Corpus Linguistics For English Teachers - New Tools, Online Resources, and Classroom Activities

Corpus Linguistics for
English Teachers
Corpus Linguistics for English Teachers: New Tools, Online Resources, and Classroom
Activities describes Corpus Linguistics (CL) and its many relevant, creative, and
engaging applications to language teaching and learning for teachers and prac-
titioners in TESOL and ESL/EFL, and graduate students in applied linguistics.
English language teachers, both novice and experienced, can benefit from the
list of new tools, sample lessons, and resources as well as the introduction of
topics and themes that connect CL constructs to established theories in lan-
guage teaching and second language acquisition. Key topics discussed include:
• CL and the teaching of English vocabulary, grammar, and spoken-written

academic discourse;
• new tools, online resources, and classroom activities; and
• focus on the “English teacher as a corpus-based researcher.”
With ready-to-use teaching vignettes, tips and step-by-step guides, case studies
with practitioner interviews, and discussion of corpora and corpus tools, Corpus
Linguistics for English Teachers is a thoughtfully designed and skillfully executed
resource, bridging theory with practice for anyone looking to understand and
apply corpus-based tools dynamically in the language learning classroom.
Eric Friginal is Associate Professor of Applied Linguistics at the Depart-

ment of Applied Linguistics and ESL and Director of International Programs,
College of Arts and Sciences, Georgia State University.
Corpus Linguistics
for English
Teachers
New Tools, Online Resources,
and Classroom Activities
Eric Friginal
First published 2018
by Routledge
711 Third Avenue, New York, NY 10017
and by Routledge
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2018 Taylor & Francis
The right of Eric Friginal to be identified as author of this work has
been asserted by him in accordance with sections 77 and 78 of the
Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or
reproduced or utilised in any form or by any electronic, mechanical,
or other means, now known or hereafter invented, including
photocopying and recording, or in any information storage or retrieval
system, without permission in writing from the publishers.
Every effort has been made to contact copyright-holders. Please advise
the publisher of any errors or omissions, and these will be corrected in
subsequent editions.
Trademark notice: Product or corporate names may be trademarks
or registered trademarks, and are used only for identification and
explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
A catalog record for this title has been requested
ISBN: 978-1-138-12308-3 (hbk)

ISBN: 978-1-138-12309-0 (pbk)
ISBN: 978-1-315-64905-4 (ebk)
Typeset in Bembo
by codeMantra
With lesson plans, classroom activities, and related
materials contributed by Cynthia Berger, Maxi-Ann Campbell,
Melinda Childs, Sean Dunaway, Peter Dye, Lena Emeliyanova,
Tia Gass, Jack A. Hardy,Tyler Heath, Jonathan McNair,
Tamanna Mostafa, Robert Nelson, Matthew Nolen,
Janet Beth Randall, Jennifer Roberts, Marsha Walker, and Ying Zhu
Contents
List of Figures xi
List of Tables xiii
Acknowledgments xv
Part A
Corpus Linguistics for English Teachers: Overview,
Definitions, and Scope 1
A1 Corpus Linguistics for English Teachers: An Introduction 3

A1.1 Corpus Linguistics and Pedagogy 6
A1.2 What is a Corpus? 10
A1.3 So, What is Corpus Linguistics? 12
A1.4 Corpora: Types and Descriptions 15
A1.5 Historical Overview of Corpus Linguistics 19
A1.6 How to Use this Book 23
A1.7 CL Limitations and Future Directions 24
A2 Connections: CL and Instructional Technology, CALL,

and Data-Driven Learning 26
A2.1 Corpus Linguistics and Instructional Technology 27
A2.2 Corpus Linguistics and Computer-Assisted Language
Learning (CALL) 33
A2.3 CL and Data-Driven Learning (DDL) 39
viii Contents
A3 Analyzing and Visualizing English Using Corpora 46

A3.1 Linguistic Analysis of Corpora 46
A3.2 CL and Visualization of Linguistic Data 60
Part B
Tools, Corpora, and Online Resources 79
B1 Corpora and Online Databases 81

B1.1 Written Learner Corpora 86
B1.2 Spoken Learner Corpora 93
B1.3 Spoken-Written Academic Corpora 100
B1.4 Varieties of English 102
B1.5 Online Collections 107
B2 Collecting Your Own (Teaching) Corpus 114

B2.1 Corpus Collection Process 115
B2.2 Collecting Written Texts 121
B2.3 Collecting Spoken Texts 136
B3 Corpus Tools, Online Resources, and an Annotated

Bibliography of Recent Studies 148
B3.1 Online Directories, Facebook Groups, and MOOCs 148
B3.2 Corpus Annotation and Markup: Taggers/Parsers 150
B3.3 Other CL Tools (and Where to Find Them) Online 156
B3.4 Annotated Bibliography of CL Studies 160
Part C
Corpus-Based Lessons and Activities in the Classroom 185
C1 Developing Corpus-Based Lessons and Activities:

An Introduction 187
C1.1 CL for an EAP Course: A Case Study 188
C2 CL and Vocabulary Instruction 213

C2.1 Identifying and Analyzing Vocabulary from Authentic
Materials in a Content-Based ESP Class 218
Contents ix
C2.2 Using a Concordancer for Vocabulary Learning with

Pre-Intermediate EFL Students 224
C2.3 Implementing the Frequency-Based VocabProfile Tool from
LexTutor: Improving English Learners’ Essay Writing for
Proficiency Exams 233
C3 CL and Grammar Instruction 243

C3.1 Analyzing Verb Usage: A Concodancing Homework 246
C3.2 Developing Corpus-Based Materials for the Classroom:
Past or Past Progressive with Telicity 252
C3.3 Quantifiers in Spoken and Academic Registers 263
C3.4 Teaching Linking Adverbials in an English for (Legal)
Specific Purposes Course 268
C3.5 A ntConc Lesson on Transitions for an Intermediate
Writing Class 277
C3.6 “The Explorer’s Journal”: A Long-Term, Corpus
Exploration Project for ELLs 279
C4 CL and Teaching Spoken/Written Discourse 293

C4.1 Using COCA to Answer the Question on
Everyone’s Lips 299
C4.2 Using Text Lex Compare to Examine the Language
of Political Speeches 305
C4.3 An Eight-Week Corpus-Based Writing Course for Academic
Professionals 315
C4.4 Incorporating a Corpus-Based Text Visualization Program
in the Writing Classroom 322
References 331
Index 347
Figures
A2.1 Components of English language pedagogy. Adapted from

Chapelle and Jamieson (2009)31
A2.2 The relationship between target second language
skills and priorities for CALL design. Adapted from
Chapelle and Jamieson (2009)34
A2.3 Concordance lines for in fact in academic written texts45
A3.1 Top 12 most common lexical verbs in spoken American
English, normalized per one million words. Adapted from
Biber (2004)48
A3.2 AntConc’s (Anthony, 2014) first left and first right
collocations for the word know from a blog corpus51
A3.3 A word cloud of the first 10 pages of this book 61
A3.4 Visual representation of the Top 20 words in the English
language from Google Books (a mega corpus of more than
500 billion words from scanned books in English and also
other languages) 62
A3.5 Use of personal pronouns I, we, my, mine, and our by men
and women bloggers in two age groups (30 and younger
vs. 31 and older). Adapted from Friginal (2009)62
A3.6 Distributional comparison of the word gentleman from the
1880s to the present in COHA, with KWIC results. Figure
and illustrations adapted from Davies, 2010–64
A3.7 Frequency of gentleman in English books from 1800s to the
present from the Google Ngram Viewer64
A3.8 Sample sentence tree from Zhu. Adapted from Zhu and
Friginal (2015)65
A3.9 Text X-Ray’s text editor and standard application tools
(Zhu & Friginal, 2015)66
xii Figures
A3.10 POS-visualizing through Text X-Ray (color coded POS

not shown in the gray scale image, e.g., green = nouns,
red = verbs, bold = prepositions)67
A3.11 Text X-Ray’s structural workflow. Adapted from He (2016)
and Zhu and Friginal (2015)68
A3.12 Sample Sketch Engine’s word sketch feature for work
from the BNC72
A3.13 Selfie/s first appearance and dramatic increase in usage from
Twitter in 2013. The gray parts in the US map (typically
major cities in the Northeast and the Southwest) indicate
that selfie/s originated from and was immediately popularly
used in many major US cities (Grieve et al., 2014)74
A3.14 Representation of TAGSExplorer’s nodes of event hashtags
and interactive visualization76
A3.15 Comparison of “Big Brother” (Kanye West) and “Trapped”
(Tupac Shakur) from the Hip Hop Word Count. Adapted
from Hemphil l: http://staplecrops.com/ 78
B1.1 Comparison of dimension scores for disciplines in
Dimension 1: (+) Involved, Academic Narrative vs.
(−) Descriptive, Informational Discourse (ANOVA: group,
F(15, 809) = 18.65, p < 0.001), adapted from Hardy
and Römer (2013)90
C1.1 Distribution of demonstrative pronouns this and that in
conversation and academic writing, frequency per million
words, adapted from Biber (2004)193
C2.1 Meanings of o-kay or OK in spoken and written, adapted
from the LDOCE214
C2.2 Sample vocabulary lists provided by WordandPhrase.Info219
C2.3 Concordance entries for “impact (n) + minimize”221
C2.4 Your essay pasted on VocabProfile field238
C3.1 Frequency of perfect and progressive aspect in AmE vs.
BrE conversation and news. Adapted from Biber et al. (2002)260
C3.2 Sample AntConc tutorial figure for students285
C4.1 Word cloud of President Obama’s 2013 address 309
C4.2 Highlighted grammatical categories of writing from Text
X-Ray (Zhu & Friginal, 2015)324
C4.3 Students working on their Text X-Ray activity in class325
C4.4 Sample MICUSP comparison output in Text X-Ray
(visualized comparison output of student paper and
MICUSP averages automatically obtained by clicking the
“Compare” button under the “Compare with MICUSP” tab)327
Tables
A2.1 Chapelle’s (1998) framework on SLA hypotheses and

CALL/CL applications35
A2.2 Criteria for evaluating CL tools and materials38
A3.1 Google Books’ (from the BYU collection) changing
collocates over time for women, art, fast, music, and food
(Davies, 2017a)51
A3.2 Key word comparison from two groups of essays written by
L2 students55
A3.3 The 50 most common 4-grams from the Enron Email Corpus56
B1.1 Corpus matrix for the BAWE corpus (Nesi, 2008, 2011)92
B1.2 Composition of the T2K-SWAL Corpus101
B1.3 Description of topics/prompts and other items from ICNALE102
B1.4 Registers of the ANC for language teaching105
B1.5 Spoken and written registers of the International Corpus
of English107
B1.6 English corpora from Davies108
B2.1 List of linguistic features compared123
B2.3 Gray’s corpus description in number of words (30 texts per
discipline/category)133
B2.4 The corpus of US-based and Iraqi RA abstracts135
B2.5 Description of the L2CD corpus139
B2.6 Interview transcription guidelines142
B2.7 Sample students’ thoughts about grammar teaching methods144
B2.8 Composition of the L2PR corpus145
B3.1 Description of relevant tools/software (organized by their
developers)152
B3.2 Other relevant tools/software158
xiv Tables
B3.3 A nnotated bibliography: CL and language teaching

and learning160
B3.4 Annotated bibliography: CL and learner writing175
B3.5 Annotated bibliography: CL and learner speech179
C1.1 Major topics and assignments in the Writing in Forestry
course by week190
C1.2 Categories of reporting verbs (adapted from
Francis et al., 1996)196
C1.3 Sample completed group worksheet (1): Verb use (arranged
alphabetically, normalized per 1,000 words and including
their lemmas) in student reports in the Writing in Forestry
course (n = 144 reports) and 500 refereed forestry research
articles197
C1.4 Sample completed group worksheet (2): The 35 frequently
used reporting verbs (ranked and normalized per 1,000
words, including their lemmas) in student reports in the
Writing in Forestry course (n = 144 reports) and 500 refereed
forestry research articles200
C1.5 Use of linking adverbials (arranged alphabetically,
normalized per 1,000 words) in student lab reports
(144 papers) and professional articles (500 articles)202
C1.6 List of 24 frequently used linking adverbials (ranked and
standardized per 1,000 words) in lab reports (n = 144
reports) and 500 refereed forestry research articles203
C3.1 Sample instructional information about necessity modals
from Real Grammar: A Corpus-Based Approach to English 244
C3.2 Journal entry grading rubric 289
C4.1 Common extended chunks from agents297
C4.2 Common extended chunks from callers298
C4.3 Comparison of unique and shared word tokens/families
from Text Lex Compare 310
C4.4 Sample writing survey responses from participants316
C4.5 Sample worksheet: analyzing patterns in writing from your
discipline with AntConc 320
Acknowledgments
This book synthesizes data, findings, and ideas in corpus-based teaching and
research from my work with mentors, colleagues, students, and the many
prominent publications from applied CL practitioners over the past two de-
cades. I present materials that I have myself developed as well as those produced
with my former students and collaborators. I would not have been able to finish
this book without their help! I thank Mike Cullom for his invaluable insights
and reviews of several drafts, guiding its completion to directly address teach-
ers’ needs and relevant practical concerns. As always, thanks to Doug Biber and
Randi Reppen of Northern Arizona University (NAU) for their constant sup-
port and encouragement; the faculty, staff, and students of the Department of
Applied Linguistics and ESL at Georgia State University (GSU); Jack A. Hardy
(Oxford College of Emory University); Ute Römer (GSU); Joseph J. Lee (Ohio
University); Brittany Polat (Lakeland, FL); Audrey Roberson (Hobart and
William Smith Colleges); my Routledge editors and reviewers; and the staff of
the Longview Public Library, Longview, WA.
Much appreciation to Susan Conrad (Portland State University), Carol C hapelle
(Iowa State University), Joan Jamieson (NAU), Ying Zhu of the Creative M edia
Industries Institute (GSU), Laurence Anthony (Waseda U niversity), Mark
Davies (Brigham Young University), Jack Grieve (University of Birmingham),
Tony McEnery (Lancaster University), and Tom Cobb (Université du Québec à
Montréal) for leading the way with their research, teaching approaches, and the
corpora and innovative corpus tools that they have developed and shared freely
with all of us. I’d like to acknowledge the support of Tom Kolb; Martha Lee;
and all faculty, staff, and students at the School of Forestry at NAU for giving me
the opportunity to develop my very first corpus-based academic writing course
in 2005. Finally, thanks to all my contributors and former students for joining
me in this journey: Cynthia Berger, Maxi-Ann Campbell, Melinda Childs,
xvi Acknowledgments
Sean Dunaway, Peter Dye, Lena Emeliyanova, Tia Gass, Tyler Heath, Jonathan
McNair, Tamanna Mostafa, Robert Nelson, Matthew Nolen, Janet Beth Randall,
Jennifer Roberts, and Marsha Walker.
Para kala Nanay at Tatay
Eric Friginal
Part A
Corpus Linguistics for
English Teachers
Overview, Definitions,
and Scope
A1
Corpus Linguistics
for English Teachers
An Introduction
This book is about Corpus Linguistics (CL) and its many relevant applications
to language teaching and learning. In this setting, CL includes components
such as computers and the internet, corpora, corpus tools, online databases (es-
pecially those with an interface to search electronic texts), and frequency-based
data or results from analyses of corpora (e.g., word lists, key words, n-grams,
part-of-speech or POS tags, and many others). The applications of CL include
incorporating corpus tools and the analysis of corpora in a classroom activity or
a homework assignment, using corpus-based online databases to help students
conduct a mini-research project or respond to a question about language pat-
terns (e.g., “What are the top 5 collocations of the word freedom in newspaper
writing?”), teachers and students’ collecting their own corpora for analysis in a
writing classroom, or exploring vocabulary patterns of use from concordancers
and corpus-based dictionaries. Language learners may come from a range of
first (L1) and second (L2) language backgrounds, language learning experi-
ences, or fields of study. The common thread is that they are learning spoken
and written English, a majority of them coming from university settings in
the United States (US) and similar locations, but their needs will also overlap
with English learners from all over the world. They also have commonalities
across writing courses; vocabulary and grammar classes in an Intensive English
Program (IEP); graduate-level academic writing programs; or sociolinguistics,
creative writing, and literature classes exploring variation in language form
and use.
The book is written primarily for English teachers. Teachers ranging from
those with a limited background in CL to those who have taken a CL course
in teacher preparation or master’s programs may benefit from the book’s list
of tools, sample lessons, and resources as well as its introduction of topics and
themes that connect CL constructs to established theories in language teaching
4 Corpus Linguistics for English Teachers
and Second Language Acquisition (SLA). The CL resources in this book may
also supplement the ones that experienced teachers have already incorporated
into their classes. My intent is to summarize developments in applied CL over
the past 10 years, from around approximately 2007–2008 to 2018, highlighting
such very recent innovations as downloadable concordancers or taggers to freely
available learner corpora. I also developed this book in response to the expressed
needs of the students who take my undergraduate and graduate courses in CL,
Technology and Language Teaching, Corpus-Based Sociolinguistics, and Research Meth-
ods in Applied Linguistics at Georgia State University (GSU). My former and
current students, and their various, meaningful comments and reflections have
given direction and contributed to the themes and foci of this book.
As an applied corpus linguist, I was greatly influenced by my mentors D ouglas
Biber and Randi Reppen of the applied linguistics program in the English De-
partment at Northern Arizona University (NAU). It was while working there
with Doug that I learned about CL’s direct applications to the study of linguistic
variation, especially the broader exploration of lexico-syntactic characteristics
of spoken and written language. Randi offered an innovative seminar on CL
and language teaching when I was a doctoral student at NAU, and, under her
supervision, I have developed a suite of lessons and corpus-based tools intended
for students of forestry writing laboratory reports and for trainees in profes-
sional, intercultural communication contexts. I provided an account of my
corpus-based writing course in forestry in Section C1. Randi’s publications on
this particular CL topic include various corpus-based vocabulary and grammar
books for English as a Second/Foreign Language (ESL)/EFL students (e.g., the
Grammar and Beyond series), especially Using Corpora in the Language Classroom
(2010), one of the first books written specifically for teachers to support them
in making use of actual corpus tools and data in their classrooms. It is my hope
that the book you are reading honors and recognizes with great appreciation
Doug’s and Randi’s influential work in the field.
One of the first edited volumes of papers on corpora and language teaching
was published in 1997 and actually comprised studies presented three years
prior, in spring 1994, at the conference on Teaching and Language Corpora
(TALC) at Lancaster University. The volume, edited by Anne Wichmann,
Steven Fligelstone, Tony McEnery, and Gerry Knowles, was entitled Teach-
ing and Language Corpora. Both the TALC conference and the edited vol-
ume resulted from discussions between members of International Computer
A rchive of Modern English (ICAME) on the emerging need to establish a
teaching-oriented CL conference. In 1997, corpus work in language teaching
and learning was, essentially, still in its infancy; online databases, downloadable
concordancers, for-purchase learner corpora, and collections of lesson plans
were still not freely or readily available.
In his chapter “Teaching and Language Corpora: A Convergence,” G eoffrey
Leech noted that there was every reason to believe that language corpora would
Corpus Linguistics for English Teachers 5
have a role of growing importance in teaching. He added that, even at that time,
there were enthusiastic testimonials on the richness of the interest and experience
already being applied to the convergence of language teaching and language re-
search through the link of corpus-based methods. Leech discussed the benefits
of the direct use of corpora in teaching, and he outlined the advantages of using
the computer in language learning, which aligned with C omputer-Assisted Lan-
guage Learning (CALL) principles: (1) automatic searching, sorting, and scoring;
(2) promoting a learner-centered approach; (3) open-ended supply of language
data; and (4) enabling the learning process to be tailored. In 1994, these ideas
were still somewhat difficult to operationalize in the language classroom. Com-
puters were still bigger, heavier, and very expensive, and designing, collecting,
and storing open-ended linguistic data were not things a language teacher could
quickly or easily accomplish. Leech envisioned future teacher-learner applica-
tions of various corpora in the classroom that seemed possible for only a select few
at that point. Over 20 years later, we now have computer-based tools and mate-
rials to which virtually everyone has global access. Hardware and software have
become immediately accessible online, and data storage has evolved from floppy
disks, CD-ROMs, and flash drives to cloud-based storage solutions. Researchers
and teachers freely share collections of texts, and the ICAME has created various
groups and conferences targeting a range of practitioners, language learners, and
materials developers. Potential resources have become real and readily available.
Leech has gone on to work with Doug Biber, Stig Johansson, Edward F inegan,
and Susan Conrad to develop and publish the Longman Grammar of Spoken and
Written English (LGSWE) (see also Section C3). The LGSWE is a seminal work
focusing on the convergence of frequency data from corpora, varieties of E nglish
(British and American), and spoken vs. written registers to comprehensively
describe the grammar of the English language. An important implication here
is that there are multiple ‘grammars’: dialect-based grammar, the grammar of
speech and writing, and the grammar that is mediated and modified by registers.
These are all relevant in directing learners to acquire the comprehensive skills
and awareness essential to successfully using English across contexts. Clearly, the
goal is to value descriptions of language features in use and to allow learners to
appreciate the range of vocabulary and grammar options available to them. CL
frequency data are all options, if I can put it this way, not new prescriptions.
Teachers need to fully understand the meaning and implications of frequency
distributions and numbers indicating the likelihood or percentage of occurrence
of a feature and how to best share this information with their students. Students
will have to incrementally learn what linguistic variation in everyday language
means, together with its sociocultural values. Correctness and accuracy in using
language, however these are defined, are clearly important constructs in CL, but
instead of focusing on or prioritizing prescribed (i.e., ‘correct’) forms, actual fre-
quencies of use, not intuitions, alongside a full attention to and consideration of
contexts, are established in the forefront.
Various linguistic frequencies, from corpus-based English dictionaries to

follow-up studies on the LGSWE model, have been used as the foundation
of classroom teaching materials and lesson plans. The corpus approach to the
teaching of English has been integrated into several niches of university IEPs
and writing or tutoring centers, with special attention paid to the role of fre-
quency in awareness-raising activities. Frequency-related constructs and how
they can inform effective language instruction are emphasized in this book. I
would argue that we must advocate more effectively for the greater utilization
of CL innovations in learner-centered and teacher-friendly language teaching
models.
Finally, Schlitz (2010) noted that there are a great many additional issues to
consider related to corpus-informed instruction, including further discussion
of the challenges as well as the successes and potential failures of usage-based
approaches. This book synthesizes wide-ranging data from my own research
and teaching over the past decade, together with materials that I have myself
produced as well as those produced in collaboration with my students and col-
leagues. In summary, I discuss herein key topics such as CL and the teaching
of English, tools and online resources, classroom activities, and “the English
teacher as a corpus-based researcher.” Teaching vignettes and tips for teachers,
case studies with practitioner interviews, and discussion of corpus tools are
offered for your consideration and possible utilization as you plan your own
teaching strategies and work with your English learners.
A1.1 Corpus Linguistics and Pedagogy

CL, broadly defined, is a research methodology focusing on the empiri-
cal investigation of language use and variation, producing results that have
much greater generalizability and validity than would otherwise be feasible
(Biber, Reppen, & Friginal, 2010). Corpora and CL approaches, over the
past three decades, have supported the view that language is systematic and
can be described and taught using empirical, frequency- and pattern-based
approaches. When applied to the teaching of English, CL methodologi-
cally offers learners relevant and meaningful data—frequency distributions
and actual patterns of English vocabulary and grammar as used in natural,
authentic contexts. This reliance on accurate, real-world (linguistic) data
supports many interrelated theories of successful language learning and
teaching. The corpus approach merges innovations in instructional technol-
ogy, educational computing, and various online resources that promote the
inclusion of different perspectives on language learning inside and outside of
the classroom, an approach differing somewhat from those taken in ‘tradi-
tional’ pedagogical models. Mobile technology, individualized instruction,
and big data visualization as integral parts of CL all contribute to how digital
learners may, in fact, fully adapt and appreciate corpus-based approaches in
their learning of English.
The number of teachers incorporating CL approaches in their English class-

rooms has grown exponentially from the mid-1990s, as evidenced by CL and
teaching conference presentation topics in major international conferences,
such as the TESOL (Teachers of English to Speakers of Other Languages) In-
ternational Convention or the American Association for Applied Linguistics
(AAAL). Conferences on CL and Learner Corpora (e.g., the American Asso-
ciation for Corpus Linguistics or Learner Corpus Research Conference, often
held in Europe) now have strands focusing specifically on corpus-based materi-
als design and the application of corpus tools in teaching a particular linguistic
feature. An informal survey of articles published in pedagogically inclined ac-
ademic journals such as TESOL Quarterly, English for Academic Purposes Journal,
English for Specific Purposes Journal, or Journal of Second Language Writing shows a
substantial increase in peer-reviewed publications that explored learner corpora
from the early 2000s to the present. However, as Conrad (2000), Meunier and
Reppen (2015), and Friginal (2013b) noted, many teachers who received some
training in CL are still not regularly using corpus-based activities in the class-
room for reasons such as lack of confidence, time constraints, questions of rele-
vance, and difficulty in orienting their students (and their courses) regularly to
corpus-based foci. In addition, there are many who may still be initially intim-
idated by various CL tools and software, most of which have originated from
works pioneered by computer programmers in the 1990s. And finally, there
are still many English teachers who are simply not familiar with the approach.
To emphasize, an important goal of this book is to update the growing, albeit
still limited number of CL and language teaching textbooks and resources avail-
able currently. An additional impetus is to continue to introduce and encourage
the utilization of CL to actual practitioners, especially those who did not come
from applied linguistics teacher-training programs: for example, teachers from ed-
ucation departments who may have limited background in this approach as it can
be integrated into the teaching of ESL/EFL or English language learners (ELLs).
In the past 10 years, there have been only a few “corpora in the language class-
room” textbooks written primarily for classroom teachers by prominent corpus
linguists. There are also textbook treatments of CL in general, with some sections
or chapters dedicated to applied topics, which include CL and language teach-
ing and learning. These resources (see a short bibliographic list with annotations)
have presented the practical aspects of developing corpus-based activities in the
classroom while also providing a survey of available references and tools, such as
concordancers and online databases. Some of these are introductory materials that
encourage further research and exploration on the part of teachers, while others
are quite specialized, targeting a particular field, like sociolinguistics (i.e., teaching
and researching language variation) or the use of online corpora and a particular
software. I suggest that readers also consult these books and others not listed here
for continuing education on CL’s various applications and for a greater under-
standing of historical perspectives on how CL has gradually been introduced to
language learners over the years.
For the Teacher
Your Reading List: Notable “Corpora in the Language Classroom” books pub-
lished in the past 10 years (from 2007). As previously noted, the LGSWE (pub-
lished in 1999) is what I consider to be the default corpus-based resource,
especially for grammar instruction, but the following textbook treatments
of CL and language teaching may speak directly to your immediate needs to
create classroom activities and lesson plans. I have provided level categories
(introductory, topic-specific, or advanced) to help guide you as to how these
books could be best utilized as additional resources.
Anderson, W. & Corbett, J. (2009). Exploring English with online corpora. New
York: Palgrave Macmillan.
This book introduces readers to available electronic resources (up until early
2009) and demonstrates how teachers can utilize corpora in the classroom. A
glossary of practical terms and topics, interactive tasks, and further readings
are provided. Level: Introductory.
Bennett, G. (2010). Using corpora in the language learning classroom. Michigan:

University of Michigan Press.
Bennet’s goal was to make CL approaches accessible to teachers and to
provide ideas, instructions, and sample opportunities to adapt CL tools in
the classroom. A set of corpus-designed activities is presented to teach a
variety of language skills. Highlight: Bennet dedicates a step-by-step guide
to using the software TextSTAT for computer-aided error analysis (CEA), with
information on error tagging and a description of error tagging codes for
use in exploring errors in student academic writing. Level: Introductory,
Topic-Specific.
Cheng, W. (2012). Exploring corpus linguistics: Language in action. New York:

Routledge.
The practical aspects of CL are introduced in this book by Cheng, with one
chapter specifically focusing on “data-driven and corpus-driven language
learning.” In this chapter, students are given ideas in conducting projects
that lead to the preparation of research reports (oral and written), with a set
of criteria for assessing the quality of the reports. Hands-on activities and case
studies, including the author’s commentary on featured tasks, are provided.
Level: Introductory, Topic-Specific.
Conrad, S. & Biber, D. (2009). Real grammar: A corpus-based approach to English.

New York: Longman.
Real Grammar explicitly draws on corpus research and corpus-based com-
parison data “to show how 50 grammatical structures are used in speech
and writing.” I highly recommended this book for English teachers across
levels, especially those interested in developing materials, including grammar
textbook writers. See Section C3, which briefly illustrates how Real Grammar
could be incorporated directly into a lesson on necessity modal verbs. Level:
Topic-Specific (Grammar).
Flowerdew, L. (2012). Corpora and language education. New York: Palgrave Mac-
millan.
Flowerdew provides a critical examination of key concepts and issues in CL,
focusing on the interdisciplinary nature of the field and the role that written
and spoken corpora now play in these different disciplines. Corpus-based
case studies are presented to show central themes and best practices in CL
research. An ‘application’ section discusses CL in teaching arenas, exploring
the pedagogic relevance of corpora. Citing Cook (1998), Flowerdew suggests
that corpora are a contribution to, rather than a solid base of, materials in the
data-driven classroom. Level: Topic-Specific.
Friginal, E. & Hardy, J.A. (2014a). Corpus-based sociolinguistics: A guide for stu-
dents. New York: Routledge.
Jack A. Hardy and I intended to generate ideas about how sociolinguistic re-
search and linguistic distributions from corpora can be effectively merged to
produce a range of meaningful studies. The teaching applications are primar-
ily for upper-level undergraduate and graduate students taking Language in
Society or Sociolinguistics courses. Students are provided detailed information
on corpus collection, tools, and available (sociolinguistic) corpora that can
be used for semester-long research projects. Level: Topic-Specific, Advanced.
Friginal, E., Lee, J., Polat, B., & Roberson, A. (2017). Exploring spoken English learner
language using corpora: Learner talk. New York: Palgrave Macmillan.
This book focuses on corpus-based analyses of learner oral production
in university-level English or ESL classrooms. Our analyses are primarily
research-based, but pedagogical applications are discussed in three spe-
cific areas of student oral production: (1) English for Academic Purposes (EAP)
classrooms, (2) English language experience interviews, and (3) peer response/
feedback activities (see Section B2 for additional descriptions of this project and
our corpus design and collection). Level: Topic-Specific, Advanced.
Liu, D. & Lei, L. (2017). Using corpora for language learning and teaching. Annap-
olis Junction, MD: TESOL Press.
Liu and Lei ask readers, “How Can You Use Corpora in Your Classroom?” As
one of the newest additions to CL in the classroom literature, this book ad-
dresses the needs of today’s teachers for a “step-by-step hands-on introduc-
tion to the use of corpora for teaching a variety of English language skills
(Continued)
such as grammar, vocabulary, and English academic writing.” The authors

provide discussions of basic essential corpus search and teaching procedures
and activities, including instructions on how to compile corpora for language
instruction and research purposes. Level: Introductory, Topic-Specific.
O’Keeffe, A., McCarthy, M. & Carter, R. (2007). From corpus to classroom. Cam-
bridge: Cambridge University Press.
From Corpus to Classroom is another recommended resource, which summa-
rizes accessible corpus research in the classroom (until 2007). O’Keeffe, Mc-
Carthy, and Carter explain how corpora can be developed and what they
tell teachers and researchers about language learning. The book intends to
answer key questions, such as, “Is there a basic, everyday vocabulary for En-
glish?”, “How should idioms be taught?”, and “What are the most common
spoken language chunks?”, among others. Level: Topic-Specific.
Reppen, R. (2010). Using corpora in the language classroom. Cambridge: Cam-

bridge University Press.
This is one of the formative publications on corpora and language teaching
written specifically for the classroom teacher. Reppen explains and illustrates
how teachers can use corpora to create classroom materials and activities to
address various classroom needs. Her goal was to “demystify” CL by pro-
viding clear and simple explanations, instructions, and examples. The book
provides the essential knowledge, tools, and skills teachers need to enable
students to discover how language is actually used in context. Level: Intro-
ductory, Topic-Specific.
Timmis, I. (2015). Corpus linguistics for ELT: Research and practice. New York:
Routledge.
Timmis is an experienced language teacher who developed this book as an
accessible, hands-on introduction to using corpora in classroom contexts.
The main focus here is to emphasize a data-rich approach to pedagogy and
how frequency-based information may contribute to effective classroom
teaching. Level: Introductory, Topic-Specific.
A1.2 What is a Corpus?
From the Latin word for “body,” the word corpus (corpora, plural) has been
used to refer to a collection of texts stored on a computer. Note that references
to text are not limited only to language that was initially written. A text1
can also be a transcription of speech. These electronic texts are equivalent to
researcher datasets (i.e., researchers’ bodies of data). However, in linguistics,

a corpus is even more narrowly defined. In addition to being a collection of
information, it is also viewed as systematically collected, naturally occurring
categories of texts. Before the age of computers, such collection was accom-
plished laboriously by hand. With the advent of personal computers and the
digitization of much of our everyday spoken and written language, CL has
become a much more widely practiced, accessible approach to examining lan-
guages and their use.
In summary, a linguistic corpus is, by definition, computerized and search-
able by computer programs (Friginal & Hardy, 2014a). Anyone with a com-
puter and perhaps an internet connection can begin to create a corpus as long
as it follows a logical and linguistically principled design. There are also several
publicly available corpora, as presented elsewhere in this book, which research-
ers around the world can study and analyze. Since they are readily available
data, frequency distributions and text extracts from corpora can naturally be
used for language teaching. The following are more specific definitions of
corpus/corpora, as provided by corpus linguists from seminal publications in the
field as well as a few newer ones:
a corpus is a large and principled collection of natural texts.

(Biber, Conrad, & Reppen, 1998, p. 12)
A corpus is a collection of pieces of language text in electronic form,
selected according to external criteria to represent, as far as possible, a
language or language variety as a source of data for linguistic research.
(Sinclair, 2005)
Computer corpora are rarely haphazard collections of textual material:
They are generally assembled with particular purposes in mind, and are
often assembled to be (informally speaking) representative of some lan-
guage or text type.
(Leech, 1992)
A corpus is a collection of (1) machine readable (2) authentic texts (including
transcripts of spoken data) which is (3) sampled to be (4) representative of
a particular language or language variety.
(McEnery, Xiao, & Tono, 2006, p. 5)
Corpora may encode language produced in any mode – for example,
there are corpora of spoken language and there are corpora of written
language. In addition, some video corpora record paralinguistic features
such as gesture (Knight et al., 2009) and corpora of sign language have
been constructed ( Johnston & Schembri, 2006; Crashborn, 2008).
(McEnery & Hardie, 2012, p. 3)
… is a collection of spoken or written texts to be used for linguistic anal-

ysis and based on a specific set of design criteria influenced by its purpose
and scope.
(Weisser, 2016, p. 13)
A corpus can be briefly defined as a systematically designed electronic
collection of naturally occurring texts […] Researchers compile corpora
and search for existing constructs of written or speech patterns identified
as relevant and measurable. A corpus provides the opportunity to mea-
sure tendencies and distributions across registers (and genres) of language.
(Friginal et al., 2017)
As apparent from the various definitions, a corpus—a systematic compilation

of naturally occurring language—serves as a primary dataset for linguists/
researchers interested in describing and analyzing linguistic forms, functions, and
variation. For teachers, it is an important source of linguistic information that
can inform and direct language teaching from the extraction of text s amples to
the usage and interpretation of frequency distributions of a key word or phrase
used in context. The argument here is that actual usage, gleaned from corpora,
is important as a guide in teaching a language. The process of exploring corpora
in the classroom allows students themselves to discover linguistic p atterns and
potentially learn and formulate ‘rules’ necessary in their acquisition of a language.
Capturing various features of written or spoken language they are learning/ac-
quiring from corpora (e.g., for spoken, unique features, such as dysfluent markers,
repeats and reformulations, overlaps and back channels, and many other language
characteristics) no longer requires extensive reading or manual searching since
this can all now be done using simple tools. Students are able to extract matching
strings of language and then observe various ways in which these strings are used
across contexts and settings by various writers or speakers.
A1.3 So, What is Corpus Linguistics?

CL has evolved, continuing its dynamic internet-based period in the 2010s
into the present to become a go-to approach in empirical investigations of lan-
guage variation and use. Today, CL in applied linguistics is sometimes grouped
with related fields, especially computational linguistics and natural language
processing. CL research and teaching topics often appear together with con-
ference themes such as text analysis, analysis of written/spoken discourse, and
data-driven learning, among others.
As emphasized by Biber et al. (2010), CL is not, in itself, a model of language
with its own theory-generating constructs: for example, gender or geographic
region influencing linguistic variation in sociolinguistics. This suggests to some
that the term “corpus linguistics” has been used and applied imprecisely in
many research studies over the years. Some users of corpora maintain that “cor-
pus linguistics does not exist” or that “corpus-based research” or other variants,
for example, corpus-assisted, corpus-informed research, would be more
accurate descriptors.
There have been attempts to define CL as its own linguistic field, prompting
distinctions between “corpus-based” (i.e., a research approach or method) and
“corpus-driven” (i.e., as a theory-generating branch in the field of linguistics).
Corpus-based analysis is “a methodology that uses corpus evidence mainly as
a repository of examples to expound, test or exemplify given theoretical state-
ments” (Tognini-Bonelli, 2001, p. 10). This means that corpora can be used to
answer preexisting questions about preexisting suppositions in frameworks that
have already been accepted by scholars in the field. Such analysis has also been
described as top-down because features of language under investigation are
known and chosen before going down to explore the lower levels of individual
texts. The corpus-driven perspective is more inductive, or bottom-up, in that
the linguistic features that are investigated come directly from analyses of the
corpus, not from categories preestablished by the researcher. In a corpus-driven
approach, the commitment of the linguist is to the integrity of the data as a
whole, and descriptions aim to be comprehensive with respect to corpus evi-
dence. The corpus, therefore, is seen as more than a repository of examples to
back preexisting theories or a probabilistic extension to an already well-defined
system. The theoretical statements are fully consistent with, and reflect directly,
the evidence provided by the corpus. Text examples and extracted patterns
are normally taken verbatim; in other words, they are not adjusted in any way
to fit the predefined categories of the analyst (Tognini-Bonelli, 2001). Biber
(2009), however, describes how such research tries to minimize the number of
assumptions of linguistic constructs the researcher might have. Instead, the data
are expected to speak for themselves (Friginal & Hardy, 2014a).
Although corpus-based and corpus-driven approaches can be thought of
as dichotomous, they are more like two poles on either end of a continuum.
There are areas of research in which corpora are used purely to find examples
of predefined linguistic features (e.g., most common stance words), and, at the
same time, there are truly corpus-driven studies, such as allowing computer
programs to determine the likelihood of multiple words’ being used together
(e.g., the concept of lexical bundles or formulaic sequences). The similarity of
these approaches is that both involve the collection and analysis of corpora of
natural language. Many researchers are even interested in similar constructs.
Compilation and analysis, however, are influenced by the ultimate goals: If one
wants to study a preestablished construct, he or she might simply search for
that construct in a corpus. If, on the other hand, the researcher is curious and
does not want to come to the analysis with preconceptions, a large corpus may
be collected and analyzed using corpus analysis that does not include a priori
decisions of what to search for.
In this book, I maintain and emphatically support the argument that CL is

a data-driven approach or methodology for researching language features and
that various frequency-based results are important, critical sources of informa-
tion for language teaching and learning. The CL approach can be defined as
follows:
• It is strictly based on the empirical analysis of actual patterns of language

use in natural texts.
• It utilizes a large and principled collection of natural texts, that is, a corpus,
as the basis for analysis.
• It makes extensive use of computers for analysis, employing both automatic
and interactive techniques.
• It relies on the combination of quantitative and qualitative analytical and
interpretive techniques (Biber et al., 1998).
So what? With relatively easy access to numerical data, it is important to re-

member that we still need to interpret these corpus-based findings in order to
move on to the ‘applied’ phase including teaching and materials design. Corpus
data will have to be functionally interpreted accurately and honestly. In an
interview I conducted with Biber (Friginal, 2013a), he emphasized the impor-
tance of functional analysis and interpretation of corpus-based data:
Quantitative patterns discovered through corpus analysis should always

be subsequently interpreted in functional terms. In some corpus studies,
quantitative findings are presented as the end of the story. I find this
unsatisfactory. For me, quantitative patterns of linguistic variation exist
because they reflect underlying functional differences, and a necessary
final step in any corpus analysis should be the functional.
(p. 119)
Results of these functional interpretations are key for classroom teachers. These
interpretations will reveal to teachers and students that language is, in fact,
mediated by and modified according to registers. There is simply no
one-size-fits-all approach to effectively teaching the lexico-syntactic features
of language (speech or writing). For example, the essential linguistic and con-
textual components of successful business email writing may not necessarily be
the same components that will make a business proposal or a laboratory report
equally successful. What to teach learners, therefore, relies on the various com-
binations of components and factors identified by the target register.
In CL, register is a situationally defined category of speech and writing. A
register distinction of spoken texts, for example, can cover sub-registers, such
as face-to-face interaction, telephone interaction, and video calls (e.g., Skype
calls or mobile “FaceTime” calls). Corpora representing these three sub-registers
could be collected and transcribed. These sub-registers are differentiated by the
medium and contexts, which can certainly influence the use of a whole range of
linguistic features. Register variation, therefore, is primarily based on these con-
textual differences. What I like about the concept of register is that a researcher
or teacher is in control of register comparisons. The situations that define reg-
isters will just have to be clear and consistent. Hence, I can categorize a target
register, such as “written technical reports,” and establish sub-registers, such as
laboratory report, incident report, or business-financial report in the fields of
chemistry, forestry, and business, respectively. I can show my students linguistic
variation across these three groups. There will, for example, probably be differ-
ent distributions of key words, personal pronouns, or passive verbs. The conse-
quence of this teaching approach is increased register awareness among students.
Register has often been used interchangeably with genre. Although there is
little consensus as to the meaning and/or use of these terms, Biber and Conrad
(2009) explain that such a distinction is a matter of focus. Both concepts refer
to text categories that have been situationally defined and have shared general
communicative purposes; the difference between the two is determined more
by how those texts are studied or used. A genre perspective is more interested
in the conventional structures that are used to create an entire text or a section
of a text, such as research article introductions or the abstract from a research
article. On the other hand, a register perspective looks at the most common
linguistic features across spoken and written texts. These linguistic features,
from a register perspective, are thought to be pervasive, and thus, a sample of
a text can be analyzed. A genre analysis, however, would require the text to
remain intact. Although many corpus-based studies take a register perspective,
they may also use or be supplemented by other methods to become more in line
with genre-based approaches (Flowerdew, 2005). Wherever they land theoreti-
cally, however, corpus-based methodologies lend themselves well to answering
the questions relevant to disciplinary specificity. Literacy practices, even those
of linguists studying such practices, may be entrenched and not noticed. Intu-
itive conclusions as to what is frequent or infrequent are not always accurate,
and corpora offer measurable ways to describe what happens in the discourse
empirically. Another benefit of corpus-based methods is that they allow for
more objective empirical studies. This is especially useful for researchers—and
teachers—who view their role as descriptive rather than prescriptive. Topics that
can be investigated include, but are not limited to, vocabulary, phraseological
units, grammatical features, and rhetorical functions (Friginal & Hardy, 2014a).
A1.4 Corpora: Types and Descriptions

Almost anyone interested in studying a language can explore corpora with
computational tools to help uncover and begin to understand a range of pat-
terns. Such methods have been commonly reserved for grammatical studies to
supplement qualitative analyses of structures and then apply them across con-
texts. Linguists compile corpora and search for existing patterns that some may
have posited intuitively to be common, rare, relevant, or significant. A corpus

allows for measurements of these tendencies to confirm (or not) our hypotheses
and intuitions. Biber has often referred to the “unreliability of our intu-
itions” about language, which is discussed further in other parts of this book.
Corpora, correctly compiled, can provide the sound data we need to support
our conclusions.
General/Reference vs. Specialized Corpora

Corpora of naturally occurring language from speakers and writers of a target
population have been grouped according to various categories or criteria. One
distinction is the number of individuals and types of language that they are
designed to represent. A corpus could focus on studying a very specific group
of language users or perhaps the language produced in one particular setting,
location, or type of interaction. At the opposite end of the continuum, a cor-
pus may have been created broadly to embody the full range of speaking and
writing of a language like American English, Tagalog, or Cantonese. A gen-
eral corpus, also referred to as a reference corpus, is constructed to reflect the
language use of very large, diverse groups of people; in our example, they are
American English speakers/writers, the people of the Philippines, or groups of
Cantonese L1 speakers in Hong Kong and Guangzhou, China.
General corpora have been designed to represent the language-at-large.
Multiple registers are included, giving a comparative, proportional view of
how language is used. In the early days of CL, the 1980s, a corpus of one
million words was considered large. Currently, there are corpora of hundreds
of millions of words. The size of the corpus does not, necessarily, make it a
general or a reference corpus; rather, the inclusion and distribution of multiple
registers that comprehensively represent or approximate the target language is
really what defines general or reference corpus. If the goal of a team of corpus
developers is to attempt to represent the language (e.g., Tagalog) as a whole,
the team must necessarily include samples of written and spoken texts system-
atically identified to represent the language produced every day by Filipinos
in a variety of settings. This certainly is a major undertaking that will require
careful planning and considerable resources. The current generation of refer-
ence corpora, such as the British National Corpus (BNC) and some relatively
smaller collections of Spanish, Chinese, or Arabic corpora, have all followed a
design with a target number of registers and texts. Inclusion or exclusion crite-
ria corresponding to the sampling plan should be defined and followed in order
to qualify the resulting full and final composition as a general corpus.
It will probably be unlikely and unnecessary in most cases for a teacher
to aim for a collection of a general corpus. There are, after all, already exist-
ing ones that can be used (as listed in Section B1). Corpora for teaching pur-
poses will often be very specific and individualized to the register or speech
event, and these are much easier to collect (see Section B2). They are referred
to as specialized corpora. Specialized corpora allow us to control for many
more variables. They are developed to represent a particular domain, includ-
ing those that are dedicated to micro contexts (e.g., abstracts of research arti-
cles or students’ responses to interview questions). These collections are useful
when moving from the analysis of results to generalizations relative to a bigger
population. For the most part, this is a question of scope. What linguistic fea-
ture or domain is being investigated? You might be interested, for example, in
the questions, “What kind of oral language do first year, level 1 IEP students
produce? Do they ask many questions and share critical comments? Do they
produce high frequency academic word list words? Do they often use passive
verbs?” These are interesting questions and could be investigated by using a
smaller sample, like two classes at an IEP in a particular university.
Written vs. Spoken Corpora

Another grouping criterion is defined by the two primary register categories
of a language—writing vs. speaking. Corpora of written texts (e.g., newspaper
articles, academic reports, fiction, Facebook posts, emails, text (SMS) messages,
Amazon customer product reviews, Donald Trump tweets from 2012 to 2018,
and a range of many other emerging written sub-registers) are more common
than are corpora of transcripts of spoken language. For obvious reasons, written
corpora are much easier to collect electronically. The Santa Barbara Corpus of
Spoken American English (SBCSAE) is a spoken corpus that consists of various
speech events, including face-to-face conversations, telephone conversations, ser-
mons, and descriptions from tour guides. Containing almost a quarter million
words, this is a relatively large corpus of spoken language. Like the SBCSAE,
the Wellington Corpus of Spoken New Zealand English (WSC) contains mul-
tiple types of speech events. With one million words, the WSC is large and
well-balanced, consisting of news monologues, sports commentary, judicial sum-
maries, lectures, conversations, telephone conversations, interviews, radio con-
versations, political debate, and meetings. A sociolinguistic corpus collected by
Tagliamonte strictly followed a recording and transcription plan that involved the
following components (Tagliamonte, 2006): (1) recording media and audio-tapes
(analogue, digital, or other formats); (2) interview reports (hard copies) and signed
consent forms; (3) transcription files (ASCII, Word, .txt); (4) a transcription pro-
tocol (hard copy and soft copy); (5) a database of information (FileMaker, Excel,
etc.); and (6) analysis files (GoldVarb files, token, cel, cnd and res).
Annotated Corpora
An important aspect of a corpus design may include additional items embed-
ded in texts/transcripts. Especially for spoken corpora, it is, depending on the
research question(s), often beneficial to be able to return to audio files to con-

firm observations or make correlations with speakers or multiple participants.
Corpus markup and multimodal annotations of variables such as prosody, body
movements or facial expressions can certainly add multiple levels of relevant
information in qualitatively analyzing corpus-based data. With an annotated
corpus of speech from the British city of York, for example, Tagliamonte and
Roeder (2009) examined how the definite article the was used. In northern
England, there is a form of the definite article that omits the vowel or otherwise
does not follow the standard form of a voiced interdental fricative followed by a
vowel, which is described as Definite Article Reduction (DAR). Orthographi-
cally, this form is usually written as t’ instead of the. To further complicate mat-
ters, this dialect of English also allows for the complete omission of a definite
article in places where it would be required in Standard English. In other words,
there were three forms that the researchers investigated: full form the, DAR,
and zero. Tagliamonte and Roeder’s spoken corpus of 1.2 million words from
92 York natives gave the researchers many opportunities to explore how the
speakers pronounced definite articles under differing phonetic and pragmatic
situations in a specialized, spoken, annotated dataset (Friginal & Hardy, 2014a).
Comparable vs. Parallel Corpora

The distinction between comparable and parallel corpora is based on the pres-
ence of several languages (also described as multilingual corpora). Comparable
corpora contain components in two or more languages that have been collected
using the same sampling method and with clear consideration of balance and
proportion in number of texts, words, and registers and domains represented.
Note that the sub-corpora of a comparable corpus are not translations of each
other. An example is the use of the London-Oslo-Bergen (LOB) corpus’ sam-
pling frame for the Lancaster Corpus of Mandarin Chinese (McEnery, Xiao, &
Mo, 2003), making these corpora comparable. Parallel corpora specifically
contain L1 source texts and their L2 translations. The design of corpus col-
lection strictly focuses on the collection of similar domains and contexts in the
L1 and L2 registers. The Canadian Hansard corpus (https://www.lipad.ca/)
(see also, Brown, Lai, & Mercer, 1991) and the CRATER corpus (McEnery &
Oakes, 1995) are parallel corpora (McEnery & Hardie, 2012).
Other Categories
Monitor Corpora are collections of texts intended to grow in size (e.g., num-
ber of words and the addition of emerging sub-registers over time). The Bank
of English (BoE), a collection of British, Australian, and American English,
for example, was first made available by the research team from the University
of Birmingham in the 1980s, with Susan Hunston as the lead developer. The
collection now has over half a billion words of “present-day English” and a
sub-corpus developed specifically for teaching purposes with over 65 million
words. Many corpora collected by Mark Davies for his Brigham Young Univer-
sity (BYU) corpus project (see Section B1) are considered to be monitor corpora,
especially the Corpus of Contemporary American English or COCA (Davies,
2008), the Corpus of Historical American English or COHA (Davies, 2010), and
the Global Web-Based English or GloWbE (Davies, 2013). Another category
quite similar to monitor corpora is Balanced Corpora, which typically intends
to represent a specific register over a period of time, with the collection empha-
sizing balance and representativeness, and with a clear sampling plan (sampling
frame) and identified schedule of collection when new data will be added. The
concept for now is primarily theoretical, but there are current corpora that,
when further developed in the future, might be categorized as fully balanced.
Examples of these are the British Academic Written English (BAWE) corpus
and Michigan Corpus of Upper-Level Student Papers (MICUSP) academic texts
(see Section B1) or specialized blogging and social media corpora (Facebook and
Twitter) that have been collected by researchers over the past 10 years.
Finally, McEnery and Hardie (2012) described Opportunistic Corpora
as corpora that do not match the descriptions of the categories mentioned ear-
lier. These corpora do not follow a strict design or sampling frame, and they
simply comprise specific data that were possible to gather for a particular task.
Sometimes, contextual and also technical restrictions may have prevented the
collection of texts originally sought from being completely realized. An ‘op-
portunistic’ process is often needed in the case of most spoken registers, from
recording to transcription.
A1.5 Historical Overview of Corpus Linguistics

The following is a brief historical overview of CL that I have adapted and syn-
thesized from Friginal and Hardy’s (2014a) Corpus-Based Sociolinguistics: A Guide
for Students and Biber et al.’s (2010) “Research in Corpus Linguistics” from the
Oxford Handbook of Applied Linguistics. Additional perspectives from McEnery and
Hardie’s description of “English Corpus Linguistics” (2012) are also provided.
In CL, collecting naturally occurring texts has always been an essential
methodological approach. Utilizing these texts in corpus-based research be-
came possible in the 1980s and 1990s, with developments in computer science
and ubiquitous desktop computing technology. Prior to these developments
and up until the 1950s, the common practice was for language researchers to
utilize collections of natural texts by ethnographers and field linguists as the
basis for language research. Descriptions of the structure of various languages
and the production of dictionaries were common results of these collected text
samples, the dictionaries having been based primarily on naturally spoken ut-
terances compiled from interviews representing a particular dialect region. For
example, the Oxford English Dictionary, which was first published in 1928, was
based on approximately 5,000,000 citations from natural texts, totaling ap-
proximately 50 million words, compiled by more than 2,000 volunteers over
a period of more than 70 years. Long before this, 150,000 natural sentences
written on slips of paper were the basis, in 1755, for Samuel Johnson’s Dictionary
of the English Language, which was published to illustrate the natural usage of
words at that time (Biber et al., 2010).
Prior to the obtainability of a computer or electronically prepared corpora,
the empirical study of vocabulary use as well as grammar teaching in English
were accomplished by reliance upon texts, such as newspaper writing, short
stories, and scholarly essays. Actual sentences taken directly from novels and
newspapers were reproduced in many commonly used grammar books during
this period to show various structures of formal, grammatically correct sen-
tences and syntactic elements such as verb phrases and clauses. A corpus of
letters written to various government agencies was the basis for earlier works in
the field, such as that of C.C. Fries, who focused on usage and social variation
of language. Another work, published in 1952, which was essentially a gram-
mar of conversation based on a 250,000-word corpus of telephone conversa-
tions, focused on such grammatical features as discourse markers well, oh, now,
and why when these markers initiated a “response utterance unit.”
In the 1960s and 1970s, the majority of researchers in linguistics—particularly
those in the US—adamantly maintained that, since language was a “mental
construct,” empirical research approaches were unsuitable to describe language
competence. They argued that, instead, what Biber (1988) referred to as
intuition-based methods, intuition rather than empirical analysis, should be
adopted as the primary research methodology in the field. Some linguists, how-
ever, steadfastly maintained that empirical linguistic analysis had greater util-
ity and validity. In the early 1960s, for example, Randolph Quirk compiled a
pre-computer-era collection of 200 spoken and written texts, each approximat-
ing 5,000 words, which he dubbed the Survey of English Usage (SEU) and sub-
sequently used to compile descriptive grammars of English (e.g., Quirk et al.,
1972). This empirical, descriptive tradition also had the continuing support of
such functional linguists as Prince and Thompson, who argued that analysis of
(still noncomputerized at this point) collections of natural texts was useful in the
identification of systematic functional linguistic variation. Thompson has been
especially interested in the study of grammatical variation in spoken interac-
tions and has identified features in conversation that influence the retention or
omission of other features, such as complementizers (Biber et al., 2010).
Kučera and Francis (1967) had actually begun work on large electronic cor-
pora in the 1960s, compiling the Brown Corpus (or, in full, the Brown Uni-
versity Standard Corpus of Present-Day American English), a 1-million-word
corpus of published American English written texts. The Brown Corpus cata-
logued a wide variety of types of American English, all of which were written in
1961. A total of 500 samples of approximately 2,000 words each from 15 different
genres were collected for this project. News, religious texts, biographies, official
documents, academic prose, humor, and various styles of fiction were included
(see Kučera & Francis, 1967). A parallel corpus of British English written texts,
the LOB Corpus (also Lancaster-Oslo-Bergen), followed in the 1970s. It was
not until the 1980s that major studies of language use based on large electronic
corpora began to appear as these corpora became more available and accessible,
thanks to the increasing availability of computational tools to facilitate linguis-
tic analysis. For example, in 1982, Francis and Kučera provided a frequency
analysis of the words and grammatical part-of-speech categories found in the
Brown Corpus. Johansson and Hofland (1989) followed with a similar analysis
of the LOB Corpus. Also during this period, book-length descriptive studies of
linguistic features began to appear: for example, Granger (1983) on passives, de
Haan (1989) on nominal post-modifiers, and Biber (1988) on the seminal multi-
dimensional studies of register variation. This period also saw the emergence of
English language learner dictionaries, such as the Collins CoBuild English Language
Dictionary (1987) and the Longman Dictionary of Contemporary English (1987), all of
which were based on the analysis of large electronic corpora. Since the 1980s, the
majority of descriptive studies of linguistic variation in and usage of English—
whether a large, standard corpus, such as the BNC, or a smaller, study-specific
corpus, such as a corpus of 20 biology research articles constructed for a written
academic register analysis—have utilized analyses of electronic corpora, and this
has now become a standard research methodology in the field.
For the Teacher
Another way of documenting and understanding the development of

modern-day CL from the 1960s to the present is to look at the works of key
research centers and universities. The ICAME, an organization based in the
United Kingdom (UK/Europe that was founded in the 1970s, has been instru-
mental in gathering various linguists (especially Leech and Johansson) who
inspired each other in experimenting on corpora and the first generation of
concordancers to process linguistic data. Several studies from scholars affili-
ated with the ICAME explored diachronic research of British English, with data
obtained from SEU. Although the genesis of CL with electronic corpora started
in the US with the collection of the Brown Corpus, linguists and university cen-
ters within the UK became the primary proponents of the CL method.
CL has flourished in the UK, especially since the 1980s, with various uni-
versities having a core group of linguists focusing on a range of research
projects and collections of corpora. University College London, Lancaster
(Continued)
University, University of Birmingham, and University of Nottingham have

been actively producing a wide range of projects, from corpora to tagging
programs and influential textbooks and related publications. Although UK
corpus linguists who have profoundly influenced language teaching have
had affiliations with a variety of institutions within and/or outside of the UK,
many have come from or are still currently with these UK universities. Various
current projects from Lancaster University, for example, include the works
of Tony McEnery and Paul Baker, and updates to the corpus annotation tool
CLAWS (the Constituent Likelihood Automatic Word-Tagging System). CLAWS
originated from Geoffrey Leech’s collaboration with Roger Garside, a com-
puter scientist, in the 1980s, and it has since been further developed and now
includes an online interface and is available for purchase. At the University
of Birmingham, Susan Hunston continues in the tradition of the late John
Sinclair (and also Tim Johns) by producing groundbreaking CL studies in dis-
course analysis, lexicography, and language teaching.
In the US, NAU, with Douglas Biber, Randi Reppen (and also Susan Con-
rad from Portland State University), and their many students who now have
teaching posts in various universities, has maintained the US tradition of
register-based CL, quantitative variationist perspectives, and CL and gram-
mar teaching. The NAU approach also includes the application of multidi-
mensional analysis across general and specialized corpora and the merging
of corpus data and multivariate statistical tests in studying languages. Other
major contributors in the US to the proliferation of CL include Mark Davies
(BYU) in corpus creation and online sharing of corpora; Stefan Gries (Uni-
versity of California, Santa Barbara) in quantitative CL research and R for CL;
and the MICASE and MICUSP research teams at the University of Michigan,
especially the initiatives of John Swales, although the Michigan CL program
has been ‘inactive’ due to recent administrative changes.
Finally, an increasing number of European universities have been actively
pursuing CL projects, with specific impetus on teaching applications and
the collection of learner corpora. For example, Sylviane Granger and her
colleagues from the Université catholique de Louvain (see Section B1) are
global leaders in these areas. European CL have also led studies on CL and
English as a Lingua Franca (e.g., with Barbara Seidlhofer from University
of Vienna) related to studies situated in the field of English for Specific/
Occupational Purposes (ESP or EOP). In Asia, Australia, and New Zealand,
various scholars contribute CL-based studies of workplace interactions, L2
writing, and many other EAP-ESP applications. Laurence Anthony from
Waseda University, Japan, has been a leading force in developing corpus
tools—all free for users! AntConc (Anthony, 2014) and his many other tools
(see Section B3) are main contributors to the successes of present-day CL.
A1.6 How to Use this Book

A primary characteristic of this book is the presentation of classroom activities
developed by actual English teachers (see Part C), with supporting interviews,
that is, qualitative comments and evaluative feedback, intended to share ‘best
practices’ and real-world applications. Additionally, I intend to emphasize how
using corpus tools might energize the classroom and encourage students to be-
come autonomous learners and provide effective alternatives for those with dif-
ferent learning styles. I also address the needs of upper-level undergraduate and
MA students who are ‘teachers in preparation’ in various university courses,
such as those that I teach, which I’ve referenced previously, including Mate-
rials Design, Curriculum Design, and Instructional Technology, among many other
courses. Thus, this book can be used as a supplemental text in these common
university teacher-training courses. A specialized course in Corpus Linguistics
and Language Teaching might use this book as a primary text.
Since Web 2.0, new tools and internet-based resources have been developed
and freely shared, and immediate, mobile access to data has continued to influ-
ence new teaching and learning paradigms. Concordancers, for example, have
become regular features in online databases, online dictionaries, and Google
searchers (Google’s search bar is, in effect, an actual concordancer inasmuch
as it is able to help users find all occurrences of a word or string of words/
characters available on the web). There has also been an increasing number
of empirical studies documenting some of the successes of corpus-informed
materials in the teaching of a range of linguistic features for learners, especially
non-native English speakers. However, as I emphasize here, more research is
still needed to fully support our current positive perceptions of the measurable
contributions of CL in language learning and acquisition. Data from exper-
imental research settings, comparisons of various test results, and analyses of
quality in language produced by learners who experienced CL-based materials
and instruction are still very much needed to further validate this approach. At
the same time, in-depth qualitative teacher comments and feedback during and
after implementation of CL approaches in the classroom are equally necessary.
To this end, I hope that teachers see how CL offers them various opportunities
to conduct classroom-based research and professional development, with its
push for innovation, active student involvement and creativity, and range of
student output. I hope that readers of this book will be encouraged to share
their experiences in integrating CL into their classrooms in local and national
conferences. Workshops showing or demoing activities or poster presentations
of step-by-step guides are certainly manageable future applications!
In Part B, I highlight lists and descriptions of what’s currently available—
tools and resources that expand CL’s applications to the teaching of English.
I also connect existing and emerging theories and models of language acqui-
sition in Section A2 to focus on the current and future directions of CL in
the classroom. A more in-depth handling of materials, design constructs, as-

sessment and evaluation for practitioners, and the practical aspects of CL and
classroom-based research are central themes. There certainly are many re-
cent innovative studies and publications that have been geared toward a more
research-oriented discussion of techniques to pedagogy and how CL has been
applied to fields such as forensic linguistics, legal research, literary stylistics, and
creative writing. I believe that a continued emphasis on practitioners’ needs
vis-à-vis new and emerging tools shared freely online, or otherwise readily
available, will further promote CL as integral to a variety of classrooms.
A1.7 CL Limitations and Future Directions

Again, we do need to establish that modern-day CL defines a corpus as a com-
puterized and electronically stored dataset. This dataset is then searchable by
computer programs. Corpora and corpus approaches in teaching English offer
relevant options for teachers to search for a wide variety of data on vocabulary
use, commonly used grammatical markers, rare features of speech and writing,
and potential errors as they occur in transcripts/texts. A clear limitation here
is that there are many other features of language that are not present in a cor-
pus. In spoken texts, for example, transcriptions of speech events do not fully
represent the complete, complex nature of interactions (e.g., suprasegmental
features, humor, sarcasm, and various speech acts are not easily captured by
plain transcripts).
I see this as an opportunity for teachers to integrate sound analytical and
interpretive approaches in their teaching. Learners can see how qualitative in-
terpretation is needed to draw conclusions about the patterns they observe.
The advantage of creating written and spoken corpora specifically intended for
teaching purposes is that the corpora can be designed with a clear purpose. For
example, a teacher-trainer developing materials about oral respect markers in
a task-based business interaction may construct a corpus of naturally occurring
speech in the workplace. The teacher-trainer can use the distributions of these
respect markers and other related features from the corpus and describe the ten-
dencies of these patterns to inform the creation of teaching materials, ‘mock’
interaction activities, and question-answer sequences in business settings.
In Part C of this book, you will read interview responses from teachers as
to what they perceive are clear limitations of CL as well as its future direc-
tions. Like most tech-based approaches, changes are inevitable in CL, and as
one teacher mentioned, “every time I desire to create a corpus-based lesson in
my class, it seems as though some part of the corpus I am using has changed.”
Updates to databases and tools happen regularly, requiring us to learn new
things and rework previous lessons. Another major theme in the ‘limitations’
discussion is the constant demand for more time in CL-based classrooms: time
to prepare lessons, to train students, to learn about new tools, and more. CL
teachers see the value of developing corpus tools into mobile apps and using
them to create communities of practice online. It is quite possible for enter-
prising groups/companies (like Duolingo or Sketch Engine) to take the lead in
fully incorporating corpus tools into existing language learning sites and ap-
plications. Overall, upon recognition of the present challenges and looking
ahead, we are positioned to experience the merging of CL approaches, internet
technology, and many other epistemological fields. CL accommodates a variety
of learners and learning contexts, and effectively complements quantitative and
qualitative research paradigms. It is evident that we are going to see more of
CL and its constructs and tools more directly utilized by (English) teachers in
the classroom.
Note
1 Text is a tricky word to use in CL as it may mean multiple items. In this book, the
possible meanings are (1) text as discourse (e.g., spoken and written texts), (2) text as
a particular ‘file’ (e.g., “I have two essay texts written by my students.”), or (3) text as
an encoding system (e.g., “Please save your files as .txt files, not as .doc or .pdf ”).
A2
Connections
CL and Instructional Technology,
CALL, and Data-Driven Learning
There are several important theoretical underpinnings that support the inte-
gration and use of CL in the teaching of English for a range of learners and
settings. CL is an approach in researching language and its features, and also in
supporting language teachers as they facilitate the learning and acquisition of
English. This means that it is definitely consistent with SLA theories, partic-
ularly those that emphasize the importance of sociocultural approaches: focus
on the learner (e.g., learning styles and characteristics), input and output, use
of realia and authentic texts, the importance of real-world tasks, learner-learner
interaction, and learner autonomy and exploration of linguistic data.
Conrad (2000), in a seminal article that asked whether CL might revolu-
tionize grammar instruction, noted that the final decades of the 20th century
brought about dramatically novel approaches that subsequently redirected gram-
mar teaching and research. She identified renewed interest in an explicit focus on
form and/or grammar instruction in the classroom (e.g., C elce-Murcia, 1991a;
Celce-Murcia, Dörnyei, & Thurrell, 1997; Ellis, 1998; Master, 1994); new ap-
proaches to grammar pedagogy, such as teaching grammar in a discourse con-
text (Celce-Murcia, 1991b); and the design of grammatical c onsciousness-raising
and input analysis activities (e.g., Ellis, 1995; Fotos, 1994; Rutherford, 1987;
Yip, 1994). Simultaneously, classroom technologies and computers were making
it possible to conduct grammar studies of wide-ranging scope and complex-
ity. Conrad then revealed how CL was positioned to facilitate the transfer of
research data to pedagogy. She added that linguistic data from an empirical
study of language, which relies on computer-assisted techniques, can best rep-
resent the context and also serve as the source for input analysis activities. She
noted that, at that time, only one aspect of CL, concordancing, tended to be
emphasized for classrooms (citing, particularly, Johns, 1986, 1994; and Cobb,
1997; Stevens, 1995), but most ESL grammarians regarded CL as contributing
Connections 27
to the radically changing domain of grammar research and instruction. With

anticipated advances in CL, Conrad predicted in 2000 that monolithic descrip-
tions of English grammar would be replaced by register-specific descriptions;
the teaching of grammar would become more integrated with the teaching of
vocabulary; and, finally, the emphasis on g rammar instruction would shift from
structural accuracy to appropriate conditions of use.
Fifteen years after Conrad’s article, Meunier and Reppen (2015) examined
contemporary literature to determine whether or not Conrad’s predictions had
actually occurred in grammar instruction. They also explored whether current
teaching materials had evolved to fully incorporate register-based descriptions
and data. They noted some positive changes: for example, that corpus-informed
textbooks had been produced and adopted in a few English courses, especially in
the university setting, but they also noted significant continuing limitations and
areas for improvement as far as the consistency of the approach and the represen-
tativeness of distributional data and corpora were concerned. They compared
eight grammar textbooks, four corpus-informed and four non-corpus-informed,
and argued that frequency data in textbooks, corpus-based results, and CL re-
search were all important contributions to effective language teaching and still
had to be more fully incorporated into English teaching materials.
Today, in the broader field of English teaching across learners (e.g., ELLs,
ESL/EFL settings) and contexts (e.g., IEP instruction, ESP/EAP, or Language in
Society), corpora and corpus tools have been associated with three primary in-
structional approaches: (1) educational/instructional technology-based learning,
(2) computer-assisted learning, and (3) data-driven instruction. These three, es-
pecially the first two, share common characteristics: They are m achine-specific,
and they also align well with and support other instructional approaches, such as
learner-centered instruction or autonomous learning. Specifically, instructional
technology emphasizes the role of tools and how they are integrated into the
learning process; CALL focuses on learning languages with the aid of computers,
with a particular emphasis on software design and evaluation; and data-driven
learning (DDL) focuses on learners’ direct discovery and use of linguistic infor-
mation/data in the language classroom and beyond. These three have been the
most common instructional approaches in which corpora and corpus tools have
been situated in various studies over the past two decades. In this section, I focus
on how CL merges perfectly into these three approaches to classroom instruction.
A2.1 Corpus Linguistics and Instructional Technology

As I mentioned in Section A1, CL emphasizes that language use is mediated by
register, which means that language teaching needs to address register aware-
ness successfully in order for learners to use the appropriate form of language
across contexts. For example, the notion of what is ‘common,’ ‘rare,’ or ‘typical’
in speech or writing is usually not meaningful for general English teaching.
Rather, language features and patterns are typical of particular registers and
will have to be prioritized and highlighted accordingly in materials design
or classroom lesson planning (Biber, 2004). Learning about registers reflects
learning English for various purposes (i.e., ESP and EAP), and it can be (best)
accomplished in the classroom when students use corpora, corpus tools, and
corpus-based materials to examine specific characteristics of spoken and written
registers. Several ESP and EAP studies report that student concordancing based
on a specialized corpus (e.g., Boulton, 2015; Chambers, Farr, & O’Riordan,
2011; Donley & Reppen, 2001; Friginal, 2013b; Gavioli & Aston, 2001) has
proven to be effective in awareness-raising exercises. Students are able to make
distinct comparisons between features used in one type of writing and those
used in another, and distributional data showing how they use specific linguis-
tic features in their own writing, again, compared to another corpus or written
academic texts, can provide additional motivation in editing drafts. Classroom
activities along these types of exercises can clearly be organized, with CL as an
approach within instructional technology.
Instructional or educational technology primarily refers to the tools, mate-
rials, and equipment used to support the teaching and learning of a particular
subject or topic. These technologies include hardware and software and audio/
video equipment, and devices. Corpus tools running on computers, tablets, and
mobile phones are all part of these classroom-based technologies. It is necessary
to keep in mind that technology is only one of many tools that English teach-
ers have at their disposal, and it is important to note that instructional tech-
nology should supplement and support instruction and help accomplish, rather
than replace, teachers’ specific teaching goals. A corollary, as far as students are
concerned, is that the availability of technologies, whether in the classroom, at
home, or anywhere learning is taking place, should also enhance and extend,
rather than replace, what only the individual brings to the dynamics of the
learning process. Students themselves need to see various patterns and be able
to interpret what’s going on. As is the case with any tool, whether a table saw
in the woodworking shop or learning/teaching technology in the classroom,
the user—both the teacher and the student—must discover the ways in which
to fully control the tools and make them work to address their particular plans,
objectives, and needs.
The Oregon Department of Education (2002), in a publication distributed
to its teachers, suggests that it is necessary for teachers to ask themselves the
following question: “Will the use of technology make this lesson better? Will
it facilitate student understanding? Will students’ capacity to demonstrate their
understanding increase because of it?” This publication notes that, by asking
these questions, teachers will be able to determine when these technologies are
appropriate and when they are not. The answers to these questions can be use-
ful to English teachers as they formulate goals in incorporating CL tools into
the classroom. The recognition that CL tools will not work all the time, across
Connections 29
language topics and lesson settings, is very important for teachers. Knowing
how the tools work and being able to take control, in case they don’t work,
are necessary in the successful integration of CL in the classroom. By thinking
about CL tools within the frame of instructional technology, English teachers
will come to view these tools as everyday, nonthreatening classroom devices.
The tools will not be a step ahead of the teachers in their instruction, and they
can use the tools when they are needed for a collocation exercise, for example,
but not when it gets too complicated or confusing for learners.
For the Teacher
Read in the following how one of my graduate students responded to the

same Department of Education questions, specifically on the use of concor-
dancers and a teacher-collected corpus in teaching register-specific vocabu-
lary. Her students are primarily international teaching assistants (ITAs) from
STEM (Sciences, Technology, Engineering, and Math) fields who teach these
courses at the undergraduate level in a US university.
For the Teacher: How will you answer these questions across
various scenarios or specific classes you tough? What are typical
situations where you think your answers to these questions would
be "no" or "I don't think so?"
Will the use of a concordancer [you can add other CL tools] make
this lesson on register-specific vocabulary better? Yes, I believe so. My
students are well-adjusted to using computers, gadgets, and anything related
to the internet so they are ready for this. I know that I will have to provide them
with a sufficient demo in using the tool, having them practice and understand
the basics first before they use it on their own. YouTube tutorials on this and
similar tools are available and have been assigned. I plan to have this activity
repeated throughout the semester, so a concordancing activity will be a regular
part of our classroom [We are in a computer lab, which is good.]. The lessons
will be better because students will directly engage with linguistic data that
they can extract themselves from the class corpus that I collected. They have
control of the activity more than just being presented items from the textbook.
Will it facilitate student understanding of vocabulary use and
register variation? Yes. In my classes, when ITAs first see an academic
world list, they get so excited to explore the list and identify words that they
do not use or know from the top bands. They form their own ideas if these
such words are fully reflective of writing (and speaking) in their disciplines
or not. Activities allowing them to compare and contrast frequency distribu-
tions from two different fields (e.g., Biology vs. Computer Science) will lead
to register awareness, in my opinion.
(Continued)
Will students’ capacity to demonstrate their understanding of

these target words increase because of it? Yes. After the concordancing
lessons, I often provide a series of objective tests to check for comprehension,
a sentence completion activity, and a reflective essay (given as homework
assignment) that will require that use of the target vocabulary. I do see major
improvements in how students utilize these words in their short sentences
and reflective essays.
The key for me is that, for these questions, ‘yes’ is the answer because the
design of my class, my classroom, and especially my students are all aligned
well with how technologies are positioned to enrich and extend the process
of learning. I was trained in CL from my graduate program so I have devel-
oped a more specific point of view in how it is integrated. I can see how other
settings (different students and teachers) will not have the same affirmative
responses to these questions.
A2.1.1 CL Technology Integration

The area of technology integration focuses specifically on how new technol-
ogy can become an integral part of the learning process and positively
impact student learning. The website Edutopia (http://www.edutopia.
org/), which is published under the George Lucas Educational Foundation,
states that, “Technology integration is the use of technology resources—
computers, mobile devices like smartphones and tablets, digital cameras,
social media platforms and networks, software applications, the internet,
etc.—in daily classroom practices, and in the management of a school.”
Educators affiliated with this website believe that successful technology in-
tegration is achieved when the use of technology in the classroom (micro)
or educational system (macro) (1) has become routine and transparent, (2)
is accessible and readily available for the task at hand, and (3) is supporting
curricular goals and helping teachers/learners effectively reach their goals
(Edutopia, 2017). In other words, the technology is ubiquitous and familiar
(or a ‘second nature’) to all users.
In summary, a well-integrated use of technology resources in the classroom
comes hand in hand with well-trained teachers, appropriate topics and learning
goals, and well-prepared students. This is a rather utopic view of unified inte-
gration in which everything works out, but unfortunately, it does not happen
like this all the time. The concept covers so many varied tools, constructs,
practices, and relationships, and as can be expected with technology, glitches
happen. Teachers’ and learners’ willingness to embrace change is also a major
requirement for success as technology continues to evolve. It is an ongoing
process and demands continual learning and shifts in paradigm.
Connections 31
Figure A2.1 omponents of English language pedagogy. Adapted from Chapelle

C
and Jamieson (2009).
Chapelle and Jamieson (2009) describe the primary components of English

language pedagogy with technology integrated into the curriculum. The three
components are teacher, learner, and English (or the content of teaching) form-
ing a triangle, with technology situated in the center of the diagram. I specified
corpus-based technology in Figure A2.1, although Chapelle and Jamieson were
referring broadly to all classroom technologies. Specifically, for CL, this model
allows for an examination of the strengths and potential weaknesses of CL tools
in the classroom for the teacher, learners, and the English lesson. It can also
help teachers in articulating how corpus-based technology would fit into their
teaching philosophy and in considering who their learners are, their own values
and preferences, and what topic/s they are teaching.
When effectively integrated into the curriculum, and when the compo-
nents of the triangle are all considered and in concert, corpus-based technology
can certainly influence English learning and acquisition in powerful ways. As
postulated in Edutopia (2017), CL tools can provide students with access to
primary source materials (i.e., the corpus); exposure to methods of collecting/
recording and interpreting linguistic data; various ways to collaborate with
other students, and, potentially, experts around the world; opportunities for
expressing understanding; and training for oral and written presentation of
their new knowledge. Teachers are also provided many opportunities for class-
room research and materials development that could lead to professional de-
velopment activities. Graduate students who have taken my CL or Technology
and Language Teaching courses at GSU have been successful in presenting their
projects in various conferences. A few have been able to publish their work
or share corpus-based lessons and their implementation in online discussion
boards or blogs.
Walsh (2014) provides a list of “Engaging Uses of Instructional Technol-
ogy” posted in EmergingEdTech! (http://www.emergingedtech.com/), which
identifies several constructs capturing the pertinent contribution of CL tools

in successful language learning. Primarily, instructional technology for active
learning, social learning, engagement with digital content, and p roject-based
learning are among the most relevant. Active learning, which overlaps
with inquiry-based learning as well as experiential learning, is promoted in
corpus-based classrooms. Learners are tasked to run the tools to be able to
extract frequencies and patterns that they can later analyze and interpret. Op-
portunities for extended explorations are always available. This process can be
autonomous, and some learners may prefer to work independently, but it is also
easily designed to incorporate paired or small group work in the classroom,
leading to prospects for social learning as well. As noted by Walsh, Bandura’s
Social Learning Theory underscores that students learn from each other in the
classroom through observation, imitation, and modeling. When a CL-based
English course is set in a computer lab, a concordancer or a visualizer can also
become a social learning tool. Teachers can create activities that will encourage
discussions on, for example, grammar rules and how to make generalizations
about them based on a particular set of examples from a target register. For
example, Text X-Ray (Zhu & Friginal, 2015), a visualizer (see Sections A3 and
C4) that I have co-developed and used extensively in my classes, allows for a
visual representation of various POS categories. I ask my students to define a
noun based on what they see in the color coded POS-tagged texts and work
in pairs to develop their definitions that might differ from traditional grammar
book definitions (e.g., nouns as names of people, places, and things). Their
discussions usually include the expected structural descriptions (e.g., what
words typically come before and after nouns) but also some new discoveries
like noun-noun sequences that they see, implying that nouns can also modify
other nouns in various contexts.
Using CL tools to help ensure engagement with digital content supports
the notion that digital content provides learners a range of modified input
useful in language learning. Word frequencies from a corpus and the con-
text in which a word or a phrase is used can be easily extracted through CL
tools. Online built-in concordancers are interactive, requiring user responses
or actions that lead to making decisions or interpretations. CL tools are still
not fully addressing gamification (i.e., the application of common elements
of game play such as point scoring, competition with others, rewards upon
completion of tasks, and others) techniques, deemed as an important focus of
instructional technology, but there definitely are easy options for these to be
incorporated in programming CL tools. Embedding questions or instructional
videos, clicking on response requests and responding to comments, or em-
bedding discussion forums in content are potential add-ons to future CL tools
(Walsh, 2014). All these relate to digitized content but also to heightened focus
on project-based learning, which is a CL advantage and the raison d’etre of
corpus-based instruction.
Connections 33
A2.2 Corpus Linguistics and Computer-Assisted

Language Learning (CALL)
Simply put, CALL is the approach to how computers could be best applied to
language teaching and learning. It embraces all types of applications—hardware
and software, and everything that has now been identified as ‘computers,’ from
traditional equipment to tablets, mobile technology, and cloud-based com-
puting. An important focus is to fully establish how these technologies work
(or not) or designed in accomplishing effective teaching and learning of first,
second, or foreign languages across a range of education levels and contexts.
Present-day CALL places great importance on learner-centered activities facil-
itated by devices and especially those that encourage learners to work on their
own and establish control of their learning of a language (Lamy & Hampel,
2007). The internet has revolutionized CALL and how learner and teacher
networks are being redefined. Online instruction, course learning platforms,
and distance learning have also dramatically changed in the past three decades
as a result of exponential advances in online technology.
CALL language learning software production is also evolving dramatically.
Originally purchased as CD-ROMs in their initial iterations (e.g., Rosetta Stone
products), these software programs have now been shared as free apps (e.g.,
Duolingo) running on tablets and mobile devices. As these language learning
applications follow CALL principles, they typically embody two important
structures: interactive and individualized learning platforms. CALL, then, is
essentially an instructional approach to language learning that helps learners di-
rect and control their progress. Computers and especially their software can be
used to reinforce what has already been learned in the classroom or as a remedial
tool for those who need additional support (Kukulska-Hulme & Shield, 2008).
Why situate CL in CALL? In applied linguistics, CALL has successfully
established its theoretical and practical foundations with SLA. Carol Chapelle
(Iowa State University) and Joan Jamieson (Northern Arizona University)
note that CALL software design features and evaluation criteria for multime-
dia CALL are developed on the basis of hypotheses about ideal conditions for
second language learning. CALL researchers and software developers outline
a relevant theory of SLA that complement hypotheses for ideal learning con-
ditions such as input saliency, opportunities for interaction, and learner focus
on communication. CALL software designers are required to adhere to sound
research and language learning principles from a variety of cross-disciplinary
sources (which include CL). Despite the potential value of diverse perspectives
(e.g., especially about learning styles and communicative language learning),
SLA theory and research are also often utilized in CALL design and to guide
research on effectiveness. Figure A2.2 illustrates the relationship between target
second language skills and how CALL software and instructional tools could be
prioritized in the classroom, from the lexico-syntactic level all the way to the
Figure A2.2 he relationship between target second language skills and priorities
T
for CALL design. Adapted from Chapelle and Jamieson (2009).
acquisition and use of content-based language. CALL design based on hypoth-

eses about SLA and these eight skills, matched with continuing evaluation, are
deemed important.
In Figure A2.2, grammar and vocabulary teaching are intertwined in the
core circle, surrounded by the four macro skills (reading, writing, listening, and
speaking), which are all covered under discourse-level instruction of commu-
nication skills and content-based language. The model implies the role of regis-
ters: for instance, in how a specific content will encompass performance-based
topics and then the lexico-syntactic features a learner needs to acquire as he/
she progresses into the communicative-interactional phase of language learn-
ing and use. CL will be ‘strong’ in some areas more so than in others, based
on this model. Speaking and listening skills, for example, may be best taught
with multimedia CALL and with the use of corpora, unless corpora are ex-
pertly annotated and linked to sound files. However, grammar and vocabulary,
reading and writing, and register-based topics all fall within the immediate
purview of CL tools and applications.
Connections 35
A2.2.1 Evaluating CALL (and CL) Tools for SLA

Chapelle’s (1998) foundational article connecting multimedia CALL and in-
structed SLA outlines essential SLA hypotheses and how CALL can address
them successfully. These hypotheses, primarily influenced by the works of
Long (1996) and Pica (1994) on input/output and error processing, articulated
ideal conditions for second language learning that are relevant for developing
multimedia CALL. I followed Chapelle’s framework in Table A2.1, also noting
empirical and/or theoretical bases for these selected hypotheses and focusing
specifically on CALL and CL’s contributions.
Table A2.1 Chapelle’s (1998) framework on SLA hypotheses and CALL/CL

applications
Hypotheses CALL and CL applications
1 The linguistic Empirical Basis: Target language input facilitates L2

characteristics acquisition by distinguishing between useless target
of target language noise and target language input, apperceived by
language input the learner, which may influence language development.
need to be A learner’s noticing of linguistic input plays an important
made salient. role in making unknown target language forms into
known and used forms (Schmidt & Frota, 1986).
CALL Focus: CALL design methods focus effortlessly
on effective input enhancement. Even with factors
internal to the learner that influence the likelihood of
apprehension, instructional multimedia materials are
developed to facilitate apperception of input.
CL Focus: The saliency of lexico-syntactic features of
language is in the forefront, taken from authentic
texts and according to registers relevant to the focus
of instruction. The prominence of the input is shown
through frequencies, contexts, and visualized data.
2 Learners should CL Focus: The interaction between lexis, grammar, and
receive help in meaning can be clearly established through the extraction of
comprehending collocations, their typical patterns, and then the interpretation
semantic and of their meanings, including the analysis of semantic prosody.
syntactic A feature of the corpus tool Sketch Engine called SkELL (see
aspects of Section A3) provides learners with a range of information
linguistic on a word and its usage compared to a built-in corpora like
input. selected registers from the British National Corpus.
Empirical Basis: The processable input useful to the learner
contains linguistic forms that the learner may not yet
know. Because of this, the learner needs help with the
specifics of the input to comprehend it both semantically
and syntactically.
(Continued)
CALL Focus: Modified input consists of such features as

simplification, elaboration, or added redundancy (Larsen-
Freeman & Long, 1991), which are all easily addressed
in multimedia CALL. Language learning software like
Duolingo start from very simple vocabulary and syntactic
structures of language, with greater repetition, before
moving to more complicated structures. The move to the
next level is guided by successful completion of prior tasks.
CL Focus: The interaction between lexis, grammar, and
meaning can be clearly established through the extraction
of collocations, their typical patterns, and then the
interpretation of their meanings, including the analysis
of semantic prosody. A feature of the corpus tool Sketch
Engine called SkELL (see Section A3) provides learners
with a range of information on a word and its usage
compared to a built-in corpora like selected registers from
the British National Corpus.
3 Learners need Empirical Basis: Comprehensible output is learner language
to have that is intended to convey meaning to an interlocutor
opportunities while stretching the learner’s linguistic resources or
to produce repertoire. It may be important that learners have an
target language audience for the linguistic output they produce so that
output. they attempt to use the language to construct meanings for
communication rather than solely for practice.
CALL Focus: Multimedia CALL has improved tremendously
in the past decade to provide learners immediate
opportunities to produce language. One groundbreaking
approach utilizes voice recognition applications. Learners
can provide a source output by responding to an automated
prompt that will be recorded, processed quickly, and then
interpreted by the tool to return an immediate feedback.
CL Focus: Producing target language output is accomplished
through activities designed to complement corpus-based
lessons, especially in the writing classroom. In a lesson on
using stance markers in argumentative essays, for example,
learners can be asked first to work on concordances of
categories such as hedges and boosters, examine their
frequencies and patterns, and then incorporate them into
their own writing.
4 Learners need to Empirical Basis: The syntactic mode of processing helps
notice errors learners to internalize new forms and to improve the
in their own accuracy of their existing grammatical knowledge. The
output. process of noticing can occur through learners’ own
reflections and monitoring or through triggers provided by
others.
CALL Focus: A learner is guided by CALL tools and

exercises to an increased awareness and noticing of their
errors through visual cues and immediate feedback (e.g.,
from writing support tools like Grammarly and similar
applications). A linguistic problem or question can be
presented, accompanied by various response options that,
depending on learners’ selection, can change their output
accordingly, and the tool can then evaluate the selection
and provide feedback. Noticing a problem will require that
the learner interacts with the other options provided by
the software.
CL Focus: There are several CL tools that show learners’
typical errors in production (especially written production)
that will promote learners’ noticing. Error-tagged texts
from essays produced by L2 writers can highlight common
grammar errors (e.g., subject-verb agreement, tense/aspect
use) in composition and will allow the learner to see his/
her error patterns alongside other learners with the same
background.
5 Learners need Empirical Basis: Negotiation of meaning refers to the

to engage in process of comprehending input with less than perfect
target language comprehension, producing output with less than perfect
interaction success, identifying instances of imperfect communication
whose and trying to resolve them. This process occurs when the
structure can normal conversational interaction is modified because
be modified for of communication breakdowns. Miscommunication,
negotiation of as evidenced by modified interaction, focuses learners’
meaning. attention on language, and through the resolution of
miscommunication, makes input comprehensible (Larsen-
Freeman & Long, 1991).
CALL Focus: Modification of the interactional structure
of conversation or of written discourse during reading
is targeted in multimedia CALL using built-in scenarios
and also the opportunity to engage learners in settings
mediated by technology. Data from guided Skype
conversation or online discussion boards show that learners
are able to negotiate meaning easily in the right context
where they feel comfortable to ask questions and explain
ideas.
CL Focus: Learner-learner interaction during concordancing
activities in the classroom provides opportunities for them
to engage in discussions of grammatical form and function.
Teachers can modify the activities to include conflicting
structures for the learners to interpret, evaluate, and make
decisions as to what form is ideal based on the register.
Another important focus of CALL is evaluation. Multimedia CALL will

have to be thoroughly evaluated or assessed to determine its usefulness in the
learning process. This strong impetus on software evaluation makes CALL
realistically position technology as simply a tool or a material that assists teach-
ers and students in the learning process. Deciding when to use CL tools in a
language learning lesson or a curriculum is one of the essential steps teachers
should include as part of a needs analysis as the syllabus is designed.
A sample software evaluation questionnaire also developed by Chapelle and
Jamieson (2009) and adapted for CL tools is provided. In this evaluation rubric,
the primary criteria, also influenced by CALL/SLA principles, are (1) learner
fit, (2) explicit teaching of linguistic features, (3) interaction with the computer,
(4) interaction with other learners, (5) types of evaluation, and (6) learner strat-
egy development (Table A2.2).
Table A2.2 Criteria for evaluating CL tools and materials
Learner fit • Does the CL tool or do corpus-based materials

fit the learners in terms of the topics that are
covered?
• Will the learners be able to see the need for
learning target features?
• Are corpus-based activities chosen to match the
students’ level of language?
Explicit teaching • Do corpus-based activities provide explicit
instruction to teach target linguistic features?
Interaction with the computer • Do corpus-based activities provide opportunities
for interaction with the computer in a way that
focuses learners’ attention on particular forms?
• Does the CL tool provide learners with help
in understanding the features that they do not
know?
Interaction with other learners • Do corpus-based activities guide learners to work
with others? Can you think of other collaborative
activities that you could develop based on this
material?
Types of evaluation • Do the activities provide feedback to learners
about their responses?
• Does the program provide evaluation of learning
outcomes (e.g., through quizzes that give learners
evaluation of their performance)?
Learner strategy development • Do corpus-based activities promote good
learning strategies?
• Do the activities provide guidance for students to
develop strategies that will help them continue to
learn on their own?
Adapted from Chapelle and Jamieson (2009).

Connections 39
For the Teacher
Try to use the aforementioned questionnaire to evaluate CL tools or corpus-based

lessons. You can start with any online CL databases like COCA or Cobb’s (2016)
Compleat Lexical Tutor (read more about LexTutor in Section A3 and in Section
C2.3). Use various features of your selected tool to develop lessons in vocabulary
teaching for international students or you can apply this directly to your own
particular class, with a specific target linguistic feature to be taught.
Next, choose any lesson in Part C of this book and evaluate how the
teacher designed and facilitated the lesson using the criteria in this rubric.
Questions:
What specific criteria presented in the questionnaire are best addressed by
CL tools and materials? What are those that will be clear limitations?
What are innovative ways to make learners interact with each other in the
CL classroom? What problems related to Learner Fit do you anticipate?
A2.3 CL and Data-Driven Learning (DDL)

DDL is specifically a CL-based instructional approach in the language classroom.
It originated from the late Tim Johns’ work at the University of Birmingham and
was also referred to originally as “Classroom Concordancing.” This strategy per-
mitted learners to access data from corpora and encouraged them to “discover the
foreign language” and its inherent structures. From the mid-1980s to early 1990s,
CL was a groundbreaking approach to research, but was not yet fully realized in
classroom pedagogy. Johns emphasized that the language learner needed to also
be a researcher of language whose learning had to be driven by access to linguistic
data. The central idea behind DDL is that giving access to resources and tools and
providing research training to learners in the use of corpora will allow them to
become autonomous in their own acquisition of the target language.
For Johns, the process of learners discovering the foreign language also requires
that language teachers provide a context within which learners can develop strate-
gies for this discovery. These machine-mediated strategies are important for them
to “learn how to learn.” Again, autonomy is emphasized here but with strategic
teacher support, given the existent limitations in classroom settings and availability
of shared computers in the early-1990s. Johns described the relationships between
the learners, teacher, and machine in the foreign language classroom:
At the heart of the approach is the use of the machine not as a surrogate
teacher or tutor, but as a rather special type of informant. The difference
between teacher and informant can best be, defined in terms of the flow
of questions and answers. The teacher typically asks a question (answer al-
ready known) to check that learning has taken place: the learner attempts to
answer that question: and the teacher gives feedback on whether the ques-
tion has been successfully answered Such is the I(nitiation)-R(esponse)-
F(eedback) structure of the classroom exchange as described in Sinclair
and Coulthard (1975): and such, too, is the structure of the typical “tu-
torial” computer program that purports to “teach a foreign language.”
The informant, on the other hand is passive - and silent - until a question
(answer unknown) is asked by the learner. The informant responds to
that question as best he (or she) can: and the learner then tries to make
sense of that response (possibly asking other questions in order to do so)
and to integrate it with what is already known.
(1994, p. 1)
The primary theoretical basis for DDL in SLA is the proven value of learners’
active examination of naturally occurring language and their discoveries of lin-
guistic patterns and rules on their own (Boulton, 2009). The focal point of an
effective DDL is the guided but free-form exploration of a language and its fea-
tures. It is in the discovery of these patterns that learners can articulate insights
and experience a degree of self-sufficiency in their language learning. As Johns
was still formally articulating this approach, corpus data (i.e., teachers’ access to
corpora) were clearly limited, and he was also restricted to basic concordancing
tools only available to a select number of computers. However, he noted at that
time that during the past few years that his students used concordance output
regularly, he was able to come up with three primary conclusions (1991):
1. The use of the concordancer can have a considerable influence on the

process of language learning, stimulating enquiry and speculation on the
part of the learner, and helping the learner also to develop the ability to see
patterning in the target language and to form generalizations to account
for that patterning.
2. The teacher has to learn to become a director and coordinator of student-
initiated research (this may be difficult for teachers to come to terms with).
3. The place of grammar in language learning and language teaching needs
to be reevaluated. Traditional grammar-based methods are vitiated by as-
sumptions about how grammar is learned and what is to be learned. The
how usually involves presenting the student with a known set of “rules” or
“patterns” that are then applied in ‘constructing’ text in the foreign lan-
guage. This is the opposite of the DDL process (i.e., traditional grammar
teaching is top-down; DDL is bottom-up).
Since Johns, DDL has grown in popularity as a pedagogical tool, embraced in

EAP settings, and has incorporated register awareness and grammar teaching
Connections 41
that is based directly on linguistic data (Aston, 2015; Boulton, 2009; Charles,
2014; Friginal, 2013b; Geluso & Yamaguchi, 2014; Lee & Swales, 2006). Two
lessons in Part C of this book are situated specifically on DDL: C2.2 on “Us-
ing a Concordancer for Vocabulary Learning with Pre-Intermediate EFL Stu-
dents” from McNair and C3.6 developed by Nolen on “A Long-Term, Corpus
Exploration Project for ELLs.”
As I mentioned in Section A1, corpus-informed grammar teaching materials
such as the LGSWE (Biber et al., 1999) helped tremendously in introducing
corpus-based data to applied linguists as well as the general audience composed
of English language teachers and language learners. The LGSWE provides ex-
tensive distributional data of the lexico-syntactic features of written and spoken
registers of British and American English, and it also presents corpus findings that
explain the functional parameters of these two national English varieties based
on comparative patterns of usage. The analysis and presentation of corpora from
the LGSWE have contributed an assortment of frequency distributions to many
language/grammar classrooms with applications to register and cross-linguistic
comparisons. This growth in corpus-based materials production is accompanied
by much easier access to these tools and by the continuing focus on CL research.
DDL has also now been recognized to coordinate well with several theories, ap-
proaches, and subfields within applied linguistics. DDL outside of the classroom
has been established as a clear focus and application of lessons and activities.
However, there are clear challenges and limitations to the successful utiliza-
tion of the DDL methodology. Learners who are not used to this approach may
be intimidated and find DDL technically complicated, especially lower-level
foreign language learners (of English). Kaltenböck and Mehlmauer-Larcher
(2005) point out that “being confronted with a vast amount of unordered,
‘messy’ data, can indeed be a frustrating, even daunting experience” (p. 79).
This is a clear challenge that has a tendency to evoke resistance from learners.
Geluso and Yamaguchi (2014) observed that there was some skepticism and
reluctance from their learners toward DDL, especially during the beginning
stages of instruction. This skepticism is limited not only to learners in the
English classroom. Römer (2011) observed that there is, in addition to learner
reticence, a reluctance to use and/or produce corpora by teachers and materials
writers due to the prospect of having to process an enormous amount of data.
To many learners and teachers, DDL may not be immediately pedagogically
appealing, to say the least, because reading line after line of authentic language
in order to induce meaning can become monotonous (Nolen, 2017). Clearly,
DDL requires an investment on the part of learners and teachers. Learners’ ages,
learning styles, technical knowhow, proficiency in English, and educational
backgrounds all play a role in how receptive they are to the approach (Geluso &
Yamaguchi, 2014). It must also be acknowledged that it can sometimes take
weeks for teachers to properly introduce and teach corpus tools. This was the
case in Geluso and Yamaguchi’s DDL curriculum for which three weeks were
set aside to explain and discuss the utility of DDL so that students would be
adequately motivated to invest their time and effort. Similarly, Mizumoto and
Chujo (2016) disclosed that in a 15-week semester, their corpus-based course
utilized the first 10 weeks to simply provide adequate DDL instruction.
Another widely raised concern is that DDL is only applicable for proficient,
intermediate to advanced learners of the language (Tribble, 2015). This seems
to be a logical reservation, since DDL often focuses on authentic, native-like
language in academia that may surpass the abilities of most beginner-level
learners. However, Boulton (2009) found that learners at an intermediate-low
proficiency level also benefited equally in learning new vocabulary from
corpus-based data rather than from traditional pedagogical strategies such as
the utilization of bilingual dictionaries (e.g., translators and electronic dic-
tionaries that provide word synonyms) within the context of certain tasks.
Boulton adds that DDL is most suited for corpus-trained, advanced proficiency
learners, but it also provides measurable benefits for lower-proficiency learners,
when guided as necessary according to their ability levels. Mueller and Jacobsen
(2016) also found that learners at lower-level English proficiencies can use DDL
as a means of error correction in their writing. DDL provides examples and as
much data as a learner is willing to extract and explore. Concerns about learner
proficiency can, in part, be resolved by various scaffolding exercises.
The current consensus is that DDL is a useful approach in the English class-
room, promotes students’ autonomous learning (Charles, 2014; Mueller &
Jacobsen, 2016), provides them with a wealth of linguistic information (A ston,
2015; Römer, 2011), and initiates their training to become language-based re-
searchers as they acquire the knowledge and abilities to utilize it (Friginal,
2013b). These learners need the support of a variety of language learning tools
in order to meet the challenges and demands of learning English, especially
in academia, for various purposes. DDL, then, can be a powerful approach to
fully expose learners to authentic language in order for them to examine and
understand how the language is structured and used naturally—leading, conse-
quently, to their own successful use. It is by no means an easy task, yet Römer
(2011) argues that DDL can certainly empower learners as it raises their lan-
guage awareness. She adds that corpora and corpus tools have great pedagogical
potential in creating a habit of obtaining data and developing a sense of own-
ership that will likely keep students motivated in their learning of a language,
both inside and outside of the classroom.
Below is a reflection by Nolen (2017) upon experiencing DDL as an English
teacher tasked to develop lessons and activities on phraseology for interna-
tional students in the US In this reflection, he focuses on a collocation example
in showing students to go beyond simple dictionary meanings into a deeper
examination of linguistic chunks to discover the functions of and important,
pragmatic information about words and phrases. See also Nolen’s semester-long
course plan for a DDL vocabulary and grammar instruction (Section C3.6),
focusing especially on concordancing activities outside the classroom.
Connections 43
Experiencing CL and DDL in Practice through

Phraseology
Matthew Nolen, Conexion Training Panamá, Language Program Director
CL in general has contributed greatly to the rise of research in the field of

phraseology and its pedagogical applications. Phraseology is not necessar-
ily a new field in linguistics, with its foundation partially attributed to Firth
(1957) who argued that, “You shall know a word by the context it keeps” (as
cited in Friginal & Hardy, 2014a, p. 41). Firth’s statement encapsulates the
essence of phraseology—to know that the meaning of a phrase is not always
contained entirely within the phrase itself but is, sometimes, also defined par-
tially or refined by the words around it. This, then, implies that the complete
grasp of a phrase’s intended meaning must sometimes extend even beyond
the borders of its neighboring words to include entire chunks of language. As
expressed by Römer (2009), language is highly patterned and quite system-
atic, and these patterns are not rare, either.
Phraseology holds that lexis and grammar are not separate aspects of
language, but rather inseparable and interconnected through phraseological
items (Römer, 2009; Sinclair, 2004; Tribble, 2015). These items may carry spe-
cific sematic and pragmatic functions and are memorized and stored [in the
memory] holistically as a single, inseparable unit in the mind (Aston, 2015). A
good example is the expression, How’s it going? Pragmatically, this question
is not always an inquiry of one’s current well-being during the time the ques-
tion is asked, but rather a greeting. Clearly, the phraseological item, How’s
it going? has its own function within English that goes beyond its literal use.
DDL fits nicely into phraseological theory since it essentially exposes learn-
ers directly to the patterns and chunks of language so that they can start to
explore and interpret various meanings immediately. Collocations, a particu-
lar term which refers to these patterns, are an important aspect of language
learning that clearly establishes the lexis-grammar connection necessary in
helping learners’ process of discovery. Hoey (2005) notes that “[c]ollocation
is, crudely, the property of language whereby two or more words seem to
appear frequently in each other’s company” (p. 2). For example, to laugh with
(someone) is appropriate and expected in English, whereas to laugh around
(someone) could be potentially understood, but sounds unnatural.
With DDL, concordancing activities on collocations present learners with
lines of authentic language that literally center on a particular word or phrase,
plus collocates. A particular command to extract the left and right collocates
of a word or a phrase is easily accessible from many present-day concordanc-
ers (see Section A3). These concordancers easily provide the user with the or-
ganized contexts of items that are searched. Context is placed on the word or
(Continued)
phrase of interest and not on the meaning of the sentence or paragraph as a

whole. This may seem confusing and limited at first, but concordance lines
typically yield enough information to inform learners of the various uses of
words or phrases in relation to collocations or patterns. Boulton (2009) ar-
gues that this scope allows a focus on form and meaning in short, multiple
contexts, showing various usages simultaneously and without the distrac-
tion of longer stretches of discourse. This means that, while the discourse is
not fully attended to, the patterns found between words are extracted and
will be sufficient to enable the qualitative interpretation of meaning.
In DDL, what are learners supposed to look for when using corpus tools?
A quintessential element of DDL is the discovery of patterns. Aside from
learning the technical components of corpus tools, learners need to be pro-
ficient in noticing activities to be able to discern different structural pat-
terns in the language. Kennedy and Miceli (2010) defined such a practice as
pattern-hunting, in which learners are prompted to explore words and their
collocations from a corpus, either for an assignment or simply for the sake
of exploration and discovery in the classroom.
As an example, consider an exploration of the word fact. College-level
ESL learners may first look the word up in an online dictionary (e.g., Dictio-
nary.com provides eight definitions, but they would likely choose the first:
“something that actually exists; reality; truth.”). At this point, learners may
assume that they already know the meaning of fact as it is not particularly
complicated and exist, reality, or truth are likely common words. However,
they may not realize that when used in expressions like in fact, the fact that,
or as a matter of fact, there is a difference in resulting shades of meaning
and also a shift in pragmatic usage. The concordance lines in Figure A2.3
reveal that, pragmatically, in fact is often used to express that the opposite
of a perceived opinion or understanding is correct. In contrast (not shown
in the figure), as a matter of fact is often used emphatically to reinforce a
previously established fact.
One particular aspect in which dictionaries are getting better, which might
exemplify Römer’s (2011) “indirect applications of corpus data,” is in the rec-
ognition of pragmatic information of words and phrases. This improvement is
still largely unavailable to many dictionary users, but learners tasked to com-
plete DDL activities would likely be attuned to this varying functional—but,
again, often unavailable at this point—information and how (and where) to
find it. In addition to examining general language use, concordance lines also
make markers of variation in language, such as the vernacular, much more
apparent and accessible to learners (Friginal & Hardy, 2014a).
Teachers are seen more as facilitators than direct disseminators of linguistic
information in the DDL classroom, since they focus more on providing con-
texts, assignments, and guidance throughout the process of discovery by the
Connections 45
Figure A2.3 Concordance lines for in fact in academic written texts.
students themselves. It is important to note that this concept may not match
the cultural expectations of learners from countries where teachers assume a
very prominent and directive role in all forms of instruction. More importantly,
however, teachers are responsible for training students on the use of corpus
tools—clearly indicating that they need to know completely how to use these
tools and addressing any potential glitches as students acquire and refine their
self-learning skills. Students should hear and, hopefully, appreciate the rationale
in support of the DDL approach and the benefits that will accrue to them as
learners and practitioners from developing proficiency in the use of these tools.
The primary goal again is autonomous learning for students. In describing
the process leading to learner autonomy, Kaltenböck and Mehlmauer-Larcher
(2005) note that it resembles a continuum where the control gradually shifts
from the teacher to the students, as there are changes in attitudes, knowledge,
and abilities that will be observed throughout the process.
To summarize, the benefits of DDL outweigh the challenges. This is espe-
cially true for learners who gain access and training to investigate language
data not only independently, but also extensively. Learners can truly become
researchers or ‘detectives’ of their target language. The roles of learners and
teachers tend to change or shift, but both will continue to develop strategies
for discovery. This places the initiative and specific path to learning the lan-
guage on the learners, and demands that they take some of the responsibility
and accountability involved in the process—as well as enjoy the benefits to
be derived from doing so!
A3
Analyzing and Visualizing
English Using Corpora
Various analyses of corpora can be accomplished using relatively simple com-

puter software programs, many of which are freely available online, referred to
as “freeware.” In Section B3, I provide a list of these corpus tools, particularly
those that are relevant to teachers, including a description of what they do, their
developers, and where they can be downloaded.
The most common and most relevant corpus tool for teachers and learn-
ers is the concordancer. AntConc (Anthony, 2014), WordSmith Tools (Scott,
2012), and MonoConc Pro (Barlow, 2012) are stand-alone concordancers that
are easy to use and have intuitive commands in running searches and other
functions. Concordancers are also included as built-in applications in MIC-
USP or COCA and many other online databases. These are programs that
can extract words or key words as they appear in a corpus. Word or phrase
frequencies can be easily obtained, and the contexts within which these words
are used can also be collected by taking words that appear before and after the
designated key words in the corpus. This process is known as Key Word in
Context or KWIC. Concordancers can also easily provide a word frequency
list (from the most common word to those appearing only once), n-grams, and
extract collocations of a target word or phrase. Advanced corpus researchers
may need to use very specialized computer programs designed to extract par-
ticularly unique patterns that are not provided by concordancers (Friginal &
Hardy, 2014a).
A3.1 Linguistic Analysis of Corpora

The following subsections provide a brief discussion of common linguistic con-
structs typically investigated using corpora that are useful to teachers in developing
materials for the classroom. We start with basic unit-level frequency distribution,
Analyzing English Using Corpora 47
from a single word or a phrase, then move on to KWICs, collocations, multiword

units, key words, and patterns of co-occurrence of various tagged features.
A3.1.1 Frequency of Single Features

Determining the frequency of a single linguistic feature from corpora is one
of the most basic types of analysis in corpus-based research. Questions such as
“What are the most frequently used words in A-graded laboratory reports in
Chemistry?” or “What are the top 12 most common lexical verbs in spoken
American English?” are easy to extract from the relevant corpus. The former
simply requires running the word list function of AntConc, and the latter will
first require a corpus that is tagged or annotated for part-of-speech (POS), that
is, the teacher will have to utilize a POS-tagger (see Section B3) to obtain the
frequency of the most common lexical verbs—these are meaning-carrying,
one-word verbs, such as sing, talk, think, or find and their lemmas—in the
corpus. As emphasized in the previous section, frequency is important for
teachers in describing the features of language varieties, including academic
language, and in determining what to focus on when considering how to
teach vocabulary or grammatical features. Popular word lists such as Cox-
head’s (2000, 2011) or Nation’s (2001) “Academic Word Lists” (see Section
C2) have been used in developing teaching and learning materials for students
in many academic writing/speaking classes (Friginal et al., 2017). Biber (2006)
noted that although most ESP/EAP studies have focused on written academic
discourse, more recently, researchers have also turned their attention to uni-
versity classroom discourse and the combined frequencies of various linguistic
features. In addition to individual counts and frequency distributions (e.g.,
counts for how many pronouns or counts for ‘in contrast’ or however) exploring
the distribution of functional features, such as the study of stance and eval-
uation, informational discourse, and hedging in speech, has provided results
for comparison across academic registers. Frequency is important in both the
description of language varieties and in determining what to focus on in a vo-
cabulary lesson. For example, it has been shown that even language specialists
cannot accurately estimate the relative frequencies of words in a particular
setting (Alderson, 2007). This is a paradox because many of our intuitions of
existence and frequency of words, word types, and grammatical constructions
are influenced by what stands out to the observer as different. Thus, casual
observers of language may be more likely to perceive infrequent linguistic
features as frequent (Friginal & Hardy, 2014a).
Frequency will have to be properly measured and reported. The frequency
of a linguistic feature is relevant when compared with other features or when
interpreted across registers. In order to make correct comparisons and inter-
pretations of frequency data, normalized frequency (nf ) will have to be
presented. Relative frequency can be determined by calculating the frequency
of the construct per x number of words. Depending on the feature and the size
of the corpus, a teacher might choose to measure the number of occurrences
per 100, 1,000, or 1,000,000 words. This process is also called normalizing (i.e.,
normed count or normed frequency). In many of my studies of word/gram-
matical constructions, I normalize the number of instances per 1,000 words,
following a simple calculation:
number of occurrences
nf = × 1,000
total number of words
Normalization not only allows for teachers to compare linguistic features with
one another but also, more importantly, allows us to compare texts and corpora
of differing lengths.
So, returning to my earlier question about the top 12 most common lexical
verbs in spoken American English, normalized frequency data is actually avail-
able for determining this. Figure A3.1 shows the top 12 lexical verbs obtained
from the Longman corpus (Biber, 2004).
Biber reports that these 12 verbs are very common in spoken interaction,
and they alone will comprise close to 50% of instances of lexical verb use in
the corpus. Based on these frequencies, teachers may start a lesson on teach-
ing verbs in conversation by focusing on introducing the forms and functions
of the first five: get, go, say, know, and think. University IEP students who are
in their first semester in the US in an English oral communication class may
directly benefit from this activity as they will hear these common verbs very
10000
9500
9000
8000
Frequency per million words
7300
7000
7000 6800
6000
5000 4700
4000
3200 3100 3010
3000
2500
1900
2000
1400
1200
1000
0
get go say know think see want come mean take make give
Figure A3.1 op 12 most common lexical verbs in spoken American English, nor-
T
malized per one million words. Adapted from Biber (2004).
frequently in interactions in and outside of the university setting. The follow-

ing text extracts illustrate the various forms and meanings of get in conversation
or informal speech:
Text Samples A3.1 Forms and meanings of get in conversation
Obtaining something (activity): Check if they can get some of that bread.
Moving to or away from something (activity): Get in the car.
Causing something to move (causative): We ought to get these wedding pictures
into an album.
Causing something to happen (causative): Uh, I got to get Max to sign one, too.
Changing from one state to another (occurrence): So I’m getting that way now.
Understanding something (mental): Do you get it?
A3.1.2 Concordances and KWICs

Traditionally, concordances are reference books comprised of alphabetical listings
of all significant content words in the source material, excluding grammatical
and functional words (e.g., prepositions, articles, adverbial phrases). In addition
to this alphabetized index of primary words from the source text, a secondary
list of words that co-occur before or after the primary word elsewhere in the text
is also provided, which enables users to understand the contextual meanings of
each word in the material. Scholars of the Bible, the Qur’an, and other significant
religious and historical texts created concordances for these documents manu-
ally before computers expedited the task. Concordances are provided today in
study or teaching editions of the Bible as appendices or footnotes, and early edi-
tions of literary works by Shakespeare, Socrates, and Homer, for example, have
concordances that facilitate cross-referencing of relevant words, terms, and re-
peated word usage. These concordances are useful in helping identify key words
and, very importantly, in defining the subtle nuances and semantic meanings
intended by authors in the various, particular contexts that are essential to a com-
plete understanding of the texts. Concordances often provide additional author
commentaries, biographer footnotes, and editor narratives (Friginal, 2015).
Concordances derived from digital text files of actual language usage by
speakers and writers in particular groups can provide comparative qualitative
and quantitative data useful in characterizing the shared meanings of those
in the defined group. Concordances can be utilized to identify the different
usages and frequency of a content word, examine word collocations, explore
the distribution of key terms and phrases, and create a list of multiword
units. All of these additional features can be produced immediately from
AntConc, and the resulting concordance lines can be saved for additional
qualitative coding and analyses. Cross-comparisons of these concordances
and their distributions across groups of speakers/writers may be invaluable in
applied linguistics. Text Samples A3.2 show KWIC lines for the phrase in my
opinion from a corpus of personal blogs written by women based in the US
(collected by Samford, 2013).
Text Samples A3.2 Concordance lines for in my opinion in personal blogs
1 stumbling expression (in my opinion I mean when I try

anyway). writing
2 They are not good drivers in my opinion. And what sucks is
teens
3 it was a really good movie in my opinion. But it brought me
to tears
4 Things change! And in my opinion they still make
great music.
5 Weekends are catch up days in my opinion. You get two whole
days in a week
6 a good person to be a nurse in my opinion. I’m not mean to
many people
7 time to spend with someone in my opinion. We accept each
other for who we
8 cause she deserves it in my opinion. look i have a nasty
side too.
9 It’s not very exciting in my opinion. Jazz isn’t something
I’d of picked
10 (for $559, 000- which in my opinion is NEVER going
to sell).
11 brought up whatsoever. In my opinion the fact that we
have gone 12
years
12 I have ever heard (in my opinion). His life is nothing
short of a miracle
A3.1.3 Collocations
As noted earlier, the way in which linguists regard and examine discrete lin-
guistic elements, such as words and phrases, has been strongly influenced by
the work of Firth (1957). These elements should not be regarded or treated
as independent from rules and other words in a text. Accordingly, the corpus
approach allows for the determination of statistically significant word combi-
nations, that is, word collocations, in a given text and how these combinations
are distributed across registers. Collocations can also be found using more ob-
jective measurements from statistical results obtained from reference corpora.
Prediction models of what might follow or precede a word, a noun, or a verb
can be measured based on their expected frequencies. Table A3.1 shows the
collocates changing over time from older to more recent for women, art, fast,
music, and food (Davies, 2017a).
Table A3.1 Google Books’ (from the BYU collection) changing collocates over time
for women, art, fast, music, and food (Davies, 2017a)
Older period More recent period
women 1930–1950s: ridiculous, plump, loveliest, 1960–1980s: battered, militant, college-

restless, agreeable educated, liberated
art 1830–1910s: noble, classic, Grecian 1960–2000s: abstract, Asian, African,
commercial
fast 1850–1910s: mail, train, horses, steamers 1960–2000s: food, track, lane, buck
music 1850–1910s: delightful, exquisite, 1970–2000s: Western, black, electronic,
sweeter, tender recorded
food 1850–1910s: spiritual, insufficient, 1970–2000s: fast, Chinese, Mexican,
unwholesome, mental organic
AntConc’s
Figure A3.2 (Anthony, 2014) first left and first right collocations for the
word know from a blog corpus.
AntConc’s first left and first right collocations for the word know are provided in
Figure A3.2 from the same corpus, with 584,714 words, of personal blogs refer-
enced earlier, written by women bloggers (Samford, 2013). The most frequently
occurring left collocate of know is “I” (I know, occurring 608 tomes), while the
most frequent right collocate is “what” (know what, occurring 214 times). A
contraction (‘t), often from don’t know, appeared 422 times in the corpus. In in-
terpreting the AntConc output, disregard the search word that is listed as Rank 1
(know) and focus on the raw frequency reported in the output window. Users
can download the full result saved as a text (.txt) file. The procedure for running
collocations in AntConc is pretty straightforward:
1 Load the corpus: File—Open File(s)—then select your folder where your
text files are located
2 Select the tab option for “Collocates” at the top of the main results win-
dow (between “Clusters” and “Word List”)
3 Type your search term (know) in the search bar
4 Identify your first left or first right options and minimum collocate fre-
quency (below “Window Span”)
5 Click “Start” and results (Figure A3.2) will be produced.
For the Teacher
Interpreting collocations
An online article by “vaughanbell” (2017) published by Mind Hacks (https://
mindhacks.com/), a neuroscience and psychology news and opinion site,
notes that there is a preference for mental health practitioners to avoid
the phrase commit suicide. These practitioners argue that commit refers to a
crime, and this increases the stigma against what should be regarded as an
act of desperation that deserves compassion as opposed to condemnation.
The author added the following supporting arguments:
 The Samaritans’ media guidelines discourage using the phrase commit

suicide: Avoid labeling a death as someone having committed suicide.
The word commit in the context of suicide is factually incorrect because
it is no longer illegal.
 The Australian Psychological Society’s InPsych magazine recommended
against using the phrase because the word commit signifies not only a
crime but a religious sin.
On the surface level, vaughanbell argues, claims that the word commit nec-
essarily indicates a crime are clearly wrong. We can commit money or commit
errors, or commit ourselves to work harder, for instance, and no crime is
implied.
After examining traditional dictionary definitions of commit (e.g., from
Google’s default dictionary: [commit] carry out or perpetrate [a mistake,
crime, or immoral act]), vaughanbell used COCA’s collocation analysis to

gather the following results. I provide the first 20 collocates of commit in
contemporary American English with their relative frequency:
(1) Suicide 1151 (11) Himself 73

(2) Crimes 314 (12) Adultery 68
(3) Themselves 251 (13) Yourself 66
(4) Murder 227 (14) Acts 63
(5) Such 120 (15) Myself 51
(6) Ourselves 100 (16) Fraud 50
(7) Itself 86 (17) Crime 39
(8) Any 80 (18) Atrocities 38
(9) Perjury 79 (19) Genocide 34
(10) Violent 74 (20) Troops 32
I have used an activity like this many times in my classes to allow stu-
dents to reflect and share thoughts on an issue and then comment on
what corpus data provide. It is certainly encouraging to witness popu-
lar culture’s acknowledgment of corpus approaches in analyzing profes-
sional discourse. In small groups, discussion guide questions such as the
following could be provided after students have read a short article. In
my experience, these questions always encourage active participation and
immediate use of the COCA database, with students using their phones or
laptops to access the site:
1 What are your initial comments/impressions (first thing that came to

mind or jumped out) after reading this article and exploring colloca-
tional data? Please share your thoughts with your group.
2 What other combinations [_____________ + suicide] are possible? The
author should also have considered searching for SUICIDE collocations
in COCA. You can do this on your own.
3 With the list of the common collocates of commit shown earlier, do you
think mental health practitioners who are discouraging the use of the
phrase commit suicide are justified?
A3.1.4 Key Word Analysis

A key word analysis identifies significant differences in the distribution of
words used by speakers or writers from two corpora. Scott (1997) defines a key
word as “a word which occurs with unusual frequency in a given text” (p. 236).
This “unusual frequency” is also referred to as the keyness value of this word
and is based on the likelihood of occurrence of the word in a target corpus as
determined by a process called cross-tabulation. In other words, keyness draws
from word frequency data, but instead of simple averages, statistical computa-
tion is used to determine if a word is more or less likely to occur in one corpus
vs. another.
Key word comparisons provide an interesting look at the unique features of
one type of discourse, language variety, or register compared to another. Key
words can be extracted easily using AntConc and WordSmith Tools. Note that
this process involves loading a target corpus, also known as “node corpus,” and
a reference corpus into the software to proceed with the analysis. A video tu-
torial for running key word analysis using AntConc is available from YouTube.
Search: “AntConc – Keywords.”
In the following example (Table A3.2), I provide two key word lists from
a collection of essays written by L2 university students responding to two ar-
gumentative email prompts. The focus here is to investigate topic effect and
whether a certain topic may have an influence in writing quality. For this key
word analysis, I wanted to categorize the distribution of words repeated from
the actual prompts. Corpus 1 are essays responding to a question about the “im-
portance of planning for the future.” Corpus 2 asks about the implication of too
much “emphasis on personal appearance and fashion.” Frequency and keyness
values are provided for each key word.
For the Teacher
Students in a CL class can be asked to interpret the data from the table
(Table A3.2). It’s a good idea to provide the additional key words, if pos-
sible, the first 100 per corpus. Clearly, students identify words specifically
mentioned in the prompts as they write their responses, and these were
the primary key words per corpus. First person pronoun I was the top key
word in the “appearance” corpus. The misspelled words “apperance” and
“fashons,” misspelled 75 and 66 times, respectively, are both in the top 30
for Corpus 2. Teachers can ask students the following questions after they
analyze the results:
1 What patterns did you recognize? How do you interpret the character-
istics of L2 student writing from these two prompts? When compared
to L1 writers, do you think there will be differences?
2 What are ideal topics of comparison for a key word analysis?
3 What are limitations in conducting key word analysis?
Table A3.2 Key word comparison from two groups of essays written by L2 students
Corpus 1 Frequency Keyness Key word Corpus 2 Frequency Keyness Key word
Future Appearance
1 865 1312.833 future 1 666 788.214 I

2 781 1258.964 we 2 593 701.818 appearance
3 756 1193.287 plan 3 517 598.986 fashion
4 502 508.426 young 4 398 399.135 look
5 219 353.026 carefully 5 348 331.253 personal
6 534 313.228 good 6 289 321.35 emphasis
7 178 286.934 planning 7 854 288.933 on
8 384 269.069 life 8 215 254.454 the
9 155 249.858 in 9 167 197.645 it
10 174 234.975 ensure 10 155 183.443 wear
11 233 206.442 still 11 154 182.26 in
12 127 204.723 it 12 160 178.816 dress
13 110 177.319 the 13 158 176.474 clothes
14 608 172.87 we 14 215 175.691 put
15 99 159.587 however 15 146 172.792 clothing
16 844 155.405 you 16 143 169.241 this
17 182 153.326 while 17 141 166.874 people
18 94 151.527 for 18 119 140.837 wearing
19 94 151.527 plans 19 114 134.92 having
20 86 138.631 if 20 221 126.197 society
21 74 119.287 so 21 108 118.057 media
22 84 118.776 goal 22 93 110.066 they
23 73 117.675 when 23 225 97.515 too
24 93 116.337 early 24 82 97.047 for
25 190 115.973 he 25 313 90.27 much
26 386 108.578 will 26 75 88.763 apperance
27 312 105.719 your 27 71 84.029 women
28 175 102.906 best 28 70 82.845 when
29 299 102.498 my 29 66 78.111 fashons
30 79 100.017 career 30 66 78.111 there
A3.1.5 Multiword Units (MWUs) and Prefabricated Chunks

As with collocations, some words frequently co-occur as linear, formulaic
strings, like a prefabricated ‘chunk’ of language. MWUs include a range of
studies of extended strings of language, and there are various ways and oper-
ationalizations (including definition of terms) to explore this construct of for-
mulaic language using corpus tools. Three of the commonly used approaches
to MWUs are n-grams, lexical bundles, and p-frames.
N-grams. The most basic construct associated with MWUs is that of the
n-gram. The n stands for any number variable (e.g., 4-gram = on the other
Table A3.3 The 50 most common 4-grams from the Enron Email Corpus
Frequency 4-gram Frequency 4-gram
1 87 you have any questions 26 19 a copy of the

2 82 me know if you 27 19 I look forward to
3 77 Let me know if 28 19 will let you know
4 73 I would like to 29 19 you get a chance
5 70 Please let me know 30 17 I m going to
6 67 if you have any 31 17 I will not be
7 60 let me know if 32 17 please let me know
8 44 know if you have 33 16 be out of the
9 39 I don t know 34 16 I don t have
10 38 If you have any 35 16 I will be in
11 31 Let me know what 36 16 let me know what
12 28 I m not sure 37 16 ll let you know
13 27 give me a call 38 16 when you get a
14 26 have any questions or 39 15 don t know if
15 26 I will be out 40 15 Give me a call
16 25 out of the office 41 15 I will let you
17 24 I don t think 42 15 me know if I
18 23 Thanks for your help 43 15 Thank you for your
19 23 You have two cows 44 15 to be able to
20 22 and let me know 45 15 to let you know
21 22 I am going to 46 15 will be able to
22 22 me know what you 47 14 I just wanted to
23 22 will be out of 48 14 if you need anything
24 21 know if you need 49 14 me know when you
25 21 Talk to you soon 50 14 not be able to
hand). N-grams can also be extracted using most basic corpus packages; both
AntConc and WordSmith Tools have intuitive commands for n-gram extraction.
Table A3.3 shows a list of the 50 most common 4-grams from a corpus of pro-
fessional, workplace emails from the Enron Email Corpus (see also Section B1).
Lexical bundles. One particular type of n-gram is the lexical bundle, an
n-gram with additional specifications as to how they are extracted or catego-
rized. Customarily, lexical bundles consist of at least three words (tri-grams)
that occur frequently—frequency determined by the researcher—across a
corpus of at least one million words. Another important criterion for labeling
MWUs as lexical bundles is that they must appear in at least five different texts
in the corpus, that is, they are common in other registers as well. This is neces-
sary to avoid any idiosyncratic language usages (Cortes, 2004).
P-frames. Researchers have also moved beyond looking only at contiguous
strings of words to also examine frequent, patterned constructions. P-frames are
consistent phraseological structures that allow, however, for variability in one
position of the phrase frame. An example of a p-frame, found by Römer (2010),
is it would be * to, in which the asterisk represents an open slot. Grammatically,
any number of adjectives might go into the blank slot in this example. Römer
found that the most frequently occurring words in a corpus of student essays in
the “blank” slot were interesting, useful, nice, and better, these accounting for 77%
of all the variants in the corpus.
A3.1.6 Vocabulary Usage and Lexico-Syntactic Measures: Cohesion,

Complexity, Sophistication, and Others
It has been well-documented that vocabulary development in spoken and
written discourse is critical in both the literacy development and academic
success of L2 learners. More specifically, students’ academic success depends
upon their developing the specialized and sophisticated vocabulary of aca-
demic discourse that is distinct from conversational language (Francis et al.,
2006). Corpus tools may be utilized to extract and then interpret the nature of
vocabulary usage by learners across levels of proficiency. For example, a num-
ber of studies have identified particular linguistic features (e.g., subordination,
prepositions, linking adverbials, etc.) that are predictive of scores given by
instructors/raters as well as features that distinguish differences among various
academic disciplines (Römer & Wulff, 2010) and demographic factors: for
example, language proficiency levels and graduate vs. undergraduate (Grant &
Ginther, 2000; Hinkel, 2002).
Identifying features indicative of quality speech and writing—especially
those that are discipline-specific—is of obvious pedagogical importance to
teachers. An understanding and description of linguistic complexity is import-
ant insofar as it may relate to the amount of discourse produced by learners as
well as the quality of that discourse, including the types and variety of gram-
matical structures; the organization and cohesion of ideas; and, at the higher
levels of language proficiency, the use of text structures in specific genres.
These features may be defined and operationalized in the development of
teaching materials for the classroom. Measures such as t-units, clause construc-
tions, type/token ratio, and markers of information density and elaboration
have all informed the creation of lessons and test prompts in the L2 classroom,
especially in the university setting (Friginal et al., 2017).
Computational tools, such as Coh-Metrix (see, e.g., Crossley & McNamara,
2009) and those developed by Scott Crossley (Georgia State University) and
Kris Kyle (University of Hawaii) and their colleagues entitled “Suite of Lin-
guistic Analysis Tools” or SALAT (http://www.kristopherkyle.com/), can be
used to rate the readability and also to extract frequency counts for a range of
linguistic features (see additional discussion about Coh-Metrix and SALAT in
Section B3). These tools are more applicable for researchers and teachers than
for language learners themselves. Teachers may find them useful in materials
development for such topics as distinguishing between authors of texts; dis-
tinguishing between writers’ country of origin, for example L2 writers from
the International Corpus of Learner English (ICLE); identifying changes in
L2 language over time; distinguishing between L1 and L2 writers; classifying

spoken and written registers; distinguishing parts of a paper (abstract, introduc-
tion, methods); and many others.
A3.1.7 Patterns of Linguistic Co-Occurrence

The concept of linguistic co-occurrence suggests that the linguistic composi-
tion of a particular language or discourse domain, such as face-to-face class-
room interaction or a study group, may have higher frequencies of questions
and responses, inserts, dysfluent markers (e.g., filled pauses—uh, um), and back
channels (e.g., uh-huh) than that of speakers in other settings. At the same time,
any given feature may not be as common in different settings such as extended
and prepared lectures, news reports, or formal speech. Linguistic features, such
as pronouns, past tense verbs, and nouns, often occur together whenever speak-
ers engage in everyday conversations or talk about their previous experiences
and recent events. These same features might also appear frequently in written,
first-person narratives or soliloquies about past events. A simple KWIC search
will not suffice to capture and document these co-occurring features from
corpora. A more advanced statistical framework is necessary to identify the
composition of features that are frequently found together within a corpus.
Corpus-based multidimensional analysis (MDA) was introduced in Biber’s
(1988) Variation across Speech and Writing as a research methodology for explor-
ing linguistic variation in spoken and written English texts. Biber’s primary
research goal was to conduct a unified linguistic analysis of spoken and written
registers from 23 sub-registers of the LOB (for written texts) and London-Lund
Corpus (for spoken texts). He was able to substantially redefine a range of regis-
ter characteristics of spoken/written discourse by using a multivariate statistical
procedure to identify intrinsic linguistic co-occurrence patterns across POS-
tagged texts. Subsequently, he was able to establish a model of corpus-based
research that could be applied to even more specialized contexts. MDA relies
on factor analysis (FA), which identifies the sequential, partial, and observed
correlations of a wide-range of variables in order resulting in groups of co-
occurring factors (Friginal & Hardy, 2014b).
Biber’s Factor 1, interpreted as Involved vs. Informational Production, is char-
acterized by the combination of private verbs (e.g., think, feel), demonstrative
pronouns, first- and second-person pronouns, and adverbial qualifiers as speak-
ers or writers talk about their personal ideas, share opinions, and involve an
audience with the use of you or your. This discourse is also informal and hedged
(that deletions, contractions, almost, maybe). The contrasting features include
the giving of information (“Informational Production”) as a priority in the
discourse. There are many nouns and nominalizations (e.g., education, develop-
ment, communication), prepositions, and attributive adjectives (e.g., smart, effective,
pretty) appearing together with very few personal pronouns. This suggests that
the focus is upon informational data and descriptions of topics rather than upon
the speaker or writer. Additional characteristics of this production are more

unique and longer words (higher type/token ratio and average word length)
and greater formality in structure and focus.
For the Teacher
Using Biber’s MDA approach, Hardy and Römer (2013) extracted dimen-
sions of A-graded university writing from MICUSP. Their Dimension 1 distin-
guished between Involved, Academic Narrative, very common in Philosophy
and Education papers, and Descriptive, Informational Discourse, typical of
A-graded papers in biology and physics. The following text samples show a
biology report compared to a philosophy critique. What characteristics are
typical of one text sample in contrast to the other? What useful teaching
applications occur to you as you identify such patterns? See a brief addi-
tional discussion on the application of MDA results to pedagogy in Section
B1 from the MICUSP description.
Text Samples A3.3 Comparison of involved, academic narrative and

descriptive, informational discourse in MICUSP
BIO.G0.25.1, report, final year UG, NS
Normally malaria is a curable disease, but only if treated properly. After an

infectious bite there is an incubation period in the host that varies depending
on the species of Plasmodium, before there is an onset of symptoms. The
symptoms of malaria that a human host will go through can be categorized as
either uncomplicated or severe. With uncomplicated malaria, the symptoms
last between 6–10 hours and include a cold stage, a hot stage and then finally
a sweating stage. Symptoms occur in a mixture of fever, chills, sweats, head-
aches, nausea, vomiting, body aches, and general malaise.
PHI.G0.06.1, critique/evaluation, final year undergrad (UG), Native

speaker (NS)
Socrates then concludes that group (D) does not exist, since those people,
by desiring what they believe to be harmful (bad) things are desiring to be
miserable and unhappy. No one wants to be miserable and unhappy, so no
one desires what he believes to be bad. (A)–(C) actually desire what they
believe to be good, and group (D) does not exist, so no one desires what
he believes to be bad. I feel compelled to say here that although Socrates
actually claims that “no one wants what is bad” (78), what he means is that
no one wants what he believes is bad.
A3.2 CL and Visualization of Linguistic Data

Various methods of visualizing linguistic data have resulted from the relatively
easy processing of corpus-based frequencies and transforming of these data into
figures or images. From simple bar graphs or histograms to more complex, on-
line interactive semantic maps, CL approaches have produced excellent visual
representations of language and innovative approaches to their use in the class-
room. Technically, concordancers are also visualizers, able to highlight KWICs
as they appear in the corpus. Visualizers are important in language learning
because they add another layer of information that is not fully captured by texts
(i.e., letters and numbers) alone. They break the monotony of the written page,
provide teachers a creative output in sharing data, and also effectively address
the needs of visual learners.
CL-based frequencies have been used in infographics (i.e., information +
graphics), which are now very common in online articles and advertisements.
These are visual representations of data or knowledge intended to present in-
formation quickly and clearly. Smiciklas (2012) notes that infographics can
develop reader cognition as graphics can enhance our visual system’s ability
to see patterns and trends more efficiently. There are many ‘drag and drop’ in-
fographic creators online like Canva (www.canva.com) or infographic makers
Piktochart (https://piktochart.com/).
A3.2.1 Word Clouds and Frequency Visualizers

The most common approach for an exact-match visualization is a word cloud.
A word cloud is a graphical representation of word frequency from a text or a
corpus. The following sample word cloud (Figure A3.3), created using Word-
Clouds.com (https://www.wordclouds.com/), represents the first 10 pages of
this book, illustrating visually that the words language, English, teaching, corpora,
writing, tools, students, corpus-based, book, and grammar are the most frequently
repeated words. These 10 words can capture and display the overall theme of
the book based on nothing other than an ‘eye-balling’ of what’s frequently
repeated in the text.
What was previously a complicated process involving computer program-
ming, the creation of word clouds has now become an easy cut, paste, and create
process online. Word cloud generators convert frequency data into a graphical
outline of text content. ‘Tags’ are identified from single words, and the im-
portance of each tag, defined as frequency of appearance in the text, is shown
with increased font size and/or change in color (Halvey & Keane, 2007). This
visualized format is convenient for quickly locating the most prominent word
in the input text or corpus. In Figure A3.3, I did not convert the entire text of
the first 10 pages of this book into all lowercase font, including the first letters
of all words, so the words Language and Teaching appear in the word cloud with
Figure A3.3 A word cloud of the first 10 pages of this book.
uppercase first letters. This illustrates the fact that CL extracts anything and ev-
erything that is available in the dataset, from the most frequent feature to those
that only appear once (see Section C4.2 for a lesson that incorporates visual-
izing political speeches with word clouds). In addition to WordClouds.com,
there are several other word cloud generators such as Wordle (www.wordle.net)
or Tagxedo (http://www.tagxedo.com/) that provide free word or tag cloud
templates and other applications. Tagxedo, for example, is also able to easily
provide a key word list and the use of various color and design options.
Because CL relies on frequency data by group or by text file, it is easy to
transform these distributions into figures, especially histograms and charts.
From MS Excel functions to more sophisticated statistical packages like
SPSS or R, figures to enhance numerical presentation are often included in
CL research articles and textbooks. These figures are also easily incorporated
into language classroom activities. Students in small groups or pairs can ex-
amine figures/graphs, identify patterns, and make exploratory conclusions.
Figures A3.4 and A3.5 show word and tag frequency data that learners can
discuss and interpret.
the
of
and
to
in
and
is
that
for
it
as
was
with
be
by
on
not
he
I
this
0 10 20 30 40 50 60
Figure A3.4 isual representation of the Top 20 words in the English language from
V
Google Books (a mega corpus of more than 500 billion words from
scanned books in English and also other languages).
90
I
80
we
70
60 my
50
40 mine
30
our
20
10
0
Older Men Younger Men Older Women Younger Women
Figure A3.5 Use of personal pronouns I, we, my, mine, and our by men and women
bloggers in two age groups (30 and younger vs. 31 and older). Adapted
from Friginal (2009).
The comparison illustrated in Figure A3.5 shows a dramatic difference

between older men and younger women in their use of personal pronouns
in personal blogs, most of which were obtained from sites such as LiveJournal
from 2006 to 2009. In the following two short excerpts, the presence of many
first-person pronouns in a blog written by a 17-year-old female high school
student contrasts considerably with the tone and focus of blog writing by a
67-year-old retired male. The two excerpts seem to address personal topics and
were both directed to readers who were quite familiar with the bloggers.
Text Samples A3.4 Comparison of blog texts
Oh, thank you God. Band camp really sucks. I am so tired of all of it! It doesn’t
matter, tomorrow is the last day. I don’t really feel like updating much. Go figure.
We have the 1st, 2nd, and up to set 15 of the 3rd song completed, but just as last
year, our drill writer is stupid and is falling behind. We have no more drill to work
on. Hopefully we will have more tomorrow. (Female blogger, high school student)
Table talk for the Sunday brunch crowd was the Senior Prom at the Golden Age
Center last night. Retired biology teacher Denver Zygote and Granny Garbanzo
double-dated with Judge and Mrs. Halfthrottle. The big excitement came about
half-way through the festivities when Granny attempted to Watusi with her cane
in her hand. (Male blogger, 60s, retired)
A3.2.1.1 Focus on Diachronic Data

Visualizing linguistic changes or trends across time is one of the primary foci
of the Google Ngram Viewer (https://books.google.com/ngrams) and the
Corpus of Historical American English (COHA, Davies, 2010–) also created
and published by Davies. These two mega-corpora feature billions of words
of American English representing various time periods. Both online visual-
izers present comparison data that default from the 1800s to the present, and
they can extract words or any multiword combinations as they appear in the
databases. For COHA, normalized frequencies of search words/phrases can be
easily obtained, and the contexts within which these words are used can also be
analyzed by examining KWIC lists that appear below the chart feature of the
site. Figures A3.6 and A3.7 illustrate the declining usage trend of the word gen-
tleman from the 1800s to the present from COHA and Google Ngram Viewer.
A3.2.2 Visualizers in the Classroom with Ying Zhu

I collaborated with Ying Zhu of the Creative Media Industries Institute and
Xi He of the Department of Computer Science at Georgia State University in
developing Text X-Ray (Zhu & Friginal, 2015), a POS-visualizer and writing
platform that can be used by teachers and their students in various contexts
of university-level language teaching, especially in academic writing across
Figure A3.6 istributional comparison of the word gentleman from the 1880s to
D
the present in COHA, with KWIC results. Figure and illustrations
adapted from Davies, 2010–.
Figure A3.7 Frequency of gentleman in English books from 1800s to the present
from the Google Ngram Viewer.
disciplines. Zhu leads the Hypermedia and Visualization Lab and Brains &
Behavior research program at GSU, and his research interests include computer
graphics, data visualization, and bioinformatics. As an L2 learner himself, and
also one that identifies as a visual learner, Zhu has advocated for the use of
computer-based visual data in language instruction. He believes that the struc-
ture of language, typically explored in grammar activities (e.g., tree diagrams),
could be best comprehended by groups of learners when they were aided by
color coded and interactive visuals. A model of networks showing nodes of
sentences and the connections words have with each other allowed Zhu to fully
British
Actresses
Joanne
Froggatt
And
Ruth
Wilson
Also
Collected
Prizes
Figure A3.8 Sample sentence tree from Zhu. Adapted from Zhu and Friginal (2015).
appreciate the functions of various parts of speech, more so than memorizing

what they mean, as required by the traditional grammar books methodology.
A sentence tree program that he has developed (Figure A3.8) automatically
creates sentence diagrams based on POS-tagged data.
Although lexico-grammar and writing instruction are the primary focal ar-
eas of Text X-Ray’s applications, the software can also be used productively for
small group peer-reviewing activities in writing or peer-editing dyads that are
mediated by computer technology. The design of Text X-Ray takes into account
teachers’ needs and objectives, focusing on content-based activities that can be
applied to help students build academic vocabulary, learn grammatical struc-
tures, and analyze model texts, especially their own writing. Student-directed
comparisons of vocabulary/POS features of texts can be facilitated through Text
X-Ray by analyzing academic word lists and grammar patterns from teacher-
prepared focal writing excerpts. Activities utilizing Text X-Ray may help stu-
dents develop greater awareness of grammar and usage across contexts. This
approach can create a great deal of positive classroom energy, encouraging stu-
dents to become autonomous learners and also provide effective alternatives for
students with different learning styles (i.e., student-driven learning). In Section
C4.4, Berger shares sample lessons and student/instructor feedback on using
Text X-Ray in an essay writing and editing activity.
Text X-Ray, we believe, contributes data and a range of linguistic information
for learners on the structure of written texts in various academic registers. The pro-
gram combines features from tools such as Compleat Lexical Tutor (Cobb, 2016) and
WordandPhrase.Info (Davies, 2017b), with the addition of an interface that highlights
Figure A3.9 ext X-Ray’s text editor and standard application tools (Zhu & Friginal,
T
2015).
how POS tags are used in context. This beta version of Text X-Ray1 works as a basic
text editor, with built-in POS visualizer for various POS tags (e.g., nouns, verbs,
prepositions) obtained from the built-in Stanford Tagger; readability and lexical
diversity measures; wordlist comparisons; and a word cloud application. Another
important feature of this program is its ability to compare normalized frequen-
cies of linguistic features, for example, word/phrasal classes, with those aggregated
from MICUSP. Note again that MICUSP is composed of advanced, A-graded
student papers categorized primarily across disciplines and text types collected at
the University of Michigan (O’Donnell & Römer, 2012). Student-produced texts
can be immediately compared with MICUSP data, in real time, across disciplines,
paper types, and student levels, including gender and native speaker vs. non-native
speaker groups. Figure A3.9 shows the primary text editor view of Text X-Ray and
its current set of tools and command buttons:
• File/Edit/Help—Standard application tools used to load (copy/paste) a text

or obtain technical help information
• Clear Text—Button allowing users to clear/delete texts loaded on the text
editor
• Parse—Command to run analysis
• Visualizer for Text Color Lightness (darker or lighter)—Color lightness
control
• Word Cloud
• Search Bar (Find/Clear)
• Applications: POS, customized word list, compare with MISCUSP, read-
ability, reader expectations
Figure A3.10 POS-visualizing through Text X-Ray (color coded POS not shown in the
gray scale image, e.g., green = nouns, red = verbs, bold = prepositions).
Figure A3.10 is a sample POS-visualized essay with immediate feedback for

students on options such as readability (Flesh–Kincaid Grade Level, Flesh Read-
ing Ease Score, and Gunning Fog Index), complex words and sentences (which
could be highlighted in bold), and reader expectation measures (Figure A3.10).
Structurally, the program’s Visualization Engine and Visualization Inter-
face are directly activated from a standalone browser application. Text X-Ray is
divided into multiple panels: text panel, visualization panel, linguistic analysis
panel, and control panels. The text panel displays the written input (the essay),
while the visualization panel, always parallel to the text panel, enables users to
analyze the texts at five levels of detail: corpus (source text), articles, words,
sentences, and paragraphs. The linguistic analysis panel holds commands for
readability indices, lexical density, word and sentence length, and other mea-
sures. The control panel allows users to adjust the visualization settings (colors
and highlights), manage texts, and various users. Figure A3.11 illustrates the
structural and programming workflow of Text X-Ray.
In computer-based visualization, texts should ideally be displayed alongside
their visual representations. In many text visualization programs or techniques,
however, visualizations often replace the texts; the original texts are often not
displayed in the interface. In the context of language teaching and learning, it
is necessary for the texts to be visible at all times. Visualizations should sup-
port, not replace, the original text; therefore, text visualizations will have to
be linked and synchronized with the original text display (He, 2016; Zhu &
Web
Corpus cache
Corpus analysis NLP Web Visualization

engine engine server engine
Corpus
Corpus
management
database Visualization
system
interface
Corpus API
Text
Figure A3.11 X-Ray’s structural workflow. Adapted from He (2016) and Zhu
and Friginal (2015).
Friginal, 2015). In the classroom, data visualization will have to be introduced

carefully and meaningfully. For example, because users often switch between
the original texts and their visualized versions, the visualized form may have
to be closer to the conventional textual display for easier mental transition and
connection. Some text visualization displays are difficult for learners because
they require additional mental adjustment from one form of display to another.
For programming purposes, therefore, the complexity and abstract nature of
innovative visualization should be controlled (He, 2016).
To investigate how Text X-Ray might be used in the classroom, we asked a
group of users, primarily graduate students and instructors at the Department
of Applied Linguistics and ESL at GSU, to pilot the software in their classrooms
from 2012 to 2016. Our plan was to distribute Text X-Ray to a wider range of
users, improve its online interface, and seek research funding to support the
program for future easy access globally. The beta version of this software is
still being examined for usability data from a limited release in order for us to
develop and finalize the next set of improved tools that will significantly en-
hance the program’s capabilities and usefulness. Teacher feedback was positive,
overall, and as shown in the following text, there are promising applications of
a program like Text X-Ray that teachers immediately noticed:
CF, Instructor, Hall County Alliance for Literacy, Gainesville, GA
Text X-Ray is a program that I could sit here and play with all day because
I just think it’s cool that a program can pick out parts of speech in a se-
lected text. I haven’t noticed POS-tagging mistakes made by Text X-Ray
yet, but I’m determined to stump it. My immediate thought was to in-
troduce this program to the other instructors where I teach. Several were
interested entering their students’ essays and comparing them to the MIC-
USP papers. The intermediate and advanced-level teachers were inter-
ested in seeing if there was any notable difference between their students’
writing and the papers in MICUSP. I haven’t checked with any of them to
see if they have had a chance to do this yet, but I think that there will most
definitely be a difference between the papers because MICUSP handles
academic papers at the collegiate level, while the papers at my institution
are mostly written by students who hope to get into the GED program
or apply for citizenship. But, it would still be interesting to see what the
MICUSP papers have that the ones written at my school don’t have.
My first thought on using Text X-Ray in the classroom was as a sort of
self-check device that the students could use in our technology room. My
class is for beginners and we do go over the basic parts of speech, so by
having students enter a pre-selected text into the program, having them
pick out the nouns in the passage, and then checking their own accuracy
with Text X-Ray is an excellent way to get the students more engaged
with their language learning. Another feature of Text X-Ray that I could
see myself using in the future for vocabulary purposes is the word list
tab. Approximating word meaning from context is a very difficult task
in any writing classroom, but if I were able to create a list of words that
I think will be difficult for the students in my classroom from a specific
passage of written text, and then use Text X-Ray to highlight the words
in the passage, it would bolster class discussion of the context in which
the words are used.
JX, Visiting Scholar—China
Using Text X-Ray can highlight how native speakers of English use cer-
tain language forms, vocabulary items, and expressions. It offers students
the use of authentic and real-life examples when learning writing which
are better than examples that are made up by the teacher. It allow students
to learn useful phrases and typical collocations they might use themselves
as well as language features in context which means that students learn
language in context and not in isolation. And it can help students get a
broader view of language by comparison. By doing so, students become
aware of lexical chunks that are useful when it comes to completing writ-
ing tasks. It helps teachers to demonstrate how vocabulary, grammar,
idiomatic expressions and pragmatic constraints with real-life language.
JH, ESL Teacher—Korea
What can it do to help students and teachers in the writing classroom?

Compared with concordancers, it is VERY user-friendly. I thought that
I could use concordancers only when I prepare the class, but I thought I
won’t recommend students to use this kind of program before I saw Text
X-Ray. The program is colorful, and it is very easy to use. With tagging
applications, students can easily find the nouns, verbs, and adjectives in
their essays. I can use it when I teach verb valency patterns. Since my
interest is in teaching grammar using corpus tools, I have been thinking
about applicable methodology that I can follow for classroom research on
grammar teaching with a tool like this. Because of this visual recognition
or representation of “grammar” on the screen, I think students’ learning
will last more than just from simple rote memory.
MM, Instructor of Japanese, GSU Department of Modern and Classical Languages
Articles are hard to learn for Japanese learners of English since Japanese
does not have articles. Texts with highlighted articles (a, an, the) can be
used in the writing classroom as a focus on form activity. Compare with
MICUSP shows the comparison of frequency of major POS between the
corpus and the current essay. By focusing on article use, for example,
the program gives a clue to Japanese students of English if they supplied
the necessary articles or not. If their frequency of articles is much lower
compared to a model corpus, they can focus on their use of articles when
they proofread their essay. Word Cloud – it might help students with
writing a summary of a text. I remember when I was an undergraduate
student, writing a summary in English was so difficult for me. Visual
presentation through word clouds can be useful.
CM, IEP Instructor, GSU
I can imagine Text X-Ray being very helpful to advanced EAP students
who are practicing genre analysis, especially as more and more ELT writ-
ing instructors are attempting to empower students to become their own
investigators of genre. The Text X-Ray tool would allow such students to
determine for themselves the differences in, say, nominalization, between
academic texts and other types of writing.
In my experience, because of the tendency to associate writing skills
with reading skills, a good deal of literacy practice in EAP programs is
focused primarily on writing. Even though students may be reading a good
deal for homework, there is little explicit instruction on how to approach
a text or improve one’s reading fluency and/or accuracy. Having taught an
upper-level reading course in an ESL program in the past, I certainly would
have devoted class time to having students explore their assigned reading
through Text X-Ray. For example, I may have begun by having students
highlight the nouns and do a quick scan for nouns they already know (good
for developing their scanning / skimming skills, as well). Which nouns do

they recognize? Which are unfamiliar? Which come from verbs?
A3.2.3 Other Visualizers and Tools

As noted previously, popular online tools such as LexTutor and WordandPhrase.
Info have components that visualize vocabulary features from texts, providing
support for students in discovering patterns of actual language use. In Section
C4.2, Roberts uses WordandPhrase.Info to show lexical items as categories of
frequency with discipline-specific word lists. She presents an IEP lesson at an
aeronautical university with authentic content from a single subject area to
teach academic English. Nelson in Section C4.4 discusses LexTutor and its suite
of corpus- and frequency-based tools for English and French language learn-
ers and teachers, especially focusing on lexical development, EAP, and CALL.
LexTutor’s visualized features can show relationships between words (through
word lists, word families, concordances, collocations, etc.) or frequency of
words and word families. The VocabProfile tool is designed to analyze vocabu-
lary use, and other applications include flashcards for learning vocabulary and
a hypertext-builder for readings linked with concordances and a WordReference
dictionary. The following subsections identify additional visualizers and useful
programs such as Sketch Engine; tools used to visualize online language, espe-
cially tweets; and visualizing the language of hip-hop.
A3.2.3.1 Sketch Engine

One of the currently buzzed-about online corpus tools is Sketch Engine (https://
www.sketchengine.co.uk/), which is corpus manager and analysis software de-
veloped by the late Adam Kilgarriff and Pavel Rychlý. Sketch Engine has evolved
in the past few years from its initial online presence in 2013 as a word sketch gen-
erator to its present, very impressive applications consisting of three main com-
ponents: a large database management system, a web interface search program,
and a very useful web interface for corpus building and management. These are
products that users can access for academic and non-academic license fees. A free
30-day trial is available. It is certainly more than just a visualizer, with its present
multiplatform structure, and I do recommend that teachers explore its various
features, especially how it can be used for corpus collection and management.
Sketch Engine’s database management system (called Manatee) was developed for
effective indexing of large corpora. Corpus Query Language (CQL) allows users to
easily extract word and phrase-level data from corpora. The current list of available
corpora in the program is impressive, and it continues to grow in number, types,
and languages. An earlier iteration of the program’s word sketch feature (Figure
A3.12) provides word distributions obtained from a source corpus like the BNC. In
Figure A3.12, work is visualized as a noun (alternative POS, verb is also an option),
Figure A3.12 Sample Sketch Engine’s word sketch feature for work from the BNC.
appearing 641.5 per million words. The top lemmas are listed and a word sketch,
similar to a word cloud, is provided in various colors and font sizes.
A new addition to Sketch Engine, updated from Figure A3.12, is the Sketch
Engine for Language Learning or SkELL tool, intended for teachers and students
of English. SkELL allows users to check how a word or a phrase is actually used
in a corpus by native English speakers (e.g., from the BNC). All text extracts,
collocations, or synonyms are identified and provided automatically by the pro-
gram. The SkELL tool is free and there is no registration required.
A3.2.3.2 Visualizing Online Language

In the field of sociolinguistics, dialectology research has benefited from corpus-
based quantitative data that support the analysis of dialect variation. For example,
Grieve’s (2016) study of regional variation in American English from a corpus
of newspaper letters to the editor, collected from over 70 cities from across the
US, shows how corpus-based methodologies and visualization techniques can
be applied directly to researching and teaching regional variation in written lan-
guage. Grieve carefully designed his corpus to account for geographic regions
in the US and their potential influence on variation across a range of linguistic
features. One of his primary outputs is a new dialect map of the US showing 12
different dialect regions identified by clusters of linguistic features distinguishing
one region from another.
Visualizing online language, especially from social media discourse such as
Facebook and Twitter status updates and tweets, has been featured increasingly
in several publications and academic articles. Popular culture references, from
the broad topic of the language of social media to more specific ones such
as President Trump’s Twitter analytics, have been explored quite extensively.
Trump, with close to 42 million followers in early 2018, has tweeted an aver-
age of 5.4 times per day since he became the US president. His top 10-word
list from 2016 includes Hillary, #trump2016, crooked, Clinton, #makeamericagreat-
again, people, America, Cruz, bad, and Trump. Twitter data are very useful, not
only for linguistic analysis but also especially for business and big data analytics.
Product sentiment analysis, movie box-office projections, and trending issues
or topics are all relatively easy to extract in real time from Twitter using its ap-
plication programming interface (API). Unlike Facebook, which can be set by
users to be exclusively private, Twitter defaults as a public platform.
Eisenstein et al. (2010) used computational models to identify regional mark-
ers from user postings on Twitter. For corpus-based dialectology research, the
important link here is how internet and mobile technology can code for vari-
ables such as location in tweets. As it is, Twitter can access users’ geographical
coordinates from, for example, mobile devices that are enabled by Global Po-
sitioning Systems or GPS. This feature produces ‘geotagged’ text data that re-
searchers can obtain from online logs. Most users’ tweets are geotagged, which
means that analysts are able to identify the users’ locations, especially if they
tweeted from their mobile phones. Posts from desktop computers or permanent
computer terminals may be identified from their internet access addresses or
universal resource locators (URL). There are more and more studies that mine
geotagged data online, focusing primarily on trends and internet user traffic.
These types of information are useful to marketing analysts and survey compa-
nies that collect quantitative tracking data of user behavior from the internet.
One of Grieve and his collaborators’ ongoing projects is to document lexical
spread on Twitter. They are in the process of compiling a multi-billion-word
regional monitor corpus, using the Twitter API, consisting of nearly every
geocoded Tweet from the US and the UK since 2013, totaling approximately
25 million words per day. Given this large number of geocoded and time-
stamped Tweets, it is possible for Grieve and his team (2014) to identify newly
emerging words and map their geographical spread over time. An earlier study
that they conducted during the first three quarters of 2013 explored “rising” or
increasingly prevalent words identified from a particular period: for example,
from day 1 of 2013 to day 250 (from January to September). They first ex-
tracted 60,000 words that occurred at least 1,000 times in the corpus and iden-
tified rising words by correlating word relative frequency per day to day of the
year using a Spearman’s rank correlation coefficient. Their list of rising Twitter
words includes rn (right now), selfie/s (a photo of oneself ), tbh (to be honest),
literally, bc (because), ily (I love you), bae (babe, baby), schleep (sleep), sweg (swag,
i.e., style), and yasss (yes). On the declining side, the following are 2013’s “fall-
ing” Twitter words: wat (what), nf (now following), swerve, shrugs (* shrugs *),
dnt (don’t), wen (when), rite (right), yu (you), wats (what’s), and yeahh (yeah).
They were also able to visualize the data per word and then trace the spread
of each word across the US For example, for the world selfie (named “Word
of the Year” for 2013), the following graph (Figure A3.13) clearly shows its
Figure A3.13 elfie/s first appearance and dramatic increase in usage from Twitter in 2013. The gray parts in the US map (typically major
S
cities in the Northeast and the Southwest) indicate that selfie/s originated from and was immediately popularly used in many
major US cities (Grieve et al., 2014).
dramatic linear increase in usage from day 1 to day 250. A ‘heat map’ from the
geotagged tweets shows parts of the US where selfie was included as part of a
tweet. Additionally, they also visualized the first users of these words in tweets,
obtaining profile pictures of Twitter users that could be qualitatively explored
according to gender, age, and other variables.
For the Teacher
The previous figures can be used to initiate small group discussions in a Soci-
olinguistics or Language in Society class. Grieve et al. (2014) found that most
rising words on Twitter follow an s-curve when presented graphically. They
also found patterns: (1) Acronyms were on the rise, but creative spellings
were on the decline, (2) there were relatively clear southern and northern US
patterns of lexical spread on Twitter, and (3) lexical innovators appeared to
include young black women in the South and young white men in the West
and the North. (This observation was derived from an examination of profile
pictures of Twitter users.) Students examining visualized geotagged Twitter
data might be asked to consider and discuss questions such as the following:
• What are your immediate impressions about the data? What jumped
out? What are lessons or takeaways from this visualized data?
• Explain what the data/figures and excerpts are about. Answer the
question, “So what?”
• Remind students that CL is a research approach, a way of thinking
about language that shines the spotlight on language use. What then
is a word? (Note that Grieve and colleagues referred to rn, ily, and
yasss as “words” from Twitter.) What are the implications of these new
Twitter words in the study of languages?
• If CL allows investigation of language choice, could we explain why
Twitter users prefer a particular word or grammatical form rather than
alternatives?
TAGSExplorer (https://hawksey.info/tagsexplorer/) is a Twitter archive visuali

zation tool developed by Martin Hawksey, which makes use of Twitter to collect
tweets related to a particular event or issue hashtag to enable participants to share
and contribute relevant comments or responses. As more and more participants
tweet, the utilization of the event/issue hashtag will continue to increase and can,
thereby, enable greater public visualization of archived tweets. Users can interact
with their own and other users’ tags, and the visualization could be shared online.
Figure A3.14 shows an example of this “queryable visualization” format, which was
made possible by using Google Sheets as a database of tweets.
Figure A3.14 Representation of TAGSExplorer’s nodes of event hashtags and in-

teractive visualization.
What does it all mean? Hawksey’s goal in developing TAGSExplorer was to

archive event hashtags and create an interactive visualization of the conversa-
tion threads on Twitter. The device makes use of many data points identified by
users and provided in real-time by Twitter. Twitter, in this context is, therefore,
an automatic corpus available for instantaneous analysis. Although it may not
be immediately applicable to classroom teaching or in the teaching of linguistic
features aside from the tracking of vocabulary use in this register, the tool is a
reflection of language development online and how emerging technologies are
providing linguists and teachers access to authentic texts. TAGSExplorer, in the
future, could be the model for individualized, learner-centered instruction on
language description. This may first start with vocabulary instruction, followed
by focus on form/structure activities for learners to enable them to be more
aware of sentence-level nodes of language as they interact with the tool. Learn-
ers will be asked to focus on discovering unique patterns of language that they
can use in their own writing, whether sending tweets or, potentially at some
point in the future, writing more formal academic essays.
A3.2.3.3 Exploring the Language of Hip Hop (also Hip-Hop)

Finally, although they do not typically refer to CL as their underlying method
of analysis, recent projects developed by researchers associated with the Rap
Research Lab and similar groups such as The Hip-hop Archive & Research
Institute at Harvard University, and The Frank-Ratchye STUDIO for Creative
Inquiry at Carnegie Mellon University utilize a corpus of hip hop lyrics to ex-
plore vocabulary use in hip hop and to visualize and compare artists’ creative
use of language. Hip hop has been a leading source of linguistic innovation
and has also now been studied across academic fields in the digital humanities,
media criticism, and data visualization. In many of these academic studies, the
language of hip hop is viewed as a cultural indicator. Tahir Hemphill has de-
veloped a searchable rap almanac the Hip Hop Word Count (http://staplecrops.
com/), which examines hip hop lyrics and allows a visualization tool to draw
out shapes and circular lines from the lyrics, revealing a layer of creative work
and the aesthetic focus that artists pursue in their songs.
The Hip-Hop Word Count is a searchable ethnographic database built from
the lyrics of over 40,000 hip hop songs (and growing) from 1979 to the present
(Hopkins, 2011). From this database, linguistic details of hip hop songs can be
explored and compared. As Hemphill suggests, these data can then be used to
not only derive interesting statistics about the songs themselves, but also po-
tentially to describe and explain the culture behind the music. An illustrated
visual on the artist, a particular song, and linguistic information such as total
words, average syllable per word, average letters per syllable, average letters per
word, polysyllabic words, and finally education level (e.g., some high school or
high school graduate) and readability or reading level are provided. Reading
levels are identified as “Readers’ Digest,” “Time Magazine,” and others. In the
following comparison data, adapted from Hemphill’s site, Kanye West’s “Big
Brother” and Tupac Shakur’s “Trapped” received word count scores of 9 and
12, respectively. (The higher the word count, the more sophisticated the lyrics
are, arguably.) Linguistic metrics of the two songs are provided (Figure A3.15).
How can analyzing hip hop lyrics teach us about cultures or subcultures?
Song-level comparisons are potentially interesting to students, especially those
who like this genre of music, but they can also apply this approach to other
genres or extend the comparison to two or more corpora. For example, my
students have always been curious about the differences of vocabulary use and
themes between country music and hip hop. They explore the distribution and
functions of words like love, God, freedom, America, and tagged POS features
such as personal pronouns, verb tenses, passive structures, and nominal modi-
fication. The Hip Hop Word Count also provides time and geographic location
identifiers based on where the artists came from in the US and related compari-
sons of metaphor use and other figures of speech, cultural references, phrase and
rhyme style, meme and socio-political ideas. Hemphill’s database then converts
various data points into an explorable visualization frame that charts “migra-
tion of ideas and builds a geography of language.”
Daniels (2017) used a token analysis method—basically, a type-token ratio—
to determine hip hop artists’ vocabulary range, identifying unique words from
an artist’s first 35,000 song lyrics collected in the corpus. His various results
allowed him to create a master list of who has the most to the least unique and
Figure A3.15 omparison of “Big Brother” (Kanye West) and “Trapped” (Tupac
C
Shakur) from the Hip Hop Word Count. Adapted from Hemphill:
http://staplecrops.com/.
diverse vocabulary range. An online interactive visualizer (https://pudding.

cool/2017/02/vocabulary/index.html) provides a set of data that also show how
artists compare to Shakespeare and Herman Melville’s Moby Dick. Results from
this approach revealed that Aesop Rock (born 1976), based in Portland, Ore-
gon, was the artist with the “largest vocabulary in hip hop,” with 7,392 unique
words used. By comparison, Shakespeare’s total was 5,170; Melville’s was
6,022, based on the first 35,000 words of Moby Dick. The artist with the small-
est vocabulary was DMX (born, 1970, from New York), with 3,214 unique
words. Daniel’s comparison chart shows that a majority of artists plot in the
3,600–5,000 range, below Shakespeare. There were not many women in the
dataset; Lil’ Kim (born, 1974, from New York), with 4,470, and Nicki Minaj
(born, 1982, from Trinidad and Tobago, raised in Queens, NY), with 4,162.
Note
1 We plan to launch a full, new version of Text X-Ray upon completion of our usabil-
ity tests. If you want to access the beta version, please send an email to textxray.
beta@gmail.com for instructions on how to run the program online.
Part B
Tools, Corpora, and
Online Resources
B1
Corpora and Online Databases
Section B1 provides a list of currently available large-scale corpora and on-

line databases that are relevant for English teachers. Most of these corpora are
readily accessible online or may be purchased from their developers. In addi-
tion to written texts, which are easier to compile and are often the focus of
classroom-based activities, I also introduce several texts and collections of spo-
ken academic discourse and learner oral language (in English). I describe useful
learner and professional written texts, and the types of student oral language,
together with some examples of corpus-based studies utilizing these corpora.
Readers will find it helpful to keep these resources in mind as they read related
sections, such as B2 (“Collecting Your Own Corpus”) and the four subsections
of Part C (“Corpus-Based Lessons and Activities in the Classroom”).
My goal in this section is to highlight corpora that are publicly available
(e.g., MICUSP, MICASE, BAWE, COCA, or ELFA) and those that may be
purchased online (e.g., ICLE, LINDSEI). In the past decade, specialized writ-
ten and spoken texts from L2 learners or the language of academia have been
widely collected and shared by various research groups globally, suggesting
a growing interest in this area of corpus-based research in the classroom and
the important merging of SLA and corpus-informed approaches in language
teaching. Some collections are private or not easily shared, but there are many
related studies or publications written about them. Examples of these types of
corpora are the T2K-SWAL (TOEFL 2000 Spoken and Written Academic
Language, described later) and many related TOEFL-based writing texts an-
alyzed by researchers in collaboration with the Educational Testing Service
(ETS), which owns these texts. For T2K-SWAL, various articles and a few
research manuscripts have been published detailing the characteristics of ac-
ademic language in the US, collected from four major public universities in
four states: for example, Biber’s University Language: A Corpus-Based Study of
82 Tools, Corpora, and Online Resources
Spoken and Written Registers (2006). Many readers may be familiar with teaching
materials developed from the T2K-SWAL, including those directly addressing
essay writing and test preparation activities.
Before proceeding with the listing of corpora and online databases, I’d like
to share my responses to several questions that I frequently hear from teachers
who attend my CL workshops or conference presentations about corpus data
and its use in the classroom. These are also common topics of discussion with
my graduate students in my CL and Technology and Language Teaching courses.
Why Use Existing Corpora?

For language teaching purposes, the simplest answer is because they are readily
available. There are now many free written (and some spoken) corpora, which
teachers can download and use for various applications. You can utilize what’s
available, and, especially during the early stages of your work with corpora, you
do not necessarily need your own collection of specialized texts. As described
later, many existing corpora collected and shared by researchers online are
excellent resources for teachers in developing lessons and classroom activities.
These can be saved and then shared with students for concordancing activities
or assigned for qualitative coding outside the classroom. Do not hesitate to
contact researchers about their corpora utilized in published studies. You may
be surprised to discover that there are a few who are willing to provide copies
of texts that can be used for replication studies. The ones that are strictly private
or proprietary will not be available, of course.
It is important to note that the selection of corpora comes second to articu-
lating your research questions and teaching goals. Do your best to first establish
clear objectives, including why you are using corpora in your course, then ex-
plore your options for available and appropriate texts you can use.
Do I Need to Purchase Corpora: For Example, the International

Corpus of Learner English (ICLE)?
There are advantages to having copies of the large-scale corpora that are avail-
able for purchase. In the classroom, these will be handy, when readily avail-
able, and could be used immediately for many (concordancing) activities by
students. It makes it easier for teachers to group files according to specific,
functional folders, and owning copies of texts allows for additional processing,
such as running taggers or parsers when needed. The ICLE provides examples
of L2 writing from university students, representing over 10 different L1 back-
grounds. In an ESL or SLA course, depending on students’ year level, the ICLE
may be relevant and useful to own, especially in activities that can illustrate to
students the concepts of transfer or show how to conduct a contrastive analysis
study. There will be a few limitations in some circumstances, and ICLE may
Corpora and Online Databases 83
not provide the best comparable data for these types of studies, but it can be used
to provide beginning-level examples. All ICLE essays were lemmatized and
POS-tagged with the CLAWS Tagger, and these items can be searched using
the built-in concordancer. The ICLE’s current edition, though, is not cheap;
the manual with a CD-ROM of all texts costs 272.22€ for a s ingle-user license
and 580.22€ for multiple users (as of early 2018). The LINDSEI manual with
CD-ROM from the same research team under Sylviane Granger (Université
catholique de Louvain) costs 211.75€ for a single-user license.
The important thing to remember here is that it is good to have a ‘reference’
corpus readily available for a class that incorporates corpus-based activities.
You do not need to purchase corpora, as there are excellent ones that are freely
available. For example, MICUSP texts can be downloaded for free with only
official email registration, and full-text corpus data from five large corpora of
English—News on the Web (NOW), Wikipedia, COCA, COHA, GloWbE,
and the Corpus del Español—are all available from Davies’s BYU corpora web
page (www.corpusdata.org/).
Are There Corpora Available to Suit My Specific Needs Beyond

Typical Academic Texts or Learner Language?
Specialized written corpora, such as memos, business letters, application let-
ters, various advertisements, and others, have been collected and examined in
research, and some of these sub-registers are also included in general corpora,
such as the BNC or the ANC. Corpora in professional fields have been ex-
plored, primarily with a focus on improving our understanding of professional
communication and also developing training materials to enhance service
quality, avoid miscommunication, and facilitate successful business transactions
(Pickering, Friginal, & Staples, 2016). Some of the most commonly studied reg-
isters of cross-cultural workplace discourse using corpora include office-based
workplace interactions; outsourced call center transactions; health-care (e.g.,
doctor-patient, nurse-patient) interactions; international business meetings and
negotiations; and tourism, hotel, and service industry communications, among
many others.
The Cambridge and Nottingham Business English Corpus (CANBEC) is a
1-million-word sub-corpus of the Cambridge English Corpus (CEC), cover-
ing a range of business settings from large companies to small firms and both
transactional (e.g., formal meetings and presentations) and interactional (e.g.,
lunchtime or coffee room conversations) language events. Studies using the
CANBEC have focused on the distribution of lexical chunks and discursive
practices in business meetings (Handford, 2010; McCarthy & Handford, 2004).
A similar, but much smaller, corpus is the American and British Office Talk
(ABOT) Corpus, which comprises mainly informal, unplanned workplace in-
teractions between coworkers in various office settings (Koester, 2010).
The Hong Kong Corpus of Spoken English (prosodic) (HKCSE) includes

a sub-corpus of business English with approximately 250,000 words (Cheng,
Greaves, & Warren, 2008; Warren, 2004). The HKCSE contains various types
of formal and informal office talk, service encounters in hotels, business presen-
tations, and conference calls. As a cross-cultural corpus, the two main cultural
groups communicating in many of the workplaces are Chinese speakers from
Hong Kong and native and non-native English speakers from many different
countries. The HKCSE is unique in that it is transcribed for prosodic fea-
tures using Brazil’s (1997) model of discourse intonation. A concordancing
program—iConc—was specifically developed for the corpus and allows quan-
titative analyses of intonational features (Cheng, Greaves, & Warren, 2005).
A manual on the HKCSE with an accompanying CD-ROM of texts (and the
iConc interface) is available for purchase.
Can I Focus on Collecting and Analyzing Emails?

My students have always been interested in studying and then teaching the
structure and form of emails, both formal and informal types, especially in ESL
classes. Emails, however, are not easy to collect, and their use will likely present
ethical issues and require that consent forms be obtained. Applying for institu-
tional review approval for a corpus collection of emails will be quite difficult,
and the process will require very specific justifications and the resolution of
legal concerns regarding privacy of communication. For further discussion on
corpus collection of specialized academic texts, see Section B2.
Business or corporate email text samples are actually available online. The
Enron Email Corpus was a product of a legal decision to force the Enron
Corporation—an American energy, commodities, and services company—to re-
lease its private documents to the public. Thus, the Enron Email Corpus became
the first massive collection of workplace emails from employees and executives.
During the criminal investigation of Enron’s business activities, the preserva-
tion of email messages, especially those from corporate executives, was ordered.
A collection 619,446 messages sent by 158 senior Enron executives was subse-
quently made available to the public. A threaded corpus with 517,431 messages
sent by 151 employees from 1997 to 2002 was developed and freely distributed
online (Friginal & Hardy, 2014a; Klimt & Yang, 2004; Shetty & Adibi, 2004).
Internet versions of the Enron email database were first posted on the Fed-
eral Energy Regulatory Commission website and the Cognitive Assistant that
Learns and Organizes (CALO) project. US universities, such as the University
of Pennsylvania, Carnegie Mellon University, and University of California,
Berkeley, host database searches and visualization features for topics and senti-
ment labels (Cohen, 2015). These emails have been explored from various per-
spectives, ranging from an analysis of the messages in corporate management
audits, the spread of gossip in emails and computer-mediated communication,
and the examinations of the emails’ electronic structure to the use of electronic
data for machine learning and natural language processing.
What Do I Need to Know about Social Media Texts and Their

Teaching Applications?
There is a growing number of studies that explore corpora of Facebook and
Twitter posts from a range of perspectives. Register analyses of status updates
and tweets provide a description of the emerging linguistic patterns of so-
cial networks. For example, tweets have been effectively used in dialectology
studies and in studies of synchronic vocabulary development (Nini, Corradini,
Guo, & Grieve, 2016). Carr, Schrock, and Dauterman (2012) analyzed the fre-
quency and function of various speech acts and humor on Facebook, reporting
that self-expression and self-disclosure influenced the popular use of expres-
sive speech acts, followed by assertive speech acts. Humor was used in various
contexts in status updates but was typically connected with a personal topic
and reference. An ongoing study by Juola, Ryan, and Mehok (2011–) explores
the identification of specific lexical features in tweets that may be part of an
emerging regional ‘tweet dialect.’ Zappavigna (2017) noted that commenting
on political issues via Twitter has become a commonplace social practice, often
involving referencing material that has been produced by politicians or public
figures across media. This is typically achieved by using direct and indirect
quotations, which are features related in a broader sense to stance, attribution,
voice, and point of view (Friginal, Waugh, & Titak, 2017).
Studies of Facebook and Twitter platforms as applied in the classroom, and
specifically for language learning, have looked at learners’ behavior and posting
activity; the quality of language produced in these platforms; and various other
applications, such as posting frequency and learners’ level of participation. The
applications of Facebook and Twitter in academic settings have also been inves-
tigated, especially the ways in which teachers can utilize these sites to encourage
written interaction and practice, critical thinking, and engagement (Krutka &
Carpenter, 2016; Vasbø, Silseth, & Erstad, 2014). Facebook and Twitter have
been utilized as more accessible discussion boards for a range of hybrid classes,
which may serve as venues for required postings, similar to reaction papers or
threaded discussions outside the classroom. In these settings, language pro-
duction is encouraged, if not required or graded, allowing learners to produce
linguistic output throughout the semester.
I worked with my colleague Jing Paul, from Agnes Scott College, to study
the effects of symmetric (Facebook) and asymmetric (Twitter) online social
networks on foreign (Chinese) language learners’ language production in both
short- (10 days) and long-term (50 days) pseudo- experimental settings. Stu-
dents completed pre- and post-questionnaires, and posted a message in Chinese
daily for five days a week during the period of the project. We found that
Facebook participants believed strongly that reading symmetric Facebook

posts improved their reading skills. In both settings, Facebook users posted
more sentences than Twitter users per day. Additionally, posts on Facebook
were more interactive than those on Twitter. In both settings, there were more
grammatical errors on Facebook than on Twitter. In the long-term study, a
moderate positive correlation between the number of Chinese characters and
the number of grammar errors occurred for the Facebook group but not for
Twitter users. We concluded that both symmetric and asymmetric online net-
works provide an environment for learners to communicate in L2 beyond the
classroom. A symmetric social network is more suitable for interactive commu-
nication; it provides a meaningful context for learners to practice their reading
skills, improving L2 writing fluency and building L2 reading and writing con-
nections. It also provides a sense of security for the learners at the initial stage of
the online communication. An asymmetric social network is more suitable for
self-practice; it promotes L2 grammatical accuracy in communication.
B1.1 Written Learner Corpora
The International Corpus of Learner English (ICLE)

The ICLE contains about 3.7 million words of learner English. It consists
of academic writing—mainly argumentative—produced by university un-
dergraduates usually in their third or fourth year. The corpus is divided into
11 national sub-corpora of between 200,000 and 278,000 words each. Eleven
native language backgrounds are represented, but there is no exact match be-
tween the backgrounds and the sub-corpora; learners with a Swedish language
background, for instance, are represented in both the Finnish and Swedish
sub-corpora. The term ‘national’ is somewhat misleading regarding some
sub-corpora as used in the ICLE; for instance, the French sub-corpus consists of
essays written in Belgium (by native speakers [NS] of French), and the German
sub-corpus of essays written in Austria, Germany, and Switzerland. This po-
tential source of confusion is not significant, given the selection tool that comes
with the corpus texts, but it may still puzzle users, who will be faced with a list
of countries to choose from that does not match the list of national sub-corpora
(Friginal & Hardy, 2014a). The learner profiles are stored in a database and
contain a great deal of information on each essay and essay writer. The profiles
are linked to the texts by essay codes, which contain, among other things, a
national code and an institution code (e.g., FIHE for Finnish, Helsinki Uni-
versity). The texts are in ASCII format, and, as previously mentioned, they are
tagged and lemmatized, and contain no markup, except for essay codes linking
each text to its profile and codes for deleted quotes, deleted bibliographical
references, and illegible words. The text format is designed to work well with
software tools for linguistic analysis, such as WordSmith Tools or AntConc.
The ICLE CD-ROM package contains about 20 variables (alphanumeri-

cal, numerical, alphabetical, or selection lists) according to which corpus users
can select texts. The coverage is quite impressive: It is possible to select essays
according to features of the essay (e.g., type, length, and production circum-
stances) as well as features of the learner (e.g., sex, country, native language,
language at home, age, and years of English at school). The advantage of this
coding scheme is that teachers can design their own tailor-made sub-corpora,
which clearly helps to increase the validity and reliability of comparisons across
native languages. For example, Aijmer (2011) emphasizes the importance of
controlling for topic in research on modality in learner writing. The ICLE
package enables users to select essays according to both type (“argumentative,”
“literary,” or “other”) and (words in) title. The only drawback, given this many
parameters for selection, is that some of the resulting sub-corpora that are pro-
duced by combining several variables will be quite small. The handbook de-
scribes the selection process well, and help files are also available via the menu
system of the program itself (Friginal & Hardy, 2014a).
For the Teacher
One of the first studies published on ICLE texts investigated learners’ use of
high-frequency verbs in writing: in particular, the verb MAKE (Altenberg &
Granger, 2002). The main research questions were: Do learners tend to over-
or underuse these verbs? Are high-frequency verbs error-prone or safe? What
part does transfer play in the misuse of these verbs? To answer these ques-
tions, Altenberg and Granger compared Swedish and French student essays
from ICLE with native speaker essays (students from the US) from a sub-
corpus called the Louvain Corpus of Native English Essays (LOCNESS). Alten-
berg and Granger found that learners, even at an advanced proficiency level,
had difficulty in accurately and consistently using MAKE (and lemmas) in var-
ious contexts. The authors suggested that some of these ‘problems’ were
shared by the Swedish and French learners, and may be L1-related. There
were many clear limitations and methodological issues about this study, es-
pecially in its conceptualization of L1 transfer and the concepts of over- and
underuse of linguistic features by learners, with NSs as the target for compari-
son. However, there are several interesting applications that clearly show how
a collection of essays like ICLE can be used for classroom activities, engaging
students actively in the discussion and analyses of semantic features of verbs.
The authors were able to extract 354.3 occurrences of MAKE (all lemmas,
i.e., make, makes, making, made) per 100,000 words for Swedish learners,
(Continued)
234.6 for French learners, and 339.8 for US students. They then grouped the
meanings of MAKE into eight major categories:
l. P
roduce something make furniture, make a hole, make a law
(result of creation):
2. Delexical uses: make a distinction, a decision, a reform
3. Causative uses: make somebody believe, make it possible
4. Earn (money): make a fortune, a living
5. Link verb uses: she will make a good teacher
6. Make it (idiomatic): if we run, we should make it
7. Phrasal/prepositional uses: make out, make up, make out of
8. Other conventional uses: make good, make one’s way
An activity extracting usage of a word to categorize its meanings or func-

tions can be easily designed for the classroom. The aforementioned cate-
gories of MAKE are fairly easy to define, and there will be many examples
from concordance outputs that learners can discuss in pairs or small groups.
Other high-frequency verbs such as GET, TAKE, or GIVE can also be used in
similar activities or mini-research studies conducted by students. Altenberg
and Granger noted that concordance-based exercises are useful in raising
advanced learners’ awareness of the structural and collocational complexity
of high-frequency verbs. They suggested that learners could be specifically
instructed to scan concordance lines and compile a list of the major collo-
cates of verbs (like MAKE) as awareness-raising exercises. This would then
be followed by a consolidation exercise in which learners are asked to fill in
the blanks in corpus excerpts from which common collocates have been
removed (e.g., for MAKE: decision, mistake, claim, argument, effort, etc.) Such
exercises, they believe, would increase the learners’ ‘depth of processing’
and also potentially their degree of retention of collocates.
The Michigan Corpus of Upper-Level Student Papers (MICUSP)

MICUSP is a corpus of proficient student academic writing that was compiled
by a team of researchers and students at the English Language Institute of the
University of Michigan (Römer & O’Donnell, 2011). The MICUSP database
enables researchers, language teachers, and students to investigate the written
discourse of proficient, advanced-level native and non-native speaker student
writers at a large American research-oriented university. It also provides stu-
dents with a wide selection of A-graded papers that may serve as models for their
own academic writing. MICUSP is currently available through a user-friendly
online search interface known as MICUSP Simple. An off-line version of the
corpus (in annotated XML and plaintext format) on a CD-ROM, accompany-
ing a MICUSP resource book, is under preparation.
The papers collected for MICUSP were written by students at four different
levels of study: final-year undergraduate and first, second, and third-year graduate
levels. Different types of papers, ranging from essays to lab reports, were collected
from a wide range of disciplines within four academic divisions: humanities and
arts, social sciences, biological and health sciences, and physical sciences. The
corpus enables analyses of disciplinary, developmental, and genre-related phe-
nomena associated with student writing. Each paper in MICUSP also captures
metadata on the student’s gender and ‘nativeness’ status (including information
on first language background in the case of non-native speakers [NNSs]).
The number of papers from the individual MICUSP disciplines range from 21
to 104. Some of the well-represented disciplines in terms of papers and word counts
are psychology, English, sociology, and biology. The overall word count is high-
est for the social sciences division (978,254), followed by the humanities and arts
(734,437), and biological and health sciences (511,550), and is lowest for physical sci-
ences (392,288). MICUSP papers have been categorized according to paper types
(e.g., argumentative essay and report), following a systematic data-driven text-type
analysis of all the papers in the corpus (Römer & O’Donnell, 2011). The papers fall
into the following seven text type categories: argumentative essay, creative writing,
critique/evaluation, proposal, report, research paper, and response paper.
For the Teacher
[Discussion continued from Section A3] Hardy and Römer (2013) identified and
analyzed the co-occurring, lexico-grammatical features of MICUSP to help char-
acterize successful student writing. Following Biber (1988), they used multidi-
mensional analysis to establish dimensions of frequently co-occurring features
that best account for cross-disciplinary variation in MICUSP. The four functional
dimensions of MICUSP appear to distinguish between: (1) Involved, Academic
Narrative vs. Descriptive, Informational Discourse; (2) Expression of Opinions and
Mental Processes; (3) Situation-Dependent, Non-Procedural Evaluation vs. Procedural
Discourse; and (4) Production of Possibility Statement and Argumentation. They ar-
gued that as writing instruction increasingly spreads from English departments
to writing intensive coursework housed in other disciplines, there is a need to
better understand student writing as it exists in those content areas. Such an
understanding can help English writing teachers address the needs of students
who are beginning writers in the discipline. Figure B1.1 presents a comparison of
disciplines in how they are distinguished in Dimension 1, Involved, Academic Nar-
rative vs. Descriptive, Informational Discourse. This illustration clearly shows how
writing in the humanities (philosophy and education) is different in linguistic
co-occurrence patterns from those in the sciences (physics and biology).
(Continued)
The linguistic dimensions of MICUSP have pedagogical applications, espe-

cially in developing teaching materials in writing-across-the-curriculum set-
tings. It is clearly an objective in these various academic departments to better
understand student writing as it exists in the relevant content areas. Related to
this, the subfields of ESP and EAP, which make use of corpora in content-area
Involved, Academic Narrative
8 Philosophy (8.193, SD 3.983)
Education (7.440, SD 10.183)
//
Psychology (3.295, SD 8.645)
2
English (1.990, SD 7.829)
Nursing (0.484, SD 6.804)
Sociology (0.354, SD 7.518)
0 Political Science (0.165, SD 4.028)
Industrial & Operations Engineering (-0.758, SD 6.011)
Linguistics (-1.042, SD 7.221)

-2
Natural Resources & Environment (-2.953, SD 6.253)

History & Classical Studies (-3.121, SD 5.154)
Mechanical Engineering (-3.647, SD 4.308)
-4 Economics (-3.916, SD 4.591)
Civil & Environmental Engineering (-5.297, SD 3.599)

-6 Biology (-5.975, SD 4.715)
-8
Physics (-8.193, SD 3.983)
Descriptive, Informational Discourse
Figure B1.1 omparison of dimension scores for disciplines in Dimension 1:

C
(+) Involved, Academic Narrative vs. (−) Descriptive, Informational Dis-
course (ANOVA: group, F(15, 809) = 18.65, p < 0.001), adapted
from Hardy and Römer (2013).
teaching, have become very popular in the past several years (Belcher, 2009;
Biber, Reppen, & Friginal, 2010; Johns, 2009). Studies such as Flowerdew
(2005), Gavioli (2005), Hinkel (2002), Hyland (2008b), Jarvis, Grant, Bikow-
ski, & Ferris (2003), and Yoon and Hirvela (2004), to name only a few, recognize
the valuable contribution of corpus-based data in the teaching of academic
writing across disciplines, especially in increasing learners’ awareness of the
textual features of their own writing relative to target (i.e., successful) models.
A poster presentation by Roberts and Samford (2013) explored the class-
room applications of MICUSP (and a few other tools and online databases)
by focusing on extralinguistic features such as patterns across specific learner
characteristics (e.g., nativeness, level, or discipline). They suggest to teach-
ers that various comparisons can illustrate progress in student work across
a semester or program and that the use of MICUSP texts can reveal specific
linguistic features unique to a genre (discipline), informing classroom deci-
sions on the instruction of that particular genre. To fully realize the potential
of these tools, teachers should explore and apply their discoveries to their
specific learner populations. However, as with any tool, care, deliberation,
and familiarity should all be taken into consideration when determining the
appropriateness of the application. Once familiar, MICUSP can be extended
to student use in or outside the classroom. Students can use their own texts
for comparison to see personalized results from their writing, rather than
a pre-compiled dataset. Some of their suggestions to teachers include the
following: (1) Pay attention to your learners’ needs. MICUSP is a collection of
upper-level papers, meaning it may not work for lower levels or non-native
professionals immediately. (2) If there is not an existing specialized corpus,
teachers are encouraged to build one from available materials. (3) Word list,
keyness, or frequency alone should not inform curriculum planning or course
development. A high keyness or frequency may provide a basis for justifying
some focus on a word, but may be deceptive due to the idiosyncratic nature
of writing. (4) Data derived from MICUSP is not meant to replace a teach-
er’s intuition, but rather to better inform and work in conjunction with it
(McEnery, Xiao, & Tono, 2006). (5) Specialized corpora can inform novice
teachers or experienced teachers from different disciplinary backgrounds
charged with teaching an academic English course in a new discipline.
The British Academic Written English (BAWE)

The primary focus of the BAWE corpus (Nesi, 2008, 2011) was to create a
collection that contained a balanced number of writing samples from various
disciplines across university systems in Great Britain. Equal numbers of subjects
were established for each of the four disciplinary groups (arts and humanities,
life sciences, physical sciences, and social sciences). BAWE was also balanced
according to the number of texts per year. This ensures that their sample would
include equal representation of the levels of students in each subject. Some
subjects (e.g., archaeology, medical science), however, are not taken at all levels
of study, so fewer papers were expected. For medical science, for example, the
Table B1.1 Corpus matrix for the BAWE corpus (Nesi, 2008, 2011)
Disciplinary Group Subject Papers Per Year Total

(1, 2, Final,
and master’s
level)
Arts & Applied linguistics/Applied English 32 128

Humanities language studies
Classics 32 128
Comparative American studies 32 128
English studies 32 128
History 32 128
Philosophy 32 128
(Archaeology) 16 64
Life Sciences Agriculture 32 128
Biological sciences/Biochemistry 32 128
Food science and technology 32 128
Health and social care 32 128
Plant biosciences 32 128
Psychology 32 128
(Medical science) 16:48 64
Physical Sciences Architecture 32 128
Chemistry 32 128
Computer science 32 128
Cybernetics & electric engineering 32 128
Engineering 64 256
Physics 32 128
(Mathematics) 16 128
Social Sciences Anthropology 32 128
Business 32 128
Economics 32 128
Hospitality, leisure, and tourism 32 128
Management
Law 32 128
Sociology 32 128
(Publishing) 16 64
Other Other 43 172
Total 3500
*This table reflects data from Summer 2012 and includes texts from Jan to Jun 2012.
corpus planned to include only upper-level students (16 in their final year of
undergraduate, and 48 at the graduate level). Although the primary topic here
is not directly related to sociolinguistics (‘academic writing’), we used this ex-
ample to provide an illustration of a corpus matrix and how you might develop
your own when you design your corpus (Friginal & Hardy, 2014a). Table B1.1
shows the corpus matrix for BAWE, with total number of texts across disci-
plinary groups and subjects.
The BAWE’s corpus matrix is a good example of how to set out to create
a balanced corpus, one that can help researchers answer questions associated
with the target population. The matrix for the BAWE corpus does not show
balance in terms of how much writing is done in each of these disciplines. In
other words, a discipline like history or philosophy might involve a lot more
writing of prose than some of the applied subjects (e.g., hospitality, health and
social care). The corpus creators were not interested in asking questions of
relative balance of writing. Instead, they thought it would be best to give each
subject equal representation, avoiding any idiosyncrasies that might be associ-
ated with a particular subject that might involve a lot more writing. Because
the BAWE project was interested in exploring disciplinary variation, the de-
velopers wanted to balance the amount of writing collected from each subject.
However, teachers might be interested in studying how language is used across
registers while putting more importance on the registers that are most fre-
quently used.
B1.2 Spoken Learner Corpora
The Michigan Corpus of Academic Spoken English (MICASE)

MICASE (Simpson, Briggs, Ovens, & Swales, 2002) is a collection of transcribed
speech that represents oral language in a university setting in the US It is ac-
cessible online with a searchable interface that functions as a concordance pro-
gram. MICASE’s original audiotapes are available at University of M ichigan’s
English Language Institute and may be used by researchers after obtaining
permission. A MICASE users’ guide (Simpson-Vlach & Leicher, 2006) is also
available in book form. The MICASE team had two primary research ques-
tions that guided their design and collection: (1) What are the characteristics of
contemporary academic speech—its grammar, its vocabulary, its functions and
purposes, its fluencies and dysfluencies? (2) Are these characteristics different
for different academic disciplines and for different classes of speakers? As MI-
CASE focused on recording a range of academic speech, the team’s sampling
goals spanned 15 different types of speech events and four major academic di-
visions within those types (humanities and arts, social sciences, biological and
health sciences, and physical sciences). They followed a stratified random sam-
pling procedure, with each recording classified according to speech event type,
a preassigned number indicating the academic discipline, two letters represent-

ing the majority of participants in the event (e.g., junior undergraduate, senior
faculty, staff ), and a final three-digit sequence to track chronologically when
the tape was recorded. MICASE recordings had two researchers who attended
most speech events in order to identify speakers and facilitate transcription by
taking field notes about nonverbal contextual information. Small group events
(e.g., advising sessions, office hours, study groups) where an observer’s presence
would have been intrusive, did not include research assistants after the record-
ing equipment was set up (Simpson-Vlach & Leicher, 2006).
MICASE provides examples of speech events ranging in length from 19 to
178 minutes, with word counts ranging from 2,805 words to 30,328 words.
Clearly, this indicates that academic discourse varies with respect to both length
and form. In MICASE, academic speech is defined as “that speech which oc-
curs in academic settings.” This means that academic discourse is not pre-
defined as something like a scholarly discussion. Simpson-Vlach (2013) noted
that, in academic settings, such speech acts as jokes, confessions, and personal
anecdotes co-occur with definitions, explanations, and intellectual justifica-
tions. The most useful event types for investigating student speech are study
groups (8 events, 100% student), student presentations (11 events, 78% student),
labs (8 events, 68% student), and discussion sections (9 events, 67% student).
However, only 12% of MICASE came from NNSs of English, and this percent-
age also includes some faculty and staff. The range of speech events includes
monologic and interactive speech; undergraduate and graduate students; junior
faculty, senior faculty, and staff; and native, near-native, and NNSs of English.
For the Teacher
MICASE texts could be used beneficially to teach university-level international

students the structure of academic lectures in the US For example, a project
by my former students Elizabeth Shafer and W. Anthony Yates (2012) looked
at the use of although, on the other hand, and others as discourse markers
used by professors in their lectures. Identifying discourse markers can help
students understand the relationship between statements and concepts and
can allow them to potentially predict subsequent information, thereby facil-
itating continuity of understanding (Nesi & Basturkmen, 2006). This is most
beneficial to students who struggle to follow along in a lecture. Understand-
ing discourse markers makes students become actively engaged, resulting in
more useful notes. (Note-taking remains the primary method that students
use to learn the content of lectures.)
Why Use MICASE to Teach Discourse Markers?

By their nature, lectures are generally delivered only once. This means that
students must process a lot of information in a short amount of time (Nesi &
Basturkmen, 2006). MICASE provides authentic data that can be easily ac-
cessed and revisited, allowing students to be exposed to discourse markers in
a stable environment. The large number of samples collected for MICASE also
ensures that any materials developed from its use will be reliable.
Teachers can use MICASE in creating guided activities for ESL students
to explore texts of lectures given by English NSs using academic spoken En-
glish. Common goals may focus on familiarizing teachers and ESL students
with how to find appropriate material in MICASE by using its built-in concor-
dancer, as well as the speaker and transcript attributes that can be extracted
online. Encouraging ESL students to examine how discourse markers are used
to structure the content of the lecture may provide them with a stable en-
vironment in which they can take notes on how contrastive views are pre-
sented in lecture format.
Shafer and Yates identified a debate text focused upon the role of govern-
ment in supporting its citizens on social issues. In preparation for this, they de-
veloped a lesson that examined how contrasting information was presented
in the lecture. The students (high-intermediate ESL students) were given a
worksheet containing the following discourse markers that they had to search
in MICASE and obtain context through concordance lines: although, on the
other hand, however, whereas, despite, nevertheless, but, and alternatively.
After searching, students were asked to explain the function of the dis-
course markers and if they were able to predict what the following idea or
sentence was about. They read the results aloud, stressing the discourse
markers as used in the speech/lecture by the instructor. They then encour-
aged students to examine other discourse markers that are related to lec-
ture structure (e.g., enumerating or defining markers).
The British Academic Spoken English Corpus (BASE)

The BASE (and a newer BASE Plus version) corpus has 160 lectures and 40
seminars recorded and transcribed from a variety of academic departments in
two universities in the UK Overall, the BASE corpus contains 1,644,942 to-
kens (from lectures and seminars) available through the Oxford Text Archive.
BASE Plus is a much larger and more current collection of British academic
speech, with the original tagged transcripts of BASE, video and audio record-
ings of lectures and seminars, video recordings of academic conference presen-
tations, and interviews with academic staff “on aspects of their academic work
and field (audio recordings, transcripts, and interview notes).” BASE Plus may
be compared with MICASE for dialect comparisons of academic discourse. As
is the case with MICASE (and also T2K-SWAL—see the following), BASE
Plus represents language in academia which does not necessarily feature a large
amount of L2 learner output. The BASE Plus video recordings have been used
in materials development projects at the University of Warwick, most notably
the Essential Academic Skills in English (EASE) series. EASE: Seminar Discus-
sions and EASE: Listening to Lectures are available online (British Academic
Spoken English and BASE Plus Collections, 2017).
Vienna-Oxford International Corpus of English (VOICE)

VOICE is a structured collection of interactions capturing spoken English as a
Lingua Franca (ELF). ELF is widely known to be most accurately and compre-
hensively representative of contemporary use of English globally, employed by
speakers from different first language (L1) backgrounds as a common means of
communication (Seidlhofer, 2007, 2012) across various locations and contexts
(e.g., business, education, tourism). The VOICE project was developed and col-
lected by research teams from the Department of English at the University of
Vienna (Barbara Seidlhofer, Project Director), funded by the Austrian Science
Fund, with support from Oxford University Press. VOICE currently has over 1
million words of transcribed spoken ELF (120 hours of transcribed speech, 23
recordings of speech events) from professional, educational, and leisure domains.
VOICE features transcripts of naturally occurring, non-scripted face-to-
face ELF interactions from 1250 mostly-European speakers. These speakers are
primarily “experienced ELF speakers” from a wide range of L1 backgrounds
(49 total). Interactions or speech events include interviews, press conferences,
service encounters, seminar discussions, working group discussions, workshop
discussions, meetings, panels, question-answer sessions, and conversations.
These speech events may also include code-switches into non-English speech
(e.g., German, French). VOICE 2.0 Online, which is based on VOICE 2.0
XML, is freely available at the VOICE project’s website: www.univie.ac.at/
voice. VOICE is not classroom based, but the corpus is certainly relevant as a
potential target corpus for many comparative studies of L2 speech across con-
texts. ELF texts from VOICE may, in fact, be considered as the type of English
student learners may aspire to in communicating successfully in English across
specific tasks and in a variety of settings.
English as a Lingua Franca in Academic Contexts (ELFA)

Also developed in the early 2000s—and around the same time as text collec-
tions for the initial version of VOICE—is the ELFA corpus. The ELFA corpus
recognizes that English has established itself as the global lingua franca, and
NNSs have increasingly outnumbered NSs in many global universities. Within

academic contexts, the English language constitutes the primary medium of
communication for a great number of international students, especially in com-
munities with speakers from different language backgrounds (Simson-Vlach,
2013). The ELFA corpus, with one million words of transcribed speech from
a variety of speakers, provides an important resource for studying the linguis-
tic features of this speech community, both as a language variety in its own
right and as an important component of academic speech. ELFA’s collection of
texts (of speech events) was based on (1) prototypicality: the extent to which
genres are shared and named by most disciplines, such as lectures, seminars,
thesis defenses, and conference presentations; (2) influence: genres that affect a
large number of participants or are widely utilized, such as introductory lecture
courses, examinations, and consultation hours; and (3) prestige: genres with
high status in the discourse community, such as guest lectures, plenary confer-
ence presentations, and opening/closing speeches. The ELFA team also included
dialogic events alongside lectures, seminars, and conference presentations.
For the Teacher
Clearly, VOICE and ELF texts are very relevant in teaching English to a range
of international students across countries and areas of study. It is important
to fully capture the structure of ELF and develop classroom activities that will
facilitate a skillful recognition of and appreciation for the importance of this
language variety. Mauranen (2003) argues that the applications of theoretical
and descriptive work to ELF are of considerable practical significance in global
academia. She noted that,
An international language can be seen as a legitimate learning target, a

variety belonging to its speakers. Thus, deficiency models, that is, those
stressing the gap that distinguishes NNSs from NSs, should be seen as
inadequate for the description of fluent L2 speakers and discarded as the
sole basis of language education in English.
(p. 517)
The Louvain International Database of Spoken English

Interlanguage (LINDSEI)
LINDSEI represents one of the first and most extensive collections of L2 spoken
English in an interview form. This corpus is especially well-suited to investiga-
tions of learner oral language because of its large size (800,000 learner words),
representativeness (11 L1 backgrounds with approximately 50 interviews each),
and consistency of its corpus design. Interviewers first asked each participant
to discuss a subject of his or her choice, from three options. They then contin-
ued the conversation informally by asking follow-up questions from the stu-
dent’s discussion, and the interview concluded with a picture-strip narration.
Each interview lasted approximately 15 minutes, and each was transcribed or-
thographically according to specific guidelines. Background information is also
noted for each speaker, including age, gender, L1, number of years of English at
school, and months living in an English-speaking country (Gilquin, De Cock, &
Granger, 2010). LINDSEI is a rich data source for investigations of lexico-
grammatical phenomena in learner speech and is suitable for a more detailed
analysis of learner interviews as a functional register. Various speech features
are captured in the transcripts, including dysfluent markers, repeats, and addi-
tional annotation (e.g., laughter, scream) as shown in the following excerpts (an
interviewee/student responds to questions about studying and future plans):
Text Sample B1.1. Sample LINDSEI interview responses

and it’s. good in the way that.. you only have to study.. and not worry about..
you know works and all that stuff.. and now that I’m.. finished and you know..
applying for jobs for next.. term and all that stuff and I start to be quite worried
and.. stress because.. I’ve been.. I mean I know I’ve been studying many years..
and.. I hope I can.. be a good..
(eh) this is my second degree so right now four years.. (mm). but I don’t
feel enough ready to: teach I mean that sometimes I think about and. next.
year next autumn I’m gonna be teaching I’m gonna have like.. twenty-five or..
thirty children.. right there in front of me looking at me and it’s like <screams>
I don’t know.. it’s.. I feel not ready I mean I think I need to know more..
even.. than I know.. and..yeah.. maybe.. I hope it’s gonna be.. enough..oh thank
you.. nice to hear
For the Teacher
The development of large-scale learner language corpora such as LINDSEI has

provided a wealth of information about how learners actually use language,
as well as how their language use compares to that of native English speakers
or across different L1 backgrounds. For example, discourse and pragmatic
markers have been studied extensively through LINDSEI, revealing that L2
speakers generally tend to prefer some pragmatic markers and not others
compared to native English speakers (Aijmer, 2011; Buysse, 2012; Gilquin,
2008; Mukherjee, 2009). Pérez-Paredes, Sánchez-Hernández, and Aguado-
Jiménez (2011) conducted an analysis of adverbial hedges in the Spanish
component of LINDSEI, and found that hedges differed significantly between
the Spanish students and a control group of native English speakers. Other
researchers have used LINDSEI to study fluency and accuracy in learner lan-
guage (Brand & Götz, 2011), grammatical phenomena such as articles and
prepositions (Kaneko, 2007, 2008), or word collocations (Mukherjee, 2009).
A study that I conducted with Brittany Polat (Friginal & Polat, 2015)
explored the various linguistic dimensions of LINDSEI. One of our primary
findings revealed a contrast between the involved conversational style of stu-
dents from countries such as Sweden, the Netherlands, and Germany and the
more informational focus of Japanese students. It was apparent from LINDSEI
transcripts that Swedish students were highly interactive in their interviews
compared to Japanese students, who exhibited very minimal responses. In-
herent cultural factors in face-to-face interviews may have influenced this
difference, especially in how these two groups of students provided elabo-
ration in their responses. Of all the interlanguage groups, the Japanese stu-
dents had the lowest total number of words (36,928), whereas the Swedish
group had 69,301 total words. In addition to first language background and
related cultural and social variables, there are other learner factors that can
also be investigated further. Average length of responses from students and,
more importantly, language fluency factors may be contributing to how stu-
dents make use of conversational features of discourse. Short and simplified
student responses have typically focused on content words and have signifi-
cantly fewer informal characteristics of speech such as transition words (e.g., I
mean, you know) and vague references (e.g., stuff, thing, all those). These types
of results may be used in teaching conversational English to some groups of
students in primarily English-speaking countries.
The European Corpus of Academic Talk (EuroCoAT)

One of the most recently compiled collections of transcripts of spoken, pri-
marily dialogic, speech between instructors and students is EuroCoAT (www.
eurocoat.es), a 58,834-word highly specialized corpus, comprised a total of
27 transcripts of conversations and consultations between students and fac-
ulty occurring during office hours at five different European universities. The
conversations were carried out in English and were collected and compiled by
researchers based in Spain. The creation of this corpus was part of the Erasmus
Plus project which had an English language proficiency requirement of stu-
dents who received grants, a requirement which varies among Spanish univer-
sities, and this requirement was used to assess student proficiency levels. This
corpus contains comprehensive descriptions of all speakers, including detailed
demographics: gender, L1, work experience (for lecturers), age, and students’
English proficiency levels. Additionally, student information, such as how long
students have resided in the foreign country prior to the recorded conversa-
tions, whether or not they recall having had previous conversations with this
lecturer outside of class, and how long each day individual students listen to
and speak in English while on Erasmus (MacArthur et al., 2014), is included.
What is unique about EuroCoAT is its inclusion of additional observa-
tional information about speakers during the course of the discussions such
as their positioning, their apparent comfort as they were recorded, and other
general observations. Additionally, the corpus includes information derived
from questionnaires completed by participants after the discussions reporting
their impressions about the ‘naturalness’ of the conversation, their comfort level
throughout, and the similarity of the recorded conversation to other conver-
sations they’d ordinarily have during regular office hours. Specific participant
positioning information, which could potentially be useful in multimodal
studies, is also included. Information is provided as to how the participants
were seated and their postures, what they were sitting on (e.g., swivel or stable
chair), their perspective with respect to the recording camera (e.g., centered or
to the left or right of the camera), toward what or whom they were facing, and
so on. Additional detailed information about the physical environment within
which the discussions occurred is also included: for example, background and
foreground of the room, furniture and equipment in the room, and the location
of the windows and doors (MacArthur et al., 2014).
B1.3 Spoken-Written Academic Corpora
The TOEFL 2000 Spoken and Written Academic

Language (T2K-SWAL) Corpus
The T2K-SWAL Corpus was designed to represent the range of spoken and
written registers that students encounter as part of a university education in
the US The project was sponsored by the Educational Testing Service, with
the goal of providing a basis for test construction and validation. The regis-
ter categories chosen for the T2K-SWAL Corpus are sampled from across the
range of spoken and written activities associated with university life, including
classroom teaching, office hours, study groups, on-campus service encounters,
textbooks, course syllabi, and institutional writing (e.g., university catalog,
brochures). Texts from all these registers were sampled from six major disci-
plines (business, engineering, natural science, social science, humanities, edu-
cation), three levels of education (lower division, upper division, graduate), and
four universities (Northern Arizona, Iowa State, California State Sacramento,
and Georgia State). Table B1.2 shows the register categories and number of
texts and words of T2K-SWAL.
Classroom teaching, the largest of the spoken registers in T2K-SWAL,
includes both lecture-style and more interactive teaching situations.
Table B1.2 Composition of the T2K-SWAL Corpus
Register # of Texts # of Words

Spoken
Classroom teaching 176 1,248,811
Classroom management 40 39,255
Office hours 11 50,400
Study groups 25 141,100
Service encounters 22 97,700
Written
Textbooks 87 760,619
Course management 21 52,410
Institutional writing 37 151,500
Classroom-management talk occurs at the beginning and end of class sessions,

to discuss course requirements, expectations, and past student performance.
Office hours are individual meetings between a student and faculty member,
for advising purposes or for tutoring/mentoring on academic content. (An of-
fice hour ‘text’ usually includes multiple meetings, between one faculty mem-
ber and the different students who came to her/his office.) Study groups are
meetings with two or more students who are discussing course assignments and
content. Finally, service encounters are situations where students interact with
university staff conducting the business of the university.
The written component of the corpus includes textbooks, course man-
agement, and institutional writing. Textbooks include only published books.
Written course management includes 10 syllabi ‘text’ files (196 syllabi totaling
ca. 34,000 words) and 11 course assignment files (162 individual assignments
totaling ca. 18,500 words). Each file combines multiple syllabi or assignments
taken from a single academic discipline and level. The main communicative
purpose of these texts is similar to classroom management: namely, communi-
cating requirements and expectations about a course or particular assignment.
As written texts, however, syllabi and assignments differ from classroom man-
agement in fundamental ways: They are not interactive, not negotiated, and
serve as a kind of formal contract for the course.
The International Corpus Network of Asian Learners of English

(ICNALE)
The ICNALE is a collection of ‘controlled’ essays and speeches by learners of
English in 10 Asian countries and territories, with 1.8 million words (spoken
and written texts) from 3,550 university students, both undergraduate and
graduate students. L1 texts from 350 NSs of English are also included in the col-
lection. For comparison, the total number of L1 and L2 texts is 10,000—4,400
Table B1.3 Description of topics/prompts and other items from ICNALE
The ICNALE-Spoken The ICNALE-Written
Topics Do you agree or disagree with this statement? Use reasons and
specific details to support your claim:
(A) It is important for college students to have a part time job.
(B) Smoking should be completely banned at all the
restaurants in the country.
Time allotment 60 second for one speech 20 to 40 minutes for
one essay
Length Not controlled (speakers were asked 200 to 300 words
to speak as long as they want during (+−10%)
the collection period)
Dictionary use No No
allowed?
Spell-checker use No Yes
allowed?
Source: ICNAL home page, located at http://language.sakura.ne.jp/icnale/.
speech samples and 5,600 essays. A primary objective of ICNALE’s collectors is

to compile the largest (so far) Asian learner corpora intended for contrastive in-
terlanguage analysis. The essays and spoken texts were collected from EFL learn-
ers (China, Indonesia, Japan, Korea, Taiwan, Thailand) and ESL learners (Hong
Kong, Singapore, Pakistan, the Philippines), and native texts from the US, the
UK, and Australia. The ICNALE project was developed by Shin Ishikawa of
Kobe University, Japan, with a team of collaborators. Table B1.3 provides a de-
scription of topics/prompts and other items guiding the corpus collection.
Aside from L1 background, ICNALE also provides data about the learners’
English proficiency obtained from Test of English for International Commu-
nication (TOEIC) or TOEFL scores in standard vocabulary size test (VST)
(Nation & Begler, 2007). Learners’ proficiency levels were classified into A2
(Waystage), B1_1 (Threshold: Lower), B1_2 (Threshold: Upper), and B2+
(Vantage or higher). These levels are similar to the levels proposed in the CEFR
(Common European Framework of Reference).
B1.4 Varieties of English

As previously mentioned, the Bank of English (BoE) is a monitor corpus with
present-day English from the UK, Australia, and the US, which are ideal
for native English dialect comparisons. Texts from the BoE include books,
newspaper writing, magazine articles, various ephemera, and broadcast (UK
and US news from the BBC and NPR). Other corpora of English varieties are
described in the following.
The British National Corpus (BNC)

The 100 million-word BNC has consistently been used in many corpus-related
studies more than any other corpus of English since its release in the early
1990s. The BNC was a result of a collaborative project between two universi-
ties in the UK (the University of Oxford and Lancaster University), the British
Library, and three publishers (Oxford University Press, Longman, and W. &
R. Chambers). Overall, the BNC was designed as a general, monolingual, and
synchronic corpus of British English. This corpus comprises a collection of
over 4,000 written and spoken texts from a wide range of sources, originally
developed to represent a cross-section of modern British English from the later
part of the 20th century up to the 1990s. Written registers of the BNC include
newspaper articles, specialist periodical and journal articles, published academic
texts, popular fiction, published and unpublished letters and memoranda, stu-
dent essays, and non-academic prose and biography, among others. Spoken
registers include transcriptions of unscripted informal conversations and other
types of spoken language collected in different contexts such as formal business
meetings, radio shows, and parliamentary discussions. In total, 90% of all texts
in the BNC are dedicated to written registers, with only 10% for spoken regis-
ters. These texts were sampled for a wider coverage of registers around the 100
million-word target. Many written texts have a total of 45,000 words that were
taken from different parts of a single-authored source for longer registers such as
fiction and biographies. This process was used to limit the over-representation
of idiosyncratic parts of texts including those from extended bibliographies or
indexes. Shorter texts (e.g., newspaper, interviews, and journal articles) were
included in full.
Since its original release, the BNC has become very accessible to many re-
searchers, especially online. The complete corpus can still be purchased in two
versions on CD-ROMs, BNC World (2001), which was a second edition re-
lease, and its most recent XML version, BNC XML Edition (2007). Other
previous versions of the BNC available for purchase include a comparative and
parallel collection of one million written and spoken texts (BNC Sampler)
and a four million-word sampler from four different comparative registers
(BNC Baby). BNC XML Edition is completely POS-tagged using CLAWS.
This version is encoded following the Guidelines of the Text Encoding Initia-
tive (TEI) to match many text processing protocols used in many corpus and
computational tools. Various contextual and bibliographic information is also
included with each text from well-defined TEI-conformant header informa-
tion. Components of the BNC can be accessed through a few online sites as
part of the BNC consortium of universities in the UK (Oxford, Lancaster, and
Leeds) and those that are developed by unaffiliated individuals or groups (e.g.,
BNCWeb at Lancaster University; Intellitext: “Intelligent Tools for Creating
and Analysing Electronic Text Corpora for Humanities Research”; BNCWeb
at Oxford; Phrases in English (PIE) and the BNC; Davies’ BYU-BNC interface
(Friginal & Hardy, 2014a).
Update: In late 2017, the Spoken BNC2014 has been publicly released
by Lancaster University and Cambridge University Press. This new addition
to the BNC contains 11.5 million words of transcribed informal British En-
glish conversation, recorded by speakers, mainly British English speakers, from
2012 to 2016. The recordings consist of casual conversation among friends and
family members and are designed to make the corpus broadly comparable to
and consistent with the demographically sampled original components of the
spoken BNC. This new addition is accessible online for free for “research and
teaching purposes.” Accessing the texts requires the creation of a free account
on Lancaster University’s CQPweb server (https://cqpweb.lancs.ac.uk/). Other
related information about the Spoken BNC2014 and a downloadable manual
and reference guide are also available.
The American National Corpus (ANC)

The BNC served as the model for the creation of the parallel American English
corpus, ANC (http://americannationalcorpus.org/), which has, so far, released
two versions of component sub-corpora totaling over 22 million words. Work
on the ANC is still ongoing, although at a very slow pace, and there has not
been a major update in the past few years, unfortunately. Unlike the BNC,
there is currently no user-friendly client program available for automatic online
searches of ANC texts, except for one that is designated for an n-gram search.
However, the ANC has downloadable text files totaling over 15 million words,
and the annotated versions of the corpus also include grammatically tagged
data using the Biber Tagger (Biber, 2010) and other XML annotations. The
ANC’s collection of texts reflects work done from early to mid-2000s. Table
B1.4 briefly describes the spoken and written texts from the current version
of the ANC. (Only those that may be directly used for language learning and
teaching are described here.)
The International Corpus of English (ICE)

The ICE (International Corpus of English) project (http://ice-corpora.net/
ice/) was the result of work initiated in 1988 by the late Sidney Greenbaum,
who was at that time the Director of the Survey of English Usage, Univer-
sity College London. It is a collection of comparable corpora of varieties of
English spoken around the world. Each corpus is uniformly designed with a
common annotation format, with a total size of one million words. There are
Table B1.4 Registers of the ANC for language teaching
Corpora Domain No. files No. words

Spoken Registers
CallHome: 24 unscripted telephone Telephone 24 52,532
conversations between NSs of American
English covering a contiguous 10-minute
segment of each call. CallHome transcripts
are time-stamped by speaker turn for
alignment with the speech signal included
in the corpus.
Charlotte: The Charlotte Narrative and Face-to-face 93 198,295
Conversation Collection (CNCC) in the
ANC contains 93 narratives, conversations,
and interviews representative of the
residents of Mecklenburg County, North
Carolina, and surrounding North Carolina
communities. Information on speaker age
and gender is included in the header for
each transcript.
Switchboard: 2,307 texts of spontaneous Telephone 2,307 3,019,477
conversations averaging 6 minutes in length
and comprising about 3 million words of
text, spoken by over 500 speakers of both
sexes from every major dialect of American
English. (See Section B3.5.2 for additional
information on the Switchboard corpus.)
Spoken Totals 2,474 3,863,592
Written Registers
911 Reports: Full text of reports released Government, 17 281,093
on July 22, 2004 by The National technical
Commission on Terrorist Attacks Upon
the United States.
Biomed: Technical articles by American Technical 837 3,349,714
authors obtained from BioMed Central,
which publishes open access, peer-
reviewed biomedical research articles.
ICIC: The Indiana Center for Intercultural Letters 245 91,318
Communication corpus of Philanthropic
Fundraising Discourse consists of
fundraising texts, including case
statements, annual reports grant proposals,
and direct mail letters.
NY Times: The New York Times component of Newspaper 4,148 3,625,687
the ANC consists of over 4,000 articles from
The New York Times newswire for each of the
odd-numbered days in July 2002.
(Continued)
Corpora Domain No. files No. words
PLoS: The Public Library of Science is an Technical 252 409,280

online, public domain journal consisting
of scientific and medical literature. Texts
include articles written by American
authors taken from PLoS Medicine
(2004–2005) and PLoS Biology
(2003–2005). In addition to technical
articles, PLoS journals include editorials,
commentaries, book reviews, and essays.
Web Data Materials: Web data materials Government 285 1,048,792
were drawn from public domain
government websites, including reports,
speeches, letters, and press releases from
the Environmental Protection Agency,
the General Accounting Office, the Japan
US Friendship Commission, the Legal
Services Corporation, the National Center
for Injury Prevention and Control, and the
Postal Rate Commission.
500 texts of approximately 2,000 words each, and each from the same registers
(news, parliamentary debates, lectures, etc.) and all dating from 1990 or later
(Nelson, 1996). The authors and speakers are all 18 years of age or older, were
either born in the target country or moved there at an early age, and were
educated in their respective countries through the medium of English. The
three primary goals of Greenbaum and his team in collecting these data were:
(1) to sample standard varieties from other countries where English is the first
language, such as C anada and Australia; (2) to sample national varieties from
countries such as India and Nigeria where English is an official second lan-
guage; and (3) to include both spoken and manuscript English as well as printed
English (Greenbaum, 1996). The ICE project has research teams in each of
the following countries: Australia, Cameroon, Canada, East Africa (Kenya,
Malawi, Tanzania), Fiji, Great Britain, Hong Kong, India, Ireland, Jamaica,
Kenya, Malta, Malaysia, New Zealand, Nigeria, Pakistan, the Philippines,
Sierra L eone, Singapore, South Africa, Sri Lanka, Trinidad and Tobago, and
the US The spoken and written registers collected by the research teams for
the ICE project are shown in Table B1.5.
The ICE was intended primarily for comparative studies of emerging En-
glishes all over the world alongside ‘native-Englishes.’ The Asian varieties of
English available for free download from the ICE website feature countries/
territories where English has been used extensively as the language of business
and education. Although academic spoken language is very limited in ICE,
there are useful comparisons of spoken and written texts in professional settings
Table B1.5 Spoken and written registers of the International Corpus of English
Spoken Texts (300 2,000-word samples) Written Texts (200 2,000-word samples)
Dialogues (180) Student exams (10)

Spontaneous conversations (90) Student essays (10)
Telephone conversations (10) Social letters (15)
Class lessons (20) Business letters (15)
Broadcast discussions (20) Learned humanistic (10)
Broadcast interviews (10) Learned social sciences (10)
Political debates (10) Learned natural sciences (10)
Legal cross-examinations (10) Learned technology (10)
Business transactions (10) Popular humanistic (10)
Monologues (120) Popular social sciences (10)
Spontaneous commentaries (20) Popular natural sciences (10)
Unscripted speeches (30) Popular technology (10)
Demonstrations (10) Press reportage (20)
Legal presentations (10) Administrative/regulatory directives (10)
Broadcast news (20) Instructional skills/hobbies (10)
Broadcast talks (20) Press editorials (10)
Scripted speeches (10) Fiction (20)
that may directly relate to academic discourses. Transcripts of class lessons, of-
ten with teacher and student interactions (mostly from teacher lectures), may be
extracted and compared across country groups.
B1.5 Online Collections
BYU CORPORA: Suite of Corpora Created and Shared by Mark Davies

Mark Davies has single-handedly taken online corpus design, collection, and
free distribution to new heights with his suite of corpora available from “cor-
pus.byu.edu” (https://corpus.byu.edu/). His collections are massive, many
totaling over a billion words, and a few are constantly updated. The term
‘mega-corpora’ has been used to refer to his corpora, especially COCA, COHA,
Wikipedia Corpus, and GloWbE. Corpus of Contemporary American E nglish
(COCA) is still the most popular among all of the BYU corpora and is argu-
ably most useful for English teachers. COCA, which was released online in
early 2008, covers a diverse collection of American English texts totaling ap-
proximately 520 million words (about 20 million words each year, and still
growing!) from 1990 to the present, across registers grouped into the follow-
ing categories: spoken, fiction, popular magazines, newspapers, and academic
journals. COCA’s online interface has been used extensively by researchers,
teachers, and students since its publication for various purposes including ma-
terials production in teaching lexico-syntactic features of English, collocations,
and synchronic word frequency changes across registers. COCA is comparable
Table B1.6 English corpora from Davies
English Corpora # Words Language or Dialect Time Period

News on the Web (NOW) 5.14 20 countries / Web 2010–yesterday
billion+
Global Web-Based English 1.9 billion 20 countries / Web 2012–2013
(GloWbE)
Wikipedia Corpus 1.9 billion English –2014
Hansard Corpus 1.6 billion British (parliament) 1803–2005
Early English Books Online 755 million British 1470s–1690s
Corpus of Contemporary 520 million American 1990–2015
American English (COCA)
Corpus of Historical American 400 million American 1810–2009
English (COHA)
Corpus of US Supreme Court 130 million American (law) 1790s–present
Opinions
TIME Magazine Corpus 100 million American 1923–2006
Corpus of American Soap 100 million American 2001–2012
Operas
British National Corpus 100 million British 1980s–1993
(BYU-BNC)
Strathy Corpus (Canada) 50 million Canadian 1970s–2000s
CORE Corpus 50 million Web registers –2014
Source: Adapted from corpus.byu.edu.
to the BNC/ANC in terms of text types, with deviations especially with texts
included in its spoken data component. COCA’s spoken texts (20% of the cor-
pus) come from television news and interview programs for the most part and
not from the types of conversation data (e.g., face-to-face conversation, service
encounters, and telephone interactions) available in the BNC or other corpora,
such as the Longman Corpus. Davies (2009) maintains, however, that COCA’s
overall balanced composition means that researchers can compare data across
registers and achieve relatively accurate results that show patterns of change in
the language from the 1990s to the present. Related to COCA are the COHA,
also from Davies, and the Google Books Ngram Viewer or Google Books Cor-
pus collected by Google Inc., which are both time-stamped from the 1800s to
the present. The current list of BYU corpora is provided in Table B1.6.
The Corpus of Global Web-Based English (GloWbE, pronounced globe)
has a staggering 1.9 billion words from 1.8 million web sources in 20 different
English-speaking countries. Texts are grouped according to where they came
from online (e.g., websites, web pages, or blogs) and the English dialects they
represent. The 20 countries currently in GloWbE include native varieties, such as
the US, Canada, the UK, Ireland, Australia, and New Zealand, and non-native
varieties, such as India, Sri Lanka, Pakistan, Bangladesh, Singapore, Malaysia,
the Philippines, Hong Kong, South Africa, Nigeria, Ghana, Kenya, Tanzania,
and Jamaica. Davies released the corpus and its online interface in April 2013.
Comparing corpus size, GloWbE is more than four times as large as COCA and
nearly 20 times as large as the BNC. Dialect studies with GloWbE can cover
international varieties of English, as they appear online, with cross-comparisons
focusing on British and American English texts (with more than 775 million
words of text for just these two dialects). Davies’s Corpus del Español is a 100
million-word diachronic corpus of Spanish (1200s–1900s) compiled by Davies,
funded by the US National Endowment for the Humanities. The corpus in-
terface (similar to COCA and COHA) allows users to search frequency data
and use of words, phrases, and grammatical constructions in different historical
periods. Registers of Modern Spanish (e.g., fiction, news) are also categorized.
For the Teacher
You might, at this point, want to go directly to Part C (“Corpus-Based Lessons

and Activities in the Classroom”) for sample classroom activities making use
of COCA.
For COCA beginners, I recommend the following: (1) Begin by first being
completely familiar with Davies’s site. The HELP page has all the information
(FAQs) you will need to get started. Read the descriptions of the corpora and
make sure that you are fully aware of account options (e.g., a premium ac-
count is available for a fee, for a full year), some policies, and copyright infor-
mation. (2) Register using your email—this is needed so you can continue to
conduct searches. (3) Read or download the README files providing details
about how to conduct searches and comparisons.
Mark Davies Interview

Jack A. Hardy (Oxford College of Emory University) and I interviewed Davies
for a short feature on his journey into corpus-based (sociolinguistic) research
for our book (2014a). Davies began his academic career in Spanish and En-
glish Linguistics, focusing primarily on syntax and historical linguistics. When
asked about how he became interested in corpora and corpus linguistics,
Davies recounted a very pragmatic rationale. In the late 1980s and early
1990s, he had been conducting research on Spanish and also Portuguese. As
a non-native speaker of these languages, he realized that he did not have the
same intuitions about these languages that his native-speaking counterparts
had. Corpora offered him ways to potentially have an edge in research and
publishing in these areas. To Davies, corpora served as a “proxy access to
(Continued)
intuitions” about Spanish and Portuguese—what linguistic features speak-

ers of these languages actually use across registers.
Davies noted that historical linguists, especially in languages other than
English, are almost, by definition, corpus linguists because of the types of
data they collect and analyze. With the advent of computers and digitized
language, large amounts of data from various time periods could be stored,
organized, and explored much more easily. Early in his career, Davies
worked in more traditional models of functional language. In these models,
there was often the distinction made between ‘acceptable and unaccept-
able’ grammatical forms. With corpora, however, research has shown how
historical language change is very gradual, and that many linguistic features
become more or less acceptable over longer periods of time than previously
believed. His formal linguistics background taught him that lexis and syntax
were separable. Working with corpora has instead shown him the crucial
intersection of lexis and syntax in actual language in use.
From the early 2000s, since he has been at BYU, Davies’s research moved
away from studying historical changes to focus more on creating corpora,
especially in English, that can be accessed publicly online. He has con-
sciously targeted teachers and learners around the world to allow them
to conduct their own synchronic, diachronic, multi-lingual, multi-dialectal
studies using the data he collects and the online database he manages.
This has been a rewarding direction of scholarship for Davies who values
how he has been able to make a difference in many teachers’ and students’
lives. In fact, included on his BYU website is the option for researchers to list
their studies (e.g., journal articles, chapters, books) that have incorporated
datasets using his collection. More than 130,000 unique visitors check out
his corpora each month, with COCA being the most widely used corpus,
having more than 65,000 unique users each month.
Google Ngram Viewer/Google Books

The Google Ngram Viewer is comprised of over 5.2 million books from Google
Books, a subdivision of Google Inc. that has conducted an extensive scanning
of published manuscripts in order to create a database of electronic or digitized
texts. The number of scanned books comprises approximately 4% of all the
books ever written in English (Bohannon, 2011). This mega-corpus contains
over 500 billion words; the majority of them are in English (361 billion). Other
available languages in the corpus include French, Spanish, German, Chinese,
Russian, and Hebrew (Bohannon, 2011; Keuleers, Brysbaert, & New, 2011;
Michel, Shen, Aiden, Veres, & Gray, 2011). The Ngram Viewer contains data
that span from 1550 to 2008, although texts before the 1800s are extremely
limited and often there are only a few books per year. From the 1800s, the cor-
pus grows to 98 million words per year; by the 1900s, it reaches 1.8 billion, and
11 billion per year by the 2000s (Michel et al., 2011). Google Ngram Viewer
is composed of raw data that are encoded by the number of n-grams, adjacent
sequences of n items from a text.
It is important to note that Google Ngram Viewer was not necessarily cre-
ated with language teaching or research in linguistics (or corpus linguistics)
as its primary application (Michel et al., 2011). The developers of the viewer
wanted to create a new approach to humanities research that they coined
Culturomics. Culturomics (www.culturomics.org/home) is a way to quan-
tify culture by analyzing the growth, change, and decline of published words
over centuries. This would make it possible to rigorously study the evolution
of culture using distributional, quantitative data on a grand scale (Bohannon,
2010). In an effort to prove the adequacy of and to provide clear impetus
for Google Ngram Viewer, Google-affiliated researchers have conducted a
series of studies to validate the usefulness and various applications of their
new program. One study, for example, showed that over 50% of the words in
the n-gram database do not appear in any published dictionary (Bohannon,
2010). In addition, it is argued that patterns and cultural influences of words
could be clearly established and tracked across time frames. The use of G oogle
Ngram Viewer and Culturomics, therefore, contributes academic and techni-
cal value to the study of culture, making it arguably a new cultural tool that
has several possibilities, including, of course, the teaching of new vocabu-
lary and syntactic patterns of English and other languages (Friginal & Hardy,
2014a).
Other Specialized Spoken and Written Corpora

Finally, the pioneering work of the Learner Corpus Association (LCA)
must be included in this section. LCA is an international association promoting
learner corpus research and providing an interdisciplinary forum for research-
ers to share results of their studies, corpora, and related projects. The LCA
hosts a biannual international research conference and maintains a compre-
hensive website (www.learnercorpusassociation.org/) which serves as a repos-
itory of data, published materials, and research tools for both members and
non-members. The group supports the compilation of learner corpora, both
written and spoken, in a wide range of languages and the design of innovative
methods and software. Members promote learner corpus research focusing on
SLA theory and applications in fields including foreign or second language
teaching, language testing, and natural language processing (e.g., automated
scoring, spell-and grammar-checking, L1 identification).
For the Teacher
I recommend exploring “The Learner Corpus Bibliography” (LCB) managed

by the LCA and located at www.learnercorpusassociation.org/resources/lcb/.
This bibliography compiles published studies related to learner corpus re-
search. The LCB has over 1,100 references (as of late-2017) and is updated
regularly. It is fully searchable by fields, such as author, title or publication
year, as well as by key words and languages (L1/L2). Electronic versions of
the articles (or links to web pages) may be available. Also featured in the LCA
website are lists and descriptions of spoken and written learner corpora col-
lected by various researchers or research teams worldwide (including learner
language other than English).
The founding members of the LCA are Gaëtanelle Gilquin, Sylviane
Granger, Fanny Meunier and Magali Paquot, all based at the Centre for En-
glish Corpus Linguistics, Université catholique de Louvain (Belgium). Recent
publications by LCA scholars, such as The Cambridge Handbook of Learner
Corpus Research (Granger, Gilquin, & Meunier, 2015), have covered emerging
models in speech annotation of learner corpora, statistics for learner cor-
pus research, and extensive historical overviews alongside future directions
(Friginal et al., 2017).
Related Non-English Corpora

In addition to Davies’s collection of Spanish and Portuguese texts, there is
an increasing number of non-English corpora available online. International
groups of linguists continue to collect corpora of specialized (national) lan-
guage varieties and also parallel texts to complement already existing data-
bases. Popular corpora of Chinese, Spanish, and Arabic languages have been
completed and analyzed, with historical/diachronic studies and cross-linguistic
comparisons as common research topics. The project Linguistic Variations in
Chinese Speech Communities (LIVAC) Synchronous Corpus (http://livac.org)
has collected and analyzed more that 450 million Chinese characters of media
texts from major Chinese speech communities, such as Beijing, Hong Kong,
Macau, Shanghai, Singapore, and Taiwan. The LIVAC is primarily a monitor
corpus (i.e., a corpus that is regularly updated), which also features POS-tagged
texts from a very large database of over 1.6 million word types. LIVAC word
types can be compared with English and other Western alphabetic languages.
The LIVAC research team has conducted several cross-linguistic compari-
sons, diachronic studies, and many sentiment analyses of global events such
as the Chinese press coverage of the US presidential elections. Corpus del
Español Actual (CEA), the Corpus of Contemporary Spanish, has over 540
million words, lemmatized and POS-tagged. The CEA is made up of the fol-
lowing texts: (1) the Spanish part of the 11-language parallel corpus Europarl:
European Parliament Proceedings Parallel Corpus, v. 6 (1996–2010); (2) the
Spanish portion of the trilingual Wikicorpus, which was obtained from Wiki-
pedia (2006); and (3) the Spanish part of the seven-language parallel corpus
MultiUN: Multilingual UN Parallel Text 2000–2009. The MultiUN corpus
utilizes texts of United Nations resolutions. The CEA provides a dedicated
online search interface, the CQPweb, which can be used to search for words,
lemmas, or sentence constructions.
B2
Collecting Your Own
(Teaching) Corpus
Collecting a do-it-yourself (DIY) corpus for class use is certainly doable and,
when planned properly, quite manageable and (almost) fun! It does, however,
require a lot more time and effort—some corpora more than others—to con-
struct a pedagogically sound collection of texts. As discussed in Section A1, a
principled collection must include consideration of a few very important fac-
tors, not only related to the texts and their linguistic properties but also to
specific contexts, approvals and consents, and ethical concerns. Before you start
the actual collection of texts for your corpus, it is important that you first de-
velop your goals, objectives, and research questions if you plan to also use them
for related research. Although much can be understood through exploration
of any dataset, the way that your corpus will be organized, compiled, and
analyzed will depend upon your specific teaching goals and the questions you
want to answer.
Following a systematic procedure when collecting your corpus is quite sim-
ilar to developing your course syllabus after careful consideration of all your
contextual variables. You seek more information about your students, their
specific backgrounds, goals they have before and after completing your course,
and the various activities and evaluation items that you have. For materials
development and classroom research based on your own corpus, there are im-
portant considerations essential to maintaining a sound design. The corpus
texts you have are like your ‘participants’; in other words, these texts are your
individual observations. To create a corpus well designed for teaching pur-
poses, it is critical that you remember this. The same sampling issues associ-
ated with other social science research can be applied to creating a corpus by
looking at your texts this way. Each one is important and will show features
that represent a sampling of your target population. The following sections
describe what to do before, during, and after your collection of a corpus spe-
cifically intended for your classroom.
Collecting Your Own (Teaching) Corpus 115
B2.1 Corpus Collection Process

In my work with Jack A. Hardy, we brainstormed the design of written corpora
for teachers’ use in writing classrooms and for large-scale research purposes.
We propose a two-part process from planning to actual collection of texts.
We believe that these procedures can address universal questions about the
collection of specialized corpora, both for direct use by classroom teachers and
their students, and for classroom-based researchers. We envision that these sug-
gestions could also be applied by materials developers who focus more on cre-
ating textbooks or lessons and activities for learners. Part 1 (“Preparation and
Pre-Collection”) includes (1) reading previous research and finalizing research
goals/questions, (2) understanding the feasibility of collection, (3) ethical con-
siderations, (4) representativeness and sampling, and (5) technical preparation.
Part 2 (“Corpus Collection”) focuses on (1) collection: recording, transcrib-
ing, and “cutting & pasting”; (2) organizing, storing, and filing options; and
(3) publishing and sharing. These topics are discussed next.
Part 1: Preparation and Pre-Collection

1 Reading previous research and finalizing research and teaching goals/
questions. Be familiar with what has already been done in your partic-
ular teaching and research setting. The most important reason for this is
that there may already be a preexisting and perhaps even publicly available
corpus that you can use. For the most part, there will be related studies
that have investigated similar questions and contexts (e.g., writing in IEPs,
writing across the curriculum, EAP in China), and they may have included
extensive descriptions of their corpus designs and collection approaches.
Related and especially new or innovative ideas in automatic transcription
or online data capture can help inform your own collection.
2 Understanding the feasibility of collection. You will have to articulate
clearly and specifically how your corpus will be collected. An important
question to answer is “Is it feasible for you to actually collect, organize,
annotate, analyze, and write about your data within the time frame you
have?” Asking your students to conduct a corpus-based study using a newly
collected corpus for a course project, for example, will be a very demand-
ing and time-consuming task. What, then, are your teaching goals in mak-
ing the corpus available for your students? You will have to make sure that
the construction of your corpus can be completed before you begin the
course, ideally, and that you are completely familiar with your data before
you start teaching with them. A proposed university-level writing corpus
by Hardy (2013, 2014) (later) followed a design plan that covers how texts
will be collected, how they will be organized, and the questions he wanted
to answer directly with the corpus. Making sure that the primary linguistic
and contextual constructs you are interested in are sufficiently represented
and defined is certainly very important.
3 Checking ethical considerations. Many English teachers will be interested

in collecting essays written by their students or recording their classrooms’
oral production during in-class presentations or peer-review or peer-
feedback activities. Be aware that these types of collection may involve
specific institutional review requirements and approval, and, definitely, the
completion of student consent forms before you begin. Creswell (2014)
outlines ethical issues to anticipate and consider during data collection,
some of which are directly applicable to teachers collecting texts (written
and spoken) produced by students: (1) protecting the anonymity of partic-
ipants, (2) storing data and destroying it after a set time, (3) planning for
ownership of the data and not sharing it with others, and (4) providing
an accurate account of the data. Creswell also emphasized that addressing
issues of confidentiality is important. In considering ethics in your corpus
collection, do not forget that any potential information that could identify
an individual should be conscientiously removed from the data. Names can
easily be replaced by numbers or pseudonyms. It will probably be a chal-
lenge or impractical to consider reciprocity or to pay students for a small-
scale, classroom-based collection, but do your best to make provisions for
all participants to receive some benefits or recognition, even if only by ac-
knowledging that their contribution to the project will potentially support
continuing improvement of teaching and materials development in your
classroom. Finally, respect all of your participants and sites, and make it
clear that participation in your corpus collection is strictly voluntary.
• Ethics and online data collection: The internet makes it easy to collect
written data, such as blogs, social media posts (on Facebook and Twitter),
or texts from discussion boards like those in Reddit or Craigslist, whether
manually or by means of automated software programs (e.g., web crawl-
ers, web spiders, or web scutters). As online data mining research expands
and more scholars and industry marketers focus on the analysis of various
online registers, ethical considerations emerge, such as how to properly
represent authors and their ideas, even in private or noncommercial blog
sites, as you proceed to ‘copy’ texts for your corpus. Blogs, in general, are
considered to be under public domain, which allows teachers to collect
blog entries without the necessary author consent forms required by most
academic institutions in the US; however, the privacy and confidential-
ity of authors and their statements, ideas, pictures, or creative endeavors
must be ensured. Author names and their addresses are not necessary and
can easily be replaced by a specific author ID number or basic location,
such as city and state (Friginal & Hardy, 2014a). A list of “Copyright Law
Dos and Don’ts” posted by Scocco (2007) in an online blog includes the
following tips, which are relevant for teachers:
• Do use material under public domain: Public domain materials in-
clude government documents, materials produced before 1923, and
materials produced before 1977 without a copyright.
• Do quote something relevant to your work: The Copyright Act states

that short quotations for the purpose of criticism, commentary, or
news reporting are considered fair use. The quote, however, should
involve only a small portion of the work and not replicate the material.
• Don’t assume that if you credit the author/s, there is no copyright
infringement. Only use copyrighted material when you have explicit
permission to do so or if you make fair use of it.
• Don’t equate Creative Commons with totally free, no-need-to-cite
materials. While Creative Commons licenses are less restrictive than
standard copyright, they should not be interpreted a “free for grab.”
In order to understand what you can or cannot do with Creative
Commons material, you should check what kind of license it is using.
4 Representativeness and sampling. For research purposes in CL, corpus rep-
resentativeness and sampling will have to be strictly ensured by utilizing
applicable statistical tests. Representativeness can be accomplished by fol-
lowing linguistic and sampling (inclusion/exclusion) criteria. If you plan to
analyze rare linguistic forms or structures in laboratory reports, for exam-
ple, concordance lines of likelihood adverbs (e.g., evidently, predictably) or
that complement clause controlled by stance nouns, then you need a large
academic corpus to make sure that you extract enough occurrences of these
structures. If you are looking only at distributions of attributive adjectives
or past tense verbs, then a normalized frequency of these POS-tagged fea-
tures from a corpus of about 200,000 words may be enough as these features
occur very frequently in the texts. Baker (2006) suggests that a corpus with
at least 200,000 words may be enough for a discourse analysis study.
For corpora collected primarily for teaching purposes, requirements for
representativeness and sampling do not have to be as stringent. A true ran-
dom sample is, after all, difficult to achieve, so aiming for a balanced cor-
pus that approximates the characteristics of the specific register of speech
or writing you are interested in would be a more realistic goal. One way
to do this is through a stratified sampling, as determined by a “corpus
matrix.” This can be accomplished by clearly establishing your variables,
which will enable you to select your sub-registers, and then establishing a
balanced quota of texts for each group, which also helps you further define
your various criteria. For example, if your goal is to collect essays written
by international students in a US IEP, your variables may include students’
national backgrounds, university majors, year levels, or written text types
(e.g., argumentative essays, reports, reflection papers). If you have sufficient
time to collect your texts (especially online, written texts), then consider
collecting more text samples, say, at least 100 files per group, across a wider
range of categories. It will be very manageable for you to reach a million
words by manually copying and pasting from online sources, but try to
gain access to automated web crawling software, which can fast-track your
collection. Again, make sure that you only ‘crawl’ public domain sites.
5 Technical preparation. Finally, there is no substitute for preparation before

you undertake a demanding corpus collection project. Sinclair’s (2005)
guide for corpus compilation emphasizes the necessity of specific criteria
for the inclusion or exclusion of texts and careful documentation of your
design and composition:
• Criteria for determining the structure of a corpus should be small in
number, clearly separate from one another, and efficient as a group in
delineating a corpus that is representative of the language or variety
under examination.
• Any information about a text other than the alphanumeric string of its
words and punctuation should be stored separately from the plaintext
and merged when required in applications.
• Samples of language for a corpus should, wherever possible, consist of
entire documents or transcriptions of complete speech events or should
get as close to this target as possible. This means that samples will differ
substantially in size.
• The design and composition of a corpus should be documented fully
with information about the contents and arguments in justification of
the decisions taken.
• The corpus builder should retain, as target notions, representativeness
and balance. While these are not precisely definable and attainable
goals, they must be used to guide the design of a corpus and the selec-
tion of its components.
• Any control of subject matter in a corpus should be imposed by the use
of external, not internal, criteria.
• A corpus should aim for homogeneity in its components while main-
taining adequate coverage, and rogue texts should be avoided.
Part 2. Corpus Collection
1 Collection: recording, transcribing, and cutting & pasting. In general, a ma-

jority of corpus-based studies that have been conducted using DIY cor-
pora have explored written language: not only ‘written’ language but also
digitally or electronically produced ‘written’ language. Online sources are
very easily copied and saved to individual document files. If you have a
collection of handwritten student notes, you will be required to retype
those notes from paper. Another convenient option, if the writing on paper
is typed or clearly handwritten, is scanning the paper samples and then
using an optical character recognition (OCR) software to convert these
hard copy texts into electronic files readable by the computer. You will find
many OCR software options (mostly for sale or licensing) accessible online.
You do need to convert your MS Word documents, PDFs, and other
formats to plaintext (or .txt files) for these to be read by concordancing
programs like AntConc. Notepad or WordPad programs available on all

Windows-based personal computers produce plaintext. In these files, text
is minimally formatted. For example, it does not include bolding, under-
lining, or font differences. Most .txt files are saved using the encoding
scheme ASCII (American Standard Code for Information Interchange).
This is a useful system because of its simplicity and ability to be used in
multiple formats and programs, including most corpus tools. With an in-
creasing amount of research being completed in languages that do not use
the Roman alphabet, however, there are limitations to ASCII. Whereas
ASCII can represent 128 characters, Unicode, another encoding scheme,
can represent more than 100,000 characters. MS Word allows you to save
your document (or .doc file) directly into a .txt file. Some advanced version
of Adobe PDF programs will also allow automatic copying and pasting of
PDF texts into plaintext programs like Notepad.
The .txt format is not commonly used in the sources from which cor-
pus collectors derive their texts. For example, most online sources are in
HTML (hypertext markup language) format. This is a system for coding
websites, creating the format that we see on internet browsers. It is good
to become familiar with the various ways of copying and saving texts,
and with how to convert other files. Look at options for batch converters
online (e.g., “convert Word/PDF to text” programs), which will make it
easier for you when you need to process hundreds of .doc or .pdf files. I
recommend Laurence Anthony’s (2013) batch text converter AntFileCon-
verter for this purpose. The program, which is a freeware, works efficiently
in converting PDF and Word (DOCX) files into plaintext (http://www.
laurenceanthony.net/software/antfileconverter/).
Spoken discourse collected into a corpus adds an additional layer of work
for teachers. We need to record interactions, interview responses, classroom
presentations, and various other spoken data into audio or video files and
then transcribe these audio-video materials into plaintext files. A systematic
transcription convention is needed to make sure that all relevant variables
and speech features are captured. Various multimodal annotations of mark-
ers, such as dysfluencies (e.g., pauses, repeats, self-correction), overlaps and
interruptions, noise, or additional contextual information (e.g., laughter,
‘sarcasm,’ eye movement, focus, emotional descriptions), may have to be
added in the transcript. A good, reliable digital audio recorder is essential,
and having handy backup equipment will ensure that you do not lose im-
portant spoken data. Make a habit of checking your recording device to
ensure that it is actually recording speech, at the right volume level, during
your target event, and learn how to best save your audio files into your
PCs or laptops. Once again, it is helpful to have backup copies available for
these recorded files. There are a few currently available options for audio-
to-text transcription programs, which make use of voice recognition and
closed-captioning technology. Dragon Dictation (“Dragon Naturally Speak-

ing”) and its suite of related products can be used to produce text transcripts
of audio played in its interface. The accuracy of transcriptions may be af-
fected by the sound quality of your recorded event, but the software helps
by at least limiting the time needed for full, manual transcription.
2 Organizing, storing, and filing options. It will help to have an easy-to-
follow (and to remember) organization system in place for your texts.
Organizing the data will help facilitate multiple analyses later. Consider
how your students will use the texts as you plan and develop your filing
system. In addition to ease of use, having a system in place will allow you
to always know where to look in case you need text samples, excerpts of
student responses, or information about a particular file. Creating header
information and consistent use of titles, numbers, or codes can help you
to better organize and subsequently access your folders to meet specific
needs. A spreadsheet with information about your files (e.g., annotations
for length/number of words, speaker or writer, dates and locations) is also
almost always necessary. And keep a record of everything! Maintaining a
corpus collection log can help organize your thoughts at the moment, but
it can also help you when you need to revisit your data or if you begin to
work on the data with your students and need to quickly access certain
information. Two useful ways of organizing your data are through (1) a
master spreadsheet and (2) purposeful file names.
When we collect corpora, there are often dozens, hundreds, or even
thousands of texts included. Such a volume of data can quickly become
overwhelming if not well organized. For MICUSP, for example, the text
files were organized into folders based on different disciplines. These fold-
ers contain subfolders with texts organized according to other independent
variables, such as level (e.g., graduate vs. undergraduate) or paper type.
When working with a corpus, it is essential to create a master list of all of
your texts. Include information about your texts in a searchable format.
Your spreadsheet can easily be sorted based on your column categories.
This is useful, especially when dealing with a large number of texts, which
may be organized into folders that do not reflect the variable you are inter-
ested in at the time. Such a spreadsheet can also be imported directly into
statistical software, such as SPSS. Finally, label your files in ways that show
information that is important to your teaching goals. For example, MIC-
USP is labeled based on (1) discipline, (2) level of education, (3) participant
number, and (4) the number of texts submitted by that participant. Your
objectives for collecting the corpus may not require the analysis of quite
as many variables as the MICUSP project, but you may have one central
variable that you are investigating, such as L1 background or proficiency
level. Instead of using disciplines for your file naming system, you could
choose something that signifies your specific goal.
3 Publishing and sharing. You will most likely not have to publish or share
your teaching corpus, but you may need to have it readily available for your
next course or for your students to download their own copy for projects or
homework activities. Using Dropbox or Google Drive could be your best
option for sharing your texts with your students. I suggest that you create
an ‘agreement document’ signed by your students as a form of contract
with them, indicating that they will only use the data for class purposes
and will not disclose potentially personal or private information about in-
dividuals or other students that may have been captured in the texts. You
can also ask your students to discard the texts after they have completed
your course.
B2.2 Collecting Written Texts

Academic DIY corpora typically collected by teachers often include those writ-
ten by students and professional, published articles that can be used for various
analyses and comparisons in the writing classroom (see Section C1 for an ex-
tensive discussion of this approach). Register options from fiction (short stories
and novels) and personal writing (reflections and journals) to technical reports
are ideal for a range of concordancing activities. Ideas for corpus collection of
student and professional (published) texts are presented in the following.
B2.2.1 Student Written Texts

I collected a corpus of computer-based responses to the written component of
the Test of English as a Foreign Language (TOEFL) to match an already exist-
ing collection of TOEFL essays provided by the Educational Testing S ervice
(ETS). The ETS sponsors corpus-based studies of TOEFL data, and with nec-
essary approval, may allow the sharing of texts for research purposes. In my
corpus collection, I wanted to compare essay responses from two controlled
prompts from a range of international students (graduate and undergraduate)
who took the TOEFL with essays from already matriculated native speakers
(NS) of English in a US university.
For students coming from non-English-speaking countries, L2 writing ability
is an important consideration measured by compulsory tests like the TOEFL
(or IELTS—International English Language Testing System) before these inter-
national students gain formal admission to US universities. For NS of English,
there is no specific TOEFL score necessary for admission, but most of these stu-
dents are required to take composition courses which aim to develop a range of
writing abilities focusing on content, grammar, mechanics, and style of written
outputs expected of all university students. Descriptions of NS writing, relative
to NNS writing (or L1 vs. L2—used similarly here) samples in an assessment
setting, are still limited at present, especially those that make use of advanced
corpus tools that show generalizable, quantitative data of the linguistic character-
istics of academic writing. I argued that it would be interesting, and potentially
useful, to examine the quality of NS/NNS writing, and produce comparative
data that illustrate the degree of variation, gaps, or unique distribution of salient
linguistic features of writing among and between these groups of students. The
descriptions of the linguistic characteristics of writing samples from NS/NNS
have important pedagogical implications that apply directly to materials pro-
duction and syllabus design, to aid the development of L2 writing for NNS, and
support easier transition to advanced, genre-specific writing for NS students.
My focus here, specifically, was upon IEPs in US universities which are
tasked to prepare international undergraduate and graduate students to meet
the rigors of formal and informal academic writing across disciplines. Writing
ability, numerically quantified and measured by TOEFL and IEP test scores,
suggests whether or not a foreign student will be able to successfully meet the
writing requirements of his/her field of specialization. Conversely, many writ-
ing instructors in freshmen composition classes have reported the surprising
range of writing abilities of NS students, which points to the importance of
further studying their usage of linguistic features typically identified as in-
dicators of writing quality (e.g., transition words, epistemic adverbials, pas-
sive/active structures, verb tense/aspect). Relevant and important questions,
then, were how well do NS students actually write in English when given the
same prompts, such as those developed for international students, that is, the
TOEFL? What is the linguistic composition of effective or ineffective essays
from TOEFL prompts as written by students with varying first language back-
grounds and levels of fluency in English? What do results of these comparisons
imply, particularly with respect to ETS and various IEP programs in the US, in
the design of writing tests and assessment of NNS writing, especially if some
NS data also show clear and specific limitations in the quality of academic
writing of NSs of English? My corpus focused on these goals and was developed
to provide exploratory data and answers to these and other related questions.
The importance of a corpus-based comparison of a learner (i.e., NNS)
TOEFL essays with that of NS target corpus is supported by previous research
such as Granger, Hung, and Petch-Tyson (2002), Mukherjee and Rohrbach
(2006), Crossley and McNamara (2009), and O’Donnell, Römer, and Ellis
(2013), to name only a few. By comparing NS/NNS corpora, for example,
instances of L2 learners’ preferences (I hesitate to refer to under- or over-use of
a particular linguistic feature) are documented directly to show areas where
NNSs may have limitations or immediate successes in full-time undergradu-
ate and graduate writing contexts. The Santa Barbara Corpus of Spoken and
Written American English and the T2K-SWAL have been used to describe the
linguistic features of US-based academic writing against L2 writing, as well as
to design language teaching materials addressing the needs of foreign students
in US universities. However, in most of these previous comparisons, the types
Table B2.1 List of linguistic features compared
Linguistic features Description/composition
a Lexico-Syntactic Prepositions; coordinators/conjunctions; complement

Complexity clause constructions; quantifiers; transitions (adverbial
connectors, metadiscourse markers)
b Expression of Stance Explicit expressions of assessments, evaluations,
synthesis; use of markers of intensity and affect; use of
discourse markers
c Vocabulary Use Vocabulary size (type/token ratio); average word length;
content word classes (e.g., nouns, verbs, adjectives,
adverbs); nominalizations; hedges/vague references/
key words
d Comparison of Discourse markers of elaboration and informational
Informational Content content; distribution of semantic classifications of
nouns, adjectives, and adverbs
e Formal vs. Informal Pronouns (especially first- and second-person
Stylistic Features pronouns); contractions; tense/aspect shifts; that
deletion; private verbs (e.g., believe, think)
of writing, including the use of prompts, and the circumstances of written pro-
duction (e.g., timed, with required word-limit and assessed for quality) have
not been methodically controlled in the corpus collection. There is a need to
carefully design parallel corpora of students’ writing that will clearly show the
influence of variables such as first language background, overall language abil-
ity, and prompts in written outputs. By using two similar TOEFL prompts and
following the same instructions and conditions of production from those uti-
lized in the existing corpus of NNS TOEFL responses, the NS corpus provided
appropriate baseline data about the range of NS writing in US universities.
The NS essays represents ‘real-world’ samples of writing foreign students could
expect from their US peers. In addition, the use of a controlled corpus of NS/
NNS essays addresses essential research validity and reliability issues in these
types of corpus-based comparisons.
My NS/NNS TOEFL responses corpus maintained a balanced number of
texts (and students) per group. There were 320 NNS and 310 NS students who
responded to two essay prompts using a computer, with 30 minutes provided
per prompt. All the essays (N = 1,260) were evaluated for quality of writing
following a rubric that is similar to the one used in TOEFL assessment. The
NNS essays had 140,800 total number of words (an average of 220 words per
essay), while the NS group had 192,200 words (an average of 310 words per
essay). My initial analyses focused on the following features (Table B2.1), first to
obtain comparison data and second to develop IEP teaching materials based on
these results. Results and particular teaching implications of these comparisons
are discussed in Friginal, Li, and Weigle (2014) and Weigle and Friginal (2015)
The following section is an undergraduate student corpus collection pro-

posal by Hardy (2013, 2014) for a large-scale dissertation project, which il-
lustrates the context and goals of the study along with what will be done to
prepare for the collection of student writing samples in a target two-year col-
lege. Hardy has since used this corpus for various research and teaching appli-
cations, with data informing work with students who attend university writing
centers for tutoring purposes..
Disciplinary Specificity in Student Writing:

A Proposal for Corpus Collection
Jack A. Hardy, Oxford College of Emory University, Oxford, GA, USA
I propose a study that combines both situational and textual analyses of un-
dergraduate student writing at a private, two-year college that offers students
a small-campus, intensive liberal arts experience. My study’s aim is to explore
issues of student writing in different disciplines along Flowerdew’s (2002) con-
textual to textual continuum. During the study, opposing points of the contin-
uum can be used to inform the understanding of a given analysis. For example,
as I conduct qualitative interviews and focus groups, my questions may be
influenced by my understandings of the texts. Similarly, textual analysis will
be informed and understood by the situational characteristics and functional
purposes of the assignments. These are the research questions for this study:
1 How does context (e.g., the school, faculty expectations, and disciplinary
differences) influence the writing practices of lower-level undergraduate
students?
2 What textual features are associated with successful writing of lower-level
undergraduate students?
3 How does context interact with rhetorical and linguistic forms?
The Context
The setting is a two-year institution that awards associate of arts (AA) degrees
to students. Upon graduation, the majority of students transfer to a ‘receiv-
ing,’ nationally recognized, private university in the same state to finish their
undergraduate careers. This private university ranks among the “Top 20” of
national universities in the US, and its students are held to high s tandards. The
proposed study will include popular fields or majors in the AA institution such
as psychology, biology, political science, physics, English, and philosophy.
In terms of writing practices, students in this two-year college are only re-
quired to take one semester of freshman reading and writing. However, many
students complete Advance Placement (AP) tests to avoid this course. Also, all
students who graduate from the college must take another course that has
been identified as writing rich. Such courses emphasize learning disciplinary
conventions, writing as a process, and creating a polished product (or prod-
ucts) of at least 20 typed pages total. Many professors prefer to assign multi-
ple, shorter assignments that can be revised and resubmitted over the course
of the semester instead of a single, long paper due at the end of the semester.
The international student population in this college has increased dra-
matically over the last 10 years. In fact, the freshman class of 2016, for
example, consists of nearly 27% international students. This newly linguis-
tically heterogeneous situation has caused much discussion and attempts at
pedagogical reforms among faculty at the institution. Professors’ pedagogi-
cal practices that once assumed high levels of English proficiency from their
students have needed to be adjusted, and more English language help has
been necessary in the composition courses and the campus’ writing center.
One aspect of this context that will facilitate the data collection pro-
cess is that the target institution is primarily a teaching college. Although
associated with a research university, the faculty members at the two-year
college are not expected to publish as much in their discipline as those on
the receiving campus. Instead, many faculty members specialize in research
that involves pedagogical practice. Many of them have also expressed in-
terest in collaborating with me in the pedagogically based subareas of their
respective disciplines. Others have invited me to help their students write
for their classes. For these reasons, data collection is not anticipated to be as
big of a burden as it has been for many other compilers of similar corpora.
Preparation
Having taught for a year at this college, I regularly met with other faculty
members and attended division and faculty meetings. Because I taught my
freshman composition course using an EAP approach, my students wrote
targeted pieces as if they were writing in psychology, then anatomy and
physiology, then education. Finally, students conducted their own mini
genre studies and wrote research papers as if they were in an applied lin-
guistics course. Their area of investigation for these research papers was
the academic discipline(s) they were considering majoring in upon comple-
tion of their AA degree. They became ethnographers and textual analysts
of their own intended future discourse communities, as recommended by
Johns (1995, 1997). Students interviewed faculty members, surveyed stu-
dents, familiarized themselves with syllabi and degree requirements, and
analyzed linguistic and rhetorical features of writing.
(Continued)
Not only did reading several dozen student-written research papers

help me understand the context I would be investigating, it also introduced
the faculty members in those disciplines to the potential of EAP practices.
Several professors who were interviewed for those course projects became
excited that their students were interested in learning about disciplinary
practices and were more than happy to help. Many instructors have con-
tinued to be excited at the potential for more discussion on writing in the
disciplines as it pertains to these lower-level students who must adapt from
course to course as they read and write in a broad array of disciplines.
In preliminary discussions with faculty members, one thing that I found
particularly relevant is the label that they prefer. Although Belcher (1989)
has influenced applied linguists to use the categories and labels of ‘hard’
and ‘soft’ disciplines (e.g., Hyland, 2008a), these labels were vehemently
protested by some of the target faculty from the proposed study. Social
scientists, in particular, felt that such a label negated the rigor needed for
their disciplines. For that reason, I will avoid using such terms, and try to
use more neutral language.
Faculty cooperation is essential for this process. I feel it is important to
maintain a close and open relationship with them, treating them as experts
in their respective areas and ideal informants for disciplinary practices. They
are central members of the target discourse communities, or communities
of practice, which I want to understand better. In addition to faculty sup-
port, I have built contact with administration at the college. In preparation
for this study, I have been in contact with the Dean of Academic Affairs, the
Director of the Center for Academic Excellence, and the Director of Institu-
tional Research. These faculty and administrative contacts are important for
both open collaboration and institutional access.
Disciplines
Although the best representation of student writing as a whole would in-
clude simple random sampling, the proposed corpus has the purpose of
comparing disciplinary writing. Thus, a more purposeful sampling must be
made to avoid the potential for obtaining much more writing from courses
that require more writing (e.g., English courses) compared to their natu-
ral science counterparts. In the planning of the corpus, three disciplinary
groupings have been made: humanities (HU), history and social sciences
(SS), and natural sciences and mathematics (NS). These three academic
divisions were predetermined because they exist as such at the target insti-
tution. Students at this school must also take courses from each division to
complete their AA degree after two years. Unlike the large-scale corpora of
Table B2.2
Corpora of student writing, organized by discipline
“AA corpus” (proposed) MICUSP BAWE corpus
Humanities (HU) Humanities and Arts Arts and Humanities
• Philosophy • Philosophy • Philosophy
• English • English • English
History and Social Social Sciences Social Sciences
Sciences (SS) • Psychology • Politics
• Psychology • Political Science
• Political Science
Natural Sciences and Biological and Health Life Sciences
Mathematics (NS) Sciences • Biological Science
• Biology • Biology • Psychology
• Physics and
Astronomy
student writing, however, only two disciplines from each division have been
selected because of time constraints.
In the same way that the BAWE corpus was influenced in its choice
of disciplines by MICASE, so too was the proposed corpus influenced by
other corpora of student writing. This will allow for cross corpus compar-
isons. The disciplines to be examined are all present in both MICUSP and
the BAWE corpus, although grouped slightly differently by larger academic
area. Table B2.2 shows the disciplines under investigation and their coun-
terparts in the other corpora. One may notice that the BAWE corpus in-
cludes psychology in the area of life sciences and that the BAWE corpus and
MICUSP have a separate group, physical sciences, for physics.
Consent
Before any papers are collected, students will be asked to complete a con-
sent form. All of the students who agree to participate will also be asked to
fill out a brief questionnaire. This questionnaire will ask information such as
age, gender, first language background, intended major (students at this
school are not able to declare their majors).
A separate section of the questionnaire, including another consent form,
will ask this group of students if they would like to also participate in a qual-
itative follow-up study. These students will provide their email addresses
to be contacted after the corpus has been collected. A master list of par-
ticipating student information will be compiled and saved digitally. Code
numbers will be assigned to the students to maintain confidentiality when
dealing with their writing samples and future uses of the corpus.
(Continued)
Contextual Data Collection and Analysis

Because the proposed study will attempt to bring together both a contex-
tual and a textual understanding of undergraduate student writing, it is
important to incorporate qualitative data collection to help interpret the
textual analyses. Information associated with the courses and assignments
will be gathered, and interviews will be conducted with students and pro-
fessors. This will help answer the first research question of the study, which
is concerned with how the context interacts with writing practices.
The syllabi for each class studied will be collected and read to better un-
derstand the course and its assignments. Any expectations associated with
reading and writing will be collected and put into a digital spreadsheet.
Along with syllabi, I will also collect any handouts or instructions for the
particular assignments that are included in the corpus. These will be used
to qualitatively understand the tasks that students were asked to complete.
This step is essential to be able to discuss the extent to which student writ-
ing was able to match the expectations of the professors.
Understanding the tasks will also help in the labeling of the genres of
the assignments, following the analysis of purpose and structure of the
papers (Nesi & Gardner, 2012; Römer & O’Donnell, 2011). For this proj-
ect, I will supplement and compare these qualitative assignments of paper
types of what the students write with quantitative methods. For example,
cluster analysis has been used to categorize text types (e.g., Biber, 1995;
Grieve, Biber, Friginal, & Nekrasova, 2010). This procedure can be used
to determine, without assigning labels a priori, which texts are more like
each other.
The qualitative data collection will include focus groups of participat-
ing students. According to Hyland (2012), students may be less likely to
openly discuss their literacy practices than their professional counterparts.
He suggests that focus groups be used to scaffold and foster participation
among students. For the present study, these focus groups will be com-
posed according to discipline. However, if there is interest, another group
could include participants who are enrolled in multiple of the courses being
studied. Such students could talk about their experiences, interpretations,
and evaluations of reading and writing in those different disciplines and
genres.
Some of the questions I hope to explore in these student focus groups
include: To what extent are students aware of disciplinary variation? Are
they aware of faculty expectations? Do they feel that they adjust their writ-
ing accordingly? I will use a list of target questions such as these to guide
me through each focus group, allowing me to direct the groups to discuss,
at least minimally, the areas I am interested in. However, I will also allow
these students to deviate and continue to expand on the topic if they want.
With this method, there is potential for them to bring up issues of disci-
plinary student literacy practices that I had not previously considered.
Participating faculty members will also be interviewed. Ideally, at least
two faculty members from each discipline will be interviewed, providing
multiple perspectives of the situational context from the perspective of an
instructor. Toward the end of the semester in which the corpus is collected,
the 12 interviews should be conducted. This would provide time for me to
familiarize myself with some of the assignments and texts.
One important purpose of these interviews will be to clarify any ques-
tions about the instructors’ syllabi and the tasks that were assigned. In
addition, I will ask questions to better understand how the faculty mem-
bers view student writing in the courses they teach. These interviews will
be vital to understanding the expectations that faculty members have of
the writing of their students. The questions to be investigated are related
to the first research question of how the context, in this case the faculty,
view and potentially influence the writing practices of their students. Do
professors have students write to display knowledge? Or are there other
intentions (e.g., writing to learn, learning disciplinary practices, critical
thinking)?
The questions to faculty members will be driven by specific assignments
and student papers. Participants will be asked about what made those
samples successful. In a way, the semi-structured interviews involve asking
general questions about perceptions of writing but will be influenced by a
method similar to stimulated recall. That is, professors will be given a stim-
ulus (previously graded papers) to help them revisit and externalize their
beliefs and memories of the assignments.
Textual Data Collection

The following is a proposal for the design, collection, and annotation of this
corpus of student writing. (Note: For this book, this proposed corpus is re-
ferred to as the AA Corpus) As a liberal arts institution, this college requires
students to have breadth in their scheduling of courses. Students must thus
balance multiple disciplinary practices while potentially only recently being
exposed to them.
It is important to recognize time and space limitations in corpus design.
A comprehensive corpus of student writing would include all student writ-
ing (Krishnamurthy & Kosem, 2007); however, one must remember the
purpose of his or her corpus in its design. An analogy of sampling can be
(Continued)
made using more concrete terms: Although researchers at a pharmaceutical

company might want to understand how a new medication affects every
type of human being, such an experiment is not only unrealistic but also
might not even be necessary if the drug was created and indicated for a
specific segment of the overall population.
With that in mind, an EAP professional may have specific target disci-
plines, tasks, genres, and levels for her students because she has conducted
a context-specific needs analysis. Often, researchers who compile corpora of
academic writing do so in order to understand the specific practices of suc-
cessful members of a discourse community in order to provide materials and
design lessons for those wanting to enter such communities. This proposal
describes a principled corpus collection procedure of lower-level undergradu-
ate student writing. The important considerations to the corpus design have
been influenced by the culture of the institution, the needs of the students,
and a desire to be comparable, especially with MICUSP and the BAWE corpus.
In sum, an essential piece of this project is the collection of a corpus
that is representative of successful student writing from the institution and
in the disciplines being investigated. A corpus of successful student writing
from six disciplines will be collected: philosophy, English, psychology, po-
litical science, biology, and physics. The samples will come from freshmen
and sophomores, representing lower-level undergraduate student writing
in varied disciplines across the curriculum. Only papers that receive grades
of A or B will be included in the study. The rest of this section describes in
detail how this corpus will be collected.
The BAWE and MICUSP creators have advocated the use of electronic
submissions for the actual collection of texts (Alsop & Nesi, 2009; Römer &
O’Donnell, 2011). Both were very large-scale projects, involving several re-
searchers, dozens of faculty members, and hundreds of students. Also, the
BAWE corpus was collected over multiple universities, and MICUSP was
collected at a very large institution. These factors made digital submissions
ideal. The proposed context, however, is a very small college where faculty
members from all disciplines know and often socialize with each other.
Also, I have personal relationships with most of the professors who teach
the target courses. These differences may help make a difference in data
collection. The most direct way to gain access to writing is through the
professors instead of the students. Dozens of texts can be collected at a
time this way, and the person who assessed the writing can directly provide
those scores. During the in-person, digital or manual (copies and/or scan-
ning) collection of student papers, professors will be asked to include the
assessment scores the participating students received.
Original paper submissions will be scanned and returned to professors.

During this process, they will be kept in a locked, secure office at the data
collection site. Upon digital submission of groups of student papers, the
texts will be cataloged in an Excel spreadsheet. The documents will be an-
onymized, with any personal information (e.g., name, student ID number)
being replaced by the participant ID number assigned to them during the
consent process. This number will also be used to name the files. For exam-
ple, a physics paper might have the file name PHY.101.1.123, which would
show the name of the course (Physics 101), the assignment number (1), and
the participant ID number (123). This spreadsheet will also include contex-
tual information about the assignments collected, including the genre label
given by the instructor.
Because the purpose of this corpus is for internal use and will not be
made public, as its creator, I will have sole access to the spreadsheet with
demographic information about the student writers. Such variables will
easily be manipulated there, and I do not anticipate the need to markup
individual files for metadata in the same way as MICUSP (O’Donnell &
Römer, 2012).
Copies of these texts will be saved in the format in which they were
turned in for future use and ease of reading. To be readable by most corpus
programs, however, the texts will need to be saved as txt files. Instead of
annotating individual files for group variables (e.g., discipline, course, sex,
language background), different copies of the corpus can be made. The pri-
mary corpus organization will be according to discipline. Six folders will be
created for each of the disciplines studied. Individual courses will then make
up subfolders. Depending on the data collection process, another subfolder
can be created for particular assignments. Further linguistic analyses using
corpus software packages like AntConc and WordSmith Tools would benefit
from easy access to different texts based on various groups. Thus, copies of
the corpus may be made that organize the texts based on variables such as
paper type or language backgrounds.
After naming and organizing the text files, they will be ready to be
tagged using various tagging programs. I plan on having this corpus an-
notated using the Biber Tagger (Biber, 2010). This tagger tags each word
in the corpus with grammatical information (e.g., parts of speech, tense,
and syntactic structures) and, for some words, semantic information (e.g.,
common verbs of possibility, adjectives of size, and nouns of place). These
tags will then be counted, and the totals for each text will be normalized
per 1,000 words.
B2.2.2 Published or Professional Written Texts

Gray (2013) collected her corpus of disciplinary writing in three sub-registers
of published academic research articles (RA). Her first step in building a princi-
pled RA corpus that also represents distinct types of research within and across
disciplines focused on an extensive survey of academic journals in 11 academic
disciplines. She then narrowed down her target to six disciplines for inclusion
in the corpus: philosophy, history, political science, applied linguistics, biology,
and physics. This final selection was based on several competing factors, includ-
ing the ability to (1) represent each article type with at least two disciplines,
(2) represent disciplines with multiple article types to the extent possible, (3)
include disciplines from a range of academic areas, and (3) relate findings to
previous disciplinary research.
The next step was to develop discipline-specific processes for sampling
and accurately labeling articles according to the research paradigm or re-
search design reported on in the article. She consulted discipline experts
in the selection of high-quality journals to serve as the source for articled
compiled in the corpus. These experts also helped in validating discipline-
specific definitions to accurately categorize the articles by research type and
primary method or design (see, e.g., Gray, 2013). Three types of primary
research designs in her subgroups are (1) theoretical, (2) qualitative, and (3)
quantitative research. Gray explained that theoretical articles have no observed
data, but rather present arguments to advance and explore theoretical con-
cepts in the field. Qualitative research studies focus on the observation and
description of empirical data, make little or no attempt to quantify that data,
and are typically observational in nature. Quantitative articles (observational
or experimental) also analyze observed data; however, those data are numer-
ically based and often involve a comparison of groups. These groups are im-
portant in addressing her research and teaching goals in using the corpus. For
example, in student research writing, especially for graduate-level papers,
theoretical, qualitative, and quantitative papers are going to have distinct
patterns and features with which learners (and even professionals) will have
to be very familiar.
Table B2.3 describes Gray’s final corpus. Each discipline is represented
by only those research types prevalent in that discipline. Each discipline/
register combination is represented by 30 research articles selected from 8
to 10 peer-reviewed journals that reflect a variety of areas or sub-disciplines
within the field. The resulting 270-text corpus comprises approximately two
million words and enables comparisons of the same journal register across
multiple disciplines (e.g., quantitative research in political science, applied
linguistics, biology, and physics), and to a lesser extent comparisons across two
registers within a single discipline (e.g., qualitative and quantitative research
in political science).
Table B2.3 Gray’s corpus description in number of words (30 texts per discipline/
category)
Theoretical Qualitative Quantitative Total
Philosophy 280,826 – – 280,826

History – 282,898 – 282,898
Political Science – 191,791 230,386 422,177
Applied Linguistics – 237,089 202,871 439,960
Biology – – 154,824 154,824
Physics 194,029 – 183,279 377,308
Total 474,855 711,778 771,360 1,957,993
As another example, I was the lead author of a collaborative project with

colleagues (especially Sabah Slebi Mustafa) from the University of Baghdad,
Baghdad, Iraq, which explored the linguistic structure of RA abstracts pub-
lished in US-based peer-reviewed journals and Iraqi publications (published
in Iraq and written by Iraqi authors) across four disciplines: (1) agriculture,
(2) nursing, (3) engineering, and (4) languages (linguistics, applied linguis-
tics, ESL/EFL, English) (Friginal & Mustafa, 2017). By using a specialized
corpus of US-based and Iraqi texts, a specific objective of this project was
to conduct a cross-cultural (i.e., based on country) and cross-disciplinary
comparison of abstracts written by scholars with differing first language back-
grounds and publishing in two different publication venues and contexts. The
commonality is the subjects’ use of English in these abstracts. Some Iraqi texts
may have been translated into English, with support from writing coaches
or translators, while US-based publications are not all necessarily written
by English native speakers. The focus here is to conduct a cross-linguistic
genre comparison to describe similarities and differences across these groups
of texts.
We were interested in developing materials for academic professionals in
Iraq and other international locations involved in the teaching of research re-
port writing focusing on various parts of the RA register. By focusing on ab-
stracts, we were also interested in teaching professionals how to write abstracts
intended for conference presentations, especially in English-based international
conferences. Swales and Feak (2009) note that there are “several million re-
search papers published every day” (p. 1). This would take into account various
forms of academic written output from research-oriented technical reports,
interactive posters, and reviews. This figure also includes papers from coun-
tries such as Iraq, where recent reforms in higher education now require the
faculty’s active participation and commitment (Mustafa, 2015). In Iraq, the
production of research articles published in local universities has also become a
regular component of life in academia and the scientific community. Research
productivity, and the prestige related thereto, are, in part, measured by the
potential global circulation of Iraqi studies, presently more prevalent in the
form of written genres, in English (or translated in English from Arabic), that
follow strict writing conventions and expectations. Numerous attempts at in-
ternational publication, which also result in numerous failures or rejections, are
a common experience of faculty researchers in Iraq. In recent years, research
mentoring programs and collaborations between Iraqi scholars and researchers
based in universities such as those in the US have been funded by local and
international grants.
Current research in academia draws heavily on the immediate transfer and
dissemination of data, methodologies and approaches, and results that are made
possible by easy access to online journals and databases. These sites serve as
the primary sources of information for literature reviews, establishing research
gaps, and meta-analyses across settings and contexts. International scholars
are able to keep abreast with the contemporary status of work conducted in
their areas of specialization, and updated syntheses of findings are immedi-
ately shared globally (Cilveti & Perez, 2006). However, this rapid increase in
scholarly activity and the immediate publication of a wide range of studies pro-
duce many interrelated challenges, especially for international researchers (or
scholars outside of institutions in the US, the UK, and similar countries). With
English being the dominant language and the primary medium of publication
for the majority of academic or scientific research papers (Eggington, 2004;
Laborde, 2011; Swales, 1985), language use and the structuring or formatting
of research articles are expected to conform to international, very competitive
standards.
A cross-disciplinary comparison of abstracts has been previously con-
ducted by Melander, Swales, and Fredrickson (1997), using abstracts produced
by American NSs and Swedish NNSs in disciplines including linguistics,
biology, and medicine. In a similar vein, but focusing on cross-linguistic
comparison, Hu and Cao (2011) examined the use of hedges and boosters as
metadiscourse markers in English and Chinese abstracts, revealing patterns
of linguistic differences in how Chinese and English abstracts utilize hedging
structures in reporting results and making generalizations or conclusions.
Lopez-Arroyo and Menendez-Cendon (2007) also described and compared
the rhetorical and phraseological structures of medical research article ab-
stracts produced by English and Spanish authors. They found evidence of
cross-linguistic influences in phraseology, vocabulary use, tenses, and use of
other related discourse markers. Similarly, Kafes (2012) conducted a contras-
tive analysis of abstracts written by American, Taiwanese, and Turkish schol-
ars in the social sciences, reporting differing patterns of rhetorical structures
potentially influenced by the authors’ culturally and linguistically diverse
backgrounds. Kafes suggested that comparative results generally reflected the
Anglo-American conventions of the academic discourse community in how

authors structure and organize their abstracts.
Our corpus of RA abstracts features four parallel sub-corpora comprised of
RA abstracts in the aforementioned four disciplines from published, empirical
research papers in the US and Iraq. Overall, the texts collected in the four
corpora were written from the period from 1995 to 2016 across a wide range
of publications and research approaches. Qualitative and quantitative studies
are included and coded but not analyzed in this paper. All US abstracts were
collected from reputable online databases (e.g., ERIC, ScienceDirect, EBSCO-
host) and from leading journals in the four disciplines. For Iraqi texts, the ‘pres-
tige’ of the publication and the nature of the review process (peer-reviewed or
not) associated with the RAs are not controlled but were coded in the corpus
collection. Some of the studies published in Iraq may have been published by
the author/s’ home university and not by an independent journal. Sample Iraqi
journals include the Iraqi National Journal of Nursing Specialties, Journal of Engi-
neering and Development, and The Iraqi Journal of Agricultural Sciences. The total
number of RA abstracts and the number of words per subgroup are shown in
Table B2.4.
Overall, Iraqi abstracts in the corpus are shorter than US-based abstracts
(with less than 200 words on average for all four disciplines). Up to five key
words are typically required. (Note that these key words were included in
the corpus collection.) As a common convention, Iraqi publishers specifi-
cally require the following components in an abstract: (1) brief purpose and
objectives of the research, (2) methodology and only the principal results,
Table B2.4 The corpus of US-based and Iraqi RA abstracts
Disciplines Iraqi RA abstracts US-based RA abstracts
Agriculture 150 (27,000 words; 225 (63,000 words;

180 average words per 280 average words per
abstract) abstract)
Nursing 135 (26,055 words; 250 (67,500 words;
193 average words 270 average words per
average) abstract)
Engineering 180 (35,820 words; 250 (57,750 words;
199 average words per 215 average words per
abstract) abstract)
Languages (Linguistics, 210 (44,520 words; 250 (72,250 words;
Applied Linguistics, 212 average words per 301 average words per
ESL/EFL, English) abstract) abstract)
Total 675 abstracts (133,395 975 abstracts (259,500
words; 197.6 words words; 266.1 words
per abstract) per abstract)
and (3) summary of major conclusions (Mustafa, 2015). For example, many
nursing RA abstracts have explicit subsections for Objectives, Methodology,
Results, and Recommendation. These abstracts are often presented sepa-
rately from the article, and the onus for authors is to make it as ‘stand-alone’
as possible.
Of the 675 Iraqi abstracts, 53% have at least two authors (compared to 34%
for US-based abstracts). Coauthoring for Iraqi scholars and professionals is often
encouraged not only to foster collaborative work but to also add particular ex-
perts who may focus specifically on such areas as statistics, research approaches,
or the actual writing process in English. Iraqi authors in the corpus all have
MA/MS or PhD degrees and have different academic titles/ranks starting with
the level of university teaching staff. Some authors were affiliated with uni-
versities outside Iraq. Finally, American and British English conventions, for
example, in spelling, vocabulary use, and some syntactic structures are ob-
served in the Iraqi abstracts. Some RA manuscripts are reviewed and checked
by designated “specialists in English” employed by local journal publications.
For these Iraqi journals, authors writing in English are also required to translate
their abstracts into Arabic.
B2.3 Collecting Spoken Texts

A recent book project of mine entitled Exploring Spoken English Learner Lan-
guage Using Corpora: Learner Talk (2017) involved extensive work analyzing
the structure of learner-spoken discourse, with Joseph A. Lee (Ohio Uni-
versity), Brittany Polat (Georgia State University), and Audrey Roberson
(Hobart and William Smith Colleges). Three specialized DIY corpora were
collected for the project: (1) L2 Classroom Discourse (L2CD) Corpus, (2)
L2 Experience Interview Corpus, and (3) Second Language Peer Response
(L2PR) Corpus. The collection of these spoken corpora required consid-
erable planning, coordinating with participants (teachers, students, and
interviewees), audio-video recording, and transcription. In the following
sections, I describe the design processes and the particular approaches in
collecting these spoken learner corpora. Teachers may find these descrip-
tions helpful as they consider various options for recording and transcribing
their own spoken data for classroom use with their students. Although it
is clear that these specialized DIY corpora were collected for large-scale
research studies, primarily, there certainly are clear ‘process-based guides’
and pedagogical applications to be gleaned from analyzing data and various
results. Audio recording and transcription were tedious, and at times very
challenging: for example, Polat’s recruitment of interviewees and the need
for funding sources to offer participant remuneration. All corpus collection
projects passed a rigorous institutional review process.
For the Teacher

As you read the sections that follow, think about your teaching situations and
relevant spoken interactions you have with your students. These events may
be recorded and transcribed for your classroom-based research, including
potentially comparing them with existing texts from MICASE. Some topics
for investigation using classroom spoken corpora include how learners and
teachers mark their stance toward propositional content and each other by
focusing on hedges (e.g., almost, assume, suppose, “in my view”) and boosters
(e.g., certain, of course, never); the use of personal pronouns in learner and
teacher talk, or how teachers use spatial deixis (e.g., here, these) to conceptu-
alize classroom space (e.g., Lee, 2011).
Studies of learner comprehension and how they modify speech (e.g.,
in providing comprehensible input) from repetitions, emphasizing slower
speech rate, and the rephrasing of utterances with more frequent and sim-
ple words have all been examined in experiments, but these may also be
analyzed from a comprehensive, well-developed corpus. From simple word
counts to more advanced frequencies of reformulations, various corpus
methods may also allow for distributions that can be used alongside test
results. Corpora will further describe the linguistic features of L2 negotiation
strategies (e.g., confirmation checks, clarification requests, recasts, or infor-
mation packaging). These descriptions may be used to develop testing and
teaching materials, and NNSs may also be induced to notice and understand
the gap between their own L2 speech system and those of other learners,
NSs, and their classroom instructors (Friginal, Lee, Polat, & Roberson, 2017).
B2.3.1 Collecting and Analyzing Classroom Spoken Corpora

The L2CD Corpus. The L2CD corpus designed and collected by Lee (2011)
consists of 24 EAP lessons taught by four highly experienced EAP teachers in
a university IEP: three female instructors and one male instructor (Burt, Mary,
Lillian, and Baker—all pseudonyms). At the time of data collection, Burt and
Mary taught oral communication, Lillian taught reading and listening, and
Baker taught structure and composition. Each EAP teacher’s lessons were video-
recorded six times over a 16-week semester, totaling 28 hours of recordings. The
uneven distribution of hours was due to the length of the teachers’ classes. Both
Burt and Mary taught afternoon classes that met for 50 minutes (totaling 5 hours
each). Lillian’s class was a morning class of 75 minutes in length (8 hours in total),
and Baker’s class was also a morning class of 100 minutes in length (for a total
of 10 hours). Both Lillian’s and Mary’s classes had 15 students each, Burt’s class
consisted of 13 learners, and Baker’s class had 17 students (Friginal et al., 2017).
The video camera was positioned in the back corner of the classrooms. It re-
corded the teachers’ linguistic and non-linguistic behaviors and learners’ speech
when they were interacting with the teachers in mostly whole-classroom for-
mats. Since the learners and teachers did not wear clip-on lavalier microphones,
however, it was difficult to capture most of their speech when learners and teach-
ers interacted during individual, pair, or group tasks. Additionally, since three of
the classes included student presentations (i.e., oral communication and reading/
listening classes), the lessons were recorded on those days which consisted of
more regular academic and language tasks such as vocabulary, grammar, read-
ing, writing, and listening activities. Therefore, the recordings were mostly of
instructor and learner talk during whole-class interactions. The first record-
ings occurred in Weeks 3 and 4, four consecutive lessons were then recorded
in Weeks 6–9, and the last recording occurred in Weeks 11–14. All 24 video-
recorded lessons were transcribed verbatim including dysfluencies following the
transcription conventions shown in the following (adapted from Jefferson, 2004).
T Teacher
S1, S2, etc., Identified student
SU Unidentified student
Ss Several or all students at once
- Interruption; abruptly cutoff sound
, Brief mid-utterance pause of less than one second
. Final falling intonation contour with 1–2 second pause
? Rising intonation, not necessarily a question
(P: 02) Measured silence of greater than 2 seconds
X Unintelligible or incomprehensible speech; each token refers to one
word
<LAUGH> Laughter
() Uncertain transcription
{} Verbal description of events in the classroom
(()) Nonverbal actions
Italics Non-English words/phrases
// Phonetic transcription; pronunciation affects comprehension
ICE Capitals indicate names, acronyms, and letters
The transcripts of the video-recorded lessons made up the L2CD corpus. Table
B2.5 provides a full description of the L2CD corpus. As previously mentioned,
it consists of 24 complete lessons, and the size of the corpus is 179,638 tokens.
The following text excerpt is illustrative of typical learner contributions in
teacher-student interactions in the L2CD.
Table B2.5 Description of the L2CD corpus
Teacher Course Levela Class Class Class Label Tokens

sizeb meetingc timed
(minutes)
Baker Structure and 3 17 MWF 100 L2CD-1 8,039

Composition
L2CD-2 9,977
L2CD-3 10,178
L2CD-4 10,528
L2CD-5 11,448
L2CD-6 9,705
Burt Oral 2 13 MWF 50 L2CD-7 7,854
Communication
L2CD-8 6,843
L2CD-9 6,579
L2CD-10 7,671
L2CD-11 6,632
L2CD-12 5,591
Lillian Reading and 3 15 TTH 80 L2CD-13 8,392
Listening
L2CD-14 6,450
L2CD-15 6,369
L2CD-16 5,085
L2CD-17 5,146
L2CD-18 7,432
Mary Oral 3 15 MWF 50 L2CD-19 6,086
Communication
L2CD-20 7,163
L2CD-21 5,398
L2CD-22 6,849
L2CD-23 6,874
L2CD-24 7,349
Total 179,638
a Level refers to the proficiency level of the course: 2 = low-intermediate; 3 = intermediate.

b Class size refers to the number of students in the course.
c Class meeting refers to the days the course met: M = Monday, T = Tuesday, W = Wednesday,
TH = Thursday, and F = Friday.
d Class time refers to the total meeting time per lesson.
Text Sample B2.1. Learner contributions in teacher-student interactions in the

L2CD
T: i want you to say a component. don’t worry, we’ll we’ll we’ll work with
that. here.
S5: music.
T: good.
S5: music. dance.
T: some other ones from the audience. music. dance. okay. so let’s see what we
have from the group over there. traditions. behavior.
S3: subculture.
T: food. what what Azeem?
S3: subculture.
T: speech.
S5: religion.
T: so we could say
S5: reli
S4: religion
S5: religion is the is different.
T: let’s put speech here.
S10: belief.
T: and we got behavior here. what else.
S10: i think religion is part of belief.
SU: religion?
S5: no n- no.
SU: x belief.
T: good. religious beliefs. what’s a value.
S7: what is the value.
T: either give me an example or tell me what value means.
S16: honesty. honesty.
S10: individualism.
S6: collectivist.
S4: collectivist. collectivism.
T: wow. we have some experts in here, i can see.
L2 Experience Interview Corpus. The L2 Experience Interview Corpus was

designed and collected by Polat (2013) to address methodological gaps in the
study of advanced L2 learning. This corpus aimed to provide transcriptions
of detailed interviews with a large number of advanced learners on the topic
of their own learning experience. In all, 123 interview texts were included
in the corpus for a total word count of 143,115. Texts ranged in length from
379 words to 3,334 words, with an average length of 1,164 words. Each par-
ticipant was interviewed by the same researcher and received exactly the same
questions in the same order, so that any differences in the interview text are
the result of differences in the participant’s speech rather than due to variations
in feedback from the interviewer. The resulting corpus combines elements of

quantitative and qualitative data collection; it is richly descriptive and detailed,
but also relatively large-scale and directly comparable across all texts.
Participants were all currently enrolled graduate or undergraduate students and
had lived in the US (or any other English-speaking country) for no more than
one year. Participants came from 23 countries, spoke 27 native languages, and
represented 43 academic majors. This diversity ensures a representative sample of
university-level English language learners. Sixty-five (52.85%) of the participants
were female and 58 (47.15%) were male. Ninety-five (77.23%) participants were
graduate students and 28 (22.76%) were undergraduate students. The average age
of participants was 26 years. Participants were recruited through a newsletter dis-
tributed to international students by the international services offices at two urban
universities. The short recruitment article provided information about the purpose
of the project, eligibility, compensation offered ($30 per participant), and contact
information. Students were informed in advance that the interviews would be au-
dio recorded and that they would be asked to provide their TOEFL scores.
Interviews with each student were recorded in a study room on campus and
were all conducted by the researcher. Interviews ranged in duration from 5 to
27 minutes. Prior to each interview, the researcher explained the purpose of
the project, reviewed the informed consent document, and allowed students to
ask questions before signing the informed consent form. Stu-dents also filled
out a short background information sheet which asked for their major, native
language(s), years studying English, months in the US, and TOEFL score. Each
interview strictly followed the interview protocol (see the following questions),
which was developed from previous L2 experience interviews (e.g., Polat, 2013).
Students were allowed to speak for up to four minutes in response to each ques-
tion. The interviewer did not ask follow-up questions or interrupt students, ex-
cept to enforce the time limit. Limited backchanneling cues, such as “Oh,” or “I
see,” were provided to put students more at ease and to more closely resemble au-
thentic conversation. This standard procedure ensured that all students received
the same input before answering questions and were not inadvertently primed to
produce different types of language (Friginal et al., 2017).
Interview Protocol Questions:
1 Tell me about your experience learning English.

2 Do you like learning English? Why or why not?
3 Why do you want to learn English?
4 What are the most important things you do to help you learn English?
5 What do you do to improve your speaking and listening ability?
6 What do you to improve your reading and writing ability?
7 How do you learn grammar?
8 How do you learn vocabulary?
9 Do you feel that most people learn in the same way that you do, or in a
different way?
10 How do you feel when you use English?
Table B2.6 Interview transcription guidelines
• Hesitation words (uh, um) are not transcribed

• Repetition and reformulation are not transcribed
° If the speaker clearly meant to repeat a word or phrase, the repetition is included
• Discourse markers like, well, and you know are not transcribed
• “How do you say…” or similar phrases that indicate the participant is unsure of the words they are using are not transcribed
• Numbers are written out in words (including years)
° Except for course numbers (English Composition 1101 is changed to English Composition XYZ)
• Teachers’ names are changed to Ms. XYZ
• Question restatements are not included
° This includes when the participant says something while reflecting, such as saying, “Writing…” while thinking about how to answer the
question about writing.
° This is somewhat up to the discretion of the transcriber. If the person says, “Writ-ing” and then continues with the sentence, the word
“writing” can be left if it is necessary to understand the meaning of the utterance.
• ‘Cause and variations of because are written out as because
• All instances of mother tongue or similar are changed to native tongue to prevent this from being counted as a family word
• If a participant concludes an answer with a wrap-up phrase such as “That’s it” or “That’s all,” it is not transcribed (because it is assumed
this is merely an indication to the inter-viewer that they are finished, not part of the response itself ). If these phrases are used as part of the
response itself, they are transcribed. This is somewhat up to the discretion of the transcriber, but usually it is clear in the context of the
response.
• US is written USA to avoid confusion with periods
• Adverbial like is marked as rrlike to distinguish it from the verb like
• When the speaker’s intention is clear, words have been changed to standard spelling
11 Is there anything you want to change about your English learning experience?
12 Is there anything else you want to discuss about your English learning
experience?
The audio-recorded interviews were orthographically transcribed by the re-

searcher and one paid assistant following the transcription guidelines shown in
Table B2.6. (All transcriptions were reviewed in full by the researcher to ensure
that guidelines were followed.)
One important theme to emerge from the L2 experience interviews is stu-
dents’ disdain for English grammar teaching methods that are still prevalent
around the world (see Table B2.7). The majority of students interviewed for this
study reported that their English learning experience in middle and high school
was filled with pencil-and-paper exercises and limited authentic communica-
tion. This was true for learners within all three clusters and from all parts of the
world. Some students suggested that this focus on written grammar exercises
rather than actual verbal interaction and practice resulted in part from their
teachers’ own lack of English proficiency, and many also felt that their national
education systems were to blame for favoring such poor teaching methods or
for simply allowing apathetic teaching. Chinese and Korean students frequently
complained about the grammar-exercise and test-focused nature of their edu-
cational systems. Many students from Europe, Asia, the Middle East, and Latin
America felt that their secondary education had not prepared them well for
speaking and listening in authentic communicative interactions (Polat, 2013).
The L2PR Corpus. The L2PR corpus (Roberson, 2015) is a specialized collec-
tion of authentic learner-learner talk during peer response activities in a writing
classroom. The spoken texts were collected in a section of a first-year composition
course for bilingual speakers or NNSs of English. In this course, students complete
writing assignments focusing on reading, writing, and revising in different aca-
demic genres such as summaries, response papers, annotated bibliographies, and
research papers. The instructor for this course teaches using a process-oriented
approach to writing, a common practice in university L2 writing classrooms
(Casanave, 2006). For three major writing assignments, students participate in
peer response sessions: two summary-response papers (a one- to two-page paper
that includes three components: a summary of the assigned text, a personal con-
nection, and opinions or evaluations of the text) and one persuasive research paper
(a three- to four-page paper that includes at least three academic sources in which
students state their opinion about a controversial topic of their choice).
Early in the semester, participants chose their own partners with whom they
would complete this activity. The procedure for the three class sessions during
which the peer response data were collected was to first distribute the handout with
the guiding questions and entertain any questions from the entire group about it.
These guiding questions for the peer response discussions focused on general con-
cerns such as paragraph development, transitions between different sections of the
Table B2.7 Sample students’ thoughts about grammar teaching methods
Participant Comments
Participant That’s pretty much all how our classes were, we just had grammar and only grammar. That’s why I think we can be ok at
20 French grammar but we’re really bad at talking. Because we just don’t practice a lot, so it was just practice about grammatical
things and everything so it could be really boring, but that’s how we got our bachelor’s, so.
Participant I learn grammar because in China the English teacher they teach a lot of grammar. That’s how I learn grammar, especially
37 Chinese in the high school. I believe the major part of the English exam is about the grammar, about how you write your
sentence, your vocabulary, all grammar. It’s only about twenty percent about the listening, and there’s no speaking test in
china in English exam. Yeah all about grammar I think, at least sixty percent in my opinion.
Participant I saw many problems in Korea, when it comes to learning English. Because we only focus on the reading and grammar, and
86 Korean sometimes listening, but students cannot actually write in English and speak in English…. Because many Korean students
actually hate learning English, because it’s really stressful, and it’s not fun, because they always focus, memorize the
vocabulary and memorize the grammar rule and those kind of things, that makes students hate English. So but I think,
speaking is really important.
Table B2.8 Composition of the L2PR corpus
Pair no. Session 1 Session 2 Session 3 Total words by

pair
1 450 446 614 3,269
459 461 839
2 604 881 469 4,214
627 1,195 438
3 807 x 572 2,493
714 400
4 1,169 1,418 x 5,008
716 1,705
5 875 1,104 807 6,445
1,888 1,259 512
Total words by 8,309 8,469 4,651 21,429
session
paper, and the inclusion of a thesis statement that signaled the development of the
rest of the paper. After group discussion of and answering any questions about
the handout, students exchanged papers and silently read each other’s drafts and
made brief notes on the paper as they read it relative to the guiding questions. To
encourage subsequent peer-peer verbal interaction rather than simply exchanging
written feedback, students were told to make notes sufficient enough to enable
verbal feedback to their partners later but that the majority of that feedback would
be delivered orally during the later conversations with their partner.
A digital recorder was placed on the desk between the pairs of students, and
when they were ready to begin giving their feedback to each other, they decided
who would go first, and the discussions began. After the first partner had com-
pleted the feedback to the other, they switched roles and repeated this process.
The notes they had made on the papers earlier relative to the discussion handout
questions served as reminders as they, in turn, delivered the oral feedback to their
partners. Trained research assistants later transcribed the conversations, referring
to the first student delivering the feedback as “S1” and the second as “S2.” These
abbreviations were subsequently eliminated, together with any irrelevant nota-
tions describing the environment (e.g., “papers shuffling in the background,”
etc.) when the transcripts were converted to text files so that the corpus consisted
entirely of student talk. Table B2.8 presents the composition of the L2PR cor-
pus, displaying the number of words in each text. In this corpus, each text is a
transcript of a pair’s discussion about one student’s paper (Friginal et al., 2017).
At each peer response session, there were two papers discussed by each pair,
and thus two transcripts generated by each pair. For example, in their first ses-
sion, the transcript of pair number one’s discussion of the first paper was 450
words long, and their discussion of the second paper was 459 words long. Each
pair participated in three peer response sessions over the course of the semes-
ter, with the exception of two pairs. Row totals show the number of words
generated by each pair across three sessions. Column totals show the number of
words generated during each session by all five pairs. The bold number in the
bottom right corner is the total number of words in the corpus: 21,429.
The following excerpt shows Zelda reviewing Ivana’s (both pseudonyms)
summary-response paper about the class book, Outcasts United. Ivana wrote
about two instances in the novel during which the refugees experienced dis-
crimination within their American community. Zelda thinks the paragraph
needs to be expanded. In this episode, Ivana and Zelda engage in collaborative
brainstorming that results in the generation of language that Ivana might use
in her second draft to expand the paragraph, consistent with Zelda’s recom-
mendation. Rather than wait for Zelda to point out problematic aspects of her
paper, Ivana begins the episode by sharing that she is stuck. Both students then
participate in the generation of new ideas, demonstrating shared direction of
the task and meaningful engagement with each other’s suggestions.
Text Sample B2.2. Sample student interaction: summary-response paper
Ivana: Here I stop because I have no idea, because I have no clue (laughing)
Zelda: (laughing) I’ll just write you some notes here about just “church and
store” and, um, “stories”, and “your opinion” about it.
Ivana: Because maybe I can say that they had to be thankful for escaping from
war, um, and don’t be so aggressive …
Zelda: Mhm.
Ivana: to the new life
Zelda: You can keep going, saying about the church and the store and what
happened in your opinion …
Ivana: Yes, there I will say about it [should not happen
Zelda: Yes, that it’s not] supposed to be to happen …
Ivana: Mhm.
Zelda: because it is in United States. And in conclusion, you can just say that
although in theory it sounds [so easy …
Ivana: Perfect, yeah]
Zelda: uh, but in reality …
(Zelda and Ivana, Peer Response Session 1, February 2013)
B2.3.2 Classroom Spoken Corpora: Future Directions

Multimodal annotation of spoken interactions in language classrooms is becoming
an increasingly popular strategy in corpus-based research. Non-linguistic factors
such as facial expression, gestures, and body position that are critically important
variables in all communication can, thereby, be identified and automatically ex-
tracted and utilized, together with enhanced prosodic and acoustic markups of
spoken corpora, to comprise a more comprehensive approach to research in this
discipline. The resulting annotations can then be interpreted alongside frequency
data and other learner demographic information. I expect to see an increasing

number of projects of this kind, with a particular focus on learner oral production
in the next few years. As briefly discussed in Section B1, collections such as Euro-
CoAT and others (search online also for the LeaP Corpus, The LONGDALE Proj-
ect, YOLECORE, and the Multimedia Adult ESL Learner Corpus) are developed
for the purpose of including video or audio-based information in corpus analyses.
During his March 2017 plenary lecture at the American Association for Applied
Linguistics (AAAL) Conference, Suresh Canagarajah (Penn State University) em-
phasized the role of spatiotemporal dimensions of communicative activity to show
language as a self-defining grammatical system. He pointed out that developments
in mobility, globalization, and technology, I believe including corpus-based tech-
nology, have prompted the realization that meanings and grammatical forms are
co-constructed in situated interactions, such as the classroom, in an expansive con-
text of social networks, ecological affordances, and material objects. Canagarajah
briefly presented data and video clips from the International Teaching Assistants
Corpus (ITACorp) collected by his research team and colleagues at Penn State.
The ITACorp has video recordings comprising over 500,000 words of language
from a variety of spoken classroom tasks including lectures, office hours, role plays,
presentations, and discussions from international teaching assistants.
His conclusions, in part, focused on important dimensions such as “board-
work” (i.e., how an ITA utilized the blackboard in writing mathematical equa-
tions), movement, and use of space in the classroom. Although, on occasion,
students may have immediately noted the ITA’s heavy L2 accent or expressed
difficulty in understanding some utterances, the ITA’s effective use of space
may have, to some extent, compensated and enhanced the classroom experi-
ence for his students. As one student who is a native speaker of English pointed
out, “Some ITAs might talk more fluently, but if the board work is poor, it will
not work at all.” Clearly, these observations and findings validate the impor-
tance of including non-linguistic data to linguistic frequency data, and design-
ing creative research approaches to accomplish this, as we continue to explore
learner language acquisition in academic settings.
Canagarajah added that students and teachers are now becoming more aware
of the importance of the conscious management of space as a defining and
generative resource in successful communication. “A competence for such suc-
cess involves one’s emplacement in relevant spatiotemporal scales to strategi-
cally align with diverse semiotic features beyond language, participate in an
assemblage of ecological and material resources, and collaborate in complex
social networks.” He argued that such a consideration compels us to revise our
traditional notions about the autonomy of language, separation of labeled lan-
guages, primacy of cognition, and agency of individuals. The relevance and,
therefore, the utility of corpus-based approaches in language research going
forward depend upon our capacity and willingness to merge more traditional
corpus-based approaches with innovative ones that incorporate annotations and
evidences from multimedia sources (Friginal et al., 2017).
B3
Corpus Tools, Online
Resources, and an Annotated
Bibliography of Recent Studies
with Tamanna Mostafa and Melinda Childs
This section provides a comprehensive list of corpus tools and online resources
for teachers to explore or be familiar with. These tools are all accessible on the
web, but as URLs change and creators move from one university or company
to another, it would be good to refer to the names of the tools listed here and
search for them online in case a link no longer works. I also provide an an-
notated bibliography of CL research studies from 2010 to the present on the
following themes: CL in the classroom, CL and writing, and CL and spoken
learner data. These studies show CL applications in many contexts and settings,
and utilize a range of corpora that teachers can use or collect themselves.
B3.1 Online Directories, Facebook Groups, and MOOCs

There are several faculty and researcher web pages on CL topics that provide
relevant links to resources available for English teachers. In addition to very
useful sites created and administered by Davies (BYU corpora) or the mem-
bers of the Learner Corpus Association (http://www.learnercorpusassociation.
org/), the global internet has provided great support for advances in CL, with
easy access to sample data, corpora, and new tools. These resources have tradi-
tionally been primarily intended for linguists, researchers, or graduate students,
but now, they have also addressed many of the needs of teacher-practitioners.
For example, the website for the University Centre for Computer Corpus
Research on Language (UCREL) at Lancaster University (http://ucrel.lancs.
ac.uk/) has specialized in the sharing of automatic or computer-aided analysis
of corpora created during their more than 40 years as pioneers in the field, and
they have now been regularly providing various applied research updates, short
video tweets of conference presentations, and a growing publication list, all of
which provide great support for practitioners.
Corpus Tools, Resources, and Bibliography 149
Also housed in Lancaster is Tony McEnery’s Corpus Linguistics: Method, Anal-

ysis, and Interpretation Massive Open Online Course (MOOC). This impressive
8-week MOOC provides a practical introduction to CL approaches, primarily
focusing on discussions of skills necessary for collecting and analyzing corpora;
a demonstration of the use of CL in the humanities, especially history; and an
introduction to CL’s variationist approaches as a data-driven way of looking at
language. McEnery emphasizes the important contribution of CL to the social
sciences in providing valuable insight into ‘social reality’ by investigating the
use and manipulation of language in society. More details, including a video
introduction from McEnery, and related materials and pricing can be found
here: https://www.futurelearn.com/courses/corpus-linguistics.
In my opinion, the best and most extensive directory of CL topics online
can be found in Corpus-Based Linguistics Links (http://martinweisser.org/
corpora_site/CBLLinks.html). I have this page bookmarked on my browsers.
The links are organized according to three primary areas: corpora, software,
and miscellaneous (e.g., conferences and projects; references, papers, and jour-
nals; courses; FAQs; e-lists; and standards). This site was originally created by
David Lee (University of Wollongong) and is currently managed and main-
tained by Martin Weisser (Guangdong University of Foreign Studies). Im-
portant links to software and corpora from those posted by Lee in 2001 to the
present are available, providing a wealth of information on large-scale projects
over the years but also smaller ones conducted outside North America and
Europe. Linked resources include PC/Linux/Unix/Mac concordancers, web-
based concordancers, chi-square and log-likelihood calculators, various kinds
of frequency lists and word lists, speech data resources (e.g., fonts, speech an-
alyzers, standards, formats, annotation, fonts for IPA, non-English-language
scripts), online dictionaries, taggers, parsers, format conversion tools, ‘web
snaggers,’ and many others.
Weisser regularly updates the site, and he has also started cleaning up Lee’s
original HTML codes and reformatted the previous structure. Note that this
site uses the term “Corpus-Based Linguistics” or CBL (not CL). Weisser ex-
plains that the reasons for this are (1) to put the focus on linguistics, that is,
what we primarily do is ‘linguistics,’ it just happens to be corpus-based; (2) ‘CL’
would be confusing since it is already widely used to mean ‘Computational
Linguistics’; and (3) the term corpus linguistics, while shorter and more popu-
lar, tends to give the impression that it is a branch of linguistics rather than just
a methodology that can be applied to any existing branch of linguistics. Our
interest should be in language, not corpora per se (Weisser, 2017).
Finally, a CL Facebook group (https://www.facebook.com/groups/Corpus
Linguistics/) currently administered by Glen Hadikin (University of Ports-
mouth) focuses on members’ and guests’ announcements and discussions of
new tools, corpora, conferences, publications, and many other topics. Group
members and guests post questions and answers from programming and
150 with Tamanna Mostafa and Melinda Childs
software troubleshooting topics to corpus design and analysis. Responses to

questions from active members who are experts in the field, like Mike Scott
(Aston University, creator of WordSmith Tools), Jack A. Hardy, Costas Gabriela-
tos (Edge Hill University), or Ramesh Krishnamurthy (Aston University), and
many others make discussions in this group very informative and useful, even
for CL beginners.
B3.2 Corpus Annotation and Markup: Taggers/Parsers

Corpus annotation in the form of text ‘tagging’ or ‘parsing’ involves the la-
beling of linguistic features onto the corpus. For example, one program might
attach a part-of-speech label to every word in the text (i.e., POS-tagging).
By doing this, when the researcher wants to count how many nouns or verbs
were used, he or she can simply search for the number of labels for their raw
frequency counts. Fortunately, most such programs also produce ‘tag count’
outputs. In addition to tagging, linguistic parsing can also be done by com-
puter programs in order to segment or separate parts of a sentence or paragraph.
A parser may be able identify and separate verb phrases and clauses, and also
extract long vs. short sentences (depending on the number of words) from a
corpus.
I often use the Biber Tagger (Biber, 2010) in my studies. It is a POS-t agger
developed by Biber to provide a grammatical tag for each word in a text file.
POS-tags follow every word or punctuation mark in the text output. The
tag symbols and tag fields represent the grammatical and semantic annotation
identified by this tagger. Tagged text files (in .txt format) allow for easy and
immediate processing, and counting of the rates of occurrence of linguistic/
grammatical features. A complementary ‘tag count’ program can be used to
extract actual frequency counts of POS features occurring in a corpus. The
Biber Tagger combines computerized dictionaries with the identification of
word sequences as instances of a linguistic feature (e.g., noun + WH pronoun
not preceded by the verb tell or say = “relative clause”) (Biber, 1988). There
are over 150 POS-tagged categories in the output, which includes grammatical
and some syntactic elements. Tag accuracy is around 95% for written texts.
Accuracy decreases a little bit for spoken texts, especially those that are not
consistently transcribed. Unfortunately, access is an issue with this tool since
the Biber Tagger is not commercially available or accessible online. However, re-
searchers may contact The Corpus Linguistics Research Program at N orthern
Arizona University for information about corpus tagging and analysis using the
Biber Tagger.
An alternative option to the Biber Tagger is the Multidimensional Analysis Tagger
(MAT) created by Andrea Nini (University of Manchester). The MAT Tagger
(Nini, 2014) is a Windows-based program that replicates Biber’s (1988) Vari-
ation across Speech and Writing tag set and dimensions, producing an automatic
Dimension 1 - involved vs informational production - closest genre: prepared speeches

60.000
Dimension 1
50.000 Conversion texts
40.000
Comparison text: student essays

30.000
20.000
Score
10.000
0.000
–10.000
General fiction
–20.000
–30.000
Broadcasts Personal letters Press reportage Official documents
Conversation Prepared speeches General fiction Academic prose Essay 1 appearence - high
Genres
Figure B3.1 Comparison of spoken and written registers from Biber’s (1988) dimensions
and an input corpus of student essays from the MAT Tagger (Nini, 2014).
comparison of the dimension scores of English texts used by Biber applicable

for text-type comparisons. Nini’s program can also generate a grammatically
annotated version of the target corpus with the statistics needed to perform
a text-type analysis. Visualized output plots the target (input) text or corpus
on Biber’s (1988) six dimensions, and it determines its closest text type from
Biber’s (1988) A Typology of English Texts. Nini uses POS-tag sets from the Stan-
ford Tagger, not the Biber Tagger tag sets, providing a very useful normalized tag
count output file (normalized per 1,000 words) that can be immediately copied
and pasted into a spreadsheet. Figure B3.1 shows a comparison of spoken and
written registers from Biber’s (1988) Dimension1: Involved vs. Informational Pro-
duction and an input corpus of student essays.
The following are more POS-taggers and linguistic parsers that can be
downloaded or accessed online (some are for purchase):
• TagAnt (Anthony, 2015) is Anthony’s free POS-tagger, which is easy to

download and run to obtain a POS-tagged version of corpora in a .txt format,
available from http://www.laurenceanthony.net/software/tagant/. As with
most of the tools created and shared by Anthony (see Table B3.1), TagAnt also
has a useful YouTube tutorial that guides users running the program and in-
terpreting outputs (on YouTube, search “TagAnt – Getting Started”). T agAnt
is built on TreeTagger (a tagging program developed by Helmut Schmid).
Table B3.1 Description of relevant tools/software (organized by their developers)
Tool/software Developer/s or location What it does
AntCLAWSGUI Laurence Anthony, AntCLAWSGUI is a free POS tagging

Waseda University tool that interfaces with the CLAWS
tagger. Note that it is a Windows-only
tool and users must have a copy of the
CLAWS tagger installed. Available at:
http://www.laurenceanthony.net/
software/antclawsgui/
AntCorGen Laurence Anthony, AntCorGen is a discipline-specific
Waseda University concordancer that also allows users to
create and manage corpora. Available
at: http://www.laurenceanthony.net/
software/antcorgen/
AntGram Laurence Anthony, AntGram generates n-grams and p-frames
Waseda University from input texts. Available at: http://
www.laurenceanthony.net/software/
antgram/
AntMover Laurence Anthony, AntMover is a text structure analysis tool
Waseda University that can also be used for visualizing
vocabulary usage in a text. Available
software/antmover/
AntPConc Laurence Anthony, AntPConc is a concordancer that can be
Waseda University used specifically to analyze corpora with
UTF-8 encoded text files. Available
software/antpconc/
AntWordProfiler Laurence Anthony, AntWordProfiler is a tool that can be used
Waseda University to analyze vocabulary complexity and
other related measures. Available at:
http://www.laurenceanthony.net/
software/antwordprofiler/
FireAnt Laurence Anthony, FireAnt is a freeware that works alongside
Waseda University Twitter to export and analyze social
media texts. The program also comes
with built-in visualizations like maps
for geo-positions, graphs, and others.
Users need a Twitter account to run
the program. Available at: http://www.
laurenceanthony.net/software/fireant/
ProtAnt Laurence Anthony, ProtAnt is a text analysis tool developed
Waseda University & by Laurence Anthony and Paul Baker
Paul Baker, Lancaster and provides type/token counts and
University comparisons. Available at: http://www.
laurenceanthony.net/software/protant/
VariAnt Laurence Anthony, VariAnt is a free spelling variation analysis
Waseda University program. Available at: http://www.
laurenceanthony.net/software/variant/
Onlist Apps4ESL Onlist is an online interface that allows
users to paste a text onto the site for
wordlist comparisons (e.g., with AWL
and NGSL). Common words become
highlighted in green, academic words
in blue, and there is a list of custom
and ‘offlist’ words provided as well.
Available at: https://www.apps4efl.
com/tools/onlist/
POS Tagger Apps4ESL POS Tagger runs like a regular tagger but
is limited to tagging a 5,000-words
corpus. Available at: https://www.
apps4efl.com/tools/pos_tagger/
CLA: Custom Kristopher Kyle, CLA is a text analysis tool that can be used
List Analyzer University of to analyze texts using very large custom
Hawaii; Scott dictionaries. In addition to words,
Crossley & Youjin custom dictionaries can include n-grams
Kim, Georgia State and wildcards. Available at: http://www.
University kristopherkyle.com/tools.html
CRAT: Scott Crossley, Georgia CRAT is a tool that includes over 700
Constructed State University; indices related to lexical sophistication,
Response Kristopher Kyle, cohesion, and source text/summary
Analysis Tool University of text overlap. It is useful for exploring
Hawaii; Danielle writing quality as it relates to summary
McNamara, Arizona writing. Available at: http://www.
State University kristopherkyle.com/tools.html
(with K. Davenport)
SÉANCE: Scott Crossley, Georgia SEANCE is a useful tool that includes 254
Sentiment State University; core indices and 20 component indices
Analysis and Kristopher Kyle, based on recent advances in sentiment
Cognition University of analysis. In addition to the core indices,
Engine Hawaii; Danielle it allows for a number of customized
McNamara, Arizona indices including filtering for particular
State University parts of speech and controlling for
instances of negation. Available at: http://
www.kristopherkyle.com/tools.html
SiNLP: The Scott Crossley, Georgia SiNLP is a software that allows
Simple Natural State University; users to analyze texts based on the
Language Kristopher Kyle, number of words, types, letters
Processing Tool University of per word, paragraphs, sentences,
Hawaii; Danielle and words per sentence. Users can
McNamara, Arizona analyze texts with their own custom
State University dictionaries. Available at: http://www.
(with A. Allen) kristopherkyle.com/tools.html
(Continued)
TAACO: Tool for Scott Crossley, Georgia TAACO is an easy-to-use tool that
the Automatic State University; calculates 150 indices of both local and
Analysis of Kristopher Kyle, global cohesion measures, including
Cohesion University of a number of type-token ratio indices,
Hawaii; Danielle adjacent overlap indices, and connectives
McNamara, Arizona indices. Available at: http://www.
State University kristopherkyle.com/tools.html
TAALES: Kristopher Kyle, TAALES is a tool that measures
Tool for the University of over 400 classic and new indices
Automatic Hawaii; Scott of lexical sophistication, like both
Analysis Crossley, Georgia single words and n-grams. It also
of Lexical State University provides comprehensive index
Sophistication diagnostics. Available at: http://www.
kristopherkyle.com/tools.html
TAASSC: Kristopher Kyle, TAASSC is an advanced syntactic analysis
Tool for the University of Hawaii tool that measures fine-grained indices
Automatic of clausal and phrasal complexity,
Analysis of classic indices of syntactic complexity,
Syntactic and frequency-based verb argument
Sophistication construction indices. Available at:
and http://www.kristopherkyle.com/tools.
Complexity html
UAM Corpus Mick O’Donnell, UAM CorpusTool is a free text annotation
Tool Universidad tool that also provides statistical output
Autónoma de for various types of linguistic/text
Madrid analysis. Available at: http://www.
corpustool.com/index.html
UAM ImageTool Mick O’Donnell, UAM ImageTool is an image annotator
Universidad tool for visual data corpora. Available
Autónoma de at: http://www.wagsoft.com/
Madrid ImageTool/
Atomic Stephan Druskat, Atomic is a cross-platform corpus
Volker Gast, & annotation tool. Available at: http://
Thomas Krause corpus-tools.org/atomic/index.html
and team, Humbolt
University of Berlin
ANNIS Stephan Druskat, ANNIS is an open source, cross-platform
Volker Gast, & program that uses a web browser-
Thomas Krause based search and visualizer for multi-
and team, Humbolt layer linguistic corpus annotation and
University of Berlin analysis. Available at: http://corpus-
tools.org/annis/
Pepper Stephan Druskat, Pepper is a free online converter tool used
Volker Gast, & to convert corpora from one format
Thomas Krause to another easily. Users need a Java
and team, Humbolt software to run it. Available at: http://
University of Berlin corpus-tools.org/pepper/index.html
Salt Stephan Druskat, Salt is an annotator that can also be used

Volker Gast, & to store, manipulate, and represent
Thomas Krause linguistic data. Available at: http://
and team, Humbolt corpus-tools.org/salt/index.html
University of Berlin
PLC PLCorpora PLC Concordancer is an online text
Concordancer analysis tool. Full access to the tool
and all of its functions requires a
subscription. The tool can be accessed
at: http://www.plcorpora.com/
PLC Wordlist PLCorpora PLC Wordlist is a frequency generator
from PLCorpora. Parts of the program
are free, but users will have to pay
a fee to access all of the features.
Available at: http://www.plcorpora.
com/opencorpus/wordlist
• The Constituent Likelihood Automatic Word-tagging System (CLAWS) is a

POS-tagger that was used to tag the BNC and is available for user licenses
and as copies for single sites. CLAWS has over 160 different POS- and
semantic tags developed by UCREL at Lancaster University. Not only
can it be used as a downloaded program on an individual computer, but it
can also be accessed directly online. The CLAWS team offers tagging ser-
vices and charges depending on the amount of text being tagged (http://
ucrel.lancs.ac.uk/claws/). This program has consistently achieved 96%–
97% accuracy, which may vary based on the type of text or transcription
convention.
• A program designed to grammatically and semantically tag corpora
is Wmatrix, a corpus comparison tool also from Lancaster University
(Rayson, 2003, 2008) (http://ucrel.lancs.ac.uk/wmatrix/). This tool com-
bines the CLAWS tagger and a semantic annotation system. Many recent
studies have been conducted using this program because of its extensive tag
set and its accessibility. For example, Xiao and McEnery (2005) and Xiao
(2009) used it to replicate the methodology of Biber’s (1988) MD analysis
using tagged data from Wmatrix.
• As noted earlier, the Stanford Parser and the Stanford Tagger (http://nlp.
stanford.edu/software/lex-parser.shtml) may also be used to obtain POS-
tagged data and parsed texts, but this is done by obtaining NLP program-
ming codes and running them through another platform, just like Nini’s
work with his MAT Tagger. The current tag sets for these tools include pri-
mary POS counts of over 50 linguistic features (e.g., primary POS features
like nouns, verbs, modal verbs, prepositions, etc.).
• Coh-Metrix, developed by Arthur C. Graesser (University of Memphis)
and Danielle McNamara (Arizona State University), is a sophisticated
computational/corpus tool that rates readability and also provides fre-

quency counts for a range of linguistic aspects, such as descriptive, connec-
tives, syntactic pattern density, word information, and readability sections.
(See more information about this tool from Crossley & McNamara, 2009.)
The Coh-Metrix tag set is generally similar to most taggers but with many
more additional features focusing on lexical diversity and specificity mark-
ers. Data and related research from Coh-Metrix, including contact infor-
mation for potential tagging requests, are located at http://cohmetrix.
memphis.edu/cohmetrixpr/index.html.
• The Linguistic Inquiry and Word Count (LIWC, pronounced Luke). LIWC
was developed by Pennebaker et al. (2017), employing a different form of
corpus tagging technique that utilizes a dictionary approach with 80 preset
categories in order to analyze the linguistic and psychosocial features of texts.
The output includes linguistic dimensions (e.g., percentage of words in the
text that are pronouns, articles, auxiliary verbs, etc.); word categories tapping
psychological constructs (e.g., affect, cognition, biological processes); personal
concern categories (e.g., work, home, leisure activities); paralinguistic dimen-
sions (e.g., assents, fillers, nonfluencies); and punctuation categories (periods,
commas, etc.). LIWC is available for purchase (http://www.liwc.net/).
• Qualitative Coding Software: Many additional constructs from cor-
pora that are useful for teachers are not easily tagged or parsed automati-
cally. For example, one might be interested in investigating the themes or
patterns that emerge in learner interviews in L2. Although there is not a
way to automatically label such qualitatively determined features, there are
software and tools available that can help keep track of such items. Manual
coding and annotations may be required, but once these are completed, the
software tools are able to automate the extraction of these coded themes
or categories together with text samples. ATLAS.ti (http://www.atlasti.
com/index.html) and NVivo (http://www.nvivo10.com/) are two coding
software packages (commercially available) that incorporate corpus tech-
nology for qualitative research.
B3.3 Other CL Tools (and Where to Find Them) Online

The Compleat Lexical Tutor or simply LexTutor (www.lextutor.ca) created and
maintained by Tom Cobb (Université du Québec à Montréal) and his team is an
internet-based suite of tools for “data-driving self-learning,” mainly for vocab-
ulary learning and the acquisition of French and English. The site is organized
primarily for users like researchers, teachers, and students. Students, for example,
may paste input texts to one of the tools available and then obtain an output
linked to speech, dictionary, concordance, and self-test resources. LexTutor has
a concordancer, a phrase (n-gram) extractor, a VocabProfile program, and other
tools. Teachers may be interested in using teaching tools and features such as a
vocabulary-based cloze passage generator or a traditional nth-word cloze builder.

Robert Nelson’s lesson in Section C2.3, “Implementing the Frequency-Based
VocabProfile Tool from LexTutor: Improving English Learners’ Essay Writing for
Proficiency Exams,” demos a direct application of LexTutor in the classroom.
The Keywords Extractor tool in LexTutor computes the keyness value of a defin-
ing vocabulary in a particular corpus. The default comparison is with the spoken
sub-register of the BNC. Users can load their own specialized corpus for a ‘one-
way’ key word analysis. Keyness values represent the number of times a word is
more likely to occur in the input corpus compared to spoken British texts. For
example, although it is a slight stretch, a comparison of MICUSP biology papers
with spoken British English is shown in the following. One application of a
result like this is to clearly establish specialized words and jargon in one specific
field of study relative to everyday speech. Words with the highest keyness values
are frequently used in the specialized register but are rare in casual talk. Such
data can inform both teachers and students about another way of developing an
academic word list. The top 40 key words, with their corresponding keyness
values, in biology papers from MICUSP identified by Keywords Extractor are:
(1) 11745.00 phenotype (21) 2725.00 pandemic

(2) 8825.00 progeny (22) 2661.00 mangrove
(3) 7268.00 autosome (23) 2401.00 basal
(4) 6489.00 ecosystem (24) 2401.00 biogeography
(5) 5516.00 allele (25) 2271.00 virology
(6) 4477.00 antigen (26) 2271.00 clack
(7) 4348.00 epitope (27) 2271.00 genome
(8) 4218.00 biofuel (28) 2076.00 epidemiology
(9) 4153.00 taxon (29) 1947.00 homozygote
(10) 4088.00 plasmid (30) 1947.00 longitudinal
(11) 3893.00 speciate (31) 1882.00 insecticide
(12) 3699.00 weight (32) 1687.00 phenotypic
(13) 3309.00 biodiversity (33) 1557.00 pathogen
(14) 3180.00 genotype (34) 1492.00 parsimonious
(15) 3050.00 phylogenetic (35) 1492.00 genus
(16) 3050.00 phylogeny (36) 1428.00 polygyny
(17) 2920.00 avian (37) 1428.00 damselfly
(18) 2855.00 lineage (38) 1298.00 somatic
(19) 2790.00 fetal (39) 1298.00 cardiovascular
(20) 2790.00 biomass (40) 1298.00 topology
CLiC (v.1.6.1) is a web application developed by a team from the University of

Nottingham and University of Birmingham that provides a set of interactive
tools to analyze literary texts (concordance, clusters, subsets, and key words). The
focus here is on merging corpus stylistics and computer-assisted methods to study
literature and discover new insights into how readers perceive fictional characters.
As part of CliC, the CLiC Dickens project was launched in December 2017, with
an accompanying CLiC Activity Book for use in the English classroom to study
Dickens’s texts. With the CLiC web app (http://clic.bham.ac.uk/teachers), teach-
ers can create activities directly illustrating corpus-based applications in language
and literature teaching. CLiC allows users to explore 15 Dickens novels as well as
an expanding collection of texts by other authors (e.g., works by Austen, Bronte,
Hardy, Doyle, Wilde, and many others). Reference corpora like 19th-Century
Children’s Literature and a 19th-Century Reference Corpus are available for
comparison. CLiC classroom activities will enable students to explore and ana-
lyze literary texts themselves, formulate their own research questions, and extract
data and text samples for detailed analysis and interpretation.
The following tables (Tables B3.1 and B3.2) list tools, developers, and their
current online locations. I selected and highlighted ‘teacher-relevant’ software
developed by CL like Anthony (in addition to AntConc and TagAnt) as well as
computational linguists or researchers in the field of NLP with whom a few in
the CL world may not be familiar.
Table B3.2 Other relevant tools/software

University of Concordancer is a free online concordancer and
Concordancer
Pennsylvania visualizer for frequency counts and text clouds
obtained from an input corpus. Available at:
http://www.spaceless.com/concordancer.php
Corpkit Daniel McDonald Corpkit can be used for parsing, concordancing
University of and key word searches, as well as extracting
Melbourne combinations of lexical and grammatical features,
lemmas, and words in certain positions within
clauses. Available at: http://interrogator.github.io/
corpkit/index.html
CorporaCoCo Anthony CorporaCoCo is a tool with a set of R functions
Hennessey used to compare co-occurrences between corpora.
(Users must have an understanding of R.) Available
at: https://cran.r-project.org/web/packages/
CorporaCoCo/README.html
LancsBox Lancaster LancsBox or The Lancaster Desktop Corpus
University Toolbox is a software package used to analyze
new or existing data in any language. Available at:
http://corpora.lancs.ac.uk/lancsbox/
Mondofacto Mondofacto MondoFacto Word Visualizer is another free visualization
Word tool that allows users to see collocations of a word
Visualizer and the type of relationship the words have with the
original text based on colors. Available at: http://
www.mondofacto.com/word-tools/visualiser.html
NLTK Project NLTK is a program for the analysis of natural

Natural
language (for English). It is written in the
Language Tool
Python programming language and may
Kit
require some level of familiarity with Python.
Available at: http://www.nltk.org/
Tag Crowd Daniel Steinbock Tag Crowd is a word cloud generator that allows
users to upload documents and visualize word
frequencies. Available at: https://tagcrowd.com/
TextSTAT Matthias Hüning TextSTAT is a tool for the creation and
manipulation of linguistic data from different
languages. The program allows the storing and
statistical analysis of corpus data. Available at:
http://neon.niederlandistik.fu-berlin.de/static/
textstat/TextSTAT-Doku-EN.html
Tweet NLP Carnegie Mellon Tweet NLP is a tweet tokenizer, POS Tagger,
and a dependency parser for tweets, along with
annotated corpora and web-based annotation
tools. Available at: http://www.cs.cmu.edu/~ark/
TweetNLP/cluster_viewer.html
Visuword Goocto Visuword is a dictionary and thesaurus
database that allows users to look up a word
and see not only the definitions but also its
synonyms and antonyms with accompanying
colors and visuals. Available at: https://
visuwords.com/
Word It Out Enideo Word It Out is a free word cloud generator that
allows users to type words to generate a specific
styled image. Available at: https://worditout.
com/word-cloud/create
WordSift Kenji Hakuta WordSift is a visualizer tool and word cloud
Simon Wiles generator that allows users to see synonyms,
Stanford collocations, and frequency of words within a
text/corpus. Available at: https://wordsift.org/
WordSmith Mike Scott, WordSmith Tools is a concordancer, key word/
Tools Lexical Analysis collocation extractor, and KWIC visualizer
Software LDT developed by Mike Scott (Aston University)
that can analyze English as well as other
languages. Available at: http://www.lexically.
net/wordsmith/version4/
WordStatix Massimo Nardello WordStatix is a corpus analysis tool used to
create concordances and can track specific
words, including their prefixes or suffixes.
Available at: https://sites.google.com/site/
wordstatix/
B3.4 Annotated Bibliography of CL Studies

Costas Gabrielatos publishes and maintains a list of publications, including PhD
dissertations, which focus on the use of corpora and CL techniques in dis-
course studies, that is, corpus studies focusing on discourse issues. The list is
not directly about CL in the classroom, but teachers will benefit from visiting
this page to find articles that can lead to sample corpora, related tools, and rel-
evant analytical approaches. Gabrielatos’s bibliography is available at: https://
www.edgehill.ac.uk/english/dr-costas-gabrielatos/?tab=bibliography-corpus-
approaches-to-discourse-studies.
The Learner Corpus Bibliography (http://www.learnercorpusassociation.
org/resources/lcb/) publishes its collection of bibliographical references on
learner corpus research. The bibliography has now over 1,000 references up-
dated regularly and is fully searchable by fields like author, title or publication
year, as well as by key words and languages (L1/L2). Electronic versions of draft
articles or links to web pages are also available.
The following tables (Tables B3.3–B3.5) provide an annotated bibliography
of CL papers published from 2010 to late 2017 on the themes CL in the classroom,
CL and writing, and CL and spoken learner data.
Table B3.3 A nnotated bibliography: CL and language teaching and learning
Article Summary Corpus
1 Akeel, E. S. The focus of this article is to provide an A collection of

(2016). The Role overview of using corpus data for the purpose NS and learner
of Native and of vocabulary test design. It makes use of corpora.
Learner Corpora native speaker (NS) and learner corpora
in Vocabulary Test available for item writers to use. The article
Design. English discusses the benefits and limitations of using
Language Teaching, corpus data in language testing and argues for
9(7), 10. the usefulness of using NS and leaner corpora,
specifically in vocabulary test development.
2 Almutairi, N. The paper uses corpora and corpus tools to Personal
D. (2016). The investigate the characteristics of personal statements from
Effectiveness statements of prospective law students law students in
of Corpus- written for their law school application. Saudi Arabia
Based Approach The study also explores the usefulness of a collected using
to Language corpus in developing activities on how to Sketch Engine.
Description in write a personal statement (and the overall The corpus has
Creating Corpus- pedagogical implications of the use of 67 personal
Based Exercises corpus-based activities in language teaching). statements with
to Teach Writing Results suggest that extracting the lexico- a total word
Personal Statements. syntactic features of law school personal count of 50,691.
English Language statements contributes positively to writing
Teaching, 9(7), 103. instruction.
3 Bardovi-Harlig, This article describes how to develop MICASE,
K., Mossman, S., teaching materials for pragmatics especially texts
& Vellenga, H. E. instruction based on authentic language from classroom
(2015). Developing through a spoken corpus. The use of a discussions,
Corpus-Based corpus in conjunction with textbooks to lectures,
Materials to identify pragmatic routines for speech acts and advising
Teach Pragmatic is presented in the article, together with meetings.
Routines. TESOL suggestions on extracting appropriate
Journal, 6(3), language samples. The focus of the
499–526. language samples is to help students
notice how language is used in context.
A step-by-step guide for working with a
corpus for pragmatics teaching with units
on agreements, disagreements, and other-
and self-clarifications is highlighted for
academic discussion in an EAP class.
4 Boulton, A. This timeline article looks at explicit uses Various corpora
(2017). Corpora in of corpora in foreign and second language (e.g., multimedia
Language Teaching learning and teaching (e.g., whether students corpora, ELISA,
and Learning. explore corpora directly via concordancers corpora of
Language Teaching, or as integrated in CALL software or transcribed
50(4), 483–506. indirectly with prepared, printed materials. speech of
The rationale is that such contact provides academic and
“massive contextualized exposure needed for professional
language learning.” The goal of the paper texts).
is to provide an overview of the evolution
of the field of CL and teaching from the
beginning to the present. The primary
objective is to focus on empirical evolutions
of DDL alongside measurable gains in
language acquisition.
5 Chambers, A., Farr, This paper explores various ways teachers The Business
F., & O’Riordan, can integrate corpus approaches into their Letter Corpus
S. (2011). Language classroom teaching. After situating the including over
Teachers with topic in relation to current research and one million
Corpora in Mind: practice in ICT and language learning, the words of British
From Starting Steps paper discusses the application of selected and American
to Walking Tall. resources, suggesting how these can provide English.
Language Learning examples of naturally occurring discourse
Journal, 39(1), for use in the language classroom (e.g.,
85–104. in business and in everyday face-to-face
interaction). Online corpus resources with
built-in concordancers requiring no prior
technical training are presented, followed
by a discussion of challenges and current
limitations.
(Continued)
6 Chang, C. F., & The article notes the increasing interest RA corpus
Kuo, C. H. (2011). in the possible applications of corpora and consisting of 60
A Corpus-Based genre-analytic approach to discipline- research articles
Approach to specific materials development. Using from three
Online Materials a word frequency list, move codes are major journals
Development for tagged in the corpus in order to identify in computer
Writing Research moves and move patterns that can help in science.
Articles. English for developing research-based online teaching
A word
Specific Purposes, materials for graduate students of computer
frequency list
30(3), 222–234. science. Examples of specialized vocabulary,
from the corpus
grammatical usage, and move structures
was analyzed
that describe the discourse of computer
to develop a
science are presented across learning tasks,
vocabulary
discussion topics, and online writing models.
profile of RAs,
The article ends with a discussion of the
and move
usefulness and effectiveness of the online
analysis was
RA writing materials, based on student
also conducted
feedback and course evaluation.
(based on a
self-developed
coding scheme
of rhetorical
moves).
7 Chang, P. (2012). This paper proposes a “textlinguistic” Web-based
Using a Stance approach in teaching advanced academic stance corpus,
Corpus to Learn writing to complement corpus approaches for allowing users
about Effective sentence-level lexico-grammatical instruction to study both
Authorial (with seven L2 doctoral students in the social the linguistic
Stance-Taking: sciences as participants). Chang examines how realizations of
A Textlinguistic L2 writers polish their research argument as stance at clause/
Approach. a result of improved stance deployment and sentence level
ReCALL: whether a web-based corpus tool can provide and how stance
the Journal of a constructivist environment which prompts meanings are
EUROCALL, the learners to infer linguistic patterns to made at the
24(2), 209–236. achieve deeper understanding of stance rhetorical move
use and patterns. Results suggest a positive level.
relationship between writing performance
and more accurate use of stance. However,
the application of higher-order cognitive
skills (e.g., inferring and verifying) is found
to be infrequent in the corpus environment.
Instead, participants use more lower-level
cognitive skills (e.g., making sense and
exploring) to learn. The paper concludes that
the learning of stance and stance patterns
is critically contingent on the surrounding
contexts, but overall, a clear authorial stance-
taking plays a critical role in developing an
effective academic argument.
8 Charles, M. (2012). Charles explores the feasibility of a student- EAP students
‘Proper Vocabulary collected corpus in multidisciplinary construct and
and Juicy classes of advanced-level students. The examine their
Collocations’: EAP course consists of six weekly, 2-hour own individual,
Students Evaluate sessions focusing on academic writing, discipline-
Do-It-Yourself with feedback data from 50 participants specific corpora.
Corpus-Building. presented and discussed in the paper.
English for Specific Over 90% of study participants found
Purposes, 31(2), it easy to build DIY corpora and most
93–102. succeeded in constructing a corpus of 10–15
research articles. Students in general were
enthusiastic about working with their own
corpora, and about 90% of them agreed
that their corpus helped them improve
their writing. Most of them mentioned
that they intended to use it in the
future. Students view corpora as a useful
resource in writing effective discipline-
specific texts. Participants’ attitudes and
experiences are also discussed in the paper,
and Charles also presents the issues and
problems that arise in connection with DIY
corpus-building.
9 Charles, M. This study, also by Charles, a follow-up International
(2014). Getting to the aforementioned study, reports graduate
the Corpus Habit: on the personal use of DIY corpora by students (40–50)
EAP Students’ EAP students. A year after completing built and
Long-Term Use of a corpus-based EAP writing course, examined their
Personal Corpora. students were asked to respond to an email own corpora of
English for Specific questionnaire which asked them about research articles
Purposes, 35, 30–40. their use of the corpus they collected. and were asked
Results show that 70% of the respondents to describe
used their corpus in one way or another. their use of the
Case studies of the use and a nonuse of corpus a year
DIY corpora are presented, highlighting after completing
two other key factors likely to affect corpus a corpus-based
use: the individual’s writing process and course.
the focus of their current writing concerns.
10 Chen, H. H. (2011). Previous studies suggest that existing web- Web-based
Developing and based concordancers have not been very collocation
Evaluating a Web- helpful in retrieving accurate and specific retrieval tool,
Based Collocation collocations meaningful for language WebCollocate,
Retrieval Tool for learners. This paper describes and explores which is
EFL Students and the applications of a web-based collocation based on the
Teachers. Computer retrieval tool, WebCollocate, in facilitating the POS-tagged
Assisted Language search for collocations. Gutenberg
Learning, 24(1), corpus.
59–76.
(Continued)
WebCollocate and the Hong Kong

Polytechnic web concordancer were used
by two groups of college EFL students
to find proper collocates in a translation
task. Results show that the students who
used the WebCollocate tool found more
proper English collocates than the other
option. A group of 35 pre-service English
teachers invited to evaluate the program
also reported that that they could easily
find proper English collocates applicable for
learners from WebCollocate.
11 Comelles, E., Laso, This article emphasizes that the integration A database
N. J., Forcadell, of corpora and corpus tools into classroom (created in
M., Castano, activity enables both teachers and learners to Moodle) of
E., & Feijoo, S. reflect on genuine data with the assistance clause patterns
(2013). Using of computer technologies. The focus here is that was
Online Databases to illustrate how an online corpus database developed by
in the Linguistics for the teaching and learning of clause students. Each
Classroom: Dealing patterns can be fully integrated into language student had to
with Clause teaching. A continuous assessment task choose a text
Patterns. Computer especially designed for the undergraduate from any genre
Assisted Language course Gramatica Descriptiva de l’Angles II (such as novel,
Learning, 26(3), (GDAII) or Descriptive Grammar of English II is newspaper, or
282–294. the primary data source. Students responded children’s story),
with positive feedback to the newly designed and then they
online database. extracted eight
sentences from
that text that
exemplified
different
clause-patterns.
12 Conroy, M. A. In this paper, Australian EAL university Google, Online
(2010). Internet students were trained to use internet-based dictionaries,
Tools for Language tools, including corpus databases, and Compleat
Learning: techniques for language learning and then Lexical Tutor.
University Students were later surveyed as to their attitudes Participants
Taking Control about these resources. Conroy also measured were
of Their Writing. students’ competence in using the tools and introduced to
Australasian Journal their newly learned techniques to correct concordancers
of educational errors in their writing. and Google-
technology, 26(6). assisted
language
learning (GALL)
techniques.
The results show that the students are
generally enthusiastic about the tools and are
reasonably competent users of web-based
resources and techniques for independent
language learning. Conroy recommends that
internet-based corpus tools and techniques
should be valued more than they are by
Australian universities and further promoted
to support EAL writing in university
contexts.
13 Cortes, V. (2011). The purpose of this article is to introduce Various corpora
Genre Analysis the design, implementation, and collected as part
in the Academic comparison of two genre-based academic of corpus-based
Writing Class: writing classes for international graduate instruction for
With or Without students in US universities. One class was one learner
Corpora? Quaderns genre-based and corpus-based, while the group.
de Filologia-Estudis other class was genre-based only. Student
Lingüístics, 16, data were collected through questionnaire,
65–80. interviews, and ratings from final papers.
Cortes’s findings indicate that the use of
corpora and corpus-based tools resulted in
the acquisition of new knowledge by the
learners, although the evaluations of the
writing papers did not show significant
difference across the two learner groups.
In general, the students of both groups
were satisfied with genre-specificity and
disciplinarity, which were the primary foci
of both classes.
14 Cotos, E. (2014). This study explores the contribution of a Each student
Enhancing Writing local learner corpus to learner development participant used
Pedagogy with by investigating the effects of two types a corpus of
Learner Corpus of DDL activities: one relying on a NS 35–45 research
Data. ReCALL, corpus and the second combining NS and articles (about
26(2), 202–224. learner corpora. Both activities focused 30,000 words)
on improving L2 writers’ use of linking in their own
adverbials and were based on a preliminary discipline, which
analysis of adverbial use in the local learner was part of a
corpus produced by 31 study participants. larger corpus
Quantitative and qualitative data, obtained comprising
from writing samples, pre-/post-tests, and 40 academic
questionnaires, were analyzed. Results disciplines
show an increase in frequency, diversity, with a total
and accuracy in all participants’ use of of 1,322,089
adverbials, but more significant progress words.
was made by learners who were exposed to
the corpus containing their own writing.
(Continued)
15 Coxhead, A. (2011). In 2000, TESOL Quarterly published No corpus used.

The Academic Coxhead’s “A New Academic Word
Word List 10 Years List” introducing the AWL to TESOL
On: Research practitioners. The AWL has been widely used
and Teaching in EAP classrooms in many countries, and
Implications. especially in aiding instruction with a range
TESOL Quarterly, of teaching materials, especially vocabulary
45(2), 355–362. tests. In this 2011 article, Coxhead reflects
on the impact of the AWL by exploring and
responding to commonly asked questions
about the list: What is the AWL? Is the AWL
useful/adequate for a range of learners’ needs?
How can I help students learn academic
vocabulary? What materials using the AWL
are available? Her discussion also references
her intentions to update the list, hopefully
adding data from registers that emerged in the
previous 10 years.
16 Flowerdew, J. Flowerdew presents ideas and concepts Expert student
(2017). Corpus- in corpus research, specifically addressing corpora:
Based Approaches specialized academic writing. She reviews MICUSP,
to Language recent corpus studies relevant to ESP BAWE; Learner
Description writing, noting that the greatest advances in Corpora: the
for Specialized language description (an essential element International
Academic Writing. for syllabus design) in recent years have been Corpus of
Language Teaching, accomplished with the use of electronic Learner English
50(1), 90–106. corpora. The paper also features some caveats (ICLE); Lingua
of the corpus approach for future language Franca corpora:
teaching research, especially in the academic VOICE and
writing classroom. ELFA.
17 Geluso, J., & The study describes an ESL curriculum COCA.
Yamaguchi, A. based on data-driven and corpus-based
(2014). Discovering approaches to improve spoken fluency.
Formulaic Geluso and Yamaguchi also measure how
Language through successful students were in appropriately
Data-Driven using newly discovered phrases and
Learning: Student investigate students’ attitudes toward these
Attitudes and approaches in their learning of English.
Efficacy. ReCALL, Data came from questionnaires, student
26(2), 225–242. interviews, and student logs. The findings
suggest that students found data-driven
and corpus-based learning to be effective,
although they expressed some problems with
unfamiliar vocabulary and difficulty in using
a concordance (especially when there are
incomplete concordance lines).
18 Kennedy, C., & The study documents a semester-long CWIC
Miceli, T. (2010). apprenticeship in corpus use (which does not (Contemporary
Corpus-Assisted demand higher-level language proficiency). Written Italian
Creative Writing: The use of a corpus is introduced as an aid to Corpus).
Introducing “imagination in writing and then, to achieving
Intermediate accuracy through grammatical problem
Italian Learners solving.” Case studies of three students are
to a Corpus as presented focusing on their evaluation of the
a Reference approach and the students’ use of corpus and
Resource. Language bilingual dictionaries as reference resources
Learning and while writing. Drawing on insights from the
Technology, 14(1), case studies, Kennedy and Miceli outline a
28–44. working definition of corpus-consultation
literacy and identify areas for improvement in
this type of corpus-based instruction.
19 Leńko-Szymańska, The paper describes a teacher-training BNC, including
A. (2014). Is course on the use of corpora in language its various
this enough? education offered to graduate students at the interfaces (BNC
A Qualitative Institute of Applied Linguistics at University online, BYU-
Evaluation of the of Warsaw, Poland. Leńko-Szymańska BNC, Phrases
Effectiveness of a discusses the results of two questionnaire- in English),
Teacher-Training based studies from students who took the COCA, and
Course on the course. Results show that overall, the MICASE.
Use of Corpora students responded positively to the course,
in Language and they also saw the benefits of corpus-
Education. based tools and materials for language
ReCALL, 26(2), learning and teaching. The students also
260–278. reported that they needed more time to gain
full command of the resources and software
used in the course and added guidance on
the pedagogical issues related to corpus use.
20 Liu, D. (2011). To search for more effective and BNC, COCA,
Making Grammar “empowering” teaching of grammar, Liu and the
Instruction More examines the use of corpora for problem-based Time corpus;
Empowering: instruction in a college-level English grammar MICASE;
An Exploratory course. The data collected and analyzed Webcorp; and
Case Study of include students’ individual and group corpus Oxford English
Corpus Use in the research projects, reflection papers on corpus Dictionary
Learning/Teaching use, and responses to a post-study survey (online) were
of Grammar. consisting of both open-ended and Likert used by students.
Research in the scale-type questions. Four themes emerged in
Teaching of English, students’ use of, and reflections about, corpus
45(4), 353–377. study: (1) critical understanding about lexico-
grammatical, (2) awareness of the dynamic
nature of language, (3) appreciation for the
context/register-appropriate use of lexico-
grammar, and (4) grasping of the nuances of
lexico-grammatical usages.
(Continued)
21 McCarthy, M. This paper considers what pedagogical Cambridge

(2016). Putting the grammar and reference grammar Learner Corpus
CEFR to Good teaching materials should ideally include (consisting of
Use: Designing based on corpus evidence from both Cambridge
Grammars Based NS and learner corpora. McCarthy examination
on Learner-Corpus demonstrates how learner corpora can scripts
Evidence. Language be used to track the emergence of a representing
Teaching, 49(1), grammatical feature from beginning to 150 different
99–115. advanced levels, how learners acquire new first language
grammar, and what teachers can learn backgrounds
from examining error-coded corpora. with 200,000
The study discusses the divide between examination
lexis and corpora and how this becomes scripts
“progressively blurred” and how corpus containing more
information can best be used to produce than 55 million
useful teaching materials for students at words).
different levels.
22 McGee, I. (2012). This paper analyzes what contribution four LTP Dictionary
Collocation English monolingual collocation dictionaries of Selected
Dictionaries as provide to ‘soft’ DDL inductive learning Collocations
Inductive Learning activities in the classroom. McGee describes (DSC), the BBI
Resources in Data- and compares the four dictionaries and Combinatory
Driven Learning examines data from concordance lines for a Dictionary of
- An Analysis series of DDL questions (with output from English (2009).
and Evaluation. the collocation dictionaries). The paper ends
International Journal by making a number of recommendations
of Lexicography, for the inductive use of the collocation
25(3), 319–361. dictionaries, and suggests ways in which
the dictionaries might be adapted in the
classroom.
23 Meunier, F. Meunier reviews the measurable effects Various
(2012). Formulaic and theoretical findings of the formulaic references and
Language and nature of language in instructed SLA. web-based
Language She also presents a rationale for tracing dictionaries;
Teaching. Annual elements of formulaicity as captured in WebCollocate.
Review of Applied corpora in the acquisition of English as a
Linguistics, 32, second language. The increasingly refined
111–129. understanding of the formulaic nature of
language is found to have clearly impacted
second language teaching, but Meunier
argues that there are still many gaps in
the actual benefits and in how formulaic
features could actually aid language
learning.
24 Miangah, T. M. This paper provides a thorough definition Various:
(2012). Different of corpora and their different types and BNC, Collins
Aspects of discusses the contribution of concordancers Wordbanks
Exploiting Corpora to language learning and teaching. A Online English
in Language summary of teaching and research areas Corpus,
Learning. Journal of that have benefited from linguistic corpora MICASE.
Language Teaching include language description, lexicography,
and Research, 3(5), morphology, syntax, semantics, cultural
1051–1060. studies, CALL, and ESP. Limitations and
future applications of the corpus approach
are discussed in the paper.
25 Mull, J., & The article provides an ethnographic Concordancers
Conrad, S. (2013). discussion of students’ use of concordancers and various
Student Use of in an ESL class. Two intermediate-level ESL corpora.
Concordancers for students were given a peer essay and were
Grammar Error asked to correct five highlighted errors.
Correction. The They then were asked to talk about the error,
ORTESOL Journal, what to correct, and how. Part of the process
30, 5–14. required the use of a concordancer and for
the students to search a corpus for related
patterns. The article reports the advantages
and promising applications of students using
concordancers to correct written errors. The
students were able to correct errors when
editing papers by learning how to discover
patterns of use from corpora.
26 Park, K. (2012). The paper examines how learners A corpus of
Learner–Corpus interact with a corpus system and the academic texts
Interaction: “microgenetic development” that emerge and a custom
A Locus of from this interaction. The corpus system search engine.
Microgenesis in is capable of retrieving highly relevant The corpus
Corpus-Assisted L2 textual examples tailored to individual system consisted
Writing. Applied needs. Data were collected from an of specialized
Linguistics, 33(4), undergraduate ESL composition course in academic papers
361–385. the US for one semester, with real-time electronically
screen recordings, corpus queries, and published in
oral-written reflections. Park documents free web-based
these interactions as evidence of learners’ journals in the
attempts to resolve issues by retrieving, areas of language,
evaluating, and appropriating the corpus culture, and
search results. Results show that the language
learners’ successful interaction with the learning. The
system relied on both their ability to corpus contained
interpret and exploit the searches and on 350,000 English
the corpus system’s ability to respond to words.
the learners’ particular needs.
(Continued)
27 Park, K., & The focus of this paper is the combined use Same as the
Kinginger, C. of real-time digital video and a networked previous study.
(2010). Writing/ linguistic corpus for exploring the ways in
Thinking in Real which these technologies enhance researchers’
Time: Digital capability to investigate the cognitive processes
Video and Corpus of learning. With the help of corpus search
Query Analysis. queries, the analysis of real time data can be
Language Learning extended to provide an explicit representation
and Technology, of learners’ cognitive processes. This innovative
14(3), 31–50. method applies to SLA, especially in writing
and exploring L2 writers’ composing process.
The paper argues that a writer’s composing
process is fundamentally developmental and
facilitated by means of an interactive process
(with a corpus).
28 Perez-Paredes, P., This study explores EFL learners’ (n = 24) BNC.
Sanchez-Tornel, behavior by tracking their interaction
M., & Alcaraz with corpus-based materials during focus-
Calero, J. M. on-form activities (“Observe, Search the
(2012). Learners’ Corpus, and Rewrite”). One group of
Search Patterns learners made no use of web services other
During Corpus- than the BNC during the “Search the
Based Focus-on- Corpus” activity, while the other group
Form Activities. was allowed to use other web services and/
International Journal or consultation guidelines. The overall
of Corpus Linguistics, performance of the second group was found
17(4), 482–515. to be better; the first group’s formulation
of corpus queries on the BNC was deemed
unsophisticated. The students in the first
group used the BNC’s search interface the
same way as they used Google or similar
resources. The researchers recommend that
careful consideration should be given to
the cognitive aspects of corpus searches, the
role of computer search interfaces, and the
implementation of corpus-based instruction.
29 Perez-Paredes, P., This article discusses the use and potential BNC.
Sanchez-Tornel, benefits of learning logs to study learners’
M., Alcaraz Calero, actual use of corpus-based resources. In
J. M., & Jimenez, P. tracking learners’ actual use of corpora,
A. (2011). Tracking the authors explore the number of events
Learners’ Actual or actions performed by each individual,
Use of Corpora: the total number of different web services
Guided vs. Non- used, the number of activities completed,
Guided Corpus the number of searches performed on the
Consultation. BNC, and the number of words or wildcards
extracted per BNC search.
Computer Assisted These parameters were used to examine
Language Learning, whether learner interaction with corpus-
24(3), 233–253. based resources differed under different
corpus consultation conditions, i.e., guided
versus non-guided consultation. Results
show that learners behaved differently in
both the number of different web services
used during the completion of tasks and the
number of BNC searches. The study suggests
that guided activities and learner-tracking
are necessary when teachers incorporate
corpus tools in the classroom.
30 Poole, R. (2012). This study compares the effectiveness of COCA.
Concordance- online textual glosses that are (1) enhanced
Based Glosses with modified corpus-extracted sentences
for Academic from concordance lines, and (2) textual
Vocabulary glosses enhanced with dictionary definitions
Acquisition. drawn from an online learner’s dictionary.
CALICO Journal, Poole aimed to determine which textual
29(4), 679–693. gloss technique would be most beneficial
in helping intermediate to advanced
language learners acquire academic lexical
items. Learner attitudes toward the textual
annotation techniques were also analyzed.
Results show that participants in both
experimental groups exhibited post-test
gains in receptive and judgment tasks, but
only the concordance-based group displayed
improvement on the productive assessment.
In addition, the concordance-based group
members commented that the glosses were
beneficial to them and that they would
likely use them again. The dictionary group
found the glosses ineffective.
31 Quinn, C. This article reports on a corpus-training Collins
(2014). Training module that was implemented in an WordBanks
L2 Writers to intermediate level EFL writing course Online and
Reference Corpora and which taught students how to refer to COCA.
as a Self-Correction corpora for the purpose of self-correcting
Tool. ELT Journal, teacher-coded errors. The full training
69(2), 165–177. sequence of the module is presented
following a discussion of the students’
reactions to the process. The article guides
teachers in preparing intermediate level L2
writers in learning about concordancers.
The focus here is to offer students an
alternative reference to traditional
dictionary searches.
(Continued)
32 Rodgers, O., The article points out that there is a growing Various corpora
Chambers, A., & body of research in the use of corpora for for LSP.
Le Baron-Earle, F. LSP, but learner evaluations of the activity
(2011). Corpora in are rare. The authors explore the use of
the LSP Classroom: a corpus of academic research articles on
A Learner- biotechnology in French (with native
Centered Corpus English-speaking university students of
of French for biotechnology as learners). After situating
Biotechnologists. the study in the research context, they
International Journal examine issues involved in the creation
of Corpus Linguistics, of an appropriate corpus and describe the
16(3), 391–411. integration of the corpus in the French
language course. Finally, an evaluation
of the learners’ reactions was conducted
through questionnaires and semi-structured
group interviews. Results show the positive
application of corpus-based approaches, and
students’ receptive attitude especially with
the use of learner-centered corpora.
33 Römer, U. (2011). Römer notes that corpora have not only BNC, COCA,
Corpus Research revolutionized linguistic research but have MICASE, and
Applications in also had a major impact on second language ICLE.
Second Language learning and teaching. She points out that
Teaching. Annual applied linguists value what CL has to offer
Review of Applied to language pedagogy, but that corpora
Linguistics, 31, and corpus tools have yet to be widely
205–225. implemented in pedagogical contexts. The
article provides a summary of pedagogical
corpus applications and reviews recent
publications that report on the merging of
CL approaches and language teaching. CL in
syllabus or materials design and the creation
of class lessons/activities are overviewed.
Römer illustrates how both general and
specialized language corpora can be best
adapted to the classroom and discusses
directions for future research in applied CL.
34 Smith, S. This exploratory study describes the DDL WBC
(2011). Learner framework in general (non-major) English (WebBootCat)
Construction university classes, with learners directed to and Sketch
of Corpora for construct their own linguistic corpora. Engine.
General English in
Taiwan. Computer
Assisted Language
Learning, 24(4),
291–316.
Smith argues that the process of creating
a corpus “inculcates a sense of ownership
in the learner” and that this approach
contributes to their increased motivation
in learning (especially if the corpus focuses
on topics of great interest to the learner and
matches their major field of study). The
process of collecting the corpus leads to the
acquisition of not only language but also
useful transferable skills such as problem-
solving competencies and knowledge of
electronic tools. The study presents DDL
applications and contexts in Taiwan and
suggests that corpus construction is an
important component of an effective DDL
course. Participants (90 freshmen general
English students) compiled and analyzed
corpora, with overall positive results in terms
of their increased motivation and signs of
successful learning of English.
35 Walker, C. (2011). This paper presents two case studies of The Bank of
How a Corpus- how CL can be used in business English English (BoE)
Based Study instruction. The rationale here is that senior corpus, BNC,
of the Factors managers in multinational companies often and the British
which Influence find themselves needing more accurate National
Collocation business English models in learning how Commercial
Can Help in to best communicate in the international Corpus
the Teaching of workplace. The paper highlights how a (BNCC).
Business English. corpus-based investigation of the collocational
English for Specific behavior of key lexis features can be used to
Purposes, 30(2), provide “sophisticated” vocabulary samples
101–112. appropriated for senior managers. By studying
collocations associated with a group of word
synonyms, it is often possible to identify slight
but noteworthy differences in the meaning
of (business) words in the group, relevant for
the specific needs of managers in high-impact
intercultural English communication settings.
36 Xu, Q. (2016). This paper provides an overview of learner ICLE,
Application of corpora, their types, and various applications, LINDSEI,
Learner Corpora to and how they might be efficiently utilized Chinese Learner
Second Language in classroom activities for second language English Corpus
Learning and learning. (CLEC), and
Teaching: An Cambridge
Overview. English Learner Corpus
Language Teaching,
9(8), 46.
(Continued)
A detailed review of literature concerning

the application of CL to SLA, especially
focusing on contrastive interlanguage
analysis, is highlighted. Xu then focuses on
the construct relationship between learner
corpora and foreign language teaching,
calling for additional future research that
may specifically include experimental
settings typical in SLA studies.
37 Yoon, C. (2011). “Direct corpus use by learners or learner No particular
Concordancing in concordancing has been hailed as one of corpus used.
L2 Writing class: the promising areas that can revolutionize
An Overview L2 writing and language pedagogy as a
of Research and whole.” Yoon focuses on L2 writing in this
Issues. Journal of study, examining how and to what extent
English for Academic concordancing activities have contributed to
Purposes, 10(3), improved student texts as reported in related
130–139. studies. The inclusion criteria for this review
article were studies that provide information
on the effects of corpus concordancing
by learners of L2 writing and on learners’
evaluation of the process. Twelve studies
included in the review show that, if proper
training and assistance are provided, learner
concordancing can be a valuable data source,
research and reference tool, and a useful
supplemental resource for enhancing the
linguistic aspects of L2 writing and also for
increasing learner autonomy.
38 Yoon, H., & Jo, J. This case study explores students’ use of a Texts from
(2014). Direct and learning strategy in corpus-based writing LexTutor.
Indirect Access revision and its effectiveness and benefit
to Corpora: An to the learners. Four Korean EFL students
Exploratory Case were asked to complete introspective and
Study Comparing retrospective questionnaires and respond
Students’ Error to interview questions as they use corpora
Correction and for an error correction activity. The effects
Learning Strategy of corpus use on error correction, error
Use in L2 Writing. correction patterns, and learning strategy use
Language Learning were the foci of collected data. Results show
and Technology, that a need-based approach to corpus use in
18(1), 96–117. L2 writing was effective for restructuring the
learners’ “errant knowledge about language
use.” This approach influenced learners to
actively adopt cognitive learning strategies
by performing as “language detectives.”
39 Young, B. P. (2011). Young, a teacher of Modern English Web-as-corpus
The Grammar Grammar, maintains that when students are approach
Voyeur: Using aware of how syntax affects the message, (through
Google to Teach they understand the systematic nature of Google).
English Grammar the language they are acquiring. In this
to Advanced study, Young characterizes grammar as
Undergraduates. “something that needs fixing,” a condition
American Speech, that has resulted in grammar teaching being
86(2), 247–258. viewed as a “mechanistic, anti-rhetorical
enterprise relying on prescriptive drills
and an emphasis on error avoidance.”
This paper outlines several data-driven
assignments utilizing a Web-as-corpus
approach to connect formal grammar
instruction to real-world usage. The
assignments were referred to as “Grammar
Voyeur” assignments, directing the students
to observe natural language in use, rather
than disjointed language in textbooks.
Implications for corpus-based and data-
driven materials design in grammar
instruction are provided and discussed.
Table B3.4 A nnotated bibliography: CL and learner writing
1 Chen, H. I. (2010). This paper reports on a preliminary study ‘BNC baby’ and
Contrastive on L2 learners’ interlanguage pragmatic Chinese Learner
Learner Corpus development in academic written English Corpus
Analysis of discourse by examining how epistemic (CLEC).
Epistemic Modality modality is used by NNS writers vs.
and Interlanguage NS writers with data from NS/NNS
Pragmatic corpora. Chen also investigates how NNS
Competence in L2 writers develop interlanguage pragmatic
Writing. Arizona competence in academic writing across
Working Papers in L2 proficiency levels. Findings suggest a
SLA & Teaching, need for culture-sensitive curricula and
17, 27–51. explicit pragmatic instructions in writing
classrooms.
2 Chen, Y. H., This study follows an automatic, The Freiburg-
& Baker, P. frequency-based approach to identify Lancaster-Oslo/
(2010). Lexical frequently-used word combinations (i.e., Bergen (FLOB)
Bundles in L1 lexical bundles) in academic writing. corpus and the
and L2 Academic BAWE corpus.
Writing.
(Continued)
Language Learning Lexical bundles retrieved from one
and Technology, corpus of published academic texts and
14(2), 30–49. two corpora of student academic writing
were investigated both qualitatively and
quantitatively.
3 Cotos, E. (2014). Cotos explores a local learner corpus to A local learner
Enhancing identify the effects of two types of DDL corpus, an
Writing activities: one relying on a NS corpus electronic
Pedagogy with and the other upon a combination of NS collection
Learner Corpus and learner corpora. The objective of of writing
Data. ReCALL, both types of activities was to improve produced by the
26(2), 202–224. L2 writer’s use of linking adverbials. participants as
Quantitative and qualitative data obtained course assignments
from writing samples, pre-/post-tests, and (40 academic
questionnaire responses were data sources disciplines having
for the study. Results showed an increase 1,623 manuscripts
in frequency, diversity, and accuracy in all with a total of
participants’ use of adverbials, but more 1,322,089 words).
significant improvement was observed in
students who were exposed to the corpus
including their own writing.
4 Crompton, P. This article examines article system ICLE and the
(2011). Article errors in a corpus of English writing Arabic Learner
Errors in the by tertiary-level L1 Arabic speakers. English Corpus
English Writing Frequencies of articles are compared with (95 essays with
of Advanced L1 those in native and non-native English 42,391 words).
Arabic Learners: speaker corpora. Crompton reports that
The Role of the ‘commonest errors’ involve misuse of
Transfer. Asian the definite articles for generic reference.
EFL Journal, These errors are deemed to be caused by
50(1), 4–35. L1 transfer (rather than an interlanguage
developmental order) as supported by
corpus data.
5 Ha, M. J. (2016). The paper examines the frequency and Learner corpus
Linking Adverbials usage patterns of linking adverbials in composed of 105
in First-Year Korean students’ essay writing compared essays produced
Korean University with NS English texts. The distribution by first-year
EFL Learners’ of the different semantic categories of university
Writing: A linking adverbials was nearly identical students in
Corpus-Informed in the Korean writing and NS writing. Korea. The
Analysis. Computer Additive relation was most frequently control corpus
Assisted Language used, followed by the causal, adversative, was taken from
Learning, 29(6), and sequential relations. Results show that the American
1090–1101. Korean learners overused linking adverbials LOCNESS
across all semantic categories when sub-corpus.
compared with LOCNESS texts.
6 Kennedy, C., & This study documents a semester- CWIC
Miceli, T. (2010). long apprenticeship in corpus use for (Contemporary
Corpus-Assisted creative writing applications. The Written Italian
Creative Writing: corpus approach is introduced as an Corpus).
Introducing aid to learners’ imagination in writing
Intermediate and also as a resource to reinforce the
Italian Learners correct use of specific English grammar
to a Corpus as features. Corpus-based activities served
a Reference as groundwork for students’ analysis
Resource. of corpus data. Kennedy and Miceli
Language Learning describe the approach and also the results
and Technology, of their evaluation of its effectiveness
14(1), 28–44. through case studies of three students
and their use of a corpus and a bilingual
dictionary as reference resources while
writing.
7 Laufer, B., & The use of English verb-noun Learner corpus
Waldman, collocations in the writing of NS of of about 300,000
T. (2011). Hebrew at three proficiency levels was words of
Verb-Noun the focus of this paper. The authors argumentative
Collocations in extracted the 220 most frequently and descriptive
Second Language occurring nouns in the LOCNESS essays compared
Writing: A corpus and in the learner corpus and with a
Corpus Analysis created concordances for them. Then, LOCNESS NS
of Learners’ verb-noun collocations were also sub-corpus.
English. Language obtained for comparison. Results show
Learning, 61(2), that learners at all three proficiency
647–672. levels produced far fewer collocations
than NS.
8 Liu, G. (2013). This article examines the characteristics Chinese
On the Use of Chinese EFL learners’ use of linking Learners’ English
of Linking adverbials in speaking and writing Corpus (CLEC)
Adverbials by through a comparison of learner and and the College
Chinese College NS corpora. Using MicroConcord, Learners’ Spoken
English Learners. linking adverbials were obtained for English Corpus
Journal of Language a contrastive study on the individual (COLSEC);
Teaching & features used by the two groups. Data LOCNESS and
Research, 4(1). show that Chinese EFL learners used London-Lund
linking adverbials in their speech Corpus (LLC).
and writing overwhelmingly more
frequently in comparison to NS.
Chinese EFL learners tended to use
more linking adverbials in their speech
than in their writing, while NS data
showed the opposite pattern, i.e., more
linking adverbials in writing than in
speaking.
(Continued)
9 Lu, X. (2011). This study evaluates 14 syntactic The ESL writing
A corpus‐based complexity measures as indices of corpus from the
evaluation language development in a corpus of Written English
of syntactic college-level ESL writers. Data were Corpus of
complexity analyzed by using a computational Chinese Learners
measures as system developed for automatic analysis (WECCL), with
indices of of syntactic complexity in college-level 3,554 essays
college‐level ESL ESL writing. Results show significant (total word
writers’ language effect of the sampling measures on the = 1,119,510)
development. mean values of most of the complexity written by
TESOL Quarterly, measures. Ten complexity measures had Chinese learners
45(1), 36–62. significant between-level differences, and (aged 18–22
those measures showed several patterns years) from 9
of development. Correlations between colleges.
the scores of the 14 complexity measures
illustrate a strong relationship between
measures of the same type, suggesting
how complexity measures could be
effectively used as indices of L2 writing
development.
10 Luo, Q., & Liao, This paper reports the results of a small- The Beijing
Y. (2015). Using scale study with 30 undergraduate students Foreign Studies
Corpora for Error from 2 college English classes in China University
Correction in that explored the effects of using reference (BFSU) CQP
EFL Learners’ corpora during the process of revising web.
Writing. Journal of essays in EFL. The BFSU CQP web was
Language Teaching used as a resource by the experimental
and Research, 6(6), group while correcting lexico-grammatical
1333–1342. errors in writing. Findings reveal that
corpora as a reference are more helpful
than an online dictionary in enabling
learners to produce accurate corrections
and reduce errors in free production.
Learners also showed a positive attitude
toward corpus use in writing.
11 MacDonald, P. MacDonald analyzes errors identified UPV learner
(2016). “We All in written argumentative essays of 304 corpus from the
Make Mistakes!” Spanish university students of English MiLC corpus;
Analyzing taken from two different corpora: one the WriCLE
an Error- from a technical university context and corpus (750
Coded Corpus the other from learners enrolled in the argumentative
of Spanish humanities. The study also explores the essays written by
University nature of errors coded in the corpus Spanish learners
Students’ Written and the relationship, if any, between the of all proficiency
English. learners’ level of competence and the type levels).
and frequency of errors they make.
Complutense Results show that grammar errors are
Journal of English the most frequent and that the linguistic
Studies, 24, 13. competence of the learners has a lower than
expected influence on the most frequent
types of errors coded in the corpus.
12 Paquot, M., Paquot and Granger review learner corpus No corpus used.
& Granger, S. data and various methods utilized in the
(2012). Formulaic analysis of formulaic language They provide
Language in an extensive discussion of findings from
Learner Corpora. learner corpus research (LCR) on learner
Annual Review of phraseology, distinguishing between co-
Applied Linguistics, occurrence and recurrence patterns. Emphasis
32, 130–149. is also placed on the relationship between
learners’ use of formulaic sequences and
potential transfer factors from the learners’ L1.
13 Thewissen, J. L2 accuracy development trajectories were ICLE.
(2013). Capturing examined in this paper and how they can
L2 Accuracy be captured via an error-tagged version of
Developmental an EFL learner corpus (from the ICLE). A
Patterns: Insights subsection of 223 learner essays was used,
from an Error‐ with each essay manually annotated for
Tagged EFL errors following the Louvain-error tagging
Learner Corpus. taxonomy and individually rated by two
The Modern testing experts according to the Common
Language Journal, European framework linguistic descriptors
97(S1), 77–101. for accuracy. A counting method, potential
occasion analysis, which relies on both an
error-tagged and a POS-tagged version of
the learner data, was used to quantify the
errors. Results indicate that the EFL error
developmental patterns tend to be dominated
by progress and stabilization trends.
Table B3.5 A nnotated bibliography: CL and learner speech
1 Aijmer, K. (2011). This study explores the use of well as a Texts from
‘Well I’m not pragmatic marker that helps speakers to the Swedish
sure I think…’ organize and direct the conversation and to component of
The Use of express specific feelings and attitudes. Aijmer the LINDSEI
Well by Non- focused on advanced EFL learners’ use of corpus and its
Native Speakers. well using Swedish texts from the LINDSEI NS counterpart
International corpus as compared with LOCNEC (native (LOCNEC)
Journal of Corpus English speakers) texts. Comparison data show to examine
Linguistics, 16(2), that, overall, Swedish learners overuse well, similarities
231–254. although there are considerable individual and differences
differences observed in the corpus. between NS’ and
NNS’ use of well.
(Continued)
Swedish learners use well primarily as

a fluency device to cope with speech
management problems. They use it
infrequently for attitudinal purposes.
Teaching applications are discussed, especially
how pragmatic markers should not be taught
the same way as other lexical items, given the
importance of context in how these markers
are learned or acquired.
2 Baumgarten, The article investigates high-frequency Three audio-
N., & House, collocations I think and I don’t know as taped elicited
J. (2010). I markers of stance-taking by NSs and NNSs conversations in
think and I don’t of English in L1 and ELF interaction. Stance English: one with
know in English expressions are seen as uniquely different a group of L1
as Lingua in ELF and L1 English discourse because of English speakers
Franca and the specific nature of ELF communicative and the other
Native English situations, i.e., the speakers’ different L1s with L2 English
Discourse. Journal and the characteristics of their respective speakers.
of Pragmatics, learner varieties in interaction may evoke
34(6), 1184–1200 ELF-specific patterns of stance-taking.
Comparison data show that the co-
occurrences of these collocation patterns
align with syntactic and discourse features in
a corpus of elicited conversation. While these
collocations are among the most frequent
stance-marking devices in both the English
L1 and the ELF data, they show almost
complementary distributions overall and only
partially overlapping functional profiles.
3 Baur, C., Rayner, This paper documents the collection of an L2 child learner
E., & Tsourakis, English-L2 child learner speech corpus from speech corpus
N. (2014). Using Swiss-German L1 students in their third year (of German-
a Serious Game of learning English. The method of collection speaking children
to Collect a uses a web-enabled multimodal language learning English).
Child Learner game implemented using the CALL-SLT
Speech Corpus. platform. In this setting, participants hold
Proceedings conversation prompts with an animated
from the Ninth agent. Prompts consist of a short, animated
International English language video clip together with
Conference a German language piece of text indicating
on Language the semantic content of the requested
Resources and response. The application or corpus data, text
Evaluation. collection and annotation procedures, and
Reykjavik, an initial analysis of data are presented and
Iceland. discussed.
4 Brand, C., & This study presents a multi-method The German
Gotz, S. (2011). approach to the description of a potential sub-corpus
Fluency Versus correlation between errors and temporal of LINDSEI
Accuracy in variables of (dys)fluency in spoken learner and a native
Advanced language. Errors and temporal variables control corpus
Spoken Learner of fluency were quantitatively analyzed. (LOCNEC).
Language: A Lexical and grammatical categories that
Multi-Method are considered error-prone were identified
Approach. from the LINDSEI corpus as well as the
International problematic aspects of fluency for all
Journal of Corpus learners, e.g., confusion in tense agreement
Linguistics, 16(2), across clauses or an overuse of unfilled
255–275. pauses. Results from an analysis of data
from five learners show no identifiable trend
for the possible correlation of accuracy
and fluency in speech. Fifty NS’ ratings of
these five learners revealed that the learner
with an average performance across the
investigated variables received the highest
ratings for overall oral proficiency.
5 Buysse, L. This article explores the use of so as a Learner corpus
(2012). So as a discourse marker by Belgian students (NS of of informal
Multifunctional Dutch) nearing the end of formal instruction interviews with
Discourse in English. An interview corpus was compiled 40 Belgian NSs
Marker in Native to determine the influence of distinct learning of Dutch in their
and Learner objectives in foreign language acquisition, second or third
Speech. Journal of with half of the learner participants majoring year in higher
Pragmatics, 44(13), in English Linguistics, and the other half in education (aged
1764–1782. Commercial Sciences. Comparisons of these 19–26 years).
two sub-corpora of Belgian learners’ data
were also conducted with a comparable NS
corpus. Primary results show that learners
use so significantly more than their native
English-speaking counterparts. Students of
English Linguistics use so slightly more than
students in Commercial Sciences. Implications
for teaching and contrastive rhetoric are
discussed.
6 Liu, B. (2013). This study investigates the role of native A corpus of
Effect of First language (Mandarin Chinese) on the use of individual
Language on the English discourse markers by L1 Chinese sociolinguistic
Use of English speakers of English. interviews with
Discourse 5 native English
Markers by L1 speakers and
Chinese Speakers 10 L1 Chinese
of English. speakers.
(Continued)
Journal of Individual sociolinguistic interviews of 15

Pragmatics, 45(1), participants (5 native English speakers and
149–172. 10 L2 speakers) were collected to create a
corpus for the study. Results show that three
Chinese discourse markers were found to
have some form of influence on their use
of corresponding English expressions. The
L1 Chinese speakers using the deliberative
function of I think in medial or final position
(while the native English speakers did not)
may have transferred their use of I think from
their L1 wo juede because wo juede can mark
the deliberative meaning in medial or final
position. L1 Chinese speakers used yeah/yes
as a back channel, while the native English
speakers did not (potentially transferred from
the corresponding Chinese expression dui).
7 Luk, J. (2010). This study documents an attempt to In all, 11 groups
Talking to Score: investigate students’ discourse performance of interaction data
Impression in L2 oral proficiency assessments conducted (as part of school-
Management in the context of peer group interactions in based assessment
in L2 Oral Hong Kong. Forty-three female Hong Kong in Hong Kong
Assessment secondary students were the participants. education system).
and the Co- Discourse frames were characterized by
Construction of features that seem to be ritualized, contrived,
a Test Discourse and colluded in speech. Interaction practices
Genre. Language from the participants suggest a strong desire
Assessment to maintain the impression of being effective
Quarterly, 7(1), interlocutors for scoring purposes rather than
25–53. for authentic communication. Implications
for test construct validity and the students’ L2
oral proficiency development are discussed.
8 Polat, B. (2011). This paper explores a developmental learner A developmental
Investigating corpus to examine discourse markers used by learner corpus
Acquisition one adult language learner over the course from a year-long
of Discourse of one year. Results show very different recording of one
Markers through patterns of use and development among language learner.
a Developmental three focal discourse markers. You know was
Learner Corpus. greatly overused by the participant, although
Journal of its occurrences declined by 50% over the
Pragmatics, 43(15), year; like increased from almost zero uses
3745–3756. at the beginning of the year to over 2,300
occurrences per 100,000 words by mid-year,
then dropped by 50% by the study’s end; well
was not used at all as a discourse marker.
Overall, the paper intends to show the
usefulness of developmental learner corpora
as a tool for studies of pragmatic acquisition,
as well as the importance of considering
naturalistic learners in the extensive
description of second language pragmatics.
9 Salsbury, T., This study uses word information scores Interview data
Crossley, S. A., from the Medical Research Council (MRC) from 6 L2 learners
& McNamara, Psycholinguistic Database to analyze word (interviewed
D. S. (2011). development in the spontaneous speech data every two weeks
Psycholinguistic of six adult learners of English as a second for a total of 18
Word language in a one-year longitudinal study. sessions over a
Information in In contrast to broad measures of lexical one-year period).
Second Language development, such as word frequency
Oral Discourse. and lexical diversity, this study analyses
Second Language L2 learners’ depth of word knowledge as
Research, 27(3), measured by psycholinguistic values for
343–360. concreteness, imagability, meaningfulness,
and familiarity. The results provide evidence
that learners’ productive vocabularies become
more abstract, less context dependent, and
more tightly associated over time. This
observation suggests the acquisition of a
deeper knowledge of L2 vocabulary and has
important implications for how vocabulary
knowledge can be measured in future studies
of L2 lexical development.
Part C
Corpus-Based Lessons and
Activities in the Classroom
C1
Developing Corpus-Based
Lessons and Activities
An Introduction
Part C of this book presents various corpus-based lessons and activities developed
for the classroom and intended for a range of language learners. The three sub-
sections represent the primary themes of the lessons or activities: namely, (1) CL
and Vocabulary Instruction, (2) CL and Grammar Instruction, and (3) CL and
Teaching Spoken/Written Discourse. There may be overlaps in these themes as,
for example, a vocabulary lesson may also include some instructions related to
the discussion of a specific grammatical feature or structure. The three themes
represent common categories of activities in language classrooms, primarily
for English language learners, but may also be appropriate in settings such as
academic writing instruction or linguistic variation in spoken discourse for na-
tive English speakers in various university-level classes. Instructors or materials
developers in intensive English programs, ‘study abroad’ courses or workshops,
EAP courses for professionals, or writing and grammar courses, or those in
sociolinguistics (focusing on the study of linguistic variation) may also benefi-
cially utilize these model corpus-based lessons and activities.
The contributors of these lessons and activities were all my former students
in the master’s or doctoral programs at the Department of Applied Linguis-
tics and ESL at GSU. They have all taken my Corpus Linguistics or Technol-
ogy and Language Teaching courses, which required materials design projects for
the short-term language instruction of a linguistic feature (e.g., collocations,
linking adverbials, politeness markers, semantic categories of verbs) taught
with a corpus tool. The design of the lessons was often based upon a hypo-
thetical classroom with a specific group of learners that the lesson designers
were familiar with. However, some of them based their design on an actual
class (e.g., an intensive English program course on academic writing) that they
were teaching. After completing the program at GSU, most of the contributors
found teaching positions at various universities in the US or abroad, teaching in
188 Corpus-Based Activities in the Classroom
ESL/EFL or EAP settings. Others continued to pursue doctoral studies or work

in research and administrative contexts. In many cases, the contributors have,
over the years, utilized corpus-based lessons in their classes and have presented
papers focusing on CL and language teaching in many national and interna-
tional conferences (e.g., TESOL Convention or the American Association for
Corpus Linguistics Conference).
The tools and databases utilized by the contributors include the COCA,
AntConc, WordandPhrase.Info, Compleat Lexical Tutor (especially Text Lex Com-
pare and VocabProfile), WebParaNews (WPN), Text X-Ray, and others. Top-
ics range from academic writing to analyzing spoken data to content-based
EAP (e.g., legal English and aviation English). The format of the lessons or
activities may follow short lesson plans for a course in a designated computer
lab (about 45 minutes to an hour) or a homework assignment, with a work-
sheet distributed to students. Some of the lessons describe a longer sequence of
corpus-based instruction or activities, such as combining data gathering and
interpretation, a comparison of linguistic features, or students collecting and
analyzing their exploratory corpora or completing a writing assignment. After
each complete lesson segment or handout, a short interview with contributors,
mainly focusing on what they believe are the important contributions of CL to
their teaching, is provided. The contributors highlighted tips and suggestions
for teachers and provided ideas for developing related lessons/activities.
Before Sections C2–C4, Section C1 provides a detailed case study of my
experiences in developing a corpus-based EAP course for university-level stu-
dents of forestry in the US In the following sections, I present the creation of
this course, the specific corpus components, student comments and feedback,
and my overall impression of the course. A paper that described this short-
term corpus-based lesson and student vs. professional writing comparisons en-
titled “Developing Research Report Writing Skills Using Corpora” (Friginal,
2013b) was published by the English for Specific Purposes Journal. A conference
presentation and paper coauthored with forestry professors Thomas Kolb and
Martha Lee, and my corpus linguist colleagues Nicole Tracy-Ventura and Jack
Grieve (2007), “Teaching Writing within Forestry,” briefly documented the
development of the course, its rationale, and perspectives from forestry pro-
fessionals (Proceedings from the University Education in Natural Resources Conference
2007, Oregon State University). I am, as readers of my work would expect,
strongly recommending the corpus-based approach in similar EAP settings and
contexts but also providing a discussion of potential challenges and limitations.
C1.1 CL for an EAP Course: A Case Study

My first set of corpus-based lessons was developed for a university-level writing
course for students of forestry. As a writing consultant and a graduate teaching
assistant at a School of Forestry (SOF) at NAU, I was first tasked to propose the
Developing Corpus-Based Lessons 189
structure and components of a Writing in Forestry course. The primary goal of

the course was to replace the required sophomore-level writing course taught
by the English Department with this new one, taught within the School. The
course followed the typical structure of an EAP curriculum, emphasizing the
need to improve students’ experience and competence in the types of writing
used in upper division forestry courses and by professional foresters. I received
enthusiastic encouragement and support from SOF faculty, allowing me, as a
language and writing expert, to gain more understanding of the writing re-
quirements they have in their courses. I was amazed at the level and rigor of
writing expected of undergraduate students in this forestry program. Before
the creation of my actual course, my work (and two other former writing
consultants) centered on providing informal writing instruction to students
and developing and presenting occasional workshops on writing in undergrad-
uate forestry at the School. Several classes required students to seek suggestions
or tutoring for improving their writing for specific laboratory assignments.
I was pleasantly surprised at the students’ enthusiasm in discussing with me
their work, and I had students lined up for my office hours, especially when a
major assignment was due. During my course development stage, I produced
a draft syllabus and a coursepack of instructional materials that incorporated
data from corpora. I visited forestry courses that required written assignments
and interviewed faculty about the types and registers of academic writing in
their courses. I started collecting various corpora of published, peer-reviewed
academic journals (including papers written by forestry professors from the
School). The assignments for the course were developed specifically to teach
writing skills to junior- and senior-level students.
The Writing in Forestry course was approved by the university as a required
course in the B.S. in Forestry program. Freshman-level English Compo-
sition was a prerequisite, and students took the course in their sophomore
year (and before taking a forest measurements class that requires several
laboratory-style reports). It was a two-credit course designed to provide a
survey and overview of written registers in the professional forestry program
and in forestry careers in the US Part of its description included references
to “corpus-based approaches” and “corpora and academic and professional
writing” (Kolb et al., 2007). Students were introduced to the structure and
rhetorical styles of (a) annotated bibliography, (b) technical synthesis papers,
(c) laboratory reports, (d) forestry-based memos, (e) professional “opinion
pieces,” and (f ) selected sections of a forest management plan. The tasks and
activities identified for the course were intended to heighten students’ aware-
ness of the written communication component of the professional forestry
program and to serve as venues for practice in presenting content informa-
tion using persuasive, clear, concise, accurate language. Table C1.1 provides
a list of weekly topics, highlighting parts that incorporated corpus-based
lessons and activities.
Table C1.1 M ajor topics and assignments in the Writing in Forestry course by week
Week Topics/assignments
1 Class introduction, diagnostic essay activity

2 Introduction to academic/forestry writing [Corpus Focus: Introduction to
academic written registers]
3 Persuasive writing, letter to the editor writing, letter to the editor project
assigned [Corpus Focus: Exploring a corpus of “letters to the editor”
collected from major newspapers in the US]
4 Annotated bibliography format, library research, annotated bibliography
project introduced and assigned
5 Reference styles, summary/annotation writing
6 Academic writing, synthesis paper assigned [Corpus Focus: Exploring the
linguistic characteristics of academic writing; brief lessons and activities
focusing on data from the academic registers of the COCA; discussion
of language in academic settings through comparison data from the
T2K-SWAL corpus, exploring writing by NS and learner texts from the
International Corpus of Leaner English (ICLE)]
7 Synthesis paper writing; citation styles
8 Professional memo writing, professional memo project assigned [Corpus
Focus: Analyzing features of memos (and business letters) from a
specialized corpus of business letters and memos (some forestry-related
memos were collected from instructors in the School and from published/
public documents from various government departments and institutes)]
9 Formatting tables and graphs; grammar lessons—referencing tables and
graphs in technical writing
10 Management plan (i.e., forest management plan) format; descriptive writing;
management plan project assigned
11 Management plan writing, data scanning and indexing
12 Lab report format, lab report project: introduction; lab report project
assigned [Corpus Focus: Understanding various sections of lab reports;
comparing professional and student research writing (using a corpus of
forestry papers published in leading forestry journals in the US); linguistic
topics—comparing linking adverbials, reporting verbs, verb tenses, and
others]
13 Lab report: methods, equations editor, results [Corpus Focus: Continuation
of the corpus-based topics mentioned earlier]
14 Lab report: discussion, abstract, appendix
15 Instructor consultations on lab report
16 Lab report due
My course started with a diagnostic essay assignment (“Why are you inter-
ested in forestry?”) to help me in conducting an initial assessment of each stu-
dent’s strengths and weaknesses in a more structured academic writing context.
The second week introduced students to more specific academic and technical
writing in forestry by discussing how forestry and scientific writing may be
different from other forms of writing (i.e., register comparison) they have done
in the past, including their freshmen English prerequisite course. Empirical
fact and data-based argumentation were emphasized for this week’s discussion.
Corpus data for comparison and an introduction to CL were provided.
Persuasive writing was the focus of the third week, and I assigned my stu-
dents to write an approximately 200-word letter to the editor of the local news-
paper, expressing their opinion on a forestry-related topic of their choice. The
topic had to be current (e.g., forest fires, bark beetle infestation, forest lands
thinning concerns) and accessible to the general public. In the fourth week,
students were assigned to write an annotated bibliography on a forestry topic of
their choice. After choosing a topic, they conducted library research to identify
six relevant academic sources (i.e., journal articles, books, technical reports).
The sources had to provide different perspectives on the topic and come from
credible academic publications. Finally, each source must be cited in the name-
year system and be followed by a one-paragraph annotation (100–250 words)
that summarized and evaluated the source and explained how this source was
relevant to the topic. The fifth week focused on developing the skills needed to
complete the annotated bibliography.
The sixth week expanded on the annotated bibliography assignment by expos-
ing students to writing for the purpose of synthesizing current knowledge about
a forestry topic (i.e., literature review for research reports). Students were assigned
to write a five- to seven-page technical synthesis paper on the forestry topic they
selected for the annotated bibliography. I encouraged them to emphasize in this
paper a key thesis or argument that was developed and supported by the findings
and ideas of the sources used for the annotated bibliography. The paper required an
introduction paragraph, which introduced the topic and presented the thesis state-
ment; a body, in which arguments were supported by synthesizing the sources from
the annotated bibliography; and a conclusion, wherein the thesis was restated and
summarized. The seventh week emphasized approaches for successful synthesis pa-
pers and citation styles, the eighth week focused upon professional memo writing,
and the ninth week demonstrated effective table and figure formats.
In the 10th week, students were introduced to the purpose, format, and
primary contexts of forest management plans, and were assigned to write the
current conditions section of a management plan for a hypothetical National
Forest: the “Greenville Forest.” They were provided with background data and
other information about the Greenville Forest and the following scenario that
framed the assignment: Improvements are desired to a forest service road that
carries commuter and recreation traffic through “Merganser Marsh,” an im-
portant wetland for wildlife and the home of several endangered species. The
paper required three sections: a Physical Setting section, in which the location,
climate, and transportation system of the Greenville Forest were described;
a Biological Setting section, in which the vegetation and the wildlife were de-
scribed; and a Social Setting section, in which the different recreational locations
of the Greenville Forest were described. Students worked on this assignment

in groups of three, exposing them to the collaboration and teamwork often re-
quired to produce many professional forestry management plans and, of course,
allowing them to benefit from the synergy of shared ideas and analysis. The
eleventh week focused on writing skills needed to complete the management
plan assignment, such as scanning a map and inserting the image into a text
document, and using word processing software to create an index.
The major component of the course, the lab report, was introduced and
assigned in the twelfth week, highlighting more specific corpus-based activi-
ties. For the report, students were provided with data on tree age, height, and
diameter at breast height (DBH), and were required to use a standard research
report format (introduction, methods, results, discussion) to address the follow-
ing questions in a six- to eight-page report:
1 Is there a relationship between tree age and tree height?

2 Is there a relationship between tree age and DBH?
3 Is there a relationship between DBH and tree height?
4 Is it possible to accurately predict tree age from DBH and/or height?
5 Are the results of your study on ponderosa pine in Northern Arizona appli-
cable to other species or to ponderosa pines growing in different regions?
The thirteenth to sixteenth weeks were devoted to demonstrations and

corpus-based exercises related to specific components of the lab report and my
review and consultation on the first draft of the report. The final student lab
report was submitted at the end of the course.
C1.1.1 Corpus-Based Writing Lessons and Activities
In further advocating for the role of CL in the language classroom, Gavioli

and Aston (2001) argue that
• Teaching syllabuses and materials should be corpus-driven.

• Corpora should be viewed as resources from which learners may learn
directly.
• Corpora can capture reality.
• Corpora can provide valid models for learners.
My students responded enthusiastically to classroom activities that high-

lighted key differences in linguistic features used by novice writers (i.e.,
students) in lab reports and those used by professional forest scientists from
the forestry corpus that I collected for the course. Discussions of these com-
parative features focused on the concept of academic written registers and
the specific features of writing defining sub-registers, including fiction,
nonfiction, narrative, analytical, or technical writing. Data visualization of
results from corpus descriptions of register-specific writing provided inter-
esting insights into the uniqueness of individual registers that students were
familiar with and exposed the systematic patterns of word use, structure,
and conventional word associations commonly employed by writers in the
same field. Figure C1.1 shows an example of register comparisons between
conversation and academic writing, and discussion topics I presented in
class. Although my students had limited background in linguistics or gram-
mar studies, they were able to draw connections between features such as
demonstrative pronouns (this and that) used very differently in speech and
in writing. I emphasized to my students that, intuitively, one might expect
that knowledge gleaned from corpus-based research that identified features
and systematic associations of patterns written in a particular field could aid
their writing in forestry, technical reports, memos, and many other future
applications.
Lesson Focus: The Grammar of Individual Words: Demonstrative Pro-
nouns this vs. that.
The traditional description of the difference from most grammar textbooks
and English instruction in the classroom:
12,000
10,000
8,000
6,000
4,000
2,000
0
Conversaon Academic WR
that this
Figure C1.1 istribution of demonstrative pronouns this and that in conversation

D
and academic writing, frequency per million words, adapted from
Biber (2004).
This refers to a thing near the speaker

That refers to something that is not near the speaker
Examples of that in conversation:
That was delicious. (In this example, that was used by the speaker to
show vague or situational reference)
A: I was, I was flat on my back.

B: Uh, I can’t sleep like that (that here was used as a reference to the specific
situation)
Examples of this in academic writing:
GAAP requires that a business use the accrual basis. This means that
the accountant records revenues as they are earned… (This used in this
excerpt is an example of “text deixis,” defined as a reference to the reader
of the idea, topic, or specific item mentioned in the previous part of the
same text. Text deixis using this is more frequently used in academic
written registers than in other written registers.)
For exploratory comparisons, I collected first drafts of the lab report written by
my students, including similar papers written by other students in previous years
provided to me by instructors. My forestry corpus compiled 500 recent (from 2000
to 2005) refereed research articles from forestry and related journals (e.g., Forest
Ecology and Management, Forest Science, Journal of Forestry). As noted earlier, included
in the corpus are articles authored by faculty and graduate students in the School.
The novice/student corpus (total number of papers = 144) comprised most ‘first
attempts’ at forestry research reports written by the students in the School.
Note that I also attempted to collect a corpus of successful papers (e.g.,

papers that received a grade of A or those contributed by faculty as samples
of effective research reports in their previous classes). However, I was able
to compile only a few, and obtaining the appropriate institutional research
approval and consent forms from students, some have already graduated
from the program, was a challenge. I decided to proceed with exploratory
comparisons using the professional texts, at least during the initial imple-
mentation of the course.
As the students’ initial research writing activity, this assigned paper focused
on simple forest measurement techniques. Much of the writing instruction I
provided in the classroom involved the discussion of technical report writing
conventions in citation, formatting tables and figures, and developing the pre-
sentation of results. With respect to language use, I focused on linguistic fea-
tures such as the use of linking adverbials, verb tenses (especially writing in the
passive), reporting verbs, pronoun use, and vocabulary features (leading into
key word comparisons). The measurement data were provided to the students,
and textbook chapters for citation in the “Review of Previous Works” section
were also provided. The School recommended a style and formatting manual,
Writing Papers in the Biological Sciences, 3rd Edition (McMillan, 2001), to aid stu-
dents in writing research reports across year levels. The format and mechanical
conventions (e.g., how to format tables and figures) in research writing fol-
lowed the suggestions from this manual.
With the collection of our two comparison corpora (student lab reports vs.
published forestry research articles), I provided the students with normalized fre-
quencies (normed per 1,000 words) of many of the target lexical features men-
tioned earlier. Specifically, for example, verbs tagged into their “reporting”
categories: argue, show, find, claim; linking adverbials; and passive structures and
verb tenses, the last three features making use of definitions or categories from the
LGSWE as models. These three groups of features were good examples because
they all are commonly used in technical research reports. Student concordancing
activities were conducted in a computer lab using MonoconcPro (Athelstan, 1999)
so that students could run their own comparisons and also obtain sample text
excerpts of features especially from the professional corpus. It was easy to demo
MonoconcPro to my students, and they immediately learned the basics of running
searches and analyzing concordance lines (“this is just like googling,” one stu-
dent commented). The concordancing exercises were facilitated using handouts
I developed with specific instructions about what the students needed to search,
count, normalize, and interpret. Discussion questions guiding small group activ-
ities were provided. The ultimate goal of the activities was to help students focus
on these features as they rewrote and finalized their lab reports. Sample data,
student interpretation and comments, and the primary results of comparisons and
concordancing activities are presented in the following.
Reporting Verbs
The definition and phraseology of reporting verbs and reporting clauses in
my lessons were derived from the works of Francis, Hunston, and Manning
(1996), Hunston and Francis (1998); and Charles (2005). Reporting clauses fol-
low the verb patterns V + that (e.g., “You indicated that…”) and it be + V-ed
+ that (e.g., “It was stated that…”) in structures involving citation and general
reference. In a research setting, Charles (2005) examined the distribution of
reporting verbs in a corpus of academic theses written by native speakers (NS)
of English from several disciplines. She looked at the phraseological patterns
of reporting clauses by focusing on the structure of internal citation “with a
human subject/s.” Charles followed four categories of reporting verbs (Argue,

Think, Show, and Find, shown in Table C1.2) previously identified by Fran-
cis et al. (1996) (for an extensive discussion, see Francis et al. (1996, pp. 97–101)
and Charles (2005, pp. 10–11)). The same categories and the following list of
reporting verbs were used in my lessons. Clearly, the idea here was to show
my students that these verbs were very commonly used in academic writing
(particularly in research reports), especially in the introduction and review of
related literature sections. Charles (2005) and the LGSWE reported that the
most common reporting verb used in academic writing is “show” (and also
most of the show verbs, especially illustrate, indicate, and demonstrate). I provided
my students excerpts of novice student writing making use of the verb show
repeatedly in successive sentences, and I asked them to think of rephrasing the
segments to include possible replacements, with the intent of improving the
overall structure and flow of the excerpts (see the following two samples).
• As shown by Hill (2002) [not cited in this book’s bibliography], the mea-
surements did not indicate infestation. It also shows that trees were not
releasing pitch as defense against bark beetle attack. The white pitch tube
shows that the beetle was successfully repelled by the tree.
• Martin and Williams (2003) [not cited in this book’s bibliography] showed
that changes are not necessarily positive at this point. In some of the pictures
shown by the researchers, the needles on conifer trees, like pines, begin to
turn a reddish-brown color. Often the change begins at the top of the trees and
moves down. This shows that some trees’ color from green to brown will ….
Table C1.2 Categories of reporting verbs (adapted from Francis et al., 1996)
Argue Show Find Think
Argue Say Show Find Think

Suggest Add Illustrate Realize Hold
Assert Hypothesize Indicate Observe Assume
Note Insist Demonstrate Discover Feel
Predict Maintain Confirm Establish Hope
Write Propose Mean Infer Know
Explain Remark Reveal Recognize
Conclude Reply Identify
Mention Speculate Note
Admit Stress
Observe Contend
Accept State
Claim Report
Imply Postulate
Complain Acknowledge
Point out Posit
In discussing reporting verbs with my students, I emphasized the importance

of accurate citation of sources, which was vital in organizing the “Review of
Previous Works” and “Discussion and Conclusion” sections of their papers.
Students were given activities that followed technical citation conventions
(usually name + year in forestry, e.g., “Smith, 2011”), but essential discussion
of reporting verbs that accompany the citations may be highlighted in class-
room activities. The students helped me in manually obtaining reporting verb
frequencies, normalized per 1,000 words across our two corpora. Our com-
parative data for the professional and student corpora showed that the overall
frequency distribution of reporting verbs following Charles’s (2005) list was
almost identical (students = 15.07; professionals = 15.11 per thousand words).
However, further analysis indicated that the students exhibited a tendency to
‘overuse’ specific verbs such as show and find (including their lemmas), while
at the same time utilizing a limited range of reporting verbs in their research
papers in contrast to the professionals. The reason may, perhaps, be obvious,
and it was an interesting topic of discussion in class. My students mentioned
that, for the most part, academic writing was still a “developing process”
for them, and that novice writers still need to “acquire a more sophisticated
bank of words and jargon,” which certainly will include the use of a wider
range of reporting verbs in their research papers. In addition, the comparison
with professional papers was mainly exploratory and for teaching purposes,
and not ideally developed for a more consistent or balanced research setting.
Here, the students’ papers focused on simple measurement data that did not
allow for use of a more advanced range of linguistic features. The following
are two sample worksheets and brief interpretation and concluding statements
from my students (working in small groups) in concordancing and obtaining
normalized frequencies of reporting verbs across corpora (Table C1.3).
Table C1.3 Sample completed group worksheet (1): Verb use (arranged alphabetically,
normalized per 1,000 words and including their lemmas) in student
reports in the Writing in Forestry course (n = 144 reports) and 500 refereed
forestry research articles
Verb Student reports Professional articles

accept 0.000 0.428
acknowledge 0.000 0.045
add 0.240 0.151
admit 0.000 0.008
argue 0.000 0.071
assert 0.000 0.017
assume 0.133 0.382
claim 0.027 0.978
complain 0.000 0.004
(Continued)
Verb Student reports Professional articles
conclude 0.240 0.132
contend 0.000 0.114
demonstrate 0.186 0.209
discover 0.027 0.025
establish 0.053 0.473
explain 0.213 0.432
feel 0.107 0.055
find 2.636 1.186
hold 0.000 0.115
hope 0.053 0.030
hypothesize 0.107 0.056
identify 0.213 0.458
illustrate 0.266 0.902
imply 0.027 0.118
indicate 0.080 0.875
infer 0.000 0.045
insist 0.000 0.003
know 0.455 0.199
maintain 0.107 0.280
mean 0.033 0.233
mention 0.133 0.074
note 0.373 0.293
observe 0.453 0.731
point out 0.000 0.042
posit 0.000 0.004
postulate 0.000 0.127
predict 0.586 0.409
propose 0.000 0.071
realize 0.000 0.043
recognize 0.027 0.140
remark 0.000 0.018
reply 0.000 0.001
report 0.053 0.672
reveal 0.107 0.191
say 0.160 0.063
show 6.233 1.599
speculate 0.000 0.011
state 0.453 0.155
stress 0.107 0.169
suggest 0.186 0.760
think 0.346 0.104
write 0.000 0.006
Sum 14.447 13.801

• [Sample Student Interpretation: Written Output] Total use of verbs was

about the same in the student reports (14.5 uses per 1,000 words) and
the professional research articles (13.8 uses per 1,000 words). The verbs
for the “show” and “find” categories were used most often in both the
student reports and professional articles, and their use was greater in
student reports than the research articles. The only other verbs com-
mon to the 10 most used verbs of both student reports and professional
research articles were “observe” and “illustrate.” Verb use in student
reports and also professional articles was dominated by “show” and
“find”; other verbs were rarely used by students. In contrast, the profes-
sional articles spread use over a larger number of verbs than the student
reports.
• [Sample Student Comment: Implications] “Forestry students need to
be mindful of these verbs and how we use them in our papers. I must
admit, I have not really paid attention to these words before. For ex-
ample, I did not notice my own use of the verb show (which I have also
repeatedly used in my papers). Having a list of options to choose from is
good. I don’t think the overall frequency of usage from the two corpora
is really that important to me, and I don’t see me following data from
the professionals all the time. But having a list is great. I can examine
my options and see what may work in replacing words that I repeat the
most” (Table C1.4)
• [Sample Student Interpretation: Written Output] The verbs show and find
were very commonly used in the two groups of texts, but clearly, their
standardized frequencies were very different. The students used show
(6.233 per thousand) and find (2.636) dramatically more than the pro-
fessionals (1.599 and 1.186 respectively). The verb claim was frequently
used by the professionals, but not the students; the verb hypothesize was
not on the list of popular verbs used by professionals, but it was popular
with students.
• [Sample Student Comment: Implication] “I learned a lot from the stan-
dardized comparisons. I think it makes it very accurate and systematic to
present data this way. One take away for me in this corpus activity is that
it’s like research. I was able to see actual patterns of usage, but this also
shows that I have various options. I don’t have to repeat the same word
over and over as it affects the quality of my writing. I learned to appre-
ciate that changes in meanings of words, when used in different contexts
[sic]. I think I do use illustrate and indicate quite frequently as well and it’s
good to be reminded that I can replace these words with confirm or recog-
nize, when appropriate.”
Table C1.4 Sample completed group worksheet (2): The 35 frequently used reporting
verbs (ranked and normalized per 1,000 words, including their lemmas)
in student reports in the Writing in Forestry course (n = 144 reports) and
500 refereed forestry research articles
Student reports Use per 1,000 words Professional articles Use per 1,000 words
Show 6.233 show 1.599

find 2.636 find 1.186
predict 0.586 claim 0.978
know 0.455 illustrate 0.902
observe 0.453 indicate 0.875
state 0.453 suggest 0.760
note 0.373 observe 0.731
think 0.346 report 0.672
illustrate 0.266 establish 0.473
add 0.240 identify 0.458
conclude 0.240 explain 0.432
explain 0.213 accept 0.428
identify 0.213 predict 0.409
demonstrate 0.186 assume 0.382
suggest 0.186 note 0.293
say 0.160 maintain 0.280
assume 0.133 mean 0.233
mention 0.133 demonstrate 0.209
feel 0.107 know 0.199
hypothesize 0.107 reveal 0.191
maintain 0.107 stress 0.169
reveal 0.107 state 0.155
stress 0.107 add 0.151
indicate 0.080 recognize 0.140
establish 0.053 conclude 0.132
hope 0.053 postulate 0.127
report 0.053 imply 0.118
mean 0.033 hold 0.115
claim 0.027 contend 0.114
confirm 0.027 think 0.104
discover 0.027 confirm 0.094
imply 0.027 mention 0.074
recognize 0.027 propose 0.071
accept 0.000 argue 0.071
acknowledge 0.000 say 0.063
Linking Adverbials
Linking adverbials (or adverbial connectors), also referred to as transition
words/phrases (e.g., however, for example, in addition), are considered necessary
features of academic and technical writing. The effective use of these linguistic
devices is important in creating textual cohesion and logical flow of the nar-
rative, alongside coordinators and subordinators, because they clearly signal
the connection between passages of text (Biber et al., 1999). A critical aspect
of university-level research writing includes the development of logical argu-
ments supported by details and evidence in prose. The effective use of linking
adverbials certainly helps improve the flow of discussions and how ideas are
best organized in academic written discourse. In order to achieve the needed
cohesion in presenting arguments and supporting evidence, students in many
EAP writing classes are encouraged to use linking adverbials in their research
papers (Altenberg & Tapper, 1998; Tanko, 2004).
My students were as enthusiastic and appreciative of the concordancing
activities focusing on linking adverbials as they were about their reporting
verb activities and discussions. We followed the same structure, with the
class in a computer lab working on worksheets that I developed, obtain-
ing normalized frequency counts and extracting text samples for analyses.
The comparison of professional and students’ corpora from our worksheets
showed that students’ papers used 7.58 linking adverbials in total, normalized
per 1,000 words; forestry professionals and practitioners had 12.12, also per
1,000 words. Further exploration of the data indicated that students did not
only use a smaller number of linking adverbials, but they also used fewer
types. Students discussed this disparity in the figures from the corpora, and
we also talked about how to possibly best teach the use of linking adverbials
to students writing research reports. In creating the corpus-based handout
and concordancing activity for linking adverbials, I referenced the list of
commonly used transitions in the form of single adverbs and prepositional
phrases from LGSWE (Table C1.5).
• [Sample Student Interpretation: Group Output] We found that the distri-

bution of common linking adverbials was relatively lower in the student
reports (7.6 per 1,000 words) than forestry articles (12.2 per 1,000 words).
“Also” was used most frequently in both the student reports and the for-
estry articles. Seven words were common to the ten most frequently used
linking adverbials for each type of writing (also, then, however, as well as, for
example, although, and therefore). The abbreviated linkages e.g. and i.e. were
commonly used in the refereed articles, but were never used in the student
reports. It appeared that there were limited opportunities for students to
use these features. Similar to the findings for verbs, use of linking adver-
bials was spread over a larger number of words for the professional articles
compared with the student reports.
• [Sample Student Comments: Implication] (1) “I don’t know how to use e.g.
or i.e. (I did not check their actual meanings as well) and I haven’t used
these in my writing. I can see their importance in the concordance lines,
but these can be replaced by the more descriptive phrase (like, for example).
Table C1.5 Use of linking adverbials (arranged alphabetically, normalized per 1,000
words) in student lab reports (144 papers) and professional articles (500
articles)
Linking adverbials Student reports Professional articles
also 2.103 2.360

although 0.240 0.800
anyway 0.000 0.003
as well (as) 0.399 0.539
e.g. 0.000 1.111
finally 0.266 0.194
for example 0.266 0.583
for instance 0.000 0.095
furthermore 0.053 0.180
hence 0.000 0.160
however 0.719 1.610
i.e. 0.000 0.614
in addition 0.133 0.361
likewise 0.000 0.037
nevertheless 0.027 0.107
on the other hand 0.027 0.103
rather 0.053 0.339
similarly 0.027 0.147
so 0.772 0.482
then 2.024 0.595
therefore 0.213 0.626
though 0.107 0.198
thus 0.107 0.710
yet 0.053 0.171
Sum 7.589 12.125
I don’t think that professional forestry writing prefers these two examples,
so I may not necessarily have to use them. I like it that I learned about
them in this lesson!” (2) “I was surprised at the use of also in the papers.
I thought however would be the most popular here as I use this transition
a lot in my writing. I studied overseas when I was in high school and I
remember being taught to use however and nevertheless in essay writing a
lot. I can see their importance in showing logic and in connecting ideas.
I appreciate seeing the list of most commonly-used linking adverbials by
forestry professors.”
One group of students worked on comparing and ranking the top 24 linking
adverbials from the two corpora, basically by simply reorganizing data from the
aforementioned table, but they were tasked to obtain text samples and to fur-
ther identify major differences between student and professional writing. Their
results are presented on Table C1.6.
Table C1.6 List of 24 frequently used linking adverbials (ranked and standardized per
1,000 words) in lab reports (n = 144 reports) and 500 refereed forestry
research articles
Student reports Use per 1,000 Professional articles Use per 1,000
words words
also 2.103 also 2.360

then 2.024 however 1.610
so 0.772 e.g. 1.111
however 0.719 although 0.800
as well (as) 0.399 thus 0.710
finally 0.266 therefore 0.626
for example 0.266 i.e. 0.614
although 0.240 then 0.595
therefore 0.213 for example 0.583
in addition 0.133 as well (as) 0.539
though 0.107 so 0.482
thus 0.107 in addition 0.361
furthermore 0.053 rather 0.339
rather 0.053 though 0.198
yet 0.053 finally 0.194
nevertheless 0.027 furthermore 0.180
on the other hand 0.027 yet 0.171
similarly 0.027 hence 0.160
anyway 0.000 similarly 0.147
e.g. 0.000 nevertheless 0.107
for instance 0.000 on the other hand 0.103
hence 0.000 for instance 0.095
i.e. 0.000 likewise 0.037
likewise 0.000 anyway 0.003
• [Sample Student Interpretation] By reorganizing the data, the differences

between student and professional writing were much clearer. A total of six
features did not appear in the student corpus (anyway, e.g., for instance, hence,
i.e., and likewise). These were not necessarily highly technical or sophisti-
cated features, but they were not used by the students in their reports.
• [Sample Student Comment: Implication (from a group)] “The first key
difference we noticed was that students had a smaller ‘vocabulary’ of link-
ing adverbials than professional scientists. We speculate that the smaller
vocabulary of students is related to their limited experience with forestry
technical writing. Clearly, they do not know much yet, when it comes to
actual data and content that are being synthesized in the papers and how to
best present them. The low use of linking adverbials may lead to the lack
of logical flow that is common to technical writing of beginning students.
The most interesting or relevant contribution of this activity is that the list
is generated from forestry writing, not other types of writing. We liked to
see the way professionals write and present data with detailed explanation.”
Passive Structures, Pronouns, and Verb Tenses

The students also explored the use of active and passive structures and
patterns of verb tenses distributed across the corpora in their corpus-based
lessons. Results of our comparisons showed that passive structures were
extensively used in the professional corpus (22.86 per 1,000 words) com-
pared to the students’ papers (10.45 per 1,000 words). I provided the
students tagged data (using the Biber Tagger) that provided normalized
tagged counts of passive constructions and asked them to extract concor-
dance lines for selected items such as were provided, was measured, was rejected,
and is illustrated. There was a clear contradiction between this result and
the way students have been taught to “use the passive voice sparingly” in
their papers. The suggested writing manual, Writing Papers in the Biological
Sciences, 3rd Edition (McMillan, 2001), and also the School’s handout in
writing research reports, explicitly recommended the use of the active voice
in technical reports, I or we as subject, and limiting the number of words
in a sentence, explaining that passive structures require more words than
active structures. The following example illustrates the common instruc-
tional focus on “Active vs. Passive Voice in Scientific Writing” (adapted
from Sainani, Eliott, & Harwell, 2015).
Lesson Focus: Scientific Writing Has Traditionally Been Third
Person, Passive Voice…
First person: I, we
Second person: you (singular), you (plural)
Third person: he, she, it, they
…but more editors are allowing, even encouraging, first person, active voice
because it may be more direct and concise:
Advantages of the Active Voice

1 Emphasizes author responsibility
No attempt was made to contact non-responders because they were
deemed unimportant to the analysis. (passive)
Vs.
We did not attempt to contact non-responders because we deemed them
unimportant to the analysis. (active)
2 Improves readability
The hypothesis that smoking causes lung cancer was rejected by tobacco
companies. (passive)
Vs.
Tobacco companies rejected the hypothesis that smoking increases lung
cancer. (active)
3 Reduces ambiguity
General dysfunction of the immune system at the leukocyte level is sug-
gested by both animal and human studies. (passive)
Vs.
Both human and animal studies suggest that diabetics have general im-
mune dysfunction at the leukocyte level.
Lessons Learned
• It’s OK to use “We” and “I”! Avoiding personal pronouns does not
make your science more objective.
• The active voice is more clear, direct, and vigorous.
• The passive voice is appropriate in some cases, but should be used spar-
ingly and purposefully.
• Journal editors encourage use of the active voice.
Science:
Use active voice when suitable, particularly when necessary for cor-
rect syntax (e.g., “To address this possibility, we constructed a λZap
library…”).
Nature:
Nature journals prefer authors to write in the active voice (“we per-
formed the experiment…”) as experience has shown that readers find
concepts and results to be conveyed more clearly if written directly.
I asked my students to work on further analyzing the concordance lines, ex-

tracting examples from the professional corpus, and rewriting the passive sen-
tences into active. This seemed like a useful activity, especially for paired work
with students comparing how they rephrased sentences from passive to active.
There were interesting discussions and comparisons resulting from this activity.
Students commented that
• Interesting patterns. We discussed passive vs. active in the compo-

sition course and I learned that in technical writing, active is better
because it makes it clearer, especially for researcher accountability. I
wonder why scientists still don’t follow this? Is it because it is a newer
trend?
• I don’t think it’s a major deal. I feel like it’s a style thing, not really influ-
encing or affecting message or content. I think it’s fine either way, but it
would be good to be consistent.
• I prefer passive. The focus on I did this, I did that is distracting to me. Tech-
nical writing is more formal than personal writing, so I have preference for
an informational style rather than a narrative.
• So, this lesson got me thinking about who to follow. If the writing text-
book tells me to use active, but the professors write in the passive, what
should my model be?
The students also compared the distribution of personal pronouns I and we.
There was very minimal use of I in the professional corpus in contrast to the
students’ corpora. Although there was a substantial number of single-authored
works in the professional corpus, it was apparent that single authors avoided
using the first-person singular in their “Methods” and “Discussions and Con-
clusion” sections, and preferred passive structures instead (e.g., I measured the
dbh using…vs. The dbh was measured…). However, the we construction was
very common in multiple-authored papers in the professional corpus. For the
students in the study, both the I and we active structures were regularly used.
We was used in the explicit write-up of process and procedures in the collection
of data because the students worked in groups; I was used in the “Discussions
and Conclusion” section of the paper. Clearly, the preference for these personal
pronouns in the students’ papers influenced their use of passive structures in their
writing.
For verb tenses, the professional corpus had 43.34 present tense verbs and
22.11 past tense verbs per 1,000 words (I provided these numbers to the stu-
dents from tagged data), while the student corpus had 52.67 present tense and
16.23 past tense verbs, respectively. In both corpora, present tense verbs were
used more frequently than past tense verbs. A cross-comparison between the
tense features disclosed that the students preferred present tense verbs more and
had fewer past tense verbs than the professionals.
In summary, highlighting these variations in the distributional patterns of
relevant linguistic features in the corpora could potentially enhance the overall
writing skills of students in forestry, especially in achieving a similar tone and
style typical of published articles in the field. In wrapping up the corpus-based
instruction for this part of the course, the class (as one group) developed the
following suggestions.
Based on the comparison of students and professional research papers
across reporting verbs, linking adverbials, passive/active structures, verb
tenses, and personal pronouns, we suggest that a course such as Writing in
Forestry should
• Increase the students’ overall use of linking adverbials in their research

reports; help the students use linking adverbials commonly found in pro-
fessional articles (e.g., nevertheless, therefore, i.e., e.g.) but not in their own
writing,
• Help the students diversify their use of reporting verbs and avoid overuse
of common verbs like show (shows, showed, shown, showing),
• Suggest to students that they consider using passive structures along-
side the predominantly active structures in their papers (especially in the
“Methods” or “Collection of Field Data” sections), and
• Help in checking verb tense patterns in the professional corpus and looking
at examples of shifts in tenses and tense consistency.
C1.1.2 Recap: Outline of the Corpus Instruction

This part of the corpus instruction covered a total of six hours, following the
outline described here:
Lesson 1: Linking Adverbials, Verb Tenses: Presentation of frequencies and
distributional data of linking adverbials and verb tenses in the professional
and students’ corpora; I introduced the use of a concordance, MonoconcPro, and
showed concordance lines of linking adverbials and patterns of verb tenses in
the corpora; students were asked to nominate a particular linking adverbial or
use of tense in sections of the research articles for a concordancing activity that
I demoed; we discussed the patterns and mechanical conventions in using link-
ing adverbials in technical reports (e.g., use of comma after ‘however’; location
[initial, middle, ending] of linking adverbials in the sentence).
Lesson 2: Reporting Verbs, Passive versus. Active Sentence Structures:
Presentation of frequencies and distributional data of reporting verbs, passive
structures, and the use of personal pronouns I and we; I presented lectures and
discussed the importance of reporting verbs in citation and overall description
of what was conducted or reported in research; we reviewed passive and active
construction in research papers; and conducted a short paired activity on edit-
ing sentences with errors in voice and supplying the appropriate reporting verb
in sentences.
Lesson 3: Concordancing and Exploration of Linking Adverbials: Stu-
dents were asked to follow instructions in conducting hands-on concor-
dancing of linking adverbials (utilizing a handout that I prepared for the
class). A list of the most common linking adverbials in academic writing
from the LGSWE (Biber et al., 1999) was provided as a guide in searching
for corpus patterns and frequencies. I instructed the students to copy and
print concordance lines of linking adverbials that would be useful for them
in editing their papers.
Lessons 3 and 4 focused on hands-on concordancing of linking adverbials
and reporting verbs only. Note again that the students worked in a computer
lab and were given instructions on how to use the software and what to specif-
ically search for in the concordancing activity.
Lesson 4: Concordancing and Exploration of Reporting Verbs: The students
brought a draft of their research papers with them and were asked to circle the
reporting verbs they used in the paper. A list of reporting verbs grouped into
the categories identified by Francis et al. (1996) was provided as a guide in the
hands-on concordancing activity that followed. Again, the students were in-
structed to copy and print concordance lines of reporting verbs that would be
useful for them in editing their papers.
C1.1.3 Course Evaluation, Limitations, and Future Directions

The logical next question here is, “Does corpus instruction improve the
quality of research reports written by students in forestry?” Given the lim-
ited set of features we explored in the short-term, corpus-based lessons
during Writing in Forestry’s initial semester, it would be presumptuous to
conclude that the overall quality of writing in students’ papers improved as
a result of the corpus instruction and concordancing activities. In addition,
the rubric used in grading the research reports (students’ f inal projects) had
40% of total ratings allotted for sentence structure, grammar, and word
choice, and 60% for content, mechanics, and organization; there was not
enough data to compare yet, especially for statistical analysis. However, it
might be relevant and instructive to future pedagogical research to deter-
mine whether qualitative ratings of the papers submitted by the students in
the two study groups differed relative to features considered as def ining
‘quality’ writing, that is, corresponding to those in the professional forestry
corpus.
A related research output from this course was an exploratory, pseudo-
experimental study (Friginal, 2013b) that investigated the contribution and
value of short-term, corpus-informed instruction to heighten student aware-
ness and utilize, in their own writing, features of professional technical writ-
ing in the field of forestry. In this classroom-based research setting, I included
a control group, with output from students receiving traditional (i.e., not
corpus-based) instruction as another comparison group. As a primary goal of
the corpus-instruction, I aimed to develop students’ technical writing skills
through the use of real-world examples of language usage obtained from pro-
fessional corpora. Results from my analyses of frequencies and distributional
data of the linguistic features showed that pedagogical approaches influenced
by CL had important applications in the technical writing classrooms (for NSs
of English, in this case). It appeared that the learners were able to explore pat-
terns of language in the professional and student corpora and, afterward, use
some of these patterns in their own writing to more closely approximate the
writing of professionals in the field. Teaching materials gleaned from frequen-
cies and comparative data of linguistic features obtained from these corpora
seemed to stimulate the students’ interest in reflection on their overall writing
skills to more closely resemble the writing of practicing professionals in their
own field. My activities involving the listing of frequencies of the linguistic
features in the study, followed by text samples extracted from the corpora, were
enthusiastically received by the students.
Most of the students in the concordancing activities stated that the pro-
cess was useful in writing and editing their papers. Since my students were
primarily L1 speakers of English, there was very limited language barrier;
vocabulary and structural examples in the corpus were easily noted and
applied in the writing and revision of the research reports. Many of the stu-
dents in this writing class expressed their preference for sample papers when-
ever they were given tasks. Because of the specialized content and writing
in the register, students benefit from examples and concrete instructions
in their own writing tasks. It helped tremendously that frequencies and
patterns from corpora provided the kind of examples students preferred in
the completion of tasks (Friginal, 2013b). Students worked with the concor-
dance lines they printed during the activity and reported that they used the
structures and patterns that they observed from their printouts in revising
their papers. There were some who mentioned that this activity was tedious
and time-consuming. Furthermore, some thought that concordancing for
linking adverbials and reporting verbs, for example, was nothing more than
using an online dictionary for synonyms or the MS Word function to find
applicable synonyms of words. These are relevant comments about the role
of corpus tools in the writing classroom, particularly because these devices
require additional time and effort and students might not immediately see
how the activity relates to the task. It is important that the corpus-based
lesson be clearly introduced and the rationale supporting it be explained to
the students and that all expectations as to the task execution and grading be
clarified. Content and mechanical conventions of technical writing could
similarly be incorporated into corpus-based instruction. Potential improve-
ment in the overall quality of writing may come from students’ repeated
exposure to patterns and mechanics of writing shown by model texts. In
reading, printing, and analyzing writing structures from the concordancer,
it was possible that the students acquired additional skills related to editing
sentences and paragraphs.
I believe that the use of a target corpus of professional writing helped in
contextualizing the activities in corpus instruction and solidifying the exam-
ples and patterns that the students needed in constructing their sentences and
paragraphs. It appeared that the students found additional motivation from
studying the patterns within professional writing in their own field. It was
easy for them to check concordance structures useful in editing their papers.
The frequencies of writer preferences, what some students mentioned as over-
use or under-use of specific linguistic features through the exploration of cor-
pora, also provided relevant data that reminded my students of the ideal style
and quality of writing expected in their area of specialization. At the end of
the course, students were asked to rate the level of difficulty for the required
writing assignments. They rated the management plan (3.9; 1 = easy, 5 = diffi-
cult) and lab report (3.9) as most difficult, followed by the synthesis paper (2.7),
memo (2.6), opinion piece (2.3) and annotated bibliography (2.3). Students
ranked the writing assignments in priority of importance to them (highest
to lowest): lab report, management plan, memo, synthesis paper, annotated
bibliography, opinion piece (Kolb et al., 2007). Ninety-seven percent of the
students agreed that the grammar workshops with corpus tools were helpful
and should be retained in the course. The most supportive written comments
from students were (directly quoted):
• I’m glad I took this class. I feel it has given me a preview of what I will
be writing in the professional program.
• I enjoyed the grammar lessons and feel that they helped me improve my
writing. I wish we could have focused more on them.
• I thought this class was awesome. I feel ahead of the game when it
comes to writing papers in other classes.
• I would like to say that this class is outstanding. I am impressed with my
overall experience in this class. It has prepared me for future classes and
applications down the road. I have had to take a lot of required courses
this semester and let me tell you that this is the only class that I enjoyed.
I do not like writing, but the instructor presented the material so well
that he made it a good experience.
• This class was great. I feel that the course has prepared me well for en-
trance into the forestry program.
The more neutral to negative comments or criticisms from students were

(d irectly quoted, also reported in Kolb et al., 2007):
• There was an enormous amount of writing, almost too much, but other
than that, I learned some new stuff.
• I think it would be good to include more examples of good work along
with work that needs improvement.
• The only suggestion I have is to change the grading system for the
class. It seems as since there is an individual emphasis on all the pa-
pers but the management plan, there should be an individual grade for
the management plan. By this I mean that a person should be graded
by their individual sections in the management plan and not get a
group grade because not everyone exerted the same level of work or
effort.
• Seems like this course could be structured better. There was a lot of
unused time during the first two weeks. Perhaps a bit more focused on
writing lab reports would be helpful as well. A lot of work for a two
credit class.
For the Teacher

Many of the issues involved with the development of corpus-based lessons
and use of computational tools in EAP (writing) courses have to do with
the difficult collection of applicable corpora, access to computer programs,
and the limited expertise and training of writing teachers. It appears, how-
ever, that the future direction of the teaching of writing across various spe-
cialized disciplines will include corpus-based materials and data. Lee and
Swales (2006) explored the use of corpora and concordancing activities in
an EAP course for non-native English speakers who were doctoral students,
with relevant positive results related to the development of advanced L2
academic writing. The participants in this study followed an innovative 13-
week course in corpus-informed EAP and were able to compare their writ-
ing with that of the patterns shown in a corpus of professional, published
academic papers. The use of corpora in an academic writing classroom was
generally regarded as acceptable and meaningful by the doctoral students.
It is certainly possible that motivation and various personal goals or contexts
contributed to how doctoral students participated actively in the course.
The same positive perception of the use of corpora and corpus tools in
L2 writing was reported by Yoon and Hirvela (2004). Results of qualitative
and quantitative analyses in their study indicated that a corpus approach
in L2 academic writing was beneficial to the development of L2 writing
skills and contributed to the students’ increased confidence toward L2 writ-
ing. The majority of the participants in the study said that they would rec-
ommend corpus-informed writing classes. Additional discussions of these
themes can be found in the lessons and notes for teachers contributed by
the instructors in the following sections (see, e.g., Dye’s description of a
semester-long series of workshops for visiting Chinese scholars in the US in
Section C4).
For materials design and development, the number of corpus-informed
textbooks and instructional handouts for writing continue to increase, al-
beit still to a limited extent, and are not yet fully incorporated into many
college composition courses. It is still relatively rare for a department or
school, such as the School of Forestry at NAU, to be fully on-board in the
creation of a course that was specifically designed with CL applications
from the start. Recently, a TESOL publication on its The New Ways series
has focused “New Ways in Teaching with Corpora,” offering readers “at-
a-glance, simple lesson plans.” Edited by Vander Viana (University of Stir-
ling), this collection’s primary purpose is to provide teachers with simple
activities and suggestions on how to teach ESL/EFL topics “informed by,
based on, or with corpora” that may be implemented in the L2 classroom.
(Continued)
The volume features how corpora may be used directly (e.g., by introduc-
ing students to corpora and getting them to work with concordance lines)
and indirectly (e.g., by using the findings of corpus analysis to inform the
design of pedagogical activities); in several teaching contexts (e.g., schools,
language institutes, universities); and in the teaching of English for different
purposes (including EAP) and of different English varieties (e.g., American,
Singaporean, and other Englishes), among others. Sections C2–C4 of this
book share these similar features and objectives.
In 2010, Robinson, Stoller, Jones, and Costanza-Robinson published
Write Like a Chemist which makes use of information gathered from cor-
pus analyses of chemistry texts. This textbook was developed primarily for
chemistry students in the US, both NSs and non-native speakers (NNSs) of
English. In my review of this book for a writing across the curriculum jour-
nal, I noted that
The use of corpus-based technology as well as the collection of spe-
cialized corpora contributes to the innovative designs of writing classes
that incorporate detailed descriptive data of different genres of writing.
Students are exposed to the writing conventions of professionals in their
own field, are given clear examples of lexico-syntactic features of various
written reports and documents, and are able to focus on the nuances and
skills that prepare them not only in writing, but more importantly, in read-
ing genre-specific writing. Write Like a Chemist: A Guide and Resource is an
impressive textbook that is a product of collaborations between chemistry
professors, applied linguists, and technical writing instructors. The book
was intended for upper division and graduate-level university chemistry
classes, and as a resource book for chemistry students, postdocs, faculty,
and other professionals who want to improve their chemistry-specific writ-
ing in the field, particularly in writing: (1) journal articles, (2) conference
abstracts, (3) scientific posters, and (4) research proposals.
C2
CL and Vocabulary
Instruction
Vocabulary-based activities using corpora and corpus tools are the ones most
commonly developed and used by teachers in the classroom for a range of
learners. This certainly reflects CL’s tradition of making use of corpora in
dictionary production and the benefits of using concordances and KWICs in
order to concretely and fully define a word and explore its various meanings.
There have been a few learner English dictionaries based on corpora. Over
the years, major English corpora of spoken and written language (both British
and American) have been collected by publishers to produce corpus-based
word lists, dictionaries, and thesauri. Commercially available corpus-based
dictionaries include The Longman Dictionary of Contemporary English (LDOCE),
6th edition (2015); Collins COBUILD English Dictionary (Sinclair, 2003); and
Davies’s (2010–) A Frequency Dictionary of Contemporary American English: Word
Sketches, Collocates, and Thematic Lists (with Dee G ardner). In addition to this
frequency dictionary based on COCA data, Davies also has various dictionar-
ies and word lists obtained from word frequencies of Spanish and Portuguese
corpora, among others. Vocabulary textbooks for English language teach-
ing utilizing various corpora are also available (e.g., McCarthy, O’Dell, and
Reppen’s (2010) Basic Vocabulary in Use), with emphasis on the role of fre-
quency distribution and real-world language use in how to best teach the
acquisition of L2 vocabulary.
Corpus-based dictionaries offer some interesting features not often included
in more traditional or intuition-based dictionaries. The following is an example
of an entry from the LDOCE on the various meanings of o-kay or OK in spo-
ken and written texts (Figure C2.1).
o.kay, OK. /pronunciaon/adj spoken 1 [not before noun]
not ill, injured, unhappy etc: Do you feel OK now? 2 used to
say that something is acceptable: “ Sorry I’m late.” “That’s
okay.” | Does my hair look OK? 3 [not before noun]
sasfactory but not extremely good: Well, it was OK, but I
like the other one beer.
[…]
okay, OK interjecon 1 used when you start talking about
something else, or when you pause before connuing: OK,
let’s go on to item B.| OK, any quesons so far? 2 used to
express agreement or give permission: “Can I take the car
today?”
[…]
okay, OK v okayed, okaying [T] informal to say officially that
you will agree to something or allow it to happen: Has the
bank okayed your request for a loan?
Frequencies of the word okay in spoken and wrien English
Wrien
Spoken
0 500 1000 1500

Based on the BNC and the Longman Lancaster Corpus
This graph shows that the word okay is much more common
in spoken English than in wrien English.
Figure C2.1 Meanings of o-kay or OK in spoken and written, adapted from the
LDOCE.
CL and Vocabulary Instruction 215
In the aforementioned excerpt, the order of meanings (of o-kay or OK)

reflects its use. Sample sentences from corpora are provided, and a frequency
graph illustrates the difference between the frequency of use of o-kay or OK in
spoken vs. written English in the normalized distribution of OK from the BNC
and the Longman Lancaster Corpus. British (BrE) or American English (AmE)
examples may be available in certain entries. This format is quite common in
online versions of many dictionaries, such as the Oxford English Dictionary. The
‘popularity’ or relative frequency of a word has been regularly referenced, to-
gether with data on register differences. For example, words moderately com-
mon in speech but not common in writing are flood, hopefully, messy, potato,
shave, and underneath; words moderately common in writing but not common
in speech are focus, glance, moreover, pollution, scope, and underlying.
Closely related to vocabulary instruction are studies on collocations and the
exploration of semantic prosody. These two constructs could be easily ex-
amined with corpora and vocabulary use. Semantic prosody describes the way
in which certain seemingly neutral words can be perceived as having either
positive or negative associations through frequent occurrences with particular
collocations. For example, the verb cause is used mostly in a negative context
(accident, catastrophe, etc.), although one can also say that something ‘caused
happiness’ (Sinclair, 1991). Concordancing activities allow learners to easily
extract collocational data or even lists such as bigrams in order to analyze the
semantic prosody of certain target features.
Another major contribution of CL in vocabulary instruction is academic
word lists. These lists have been used regularly in many EAP settings, espe-
cially in writing instruction across the curriculum. An example of these word
lists is Averil Coxhead’s Academic Word List (AWL), which was developed as
a rationalized, more specialized extension of, or perhaps in response to, the
General Service List (GSL). The GSL is a list of over 2,000 words that was
developed by Michael West in 1953 and updated by John Bauman and Brent
Culligan in 1995. The list includes the most frequent words of written En-
glish collected primarily for English language learners and ESL writing teach-
ers (Friginal, Walker, & Randall, 2014). The updated version of the GSL uses
frequencies from the Brown Corpus (Bauman, 1995). Words from the AWL
came instead from an “Academic Corpus” that Coxhead herself collected. The
corpus contains a total of 3.5 million words, with texts representing multiple
academic sources, such as journals and textbooks published from 1993 to 1996.
Coxhead also included texts from the Brown Corpus, London/Oslo/Bergen
Corpus (LOB), and MicroConcord Academic Corpus. The four predominant
academic registers include arts, commerce, law, and science, and encompass 28
different sub-registers (Coxhead, 2000, 2002).
An automated search of the Academic Corpus yielded the words comprising
the AWL, which consists of 570 word families and 3,110 individual words.
The words were selected based on their specialized occurrence, range, and
frequency, and were only included if not already present in the first 2,000
words of the GSL. The total number of words on the list comprises 10% of all
the words in academic texts. As a rule, each word had to be used at least 100
times in the Academic Corpus. Sample entries from the AWL are shown, with
each lemma and all of its lexemes provided. The bolded words were the most
frequent on the sub-list. The most frequent word in each word family could be
the lemma (like the word estimate seen later) or it could be one of the lexemes
(like the word derived, also seen later). The AWL has 10 sub-lists, with Sub-list
1 containing the highest-frequency words.
Sample sub-list from AWL

derive, derivation, derivations, derivative, derivatives, derived, derives, deriving
distribute, distributed, distributing, distribution, distributional
estimate, estimated, estimating, estimation, estimations, over-estimate,
overestimate, overestimated, overestimates, overestimating, underesti-
mate, underestimated, underestimates, underestimating
function, functional, functionally, functioned, functioning, functions
identify, identifiable, identification, identified, identifies, identifying, identi-
ties, identity
As noted here, Coxhead’s primary reason for creating the AWL was that other
word lists, especially the GSL, were limited in their capacity to demonstrate
current lexical usage across academic registers. Yet after more than 15 years,
it is possible that the AWL may also need some updating. For example, Ming-
Tzu and Nation (2004) completed a study on homographs within the AWL
and concluded that the list should include a wider range of word families and
lemmas, representing a range of academic homographs. In addition, Nation and
Waring (1997) argued that in order to comprehend a text, 95%–98% of words
in it must be fully understood and acquired by learners. Unfortunately, as Na-
tion and Waring found, word lists such as the AWL and GSL arguably do not
represent at least 95% of words in a target text. A more recent paper of mine,
coauthored with my former graduate students Friginal, Walker, and Randall
(2014), also found that the declining trend in vocabulary use, when measured
in a relatively short time frame (e.g., 1990–2012), was very relevant if applied
to standard word lists used for language teaching. As the AWL has been used
as a reference for teachers of academic writing for a range of learners, including
non-native speakers (NNS) of English, accurate distributions representing the
present status of word usage in specific contexts is of great importance. Teachers
and curriculum developers, then, might focus on these distributions to corre-
spond with patterns of vocabulary utilized currently outside the classroom. In
EAP, these are very important arguments for pedagogy.
Additionally, concurrent with the declining frequencies of these AWL words

is the consistent rise of new words. Every year, there are approximately 8,500
new words added to the English language (Michel, Shen, Aiden, Veres, & Gray,
2011), primarily because of the internet (social media and online networks).
The constant influx of new words gives a writer more options in presenting
ideas and structuring academic arguments or explanations. This proliferation
(and spread) of new words certainly contributes to the variation in word lists
and may cause a gradual decrease in overall usage of the words similar to the
ones on the AWL. Coxhead’s AWL appears to be naturally heading in this
direction, and there may be a need for a new list, as Coxhead herself proposed
(Coxhead, 2011). At the very least, the AWL needs to be constantly updated
since it is used by many teachers for reference in genre-based instruction, espe-
cially in discipline-specific writing (Friginal et al., 2014).
There are three vocabulary-based lessons intended for university-level
English learners in Section C2: (1) “Identifying and Analyzing Vocabulary
from Authentic Materials in a Content-Based ESP Class” (Roberts), (2)
“Using a Concordancer for Vocabulary Learning with Pre-Intermediate EFL
Students” (McNair), and (3) “Implementing the Frequency-Based VocabProfile
Tool from LexTutor: Improving English Learners’ Essay Writing for Profi-
ciency Exams” (Nelson).
The first lesson stresses that the compilation of authentic materials for teach-
ing purposes requires that the teacher determine effectively which vocabulary
items to focus on in order to aid learners with both receptive and productive
knowledge related to the actual content of the course. Authentic materials,
even with their known advantages, could still be challenging for learners who
may have only been exposed to modified texts in English, especially those
who were educated overseas. Roberts’s lesson (C2.1) makes use of extracted
texts from Aerosafety World, a flight safety magazine that publishes articles
about safety issues in the aviation industry. She notes, however, that the typical
content of the magazine is written for a native English speaker audience, de-
spite the fact that a large percentage of aviation personnel are second language
speakers of English. She recommends that scaffolding activities using corpus
tools and databases—for this lesson, WordandPhrase.Info (Davies, 2017b)—a llow
learners to be introduced to this type of authentic content in the language
learning classroom as preparation for a more focused instruction in the aviation
industry. Lesson C2.2 aims to promote comprehensive vocabulary acquisition
for Japanese pre-intermediate learners of English by combining data from a
news database (WebParaNews, or WPN, a Japanese-English bilingual corpus
of news articles) and an online concordancer. MacNair specifically draws direct
connections between CL and DDL, with emphasis on students’ easy access
and exposure to rich linguistic data. As previously discussed in Section A2,
DDL supports many SLA applications, especially noticing learner-centered
approaches, and usage-based learning. Specifically, the noticing hypothesis
maintains that a language learner needs to notice some feature of the input to
have the best chance of acquiring it. McNair argues that corpora provide such
data for learners immediately, whereas it would take much longer if the learner
were only relying on noticing natural occurrences in reading material. Finally,
Nelson’s main objective in C2.3 is to show how teachers can effectively incor-
porate Cobb’s (2016) Compleat Lexical Tutor, particularly its VocabProfile feature,
to help students develop their essay writing skills for proficiency exams like
the TOEFL or IELTS. Nelson based this lesson on his actual Intensive English
Program (IEP) course, which prepares students for a norm-based test. His goal
was to guide them through the process of discovering the differences between
the way they wrote essays in tests, and he selected benchmark essays so that the
students will, hopefully, incorporate these newly acquired ‘awarenesses’ into
their strategy to achieve higher scores.
C2.1 Identifying and Analyzing Vocabulary from Authentic

Materials in a Content-Based ESP Class
Jennifer Roberts, College of Aeronautics at Embry-Riddle Aeronautical Aeronautical
University-Worldwide, Daytona Beach, FL, USA
Lesson Background
This lesson is situated in a course within an IEP at an aeronautical university
that uses only authentic content from a single subject area to teach academic
English. The majority of students will be pursuing careers as pilots, aerospace
engineers, aircraft maintenance, or technicians, or in the aviation business, and
therefore, the content area for this course is aviation and aerospace.
Task: Analyze an authentic text to identify academic and technical vocabu-
lary items for classroom use.
When using authentic materials, unlike an ESL book, it is up to the teacher
to determine which vocabulary items should be focused on in order to aid
learners with both receptive and productive knowledge as related to the con-
tent. Authentic materials can be challenging for learners who may have, up
until this point, dealt only with adapted or modified texts in English.
Flight Safety Foundation is an international organization concerned with the
safety of flight. Their monthly magazine Aerosafety World publishes analyses of im-
portant safety issues facing the aviation industry. However, the content is written
for a native English-speaking audience, despite the fact that a large percentage of
aviation personnel are second language speakers of English. Through scaffolding,
learners can be introduced to this type of authentic content in the language learning
classroom and thereby be better enabled to enter the broader aviation community.
Instructors can develop a range of lessons based on a single article such as
one found in Aerosafety World, but where should they start? Through the online
corpus tool WordandPhrase.Info (Davies, 2017b), the article’s lexical items can be
revealed as categories of frequency and, even better for teachers of specific con-
tent, discipline-specific lists. The following example uses the article “Survival
Factors,” which discusses the National Transportation Safety Board’s analysis of
the Asiana Airlines Flight 214 crash (Rosenkrans, 2014).
Procedure
For the Teacher

Begin by accessing the online tool at WordandPhrase.Info. Click on “Input/ana-
lyze texts” on the right-hand side of the screen.
Copy and paste the text to be analyzed, click on “Academic” (“All Genres”
will analyze all words, not just academic words), and then “Search.” The article
will be displayed and color coded to illustrate academic vocabulary words from
two frequency ranges as well as a discipline-specific color (red), as chosen by
the instructor. The default setting is “All Acad,” but in this case, “Sci” is chosen
for an aviation article.
Figure C2.2 shows the top portion of the results of clicking on “See Lists.”
Three vocabulary lists, Range 1 (AVL List 0–500), Range 2 (AVL List 501–3000),
and subject-specific vocabulary (e.g., Science / Technology), are generated.
Range 1
(AVL LIST 0 - 500)
193 Words
21: report
15: impact
6: reported
5: response
4: factors, initial, research
3: between, female, group, than, within
2: based, colleague, components, control, described, design, directed, likely, occurred, perceived,
performance, performed, separated, significant, such
1: absence, addition, although, among, application, applied, apply, approach, article, assessed,
attempting, complex, comprehensive, concludes, condition, conducting, contains, contributing,
critical, current, currently, data, degrees, depending, descriptions, determined, difficulty, direct,
directing, discussion, due, enabling, established, estimated, experienced, failures, identified,
images, impacted, improve, include, including, information, involved, level, limits, lower,
mechanisms, needed, noting, objective, obtain, patterns, phase, positive, provide, provided,
relation, requirement, resulted, resulting, section, sections, selected, similar, sources, specifically,
standards, stated, status, subjected, testing, throughout, understanding, unique, units, used,
varied, various
Range 2
(AVL LIST 501 - 3000)
100 WORDS
Figure C2.2 Sample vocabulary lists provided by WordandPhrase.Info.

After the lists are generated, an instructor can examine the top vocabulary
words in each list and use a combination of frequency results and intuition to
decide which words to focus on in classroom activities. In the FSF example,
high-frequency items, such as report, impact, eject, trap, factor, sequence, fuselage,
and initial, are useful for both receptive (e.g., comprehension or main idea ques-
tions) and productive (e.g., group discussion or summary writing) activities
related to the article. Some lexical items, such as impact, are relevant as multiple
parts of speech, and WordandPhrase.Info permits students to discover this distinc-
tion on their own, as illustrated later. Alternatively, an instructor can choose
which part-of-speech the student should focus on.
An ESP instructor can further categorize the subject-specific vocabulary
based on her own content knowledge. For instance, this article contains many
parts of an airplane: fuselage, engine, tail, and gear (likely part of landing gear).
For the Student

Task: Analyze vocabulary identified by the teacher through discovery and
exploration.
WordandPhrase.Info’s interface is simple to interact with and affords many
opportunities for students to discover vocabulary meanings on their own.
After the instructor has determined a word list from the article, students can
follow the following guidelines. An example is given that permits the student
to discover part-of-speech distinctions used in the article (i.e., a word as a
noun and as a verb). An instructor could also indicate in the word list which
part-of-speech to focus on.
Ø Go to WordandPhrase.info. Click on “Frequency List” on the left side of the page.

Ø Consider the following excerpt from the article “Survival Factors”:
The initial impact with the seawall occurred at 1127:50. … Some flight attendants
stated that the first impact was followed by a sensation of lifting off again.
What part-of-speech is impact in this context? ______________________

Ø In the search bar, type “impact,” and choose the part-of-speech that
matches the context described earlier.
Result (WORDNET): Possible definitions of “impact”:
1. the striking of one body against another 2. a forceful consequence
3. Influencing strongly 4. the violent interaction of individuals or groups
entering into combat
Ø Which definition do you think fits the context of this excerpt?
Ø Which synonyms could you use for impact in the context of this excerpt?
Choose two (see the following list).
Possible synonyms for impact:
crash influence
920 impact 427 effect
1485 contact 432 control
2312 shock 920 impact
2854 crash 1396 influence
3814 blow 2553 impression
5700 collision 6817 bearing
8401 bang 11990 sway
14157 brunt
Ø Examine the collocates for impact. Click on a verb collocate and take a look
at the concordance entries. Do you see any patterns? Create two original
sentences that contain the collocate and impact, with at least one sentence
using an aviation topic. (See Figure C2.3 which shows the example verb
“minimize.”)
Ø Choose another collocate that is an adjective. Take a look at the concor-
dance entries. Do you see any patterns? Particularly, notice prepositions.
Create two original sentences that contain the collocate and impact, with at
least one sentence using an aviation topic.
Ø Complete Steps 2–6 with the rest of the vocabulary words.
Figure C2.3 Concordance entries for “impact (n) + minimize”.

Suggestions to Teachers: There are several options for presenting the vocabu-
lary words to students. Pulling out excerpts and highlighting vocabulary words
is a good way for students to work on part-of-speech recognition, or instructors
can list items with parts of speech already indicated. The example article contains
uses of impact as both a noun and a verb, so students can explore both. Also, in-
structors can decide which part-of-speech collocates should be used based on the
content and goals of the activity, or allow students the freedom to choose.
Homework: Read the article “Survival Factors” about the Asiana Airlines
Flight 214 crash. Write a summary of the events using a minimum of six of
the vocabulary words you explored using WordandPhrase.Info. Try to write sen-
tences using the collocates you learned today.
Teacher Perspectives on CL in the Classroom
Jennifer Roberts is a faculty member in the College of Aeronautics at

Embry-Riddle Aeronautical University-Wordlwide, serving as an Aviation
English Specialist to develop and implement aviation English programs
both locally and internationally. Before coming to Embry-Riddle, she served
as an English Language Fellow in Indonesia where she focused on teacher
training and program development. Her research interests are in the ped-
agogical applications of corpus linguistics, language policy and planning,
and curriculum and materials development in English for specific purposes
settings. Currently, her research focuses on the level of English language
proficiency necessary for ab initio aviation personnel, such as those begin-
ning flight training.
What do you see are primary hurdles for (ESP) teachers

in integrating corpus approaches in their classrooms?
Particularly in ESP, instructors must use their own knowledge about the
subject area to intuitively know if the information provided in the corpus is
relevant to that industry and used in the correct context. In the case of ESP
housed in an academic institution, students need to understand academic
vocabulary in a more general context as well as their subject-specific con-
text, which can sometimes be a subtle distinction. Corpora are balanced
and representative of their respective registers, but these registers likely
cover areas outside of the specified industry. Therefore, although the dis-
covery portion of the aforementioned lesson can be done via the larger
corpus, it is important for production-based activities to be in the context
of the specific subject area (e.g., create sentences on an aviation topic).
Tools such as MICUSP illustrate academic writing patterns in specific subject

areas, but even this tool does not cover all fields (i.e., aviation-related pa-
pers are not included). More industry-specific corpora need to be collected,
analyzed, and ideally made available online for students of specific concen-
trations to use. In this way, corpus tools could possibly begin to help stu-
dents with more operational and technical language (i.e., radiotelephony,
phraseology, manuals, etc.)
What works with CL and your teaching? What are

examples of measurable student accomplishments?
In our IEP, despite the majority of students pursuing careers in aerospace
and aviation, there remains a population of students simply interested in
academic English. Using corpus tools allows these students to receive input
in a more generalized academic manner. The academic vocabulary words
identified through a tool like WordandPhrase.info (Davies, 2017b) should be
learned regardless of what specific major a student will study, but the added
benefit of grounding the learning in a context (e.g., aviation) that the stu-
dents find relevant and interesting has been tremendous. As an instructor,
it would be difficult and time-consuming to identify which words in an au-
thentic article are academic without using corpus tools. These lexical items
subsequently show up in student production in our aviation class, as well as
their other IEP classes. Additionally, the acquisition of these words leads to
the comprehension of the content of the article which, although secondary
to language, is important information to understand in the industry.
What do you see are future directions or applications of

CL in data-driven learning and content-based instruction?
Content-based instruction relies heavily on vocabulary acquisition, as stu-
dents need strong vocabulary knowledge to work with the subject-specific
information which is presented in class. Corpus analysis can reveal which
items should be focused on in the classroom. Although this lesson focuses
on a single text, software such as AntConc can be used to analyze large
collections of texts from sources all related to the same content-area, and
those words can be integrated into lessons. In the case of aviation, texts
could come from sources such as handbooks, manuals, reference guides,
and even checklists. Analyzing these texts will assuredly reveal the most
frequent lexical items, rather than those chosen intuitively by an instructor,
and are very likely to show up repeatedly in a student’s career.
C2.2 Using a Concordancer for Vocabulary Learning with

Pre-Intermediate EFL Students
Jonathan McNair, Georgia State University, Atlanta, GA, USA
Lesson Background
As technology continues to become more accessible and expand into our every-
day lives, there is a growing need to incorporate it into language teaching and
learning. As discussed in many parts of this book, DDL is one such approach,
providing students with access to rich linguistic data drawn from corpora, large
sets of texts. Through corpora, students are exposed to an abundance of input,
from which they can induce rules and patterns in language. DDL is consistent
with many theories in SLA, particularly the noticing hypothesis and usage-based
learning (UBL). According to the noticing hypothesis, a language learner needs
to notice some feature of the input to have the best chance of acquiring it. DDL
works by drawing the learner’s attention to the target feature by providing nu-
merous examples in context. One of the tenets of UBL is that we learn language
through analyses of massive amounts of form-function pairings in authentic con-
texts. Corpora provide these data for learners immediately, whereas it would take
much longer if the learner were only relying on natural occurrences in reading
material. These are two of the conceptual underpinnings for DDL.
The overall goal of the activities in this lesson is to promote deeper vocabulary
acquisition in pre-intermediate learners. I have selected pre-intermediate students
because current DDL research has largely overlooked this group or assumed that
they will not benefit from DDL (Chujo, Utiyama, & Miura, 2006). Vocabulary is
the target language feature because although it is instrumental for learners to be
successful, they may not be fully aware of what it means to know a word. It goes
beyond learning the pronunciation, spelling, and a primary definition. Richards
(2008) references several other important lexical characteristics that are necessary
for vocabulary, including polysemy, affective meaning, associations, and colloca-
tions. By providing large numbers of contextualized examples, corpora give learn-
ers the means to induce these characteristics. Another goal for DDL is promoting
learner autonomy by teaching them how to use corpus tools independently. This is
especially important for vocabulary acquisition, the onus of which is already on the
students to learn themselves. The activities in this lesson make use of two different
corpus tools in the hope of providing students with the skills and knowledge nec-
essary to pursue deeper word learning on their own.
Notes for the Teacher

I’ve learned that there are two approaches to using corpora in language teach-
ing: indirect and direct. Indirect applications use corpus studies to develop ma-
terials for language learning, but learners do not interact directly with corpus
tools. Indeed, they are likely not even aware of them. Examples of this include
the Macmillan English Dictionary for Advanced Learners (Rundell, 2007) and the
Longman Grammar of Spoken and Written English (Biber, Johansson, Leech, Con-
rad, & Finegan, 1999), both of which were developed based on corpus studies.
One helpful indirect use that is becoming more accessible for teachers is the
creation of learner corpora, as done by Grigoryan (2016). In a study set at
a school in the United Arab Emirates, 10 participants with proficiency lev-
els ranging from elementary to advanced wrote timed paragraphs about their
weekend on iPads. These paragraphs were processed through AntConc to ex-
amine word use. In a few minutes, AntConc found that the words anyone and
anywhere were used incorrectly in almost all paragraphs. Instructional materials
were developed based on these findings and used in class. Later, students wrote
a second timed paragraph. The AntConc result showed a 98% accurate usage of
anyone and anywhere, indicating that creating learner corpora may be useful for
teachers with access to digital copies of their students’ writing.
With technology becoming more and more prevalent in classroom, it is
increasingly realistic to design activities in which students engage with lin-
guistic data themselves. This is the direct approach to corpus use. Fukushima,
Watanabe, Kinjo, Yoshihara, and Suzuki (2012) created a learner corpus of 212
graduation papers by Japanese students that will be utilized by future EFL stu-
dents at the university. Using a learner corpus for direct use by other learners
has a couple important benefits. First, 98.2% of words from the AWL appear in
the learner corpus, but the difficulty of the writing is reduced. This gives learn-
ers access to the same vocabulary used in EAP, but with more comprehensible
context. Second, these papers provide successful non-native models to which
EFL students can aspire. This could serve as a motivational boost for learners to
see the work of former students used as a model.
I believe that it is also necessary to go beyond pre-/post-test differences
and basic attitudes to discover how learners are using corpus tools. To do this,
Geluso and Yamaguchi (2014) used four-step student journals: (1) Identify for-
mulaic sequences (FSs) in authentic material, (2) use corpus tools to identify
patterns in the FSs, (3) practice the FSs in small groups, and (4) use the FSs in
authentic communication. Rather than using the concordancer for new words,
students used it with words with which they were already familiar to learn
new uses and expressions. From there, they would use the many contexts pro-
vided to tease out the nuances of the FSs. Lastly, Boulton (2012) examined
how distance learners used corpus tools in novel ways. Rather than only using
a concordancer, the students often combined features to discover which words
co-occur, with what frequencies, and in what. Although these students also
reported difficulty using corpus tools and did not find them especially effective,
they indicated that they planned to use them in the future. However, more
research is needed about how exactly DDL improves learning, if that is indeed
the case. Is it the result of focusing attention on specific features? Providing an
abundance of examples? Or is it more motivating, thus learners study more?
Procedure
What Is WebParaNews?
This project uses WebParaNews (WPN), a free Japanese-English bilingual cor-
pus of news articles from a bilingual newspaper. WPN is a parallel corpus—
each English sentence has an equivalent sentence in Japanese. When a sentence
is highlighted, the translation equivalent is highlighted as well. Additionally,
WPN has a built-in concordancer, a tool that lets the user search for every in-
stance in which a word or phrase appears in the corpus.
Why WPN?
The WPN concordancer will be used to build students’ vocabulary knowl-
edge and demonstrate that learning a word goes beyond knowing synonyms
or primary definitions. The reason for selecting a bilingual concordancer is
because pre-intermediate students often lack the linguistic knowledge neces-
sary to understand the full context provided by a concordancer. The richest
input is useless if students do not understand what they are reading. By using
a Japanese translation, advanced vocabulary and grammatical patterns will
not pose as much of a barrier to pre-intermediate students using the corpus.
Because the data come from a newspaper in Japan, the topics may be more
familiar to students, which will also facilitate comprehension.
How Is COCA Different from WPN?

COCA is a much larger corpus that contains multiple genres, whereas WPN con-
tains only a single newspaper. WPN has the benefit of being a bilingual parallel
corpus, but lacks the variety of tools that accompany COCA. For example, with
COCA, learners can search for specific collocates or by part-of-speech. COCA also
provides the co-occurrence frequencies. However, this added power could possibly
be overwhelming for students. Which corpus is most appropriate for instruction
will depend on many contextual factors, notably instructional design. If learners are
properly trained in the tool, I feel that either corpus will serve its purpose.
Setting
This lesson takes places in a high-beginner EFL class at a Japanese university. In
this classroom, students have access to computers and the internet. In the pre-
vious class period, students turned in a report summarizing two current events.
The teacher marked a few words in each report that sound off due to the use of
near, instead of true, synonyms. The first lesson will use WPN to help students
identify better words to use in those situations and the reasoning behind their
use. The COCA activity is a stand-alone activity. It is not connected to this
sitting, but is meant to introduce COCA as an alternative to WPN for pre-

intermediate learners.
Recommendations for Teachers

Ø Practice extensively with the tools before presenting this lesson to the class.
Aside from knowing how to answer questions about the tools, it is important
to display confidence with the tool. If students see that even the instructor is
unsure, they may perceive WPN or COCA to be much too difficult.
Ø Do not worry about not having previous experience with corpora. I see
this as a potential benefit because you can use your learning experience to
predict difficulties that your students may have.
Ø It is best to demonstrate in a classroom where students can work with the
tools as you demonstrate them. If this is not possible, try to include students
in a demonstration by asking them questions about what to do next and
having volunteers come up to use the tool on your computer. If no com-
puters or projectors are available, include more pictures in the handout to
which students can refer later.
Ø You may feel that WPN is too limited in terms of its tools and its small cor-
pus. COCA is the more powerful tool, but lacks the translations in WPN
that help pre-intermediate students.
Ø Another activity specifically for WPN would have them identify Japanese words
in WPN that are translated multiple ways in English, and vice versa. This type
of awareness raising will reveal nuances in the vocabulary they are learning.
Ø An activity specifically for COCA would be to have students compare how
words are used in different genres. This can open up sociolinguistic issues
for discussion.
WPN Activity Worksheet for Students

Name: ____________________________
What does it mean to know a word?

Is it knowing the definition? Is it knowing how to make the word
plural or how to conjugate it?
Is it knowing its collocations (words that are commonly next to it)?
There are many parts to knowing a word, and it is important to learn all of
them to communicate effectively. These activities will give you an opportunity
to learn about the other parts of a word besides its definition.
Discuss with a Partner
Think about these two verbs: begin and start.
Go to http://www.dictionary.com/ and search for “begin” and “start.” Look

at the first definitions. These verbs are very similar, but you can’t always use
them in same place. What do you think is different about these two verbs?
Discuss with a partner and write your ideas:
Today’s Lesson: Vocabulary Building with WebParaNews

Today, we will use WebParaNews (WPN), a corpus of Japanese-English news
texts, to discover more about words beyond their definitions. Open two tabs
in your web browser and open http://www.antlabsolutions.com/webparanews/
in each tab.
Follow these instructions for each tab:
1 Change “Sampled Hits” to 100. Leave other settings alone.

2 Type your search terms into the text box. (Separate tabs for each word.)
Ø Compare “start” and “begin” with a partner.
Ø Search for “start” in one tab and “begin” in the other.
Ø Look for patterns with a partner.
Ø Fill in the following sentences with “start,” “begin,” or “start/begin”
based on your findings with WPN.
Note: You will have to change the tense for the verbs in WPN. Also,
you can search for multiword phrases, such as “go away,” in WPN.
I could not ________________ the car.
I will ___________________ the journey tomorrow.
It ___________ to rain.
The washing machine would not _____________.
FOR THE TEACHER: Review the answers together. Ask students for possible guide-
lines for using “start” and “begin” based on their intuitions from the concordancer results.
Explain that only “start” can be used for machinery.
Idiomatic Language
Idiomatic expressions are very common expressions in a culture/language. It is
important to use the exact word in these expressions—not synonyms. Use the
concordancer again to select the right words for each expression. Choose one
of the bold words above each pair of sentences. Write a definition for each of
the following expressions.
Warm or Hot?
1 The Emperor thanked them for their _________ hospitality.
2 Inflation has become a ___________ issue in the presidential election.
Big or Large?
1 After winning the lottery, the woman was living __________.
2 After winning the gold medal, the athlete got a ________ head.
Cold or Cool?
1 Everyone thought that the king was ___________ hearted because he
never talked to the people.
2 Both the ruling and opposition parties must debate the matter with
_________ heads.
FOR THE TEACHER: Review the answers and meanings together. Discuss how
concordancers can help learn formulaic language. Return students’ reports from the previ-
ous class period.
Editing Reports with a Partner

[Teacher] I am returning your reports to you. However, there were problems
with a few of your word choices. If a word is circled, that means there is a more
natural-sounding word for that context.
Use WPN and http://www.thesaurus.com/ to find better words. Write the old
sentence and the new word in the blanks for five words in the following “sen-
tences and new word” worksheet:
Original Sentence New Word
WPN Activity Review Questions

Ø Was using a concordancers helpful for learning more about words?
Ø Are concordancers worth the time it takes to learn to use them?
Ø Will you use a concordancer on your own in the future?
COCA Activity
Name: _______________________________
Which Preposition Do I Use?
Did you know that English has 91 single-word prepositions? This can make
it hard to know which one to use. In this part of the lesson, we will use the
Corpus of Contemporary American English (COCA, Davies, 2008–) to help
us decide which prepositions to use.
Here are your COCA instructions:
Ø Click on Collocates. Note again that collocates are words that often ap-
pear together.
Ø Click POS in the options bar. This command means part-of-speech.
Then, select prep.ALL from the drop-down list. This means we want to
look for prepositions.
Ø Click the 1 on the right side of the number line. This means we want to
know what prepositions appear one word to the right of the word/phrase
we will search for.
Now we are ready to search! Let’s do one together.
EXAMPLE: I live _____ Osaka. (in/at)
Is it “in” or “at?”
Type “I live” into the space next to Word/Phrase and hit enter. What do
you see?
[Students will see a COCA output with the following results: in—611,
with—89, on—46, for—25, at—22, and so on.]
Ø What do you notice? IN appears 611 times and only AT appears 22 times!
But that does not mean IN is the right answer! Click on each word to look
at the sentences. What do you notice? (TO THE TEACHER: Here you
can talk about situations where IN is used, like with cities, and situations
where AT is used, as with street addresses)
Ø Now it is your turn! Work with a partner to answer these questions. As
you answer each question, compare the sentences where each preposition
is used. Can you make any rules?
1. The car was fixed ____ my mom. (with/by) (Hint: Search for “was fixed”)
2. They walked _____ the store. (to/until)
3.
I was waiting ______ three hours. (for/since) (Hint: Search for “was waiting”)
4. Gasoline is priced _____ 100 yen a liter. (for/at) (Hint: Search for “is priced”)
To the Teacher: Talk about it together as a class. Review the rules that
students have created. A follow-up activity might increase the difficulty by
removing the hints, or using two prepositions that contain high frequencies,
which will force students to examine the contexts.
Jonathan McNair is a graduate student in the Department of Applied

Linguistics at Georgia State University. He is currently teaching in the
university’s IEP and has previously taught English at a public elemen-
tary school in Seoul, South Korea. His research interests include data-
driven learning in English as a Foreign Language settings, especially
with p re-intermediate language learners, as well as Natural Language
Processing.
What are challenges in implementing corpus-based

lessons and activities in EFL?
The first challenge that comes to mind is the time it takes for teachers to
learn how to use corpus tools, design a corpus-based lesson, and then
teach students how to use these tools. These challenges are not unique
to EFL settings, but they may be compounded by the fact that student
motivation may already be low because they see no relevance of English
to their lives. Special effort needs to be made to help students see the
value not only in learning English, but in learning how to use corpus
tools.
Teachers may also question how DDL can be done communicatively.
Geluso and Yamaguchi (2014) provide a good example of this with student
speaking journals and an activity that requires students to search out situa-
tions for authentic language use. Although the students in that study were
more advanced, I believe it provides a good model for teaching students
with lower proficiency.
Lastly, the corpora we choose reveal our own biases about learning
and speaking English. Are corpora of native speakers the ideal target
for EFL students, who may be more likely to use English as a lingua
franca with other non-native speakers? Differences in our students’ lan-
guage production and the native speaker data may simply be just that –
differences, not deficits. Even for ESL students, I do not think native
speaker comparisons are the best. It is important to identify appropriate
corpora for our students, but this is also limited by what is available to
us. In the same vein, the promotion of learner autonomy may not be an
(Continued)
educational value in our cultural context. Language teaching is always

accompanied by ideology, whether we are aware of it or not, and it
is important to understand our students’ expectations about language
learning. I am not arguing that we give up on DDL in these situations,
but that we should include students in our reasoning for incorporating
DDL.
What are areas for improvement in CL to effectively

support teachers and students?
In order to convince more teachers to try corpus-based lessons, there
needs to be more empirical evidence of why and how to use corpus
tools in different settings. This is an area where teachers need to be
included. Formal studies are important, because this is where empirical
evidence comes from, but there is also a need for more action research
with corpora. Even if the findings are not quantified or official, action re-
search is how teachers spread corpus-based lessons to other teachers. As
a teacher, I would be more likely to try something if a fellow teacher has
had success with it than if I only read about it in an academic article. In
addition to teacher training and research, there is room for improvement
on the coding side. Making corpus tools more intuitive and appealing
could go a long way with both teachers and students to promote the use
of these tools.
What future directions or applications of CL

do you anticipate?
As demonstrated by Fukushima et al. (2012), more researchers are de-
signing corpora for specific learner populations. This applies not only
to proficiency levels or native language, but even topics. For example,
a corpus of sports science articles would be useful for sports science
majors. As mentioned above, using corpora of World Englishes and
successful non-native speakers would go a long way towards escaping
the monolingual bias that still pervades linguistics and language teach-
ing. There will always be some differences between the language of
native speakers and non-native speakers, but these are not necessarily
deficits.
There is also room to develop these corpus tools into mobile apps and
use them to create communities of practice online. This is likely an area that
private sector groups like Duolingo can help by incorporating corpus tools
into existing language learning sites and applications.
C2.3 Implementing the Frequency-Based VocabProfile Tool

from LexTutor: Improving English Learners’ Essay Writing for
Proficiency Exams
Robert Nelson, Alliance Française d’Atlanta, Atlanta, GA, USA
Lesson Background
The goal of this lesson is to show how teachers can implement a quantitative tool
to help students heuristically improve their writing (particularly, essay writing
for proficiency exams). The context is for preparing students for a norm-based
test, which has formulaic responses and predictable vocabulary; yet the lesson
can be easily adapted any time a teacher has a high-quality model for students
(discussed below). First, I will present thematic research that deals with writ-
ing for the TOEFL and quantitative methods of writing assessment. Second,
I will briefly introduce Cobb’s (2016) Compleat Lexical Tutor (or LexTutor) and
describe the capabilities/features of the VocabProfile tool. Third, I will illustrate
the particular context and student population for which I developed this lesson
to explore my motivations and pedagogical logic. This lesson was fundamen-
tally influenced by the fact that my students are attempting to write for specific
tasks which require a rather banal structure and sophisticated vocabulary. My
goal was to guide students through the process of discovering the differences
between the way they write and benchmark essays so that they had a defined
conception of their path to a higher score. An important consideration in this
section is the fact that the class had a variety of levels and proficiencies, which is
unusual within the specific IEP. Fourth, along with a few recommendations for
teachers, I will present the lesson plan and each of its components: a list of mate-
rials, a plan for the teacher, a handout for students, and related resources. I con-
clude with a discussion of the feasibility of this tool in IEP classrooms and how
teachers in other contexts may benefit from implementing VocabProfile activities.
Note to Teachers: This particular IEP offers an elective special topics
course for English learners preparing for the TOEFL proficiency tests. The
TOEFL is a notoriously difficult exam to prepare for because there is a vast array
of resources available for students, from many different for-profit companies,
all with disparate quality and similarity to the actual test. My lesson focuses
on writing for the TOEFL and presents a plan that uses frequency-based tool.
The TOEFL Writing section consists of two tasks. First is an integrated writ-
ing task, in which students must integrate and compare information from two
different sources: a short reading and then a short lecture in which a speaker re-
futes or expands on points brought up in the reading; students have 20 minutes
to write approximately 200 words. The second task is an independent writing
task in which students must present their opinion and real-life examples about
a given topic; students have 30 minutes to write at least 300 words.
Findings from Related Literature

Ø Quantitative methods of assessing writing, such as corpus- and frequency-
based tools, are used to develop and validate tests, as well as compare per-
formance of different learners and groups of writers based on proficiency or
language background. Weigle and Friginal (2015) found that writers make
different choices based on the discursive style, that is, narrative versus de-
scriptive writing, which has significant implications for this context, given
that there are two distinct writing tasks on the TOEFL: one that is more
descriptive and objective and one that is more opinion- and narrative-
based (real-life examples). Another salient finding from their study was that
although the writing patterns found on the TOEFL were significantly dif-
ferent than those used in non-test academic writing, students who scored
well on tests were writing similarly to non-test writing. Overall, successful
test takers were also writing high-quality prose, regarding syntax and ad-
verbial cohesive devices, not lexical choice.
Ø Staples, Egbert, Biber, and McClair (2013) found that TOEFL exam re-
sponses differed significantly from authentic academic writing in terms of
the amount of lexical bundling (not “not sophisticated,” necessarily) and the
conversational tone of examinees, although Guo (2011) found that lexical
sophistication was indeed a predictor of higher scores. Lexical choices influ-
ence independent writing scores on the TOEFL: Sophistication was not as
strong a predictor as the overall length of the text, but better scoring essays
tended to use less frequent words (Guo, Crossley, & McNamara, 2013).
Ø Stevenson (2016) examined the benefits of integrating Automated Writing
Evaluation (AWE) in the classroom through an abstraction of data from a va-
riety of studies dealing with academic writing using several evaluation pro-
grams. The social element of using quantitative means of analyzing writing is
important not only because it can foster a community of writers, and this is
reflected in the group discussion at the end of the lesson, as well as the sugges-
tion that more tech-proficient students assist other students, but also because it
presents an opportunity for teachers to be creative in their implementation of
quantitative analysis, which is a major consideration in this lesson.
Ø Creativity of implementation does have suggested limitations from the re-
search on task-based academic writing and lexical choices. Kyle and Crossley
(2016) found that while lexical sophistication, which deals with the prove-
nance of vocabulary deployed by students from various frequency lists, is a
strong predictor of independent writing quality, there was not a strong cor-
relation between writing quality and lexical choice for integrated or source-
based writing.
In summary, it would theoretically behoove teachers interested in using this

lesson to focus more on independent rather than integrated writing tasks. This
observation is additionally supported by the work of Plakans and Gebril (2013),

who found that not only are the processes and products of the two different
tasks different, a theme which echoes Weigle and Friginal’s (2015) findings,
but also higher scores on integrated writing are associated with skilled use and
inclusion of material from the source text or lecture. Finally, although verbatim
citation of sources in the writing task negatively correlated with score, there
is still greater limitation to the lexicon of integrated writing as writers rely on
paraphrasing and summarizing rather than on their own thoughts and opinions.
LexTutor and VocabProfile

Compleat Lexical Tutor [lextutor.ca] is a veritable armory of corpus- and
frequency-based tools for English and French language learners and teach-
ers, and researchers in linguistics. The overwhelming focus of the tools is
on lexical development, EAP, and CALL. The tools either concentrate on
the relationships between words (word lists, word families, concordances,
collocations, etc.) or frequency of words and word families. LexTutor brings
together a multiplicity of research and over 25 corpora. There are many
tools on LexTutor, so it is important to note that even if teachers don’t feel
that the VocabProfile tool, which is the focus of this lesson, could be bene-
ficial in their course, there are many more tools they should investigate to
evaluate and analyze vocabulary, from sophisticated flashcards for learning
class vocabulary to a hypertext-builder for readings linked with concor-
dances and the WordReference dictionary.
VocabProfile is a frequency-based analytical tool that sorts the vocabulary
of a text into four categories: K1, K2, AWL, or Non-list, where K1 is a list of
the first thousand most frequent words in English, representing about 85% of
speech; K2 is a list of the second thousand; the AWL is an academic word list;
and non-list words. As mentioned earlier, LexTutor’s tools focus on categorizing
words by relationships and frequency. The output from the tool classifies words
by list and also shows type-token analysis by list and family.
Materials and Procedure
VocabProfile Lesson Plan

This lesson plan focuses on using the VocabProfile tool to help students eval-
uate their writing and diagnose their areas for targeted lexical development.
Students will reflect on their essays by using the tool to compare them with a
benchmark essay, and perform their own diagnostic analysis. The course takes
place over a semester and meets three times per week for 50 minutes. There are
18 students in the class representing a broad range of proficiencies and language
backgrounds. Several students have already taken the TOEFL.
Materials
ü T workstation with projected PC and docu-cam
ü Computer workstations with internet access for Ss
ü Copies of the “Complete lesson plan handout”
ü Ss’ TOEFL integrated essays
o checked for spelling
o informally rated from 0 to 5 based on ETS rubric
o in digital (for copy + paste)
o A copy of the ETS integrated writing rubric (as needed)
o Text files for benchmark essays
ü When possible, there can be multiple essays in one file and the profile will
be an average of several essays.
ü Links to the K1, K2, and AWL
Procedure
1 Understanding frequency: 10 minutes
T writes the word “Frequency” on the board and elicits the definition
from Ss (something like “how often or how many times something oc-
curs”). T then explains how words can be divided into categories based on
frequency. To illustrate, T might ask whether the word question (or another
word) would be more or less frequent (“Do you think this is one of the
1,000 most common words in English?”); T contrasts this with the word
collectively. T and Ss read the introduction of the handout; T checks to see
that all students understand:
Ø How frequency can categorize words
Ø How knowing what kinds of words [based on frequency] Ss use in their
essays and comparing that to the types of words a TOEFL essay uses can
potentially help Ss score better on the test.
2 Running profiles: 10 minutes
Students first run the profile of their essays. They follow the directions on
the handout. T helps students who have more difficulty; quicker Ss help
those struggling. T may demo on the overhead with an essay by an exem-
plary student who has agreed to let their essay be used for illustration.
T explains that students don’t need to worry about understanding all the
information on the page yet and immediately begins running the profiles
on the benchmark essays
3 Benchmarking: 15 minutes
Students execute the profile on benchmark essays by following the direc-
tions. There are benchmark essay text files (cleaned for spelling) uploaded
in a designated course management site (e.g., Blackboard or iCollege; or a
Dropbox system). Students run the profile of the essay that is the next level
above their performance: For example, if a student received a Level 3 on
their essay, they run the profile for the level 4 essay.
Ss fill in the table so they can see a side-by-side comparison of their
frequency stats and those of the benchmark essay. They answer some basic
analytical questions.
4 Small group comparison: 15 minutes
Ss get into groups with mixed performance on the essays so that students can
compare their performance, frequency, and goals with others. Ss are aware
that the course outcomes are not driven by language acquisition but instead
test preparation, so there shouldn’t be too much discomfort sharing scores and
goals. If the class has more shy people, they can be grouped by similar scores.
In groups, students identify what kinds of words they are using too
many of in comparison with the benchmark and what are reasonable goals
for themselves. Depending on how their conversations go, they may begin
to look at which words they are using and what are appropriate alterna-
tives (which makes a group with heterogeneous scores very interesting).
Students review copies of the word lists that the VocabProfile uses, as well as
key words, function words, transitions, and turns of phrase that are recom-
mended for use on the TOEFL by their test-prep materials.
5 Homework
Ss will not completely rewrite their essays, but instead look critically at
their word choices and adjust to eventually match their vocabulary pro-
file with that of benchmark essays. They are instructed not to make any
significant grammatical or structural changes, just replacing words and as-
sociated pronouns, verbs, and so on. They may do this by looking at the
“Edit-to-a-Profile” section.
For example, Student X does not use as many words from the AWL as the
benchmark, and may replace an instance of “then” with “subsequently.” The
objective is that Ss can compare two versions of the same text, with the second
draft being more benchmark-like. T should emphasize to Ss that through re-
peated practice and a more sophisticated understanding of the word choices of
successful benchmark essays, they should aim to make similar lexical decisions
(such as more words from the AWL, or more creative non-list words, or more
K2 words…) during their timed-writing on the actual test.
Students’ Handout
Name ________________________________
My essay received a score of ______________
My benchmark is a _______________
In this activity, you will find out how the words you use in your TOEFL inte-
grated writing essay differ from benchmark (example) essays. This is especially
important because the essays are so short, only around 300 words. By compar-
ing your words to that of a benchmark essay, you can see what types of words
you should use more or fewer of to get a better score.
We will be using a tool called the VocabProfile. This tool sorts the words you
use in an essay into 4 categories:
K1, K2, AWL, and Other. K1 represents the 1,000 most commonly used words
in English.
K2 represents the 1,000 next most common words in English.
AWL stands for “Academic Word List.”
Other means words not on any of these lists (infrequent and non-academic)
Directions
1 Open the graded copy of your Integrated writing essay. I have fixed all the
spelling errors for you because the tool will not count misspelled words,
but the TOEFL does.
2 Open the website lextutor.ca and click on “VocabProfile,” which is in the
middle column under “researchers.” Then click “VP-Classic.”
3 Make the title “Your Name’s Integrated Essay.” Then, copy and paste your
essay into the field. It should look like this (Figure C2.4):
Figure C2.4 Your essay pasted on VocabProfile field.
4 Click the yellow “Submit_Window” button. Don’t worry about under-

standing this page quite yet.
5 Repeat steps 2, 3, and 4 with a benchmark essay. First, go to our class
Dropbox and open the text file for the appropriate benchmark essay. If
your essay received a rating of 3, download the Level 4 benchmark essay.
If you got a 5, just use the level 5 benchmark essay. Once you open the
VocabProfile page, give it the title “Level X benchmark Essay.” Copy and
paste the text into the field and click “Submit_Window.”
6 Now you should have two windows with two vocabulary profiles. One is
your essay; the other is an essay one level higher than yours. We will now
compare them in terms of word frequency. Focus on the box that says,
“Current Profile”; it looks like this:
Current Profile
% Cumul.
K1 81.33 81.33
K2 3.33 84.66
AWL 12.33 96.99
Other 3 100
7 Fill in the following worksheet table:
My Essay Level ___ Benchmark Essay
Category % Cumul. Category % Cumul.

K1 K1
K2 K2
AWL AWL
Other Other
8 Fill in the blank and circle more or fewer

I use _____% [more / fewer] words from the K1 list than the benchmark essay.
I use _____% [more / fewer] words from the K2 list than the benchmark essay.
I use _____% [more / fewer] words from the AWL than the benchmark essay.
I use _____% [more / fewer] words from the Other list than the benchmark essay.
What adjustments could you make to achieve a more TOEFL-like writing
style? What kinds of words should you use more of? What kinds of words
should you use less of?
Homework: Look at the Edit-to-a-Profile section of the VocabProfile page.
You may need to repeat the analysis of your essay at home. Notice how the
words are color coded based on the list they come from. Choose 10 words
that you could change to move your vocabulary closer to a benchmark
essay. Bring in your synonyms, and in the next class, we will analyze our
improvements.
Ex: I need to use fewer K1 words and more AWL words.
Change the word “then” to “subsequently.”
Recommendation for Teachers

Familiarize yourself with all the tools and seek inspiration for your class: There are so
many tools available on the LexTutor site that it would be a waste to force your-
self to use one tool when it’s just not working for you or potentially another
tool would suit your context better.
What are your students’ goals, and what do you want them to learn?: I wanted my
students to have a concrete understanding of the distance between their writ-
ing and the writing that will earn them a higher score. In other contexts, they
may be trying to write for another particular genre or with a specific group of
words, such as students who are just beginning to familiarize themselves with
the academic word list.
Focus on what is pertinent to your context: The VocabProfile output page can be
somewhat overwhelming, so it is important the teachers focus on what can
be useful in which context. I based this lesson around one table among many
that showed the cumulative profile, that is, what percent of the text is made
up from which word list. By focusing on just one part of the analysis, students
won’t be overwhelmed or too confused. You can also apply these analytics
in a variety of ways. In this lesson, I discuss lexical choices and related it to
students’ writing, but you might also use this information to measure the
lexical sophistication of a text you are considering giving your students as a
reading. Thus, the same tool can be used in different ways for different skills
and objectives.
When your learners are familiar with the tool, go further: This lesson is a very basic
introduction to the tool and treats students as though they have never used the
tool before. Once they understand the tool and the way it analyzes writing,
they can begin to look at the writing of their peers or multiple samples of their
own and monitor their progress.
With this tool, spelling counts: While spelling is not a major consideration
for the TOEFL score (as it is in other tests, like the IELTS), quantitative tools
tend not to count misspelled words the same as their correctly spelled coun-
terparts. To mitigate this effect, students’ texts should be spell-checked either
by the teacher (more labor) or by students with a spellcheck software such as in
M icrosoft Word or Grammarly.
Robert Nelson is an applied linguist and double bassist from Atlanta,

GA, USA. He currently teaches French at the Alliance Française d’Atlanta,
as well as ESL/IEP at Georgia State and the Latin American Association.

His research interests include multilingual identities, metrolingualism,
e motion labor of language teachers, and the Modernist theater and
poetry of Belgium. He also studied French and classical music at Rice
University.
How did you learn about VocabProfile and its applications

for language teaching?
The inspiration for this lesson came from a presentation I gave in a course
about Corpus Linguistics and Technology and Language Teaching. I presented
highlights from the suite of tools provided on the LexTutor site to a class of
language teachers. In the section about the VocabProfile tool, I showed the
teachers a comparison of two anonymous students’ essays, one high and
one low performing, against a benchmark TOEFL essay. I asked teachers
about the implications, or what advice I could give to my students to help
them improve their writing. Following the discussion, I realized that I could
create a lesson where students could perform the analysis using the tool on
their own (the directions and goals just needed to be very explicit, because
the site inundates you with analytics and colors). The heuristic nature of the
analysis was also very attractive, as well as the fact that I was solving the
problem of paltry or unreliable TOEFL prep materials simply with bench-
mark essays and a website instead of expensive books that don’t guarantee
an authentic presentation of the test.
What are strengths and limitations of VocabProfile or

LexTutor as an online corpus tool for language teachers
and students?
VocabProfile can be very useful tool that easily moves from quantitative anal-
ysis to application when presented the right way. Students can easily see the
gap between their writing and high-scoring TOEFL writing, which is highly
stratified and formulaic. It’s also completely free!
Regarding limitations, the VocabProfile does just that: vocabulary. TOEFL
writing scores are also based on cohesive devices, diverse and sophisticated
grammar structures, and organization. LexTutor does not venture far out of
the realm of vocabulary development, so teachers might use other corpus
tools to parse the grammar structures and other rhetorical elements of writ-
ing to show students their target output.
What are your recommendations to help further develop,

improve, or update Lex Tutor?
The type of programming language used in LexTutor prioritizes business
logic over aesthetics and presentation. While most researchers familiar with
corpus tools would be comfortable filtering the panoply of fonts and colors
and text fields and buttons, it might not be obvious to the uninitiated stu-
dent and teacher. I think there could be better access to the guides, which
are either hidden or lost on the screen that teems with information, or they
are non-existent. I think that visually the different sections could be more
discrete and interconnected. For example, VocabProfile is designated as a
researcher’s tool, and yet it is, as I have hopefully demonstrated, very appli-
cable for teachers and students and researchers alike.
C3
CL and Grammar Instruction
I note again here that I consider the Longman Grammar of Spoken and Written English
(LGSWE) (Biber et al., 1999) to be one of the most useful and influential grammar
resource books published in the past two decades, transforming the teaching of
English grammar to emphasize the important role of registers in mediating lan-
guage use. In highlighting real-world BrE and AmE in context, the LGSWE shows
language teachers the immediate teaching applications of frequency data. It spec-
ifies the unique distinctions between spoken and written grammars, and provides
researchers a much needed guide in defining and categorizing an extensive list of
syntactic features of English. A follow-up, student version of the LGSWE, The
Longman Student Grammar of Spoken and Written English (Biber, Conrad, & Leech,
2002), provides a slightly condensed discussion of the elements of English gram-
mar, this time, specifically for learners (ideally in upper-level undergraduate or
master’s level pedagogical grammar courses), including how, why, and when to
use different structures in speech and writing. For teachers, a companion Student
Workbook offers various activities and short lessons that could immediately be
adapted in the classroom or assigned as homework.
Conrad and Biber’s (2009) Real Grammar: A Corpus-Based Approach to Language
uses corpus data to show how grammatical structures could be ‘easily’ developed
and taught in many classroom settings. The presentation here is not as extensive
as the LGSWE as the focus is only on a select set of 50 features, such as simple past
tense in polite offers, meanings of take + noun phrases, action verbs with inani-
mate subjects, and adjective clauses that modify sentences. However, the different
units are developed as pullout lessons that a teacher can simply assign to students
for hands-on activity in class or for work outside the classroom. Each unit starts
with a review of what traditional grammar textbooks typically define or em-
phasize compared to actual data from corpora. This instructional part is then
followed by various activities, often about noticing contexts, analyzing discourse,
Table C3.1 Sample instructional information about necessity modals from Real
Grammar: A Corpus-Based Approach to English
Modal Function Conversation or Examples

Writing?
must obligation: relatively common The trade-offs faced

describing what in writing (often by farmers must be
needs to be done with a passive carefully considered.
verb)
conclusions: relatively common He didn’t know? Oh, so
stating a logical in both Mark must not have
conclusion conversation and told him.
writing (often This climatic change
with a verb in a must have had a
perfect tense) significant impact on
the habitat.
have to obligation: very common in A: Why are you late?
expressing conversation B: I had to close the
strong personal building.
obligation
have got to obligation: relatively common I’ve got to leave
(gotta) expressing in conversation I’d better go.
had better personal
(better) obligations
advice: making a relatively common You gotta get a sewing
recommendation in conversation machine.
You better get going.
should advice: asking relatively common Should I try for it?
for advice or in conversation I think you should give
recommending and writing this to Stephen.
an action or (often with a Traps should be placed
procedure passive verb) in locations of high
moth density.
Source: Adapted from Conrad and Biber (2009).
or practicing using patterns. Table C3.1 shows a discussion of necessity modals

followed by a ‘notice in context’ activity from Real Grammar.
Notice in context: Read the sentences, and circle the modals. Decide the
function for each one.
1 I remember my first car accident. It was right after I got my license, and I
must have been sixteen. My dad was in the car with me and I backed into
the car across the street.
2 In group counseling, comfortable seating should be used and chairs set out
in a circle so that everyone can see each other. This is important for pro-
moting trust and confidence in the group.
CL and Grammar Instruction 245
I have used Real Grammar in my Technology and Language Teaching and Corpus
Linguistics courses for graduate students to serve as a model for instructors in
developing lessons and activities for a variety of settings. My students respond
positively to the design and components of the 50 units, and they are able to ex-
tend them to the teaching of other features not covered in the textbook. They
notice areas in which additional discussion from the point of view of grammar
instructors could be added to possibly help ‘convert’ those who are not yet fully
versed in CL approaches and themes. Activities for students to directly interact
with corpus tools could also be added improvements, although it is clear that
this is not a primary goal for the authors. Real Grammar has greatly influenced
the development of Corpus Linguistics for English Teachers (this book), especially
in selecting lessons and activities for Section C3.
Six grammar-focused lessons and activities for various groups of learners
and English courses are presented in Section C3: (1) “Analyzing Verb Us-
age: A Concodancing Homework” (Randall), (2) “Developing Corpus-Based
Materials for the Classroom: Past or Past Progressive with Telicity” (Dun-
away), (3) “Quantifiers in Spoken and Academic Registers” (Walker), (4)
“Teaching Linking Adverbials in an English for (Legal) Specific Purposes
Course” (Heath), (5) “AntConc Lesson on Transitions for an Intermediate
Writing Class” (Emeliyanova), and (6) “The Explorer’s Journal”: A Long-Term,
Corpus Exploration Project for ELLs (Nolen). Lessons range from a homework
assignment on verb use to a semester-long project that requires students to reg-
ularly complete concordancing activities on collocations. Contributors make
use of COCA, GloWbE, or AntConc, with a teacher-collected corpus, and
they highlight the role of CL in encouraging students to think like research-
ers in a simple and exploratory study that aids in discovering and comparing
linguistic patterns.
Lesson C3.1 describes how students can use COCA to explore a variety of
verb structures and functions as they write academic essays. The worksheet de-
veloped by Randall allows students to extract and interpret collocation verbs in
order for them to avoid too much repetition and to add variety to their writing.
This lesson overlaps with my description earlier in Section C1 of my Writing in
Forestry lesson on reporting verbs. Nolen’s contribution (C3.6) is a Corpus Jour-
nal Project intended for smaller classes (ideally 5–10 students) of learners from
the American Council on the Teaching of Foreign Languages (ACTFL), with
proficiency levels ranging from intermediate-mid to advanced-high or above.
Students are taught to use concordancers to complete various in-class and take-
home activities. This is quite a robust nine-week plan, which serves as an in-
tegral part of the course that Nolen has been running for several terms now.
Two related lessons on linking adverbials for international LL.M.
graduate students (C3.4, Heath) and for students in a university-level aca-
demic writing course (C3.5, Emeliyanova) make use of AntConc to search
for and identify the functions of linking adverbials in legal texts and to
provide students with experience in completing a ‘mini research project’
to determine how frequently each type of transition or cohesive device is

used in argumentative essays (with texts and data from the MICUSP online
interface). Both lessons were developed to emphasize inductive learning,
asking students to notice structures and patterns of usage within model
texts. Finally, Walker (C3.3) targets students at a lower-to-basic proficiency
level of English for a 50-minute lesson, using an appropriate quantifier for a
specific context, while Dunaway (C3. 2) contributes a two- to three-week
lesson on understanding telic/atelic verbs for advanced (international) stu-
dents of English. In Walker’s lesson, students search for information about
quantifiers from COCA to see if the words are used with a count or non-
count noun and if the words are more suitable for an academic writing
environment or for an informal conversational environment. Dunaway, for
his part, combines extensive uses of grammar-based definitions, examples,
and related materials to help students uncover semantic characteristics and
aspects of verbs.
C3.1 Analyzing Verb Usage: A Concodancing Homework

Janet Beth Randall, University of Macau, Guangdong Sheng, China
Lesson Background
This homework focuses on how COCA (Davies, 2008–) can be utilized by
students to explore a variety of verb structures and functions as they write
academic essays. The learners are primarily college-level students studying ac-
ademic ESL/EFL writing in countries such as China, South Korea, or Japan.
The homework covers concepts such as verb collocations and verb functions,
and asks students to predict, identify, and investigate various structures and dis-
tributions from COCA. A practice exercise allows them to use the knowledge
they have gained about the target verbs (prove and illustrate) in a sentence
completion activity.
Homework Outline and Worksheet
Concordancing Lesson
For this week’s homework, we are going to use the concordance tool, COCA
(corpus.byu.edu/coca), to understand how to use varieties of verbs more
appropriately in our essays.
Follow the step-by-step instructions to analyze your verb usage. Turn in
this handout at the beginning of class next week.
Part A: Predict
 How many English verbs do you guess that you know? ____________
 Which verbs do you guess are the most common in academic essays?
(Write down 4 guesses)
_________________________ _______________________
_________________________ _______________________
Part B: Identify
 Reread Example Essay 1 in Appendix B of your course booklet (page 82).
[Note to teachers: It would be ideal to identify a reading passage from the stu-
dents’ textbook as part of this homework activity. If there are limited options,
you may assign another essay.] Circle all the verbs.
How many DIFFERENT verbs are in this essay _________________

What is the most common verb in this essay? _________________
Were your 4 guesses all in the essay? YES NO
Using more variety
It is true that common verbs like is or do will still be very common in formal,
academic writing. Often, however, we try to choose different verbs to avoid
too much repetition and to add variety to our writing. This keeps our read-
ers interested. Look at these two sentences from the reading:
The large-scale constructions may prove likely to cause envi-

ronmental destructions (para 7).
Justification for regulations would illustrate how small firms

can access more financial resources (para 3).
Verbs like illustrate and prove can be used in sentences where we might
choose be. In order to use them appropriately though, we need to know
more about the patterns that these verbs follow.
Part C: Investigate
 Log on to COCA (corpus.byu.edu/coca) to research the verbs PROVE and
ILLUSTRATE.
(Remember: We made accounts for this site earlier in the semester!)
Prove
 Find the most common collocates of prove in academic English.
Remember how to find collocates? [Instructor demoes how to do this in
COCA.]
 Write down the 4 most common collocates, and write the total # of occur-
rences in the parentheses:
_________ (____) _________ (____) _________ (____) _________ (____)
(what POS are these words?) __________________
 Click on EACH of the 4 most common collocates, and scroll through the
concordance lines to look at the examples.
 Write down 1 GOOD example sentence for each.
(Remember: A GOOD example sentence will be one that you understand and one that
represents how the verb is most commonly used with that collocate, i.e., tense, aspect,
preposition, etc.)
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
(Write the collocate in the parentheses.)
Illustrate
 Now, find the most common collocates with illustrate in academic English,
following the same process.
 Write down the 4 most common collocates:
_________ (____) _________ (____) _________ (____) _________ (____)
(Are these the same kinds of words that collocate with prove?) YES NO
(What POS are these words?) ________________________
 Click on each of the 4 most common collocates and write down 1 sentence
from each.
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
Part D: Practice
 Use the knowledge you have gained about these verbs. Each sentence is miss-
ing the verb. Choose whether it should be prove or illustrate, and fill in the
blank. BE SURE TO USE CORRECT VERB TENSE and ASPECT!
1 These tools may __________________ useful for monitoring the

dynamics of escaped transgenes.
2 This Eighth Amendment issue again __________________ how
impossible it is to have value-free constitutional adjudication.
3 Morison gives two examples to __________________ the electrifying
power of this speech.
4 …empirically validating this position has __________________
difficult due to methodological limitations…
5 To __________________ this point, Rahner uses the simple example of
buying a banana.
6 Youth community-based, in-home interventions have
__________________ effective for drug abuse and related problems.
Part E: Apply
 Now, create your own sentences using prove and illustrate in the ways we
have learned they are most frequently used.
Use your two chosen Sustainable Development Goals as the topic of your ex-
ample sentences. Think about how you could use these sentences in your essay.
Use all the different collocates that you have written about for this homework.
WRITE 8 SENTENCES IN TOTAL
Prove
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
Illustrate
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
Janet Beth Randall is an instructor at the English Language Center at the

University of Macau. Prior to this, she worked in diverse EFL/ESL contexts
with both academic and business students, most recently as a member of
the founding team for New York University’s American Language Institute in
Tokyo, Japan. Her current work consists of curriculum development and ped-
agogically oriented English for Specific Purposes. Her current research and
teaching interests include the use of corpus tools in content-based instruction.
What are your recommendations for language teachers

who are considering integrating the CL approach (using
corpora and corpus tools) into their teaching?
I think the easiest way to incorporate a CL approach is by using materials
or a textbook for students that have already integrated these approaches in
featured lessons and activities. With premade materials, someone else has
done a lot of the work for you and you can get more of a feel for what the
CL approach looks like and how to implement it in other ways and branch
out from there. Of course, not everyone has the freedom to choose his/
her own textbook, so if you’re feeling more adventurous, I have a couple of
other recommendations of good starting points.
I think one of the most natural ways to start exploring the plethora of
tools out there is by playing around with the Google Ngram viewer. In fact,
this is generally how I introduce my students to corpus tools. With this tool,
you can easily compare the frequency of multiple words or chunks visually,
and it is such a simple tool that it is easy to understand the concepts of
frequency and language change.
The main drawback to the Ngram Viewer is that it presents the language
out of any context - which is why a great next step is to move to a con-
cordancer to investigate the words more fully. I have an affinity for COCA
but most work very similarly. With a concordancer, you can compare how
the vocabulary you are teaching is used in different types of texts. You can
look for example sentences for the vocabulary you might be teaching. You
can look for good collocates or verb patterns or see where in a sentence a
phrase you’re teaching usually occurs. Then you can decide if these kinds
of investigations are something your students could also do to learn more
about the language.
My strongest recommendation, if you would like to introduce students
to corpus tools, is to start with something small and explore from there. I
think it’s best to find one activity that you want to add to your repertoire
and then build on it step-by-step.
In your opinion, what language teaching scenarios (e.g.,

contexts, topics, student level) are most applicable to
the CL approach?
I think a CL approach is pretty flexible, and teachers can use different as-
pects of this approach with many different levels of students in a variety of
contexts. Maybe because my context is largely academically focused stu-
dents, I think this might be the easiest context as there are probably the
greatest number of ready-made corpus-informed materials out there for
general academic contexts. Here are some of the approaches I use with my
different students.
I think with lower level students, it’s important to choose vocabulary
and language features that are most frequent or that will help them in the
most common situations, so with my low-level students, I might not teach
them to use corpus tools, but in my preparation for classes I use a vocab
profiler (LexTutor) to identify the level of a sample text quickly or decide
how to adapt authentic materials. I also critically evaluate the coursebook
readings and grammar explanations that are a part of my course to make
sure they describe real language usage and aren’t too unnatural.
When I work with students who are at a bit higher level, I want to in-
troduce them to tools that they can use autonomously. To that end, I use
the quick visual comparisons of the Ngram Viewer with my intermediate
students to show the probability of two words or phrases being the appro-
priate choice based on their frequency. My upper-level students learn more
how to use concordancers and vocab profilers. As I’ve said, I think you can
find good tools to add to your toolbox from the CL approach no matter the
context you’re working in.

do you anticipate?
I’m not sure if this is really what I anticipate or just what I dream for, but I
would like to see more integration of corpus tools into other kinds of tech-
nology. Sites like rewordify.com (a site that identifies difficult words on
a webpage or in a text and replaces them with easier synonyms) and the
now-defunct readability.com (a site that would change web pages into
easier to read formatting) are moving in that direction, but seem to always
come up against the big issue of copyright. I think to get students and
teachers, for that matter, to use corpus tools, the tools need to be easy to
learn and easily accessible. I want browser extensions that bring up top
collocates for a word that I click on, or that help me easily collect texts
(Continued)
or transcripts from different pages to analyze for key words and phrases.
I want mobile app versions of popular online and offline corpus tools so
that students can use them easily in and out of class. I think to improve its
applicability and increase its utilization, corpus linguistics needs to address
its usability/access issues. Right now, I think that the learning curve on any
of these approaches is a bit steeper than it needs to be. That learning curve
is the biggest hurdle that is keeping most teachers from integrating it more
frequently into their instruction.
C3.2 Developing Corpus-Based Materials for the Classroom:

Past or Past Progressive with Telicity
Sean Dunaway, George Mason University, Fairfax, VA, USA
Lesson Background
The following materials are developed for a two- to three-week lesson on
understanding telic/atelic verbs. Telicity is the characteristic or an inherent
property of a verb or verb phrase denoting an action, situation, or event as
being done (or completed) or continuing (i.e., one with no definite ending).
A verb is telic when it clearly establishes completion, while a verb that pres-
ents an action or event as being incomplete is said to be atelic (Biber et al.,
1999). CL approaches through concordancers (especially with databases like
COCA or MICUSP) provide teachers and materials developers a wealth of
options in designing real-world and authentic classroom activities for these
types of grammar-based lessons. Students may relate well with localized
examples in explaining concepts and defining terms, especially if teachers
are allowed to utilize materials beyond required texts. The following lessons
and activities aim to also show teachers how to design materials that even-
tually can inform or guide the creation of a textbook for a particular group
of learners.
Note to Teachers: The target students here are university-level ESL stu-
dents taking a grammar course (e.g., Introduction to English Grammar), but the
activities may also be used for higher-level students in an intensive English
program focusing on practicing academic speech and writing. Lesson handouts
may be directly provided in the classroom with student (computer) worksta-
tions and access to COCA or may be given as an assignment. The students
already have sufficient background and experience running COCA searchers.
Additional corpus-based data were adapted from Biber et al.’s (2002) Longman
Student Grammar of Spoken and Written English.
Past or Past Progressive with Telicity
A. Understanding Telic/Atelic Verbs

How do verbs stop?
Verbs that complete or finish have a natural end (i.e., they are telic verbs)
Atelic verbs stop or pause because they do not have a natural end.
Telic Verb Examples
When individuals fall, they later hit the ground and stop falling.
When David kicks, his leg goes out and back in. The kick starts and ends.
Telic verbs have two types:
➢ One with a long duration (accomplishments)
➢ One with a short duration (achievements)
Achievement After he made a shot, spectators sometimes kicked his golf ball into the rough
(short duration) or hid it under garbage.
Achievement When the wind was the strongest and the rain the hardest was when the tree
(short duration) fell on the house
Begin short Done

time
Accomplishments are verbs with an end that take a long time, like draw and
clean.
Different from achievements, accomplishments finish more than stop.
Accomplishment Carrie was drawing on a receipt with a ballpoint pen.

(long duration)
Accomplishment After five frantic minutes of cleaning egg off her skirt, she declared it the
(long duration) best burger ever.
Begin long time Finish
Normally, a kick and a fall has a short duration, but if a kick is repeated it can
also be an accomplishment.
Accomplishment Nick’s father was kicking a little something his way.
Also, if a fall happens from a tall building or many things are falling, the word
can be an accomplishment.
A light rain was falling from a gray, overcast sky
Accomplishment A light rain was falling from a gray, overcast sky
Atelic verbs like play and swim, in contrast to telic verbs, have a less clearly
defined ending.
They do not often complete, they stop or finish.

Atelic verbs without an ending are called activity verbs because they are
activities that people often do.
Stop or
Start Long or short time
pause
Children can play for hours, and then stop, but they never really complete play-
ing because, arguably, there is no ending to play.
Activity We were playing electric keyboards, which only require a very light touch.
A person (or an animal) can also swim for a long time, and then stop swim-
ming, but the person didn’t really complete swimming. He or she probably just
got tired. This means that he or she can swim again the following day (or after
resting).
Activity She noticed the ducks were swimming her way
The last important type of verb here is the non-action or stative verb:
Stative verbs are the following:
➢ when someone senses (e.g., smell, see, hear, taste, feel)

➢ when individuals use their mind (e.g., know, believe, think, understand)
➢ when people possess something (e.g., have, own, belong)
➢ when someone has an opinion or a specific emotion (e.g., like, love hate)
➢ when someone measures something (e.g., weigh, cost, equal)
➢ when you relate (e.g., contain, entail consist)
➢ when you describe (e.g., resemble, sound, appear, seem)
Stative verb Whoever was waiting for me inside my apartment was about to get what he
deserved.
Stative verb Mrs. Jackson was standing at a respectful distance by the door.
B. Identifying Telicity
Practice on Identifying Telicity
Identify the lexical aspect of each verb in the following sentences
ACTIVITY STATE
ACCOMPLISHMENT ACHIEVEMENT
not an event or
event or action event or action event or action
short or long action
long duration short duration short or long
duration clear start/finish has start/finish
unclear start/end duration
Worksheet: COCA extracts on lexical aspects of verbs
Examples Were Obtained From COCA @ corpus.byu.edu/coca Type of Verb Phrase
She <a> admitted Armstrong was driving <a> Achievement

 Activity
I <c> was wondering what <d>smelled funny. <c >
<d>
She <e> noticed Jake <f> was watching. <e>
<f >
I didn’t <g> realize you <h> were bringing guests. <g>
<h>
Once we got inside, the Germans <j> were throwing 
grenades and shooting. <j>
k
I < > knew the feeling that she <l> was describing. <k>
<l>
Mick <m> looked at two men nearby who <n> were listening <m>
in to the conversation. <n>
Jade <o> was staring through the windshield and was <o>
slowly shaking her head 
Joe <q> discovered that his wife and Mister Schirmer <r> were <q>
having an affair <r>
Interim CEO May <s> announced he <t> was creating a task <s>
force <t>
Rickie called his father in the Middle East and told him 
we <v> were marrying for the green card <v>
Heidi <w> was bringing cupcakes to the classroom and <x> <w>
could drop off the costume. <x>
They <y> were hoping a tip might <z> lead them to the gun <y>
used to murder Russel Douglas. <z>
Question B1: Can achievements be accomplishments or activity verbs? Use

COCA to find examples of the highlighted [bold] verbs in the past progressive
tense. How frequent is each verb?
Example: The first example of an achievement verb was the verb admit. Was admitting
has a frequency of 48, while were admitting has a frequency of 12. Admit is not common
in the past progressive tense.
Policies have moved so far left that President Obama was admitting that he could not
bring his Democrats to the table on spending cuts (News Report on COCA).
Examples:
1. ________________________________________________________
2. ________________________________________________________
3. ________________________________________________________
Question B2: Can accomplishments be achievements? Use COCA to find

examples of the non-bold verb phrases in the past simple tense.
Example: Drove has a frequency of 23,982. Enter the two gunmen, whom Garland
police said drove up to a police car that was blocking an entrance to the exhibition hall.
(News Report on COCA).
Examples:
1. ________________________________________________________
2. ________________________________________________________
3. ________________________________________________________
C. Understanding Progressives
Question C1: Biber et al. (2002) report that progressive verbs are not com-
monly used across registers such as conversation, fiction, news, and academic
discourse. From the aforementioned examples, when do English users prefer to
use the past progressive?
Question C2: The following verbs occurred in the progressive 80% of the
time (from a different group of texts). What are their frequencies in the COCA?
Answer this question by completing the following worksheet.
Verb comparison worksheet: progressives 80%
were __ing was __ing *ed or Irregular

Past Verb
Activity/Physical Verbs Bleed

Chase
Stop
Starve
Communication Verbs Chat
Joke
Kid
Moan
Question C3: The following verbs occurred in the progressive in less than 2%
of cases. What are their frequencies in the COCA? Answer this question by
completing the following worksheet.
Verb comparison table worksheet: progressives less than 2%
were __ing was __ing *ed or Irregular

Past Verb
Physical Action Verbs Find

Invent
Communication Verbs Reply
Communicate
Mental Attitude Verbs Believe
Like
Perceptual States See
Hear
Question C4 (Past Simple/Past Progressive): With the information you

now know about these verbs, complete the following sentences with either the
past simple or past progressive form of the verb (sentences collected from COCA).
1. Look
Just like that you were looking (occurred at same moment) at days and days
of added labor (Conversation)
2. Hope
I did finally get my day at the beach; just not the beach I ______________
for. (Fiction)
3. Depend
Their aunt worked part-time at the visitor center, but they ___________
on her to keep the house running. (Fiction)
4. Concern
One cost of service involved distance and the second ____________
traveling through city traffic. (News)
5. Stay
The Mom and Dad had separated, but Russ ____________ with her
and their two children over Christmas. (Fiction)
6. Chase
Moms ___________ after their kids, and dogs were running after any-
thing that moved. (Magazine)
7. Leave
For example, when I ____________ my relative overnight in the hos-
pital, my relative was using the same bedpan. (Conversation)
8. Think
I stumbled on this and ____________ it might be helpful for people
still learning how to use Gmail. (Conversation)
9. Involve
This project also ____________ a class blog for students to post the
videos that they made. (Academic)
10. Spend
Roosevelt students ____________ a minimum of two hours every day
on reading. (Academic)
D. Further Projects: Conducting Exploratory Research

With the research skills you’ve gained, you’re now ready to learn more about different
verb tenses.
D1. Look at examples of simple past and past perfect verbs in COCA and find
out which aspect is most likely to occur in the present perfect. Discuss your
findings with a classmate.
D2. Biber et al. (2002) found that perfect tenses were more common
in BrE than in AmE and that progressive tenses were more common in
AmE.
Use the Corpus of Global Web-Based English (Davies, 2013) to compare

your results with the following figure (e.g., search have *ed or is *ing in the
chart section and see the results) (Figure C3.1).
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
AmE CONV BrE CONV AmE NEWS BrE NEWS
Perfect Aspect Progressive Aspect
Figure C3.1 requency of perfect and progressive aspect in AmE vs. BrE conver-
F
sation and news. Adapted from Biber et al. (2002).
D3. Research synthesis of related findings
Synthesis: Progressive aspect—(1) subject is agent, (2) duration is described by

the verb, that is, telicity
Subject is agent (frequent) stare, look, watch, listen

Subject is experiencer (rare) see, hear
Mental verbs (frequent) wonder, think, hope, wish
State of mind (rare) appreciate, know, want, believe, want
State of being? has, rely, depend
Rare human subject (rare) concern, correlate
Stative verbs (frequent) stay, wait, stand, live
Agent, Activity verbs (frequent) drive, chase, shout, bring, move,
play, walk
Instantaneous verbs (rare) threw, shut, smash, swallow, throw,
smile, spend, turn on/turn off, pick
up/put down, approve, agree, broke,
End point of process (rare) ruled, decided, attain, dissolve, find,
invent, gave, leave, go, do, create,
made, replace, born, shoot, finish,
marry, involve
Greater sense of involvement (varies) say, think, describe
Synthesis: percentages and comparisons
Progressive Aspect Perfect Aspect
80% bleed, chase, shop, starve Common taken, become, given,

(i.e., activity/physical verbs) shown, thought,
chat, joke, kid, moan called, gone, done,
(i.e., communication verbs made, seen, come, said
50% dance, drip, head for, march, pound, Rare needs, doubt,
rain, stream, sweat represents, constitutes
(i.e., physical/activity verbs) (i.e., state verbs)
scream, talk glance, kiss, nod,
(i.e., communication verbs) scream, smile
(i.e., state verbs)
<2% Arrest, dissolve, find, invent, rule,
shut, shrug, smash, swallow throw
(i.e., activity verbs)
accuse, communicate, disclose,
exclaim, label, reply, thank
(i.e., communication verbs)
agree, appreciate, believe, conclude,
desire, know, like, want
(i.e., mental/attitude verbs)
detect, hear, perceive, see
(i.e., perceptual states/activities)
convince, guarantee, initiate, oblige,
prompt, provoke
(i.e., causation verbs)
D4. Past or Past Continuous with Lexical Aspect and the COCA
Decide what kind of lexical aspect each verb in the following chart has.
ACTIVITY STATE
ACCOMPLISHMENT ACHIEVEMENT
not an event or
event or action event or action event or action
short or long action
long duration short duration short or long
duration clear start/finish has start/finish
unclear start/end duration
Ø What are the most common past simple and past continuous words? (Make
a guess or use your intuition)
Ø Now, check your answer using COCA. Were the verbs more frequently in
past simple or past continuous case?
Sean Dunaway is an instructor at George Mason University in the INTO

Mason Joint Venture. Currently, he is working on partnerships with other
campus departments, such as Women and Gender Studies, to provide more
engaging content-based instruction to students. He also has developed lan-
guage support courses for information technology, computer science, and
financial accounting. Before obtaining his master’s in Applied Linguistics,
Sean worked as an English Language Instructor in Thailand and Taiwan. His
research interests include Corpus-Based Grammar materials design, English
for Specific Purposes instruction, and Blended Learning.
What works? What skills do you consider as ideal foci

of corpus-based instruction in your classroom? Do you
recommend ESL/EFL teachers across contexts to explore
what this approach can contribute to their teaching?
English Language Learners get a lot of mixed messages from different in-
structors and resources. The great thing about corpus-based instruction is
its authenticity: real tendencies based on real English usage. For example,
when teaching -ing verbs and to verbs (i.e., gerunds and infinitives), if an
instructor wants to show that stop cannot pair with to go, the instructor can
just put the entry into the COCA, and it will return no matches. Also, when
comparing advice modals like should, ought to, and had better, the instruc-
tor can show which is most common in particular contexts and show how
the use of these modals has changed over time. Therefore, it is easy to refer
to corpora as an independent source supporting the instructor; the same
way a citation supports a writer’s arguments. While rule-based instruction is
simpler, observing tendencies is more authentic. Corpus-based instruction
is also beneficial in its adaptability to different instructional contexts, and
for the boost it can give to learners’ computer literacy. In addition, tools
like BYU’s Global Web-Based English corpus allow learners to compare lan-
guage use between/among different contexts.
What did not/does not work? What are challenges

you encountered in using corpora and corpus-based
materials in the classroom?
The primary challenge is availability of quality corpus-based materials.
Material-producers minimize costs by favoring little revision and tried and
true methods. Real Grammar (Biber & Conrad, 2009) and English Grammar
Today (Carter et al., 2016) are two of the few great corpus-based work-
books, in my opinion, but they require supplements to scaffold and allow
for productive tasks. The onus really is on the practitioner to develop his
own corpus-based materials with the least metalanguage possible while at
the same time managing increased contact hours, service requirements,
professional development, multiple positions and personal obligations. A
final challenge is a program’s focus on the bottom line i.e., if a program
invests into creating and adopting a corpus-based curriculum, does that
bring them greater profit and a larger, more diverse student population?
More often than not, achieving greater alignment with best practices does
not easily translate into higher revenue.

do you anticipate?
Previous corpora, like the one used to develop Douglas Biber’s (and co-
authors) seminal Longman Grammar of Spoken and Written English (1999),
were subject to patent and copyright, but the availability of open source
corpora like MICUSP, LexTutor, and COCA is promising for practitioners.
In the future, it will be easier to train learners and teachers in the use of
these applications. In addition, the increasing reputation and desirability of
accreditations (such as the Commission on English Accreditation, especially
for Intensive English Programs) is standardizing professional development
requirements across the nation, so there is now more potential for corpus-
based instruction when its perks are demonstrated at local conferences and
relayed to colleagues.
C3.3 Quantifiers in Spoken and Academic Registers

Marsha Walker, Georgia Tech Language Institute, Atlanta, GA, USA
Lesson Background
This lesson is developed with the understanding that all students are in a com-
puter lab environment and have previously experienced using the COCA
(Davies, 2008–). If that is not true, some additional preparation may be neces-
sary. The targeted audience is a classroom of students at a lower-to-basic pro-
ficiency level of English. However, it could easily be utilized at many different
levels with an understanding that the time needed to accomplish the task would
likely vary. The main lesson is set to encompass approximately 50 minutes with
the possibility of extensions to the lesson mentioned at the end. As a final note,
please keep in mind that technology changes daily, and so the listed directions
may not be completely accurate.
The primary purpose for the lesson is to help students choose the appropriate
quantifier for a specific context. A quantifier is defined as an expression used to
represent a more general amount, such as a lot, lots, many, much, loads of, multiple,
and several. These previous expressions, which are more commonly used, denote
plurality of a noun, so they have been selected as the focus of the following lesson.
As the students search for information about the quantifiers, they will be able to see
if the words are used with a count or non-count noun and if the words are more
suitable for an academic writing environment or for an informal conversational en-
vironment. Using non-count nouns is one of the more notably challenging gram-
mar topics as is using situationally appropriate language, which is a major reason
why this topic was chosen. It is important for students to understand why they
are learning what they are learning. Therefore, it is strongly encouraged that
an on-level explanation is provided in order to have full buy-in from the students.
Procedure
The following are the directions that the students can follow. Teachers may
want to start the class with a brief review or warm-up activities for quantifiers
and/or count and non-count nouns.
1 Go to the COCA <corpus.byu.edu/coca/> and login to your account.

2 Check to make sure the search function is on List. Click to check the box
that says Sections. Then, click on the word Sections. In box 1, click on
SPOKEN, and in box 2, click on ACADEMIC. (This is how you can see
the quantifiers in different contexts.)
3 Type the word numerous in the search box and click Find matching strings.
(Be very careful of spelling and the use of lowercase letters.)
4 In the table worksheet, record the results.* Write down the number of
times the word was used in Tokens 1 for spoken and in Tokens 2 for ac-
ademic. Using the frequency information, determine if the word occurs
primarily in spoken, written, or both contexts.** (You can determine this
by simply looking at which context has a higher number.)
*Note: The word numerous is used as an example, so the results are already
written for you.
**Note: Written context is being used as a synonym for academic context.
5 Once the more common context for the quantifier has been discovered,
click on the word under the correct section (spoken or academic) and write
down two examples on the worksheet. Both examples should be complete
ideas. If the quantifier occurs relatively equally in both, then provide one
example from each context.
(When you are choosing examples, try and pick examples that are helpful
to you.)
6 In your written examples, underline the noun that the quantifier describes.
(We will come back to this and discuss it more after you have searched for
all of the words.)
7 Repeat steps 2–6 for a lot.
8 Repeat steps 2–6 for lots.
9 Repeat steps 2–6 for many.
10 Repeat steps 2–6 for much.
11 Repeat steps 2–6 for loads of.
12 Repeat steps 2–6 for multiple.
13 Repeat steps 2–6 for several.
Note to teachers: Once all of the students have completed these steps, write
each of the quantifiers on the board. Have each student come and write down
one example (or more as you see fit) from their worksheet.
After they have done this, review with the students and identify count vs.
non-count nouns. Use this as an opportunity to have a conversation about
which quantifiers can be used for both types of nouns and which quantifiers can
only be used for one type of noun.
Possible Extensions
• Assign the students to smaller groups of 2–4 and have them write a dia-
logue using quantifiers that are more common in spoken contexts. Have
them share their dialogues upon completion. (You could assign a specific
topic or a specific list of count and non-count nouns.)
• Put the students in pairs. Have the students write a summary of a recent
topic you have covered. Make them include the quantifiers that are more
common in written contexts. (You could choose to provide a specific list
of count and non-count nouns.)
• Have the students come up with a list of other quantifiers or provide a list
of other quantifiers and have them complete the same activity at home for
practice.
Variations
• The students could use their cell phones instead of computers.
• The students could work in pairs instead of individually.
• This activity could be done with different words.
• This activity could be done using different contexts.
Worksheet
Students’ quantifier worksheet
Quantifiers Tokens 1 Tokens 2 In which context Provide 2 examples in the

in Spoken in Written should you use this chosen context.
Corpora: Corpora: word? (Spoken, Underline the noun that the
Written, or Both) quantifier modifies.
Example: 1306 8740 Written 1. You can find help at

numerous numerous online sites.
2. Numerous evaluations of
after-school programs exist.
a lot
lots
many
much
loads of
multiple
several
Marsha Walker is a doctoral student at Georgia Southern University,

where she is working towards a degree in Higher Education Leadership.
She also currently works as an ESL lecturer at the Georgia Tech Language
Institute. Before joining the Language Institute, she worked in a variety of
educational settings, including a public middle school, public high school
(through Teach for America), and private tutoring company. Her research
interests consist of pedagogical applications of CL, technology applications,
vocabulary strategies, and curriculum development.
What do you see are the primary hurdles for teachers

in integrating corpus approaches into their teaching?
I think the main difficulty is time: the time it takes to create the corpus
activity and the time it takes to do the corpus activity. We teachers are
continually faced with deadlines, and incorporating corpus lessons can be
problematic when it often takes more time than a traditional worksheet.
However, I have found that there is a learning curve for the students and
instructors. Once students understand how to use a corpus, activities can

be completed more quickly. Once teachers know the types of lessons that
are typical/most useful, they can prepare for the tasks in a more reasonable
amount of time as well.
What future directions or applications of CL do you

anticipate?
Every time I desire to create a corpus-based lesson in my class, it seems as
though some part of the corpus I am using has changed. I think the change
has predominately been that corpora are becoming a more specific type.
What I mean is that instead of simply being a collection of English from ev-
erywhere, there is one corpus on historical American English, another corpus
on news English, and still another on international English. I anticipate in the
future that corpora will continue to be context-specific, and I believe that is
what will happen with corpus-based linguistic approaches. In other words,
I anticipate educators and researchers are going to choose a corpus which
is dependent on the specific context they are teaching or researching. I also
think that the current trend of corpora becoming more user-friendly will con-
tinue. A direct result of easier accessibility is likely to be a future cell phone
application.

who are starting to explore this approach?
My major recommendation would be to not give up. Language teachers
should give the use of corpus-based materials several chances. In my ex-
perience, the first attempt of using a new corpus or a new corpus-based
lesson will cause you to undoubtedly have some issues. However, the more
experience you have with different corpora and corpus-based activities, the
smoother the class will go. I think it is truly worth your time because it
provides an authentic element into your classroom that is often missing.
I would also encourage you to experiment with many different activities
because you will find what works for you and for your class. My other rec-
ommendation would be for you to simplify the process or the results for the
students. Personally speaking, I often get very excited about information I
learn from corpora, and I sometimes find myself oversharing. For instance, I
have shown a specific graph that has all the exact numbers of frequency for
a specific word in different contexts. The students do not need to know that
the word academia has a ratio of 0.2 in spoken contexts but a ratio of 6.1 in
academic contexts. All the students need to know is that academia is more
common in academic contexts than in spoken contexts. So, try different
activities many times and keep it simple!
C3.4 Teaching Linking Adverbials in an English for (Legal)

Specific Purposes Course
Tyler Heath, Embry-Riddle Aeronautical University, Daytona Beach, FL, USA
Lesson Background
The goal of this lesson is to teach a group of international master of laws
(LL.M.) students how to use the AntConc concordancing software to search
for and identify the functions of linking adverbials in legal texts. This lesson is
intended to serve as an exercise in inductive learning, in the sense that students
will develop an understanding of how linking adverbials are used as cohesive
devices in American legal discourse via their noticing its usage in written legal
corpora. At the very least, students would be able to use this lesson as an affor-
dance, or supplement, to their legal writing courses.
Students do not necessarily have to be fully versed in corpus-based ap-
proaches, discourse analysis, or grammar to make use of corpus tools in their
learning. The idea here is that, with the ability to use AntConc, international
law students may be better able to explore some of the discourse structures that
are unique to legal English texts, and common-law texts in particular. Because
so many of these students come from backgrounds in civil law, many of them
face particular challenges when writing legal English in the context of A merican
common law. By encouraging student familiarity with corpus s oftware and
patterns from corpora, they may better be able to solve quandaries that arise in
their legal writing without having to scour the internet, seek the advice of a
professional, or hunt through a textbook.
Notes to ESP Teachers

Ø The idea that students learning to write legal English can benefit from
using corpus tools is supported by Candlin and Hafner (2007), who pro-
vided a concordance tool specifically developed for law students at the City
University of Hong Kong. Despite challenges in having the students fully
commit to using the software (especially the more experienced students),
the study showed that younger, less experienced law students benefited
from using the software, which may have increased their confidence in
their ability to effectively write legal English.
Ø Li (2017) documented how a corpus was used to explore vague language in
legal texts, particularly as it occurred across semantic categories. Although
he used legal texts from the European Union, which, as primarily civil law
texts would not have exactly the same discourse structures as US materi-
als, the fact that he was able to investigate specific legal vocabulary trans-
lates well for common law students in the US Specifically, these could be
used to investigate the so-called terms of art, which are words that look
and sound like lay terms but have specific (and different) legal definitions,
such as consideration, for example.
Ø Baldwin noted that “these” is the most commonly used form of demon-
strative in American legal English texts (2014). “These,” then, has to be
correctly used as an anaphoric reference to make writing more precise.
However, students often make the mistake of using the word too broadly,
which leads to imprecise and frequently undesirable legal writing. Hartig
and Lu (2014) used corpus tools to show how plain English is not necessar-
ily the only correct way to write effective legal English, which could have
pedagogical implications when shown to students who feel dejected when
their writing style does not match a professor’s prescriptive approach.
Ø Maher (2015) provided some insight into specific grammar features in a
cross-textual corpus analysis of that as it is used in legal English, specif-
ically with regards to the management of averrals and attributions. His
argument was that any student wishing to study common law had to be
aware of the discourse practices found in that particular discourse com-
munity. While he argued for the use of corpora by teachers and materials
developers, the aforementioned possibilities for student-used corpora do
create an interesting possibility for future applications in the classroom.
For instance, those students who are motivated to use corpus tools may be
able to acquire basic skills in discourse analysis, at least in the very specific
discourse structures used in legal English.
Procedure
AntConc as a text analyzer and concordance allowing users to search for word
frequencies, n-grams, and compare key words between different chunks of
corpora will be used for this lesson to explore a set of 50 Constitutional Law
cases and search for the most frequently occurring linking adverbials and four-
word strings. Using AntConc as a teaching tool for legal English courses may
start with a search for frequently used linking adverbials (i.e., transition words).
These features are necessary to create cohesive statements in American legal
writing, as the genre requires students to show the effects caused by past texts
on current or future legal situations (e.g., if/then; therefore; as a result), the degree
of causality derived from contrastive analyses (however, although), and how the
sum of different informational units creates a whole (additionally, furthermore).
Settings and Contexts

This lesson is intended for students in Lawyering Skills for LL.M. Students, which
is a pilot course being taught in an IEP by an ESL instructor and a law practi-
tioner with over three decades of legal experience. The purpose of this course
is to teach international students in a LL.M. program. Instruction foci include
grammar, structured writing, and use of cohesive devices found in American

legal discourse. Due to the common-law legal system used in the US, students
must understand and produce the language necessary to clearly show how differ-
ent legal texts from the past—including cases, case briefs, and court opinions—
are or could be applied to current or future cases. That is, students must use past
legal discourse to make predictions about how future legal questions might be
best answered. The level of English required for this kind of written production
is very advanced; while all students show a high proficiency in oral communica-
tion (based on interview samples and instructors’ personal experience in work-
ing with the course), most are unfamiliar with, and have difficulty producing
accurate and sophisticated grammatical structures and cohesive devices neces-
sary for written legal discourse at a professional level in American law settings.
Lesson Outline
Pre-class
Teacher: Informs students of the class focus and objectives, provides nec-
essary information and materials via email and, when available, through a
learning management system (e.g., Blackboard, iCollege).
Students: Preview materials (.txt files of all legal cases they are using for the
course; AntConc.exe; instructions to download AntConc software)
In-class
Teacher: [Projects an image of a text (below) onto a large screen] Please take a look at
these sentences from your legal case files, and tell me what you notice about
them. Do they have anything in common? Are they different in any way?
Although the court did not characterize this interest as absolute, it repeat-
edly indicated that it outweighs any countervailing interest that is based on
the quality of life of any individual patient.
In addition to relying on state constitutions and the common law, state
courts have also turned to state statutes for guidance.
The state, therefore, has power to prevent the individual from making
certain kinds of contracts, and in regard to them the Federal Constitution
offers no protection.
Students: [Examine the sentences, and volunteer answers to the teacher]

Teacher: [Responds to student answers; Reveals the target language struc-
ture, as shown by the bolding of although, in addition, and therefore].
These are called linking adverbials, or transition words, and they’re used to
build cohesion in legal texts. [Provides students with a list of common linking adver-
bials and transition words used in legal English writing (See Appendix A)].
First, I want you to try using the AntConc software to notice when the writ-
ers are using different linking adverbials or transition words in the cases. Can
you all please go to [Blackboard or iCollege] or your email and run AntConc
with the .txt files? [Models this action, which is projected to the students]
Students: [Follow model and instructions, running the software and up-
loading and opening .txt files]
Teacher: [Continues to model instructions as they are spoken] Double-click on
the AntConc icon. Now, click on the File tab in the top left corner of the win-
dow. Then, click Open File(s). Next, select the Search.Law.Casebook.txt file.
This file contains all of our legal cases for the semester. Finally, click Open.
Students: [Follow teacher’s guide, load software and import .txt files into
software]
Teacher: [Continues to model instructions as they are spoken] Click on the
tab that says, Word List. Then, at the bottom left of the window, click the
Start icon. You should see a list of words. Now, click on the tab that says,
“Condordance.” Let’s try searching for the word, however. In the search
box, type however, and then click the Search icon. Now, how many hits
do we have for however? [Asks whole class]
Students: [Volunteer responses]
Teacher: [Continues to model instructions as they are spoken] Now, if we
look at these occurrences, what do you notice about them?
Students: [Volunteer responses]
Teacher: [Continues to model instructions as they are spoken] That’s right,
they usually occur at the beginning and middle of the sentence. And, they
are almost always followed immediately by a comma. Let’s go to hit #205.
However is in the first position. This should be the Planned Parenthood
case. Click on however. It should be highlighted in blue.
Students: [Follow instructions]
Teacher: [Continues to model instructions while speaking] Please work
with a partner. Start with the sentence before however – the sentence
should start with, The separate but equal doctrine… Read the two sen-
tences together, and come up with a way to describe what however is
doing to the information.
Students: [Read the sentences as they are instructed; Begin to openly vol-
unteer answers to the teacher]
Teacher: [Addresses student responses, and calls attention to relevant points
made about the text as it is displayed on the projector] Good! I like that
response! I’d like you to notice that however is used, here, to show a con-
trast. In the first sentence, the court is stating one piece of information.
Then, they use however to call attention to a contrast in ideas – the rest
of the sentence following however shows the court’s argument against the
information given in the preceding sentence.
Students: [Remark on the structure presented by the teacher, possibly ask-
ing more detailed questions about however, or possibly other questions
about words that might be used as a substitution]
Teacher: [Responds to student remarks/questions; gives instructions] Now,
continue working with the same partner. With your partner, I want you to
search for the following words in your corpus.
Group 1 Group 2 Group 3 Group 4 Group 5 Group 6
Moreover Nevertheless Therefore Namely Although In addition

Furthermore However Thus Specifically Notwithstanding Additionally
Teacher: [Gives instructions] As you work, make notes about the number of
occurrences of each word, and where they fall in the sentence structure.
If the word comes at the beginning of the sentence, read the sentence that
precedes it. Then, discuss with your partner the function of the word in
the sentence. What is it doing? Why do you think the writers chose that
word, and not another?
Students: [Work on assigned task]
Teacher: [Gives instructions] Now, take 5 minutes and talk to a pair or group close
to you. Talk about what you discovered in the corpus. Discussion questions:
Which words were more frequent?
Did a word appear at the beginning of a sentence, or in the middle?
Why do you think the writer made a choice to use a certain word, but not
another?
What function do you think the word has in the larger context in which
it occurred?
Students: [Work on task]
Teacher: [Addresses entire class] Well, I’m hearing some pretty interesting dis-
cussions. Did anyone have an insight they would like to share, or a question
about the word usage or function?
Students: [Volunteer responses; discussion ensues]
Teacher: [Addresses entire class] Now, I have a little something that I want you
to do before next class. [Gives worksheet (See Appendix B).] I want you to
run your Re Griffiths v. Ambach case file through the AntConc software,
and use that to complete the worksheet. You need to bring this to our next
meeting so we can discuss our results together. Okay? See you next time.
End of Class: [Class is dismissed, teacher answers any specific questions the students
have about the assignment]

• When introducing AntConc to international, professional students, provide
very explicit modeling, repeated instructions, and activities
• Pause frequently during your instruction, as some students may be con-
fused at first by the software and common terms and procedures
• When first introducing the software, do simple searches
• Make sure all of your students are using the same .txt files
• Make sure all of your students are using the same version of AntConc—even
small variations in different versions can have large usage discrepancies, espe-
cially for students who are not very familiar with the process and its applications
• Using key word lists to compare corpora can get tricky—if you choose to
do this, go slow
• When teaching the activities provided in this lesson, be prepared to an-
swer a lot of questions about word placement—students typically have
preconceived notions about things like whether or not however can be
used anywhere but the beginning of a sentence
• Even if students are familiar with other concordance tools, they may get
very frustrated if they feel lost during the explanation (focus on the first
two recommendations included)
Appendix A: Transition Words Worksheet

Transitional Words and PhrasesTransitional words and phrases can create pow-
erful links between ideas in your paper and can help your reader understand
the logic of your paper.
However, these words all have different meanings, nuances, and connotations.
Before using a particular transitional word in your paper, be sure you under-
stand its meaning and usage completely and be sure that it’s the right match for
the logic and intended flow in your paper.
ADDITION
furthermore moreover too even more
also in the second place again further
in addition next finally besides
and, or, nor first second, secondly last, lastly
TIME
while immediately never after
later, earlier always when soon
whenever meanwhile sometimes in the meantime
during afterwards now, until now next
(Continued)
following once then at length

simultaneously so far this time subsequently
PLACE
here there nearby beyond
wherever opposite to adjacent to neighboring on
above, below
EXEMPLIFICATION or ILLUSTRATION
to illustrate to demonstrate specifically for example
for instance as an illustration e.g.
COMPARISON
in the same way by the same token similarly
in like manner likewise in similar fashion
CONTRAST
yet and yet nevertheless nonetheless
after all but however though
otherwise on the contrary in contrast notwithstanding
on the other hand at the same time
CLARIFICATION
that is to say in other words to explain i.e., (that is)
to clarify to rephrase it to put it another way
CAUSE
because since on account of for that reason
EFFECT
therefore consequently accordingly
thus hence as a result
PURPOSE
in order that so that to that end, to this end for this purpose
QUALIFICATION
almost nearly probably never
although always frequently perhaps
maybe
INTENSIFICATION
indeed to repeat by all means of course
(un)doubtedly certainly without doubt in fact
surely
CONCESSION
to be sure granted of course, it is true
SUMMARY
to summarize in sum in brief
in short in summary to sum up
CONCLUSION
in conclusion to conclude finally
Appendix B: Homework
Instructions
• Load the Re Griffiths v. Ambach case file into AntConc
• Use the following words when doing your concordance searches:
o However
o Furthermore
o Although
• Use your Transition Words Worksheet to help find alternative word choices
• Fill in the blanks to complete the following exercise
However
Function of the word (give a brief summary):
________________________________________________________________
________________________________________________________________
Possible alternative word choices (list words):

________________________________________________________________
If you had to choose a word—either however or another word—which would

you choose, and why? (write a short answer):
________________________________________________________________
________________________________________________________________
________________________________________________________________
Furthermore
________________________________________________________________
________________________________________________________________

________________________________________________________________
If you had to choose a word—either furthermore or another word—which
would you choose, and why? (write a short answer):
________________________________________________________________
________________________________________________________________
________________________________________________________________
________________________________________________________________
Although
________________________________________________________________
________________________________________________________________

________________________________________________________________
If you had to choose a word—either although or another word—which would

you choose, and why? (write a short answer):
________________________________________________________________
________________________________________________________________
________________________________________________________________
Tyler Heath is an instructor at the IEP at Embry-Riddle Aeronautical Uni-

versity, Daytona Beach, FL. He worked as an English language instructor in
South Korea, focusing on oral communication and cross-cultural awareness.
His research interests include English for Specific Purposes, Aviation English,
sociolinguistics, and language policy and planning.
The following is a similar concordancing activity focusing on ‘transitions’ and

cohesion intended for an intermediate writing course in an IEP setting. Unlike
Heath’s section (C3.4), this activity only introduces three groups of transi-
tion markers: Addition, Causal, and Contrast/Concession. The activity also
incorporates a research component, allowing the students to make use of a
concordancer in obtaining data from a corpus. In this example, instead of Ant-
Conc (and teacher-collected corpora), the MICUSP database (MICUSP Simple
BETA: http://micusp.elicorpora.info/) is used.
C3.5 AntConc Lesson on Transitions for an Intermediate

Writing Class
Lena Emeliyanova, Georgia Tech Language Institute, Atlanta, GA, USA
Objectives
Students will review the concept of cohesion in academic writing. Students
will become aware of how frequently different transition words are used in
common written texts across disciplines. In addition, they will also be provided
opportunities to see patterns of their own use of transitions in their writing.
Students will make sentences using transitions of their choice.
Warm-up
Teacher: We have finished looking at the essay structure, and today we will
start looking at something that is very important for clear and effective
writing. (Goes to the board and writes ‘cohesion’.)
Teacher: Can anybody tell us what ‘cohesion’ means?
Students: Connection of ideas/Linking/Flow/etc.
Teacher: Good. Why is creating strong cohesion important for good writing?
Students: Maintains the flow/helps the reader/easier to read and understand.
Teacher: When we write, what do we use to create cohesion?
Students: Transitions.
Teacher: Great! Yes, some people call them transitions; others might call
them linking words or linking adverbials.
Building Schemata
Teacher: (Displays a list of transitions on the screen.) Work with your partner to
categorize the following transitions into three groups. You have to decide
what the categories are. Write them down in three columns.
List of transitions: therefore, moreover, however, on the other hand, yet, in

addition, thus, furthermore, first, nevertheless, finally, nonetheless.
(Students work in pairs to group the transitions. Teacher elicits categories’
names and students’ groupings.)
Addition Causal Contrast/Concession

moreover therefore however
in addition thus on the other hand
furthermore as a result yet
another reason hence nevertheless
additionally
Mini-Research Project
Teacher: Transitions from which category do you think are used the most
in academic writing? Do you think that all disciplines/majors use these
transitions equally? (Students volunteer answers.)
Let’s go back to MICUSP and do a mini research on how frequent each type
of transitions is used in argumentative essay since your next project is an argu-
mentative essay.
(Teacher divides students in three groups, assigns each group a category. Students work
together to count how many times transitions from each category are used.)
Teacher: What are your numbers? (Fills out the table)
Addition Causal Contrast/Concession

moreover (89); hence (194); however (3242);
in addition (100); therefore (1174); on the other hand (303);
furthermore (107); thus (1926); yet (802);
another reason (3); as a result (363) nevertheless (146)
additionally (51);
also (5791)
6141 3,657 4,493
Teacher: In groups, discuss the following questions (Displays questions of

the screen.) If you need to, please go back to MICUSP for clarification or
double-checking.
Do you see any problems with the frequencies of ‘also’ and ‘yet’? (Answer:
They are not always functioning as linking adverbials.)
Are any of these results surprising to you?
Which of the addition transitions do you use most frequently in your own
writing? How does it compare to the patterns that we discovered in MICUSP?
Have you ever used ‘thus’ or ‘hence’ in your writing?
Where in a sentence does ‘however’ usually appear?
Positions of Transitions
Teacher: We have identified 4 possible positions of where transitions can
appear in sentences. Now, let’s find sample sentences for each transition.
Instructions:
• I will assign you a transition.

• Search MICUSP for sample sentences of your transition in different
positions.
• Copy and paste sample sentences in Google docs under your transition.
• Some transitions don’t appear in all four positions. Examine three
pages of examples, and if you can’t find an example for a certain posi-
tion, move on to the next one.
• Try to choose sentences that your classmates can understand.
(Teacher distributes the following instructions.)

(Students work in pairs to complete a Google docs spreadsheet with sample sentences.
Teacher monitors and provides feedback).
Teacher: Now, homework. Read the sample sentences that you have com-
piled in class. Choose 5 transitions that you have never used before or you are
not very clear about how to use. Make sentences in Google docs/Transitions
Homework. Do not forget to put your names by your sentences. We will be
looking at your sentences tomorrow in class.
Lena Emeliyanova is an ESL lecturer at the Georgia Tech Language In-

stitute. Originally from Russia, she has lived in China where she spent four
years teaching English, learning Chinese, and gaining a deeper appreciation
of cultural differences and similarities. Her teaching and research interests
include intercultural communication, technology in the classroom, and L2
writing. She is also interested in collaborative processing of corrective feed-
back and how this impacts language development.
C3.6 “The Explorer’s Journal”: A Long-Term, Corpus Exploration

Project for ELLs
Matthew Nolen, Conexion Training Panamá, Language Program Director
Lesson Background
The following lesson is expected to be taught to smaller classes (between
5–10 students) of learners from the American Council on the Teaching of
Foreign Languages (ACTFL), with students’ proficiency levels ranging from
intermediate-m id to advanced-high or above. By the time of the lesson, ELLs
will be familiar with the basic principles of CL and the use of corpus tools but
may not be confident enough to go exploring through a corpus. The objectives
of the lesson are as follows:
1. Introduce students to the Journal Entry template.

2. Walk through each section of the Journal Entry template.
3. Assign the Corpus Journal Project.
Students in the course will have learned (1) the basic theory and use of corpora
in the classroom, (2) the effects of register and register variation in language
usage, (3) where to find online corpora for English, (4) the online interface of
the COCA, and (5) the AntConc concordancer using files from the Michigan
Corpus of Academic Spoken English (MICASE).
The following lesson is not intended for the instruction of new tools or techniques.
It is to establish a class project to sustain the use of corpora. The lesson is not
intended to focus solely on one corpus. Instead, it opens the doors to a variety
of corpora depending on the interests of the students.
The Project
This lesson will introduce a class project that will be active for approximately
nine weeks after the Journal Entry template is taught. The project will be
presented at the end of the handout provided. The project will serve three pur-
poses by the end of nine weeks:
• Students can reference their journal for corpus findings that they investigated.
• The Journal Entries will serve as part of a class portfolio for a grade.
• The project will allow students to become comfortable with corpus tools.
Notes to the Teacher

Ø The following handout is for the introduction of a template that is inten-
tionally open-ended. Journal Entries are meant to be applied to any corpus
and should not limit students’ investigations. To clarify, the purpose of the
Entries is to establish habits of investigation in the students that will promote
autonomy and agency. Entries center around 11 guiding questions or in-
structions that are listed later. If students wish to add to their Entries, DO
NOT LIMIT THEM. Allow them to dig deeper, but have them at least
complete the Entry. The explorer theme was selected to instill a sense of
adventure and discovery. After completing this lesson, consider how you
may want to elaborate on the theme or encourage students to become ex-
plorers of the English language.
Ø The intention of the Explorer’s Digital Journal (EDJ) is for any available
corpus to be explored systematically in a fashion that would benefit ELLs.
However, any corpus that teachers want to use with the EDJ needs to be
explicitly taught to learners before they can start exploring. For this par-
ticular lesson, the online interface for the MICASE was used. MICASE
was selected because of its relatively simple online interface and its register
for spoken academic English. In Friginal and Hardy’s (2014a) discussion,
registers are text categories that have been situationally defined and have
shared general communicative purposes… [specifically,] the most common
linguistic features across spoken and written texts (pp. 25–26). The em-
phasis of the course will largely center around spoken language; therefore,
a spoken register is necessary as target domain.
Ø According to Tribble (2015), MICASE is the third most common corpus
tools taught to ELLs. This is likely because MICASE is a free online in-
terface that is relatively easy to teach ELLs. It is important to remember,
however, that the selection of corpora for this particular project is flexible
since the EDJ is general enough to apply to a variety of corpora. Teachers
or learners willing to use other corpora or have learners develop their own
corpora like what Charles (2014) had done can still participate in the EDJ if
the corpus tool to be applied is taught first. What is critical is that learners
are familiar with the tools before developing the habit of exploration. De-
spite the corpus used, a brief explanation of the limitations and uses of spe-
cific corpus is necessary to illuminate the importance of selecting corpora
based on whether a corpus is representative to the language to be modeled.
Ø As a coordinating side project, leaners will be required to post their findings
on a blog as part of their project. For the learners that I teach, the blog serves
two purposes: First, it is part of a project where they will post reflections of
their language learning experiences. Second, it will also serve as a database for
some of the tools and techniques they learn throughout courses in the study
abroad program they are attending. This includes the EDJ entries that they
will maintain throughout the semester. The first step of creating a blog will be
part of a lesson during the first week of classes. Word Press (www.wordpress.
com), a website that offers a free blog service, is recommended for the purposes
of the activity since it is easy to create an account and free to use.
Setting and Context

As previously noted, for this project, only MICASE was selected as the corpus
database. However, time may allow for the teaching of additional tools and
resources later in the course. This is at the discretion of the teacher who needs
to decide if the additional databases would encumber or enlighten her or his
learners. As far as finding a variety of corpora, they are becoming easier to
locate as they grow in popularity (e.g., BYU databases from Davies: www.
corpus.byu.edu). Variety allows for teachers to select corpora that would best
serve students. It is important that the teacher takes ownership of the EDJ for
it to work. It is the teacher who leads based on the needs of her or his learners
better than anyone else, and it is the responsibility of the teacher to act, teach,
or train for the betterment of their own learners.
The EDJ was introduced in an ELL/ESL class within a study aboard pro-
gram that follows a Task-Based Language Teaching (TBLT) methodology. TBLT
courses center around having learners complete communicative tasks as a means
of language learning. The tasks in the course included explicit instruction of
cognitive and metacognitive learning strategies; the use of authentic materials like
movies, newspapers, Instagram or Facebook posts, and music; and corpora as a
language learning tool. The underlying goal of the course was to provide learners
with an alternative means to learn language outside of textbooks and traditional
classroom settings. It should be noted that corpora and this activity can be incor-
porated into a variety of courses and classes that may not directly require it. It is,
after all, a tool. With good direction, corpora benefit language learners in general.
It is not best kept in a single classroom that only deals with corpora.
EDJ Handouts and Worksheets
Keeping a Corpus-Based Journal
Name Date: Page:

1. What WORD or PHRASE would you like to explore?
2. What is the PART OF SPEECH of a word?
3. What CORPUS will you use?
4. What is the REGISTER(S) of the WORD or PHRASE you are interested in?
5. How many TOKENS/HITS does your word or phrase have?
6. List any COLLOCATIONS (friends of the word) that are common for the translation
of the WORD or PHRASE in your first language. Include the token number.
7. See if the same COLLOCATIONS that you would use in your first language are
possible in English.
8. Find three common patterns or characteristics of the COLLOCATIONS for your

WORD or PHRASE in English.
1.
2.
3.
9. Find two rare COLLOCATIONS in English.

1.
2.
10. W
rite an entire sentence as an EXAMPLE with the WORD or PHRASE you
explored.
11. Write a second EXAMPLE.
Please keep a master copy of the template drawn out earlier to use throughout
the journal.
One of the great things about using a corpus is discovery. With or with-
out the help of a guide (teacher), you now have the ability to explore English
through corpora. You need to maintain and continue the hunt for understand-
ing by continually exploring the English language! All great explorers kept
maps and journals of their journeys. Now it’s your turn! Your project is to
develop a corpus journal.
But How Do You Complete an Entry?

Don’t panic! This handout provides step-by-step instructions for your Journal
Entries to explore more and more about English. Please understand that an ex-
ample was provided on the handout AND another example will be completed
during this class.
STEP 1: Select the Word or Phrase

Start with a word or short phrase
Hint: Where can you find words? Look that you are interested in explor-
in your corrected homework for words ing! For example, let’s look at the
or phrases that you have trouble with.
often-confusing preposition “by”.
Phrases should not be greater than three
words.
What WORD or PHRASE would you like to explore?

by
(Example for Class)
STEP 2: Identify the POS

Let’s start with the basics. Before
analyzing the word or phrase, try Hint: What do you know about the word
to figure out the POS. This will or phrase that you are exploring? Can you
tell me the part of speech? (Noun, verb,
help you understand the contex- adjective, etc.) There may be more than 1,
t(s) where you may find the word. so underline the part of speech that you are
most interested in. If you are having trou-
ble finding the part of speech, look it up in
a dictionary.
What is the PART OF SPEECH of a word?
Preposition & Adverb
(Example for Class)
STEP 3: Select a Corpus & Identify the Register(s)

Now we have to choose where we
get our information. It is very im- Hint: What context do we want to study?
portant that we choose from the right Are we more interested in the written or
body of information for our inter- spoken contexts? Do we want formal con-
texts like academic journals or informal nar-
ests. For the example of by, we will rative contexts like works of fiction?
look at its use in a spoken, academic
context through the MICASE.
What CORPUS will you use?

Michigan Corpus of Academic Spoken English (MICASE)
(Example for Class)
What is the REGISTER(S) of the WORD or PHRASE you are interested in?
Spoken English in classrooms
(Example for Class)
STEP 4: Identify the Number of Tokens

Open up AntConc and download the MICASE corpus from Mr Matt’s flash
drive. Look up the example under the Concordance feature. In the top-left
corner, you will see a Concordance Hits line. Write the total number of hits
(Figure C3.2).
Figure C3.2 Sample AntConc tutorial figure for students.
How many TOKENS/HITS does your word or phrase have?
4037 hits
(Example for Class)
Don’t let all the details overwhelm you! If you have any questions or concerns,
talk to each other or ask Mr Matt.
STEP 5: Identify the Top Collocations

Look over the Collocate and
Clusters/N-Gram functions in Hint: Take this opportunity to start exploring
confusing or troublesome words or phrases!
AntConc. Take 5 minutes to ex-
plore collocates and choose inter-
esting or unpredicted collocates.
List at least three and include the total hits. Let’s look at by…
List any COLLOCATIONS (friends of the word) that are common for the translation
of the WORD or PHRASE in your first language. Include the token number.
is [past participle] by (360 tokens)/ by bona fide or by bona fide researchers (both
152 tokens) / by the way (118 tokens)
(Example for Class)

STEP 6: Try to Find Collocations from Your First Language

A common occurrence in lan-
guage learning is literal translation Hint: There are 2 ways to approach this—
from your first language. For the either you can rely on intuition or we can
attempt to find corpora in your first lan-
example, we will explore by and
guage. The first option is the easiest, but
it’s collocates in my L2, Spanish. both will be reflected in this example.
See if the same COLLOCATIONS that you would use in your first language are
possible in English.
(Intuition) “By” carries a variety of translations in Spanish including por, de, en,
and con. Therefore, the literal translation of by is complicated. For example, the
common expression por favor would literally translate into by/through/because of
favor.
(Corpus) According to Corpus Del Español (on corpus.byu.edu), the apparent

translations es [pasado participio] por and por el camino are possible. The
passive voice in Spanish (Es [PP] por) follows a similar construction to English.
However, the meaning behind por el camino is literal. For example, por el
camino will indicate that something or someone is literally by the path/way/
street.
Por bona fide does not exist in the corpus and can be inferred to be rare or not possible.
(Example for Class)
We’ve explored collocations. We’ve gathered information and seen how they
are used in English and in your first language.
So…what’s next?
The next step is to form hypotheses based on patterns that we see!
STEP 7: Find Three Patterns or Characteristics in the Collocations for

the Word or Phrase
Have you seen some recurring
patterns in the data? See if you Hint: Characteristics, for the purposes of
can list characteristics of the pat- the Journal, are any defining feature for
terns in the following collocates. grammar, meaning, or vocabulary based on
frequent patterns. The discoveries of one
If something occurs more than student may not be identical to the discover-
10 times in the corpus it may be ies of another student. The example with by
a pattern. is going to be synthetic since it comes from
an instructor’s perspective.
Find three common patterns of the COLLOCATIONS for your WORD or PHRASE
Is [past participle] by By bona fide By the way
1. Passive voice 1. Bona fide means 1. A lexical bundle
authentic or true 2. Used to add information
2. Verb phrase 2. Used mainly for emphasis. or change the subject in a
3. Latin expression for conversation
3. Is owned by in good faith used in 3. Used in sentence-initial,
has 152 hits in English. medial, and final positions
MICASE.
(Example for Class)
If you have a hypothesis on how something in English works, test it out and
document your findings for extra credit!
STEP 8: Find Two Rare/Strange/Inappropriate Collocations

With a tentative understanding of
how a word of phrase IS USED, Hint: Sometimes rare collocations can be
it may be clarifying to also under- due to the context (register). The order or
rarity of the structure may change based on
stand how it IS NOT USED OF-
the register.
TEN. Write at least two phrases
that are rare or seem odd.
Find two rare COLLOCATIONS in English

1. Bit by bit (3 tokens)
Means little by little. Possibly more common in other registers.
2. By us (5 tokens)
Defeats the purpose of the passive voice and sounds informal.
(Example for Class)
1.
2.
STEP 9: Write Two Sentences Using the Word or Phrase in an Appro-

priate Context
Write two examples of different
uses for the word or phrase you Hint: Every student may take something dif-
explored! You are free to write ferent away from the Journal Entry. Use this
two examples for any cluster or section to give an example of a discovery that
interested or fascinated you.
collocation that you discovered
in the previous steps.
Write an entire sentence as an EXAMPLE with the WORD or PHRASE you

explored
The research project was conducted by bona fide applied linguists.
(Example for Class)
11. Write a second EXAMPLE.
We are learning more English bit by bit.
(Example for Class)
Finished! You’ve completed your first Explorer’s Journal Entry in the English
language. Congratulations!
But… you’re not finished yet!
The Explorer’s Journal Entry Project
I. Instructions
• Complete two Journal Entries every week for the remainder of the semes-
ter (nine weeks).
• You will present your Journal Entries to class partners, the teacher, or the
entire class every Monday during class time.
• The Journal can be completed online through a blog or on paper through
a notebook or journal.
• Mr Matt will collect paper entries on Monday and return them on Tuesday
to look over them.
• Mr Matt will check your blog weekly.
• You are expected to explore DIFFERENT words or phrases in E nglish.
For example, examining by and by the way as separate journal en-
tries will not be allowed. You need to explore a variety of words or
phrases.
• Critical Sections are sections that require exploration or explanation. These
primarily include sections 6–11. Evidence of Testing for Extra Credit can
be concordance lines that show the hypothesized pattern or 10 examples
written or typed from
II. Rubric
Each journal entry will be graded in three different aspects + Extra Credit as
shown later:
Table C3.2 Journal entry grading rubric
Unsatisfactory Acceptable Good Excellent
Entry The Journal The entry is All sections are All the sections
Completion Entry is missing one completed, are fully
missing two section, but but some completed
or more it doesn’t critical and there
sections. affect the sections is plenty
OR entire entry. do not written in
have much every critical
The section
written in section.
missing is
them.
a critical
section.
Variety Every entry is There is some Most of the Every entry
(Compared about the repetition, entries are is different
to previous same word, but there is different, and explores
entries) group of also some but there different
words, or variety. is some words or
types of repetition phrases in
phrases. between the English
them. language.
Unsatisfactory Acceptable Good Excellent
Hypothesis Section 8 offers Section 8

(Pass or fail) no clear has clear
pattern or ( Fail) (Pass ) patterns
hypothesis that can be
for the tested and
characteristics a possible
listed. hypothesis.
(Extra credit) N/A The hypothesis The hypothesis
Testing was was
hypotheses documented documented
in the entry, and
but no evidence of
evidence of testing was
testing was included.
provided. (+2 points)
(+1 point)
III. Final Thoughts

This is a great opportunity for you to explore and enjoy English inde-
pendently. You are required to complete 18 Journal Entries by the end of the semes-
ter (nine weeks). If you want to do extra, please feel free to do so! Mr Matt
will happily grade any extra exploring that you do. The Journal Entries will
be a part of your student portfolio in Mr Matt’s class. They will be consid-
ered part of the grade for the class.
• You have to buy in. Honestly, if you’re not willing to spend time and effort
developing and refining corpus activities, they will likely not have the de-
sired effect. For learners to believe it will help, the teacher has to believe it
will help.
• To use corpora in the classroom, understand corpora! As teachers, we need to
take the time to explore any corpus tools or corpora before introducing it
to learners.
• Training montages are myths! There is no quick way to train learners in cor-
pora (no matter what music you choose). It will take time to reach a point
of competence with corpora in order to use this activity.
• Limitation: Are we there yet? An area of uncertainty is that the time and
effort it takes to get learners to the point of this activity will likely
differ between classes. Future studies may try to find out how long it
generally takes to develop autonomy with corpus tools between learner
proficiencies.
• How does one eat an elephant? One bite at a time. How does one teach corpora?
One byte at a time. Well, maybe not that slow. We do, however, need to
be careful to expose learners to a few new things per class, especially when
dealing with technical tools like corpora.
• Read the room! If your learners have glazed over eyes or a blank expression,
the teacher might need to investigate what material might be confusing.
Do not be shy to mix things up by going impromptu in class if you have
no plan B.
• Spice it up! Corpora have a tendency to not look exciting on their own.
Consider concordance lines; they are literally lines of text taken from a
broader context. It is on us teachers to make it fun. Good teachers can make
good entertainers.
• Make time for Q&A. Learners who have never worked with corpora may
not understand why it matters. It is important to address their questions
and concerns, including what corpora are, how they work, and why they
matter.
• Make it your own. Once you, the teacher, are familiar with material, change
things to complement your teaching style and philosophy. This activity is
not perfect for every group of learners or every teacher. Change things to
make them better for your learners!
Matthew Nolen works at Conexion Training, a third-party study abroad pro-

gram, as an English language instructor and the Language Program Direc-
tor (based in Panamá). His passion for language and culture stemmed from
spending 12 years living abroad as a child and teenager in Argentina and India.
His experiences abroad have given him insights into the cultural and linguistic
challenges students may face. Matthew has taught in an ESL setting for four
years. His research interests include CL in the classroom, data-driven learning,
learner autonomy, learner motivation, and task-based language teaching.
What gave you ideas about developing a corpus journal

and what types of learners and classes best suit this
activity?
When most people think of corpus activities, they probably think of an
activity that lasts a couple of hours, days, or weeks and then ends. To learn-
ers, it may seem like something to master and then it is over. The corpus
journal came from an idea on how to make the use of corpora practical
to individual learners and long-term. The purpose of the exercise is to
help guide learners beyond a series of corpus lessons with the purpose of
demonstrating how corpus, as a tool, is a huge asset to learners of English.
It can be implemented in a variety of classes including writing, ESP, and,
with the right corpus, even conversation classes. Although it would benefit
advanced proficiency learners most, it is still beneficial to lower-proficiency
learners as well.
What do you see are primary hurdles for language

teachers in integrating corpus approaches to their
teaching?
The biggest challenge in learning to teach anything new is initial impres-
sions and a resistant attitude. To learners and teachers, corpus probably
looks too technical, too impractical, and too boring. If this is the attitude
that teachers carry into the classroom where they teach corpus linguis-
tics it WILL be too technical, to impractical, and boring! The classroom
attitude towards something starts with the teacher. The challenge then
is not to see what corpus linguistics is as an academic field, but instead
(Continued)
what it can become in the classroom. As a tool for language learning,

corpus linguistics has huge potential if teachers’ beliefs and classroom
practices positively embrace it. This matters because it takes time to
properly teach anything in corpus linguistics.
How could learners be best motivated to engage in

corpus-based and data-driven activities?
I remember stumbling upon the quote success begets success some time
ago, and believe it to be very true of anything. Motivation can come from
a variety of sources, internal and external, but ultimately, learners are more
likely to use something that they believe works for them. If learners can
come to see how corpora and DDL will benefit their language learning and
production, then I believe some would be motivated with that to continue
pursuing it. And yet, it never hurts to make it engaging and fun! Look up
something that the learners may find comical in the corpora or set a theme
for adventurers and ask learners to come in as explorers. Like anything we
are expected to teach, we need to strive to make it successful for learners
and fun to explore.
C4
CL and Teaching Spoken/
Written Discourse
In this section, I focus primarily on the teaching of spoken and written ac-
ademic discourse for English learners, but the selected lessons and activities
following this introduction may also be adapted for specific contexts and/
or purposes across settings, including those that are outside university class-
rooms. Sociolinguistic and English for Occupational Purposes (EOP) topics
and applications also benefit from the utilization of corpus data, which, in turn,
could be taught to a wide range of learners. The acquisition of features, such
as politeness in English, for example, is relevant to many international students
attending US universities. Explaining effective, respectful features of email
writing, with text excerpts from corpora, can be a relevant and useful topic of
workshops or orientation programs for international students during their first
semester on US campuses.
It is important to fully explain and illustrate the concept of register variation
when teaching the characteristics of spoken and written English. Although we
might find that a rich and complete description of speakers or writers could
contextualize their language use, it is also important to realize that these indi-
viduals use language differently, depending on the audience and purposes they
have. Corpus-based research on register variation has shown that the lexical
and grammatical findings from one register of a language cannot easily be
generalized to other registers or to the language as a whole. In other words, if
a finding is made based on texts that come from one situational context, that
finding may not apply to language that is used in other settings (Friginal &
Hardy, 2014a). For example, the way that we speak to an employer or a school
principal would differ significantly from the way we banter with friends at a
sporting event. In analyzing this concept linguistically, Biber, Conrad, and
Reppen (1998) investigated how the lemma deal was used in two written regis-
ters (academic prose and fiction) from the Longman-Lancaster Corpus. Each of
these sub-corpora consisted of two million words. The researchers were inter-
ested in seeing how often deal was used as a noun or as a verb. Out of the four
million word samples of the corpus, the difference between the use of nouns
and verbs was not that great. Deal was used as a noun 366 total times and was
used as a verb 482 times. However, when looking at the differences between
registers, the researchers found that the distribution of the nominal and verbal
forms was quite different. In fiction, deal was more likely to be used as a noun,
and in academic prose, it was much more likely to be used as a verb.
The theoretical arguments over what can and cannot be fully captured in
corpora and how to best conduct research or teach their use to language learn-
ers identify important considerations that should further define CL and its ap-
plications. Kachru (2008) notes that corpus-based linguistic research is as good
as the corpora on which it is based, and grammatical or lexical analyses of
corpora are as good as the analytical tools, such as grammatical taggers or con-
cordancers, and many other new software programs specifically developed to
analyze them. In addition, defining specific grammars of language, that is, spo-
ken vs. written grammars, and recognizing that they exist are also important
considerations. Sociolinguists are often interested in the context within which
speakers and writers use language. The study of sociolinguistics focuses on
variation in language form and use that is associated with social, situational,
attitudinal, temporal, and geographic influences (Friginal & Hardy, 2014a),
and CL has itself evolved over several decades to strongly support these em-
pirical investigations of language-in-use. It is clear that the use of corpora and
corpus-based approaches is an invaluable and indelible contribution to the field
of sociolinguistics in exploring the structural and functional characteristics of
spoken and written discourse. CL offers ways to investigate the composition
of linear strings of language, showing the linguistic context that can provide
learners with the components of a target word (i.e., a key word) used in a sen-
tence; an utterance; or, as introduced earlier in this book, KWIC searches. It
is important to note, however, that KWIC searches are not necessarily only of
orthographic words. On the contrary, one can search for letters, morphemes,
and even multiple words. In languages like Chinese and Korean, one might
even use such a program to search for characters.
Closely connected with sociolinguistic research, the CL approach in the
study of World Englishes focuses on emerging varieties of English as they adapt
to changing circumstances of use and contact with local languages and cultures.
World Englishes has been operationalized to show the expanding nature of En-
glish used by ESL/EFL speakers in various contexts. Studies of World Englishes
have focused on two major subareas: (1) indigenous varieties of English and (2)
the study of English as a Lingua Franca (ELF). Corpus development efforts in
representing indigenous varieties of English are best represented by the Interna-
tional Corpus of English (ICE), as briefly noted in Section B1. The ICE project
is an attempt to construct comparable corpora for all varieties of English spoken
CL and Teaching Spoken/Written Discourse 295
around the world (Greenbaum, 1996). According to Seidlhofer (2007), the most
widespread contemporary use of English throughout the world is that of ELF,
and the Vienna-Oxford International Corpus of English (VOICE) and the Cor-
pus of English as Lingua Franca in Academic Settings (ELFA) (Mauranen, 2007)
are both invaluable resources (also briefly introduced in Section B1).
Analyzing and teaching intercultural spoken interactions: Two rel-
evant examples here are (1) academic speech between international teaching
assistants (ITAs) in the US and their everyday interaction with students, and (2)
the discourse of telephone-based customer service call centers between offshore
(those located outside the US) representatives and American callers. Corpora
have been collected for these two registers, and results of various analyses have
been used in teaching and training purposes, especially for the L2 interlocu-
tors. Over the years, US universities have increasingly employed international
graduate students (especially doctoral students) as teaching assistants. These
ITAs typically assist professors in grading papers, manage study groups, and
hold regular office hours to meet with students regarding class projects and
examinations. In mathematics, science, and engineering departments, ITAs
commonly teach many introductory courses themselves. These ITAs know the
contents of the class very well, but limitations in language and with teaching
strategies have been reported by students (Reinhardt, 2010). Pickering (2006)
reported that student complaints about ITAs’ language use were motivated pri-
marily by prejudicial behavior or the general complexity of the content of the
class. Continuing training and research are instituted in many graduate pro-
grams to help ITAs improve their use of English in the classroom. To this end,
Reinhardt (2010) investigated spoken directives by ITAs in office-hour consul-
tations from a corpus-based perspective. Two corpora were used in the study
for comparison: the ITAcorp (Thorne, Reinhardt, & Golombek, 2008), which
was a learner corpus that collected classroom activities from advanced ESL and
ITA preparation courses (mentioned previously in Section B2.3.2), and MI-
CASE, the Michigan Corpus of Academic Spoken English (Simpson, Briggs,
Ovens, & Swales, 2002). ITAcorp represented the language of ITAs, while
MICASE was used for comparison as it provided samples of spoken texts by
academic professionals (e.g., full-time instructors and professors). The ITAcorp
includes spoken academic English texts (e.g., lecture presentations, discussion
leadings, and office hour role-plays). Spoken texts from MICASE include 152
academic speech events (e.g., advising, colloquia, discussion sections, lectures,
office hours, and tutorials from a balance of university academic disciplines
(Simpson-Vlach & Leicher, 2006)). MICASE also features coded sociolinguistic
variables, such as speaker ages, gender, academic rank, and field of study.
The primary goal of Reinhardt’s (2010) comparison was to inform ITA
instruction in the context of ESP and cross-cultural pragmatics. In addition, a
social-functional approach was conducted to analyze variables, such as polite-
ness in academic conversations involving directives. Summative corpus results
showed that the ITAs made fewer statements marking independence and inclu-
sion appeals than instructors and professors, but they used directive construc-
tions more frequently. The use of ‘directive vocabulary constructions’ (e.g., I
suggest that, I recommend that you) by ITA and professors illustrates ITAs’ prefer-
ence for these constructions much more than professors’ (or ‘experts’). Overall,
ITAs had 73 total constructions in contrast to 10 from experts. Reinhardt, cit-
ing Blum-Kulka, House, and Kasper (1989), suggested that performatives like I
suggest or I recommend, as used by ITAs, had the effect of an indirect imperative
used to soften the force of the instruction. Directive vocabulary constructions
preferred by ITAs emanate from an institutional authority, but many ITAs have
distanced themselves from the power source of the directive by invoking poli-
cies or rules (e.g., it is required or the administration suggests).
I also specialize in the analysis of professional spoken discourse, especially in
the context of outsourced call centers in countries, such as the Philippines and
India, serving callers from the US I collected a Call Center Corpus provided
by a global, US-owned call center company, and I use this corpus in part to
develop language training materials for call center representatives or ‘agents.’
Communications in outsourced call centers have clearly defined roles, power
structures, and standards against which the satisfaction levels of customers
during and after the transactions are often evaluated. Callers typically demand
to be given the quality of service they expect or can ask to be transferred to an
agent who will provide them the service they prefer. Offshore agents’ ‘perfor-
mance’ in language and explicit manifestations of pragmatic skills naturally are
scrutinized closely when defining ‘quality’ during these outsourced call center
interactions. It is clear that intercultural communication in customer service
has become an everyday phenomenon in the US as callers come into direct
contact with agents who do not share some of their basic assumptions and per-
spectives. Before the advent of outsourcing, Americans had a different view of
customer service facilitated on the telephone. Calling help desks or the cus-
tomer service departments of many businesses mostly involved call-takers who
were able to provide a more localized service. Interactants typically shared the
same “space and time,” and awareness of current issues inside and outside of
the interactions (Friginal, 2009; Friginal & Hardy, 2014a). Based on a 2010
survey, customer satisfaction with calls perceived to be handled in the US was
more than one-fifth higher than it was with calls perceived to be handled out-
side the country. Furthermore, callers said that one of the biggest differences
between “foreign and American call centers” was the ease of understanding
the customer service agent (Brockman, 2010; Friginal, 2011). Clearly, these
additional contexts affect the way in which offshore agents attempt to connect
and interact with their customers across various types of tasks.
Training materials developed for call center agents may cover word lists and
multiword units or chunks in teaching the use of recurring patterns, common
types of questions, or the use of politeness markers. These features could also
be extracted from a corpus of caller utterances to prepare agents for the typ-
ical caller language across a variety of tasks (e.g., troubleshooting or product
purchase). Tables C4.1 and C4.2 show frequent extended chunks from agents
and callers in a US-based call center corpus (i.e., with American agents). These
patterns could be presented in a training workshop in India or the Philippines
for discussion (hopefully, with accompanying sound files). Sample discussion
questions may include the following:
Study and compare/contrast the list of common agent/caller chunks.

How could these phrases contribute to new learning of more specific
language patterns compared to vocabulary lists? What are pronunciation
topics to introduce with these phrases? (Consider suprasegmentals here:
intonation and combination/merging of phrasal sounds, especially).
Table C4.1 Common extended chunks from agents
No. Frequent extended chunks
1 Thank you for calling customer support or Thank you for calling (followed by specific
company name)
2 may I have your first and last name please
3 how may I help you today
4 can I have your phone number starting with the area code please
5 how can I help you today
6 what else I can help you with
7 thank you so much for calling
8 your first and last name please
9 is there anything else I can help you with?
10 can I put you on hold for a minute
11 thanks for calling (company name) my name is
12 Can I have your phone number?
13 it’s at the back of the modem
14 thank you for choosing (company) services
15 do you have a pen and paper?
16 let me just go ahead and (cancel, file this, process this order, change the code, etc.)
17 thank you very much for that
18 Thank you, is there anything else?
19 thank you so much for waiting
20 if I put you on hold, would that be OK?
21 Thank you for calling (company name) this is (agent name) speaking
22 I’ll be more than happy to assist you
23 I’m gonna go ahead and change this number
24 may I please have your DSL telephone number?
25 could you please hold for a minute or two?
Table C4.2 Common extended chunks from callers
No. Frequent extended chunks
1 I don’t know if I can do it myself

2 I don’t know if you can help me
3 what do you want me to do?
4 what do you want me to do next?
5 I don’t know what that means
6 I just wanted to make sure
7 I’m sorry I didn’t hear that
8 I haven’t been able to check
9 I’m sorry what was that?
10 I’m trying to get the order processed
11 that’s what I’m trying to do
12 you know, I don’t know
13 I don’t know if it’s broken
14 I don’t know what you’re talking about
15 go ahead and check that last part
16 I don’t have a ticket
17 I don’t know anything about it
18 I don’t know if they are open
19 I don’t know what’s going on
20 I don’t know what you mean
21 I don’t know why I can run it
22 I’m trying to key in
23 that’s what I’ve been trying to do
24 this is the first time I called
25 uh go ahead and check that
Additional training questions
• Compare and contrast these two sets of phrases or extended chunks with
vocabulary lists (previously discussed). What are noticeable patterns, sim-
ilarities, and differences?
• Spend time studying and memorizing these chunks from US-based agents
and practice their pronunciation across various possible contexts (e.g., type
of transaction, level of pressure in the call, customer’s/caller’s patience or
disposition during the call).
• Define these chunks. Are meanings clear? Are there examples of poten-
tially misleading or complicated bundles on the list?
• What possible COMPREHENSION topics might these phrases suggest
that should be considered? Note that these are expressions from American
callers, and these chunks represent typical oral language structures from
actual customers.
Notes for trainers: Develop lessons/activities directly addressing the comprehen-

sion and understanding of these expressions/bundles. How are they used and
what do they actually mean? Sample: In the following four-word bundles, what
could be the callers’ primary message or question to the agent? Use these ex-
pressions in a sentence:
• I’m pretty sure

• you want me to
• hold on one second
• let me ask you
• you know I don’t
The four lessons in Section C4 focus primarily on the teaching of academic

spoken and written discourse. Campbell’s contribution (C4.1) explores the use of
doing good and doing well in both spoken and written language for a content-based,
US academic writing course she teaches at Duke Kunshan University in Kun-
shan, China. Her course was developed for Chinese students who plan to study
abroad at an English-medium university. Lesson C4.2 (Gass) uses Text Lex Com-
pare (Cobb, 2016) to analyze words in political discourse (or political speeches)
for an ESL public speaking course, civics courses, or composition studies and
political science courses. In an ESL context, Gass proposes that her lesson be
integrated into a unit on persuasive speaking and writing. Section C4.3 (Dye)
describes a mentoring program for visiting scholars and how corpus-based activ-
ities could be tailored specifically for their English training for written and oral
comprehension and production. The nine-week course focuses on topics such
as writing professional emails, leading discussions, and giving presentations as
well as on specific elements of English, like pronunciation issues, field-related
writing conventions, and cultural aspects of communication. Finally, Section
C4.4 (Berger) describes the visualizer Text X-Ray’s (see Section A3) application
in an academic writing course for multilingual writers. Berger introduced Text
X-Ray to her students in a university-level undergraduate composition course.
She observed that the software program elicited greater engagement in discus-
sions of the patterns of academic written discourse; her students were interacting
extensively with one another—giving instructions, asking questions, or vocaliz-
ing surprise at the results.
C4.1 Using COCA to Answer the Question on Everyone’s Lips

Maxi-Ann Campbell, Duke Kunshan University, China
Lesson Background
This lesson takes place in a content-based, US academic writing course
taught at Duke Kunshan University in Kunshan, China. This course is geared
toward EFL students who plan to study abroad at an English-medium uni-
versity in the future. By the end of this course, students will be able to write
response papers that make an original argument, present support through evi-
dence and analysis, and incorporate the work of others through summary and
citation practices. Aside from writing, students will also be able to contribute
to academic group discussions and give short paper presentations. The course
content is sociolinguistics, with a specific look at endangered languages and
language policy.
Students in this class often feel self-conscious about their English lan-
guage skills, and they frequently apologize for their ‘Chinglish.’ These
apologies are often followed by questions of how to make their speaking
and writing more native-like. Additionally, the students mention concerns
about using new words they find in the dictionary in their writing because
they are afraid of using them incorrectly. Given these concerns, I incorpo-
rate corpus tools into the class to offer the students a way to do their own
investigation into US English. I hope that in these investigations, they will
come to find that English is more flexible than they might think. I hope this
approach will build students’ awareness of how grammar rules can change
depending on the context and that variation in usage is normal. As such,
even their ‘Chinglish’ grammar is a variety and appropriate within a par-
ticular context.
Students bring their laptops to class on the days when we engage in corpus
lessons (if a student’s laptop has a problem, one can be borrowed from the
school’s IT department). This class is 1 hour and 15 minutes long, and the fol-
lowing lesson constitutes the first 45 minutes.
Introduction
The lesson associated with the following handout takes place in the third week
of classes. The lesson has a sociolinguistic base, even though it is not initially
presented as such to the students. By the end of class, students will
1 Analyze the usage of “doing good” and “doing well” in both spoken
and written language using the COCA.
2 Develop students’ awareness of differences in spoken and written registers.
3 Reflect on language usage (i.e., moving away from classifying language as
“correct” or “incorrect”).
4 Better understand my language teaching philosophy (i.e., moving away
from purely prescriptive rules and analyzing language in terms of its actual
usage).
Lesson Outline and Worksheet

Using COCA to Answer the Question on Everyone’s Lips
How are you doing?
Introduction
Speaker A: How are you doing?
Speaker B: I’m doing . (Fill in the blank with the word(s) most
people would say.)
Compare with the students near you. Do you agree? Why or why not?
The Grammar
What part-of-speech is the word “good” usually?
What part-of-speech is the word “well” usually?
• modify nouns.
• modify verbs.
So, we could expect many people to say: I’m doing .
Spoken Language Analysis (Using COCA)
1 Go to the website http://corpus.byu.edu/coca/
2 On the left-hand side of the page, in the blue box, click the word Compare.
3 Type doing good in the “Word1” box and doing well in the “Word2”
box.
4 Click on the word Sections (not the check box), and choose Spoken for
both “1” and “2.”
5 Click Compare words above “Sections.”
6 In line “1” for doing good, which should have the word good, click the num-
ber that appears under “W1.”
7 Look for examples where people are asked “how are you?” “are you okay?”
or similar questions, then fill in the following chart based upon the exam-
ple given. (Hint: You can click “Ctrl + F” or “Command + F” to search
the page for these questions.)
News/talk show name Question Response
NPR (Line 9) How you doing? I’m doing good.
Discuss with a Classmate

1 Based upon the examples in the chart, what part of the speech is the word
“good”?
2 Look at other examples of “doing good” where the speaker has not been asked
a question. What part of speech is the word “good” in these other contexts?
3 Based upon your discussion so far, how many parts of speech can the word
“good” serve?
Analysis of “Doing Well”

1 Now click Frequency on the tool bar at the top of the page.
2 Click the number that appears under “W2” in line 1 for doing well.
3 Repeat step 7 under “Spoken Language Analysis” for doing well. (Hint: You
can look through more pages of data by clicking the “>” next to “Page,”
below the main tool bar.)
News/talk show name Question Response
Discuss with a Classmate

1 Skim the other lines given as context. Are you able to find any instances of
“doing well” where “well” is not an adverb?
2 Based upon your analysis, which of the two words, “good” or “well,” would
you most likely use in answer to the question, how are you doing? Why?
3 Is it easy to decide which response is more correct/accurate?
Academic Language Analysis (Using COCA)

1 Click Search on the tool bar. Change the “Sections” criteria to Aca-
demic for both “1” and “2.”
2 Click Compare words.
3 In line “1” for doing good, click the number that appears under “W1.”
4 In the table on the next page, list four different academic sources in which
“doing good” is found. Then provide the part-of-speech (POS) for “good”
from “doing good.”
5 Give the example and definition of good in that context.
6 Follow the example given.
Source of information Part-of-speech Example and definition
Nursing Ethics (Line 13) Noun “The notion of doing good, being
good, and acting on the good, which
resembles virtue ethics” / Good
refers to charitable deeds.
7 Repeat the same steps for doing well. (Hint: Remember to look at column
“W2.”)
Source of information Part-of-speech Example and definition
General Discussion
1 What part(s) of speech does “good” serve in spoken language? Do you find
the same pattern in academic language?
2 What part(s) of speech does “well” serve in spoken language? Do you find
the same pattern in academic language?
3 Can you identify a rule for when to use “doing good” and “doing well”?
Is one of them simply incorrect?
4 Can you think of another way to characterize language use other than
correct vs. incorrect?
Maxi-Ann Campbell currently teaches academic writing at Duke Kun-

shan University’s Language and Culture Center, Kunshan, Jiangsu, China.
Her research has focused on improving native-non-native speaker interac-
tion, and best practices for teaching EFL. She is coauthor of the third edition
of TESOL Press’s bestselling book, More than a Native Speaker.
How did you learn about CL as an approach to language

teaching? Were there any surprising or unexpected
observations you had in developing CL-based lessons and
activities?
My first in-depth look at corpus linguistics as a method to language teach-
ing was in a corpus linguistics course during my master’s program. I found
corpus linguistics to be particularly freeing because it gave me the data
I needed to support the instinctual feelings I had that some prescriptive
grammar rules were not actually representative of how proficient, educated
English speakers used the language. I can remember quite clearly the time
when a friend from Romania asked me “How are you doing?” and I re-
sponded automatically “I’m doing good.” He then told me it would be
(Continued)
correct to say “I’m doing well.” He then further generalized my “mistake”

to the American student population, whom he often heard making lazy
grammar mistakes like the one I just made. I remember at the time feeling
indignant at his lecture, but I could not think of any response beyond a
petty, “It’s my native language; I can speak it how I want to.” So, I said
nothing at all.
It was not until years later in this course creating my first corpus-based
lesson plan that I had the opportunity to actually investigate how people
respond to the question “How are you doing?” I was surprised to find that
many people, from talk show hosts to characters in books responded “I’m
doing good.” Even as I type this, Microsoft Word is telling me that I should
change “good” to “well”; however, I now have the power of corpus data to
feel comfortable ignoring some of the grammar rules that are prescribed to
me. In designing corpus-based lessons, I wanted my students to also have
that power. I wanted them to realize that the English language was not one,
rigid set of rules. There are many Englishes, and each has its own validity.
My students’ Englishes are also valid, and they can contribute new ways
of expressing ideas as they learn English and develop their interlanguage.
Breaking a prescribed grammar rule is not an immediate sign of laziness.
In fact, it can show this student’s innovation, the way that our brains learn
languages, or how the language is evolving.
What are strengths of the CL approach in teaching ESP

or in showing variations in language use for non-native
speaking students of English?
I think the field of foreign language teaching, especially English language
teaching, is beginning to recognize the importance of not only teaching
students but also empowering them. Recent research has suggested that
non-native English teaching assistants, for example, will give students
higher grades because they will automatically blame themselves and their
language skills when students do not perform well. It is easy as a non-native
speaker of a language to assume that any breakdowns in communication
is one’s fault. However, this is not always the case, and communication is
a two-way street. As such, our students’ should not always have a deficit
ideology, a belief that their language is somehow always inferior to a native
speakers’. Instead, I hope to build a lingua franca ideology in my students.
I want them to realize that there is no one standard, correct English, and
I hope that corpus linguistics can introduce students to the variation in
language use. Corpus linguistics provides students with data of how peo-
ple actually use the language, and this gives students the opportunity to
notice that language is both creative and messy. I want my students to view
variation in language as the norm, which it is in reality. I want them to have
confidence that their variety of English is also acceptable, and it is up to
both interlocutors, whether native speaker or not, to negotiate for meaning
when communicating.

(in China or similar countries) who are starting to explore
this approach?
I would recommend that they first try different corpus tools to answer some
of the questions that they have about language or the questions students
frequently ask them about English. For example, one pattern I feel I have
noticed recently in spoken English is the use of adjectives instead of adverbs
to modify verbs. For example, I have heard people say “She did perfect”
instead of “She did perfectly.” I wonder if the use of adjective forms instead
of the adverb is particular to the “do” verb, or if this is seen with other verbs
of a similar type, or is this found among many different types of verbs.
Perhaps, this could also be a regional feature. These kinds of questions are
great for language teachers to start doing their own research into language
use. After doing their investigation into the different corpus tools available
and the language-related questions they have, then they can start to look
at how students can use this tool to answer the questions they have. If stu-
dents are not given a lot of guidance on how to use corpus tools effectively,
then it can either lead them astray or cause great frustration. So, I would
encourage any teacher considering using corpus tools in the classroom to
do some investigation on their own first.
C4.2 Using Text Lex Compare to Examine the Language of

Political Speeches
Tia Gass, Georgia State University, Atlanta, GA, USA
Lesson Background
This lesson on using Text Lex Compare (Cobb, 2016) to examine political dis-
course could be used in a number of different courses, but it was designed
primarily with an ESL public speaking course in mind. (It could also be used
in an ESL civics course or in composition studies and political science courses.)
In an ESL context, this lesson would be integrated into a unit on persuasive
writing and speaking. It is a noticing activity that uses corpora from American
presidential speeches from the American Presidency Project (www.presidency.

ucsb.edu/). By the end of the lesson, students will be able to pick out and dis-
cuss some of the persuasive features of formal (political) speeches used in the
particular text they select. This lesson will also help students reflect upon their
own language choices in public speaking.
For the Teachers: In following this lesson, teachers may be able to improve
their ability to help students reflect upon their language choices and speak in a
more persuasive manner. Students will most likely encounter politically charged
vocabulary that they are unfamiliar with. However, most presidential speeches
are fairly simple and straightforward. The main anticipated ‘setback’ with this
lesson is that some of the vocabulary may be a bit difficult, even for high-level
ESL learners. Also, in using the tool, there will be a large amount of data to work
with, and students might not be able to sort through lists and prioritize specific
items easily. Teachers will have to provide additional support and training in
using the tool, be ready with vocabulary help/tips, and remind students of the
different POS in English (hopefully covered in previous lessons). In this part of
the course, articles, prepositions, and conjunctions will not be covered.
Related Comparisons
(This part may be provided to students in the form of a handout or as the topic of a lecture
before the activity)
Ø Friginal and Hardy (2014a) and Reiter (2011) explored the language of US
presidential inaugural addresses, which comprise a distinct type of very formal
political discourse. The address is delivered orally as a speech in public, in a
very formal ceremony, and yet, this speech is written and extensively pre-
pared, at least for most presidents. In over 220 years of American history, there
have been only 58 such speeches, starting with George Washington in 1789
through Donald Trump in 2017. They appear at regularly scheduled intervals.
Up until 1933, when Franklin D. Roosevelt took office, presidents had been
inaugurated every four years on March 4, about four months after the Novem-
ber elections. The date changed with the adoption of the 20th Amendment to
the US Constitution. Since Roosevelt’s first term, inaugurals have been held
every four years on January 20th (Reiter, 2011). These inaugural addresses
serve a function that is ceremonial as well as deliberative, and the rhetoric in
these speeches reflects that.
Ø There is much related research on the form, content, rhetorical style, and
historical applications of presidential inaugurals. Recently, radio and tele-
vision coverage of these speeches have become major events, with both
the content and delivery scrutinized and analyzed immediately after de-
livery by political commentators. Media and internet broadcast of Barack
Obama’s inaugural address in 2009, a historic address delivered by the
first African-American president of the US, was met with great global
enthusiasm. The ceremonial nature of this domain covers a lot of contexts
that have also shifted over time. George Washington spoke in New York
to a small group comprised mostly of supporters and elected senators and
congressmen, while George W. Bush, in 2005, his second term, spoke in
Washington, D.C., to an international audience who tuned-in to listen to
US post-9/11 thoughts and sentiments and immediate future directions
related to the on-going turmoil in the Middle East at that time.
Ø The following two excerpts compare George Washington’s first inaugural
address in 1789 with Bill Clinton’s second inaugural address, delivered on
January 20, 1997—a difference of over 200 years. Personal pronouns are
highlighted in these two excerpts, and references to the country are un-
derlined (e.g., note Clinton’s explicit mention of America). An additional
content analysis of Washington vs. Clinton’s speeches in these two excerpts
can certainly provide you with ideas to pursue further in conducting a
diachronic corpus-based analysis of all US presidential inaugural addresses
(adapted from Friginal & Hardy, 2014a).
George Washington
New York, Thursday, April 30, 1789
(Excerpt from 1,428 words total)
Fellow Citizens of the Senate and of the House of Representatives:
Among the vicissitudes incident to life no event could have filled me with
greater anxieties than that of which the notification was transmitted by your
order, and received on the 14th day of the present month. On the one hand,
I was summoned by my Country, whose voice I can never hear but
with veneration and love, from a retreat which I had chosen with the fond-
est predilection, and, in my flattering hopes, with an immutable decision,
as the asylum of my declining years--a retreat which was rendered every
day more necessary as well as more dear to me by the addition of habit
to inclination, and of frequent interruptions in my health to the gradual
waste committed on it by time. On the other hand, the magnitude and dif-
ficulty of the trust to which the voice of my country called me, being
sufficient to awaken in the wisest and most experienced of her citizens a
distrustful scrutiny into his qualifications, could not but overwhelm with
despondence one who (inheriting inferior endowments from nature and
unpracticed in the duties of civil administration) ought to be peculiarly
conscious of his own deficiencies. In this conflict of emotions all I dare
aver is that it has been my faithful study to collect my duty from a just
appreciation of every circumstance by which it might be affected. All I dare
hope is that if, in executing this task, I have been too much swayed by a
grateful remembrance of former instances, or by an affectionate sensibility
to this transcendent proof of the confidence of my fellow-citizens, and

have thence too little consulted my incapacity as well as disinclination for
the weighty and untried cares before me, my error will be palliated by
the motives which mislead me, and its consequences be judged by my
country with some share of the partiality in which they originated.
William J. Clinton
Washington, DC, January 20, 1997
(Excerpt from 2,157 words total)
My Fellow Citizens:
At this last presidential inauguration of the 20th century, let us lift our
eyes toward the challenges that await us in the next century. It is our
great good fortune that time and chance have put us not only at the
edge of a new century, in a new millennium, but on the edge of a bright
new prospect in human affairs, a moment that will define our course,
and our character, for decades to come. We must keep our old de-
mocracy forever young. Guided by the ancient vision of a promised
land, let us set our sights upon a land of new promise.
The promise of America was born in the 18th century out of the bold con-
viction that we are all created equal. It was extended and preserved in the
19th century, when our nation spread across the continent, saved the union,
and abolished the awful scourge of slavery. Then, in turmoil and triumph, that
promise exploded onto the world stage to make this the American Century.
And what a century it has been. America became the world’s mightiest
industrial power; saved the world from tyranny in two world wars and a
long cold war; and time and again, reached out across the globe to millions
who, like us, longed for the blessings of liberty.
Along the way, Americans produced a great middle class and security
in old age; built unrivaled centers of learning and opened public schools to
all; split the atom and explored the heavens; invented the computer and the
microchip; and deepened the wellspring of justice by making a revolution
in civil rights for African Americans and all minorities, and extending the
circle of citizenship, opportunity and dignity to women.
Now, for the third time, a new century is upon us, and another
time to choose. We began the 19th century with a choice, to spread our
nation from coast to coast. We began the 20th century with a choice,
to harness the Industrial Revolution to our values of free enterprise,
conservation, and human decency. Those choices made all the difference.
Ø Visual representation: The following is a word cloud of Barack Obama’s

2013 second inaugural speech with sample KWIC lines of the most
frequent word (must).
Figure C4.1 Word cloud of President Obama’s 2013 address.
Sample KWIC lines for must from President Obama’s 2013 address:
• Together, we resolved that a great nation must care for the vulnerable,
and protect its people from life’s worst hazards and misfortune.
• Now, more than ever, we must do these things together, as one nation,
and one people.
• We must harness new ideas and technology to remake our government,
revamp our tax code, reform our schools, and empower our citizens
with the skills they need to work harder, learn more, and reach higher.
• We must make the hard choices to reduce the cost of health care and the
size of our deficit. But we reject the belief that America must choose
between caring for the generation that built this country and investing
in the generation that will build its future.
What Is Text Lex Compare?

Text Lex Compare is one of the many tools in Compleat Lexical Tutor (www.
lextutor.ca/) also developed by Tom Cobb (2016) and his team of linguists
from Université du Québec à Montréal. It is, in part, a corpus-based tool that
allows students to compare two different texts and note their similarities and
differences. This process is accomplished through a “recycling index” system,
which looks at the number of linguistic tokens that are being used in both
texts; it generates a percentage of similar tokens in both texts. It also lists the
most common tokens in each text. There is a “novel” index for tokens that are
unique to one text or the other. Text Lex Compare can be utilized to help stu-
dents note linguistic features that are common in one or several registers. It can
help familiarize students with the vocabulary features of texts and also inform
them about which vocabulary words they should not use in certain contexts.
Procedure
By the end of this lesson, students will be familiar with the language and rhe-
torical features of formal, political speeches. They will be able to understand
the difference between casual spoken English and the register that is used in
speeches. Students will be able to pick out and discuss some characteristics of
persuasive language that are being used in the speeches they select. This lesson
will also help students improve their spoken English in formal registers.
Warm-up
• Do presidents use the same language when you see them on TV as every-
one else? Why or why not? What do you notice about the way they talk?
• What do you know about Barack Obama, Donald Trump, or Bill Clinton?
What can you say about the way they communicate ideas on TV or respond
to interviews?
Demonstration
• Teacher shows how to use Text Lex Compare with President John F. Kenne-
dy’s Inaugural speech compared with Barack Obama’s (Table C4.3).
• Teacher: These two speeches are of the same register, but Kennedy’s speech
was given in 1961 and Obama’s speech was given in 2009.
• Introduce the results (see Token Recycling Index Data in the following),
noting that the 3rd most common shared token was “we,” and the 16th
most common shared token was “they.” How do pronouns shape the tone
of each speech?
• The frequent use of auxiliary verbs in both speeches is also quite interest-
ing. The 8th most popular shared token is “be.” The 15th most popular
token in both speeches is “can.” “Will” is the 27th most popular shared
token. What rhetorical purpose do these auxiliary verbs serve?
Table C4.3 Comparison of unique and shared word tokens/families from Text Lex
Compare
Unique to first 626 Shared 1028 tokens Unique to second 337 Same list alpha first
tokens 424 families 229 families tokens 230 families
freq first (then alpha)
001. applause 25 001. the 86 001. both 10 001. abolish 1

002. equal 7 002. of 66 002. side 8 002. absolute 2
003. no 7 003. we 63 003. which 7 003. accident 1
004. act 6 004. to 42 004. ask 6 004. administration 1
005. journey 6 005. and 41 005. final 5 005. adversary 1
006. through 6 006. this 40 006. shall 5 006. again 1
Unique to first 626 Shared 1028 tokens Unique to second 337 Same list alpha first
tokens 424 families 229 families tokens 230 families
freq first (then alpha)
007. until 6 007. a 31 007. first 4 007. age 1
008. care 5 008. be 28 008. hand 4 008. aggression 1
009. complete 5 009. in 24 009. help 4 009. ago 1
010. creed 5 010. not 19 010. weak 4 010. alarm 1
011. she 5 011. for 16 011. burden 3 011. alike 1
012. – 5 012. let 16 012. control 3 012. ally 1
013. child 4 013. it 15 013. dare 3 013. almighty 1
014. future 4 014. but 13 014. destroy 3 014. alter 1
015. meaning 4 015. can 13 015. form 3 015. americans: 1
016. principle 4 016. they 10 016. go 3 016. area 1
017. alone 3 017. you 10 017. good 3 017. around 2
018. build 3 018. all 9 018. join 3 018. art 1
019. carry 3 019. any 9 019. revolution 3 019. ask 6
020. debate 3 020. do 9 020. seek 3 020. assemble 1
021. declare 3 021. free 9 021. struggle 3 021. assist 1
022. economy 3 022. new 9 022. absolute 2 022. assure 2
023. endure 3 023. power 9 023. around 2 023. asunder 1
024. evident 3 024. i 8 024. assure 2 024. atom 1
025. guide 3 025. nation 8 025. back 2 025. back 2
026. happy 3 026. who 8 026. balance 2 026. balance 2
027. how 3 027. will 8 027. become 2 027. beachhead 1
028. job 3 028. by 7 028. beyond 2 028. become 2
029. live 3 029. from 7 029. cooperate 2 029. belabour 1
030. most 3 030. have 7 030. day 2 030. beyond 2
031. number 3 031. or 7 031. deed 2 031. bitter 1
032. person 3 032. pledge 7 032. disease 2 032. bond 1
033. real 3 033. world 7 033. divide 2 033. border 1
034. resolve 3 034. as 6 034. doubt 2 034. both 10
035. rule 3 035. man 6 035. endeavour 2 035. break 1
036. understand 3 036. what 6 036. explore 2 036. burden 3
037. value 3 037. america 5 037. far 2 037. chain 1
038. while 3 038. peace 5 038. finish 2 038. chief 1
039. about 2 039. with 5 039. foe 2 039. citizens: 1
040. advance 2 040. always 4 040. forth 2 040. civil 1
041. also 2 041. arm 4 041. instead 2 041. clergy 1
042. america’ 2 042. bear 4 042. instrument 2 042. colony 1
043. bind 2 043. begin 4 043. loyal 2 043. come 1
044. blood 2 044. call 4 044. mister 2 044. comfort 1
045. capacity 2 045. citizen 4 045. negotiate 2 045. communist 1
046. compel 2 046. country 4 046. offer 2 046. conquer 1
047. constant 2 047. fellow 4 047. oppose 2 047. control 3
048. could 2 048. generate 4 048. pass 2 048. convert 1
049. courage 2 049. hope 4 049. place 2 049. cooperate 2
050. crisis 2 050. human 4 050. problem 2 050. culture 1
TOKEN Recycling Index: (1028 repeated tokens: 1365 tokens in new text) = 75.31% FAMI-
LIES Recycling Index: (229 repeated families: 459 families in new text) = 49.89% (Token recy-
cling will normally be the most interesting measure of, for example, text comprehensibility, as it is with VPs.)
Exploration Activity
• The students will select two American presidential speeches to analyze
from a list of speeches from the American Presidency Project. (This can be
done as a paired activity.) Students will have to justify their comparisons,
identify vocabulary features to compare, and discuss and present their re-
sults in class.
• Some guide questions for students: What patterns did you find inter-
esting in your speeches? Were there any differences with regards to the
speaker’s political affiliation, era in which each speech was delivered,
others? Has this activity taught you about “word choices” in public
speaking?

Tech Familiarity: This particular activity can be done by copying and pasting
text into the tool or uploading documents. Note that the “upload documents”
feature only takes text (.txt) documents and does not accept Microsoft Word
(.doc) files.
Materials: Students will need a computer with reliable internet access.
Ideally, this lesson would be conducted in a computer lab. The current ver-
sion of this tool works with mobile phones, but it can be difficult to see the
data on an extremely small screen. It also is difficult to copy and paste large
amounts of text on a mobile phone. Students will have to know what “token”
and “family” mean. Explain to them that the word “novel” means “new”
and “unique” means “one of a kind” in this setting. Show students how to
interpret the results. Explain to them that the tokens are being ranked in
terms of popularity.
Vocabulary: ESL students are bound to encounter some political vocabulary
that they are unfamiliar with. This lesson provides them the opportunity to
practice dictionary skills and to guess/infer meaning from context.
Organizing the Results: Text Lex Compare produces a large amount of data,
and so it is crucial for teachers to provide very clear and specific prompts for
certain features to explore and compare. Students may be given a “theme” and
then search for words that fall under that theme. For example, students may
look for “religious language” (references to the Almighty, divine providence, etc.)
in the speeches they select and discover how religious language has changed in
America over time. They can also explore language related to war or foreign
policy and how these may have evolved.
Additional Resources:
• The Avalon Project (http://avalon.law.yale.edu/subject_menus/inaug.
asp)Inaugural Words: 1789 to the Present, from the New York Times
( http://w w w.ny t imes.com/interact ive/20 09/01/17/wash ing ton/

20090117_ADDRESSES.html)
• American Rhetoric (http://www.americanrhetoric.com/)
• The White House - Official Website (https://www.whitehouse.gov/briefing-
room/speeches-and-remarks)
• The Miller Center, A Project of the University of Virginia (http://miller
center.org/president/speeches)
Tia Gass is a recent graduate of the Applied Linguistics MA program

at Georgia State University where she tutored for the Intensive English
Program and helped undergraduate students explore potential study
abroad and international exchanges options. She also was an EFL instruc-
tor for the Mecidiyeköy and Taksim branches of English Time Language
School in Istanbul, Turkey. Her research interests include EAP curriculum
design and the use of corpus tools in intermediate- and advanced-level
ESL courses.
Why have you decided to develop this lesson?

Many university courses require students to do presentations in class. These
public speaking activities are often a source of anxiety for many native and
non-native English speakers. Students are often given these assignments
without any formal instruction on the language that they are supposed
to use in their presentations. For non-native speakers, these projects can
be especially daunting because they may not be familiar with the lexicon
of formal English. Furthermore, they may not be aware of the structural
conventions of English language speeches. This activity aims to help stu-
dents become more accustomed to the grammatical and lexical features
of formal English. An added bonus of this lesson is that it can help educate
students about American history. When I was an undergraduate student
in an “Intro to Foreign Policy” course, our textbook featured numerous
primary source documents. I felt that these primary source documents
helped me better understand the motivations of various political leaders.
This “primary source” method allowed me to ascertain the motivations
of important political figures for myself rather than rely on a secondary
source’s interpretation of the documents. It allowed me to challenge the
interpretations of other scholars with solid evidence and participate in the
larger discourse.
(Continued)
How can Text Lex Compare best serve teachers and students?
What are ideal topics, activities, and settings for this
feature?
Text Lex Compare is most effective when it is being used to compare a small
number of texts in great detail. It is more effective at analyzing lexical fea-
tures of texts due to the fact that it counts “tokens” rather than chunks of
words. Unfortunately, it is not as effective as AntConc at analyzing colloca-
tions. When using Text Lex Compare, make sure that you select a small num-
ber of texts that have a lot in common. Text length also plays a huge role in
the effectiveness of this tool. Make sure that the two texts are around the
same size, otherwise the results will be skewed. The nice thing about Text
Lex Compare is that it automatically limits the number of texts that students
can look at. Therefore, they can focus on a smaller number of texts without
becoming overwhelmed by the flood of information that the larger corpus
tools sometimes give students. This is useful if you are looking at important
historical documents that students will need to be extremely familiar with.
For example, certain political speeches are discussed in detail on the US
citizenship exam and in many one-hundred level political science courses.
What are your recommendations for teachers who are

considering developing activities from this site in their
classroom?
This tool is extremely useful for noticing activities and content-based analy-
sis. Thus, it is crucial that instructors select appropriate materials for Text Lex
activities. These materials should be representative of the genre that is be-
ing studied in the course. Corpus activities are more effective if they can be
tied in to larger course themes. Corpus activities sometimes give students
a lot of information to work with, so instructors must “narrow” their focus.
I think that it is almost impossible to go too “narrow” with these sorts of
activities. Corpus tools such as AntConc work best for “macro” details while
tools like Text Lex Compare work best for “micro” details. Pick speeches that
are important enough to warrant this sort of micro analysis, otherwise the
activity will not be as effective. For example, they could search for lan-
guage related to war, religious language or other rhetorical devices. These
“theme” vocabulary searches can be extremely interesting when they are
tied into a larger concept that is being taught in the class. For instance, if
the class has a section on “Religion in America” the students could search
for religious language in 2–4 speeches and see how it has evolved over a
certain time period.
C4.3 An Eight-Week Corpus-Based Writing Course for Academic

Professionals
Peter Dye, International Study Center, Oglethorpe University, Atlanta, GA, USA
Lesson Background
My (former) university’s Office of International Initiatives has hosted visiting
scholars through its Faculty Mentoring Program, in an effort to promote pro-
fessional development and strengthen the university’s global presence and rela-
tionships. Each visiting scholar is paired with a faculty mentor from the same
field to work with during his/her stay on campus. Within the program, there
is also an Intensive English Instruction component that lasts eight weeks, for
four hours per week. It involves English training for written and oral compre-
hension and production. This part of the program focuses on topics like writ-
ing professional emails, leading discussions, and giving presentations as well as
on specific elements of English, like pronunciation issues, field-related writing
conventions, and cultural aspects of communication.
The participants for this program were nine Chinese professors from a diverse
group of academic fields including art, biology, computer science, English, early
childhood education, history, and physics. For this group, we decided to devote
more time to academic writing skills as well as an analysis of more in-depth,
subject-specific vocabulary. The majority of the scholars are competent in com-
prehending academic reading (especially related to their field), and several of them
have published research articles in English before. Despite this, academic writing
and its specific conventions are an aspect that most members of the group ex-
pressed interest in improving. Corpus-based instruction appeared to be an appli-
cable and meaningful lesson focus for the eight-week course, and it matched the
academic level and specific goals of the participants. The following is the outline of
my course, together with some examples of activities and suggestions for teachers.
Outline of Corpus-Based Instruction

A. Writing Survey: A writing survey was distributed to the group in the be-
ginning of the course that asked the participants three questions:
1 What types of writing in English have you done in the past (university
papers, research articles, etc.)?
2 What are your English writing goals for the future? Do you hope to
publish papers in English?
3 What parts of English writing are most difficult for you?
The responses from participants to the writing survey helped guide the fo-
cus of the implementation of corpus-based instruction and confirmed their
desire for academic writing improvement (Table C4.4 provides a short

excerpt of participant responses).
B. Register Awareness Activities: The participants were provided various ac-
tivities emphasizing the different forms and structures of written texts such
as (1) text messages, (2) informal and formal emails, (3) online news and
sports articles, and (4) academic research articles. Davies’s (2008–) Corpus
of Contemporary American English (COCA) database was introduced to
the participants to highlight discussions of register similarities and differ-
ences, focusing on distributions and also concordance patterns, especially
between spoken and written texts provided by COCA.
C. Introduction to the Corpus Approach: This part of the course focused
specifically on introducing the corpus approach to the participants. Some
topics included relevant definitions and descriptions of processes required
in obtaining data and patterns from corpora. The following are sample
concepts for discussion topics for the group:
Table C4.4 Sample writing survey responses from participants
Participant Question 1: What types Question 2: What are your Question 3: What parts
of English writing have English writing goals for of English writing are most
you done in the past? the future? Do you hope to difficult for you?
publish papers in English?
1 My graduate thesis is 1. Write English papers I am an editor of my

written in English. professionally and university press.
I have written some formally. 2. I hope My job is to check
abstracts in English. to publish papers in whether the English
Euphemism. Email. English abstract is right
Text. or wrong. But
sometimes it’s very
difficult for me
to find a suitable
expression to make
sure whether it
is appropriate in
English.
2 University papers. I have not published Verb tense. Grammar.
Research articles. papers in English, but Different sentence
I want to write and
publish papers and
books in the future.
Translate English
works for Chinese
scholar.
3 Abstract of my article Send English emails. Vocabulary and to
express one meaning
use different words.
Ø A corpus is simply a large collection of authentic texts that is gathered and

can be analyzed electronically. What is it used for? There are several key
reasons a corpus can be a practical tool in the field of linguistics. First, it
allows writers to check their intuitions about language. Instead of making
decisions based on what ‘sounds right,’ a corpus can provide evidence of
language use based on a large pool of data. Second, a corpus can provide
authentic examples to draw from for language teaching or learning. Teach-
ers often create examples to illustrate a linguistic structure, but it may not
represent authentic language from the real world. Students have a lot to
gain from examining texts from their subject area as opposed to fabricated
examples made to seem real. Finally, a corpus allows researchers to con-
duct their own investigations of linguistic patterns. Language changes and
trends over time can be examined in a way that was never possible a few de-
cades ago. All of these methods can be employed to expose learners to the
language they need instead of the language teachers may think they need.
Ø The four main characteristics of a corpus are that it is authentic, large, elec-
tronic, and adheres to specific criteria. For example, an electronic collection
of the text from 100 research articles in the field of physics is a corpus. A
corpus could also contain the novels of John Steinbeck or, more generally,
a collection of 20th-century American novels. There are corpora that con-
tain academic English, spoken English, newspaper articles, or legal cases.
Ø There is no specific rule regarding how long a corpus must be, but it
should be large enough to promote analysis of linguistic patterns. Once
collected, a corpus can be analyzed using a concordancing program. A
concordancer allows the corpus to be searched based on word frequen-
cies, collocations, and any number of usage-related topics. There are
both internet- and PC-based concordancing programs freely available
to teachers and students. Despite their relatively recent debut in the
world of teaching, corpora and concordancing have been put to practi-
cal classroom use in a number of studies.
Ø Sample Discussion Questions: Discuss with a partner: What are some
aspects of writing that you think are unique to your discipline (words,
phrases, organization, grammar, etc.)? What are some questions that
you have about academic writing in English (when to use certain words,
punctuation, etc.)? Do you notice any problems with the following ex-
amples for an academic paper?
One of the key problem…

Have you ever think…
This is a hot topic in the field…
D. AntConc and Teacher/Learner-Compiled Corpora: Participants were in-

troduced to AntConc and how a concordancer can help analyze specific
corpora. Each class member was told to download AntConc and a sample
Spoken English Corpus (that I provided) and save them each onto a flash
drive for the next class. I demonstrated how to do this.
E. The following class took place in the computer lab again, and the group was
shown how to use the basic features of the AntConc program. They opened
the corpus file, generated the word list ranked by frequency, and were
shown how to search for specific words or phrases. After they had been
given sufficient time to familiarize themselves with the basic functions,
the class used a handout to guide them through more searches. Working
with a partner, they recorded the top ranked words of the corpus and then
completed searches to determine which prepositions and verbs were used
in differing contexts (e.g., move in vs. move on). They were shown how to
analyze the examples to help identify the differences in meaning. The final
activity allowed them to conduct their own searches to identify patterns
they were curious about, check their intuitions, and further familiarize
themselves with using the program.
The next assignment for the class was to start collecting research articles
from their discipline and send them to me by email. The ultimate goal was
for each student to have his or her own discipline-specific corpus written
by native speakers (NSs) of English, which the students could keep as a ref-
erence tool well after the English instruction course had ended. They were
told to collect a minimum of 8 articles, but most of the students eventually
sent closer to 15. I converted the articles, most in PDF format, into text
files and uploaded them into separate folders organized by discipline that
the students could access online through a shared folder.
F. Analyzing Patterns of Writing: It took approximately two weeks for par-
ticipants to have their self-compiled corpora ready. Before each participant
was assigned to analyze his or her specific corpus, a class was scheduled to
analyzing linguistic features of three registers of writing: news, fiction,
and research articles. This was intended to familiarize the participants with
exploring patterns first by hand in order to demonstrate the ease, effective-
ness, and speed of doing it with a concordancing program later. They were
given a one-page excerpt from each category and tasked with underlining
every verb on the page and trying to describe any patterns they found.
After each sample was analyzed, I guided the class into accurately de-
scribing their findings. Examples of the findings included numerous uses
of the present perfect in the news excerpt, a high frequency of the past
simple and past perfect in the fiction sample, and the prevalence of the
passive voice in the research article. This activity culminated in the main
assignment: perform their own searches using their self-compiled corpora.
Participants were given a great deal of flexibility in what to search for,
but the following questions served as a guide:
1 What are some examples of common verbs?
2 What grammatical patterns are most common?
3 Did you find anything interesting or surprising?

4 How can you use this information to improve your writing?
Analyzing patterns in writing: Verbs

This is an open research activity. Everyone may identify different patterns.
The goal is to determine which types of verbs are more appropriate in specific
situations to help our English writing.
1. News
Read through the news sample and underline all of the main verbs in each
sentence. What grammatical patterns are most common (For example: do/did/
have done/had done/is doing/was doing/was done/etc.)? Are there any verbs that are
used more than once?
2. Fiction
Read through the short story sample and underline all of the main verbs in each
sentence. What grammatical patterns are most common (For example: do/did/
have done/had done/is doing/was doing/was done/etc.)? Are there any verbs that are
used more than once?
3. Academic
Read through the academic article sample (applied linguistics) and underline
all of the main verbs in each sentence. What grammatical patterns are most
common (For example: do/did/have done/had done/is doing/was doing/was done/
etc.)? Are there any verbs that are used more than once?
4. Your Discipline
Analyze the verb patterns of your discipline using AntConc.
1 Open AntConc (from your USB or from http://www.laurenceanthony.net/
software/antconc/)
2 Select File and click on Open File(s)...
3 Select all of the files from the corpus of your field (biology, computer
science, etc.)
4 Go to the Word List tab
5 Click the Start button to see the most frequent words in your corpus
6 You can use the Concordance tab to search for specific verbs/words/
phrases
7 Answer the following guide questions:
What are some examples of common verbs?
What grammatical patterns are most common?
Did you find anything interesting or surprising?
How can you use this information to improve your writing?
Table C4.5 Sample worksheet: analyzing patterns in writing from your discipline with
AntConc
Participant What are some What Did you find anything How can you use
examples of common grammatical interesting or surprising? this information
verbs [participants were patterns are most to improve your
told they could look up common? writing?
any types of words if
preferred]?
1 Is Passive voice TMM To check the

Are The subjects MEM patterns if I
Study are usually WOW (what are am not sure
Learn something these?) about them
Found that the or objects
suggestion found
2 41. Development Past pattern Verbs come out on First find the
51. Study 51 rank verb, then
60. Teach check it
77. Growth
78. Effect
3 1 Defined parameter Passive voice Both a and c Gerund
(d is the thickness Present increase slightly The doping
of a tilted? form) Present perfect with increasing element
2 Passive voice (the oxygen content Heavy doping
maximum value is The Seebeck Lower doping
improved) coefficient increas*
3 Is followed by decreases due [further notes]
adjective (but the to the increased
reason is not clear carriers
so far) concentration
With the increasing
steps
[more “increasing”
examples and
other -ing
adjectives]
Table C4.5 shows a synthesis of participants’ responses to the AntConc guide

questions.
Notes for Teachers

Ø By the end of the course, the majority of the participants had positive im-
pressions toward using corpus tools for their academic writing. Throughout
the course, however, there were some that did not find the use of cor-
pora to be particularly useful for their objectives. This seems to summarize
corpus-based instruction well: It is not for everyone. For the participants

that had clear and immediate academic English-related goals or for those
already possessing confidence with using various types of software for re-
search, understanding and applying the use of corpus tools came naturally.
For one or two of the students that did not have pressing English writing
goals and preferred to spend their time in the US practicing spoken English
skills, the advantages of corpus use were a harder sell. This is important for
instructors to keep in mind when planning to implement corpus instruction
in the classroom. Bearing this in mind, there are a number of factors that
instructors should consider before adding a corpus element to their courses.
Ø Ample Time: Effectively introducing participants to corpus tools requires
sufficient time for explanation, demonstration, and practice. For this course,
several hours of class time were committed exclusively to this purpose. Ob-
viously, depending on the instructor’s goals, different elements will require
varying amounts of time. The most basic aspects of COCA could be intro-
duced within a single class, but if participants are expected to compile their
own corpora to be analyzed with a program like AntConc, more time will
have to be allotted. The participants in this course were given two weeks to
collect their research articles and then two full classes were spent practicing
analysis. Only after multiple opportunities to practice using AntConc did
they begin to realize its potential for their own writing and research.
Ø Proper Explanation of Merits: One of the most challenging responsibilities of
the instructor in this context is to properly explain to participants why these
tools can be helpful. There is a time-commitment on their part to learning
the process, so they need to understand why their time is being spent toward
a potentially practical application. In this course, the participants first prac-
ticed examining linguistic features of various writing samples by hand so
that when they gained access to the concordancer the advantages would be
more apparent. I also had to spend a significant amount of time explaining
why factors like frequency and authenticity of texts are valuable in writing,
editing, and research. Without the appropriate explanation of the material, it
would be easy for participants to feel resentful, bored, or overwhelmed.
Peter Dye is an English Instructor and Academic Manager in Oglethorpe

University’s International Study Center. Over the past decade, he has taught
a range of EAP and ESP courses in Spain, South Korea, and the US His re-
search interests involve using corpus tools to teach academic writing to
English learners across disciplines.
(Continued)
You have worked with professional academics in your

lesson for this book. What, to you, is the ideal or more
appropriate English proficiency level of students in a
corpus-based course?
As I mentioned previously in the lesson, corpus instruction is not for ev-
eryone. The students need to have a decently strong foundation of English
before setting out to analyze millions of words of text for specific linguis-
tic features. Otherwise, they may not know what features to search for or
how to interpret the results. For this reason, I recommend that this type of
course be designed for at least intermediate students, but preferably more
advanced learners. Several of the lower level students in my classes required
extra time and attention from me (and also more advanced classmates)
before they were able to comprehend the tools, materials, and objectives.
How do you connect relevant student goals with corpus

instruction?
The students’ desired outcomes in English should relate to the instruction.
This class was particularly well suited for corpus instruction because the
majority of them wanted or desired to publish research articles in English.
Research articles are easily accessible and lend themselves to corpus analy-
sis. A speaking or listening course could incorporate corpora, but it may not
be as fitting. It is also helpful if the students are at least minimally computer
savvy. The students with a computer science background were able to more
quickly understand and use the concordancing program than some of the
other students from different fields or that had different technological
abilities.
C4.4 Incorporating a Corpus-Based Text Visualization Program

in the Writing Classroom
Cynthia Berger, Duolingo, Pittsburgh, PA, USA
Lesson Background
The lesson describes Zhu and Friginal’s (2015) Text X-Ray (see Section A3.2
for an introduction to this POS-visualizer) and its application in an academic
writing course for multilingual writers. I introduced Text X-Ray to my students
in a university-level undergraduate composition course. Selected features of the
program were highlighted through three separate activities. While the curricu-
lum and syllabus for this course had already been determined before I had access
to Text X-Ray, I was able to integrate these activities into the pre-established
course syllabus in a way that aligned with the general course objectives.
The class consisted of 21 students representing the following 7 native lan-
guages: Arabic, Korean, Italian, Spanish, Chinese, Norwegian, and Romanian.
The age of students ranged from late teens to mid-20s. While some students
had been in the US for several years (the longest time spent in the US was eight
years), the majority had been in the country for as few as four months. The
latter self-identified in the classroom as “international students,” while many of
the former had immigrated to the US with their families as adolescents and thus
might best be understood in the field as “Generation 1.5” students (Doolan &
Miller, 2012).
At the risk of overgeneralizing the heterogeneity of learners’ experiences,
“Generation 1.5” typically refers to students aged 25 and under who immi-
grated from a non-English-dominant country during their adolescence, speak
a language other than English at home, and have strong speaking and listening
skills in English (Doolan & Miller, 2012). Students who fell within these crite-
ria exhibited advanced conversation skills in English but found issues pertaining
to sentence structure, basic grammatical categories (i.e., POS), and word choice
particularly challenging in their writing. Furthermore, I noticed that these
students struggled to grasp the conventions of college-level academic writing.
Lessons
The following sample lessons describe three separate activities in which Text
X-Ray was utilized in the described course. The activities are presented in the
order in which they took place in the semester.
Sample Lesson #1
The first time I introduced Text X-Ray to my students, it was used to present
the tool’s ability to highlight grammatical categories. The objective of this first
lesson was to cultivate learners’ ability to self-edit for subject-verb agreement
and for issues related to basic sentence structure, both of which the instruc-
tor found to be prevalent in learners’ later drafts, despite repeated attempts to
encourage students to self-edit their final versions of essays for sentence-level
mistakes.
Note to Teachers
• Text X-Ray, in its most basic application, can show teachers/learners the
use of particular POS (e.g., nouns, verbs, adjectives, and adverbs) in a text.
• If a specific objective in a course, perhaps a course in second language
grammar for international students in the US, focuses on the use of a
certain feature, such as ‘existential there,’ Text X-Ray can provide a quick
Figure C4.2 ighlighted grammatical categories of writing from Text X-Ray (Zhu &
H
Friginal, 2015).
access to many potential examples for students to analyze (i.e., data-driven

learning applications).
• This feature can build awareness of a form’s construction and typical place-
ment at sentence-level, paragraph-level, and even entire composition-level
writing. A color coded visualizer helps users to focus on these ‘tagged’
features easily within the same text or group of texts.
In previous assignments, students had prepared a 2–3-page second draft of an

essay-in-progress. Students were asked to bring a hard copy of this draft to
class, in addition to uploading a digital version of the paper to the class Dropbox
folder. At the beginning of class, I told my students to give the hard copy of
their draft to a neighbor for peer editing. Peer editors were instructed to read
their partner’s essay carefully.
Afterward, peer editors were told to read the essay again, this time underlin-
ing all of the nouns that they encountered as they read. Following this activity,
the draft was passed to a second peer editor who (1) read the essay carefully, and
(2) read it a second time, underlining any nouns that the original peer editor
may have missed. In this regard, my students had ample opportunity to think
about nouns and the various roles they may play in a sentence (e.g., subject,
direct object, etc.) before using technology to identify them.
Next, the hard copies of drafts were returned to their original author, and
students were directed to the following directions on the digital projector:
1 Open Text X-Ray1 from the following website: http://cs.gsu.edu/yzhu/

Documents/TextVis/launch.jnlp (Case-sensitive—you must use capital
letters when appropriate.)
2 Open your Summary Paper + Response 3.2 from our course’s Dropbox folder.
3 Select and copy the text from the entire paper.
4 Paste your paper into Text X-Ray.
5 Click the green button that says, “nouns.”
6 Double-check to make sure all of your nouns have been underlined.
Students were instructed to complete this portion of the lesson in dyads, that is,
partners would paste and analyze one student’s paper before moving on to the
next student’s paper. While students were working, I walked around the room
and facilitated learners’ initial attempts to access Text X-Ray; there were few
difficulties. I noticed that partners were actively interacting with both the Text
X-Ray interface and with one another, pointing to the screen, pointing to the
essay in question, and often verbalizing and gesturing animatedly if there was
a discrepancy between the two.
Whenever possible, I asked students to reflect on such discrepancies. Why
had they not realized that this particular word was a noun? Why had students
labeled another word a noun if Text X-Ray did not? I also observed that many
students had begun to experiment with other features of Text X-Ray, such as
the brown button labeled “verbs.” At this point, I directed their attention to
further instructions on the digital projector:
7 Click the brown button that says, “verbs.”

8 Using Text X-Ray’s highlighted verbs as a guide, circle all of the verbs in
your paper.
9 Draw an arrow between each verb and the noun that is closest to it.
Figure C4.3 Students working on their Text X-Ray activity in class.

Following this portion of the lesson, I encouraged students to consider whether

or not each verb had an accompanying noun acting as its subject and whether
that noun and verb agreed in number.
While this particular sample lesson focused on students’ ability to self-edit
their writing for subject-verb agreement, the feature of Text X-Ray that has
been utilized could also be used to help learners’ self-edit with regard to article
usage, issues pertaining to the use of count vs. non-count nouns, and so forth.
Sample Lesson #2
Another lesson in which Text X-Ray was utilized took advantage of the corpus
tool’s ability to highlight the linguistic features of a particular text as compared
to the linguistic features conventional to a chosen genre or text type.
During an in-class discussion, a few students had pointed out that the essays
they were required to read in their textbook did not necessary align with the
expectations I communicated to them regarding their own academic writing.
I was quite encouraged by this level of genre-awareness in my students and I
chose to incorporate a genre investigation into a subsequent lesson that would
allow students to further explore their observation.
Fortunately, students in this course were already familiar with MICUSP from
previous classroom activities. Students were directed to a document in the Drop-
box folder, which included three paragraphs from a recently assigned essay in the
class textbook: “Sorry, Vegans: Brussels Sprouts Like to Live, Too,” written by
Natalie Angier in 2009 for the New York Times. Working in pairs, students pasted
these three paragraphs into Text X-Ray’s text editor and then chose the “Compare
with MICUSP” option. Because the students were currently working on writing
argumentative essays of their own, they were told to compare based on this paper
type alone. Thus, students selected “Paper Type”—“Argumentative Essay” and
then hit “Compare.” They received a comparison like the following:
Sample Lesson #3
In a third lesson, students used Text X-Ray to investigate their own use of re-
cently learned vocabulary in their writing.
Throughout the assigned course readings, learners were required to keep logs
of “newly-encountered” words. The vocabulary log required learners to note the
page and paragraph number of four new word/phrases from each text, to copy each
word/phrase along with the 1–2 words that came before and after it in the text (for
the sake of recording collocational information), and to provide a definition of the
vocabulary items in the students’ own words. Students uploaded their vocabulary
logs following each course reading, after which I extracted the most frequently
cited words into a master list that students could reference at any time via the
course Dropbox folder. As students’ essays centered on topics that emerged from
course readings, they were encouraged to incorporate recently learned vocabulary
from course readings into their own essays; however, no quantitative manner of
identifying and/or tallying such vocabulary had been introduced.
Figure C4.4 S ample MICUSP comparison output in Text X-Ray (visualized com-
parison output of student paper and MICUSP averages automatically
obtained by clicking the “Compare” button under the “Compare
with MICUSP” tab).
Students were instructed to copy and paste their most recent draft of an
essay into Text X-Ray’s text editor. They were then guided to download a
copy of the vocabulary list I prepared and select “Load a word list from this
computer,” under Text X-Ray’s “Customized word list” feature. Once this list
was uploaded, students highlighted the uploaded document under “Word list”
and clicked “Highlight the words from the list(s).” Doing so caused any words
from the class-generated vocabulary list to be underlined in red in the students’
writing, providing a visual representation of the amount of new vocabulary
learners were incorporating into their writing. Students performed this activity
in dyads so that there was greater spoken interaction and accountability.
In the final portion of this lesson, students were encouraged to use the “Word
Cloud” function of Text X-Ray to produce a visual graphic of the most fre-
quently used words in their writing. Students were then asked to individually
write a reflection of their current progress in vocabulary use based on the results
of the word cloud and the highlighted vocabulary found in their essay. Questions
students were encouraged to answer in their reflection included the following:
• How much of the class vocabulary are you using in your own writing?
• Based on your word cloud, which word(s) do you use the most in your
writing? Why do you think this is the case?
• What is one way you can practice using more unfamiliar vocabulary in
your academic writing?
Final Thoughts and Recommendations for Teachers

on Using Text X-Ray
I found three features of Text X-Ray to be particularly useful. First, the inter-
face is visually appealing. This is not the case with the majority of pedagogical
corpus tools I am familiar with. The majority of the learners in this course are
highly digitally literate. They can spot a poorly designed website from a mile
away and have little tolerance for it. While visual appeal isn’t my number one
priority in determining a digital tool’s usefulness in the classroom, design does
matter.
Second, I really liked how Text X-Ray combined a variety of features avail-
able in other corpus tools into one user-friendly location. Rather than needing
to introduce and familiarize my writers with several different corpus tools/
websites, students can access the majority of the features these tools provide in
one place. This makes it much more likely that my students will remember Text
X-Ray and use it independently outside of class.
Finally, as I observed my students completing these three activities, I no-
ticed the degree of interaction that Text X-Ray elicited, on a variety of levels.
First, learners were interacting with one another. The majority of activities
were performed in groups of 2–3; thus, students were talking with one another
throughout—giving instructions, asking questions, vocalizing surprise at the
results, and so forth. Text X-Ray is fairly intuitive, so rather than giving explicit
instructions on all of the features, I allowed students to explore on their own.
As a result, they interacted with each other more, and the overall energy in
the classroom seemed to increase. There was also interaction between media;
that is, students engaged more with the written text on paper, as well as with
the text on the screen. Such interaction may seem unremarkable, but text-to-
screen can be a difficult gap to bridge in writing courses. Many students are
either comfortable composing on paper or on screen, but not both. It can be
even more difficult for learners to self-edit in a particular medium. Finally, Text
X-Ray allowed students to interact with more than one text at a time. Rather
than engaging with assigned course readings in one portion of the class and
then with their own essays in another, Text X-Ray made it possible for learners
to visualize connections between their own writing, the articles and essays
they’d been reading on a similar topic, and even successful essays written by
students before them (by way of MICUSP). In this regard, my students became
less isolated as writers; they belonged to a community of writers striving to
better understand and perfect their craft. Students also began to deconstruct
the seemingly ‘perfect’ model texts in a way that made them better readers and
writers overall.
In future courses, I can envision using Text X-Ray with items from the
Academic Word List (AWL) to encourage learners to produce more academic
language and to monitor their use of words from the AWL in their writing.
I’d also like to play around more with the “Readability” functions, allowing
students to visualize the amount of complex words and sentences they’re using
in their writing. There are also lots of other ways that highlighting grammat-
ical categories could be used to tailor an activity to whatever we’re working
on in class at a particular point in the semester. For example, I might want to
encourage learners to compose more compound sentences using coordinating
conjunctions, to proofread for misuse of prepositions, and so forth. Being able
to easily track their use of these categories throughout their writing would be
valuable to students.
Cynthia Berger is an Applied Linguist for the language learning platform,

Duolingo, Pittsburgh, PA, USA. She is also currently a PhD candidate in
Applied Linguistics at Georgia State University. Her research interests in-
clude CL, second language acquisition, lexical development, and language
pedagogy. Before studying applied linguistics, Cynthia received an MFA in
Creative Writing from the University of Oregon and taught English as an
additional language in a variety of settings. She is the coauthor (with Jun
Liu) of the book TESOL: A Guide.
Note
1 Note again that we plan to launch a full, new version of Text X-Ray upon comple-
tion of our usability tests. If you want to access the beta version, please send an email
to textxray.beta@gmail.com for instructions on how to run the program online.
References
Aijmer, K. (2011). “Well I’m not sure I think…” The use of well by non-native speak-
ers. International Journal of Corpus Linguistics, 16(2), 231–254.
Alderson, J. C. (2007). Judging the frequency of English words. Applied Linguistics,
28(3), 383–409.
Alsop, S., & Nesi, H. (2009). Issues in the development of the British Academic Written
English (BAWE) corpus. Corpora, 4(1), 71–83.
Altenberg, B., & Granger, S. (2002). The grammatical and lexical patterning of MAKE
in native and non-native student writing. Applied Linguistics, 22(2), 173–195.
Altenberg, B., & Tapper, M. (1998). The use of adverbial connectors in advanced Swedish
learners’ written English. In S. Granger (Ed.), Learner English on computer (pp. 80–93).
Harlow, UK: Addison Wesley Longman.
Anderson, W., & Corbett, J. (2009). Exploring English with online corpora. New York, NY:
Palgrave Macmillan.
Angier, N. (December 21, 2009). Sorry, vegans: Brussels sprouts like to live, too. The
New York Times. Retrieved March 14, 2016 from www.nytimes.com/2009/12/22/
science/22angi.html
Anthony, L. (2013). AntFileConverter [Computer Software]. Tokyo, Japan: Waseda
University. Retrieved July 9 from www.laurenceanthony.net/
Anthony, L. (2014). AntConc (Version 3.4.3) [Computer Software]. Tokyo, Japan:
Waseda University. Retrieved July 9 from www.laurenceanthony.net/
Anthony, L. (2015). TagAnt [Computer Software]. Tokyo, Japan: Waseda University.
Retrieved July 9, from www.laurenceanthony.net/
Aston, G. (2015). Learning phraseology from speech corpora. In A. Lenko-Szymanska &
A. Boulton (Eds.), Multiple affordances of language corpora for data-driven learning (pp. 65–84).
Amsterdam: John Benjamins.
Athelstan. (1999). MonoConc Pro. (Computer Software).
Baker, P. (2006). Using corpora in discourse analysis. New York, NY: Continuum.
Baldwin, E. (2014). Beyond contrastive rhetoric: Helping international lawyers use co-
hesive devices in legal writing. Florida Journal of International Law, 26, 399–446.
Barlow, M. (2012). MonoConc Pro 2.2 (MP2.2) [Computer Software]. Retrieved
September 17, 2014 from www.monoconc.com/
332 References
Bauman, J. (1995). About the general service list. Retrieved March 2011, from http://
jbauman.com/aboutgsl.html.
Belcher, D. (1994). The apprenticeship approach to advanced academic literacy: Gradu-
ate students and their mentors. English for Specific Purposes, 13(1), 23–34.
Belcher, D. (Ed.). (2009). English for specific purposes in theory and practice. Ann Arbor, MI:
Bennett, G. (2010). Using corpora in the language learning classroom. Ann Arbor, MI: University
of Michigan Press.
Biber, D. (1988). Variation across speech and writing. Cambridge, UK: Cambridge University
Press.
Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge,
UK: Cambridge University Press.
Biber, D. (2004). Corpus linguistics and language teaching. Invited lecture, University
of California, Berkeley, Berkeley, CA, March 15.
Biber, D. (2006). University language: A corpus-based study of spoken and written registers.
Biber, D. (2009). A corpus-driven approach to formulaic language in English: Multi-
word patterns in speech and writing. International Journal of Corpus Linguistics, 14(3),
275–311.
Biber, D. (2010). Biber Tagger [Computer Software]. Flagstaff, AZ: Northern Arizona
University.
Biber, D., & Conrad, S. (2009). Register, genre, and style. New York, NY: Cambridge
University Press.
Biber, D., Conrad, S., & Leech, G. (2002). A student grammar of spoken and written English.
London, UK: Longman.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language struc-
ture and use. Cambridge, UK: Cambridge University Press.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman gram-
mar of spoken and written English. Harlow, UK: Pearson.
Biber, D., Reppen, R., & Friginal, E. (2010). Research in corpus linguistics. In R. B.
Kaplan (Ed.), The Oxford handbook of applied linguistics (2nd ed., pp. 548–570).
Oxford, UK: Oxford University Press.
Blum-Kulka, S., House, J., & Kasper, G. (1989). Cross-cultural pragmatics: Requests and
apologies. Norwoord, NJ: Ablex.
Bohannon, J. (2010). Google opens books to new cultural studies. Science, 330(6011),
1600.
Bohannon, J. (2011). Google books, Wikipedia, and the future of culturomics. Science,
331, 135.
Boulton, A. (2009). Testing the limits of data-driven learning: Language proficiency
and training. ReCALL, 21(1), 37–54.
Boulton, A. (2012). Beyond concordancing: Multiple affordances of corpora in uni-
versity language degrees. Procedia - Social And Behavioral Sciences, 34, 33–38
(Languages, Cultures and Virtual Communities (Les Langues, les Cultures et les
Communautes Virtuelles).
Boulton, A. (2015). Applying data-driven learning to the web. In A. Lenko-Szymanska &
A. Boulton (Eds.), Multiple affordances of language corpora for data-driven learning (pp. 267–
295). Amsterdam: John Benjamins.
Brand, C., & Götz, S. (2011). Fluency versus accuracy in advanced spoken learner
language: A multi-method approach. International Journal of Corpus Linguistics, 16(2),
255–275.
References 333
British Academic Spoken English and BASE Plus Collections. (2017). The British Aca-
demic Spoken English. Retrieved March 14, 2016 from www2.warwick.ac.uk/fac/
soc/al/research/collections/base/
Brockman, J. (2010). Who’s taking your call? CFI Group’s 2010 Contact Center Satisfaction
Index. Retrieved March 14, 2016 from www.cfigroup.com/resources/whitepapers_
register.asp?wp=46
Brown, P., Lai, J., & Mercer, R. (1991). Aligning sentences in parallel corpora. In
Proceedings of the twenty-ninth annual meeting of the association for computational linguistics
(pp. 169–176). Berkeley, CA.
Buysse, L. (2012). So as a multifunctional discourse marker in native and learner speech.
Journal of Pragmatics, 44(13), 1764–1782.
Candlin, C. N., & Hafner, C. A. (2007). Corpus tools as an affordance to learning in
professional legal education. Journal of English for Academic Purposes, 6, 303–318.
Carr, C. T., Schrock, D. B., & Dauterman, P. (2012). Speech acts within Facebook
status messages. Journal of Language and Social Psychology, 31(2), 176–196.
Carter, R., McCarthy, M., Mark, G., & O’Keeffe, A. (2016). English grammar today.
Cambridge, UK: Cambridge University Press.
Casanave, C. P. (2006). Controversies in second language writing: Dilemmas and decisions in
research and instruction. Ann Arbor, MI: University of Michigan Press.
Celce-Murcia, M. (1991a). Discourse analysis and grammar instruction. Annual Review
of Applied Linguistics, 11, 459–480.
Celce-Murcia, M. (1991b). Grammar pedagogy in second and foreign language teach-
ing. TESOL Quarterly, 25, 135–151.
Celce-Murcia, M., Dörnyei, Z., & Thurrell, S. (1997). Direct approaches in L2 instruc-
tion: A turning point in communicative language teaching? TESOL Quarterly, 31,
141–152.
Chambers, A., Farr, F., & O’Riordan, S. (2011). Language teachers with corpora in
mind: From starting steps to walking tall. Language Learning Journal, 39, 85–104.
Chapelle, C. (1998). Multimedia CALL: Lessons to be learned from research on in-
structed SLA. Language Learning & Technology, 2(1), 22–34.
Chapelle, C., & Jamieson, J. (2009). Tips for teaching with CALL: Practical approaches to
computer-assisted language learning. White Plains, NY: Pearson Education.
Charles, M. (2005). Phraseological patterns in reporting clauses used in citation: A corpus-
based study of theses in two disciplines. English for Specific Purposes, 17, 113–134.
Charles, M. (2014). Getting the corpus habit: EAP learners’ long-term use of personal
corpora. English for Specific Purposes, 35, 30–40.
Cheng, W. (2012). Exploring corpus linguistics: Language in action. New York, NY:
Routledge.
Cheng, W., Greaves, C., & Warren, M. (2005). The creation of a prosodically tran-
scribed intercultural corpus: The Hong Kong Corpus of Spoken English (prosodic).
ICAME Journal, 29, 47–68.
Cheng, W., Greaves, C., & Warren, M. (2008). A corpus-driven study of discourse intona-
tion. Amsterdam: John Benjamins.
Chujo, K., Utiyama, M., & Miura, S. (2006). Using a Japanese-English parallel corpus
for teaching English vocabulary to beginning level students. English Corpus Studies,
13, 153–172.
Cilveti, L. D., & Perez, I. K. A. (2006). Textual and language flaws: Problems for
Spanish doctors in producing abstracts in English. IBERICA, 11, 61–79.
Cobb, T. (1997). Is there any measurable learning from hands-on concordancing? Sys-
tem, 25, 301–315.
334 References
Cobb, T. (2016). Compleat Lexical Tutor v.8 [website]. Retrieved December 9, 2016
from http://lextutor.ca
Cohen, W. (2015). Enron email dataset. Retrieved from www.cs.cmu.edu/~enron/
Collins CoBuild English Language Dictionary. (1987). London: Collins CoBuild.
Conrad, S. (2000). Will corpus linguistics revolutionize grammar teaching in the 21st
century? TESOL Quarterly, 34(3), 548–560.
Conrad, S., & Biber, D. (2009). Real grammar: A corpus-based approach to English. New
York, NY: Pearson-Longman.
Cook, G. (1998). The uses of reality: A reply to Ronald Carter. ELT Journal, 52(1), 57–63.
Cortes, V. (2004). Lexical bundles in published and student disciplinary writing: Exam-
ples from history and biology. English for Specific Purposes, 23, 397–423.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238.
Coxhead, A. (2002). The academic wordlist: A corpus‐based word list for academic
purposes. In B. Kettemann & G. Marko (Eds.), Teaching and learning by doing corpus
analysis (pp. 73–89). Amsterdam: Rodopi.
Coxhead, A. (2011). The academic word list 10 years on: Research and teaching impli-
cations. TESOL Quarterly, 45(2), 355–361.
Crashborn, O. (2008). Open access to sign language corpora. In O. Crashborn, T. Hanke,
E. Efthimiou, I. Zwitserlood, & E. Thoutenhoofd (Eds.), Construction and exploitation of
sign language corpora (pp. 33–38). Third workshop on the representation and processing
of sign language. Paris: European Language Resources Association (ELRA).
Creswell, J. (2014). Research design: Qualitative, quantitative, and mixed methods approaches
(4th ed.). New York, NY: Sage.
Crossley, S., & McNamara, D. (2009). Computational assessment of lexical differences
in L1 and L2 writing. Journal of Second Language Writing, 18, 119–135.
Daniels, M. (2017). The largest vocabulary in hip hop. Retrieved from https://pudding.
cool/2017/02/vocabulary/index.html
Davies, M. (2008). The Corpus of Contemporary American English (COCA): 520
million words, 1990-present. Retrieved from https://corpus.byu.edu/coca/
Davies, M. (2009). The 385+ million word Corpus of Contemporary American
English (1990–2008+): Design, architecture, and linguistic insights. International
Journal of Corpus Linguistics, 14(2), 159–190.l
Davies, M. (2010). The Corpus of Historical American English (COHA): 400 million
words, 1810–2009. Retrieved from https://corpus.byu.edu/coha/
Davies, M. (2013). Corpus of Global Web-Based English: 1.9 billion words from speak-
ers in 20 countries (GloWbE). Retrieved from https://corpus.byu.edu/glowbe/
Davies, M. (2017a). Using large online corpora to examine lexical, semantic, and cul-
tural variation in different dialects and time periods. In E. Friginal (Ed.), Studies in
corpus-based sociolinguistics (pp. 19–82). New York, NY: Routledge.
Davies, M. (2017b). WordandPhrase.Info [website]. Retrieved March 18, 2016 from
corpus.byu.edu
Davies, M., & Gardner, D. (2010). A frequency dictionary of contemporary American English:
Word sketches, collocates, and thematic lists. New York, NY: Routledge.
De Haan, P. (1989). Postmodifying clauses in the English noun phrase: A corpus-based study.
Amsterdam: Rodopi.
Donley, K., & Reppen, R. (2001). Using corpus tools to highlight academic vocabulary
in SCLT. TESOL Journal, 12, 7–12.
Doolan, S., & Miller, D. (2012). Generation 1.5 written error patterns: A comparative
study. Journal of Second Language Writing, 21(1), 1–22.
References 335
Educational Testing Service. (2016). TOEFL iBT questions writing sample responses
[pdf ]. Retrieved December 9, 2016 from www.ets.org/Media/Tests/TOEFL/pdf/
ibt_writing_sample_responses.pdf
Edutopia. (2017). [Website] Retrieved March 12, 2017 from www.edutopia.org/
Eggington, W. (2004). Rhetorical influences. As Latin was, English is? In C. L. Moder &
A. Martinovic-Zic (Eds.), Discourse across languages and cultures (pp. 251–265). Phila-
delphia, PA: John Benjamins.
Eisenstein, J., O’Connor, B., Smith, N. A., & Xing, E. P. (2010). A latent variable
model for geographic lexical variation. Paper presented at the Proceedings of the
2010 Conference on Empirical Methods in Natural Language Processing, MIT
Stata Center, Cambridge, MA, USA.
Ellis, R. (1995). Interpretation tasks for grammar teaching. TESOL Quarterly, 29, 87–105.
Ellis, R. (1998). Teaching and research: Options in grammar teaching. TESOL Quar-
terly, 32, 39–60.
Firth, J. (1957). Papers in linguistics. Oxford, UK: Oxford University Press.
Flowerdew, J. (2002). Genre in the classroom: A linguistic approach. In A. M. Johns (Ed.),
Genre in the classroom: Multiple perspectives (pp. 91–102). New York, NY: Routledge.
Flowerdew, L. (2005). An integration of corpus-based and genre-based approaches to
text analysis in EAP/ESP: Countering criticisms against corpus-based methodolo-
gies. English for Specific Purposes, 24(3), 321–332.
Flowerdew, L. (2012). Corpora and language education. New York, NY: Palgrave Macmillan.
Fotos, S. (1994). Integrating grammar instruction and communicative language use
through grammar consciousness-raising tasks. TESOL Quarterly, 28, 323–351.
Francis, D., Rivera, M., Lesaux, N., Kieffer, M., & Rivera, H. (2006). Practical guidelines
for the education of English language learners: Research-based recommendations for instruction
and academic interventions. Portsmouth, NH: RMC Research Corporation, Center
on Instruction.
Francis, G., Hunston, S., & Manning, E. (1996). Collins Cobuild grammar patterns 1:
Verbs. London, UK: Harper Collins.
Friginal, E. (2009). The language of outsourced call centers: A corpus-based study of cross-
cultural interaction. Amsterdam: John Benjamins.
Friginal, E. (2011). Interactional and cross-cultural features of outsourced call center
discourse. International Journal of Communication, 21(1), 53–76.
Friginal, E. (2013a). 25 years of Biber’s multi-dimensional analysis: Introduction to the
special issue. Corpora, 8(2), 137–152.
Friginal, E. (2013b). Developing research report writing skills using corpora. English for
Specific Purposes, 32(4), 208–220.
Friginal, E. (2015). Concordancers. In J. Bennet (Ed.), The Sage encyclopedia of intercul-
tural communication (pp. 109–111). Thousand Oaks, CA: Sage.
Friginal, E., & Hardy, J. A. (2014a). Corpus-based sociolinguistics: A guide for students. New
York, NY: Routledge.
Friginal, E., & Hardy, J. A. (2014b). Conducting Biber’s multi-dimensional analysis
using SPSS. In T. Berber Sardinha & M. Pinto (Eds.), Multi-dimensional analysis: 25
years on (pp. 297–316). Amsterdam: John Benjamins.
Friginal, E., Lee, J. J., Polat, B., & Roberson, A. (2017). Exploring spoken English learner
language using corpora: Learner talk. London, UK: Palgrave Macmillan.
Friginal, E., Li, M., & Weigle, S. (2014). Revisiting multiple profiles of learner compo-
sitions: A comparison of highly rated NS and NNS essays. Journal of Second Language
Writing, 23, 1–14.
336 References
Friginal, E., & Mustafa, S. (2017). A comparison of US-based and Iraqi English research
article abstracts using corpora. Journal of English for Academic Purposes, 25, 45–57.
Friginal, E., & Polat, B. (2015). Linguistic dimensions of learner speech in English in-
terviews. Corpus Linguistics Research, 1, 53–82.
Friginal, E., Walker, M., & Randall, J. (2014). Exploring mega-corpora: Google Ngram
viewer and the Corpus of Historical American English. E-JournALL, 1(1), 132–151.
Friginal, E., Waugh, O., & Titak, A. (2017). Linguistic variation in Facebook and
Twitter posts. In E. Friginal (Ed.), Studies in corpus-based sociolinguistics (pp. 342–362).
New York, NY: Routledge.
Fukushima, S., Watanabe, Y., Kinjo, Y., Yoshihara, S., & Suzuki, C. (2012). Develop-
ment of a web-based concordance search system based on a corpus of English papers
written by Japanese university students. Procedia – Social and Behavioral Sciences, 34,
54–58. Languages, Cultures and Virtual Communities (Les Langues, les Cultures et
les Communautes Virtuelles).
Gavioli, L. (2005). Exploring corpora for ESP learning. Philadelphia, PA: John Benjamins.
Gavioli, L., & Aston, G. (2001). Enriching reality: Language corpora in language ped-
agogy. ELT Journal, 55(3), 238–246.
Geluso, J., & Yamaguchi, A. (2014). Discovering formulaic language through d ata-
driven learning: Learner attitudes and efficacy. ReCALL, 26(2), 225–242.
Gilquin, G. (2008). Hesitation markers among EFL learners: Pragmatic deficiency or
difference? In J. Romero-Trillo (Ed.), Pragmatics and corpus linguistics: A mutualistic
entente (pp. 119–149). Berlin: Mouton de Gruyter.
Gilquin, G., De Cock, S., & Granger, S. (Eds.) 2010. The Louvain International Da-
tabase of Spoken English Interlanguage, handbook and CD-ROM. Louvain-la-Neuve,
Belgium: Presses Universitaires de Louvain.
Granger, S. (1983). The be + past participle construction in spoken English with special emphasis
on the passive. New York, NY: Elsevier Science Publishers.
Granger, S., Gilquin, G., & Meunier, F. (2015). The Cambridge handbook of learner corpus
research. Cambridge, UK: Cambridge University Press.
Granger, S., Hung, J., & Petch-Tyson, S. (Eds.). (2002). Computer learner corpora, second
language acquisition, and foreign language teaching. Amsterdam: John Benjamins.
Grant, L., & Ginther, A. (2000). Using computer-tagged linguistic features to describe
L2 writing differences. Journal of Second Language Writing, 9(2), 123–145.
Gray, B. (2013). More than discipline: Uncovering multi-dimensional patterns of vari-
ation in academic research articles. Corpora, 8(2), 153–181.
Greenbaum, S. (Ed.). (1996). Comparing English worldwide: The International Corpus of
English. Oxford, UK: Clarendon Press.
Grieve, J. (2016). Regional variation in written American English. Cambridge, UK: Cambridge
University Press.
Grieve, J., Biber, D., Friginal, E., & Nekrasova, T. (2010). Variation among blogs: A
multi-dimensional analysis. In A. Mehler, S. Sharoff & M. Santini (Eds.), Genres on
the web: Corpus studies and computational models (pp. 45–71). New York, NY: Springer-
Verlag.
Grieve, J., Nini, A., Guo, D., & Kasakoff, A. (2014). Big data dialectology: Analyzing
lexical spread in a multi-billion word corpus of American English. Paper presented at
the A merican Association for Corpus Linguistics Conference, Northern Arizona
University, F lagstaff, AZ.
Grigoryan, T. (2016). Using learner corpora in language teaching. International Journal
of Language Studies, 10(1), 71–90.
References 337
Guo, L. (2011). Product and process in TOEFL iBT independent and integrated writ-
ing tasks: A validation study (Doctoral dissertation). Retrieved from Georgia State
University’s Catalog (gast.2478085).
Guo, G., Crossley, S. A., & McNamara, D. S. (2013). Predicting human judgments of
essay quality in both integrated and independent second language writing samples:
A comparison study. Assessing Writing, 18(3), 218–238.
Halvey, M., & Keane, M. T. (2007). An assessment of tag presentation techniques.
Poster presentation at http://dblp.uni-trier.de/db/conf/www/www2007.html.
Handford, M. (2010). The language of business meetings. Tokyo: Cambridge University
Press.
Hardy, J. A. (2013). Disciplinary specificity in student writing: A proposal for corpus collection
(Unpublished manuscript). Atlanta, GA: Georgia State University.
Hardy, J. A. (2014). Undergraduate student writing across the disciplines: Multi-dimensional anal-
ysis studies (Unpublished doctoral dissertation). Atlanta, GA: Georgia State University.
Hardy, J. A., & Römer, U. (2013). Revealing disciplinary variation in student writing: A
multi-dimensional analysis of 16 disciplines from MICUSP. Corpora, 8(2), 183–208.
Hartig, A. J., & Lu, X. (2014). Plain English and legal writing: Comparing expert and
novice writers. English for Specific Purposes, 3, 87–96.
He, X. (2016). Text and spatial-temporal data visualization (Unpublished doctoral disserta-
tion). Atlanta, GA: Georgia State University.
Hinkel, E. (2002). Second language writers’ text: Linguistic and rhetorical features. Mahwah,
NJ: Lawrence Erlbaum Associates.
Hoey, M. (2005). Lexical priming: A new theory of words and language. London, UK:
Routledge.
Hopkins, C. (2011). The hip-hop word count counts on language to understand the
reality we keep. Retrieved from https://readwrite.com/2011/01/20/the_hip-hop_
word_count_counts_on_language_to_under
Hu, G., & Cao, F. (2011). Hedging and boosting in abstracts of applied linguistics
articles: A comparative study of English-and Chinese-medium journals. Journal of
Pragmatics, 43, 2795–2809.
Hunston, S., & Francis, G. (1998). Verbs observed: A corpus-driven pedagogic gram-
mar. Applied Linguistics, 19(1), 45–72.
Hyland, K. (2008a). Scaffolding during the writing process: The role of informal peer in-
teraction in writing workshops. In D. D. Belcher & A. Hirvela (Eds.), The oral-literate
connection: Perspectives on L2 speaking, writing, and other media interactions (pp. 168–190).
Ann Arbor, MI: University of Michigan Press.
Hyland, K. (2008b). Genre and academic writing in the disciplines. Language Teaching,
41(4), 543–562.
Hyland, K. (2012). Disciplinary identities: Individuality and community in academic discourse.
New York, NY: Cambridge University Press.
Jarvis, S., Grant, L., Bikowski, D., & Ferris, D. (2003). Exploring multiple profiles of
highly rated learner compositions. Journal of Second Language Writing, 12(4), 377–403.
Jefferson, G. (2004). Glossary of transcript symbols with an introduction. In G. H.
Lerner (Ed.), Conversation analysis: Studies from the first generation (pp. 13–31). Amster-
dam: John Benjamins.
Johansson, S., & Hofland, K. (1989). Frequency analysis of English vocabulary and grammar.
Oxford, UK: Oxford University Press.
Johns, A. M. (1995). Teaching classroom and authentic genres: Initiating students into
academic cultures and discourses. In D. Belcher & G. Braine (Eds.), Academic writing
338 References
in a second language: Essays on research and pedagogy (pp. 277–291). Norwood, NJ:
Ablex Publishing Corporation.
Johns, A. M. (1997). Text, role, and context. New York, NY: Cambridge University Press.
Johns, A. M. (2009). Tertiary undergraduate EAP: Problems and possibilities. In D.
Belcher (Ed.), English for specific purposes in theory and practice (pp. 41–59). Ann Arbor,
MI: The University of Michigan Press.
Johns, T. (1986). Micro-Concord: A language learner’s research tool. System, 14, 151–162.
Johns, T., & King, P. (1991). Classroom concordancing. ELR Journal, 4, 1–16.
Johns, T. (1994). From printout to handout: Grammar and vocabulary teaching in the
context of data-driven learning. In T. Odlin (Ed.), Perspectives on pedagogical grammar
(pp. 293–313). Cambridge, UK: Cambridge University Press.
Johnston, T., & Schembri, A. (2006). Issues on the creation of a digital archive of a
signed language. In L. Barwick & N. Thieburger (Eds.), Sustainable data from digital
fieldwork (pp. 7–16). Sydney, NSW: University of Sydney Press.
Juola, P., Ryan, M., & Mehok, M. (2011). Geographic localizing Tweets using stylo-
metric analysis. Paper presented at the American Association for Corpus Linguistics
2011, Georgia State University, Atlanta, GA.
Kachru, Y. (2008). Language variation and corpus linguistics. World Englishes, 27(1), 1–8.
Kafes, H. (2012). Cultural traces on the rhetorical organization of research article abstracts.
International Journal on New Trends in Education and their Implications, 3(2), 207–220.
Kaltenböck, G., & Mehlmauer-Larcher, B. (2005). Computer corpora and the language
classroom: On the potential and limitations of computer corpora in language teach-
ing. ReCALL, 17(1), 65–84.
Kaneko, T. (2007). Why so many article errors? Use of articles by Japanese learners of
English. Gakuen, 798, 1–16.
Kaneko, T. (2008). Use of English prepositions by Japanese university students. Gakuen,
810, 1–12.
Kennedy, C., & Miceli, T. (2010). Corpus-assisted creative writing: Introducing inter-
mediate Italian learners to a corpus as a reference resource. Language Learning and
Technology, 14(1), 28–44.
Keuleers, E., Brysbaert, M., & New, B. (2011). An evaluation of the Google Books
Ngrams for psycholinguistic research. In K. M. Würzner & E. Pohl (Eds.), Lexical
resources in psycholinguistic research (vol. 3, pp. 23–27). Potsdam, Germany: Uni-
versitätsverlag Potsdam.
Klimt, B., & Yang, Y. (2004). Introducing the Enron Corpus. Paper presented at the
First Conference on Email and Anti-Spam (CEAS), Mountain View, CA.
Knight, D., Evans, D., Carter, R., & Adolphs, S. (2009). Headtalk, handtalk, and the
corpus: Towards a framework for multi-modal, multi-media corpus development.
Corpora, 4(1), 1–32.
Koester, A. (2010). Workplace discourse. London, UK: Continuum.
Kolb, T., Friginal, E., Lee, M., Tracy, N., & Grieve, J. (2007). Teaching writing within
forestry. In Proceedings from the University Education in Natural Resources Conference
2007, Oregon State University, Corvallis, OR. Retrieved September 22, 2015 from
www.uenr.forestry.oregonstate.edu
Krishnamurthy, R., & Kosem, I. (2007). Issues in creating a corpus for EAP pedagogy
and research. Journal of English for Academic Purposes, 6, 356–373.
Krutka, D., & Carpenter, J. (2016). Why social media must have a place in schools.
Kappa Delta Pi Record, 52(1), 6–10. doi:10.1080/00228958.2016.1123048
References 339
Kučera, H., & Francis, W. N. F. (1967). Computational analysis of present-day American

English. Providence, RI: Brown University Press.
Kukulska-Hulme, A., & Shield, L. (2008). An overview of mobile assisted language
learning: From content delivery to supported collaboration and interaction. Re-
CALL, 20(3), 271–289.
Kyle, K., & Crossley, S. (2016). The relationship between lexical sophistication and
independent and source-based writing. Journal of Second Language Writing, 34, 12–24.
Laborde, J. (2011). The evaluation of researchers and the future of Latin American
scientific journals. In A. M. Cetto, & J. O. Alonso (Eds.), Calidad e Impacto de la
revista Iberoamericana (pp. 59–79). Mexico City: Facultad de Ciencias, UNAM.
Retrieved September 22, 2016 from www.latindex.unam.mx/librociri/
Lamy, M. N., & Hampel, R. (2007). Online communication in language learning and teach-
ing. Houndmills: Palgrave Macmillan.
Larsen-Freeman, D., & Long, M. (1991). An introduction to second language acquisition
research (1st ed.). New York, NY: Routledge.
Lee, D., & Swales, J. (2006). A corpus-based EAP course for NNS doctoral students:
Moving from available specialized corpora to self-compiled corpora. English for
Specific Purposes, 25(1), 56–75.
Lee, J. J. (2011). A genre analysis of second language classroom discourse: Exploring the rhetorical,
linguistic, and contextual dimensions of language lessons (Unpublished doctoral disserta-
tion). Atlanta, GA: Georgia State University.
Leech, G. (1992). Corpora and theories of linguistic performance. In J. Svartvik (Ed.),
Directions in corpus linguistics: Proceedings of the Nobel symposium 82, Stockholm, August
1991 (pp. 105–122). Berlin: Mouton de Gruyter.
Leech, G. (1997). Teaching and language corpora: A convergence. In A. Wichmann, S.
Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 1–24).
London, UK: Longman.
Li, S. (2017). A corpus-based study of vague language in legislative texts: Strategic use
of vague terms. English for Specific Purposes, 45, 98–109.
Liu, D., & Lei, L. (2017). Using corpora for language learning and teaching. Annapolis Junc-
tion, MD: TESOL Press.
Long, M. H. (1996). The role of the linguistic environment in second language acqui-
sition. In W. Ritchie & T. Bhatia (Eds.), Handbook of second language acquisition (pp.
413–468). New York, NY: Academic Press.
The Longman Dictionary of Contemporary English (1st ed.). (1987). London, UK: Longman.
The Longman Dictionary of Contemporary English (6th ed.). (2015). London, UK: Longman.
Lopez-Arroyo, B., & Mendez-Cendon, B. (2007). Describing phraseological devices in
medical abstracts: An English/Spanish contrastive analysis. META, 52(3), 503–516.
MacArthur, F., Alejo, R., Piquer-Piriz, A., Amador-Moreno, C., Littlemore, J., Ädel, A.,
…, Vaughn, E. (2014). EuroCoAT. The European Corpus of Academic Talk. Retrieved
September 24, 2016 from www.eurocoat.es
Maher, P. (2015). The role of ‘that’ in managing averrals and attributions in post-
graduate academic legal texts. English for Specific Purposes, 40, 42–56.
Master, P. (1994). The effect of systematic instruction on learning the English article
system. In T. Odlin (Ed.), Perspectives on pedagogical grammar (pp. 229–252). Cam-
bridge, UK: Cambridge University Press.
Mauranen, A. (2003). The Corpus of English as lingua franca in academic settings.
TESOL Quarterly, 37(3), 513–527.
340 References
Mauranen, A. (2007). Hybrid voices: English as the lingua franca of academics. In K.

Flottum, T. Dahl, & T. Kinn (Eds.), Language and discipline perspectives on academic
discourse (pp. 244–259). Cambridge, UK: Cambridge Scholars Press.
McCarthy, M., & Handford, M. (2004). Invisible to us: A preliminary corpus-based study
of spoken business English. In U. Connor & T. A. Upton (Eds.), Discourse in the pro-
fessions: Perspectives from corpus linguistics (pp. 167–201). Amsterdam: John Benjamins.
McCarthy, M., O’Dell, F., & Reppen, R. (2010). Basic vocabulary in use. Cambridge,
McEnery, T., & Hardie, A. (2012). Corpus linguistics. Cambridge, UK: Cambridge
University Press.
McEnery, T., & Oakes, M. (1995). Sentence and word alignment in the Crater project:
Methods and assessment. In S. Warwick-Armstrong (Ed.), Proceedings of the association
for computational linguistics workshop, SIG-DAT workshop (pp. 104–116). Dublin: ACL.
McEnery, T., Xiao, R., & Mo, L. (2003). Aspect marking in English and Chinese using
the Lancaster corpus of Mandarin Chinese for contrastive language study. Literary
and Linguistic Computing, 18(4), 361–378.
McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced
resource book. New York, NY: Routledge.
McMillan, V. (2001). Writing papers in the biological sciences (3rd ed.). New York,
NY: Bedford/St. Martins.
Melander, B. J. M., Swales, J., & Fredrickson, K. (1997). Journal abstracts from the
academic fields in the United States and Sweden: National and disciplinary procliv-
ities. In A. Duszak (Ed.), Cultures and styles of academic discourse (pp. 251–272). Berlin:
Mouton de Gruyter.
Meunier, F., & Reppen, R. (2015). Corpus vs. non-corpus informed pedagogical mate-
rials: Grammar as the focus. In D. Biber & R. Reppen (Eds.), The Cambridge handbook
of English corpus linguistics (pp. 498–514). Cambridge, UK: Cambridge University
Press.
Michel, J. B., Shen, Y., Aiden, A., Veres, A., & Gray, M. (2011). Quantitative analysis
of culture using millions of digitized books. Science, 331, 176–182.
Ming-Tzu, K., & Nation, P. (2004). Word meaning in academic English: Homography
in the academic word list. Applied Linguistics, 25(3), 291–314.
Mizumoto, A., & Chujo, K. (2016). Who is data-driven for? Challenging the mono-
lithic view of its relationship with learning styles. System, 61, 55–64.
Mueller, C., & Jacobsen, N. (2016). A comparison of the effectiveness of EFL students’
use of dictionaries and an online corpus for the enhancement of revision skills. Re-
CALL, 28(1), 3–21.
Mukherjee, J. (2009). The grammar of conversation in advanced spoken learner
English: Learner corpus data and language-pedagogical implications. In K. Aijmer
(Ed.), Corpora and language teaching (pp. 203–230). Amsterdam: John Benjamins.
Mukherjee, J., & Rohrbach, J.M. (2006). Rethinking applied corpus linguistics from
a language-pedagogical perspective: New departures in learner corpus research.
Retrieved March 14, 2014 from www.uni-giessen.de/faculties/f05/engl/ling/staff/
professors/jmukherjee/publications/pdfs/Mukherjee-Rohrbach-2006.pdf
Mustafa, S. (2015). Exploring EAP and ESP applications in Iraqi higher education (Unpub-
lished manuscript). Baghdad, Iraq: University of Baghdad.
Nation, P. (2001). Learning vocabulary in another language. Cambridge, UK: Cambridge
University Press.
References 341
Nation, P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31(7),
9–13.
Nation, P., & Waring, R. (1997). Vocabulary size, text coverage, and word lists. In N.
Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition and pedagogy (pp.
6–19). Cambridge, UK: Cambridge University Press.
Nelson, G. (1996). The design of the corpus. In S. Greenbaum (Ed.), Comparing English
worldwide: The International Corpus of English (pp. 27–35). Oxford, UK: Clarendon Press.
Nesi, H. (2008). BAWE: An introduction to a new resource. In A. Frankenberg-
Garcia, T. Rkibi, M. Braga da Cruz, R. Carvalho, C. Direito, & D. Santos-Rosa
(Eds.), Proceedings of the eighth teaching and language corpora conference (pp. 239–246).
Lisbon, Portugal: ISLA.
Nesi, H. (2011). BAWE: An introduction to a new resource. In A. Frankenberg-Garcia,
L. Flowerdew, & G. Aston (Eds.), New trends in corpora and language learning (pp.
213–228). London, UK: Continuum.
Nesi, H., & Basturkmen, H. (2006). Lexical bundles and discourse signaling in aca-
demic lectures. International Journal of Corpus Linguistics, 11(3), 282–304.
Nesi, H., & Gardner, S. (2012). Genres across the disciplines: Student writing in higher educa-
tion. New York, NY: Cambridge University Press.
Nini, A. (2014). Multidimensional analysis tagger 1.2- Manual. Retrieved from http://
sites.google.com/site/multidimensionaltagger
Nini, A., Corradini, C., Guo, D., & Grieve, J. (2016). The application of growth curve
modeling for the analysis of diachronic corpora. Language Dynamics and Change.
doi:10.1017/S1360674316000113
Nolen, M. (2017). Exploring corpora: Developing exploratory habits with data-driven
learning in a study abroad setting (Unpublished Master’s paper). Department of Ap-
plied Linguistics and ESL, Georgia State University.
O’Donnell, M. B., & Römer, U. (2012). From student hard drive to web corpus (Part 2):
The annotation and online distribution of the Michigan Corpus of Upper-level Student
Papers (MICUSP). Corpora, 7(1), 1–18.
O’Donnell, M. B., Römer, U., & Ellis, N. (2013). The development of formulaic se-
quences in first and second language writing: Investigating effects of frequency,
association, and native norm. International Journal of Corpus Linguistics, 18(1), 83–108.
O’Keeffe, A., McCarthy, M., & Carter, R. (2007). From corpus to classroom. Cambridge,
The Oregon Department of Education. (2002). Instructional technology ideas and re-
sources for Oregon teachers. Retrieved January 6, 2010 from www.ode.state.or.us/
teachlearn/subjects/technology/instrtec.pdf
Pennebaker, J. W., Chung, C. K., Ireland, M., Gonzalez, A., & Booth, R. J. (2017).
The development and psychometric properties of LIWC. Austin, TX. Retrieved
October 6, 2017 from www.LIWC.net
Pérez-Paredes, P., Sánchez-Hernández, P., & Aguado-Jiménez, P. (2011). The use
of adverbial hedges in EAP students’ oral performance. In V. Bhatia, P. Sánchez-
Hernández, & P. Pérez-Paredes (Eds.), Researching specialized languages (pp. 95–114).
Pica, T. (1994). Research on negotiation: What does it reveal about second-language
learning conditions, processes, and outcomes? Language Learning, 44(3), 493–527.
Pickering, L. (2006). Current research on intelligibility in English as a lingua franca.
Annual Review of Applied Linguistics, 26, 219–233.
342 References
Pickering, L., Friginal, E., & Staples, S. (Eds.). (2016). Talking at work: Corpus-based
explorations of workplace discourse. London, UK: Palgrave Macmillan.
Plakans, L., & Gebril, A. (2013). Using multiple texts in an integrated writing assessment:
Source text use as a predictor of score. Journal of Second Language Writing, 22(3), 217–230.
Polat, B. (2013). L2 experience interviews: What can they tell us about individual dif-
ferences? System, 41, 70–83.
Quirk, R., Greenbaum, S., Leech, G. N., & Svartvik, J. (1972). A grammar of contempo-
rary English. London, UK: Longman.
Rayson, P. (2003). WMatrix: A statistical method and software tool for linguistic analysis through cor-
pus comparison (Unpublished doctoral dissertation). Lancaster University, Lancaster, UK.
Rayson, P. (2008). From key words to key semantic domains. International Journal of
Corpus Linguistics, 13(4), 519–549.
Reinhardt, J. (2010). Directives in office hour consultations: A corpus-informed inves-
tigation of learner and expert usage. English for Specific Purposes, 29(2), 94–107.
Reiter, J. (2011). Lexical variation in inaugural addresses: A research proposal and pre-
liminary results (Unpublished manuscript). Department of Applied Linguistics and
ESL, Georgia State University, Atlanta, GA.
Reppen, R. (2010). Using corpora in the language classroom. Cambridge, UK: Cambridge
University Press.
Richards, J. C. (2008). Moving beyond the plateau. New York, NY: Cambridge Univer-
sity Press.
Roberson, A. (2015). The Second Language Peer Response (L2PR) corpus. Atlanta, GA:
Georgia State University.
Roberts, J., & Samford, W. (2013). Classroom applications of available corpus tools: Tips and
suggestions for analyzing and utilizing specialized corpora in the classroom. Poster presented
at TESOL Convention 2013, Dallas, TX, March 21, 2013.
Robinson, M., Stoller, F., Jones, J., & Costanza-Robinson, M. (2010). Write like a chem-
ist. Oxford, UK: Oxford University Press.
Römer, U. (2009). The inseparability of lexis and grammar: Corpus linguistic perspec-
tives. Annual Review of Cognitive Linguistics, 7, 140–162.
Römer, U. (2010). Establishing the phraseological profile of a text type: The construc-
tion of meaning in academic book reviews. English Text Construction, 3(1), 95–119.
Römer, U. (2011). Corpus research applications in second language teaching. Annual
Review of Applied Linguistics, 31, 205–225.
Römer, U., & O’Donnell, M. B. (2011). From student hard drive to web corpus (Part 1):
The design, compilation and genre classification of the Michigan Corpus of Upper-level
Student Papers (MICUSP), Corpora, 6(2), 159–177.
Römer, U., & Wulff, S. (2010). Applying corpus methods to written academic texts:
Explorations of MICUSP. Journal of Writing Research, 2(2), 99–127.
Rosenkrans, W. (2014). Survival factors. Aerosafety World. Retrieved from https://
flightsafety.org/asw-article/survival-factors-2/.
Rundell, M. (2007). Macmillan English dictionary for advanced learners (2nd ed.). Oxford:
Macmillan.
Rutherford, W. (1987). Second language grammar: Learning and teaching. London: Longman.
Sainani, K., Eliott, C., & Harwell, D. (2015). Active vs. passive voice in scientific
writing. Retrieved March 14, 2016 from www.acs.org/content/dam/acsorg/events/
professional-development/Slides/2015-04-09-active-passive.pdf
References 343
Samford, W. (2013). Cognitive and emotional elements of female online journals in

relation to age: An exploratory corpus study (Unpublished manuscript). Department
of Applied Linguistics and ESL, Georgia State University, Atlanta, GA.
Schlitz, S. A. (2010). Introduction to special issue: Exploring corpus-informed ap-
proaches to writing research. Journal of Writing Research, 2(2), 91–98.
Schmidt, R., & Frota, S. (1986). Developing basic conversational ability in a second
language: A case study of an adult learner of Portuguese. In R. Day (Ed.), Talking
to learn: Conversation in second language acquisition. Rowley, MA: Newbury House.
Scocco, D. (2007). Copyright Law: 12 Dos and Don’ts. DailyBlogTips. Retrieved March
22, 2016 from www.dailyblogtips.com/copyright-law-12-dos-and-donts
Scott, M. (1997). PC analysis of key words – and key words. System, 25(2), 233–245.
Scott, M. (2012). WordSmith Tools (Version 6). [Software]. Retrieved March 22, 2016
from http://lexically.net/wordsmith
Seidlhofer, B. (2007). Common property: English as a lingua franca in Europe. In
J. Cummins & C. Davison (Eds.), International handbook of English language teaching
(pp. 137–153). New York, NY: Springer.
Seidlhofer, B. (2012). Anglophone-centric attitudes and the globalization of English.
Journal of English as a Lingua Franca, 1(2), 393–407.
Shafer, E., & Yates, A. (2012). Although, on the other hand: Using MICASE to identify
lecture structure by teaching discourse markers (Unpublished course paper). Georgia State
University, Atlanta, GA.
Shetty, J., & Adibi, J. (2004). The Enron dataset database schema and brief statistical re-
port. Information Sciences Institute Technical Report, University of Southern California,
2004. FINAL Corpus-Based Sociolinguistics 6-24-13.docx.
Simpson, R., Briggs, S., Ovens, J., & Swales, J. (2002). The Michigan corpus of academic
spoken English. Ann Arbor, MI: The Regents of the University of Michigan.
Simpson-Vlach, R. (2013). Corpus analysis of spoken English for academic purposes. In
C. Chapelle (Ed.), The encyclopedia of applied linguistics (pp. 452–461). Malden, MA:
Wiley Blackwell.
Simpson-Vlach, R., & Leicher, S. (2006). The MICASE Handbook. Ann Arbor, MI:
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford, UK: Oxford University Press.
Sinclair, J. (2004). Trust the text: Language, corpus and discourse. New York, NY:
Routledge.
Sinclair, J. (2003). Collins COBUILD advanced learner’s English dictionary. Glasgow, UK:
HarperCollins Publishers.
Sinclair, J. (2005). Corpus and text – basic principles. In M. Wynne (Ed.), Developing lin-
guistic corpora: A guide to good practice (pp. 1–16). Oxford, UK: Oxbow Books. Re-
trieved January 6, 2010 from www.ahds.ac.uk/creating/guides/linguistic-corpora/
chapter1.htm
Smiciklas, M. (2012). The power of infographics: Using pictures to communicate and connect
with your audience. Indianapolis, IN: Pearson Education.
Staples, S., Egbert, J., Biber, D., & McClair, A. (2013). Formulaic sequences and EAP
writing development: Lexical bundles in the TOEFL iBT writing section. Journal of
English for Academic Purposes, 12(3), 214–225.
Stevens, V. (1995). Concordancing with language learners: Why? When? What?
CAELL Journal, 6, 2–10.
344 References
Stevenson, M. (2016). A critical interpretive synthesis: The integration of automated

writing evaluation into classroom writing instruction. Computers and Composition,
42, 1–16.
Swales, J. (1985). English as the international language of research. RELC Journal, 16(1),
1–7.
Swales, J., & Feak, C. B. (2009). Abstracts and the writing of abstracts. Ann Arbor, MI:
The University of Michigan Press.Tagliamonte, S. A. (2006). Analysing sociolinguistic
variation. Cambridge, UK: Cambridge University Press.
Tagliamonte, S. A., & Roeder, R. V. (2009). Variation in the English definite article:
Socio-historical linguistics in t’speech community. Journal of Sociolinguistics, 13(4),
435–471.
Tanko, G. (2004). The use of adverbial connectors in Hungarian university students’
argumentative essays. In J. Sinclair (Ed.), How to use corpora in language teaching (pp.
157–181). Philadelphia, PA: John Benjamins Publishing Company.
Thorne, S., Reinhardt, J., & Golombek, P. (2008). Mediation as objectification in the
development of professional discourse: A corpus-informed curricular innovation. In
J. Lantolf & M. Poehner (Eds.), Sociocultural theory and the teaching of second languages
(pp. 256–284). London: Equinox.
Timmis, I. (2015). Corpus linguistics for ELT: Research and practice. New York, NY:
Routledge.
Tognini-Bonelli, E. (2001). Corpus linguistics at work. Philadelphia, PA: John Benjamins.
Tribble, C. (2015). Teaching and language corpora: Perspectives from a personal jour-
ney. In Leńko-Szymańska, A. & Boulton, A. (Eds.), Multiple affordances of language
corpora for data-driven learning (pp. 15–36). Philadelphia, PA: John Benjamins.
Vasbø, K., Silseth, K., & Erstad, O. (2014). Being a learner using social media in school:
The case of Space2cre8. Scandinavian Journal of Educational Research, 58(1), 110–126.
doi:10.1080/00313831.2013.773555
Vaughanbell. (2017). Should we stop saying ‘commit’ suicide? [Mindhacks]. Retrieved
from https://mindhacks.com/2017/08/12/should-we-stop-saying-commit-suicide/
Walsh, K. (2014). Engaging Uses of instructional technology. Retrieved January 3,
2017 from [EmergingEdTech!] www.emergingedtech.com/
Warren, M. (2004). //So what have YOU been WORking on REcently//: C ompiling
a specialized corpus of spoken business English. In U. Connor & T. A. Upton (Eds.),
Discourse in the professions: Perspectives from corpus linguistics (pp. 115–140). Philadelphia,
PA: John Benjamins.
Weigle, S. C. & Friginal, E. (2015). Linguistic dimensions of impromptu test essays
compared with successful student disciplinary writing: Effects of language back-
ground, topic, and L2 proficiency. Journal of English for Academic Purposes, 18, 25–39.
Weisser, M. (2016). Practical corpus linguistics: An introduction to corpus-based language anal-
ysis. Malden, MA: Wiley Blackwell.
Weisser, M. (2017). Corpus-based Linguistics Links [website]. Retrieved June 19, 2017
from http://martinweisser.org/corpora_site/CBLLinks.html
Wichmann, A., Fligelstone, S., McEnery, T., & Knowles, G. (Eds.). (1997). Teaching and
language corpora. London, UK: Longman.
Xiao, R., & McEnery, A. (2005). Two approaches to genre analysis. Journal of English
Linguistics, 33(1), 62–82.
Xiao, R. (2009). Multidimensional analysis and the study of world Englishes. World
Englishes, 28(4), 421–450.
References 345
Yip, V. (1994). Grammatical consciousness-raising and learnability. In T. Odlin (Ed.),

Perspectives on pedagogical grammar (pp. 123–138). Cambridge, UK: Cambridge Uni-
versity Press.
Yoon, H., & Hirvela, A. (2004). ESL student attitudes toward corpus use in L2 writing.
Journal of Second Language Writing, 13, 257–283.
Zappavigna, M. (2017). ‘Had enough of experts:’ Intersubjectivity and the quoted voice in
microblogging. In E. Friginal (Ed.), Studies in corpus-based sociolinguistics (pp. 319–341).
New York, NY: Routledge.
Zhu, Y., & Friginal, E. (2015). Interactive visual text analysis for corpus-based learning.
In Proceedings of IEEE big data computing service and applications conference,
Atlanta, GA.
Index
Active voice 204 Collecting written texts 121

American and British Office Talk (ABOT) Collocations 50
Corpus 83 Comparable corpora 18
Academic Word List (AWL) 47, 233 Compleat Lexical Tutor 156
American Council on the Teaching of Complexity and sophistication 57
Foreign Languages (ACTFL) Concordances 49
245, 279 Constituent Likelihood Automatic
American National Corpus (ANC) 104 Word-tagging System (CLAWS)
Annotated bibliography of CL studies 159 22, 83
Annotated corpora 17 Corpus, annotation and markup 150
AntConc 46, 51, 268, 277, 283, 317 Corpus, definition 10
Corpus collection process 115
Biber Tagger 150 Corpus linguistics introduction,
British Academic Spoken English Corpus definition 12
(BASE) 95 Corpus linguistics, brief history 19
British Academic Written English Corpus of Contemporary American
(BAWE) 91 English (COCA) 107
British National Corpus (BNC) 103 Corpus tools, list 148, 152–159
Brown Corpus 20
BYU Corpora (COCA, COHA, others) Data-Driven Learning (DDL) 39
107, 108
English as a Lingua Franca (ELF) 96
Computer-Assisted Language Learning English as a Lingua Franca in Academic
(CALL) 33 Contexts (ELFA) 96
Call Center Corpus 296 English for Academic Purposes (EAP) 27,
Cambridge and Nottingham Business 40, 47, 124, 137
English Corpus (CANBEC) 83 English for Specific Purposes (ESP) 27, 40,
CliC 157 218, 222, 268
Coh-Metrix 57, 155 Enron Email Corpus 56, 84
Cohesion 57 European Corpus of Academic Talk
Collecting corpora 114 (EuroCoAT) 99
Collecting spoken texts 136 Explorer’s Digital Journal (EDJ) 279–290
348 Index
Frequency 6, 47 Multidimensional analysis 58–59

Multiword units 55
General corpora 16
Google Ngram Viewer/Google N-grams 55
Books 64, 110, 111, 250 Non-English corpora 112
Grammar instruction 243
Online collections 107
Hip-Hop language 76–78 Online directories 148
Hip-Hop Word Count 77
Hong Kong Corpus of Spoken English Parallel corpora 18
(prosodic) (HKCSE) 84 Passive structures 204–207, 287
P-frames 56
Instructional technology 27 Political speeches 61, 305
Intensive English Program (IEP) 3, 137, Prefabricated chunks, see Multiword
218, 233 units 55
Intercultural spoken interactions 295
International Corpus of English Qualitative coding software 156
(ICE) 104 Quantifiers 263–266
International Corpus of Learner
English (ICLE) 89 Real Grammar 8, 243–244
International Corpus Network of Asian Reference corpora, see General
Learners of English (ICNALE) 101 corpora 16
International English Language Testing Register 5, 14
System (IELTS) 121 Reporting verbs 195–200
International Teaching Assistants
(ITA) 29, 147 Second language acquisition (SLA)
ITAcorp 147 4, 224
Second Language Classroom Discourse
Keyness 54, 157 (L2CD) Corpus 137
Key word analysis 54 Second Language Peer Response (L2PR)
Key Word in Context (KWIC) 49 Corpus 143
Sketch Engine 71
Learner Corpus Association (LCA) 111 Specialized corpora 16
Learner Corpus Bibliography 112, 160 Spoken learner corpora 93
Lexical bundles 56 Spoken-written academic corpora 100
Linking adverbials 200, 202–206, 268, 277 Stanford Parser/Tagger 155
Louvain International Database of Spoken
English Interlanguage (LINDSEI) 97 T2K-SWAL Corpus 100–101
Linguistic co-occurrence 58 TAGSExplorer 75–76
Linguistic Inquiry and Word Count Task-Based Language Teaching
(LIWC) 156 (TBLT) 281
LOB Corpus 18, 21 Technology integration 30
Longman Dictionary of Contemporary Telic Verbs, Telicity 252–261
English (LDOCE) 213–214 Test of English as a Foreign Language
Longman Grammar of Spoken and (TOEFL) 121
Written English (LGSWE) 5, 8, 243 Text Lex Compare 305–310
Text visualization 322
MAT Tagger 150–151 Text X-Ray 63–71, 322–329
Michigan Corpus of Academic Spoken
English (MICASE) 93–96, 127, 279 Usage-based learning (UBL) 224
Michigan Corpus of Upper-Level Student
Papers (MICUSP) 19, 88, 277 Varieties of English 102
Monitor corpora 18 Verb usage 246–249
Index 349
Verb tenses 204–210 WebParaNews (WPN) 226

Visualizing online language 73–75 Wmatrix 155
VocabProfile 156, 233 WordandPhrase.Info 218–222
Vocabulary instruction 213 Word Cloud 60–62, 308–309
Vienna-Oxford International Corpus of Writing in Forestry 189
English (VOICE) 96 Written learner corpora 86

Corpus Linguistics For English Teachers - New Tools, Online Resources, and Classroom Activities

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Corpus Linguistics For English Teachers - New Tools, Online Resources, and Classroom Activities

Uploaded by

Copyright:

Available Formats

Corpus Linguistics for

• CL and the teaching of English vocabulary, grammar, and spoken-written

Eric Friginal is Associate Professor of Applied Linguistics at the Depart-

ISBN: 978-1-138-12308-3 (hbk)

A1 Corpus Linguistics for English Teachers: An Introduction 3

A2 Connections: CL and Instructional Technology, CALL,

A3 Analyzing and Visualizing English Using Corpora 46

B1 Corpora and Online Databases 81

B2 Collecting Your Own (Teaching) Corpus 114

B3 Corpus Tools, Online Resources, and an Annotated

C1 Developing Corpus-Based Lessons and Activities:

C2 CL and Vocabulary Instruction 213

C2.2 Using a Concordancer for Vocabulary Learning with

C3 CL and Grammar Instruction 243

C4 CL and Teaching Spoken/Written Discourse 293

A2.1 Components of English language pedagogy. Adapted from

A3.10 POS-visualizing through Text X-Ray (color coded POS

A2.1 Chapelle’s (1998) framework on SLA hypotheses and

B3.3 A nnotated bibliography: CL and language teaching

Various linguistic frequencies, from corpus-based English dictionaries to

A1.1 Corpus Linguistics and Pedagogy

The number of teachers incorporating CL approaches in their English class-

For the Teacher

Bennett, G. (2010). Using corpora in the language learning classroom. Michigan:

Cheng, W. (2012). Exploring corpus linguistics: Language in action. New York:

Conrad, S. & Biber, D. (2009). Real grammar: A corpus-based approach to English.

such as grammar, vocabulary, and English academic writing.” The authors

Reppen, R. (2010). Using corpora in the language classroom. Cambridge: Cam-

researcher datasets (i.e., researchers’ bodies of data). However, in linguistics,

a corpus is a large and principled collection of natural texts.

… is a collection of spoken or written texts to be used for linguistic anal-

As apparent from the various definitions, a corpus—a systematic compilation

A1.3 So, What is Corpus Linguistics?

In this book, I maintain and emphatically support the argument that CL is

• It is strictly based on the empirical analysis of actual patterns of language

So what? With relatively easy access to numerical data, it is important to re-

Quantitative patterns discovered through corpus analysis should always

A1.4 Corpora: Types and Descriptions

have posited intuitively to be common, rare, relevant, or significant. A corpus

General/Reference vs. Specialized Corpora

Written vs. Spoken Corpora

research question(s), often beneficial to be able to return to audio files to con-

Comparable vs. Parallel Corpora

A1.5 Historical Overview of Corpus Linguistics

For the Teacher

Another way of documenting and understanding the development of

University, University of Birmingham, and University of Nottingham have

A1.6 How to Use this Book

the classroom. A more in-depth handling of materials, design constructs, as-

A1.7 CL Limitations and Future Directions

to the radically changing domain of grammar research and instruction. With

A2.1 Corpus Linguistics and Instructional Technology

For the Teacher

Read in the following how one of my graduate students responded to the

Will students’ capacity to demonstrate their understanding of

A2.1.1 CL Technology Integration

Figure A2.1  omponents of English language pedagogy. Adapted from Chapelle

Chapelle and Jamieson (2009) describe the primary components of English

identifies several constructs capturing the pertinent contribution of CL tools

A2.2 Corpus Linguistics and Computer-Assisted

acquisition and use of content-based language. CALL design based on hypoth-

A2.2.1 Evaluating CALL (and CL) Tools for SLA

Table A2.1 Chapelle’s (1998) framework on SLA hypotheses and CALL/CL

C2.2 Using a Concordancer for Vocabulary Learning with

A2.1 Components of English language pedagogy. Adapted from

A3.10 POS-visualizing through Text X-Ray (color coded POS

A2.1 Chapelle’s (1998) framework on SLA hypotheses and

B3.3 A nnotated bibliography: CL and language teaching

Figure A2.1 omponents of English language pedagogy. Adapted from Chapelle

Table A2.1 Chapelle’s (1998) framework on SLA hypotheses and CALL/CL

Table A2.2 Criteria for evaluating CL tools and materials

Figure A2.3 Concordance lines for in fact in academic written texts.

Figure A3.3 A word cloud of the first 10 pages of this book.