Professional Documents
Culture Documents
English Teachers
Corpus Linguistics for English Teachers: New Tools, Online Resources, and Classroom
Activities describes Corpus Linguistics (CL) and its many relevant, creative, and
engaging applications to language teaching and learning for teachers and prac-
titioners in TESOL and ESL/EFL, and graduate students in applied linguistics.
English language teachers, both novice and experienced, can benefit from the
list of new tools, sample lessons, and resources as well as the introduction of
topics and themes that connect CL constructs to established theories in lan-
guage teaching and second language acquisition. Key topics discussed include:
With ready-to-use teaching vignettes, tips and step-by-step guides, case studies
with practitioner interviews, and discussion of corpora and corpus tools, Corpus
Linguistics for English Teachers is a thoughtfully designed and skillfully executed
resource, bridging theory with practice for anyone looking to understand and
apply corpus-based tools dynamically in the language learning classroom.
Eric Friginal
First published 2018
by Routledge
711 Third Avenue, New York, NY 10017
and by Routledge
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2018 Taylor & Francis
The right of Eric Friginal to be identified as author of this work has
been asserted by him in accordance with sections 77 and 78 of the
Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or
reproduced or utilised in any form or by any electronic, mechanical,
or other means, now known or hereafter invented, including
photocopying and recording, or in any information storage or retrieval
system, without permission in writing from the publishers.
Every effort has been made to contact copyright-holders. Please advise
the publisher of any errors or omissions, and these will be corrected in
subsequent editions.
Trademark notice: Product or corporate names may be trademarks
or registered trademarks, and are used only for identification and
explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
A catalog record for this title has been requested
List of Figures xi
List of Tables xiii
Acknowledgments xv
Part A
Corpus Linguistics for English Teachers: Overview,
Definitions, and Scope 1
Part B
Tools, Corpora, and Online Resources 79
Part C
Corpus-Based Lessons and Activities in the Classroom 185
References 331
Index 347
Figures
This book synthesizes data, findings, and ideas in corpus-based teaching and
research from my work with mentors, colleagues, students, and the many
prominent publications from applied CL practitioners over the past two de-
cades. I present materials that I have myself developed as well as those produced
with my former students and collaborators. I would not have been able to finish
this book without their help! I thank Mike Cullom for his invaluable insights
and reviews of several drafts, guiding its completion to directly address teach-
ers’ needs and relevant practical concerns. As always, thanks to Doug Biber and
Randi Reppen of Northern Arizona University (NAU) for their constant sup-
port and encouragement; the faculty, staff, and students of the Department of
Applied Linguistics and ESL at Georgia State University (GSU); Jack A. Hardy
(Oxford College of Emory University); Ute Römer (GSU); Joseph J. Lee (Ohio
University); Brittany Polat (Lakeland, FL); Audrey Roberson (Hobart and
William Smith Colleges); my Routledge editors and reviewers; and the staff of
the Longview Public Library, Longview, WA.
Much appreciation to Susan Conrad (Portland State University), Carol C hapelle
(Iowa State University), Joan Jamieson (NAU), Ying Zhu of the Creative M edia
Industries Institute (GSU), Laurence Anthony (Waseda U niversity), Mark
Davies (Brigham Young University), Jack Grieve (University of Birmingham),
Tony McEnery (Lancaster University), and Tom Cobb (Université du Québec à
Montréal) for leading the way with their research, teaching approaches, and the
corpora and innovative corpus tools that they have developed and shared freely
with all of us. I’d like to acknowledge the support of Tom Kolb; Martha Lee;
and all faculty, staff, and students at the School of Forestry at NAU for giving me
the opportunity to develop my very first corpus-based academic writing course
in 2005. Finally, thanks to all my contributors and former students for joining
me in this journey: Cynthia Berger, Maxi-Ann Campbell, Melinda Childs,
xvi Acknowledgments
Sean Dunaway, Peter Dye, Lena Emeliyanova, Tia Gass, Tyler Heath, Jonathan
McNair, Tamanna Mostafa, Robert Nelson, Matthew Nolen, Janet Beth Randall,
Jennifer Roberts, and Marsha Walker.
Para kala Nanay at Tatay
Eric Friginal
Part A
Corpus Linguistics for
English Teachers
Overview, Definitions,
and Scope
A1
Corpus Linguistics
for English Teachers
An Introduction
This book is about Corpus Linguistics (CL) and its many relevant applications
to language teaching and learning. In this setting, CL includes components
such as computers and the internet, corpora, corpus tools, online databases (es-
pecially those with an interface to search electronic texts), and frequency-based
data or results from analyses of corpora (e.g., word lists, key words, n-grams,
part-of-speech or POS tags, and many others). The applications of CL include
incorporating corpus tools and the analysis of corpora in a classroom activity or
a homework assignment, using corpus-based online databases to help students
conduct a mini-research project or respond to a question about language pat-
terns (e.g., “What are the top 5 collocations of the word freedom in newspaper
writing?”), teachers and students’ collecting their own corpora for analysis in a
writing classroom, or exploring vocabulary patterns of use from concordancers
and corpus-based dictionaries. Language learners may come from a range of
first (L1) and second (L2) language backgrounds, language learning experi-
ences, or fields of study. The common thread is that they are learning spoken
and written English, a majority of them coming from university settings in
the United States (US) and similar locations, but their needs will also overlap
with English learners from all over the world. They also have commonalities
across writing courses; vocabulary and grammar classes in an Intensive English
Program (IEP); graduate-level academic writing programs; or sociolinguistics,
creative writing, and literature classes exploring variation in language form
and use.
The book is written primarily for English teachers. Teachers ranging from
those with a limited background in CL to those who have taken a CL course
in teacher preparation or master’s programs may benefit from the book’s list
of tools, sample lessons, and resources as well as its introduction of topics and
themes that connect CL constructs to established theories in language teaching
4 Corpus Linguistics for English Teachers
and Second Language Acquisition (SLA). The CL resources in this book may
also supplement the ones that experienced teachers have already incorporated
into their classes. My intent is to summarize developments in applied CL over
the past 10 years, from around approximately 2007–2008 to 2018, highlighting
such very recent innovations as downloadable concordancers or taggers to freely
available learner corpora. I also developed this book in response to the expressed
needs of the students who take my undergraduate and graduate courses in CL,
Technology and Language Teaching, Corpus-Based Sociolinguistics, and Research Meth-
ods in Applied Linguistics at Georgia State University (GSU). My former and
current students, and their various, meaningful comments and reflections have
given direction and contributed to the themes and foci of this book.
As an applied corpus linguist, I was greatly influenced by my mentors D ouglas
Biber and Randi Reppen of the applied linguistics program in the English De-
partment at Northern Arizona University (NAU). It was while working there
with Doug that I learned about CL’s direct applications to the study of linguistic
variation, especially the broader exploration of lexico-syntactic characteristics
of spoken and written language. Randi offered an innovative seminar on CL
and language teaching when I was a doctoral student at NAU, and, under her
supervision, I have developed a suite of lessons and corpus-based tools intended
for students of forestry writing laboratory reports and for trainees in profes-
sional, intercultural communication contexts. I provided an account of my
corpus-based writing course in forestry in Section C1. Randi’s publications on
this particular CL topic include various corpus-based vocabulary and grammar
books for English as a Second/Foreign Language (ESL)/EFL students (e.g., the
Grammar and Beyond series), especially Using Corpora in the Language Classroom
(2010), one of the first books written specifically for teachers to support them
in making use of actual corpus tools and data in their classrooms. It is my hope
that the book you are reading honors and recognizes with great appreciation
Doug’s and Randi’s influential work in the field.
One of the first edited volumes of papers on corpora and language teaching
was published in 1997 and actually comprised studies presented three years
prior, in spring 1994, at the conference on Teaching and Language Corpora
(TALC) at Lancaster University. The volume, edited by Anne Wichmann,
Steven Fligelstone, Tony McEnery, and Gerry Knowles, was entitled Teach-
ing and Language Corpora. Both the TALC conference and the edited vol-
ume resulted from discussions between members of International Computer
A rchive of Modern English (ICAME) on the emerging need to establish a
teaching-oriented CL conference. In 1997, corpus work in language teaching
and learning was, essentially, still in its infancy; online databases, downloadable
concordancers, for-purchase learner corpora, and collections of lesson plans
were still not freely or readily available.
In his chapter “Teaching and Language Corpora: A Convergence,” G eoffrey
Leech noted that there was every reason to believe that language corpora would
Corpus Linguistics for English Teachers 5
have a role of growing importance in teaching. He added that, even at that time,
there were enthusiastic testimonials on the richness of the interest and experience
already being applied to the convergence of language teaching and language re-
search through the link of corpus-based methods. Leech discussed the benefits
of the direct use of corpora in teaching, and he outlined the advantages of using
the computer in language learning, which aligned with C omputer-Assisted Lan-
guage Learning (CALL) principles: (1) automatic searching, sorting, and scoring;
(2) promoting a learner-centered approach; (3) open-ended supply of language
data; and (4) enabling the learning process to be tailored. In 1994, these ideas
were still somewhat difficult to operationalize in the language classroom. Com-
puters were still bigger, heavier, and very expensive, and designing, collecting,
and storing open-ended linguistic data were not things a language teacher could
quickly or easily accomplish. Leech envisioned future teacher-learner applica-
tions of various corpora in the classroom that seemed possible for only a select few
at that point. Over 20 years later, we now have computer-based tools and mate-
rials to which virtually everyone has global access. Hardware and software have
become immediately accessible online, and data storage has evolved from floppy
disks, CD-ROMs, and flash drives to cloud-based storage solutions. Researchers
and teachers freely share collections of texts, and the ICAME has created various
groups and conferences targeting a range of practitioners, language learners, and
materials developers. Potential resources have become real and readily available.
Leech has gone on to work with Doug Biber, Stig Johansson, Edward F inegan,
and Susan Conrad to develop and publish the Longman Grammar of Spoken and
Written English (LGSWE) (see also Section C3). The LGSWE is a seminal work
focusing on the convergence of frequency data from corpora, varieties of E nglish
(British and American), and spoken vs. written registers to comprehensively
describe the grammar of the English language. An important implication here
is that there are multiple ‘grammars’: dialect-based grammar, the grammar of
speech and writing, and the grammar that is mediated and modified by registers.
These are all relevant in directing learners to acquire the comprehensive skills
and awareness essential to successfully using English across contexts. Clearly, the
goal is to value descriptions of language features in use and to allow learners to
appreciate the range of vocabulary and grammar options available to them. CL
frequency data are all options, if I can put it this way, not new prescriptions.
Teachers need to fully understand the meaning and implications of frequency
distributions and numbers indicating the likelihood or percentage of occurrence
of a feature and how to best share this information with their students. Students
will have to incrementally learn what linguistic variation in everyday language
means, together with its sociocultural values. Correctness and accuracy in using
language, however these are defined, are clearly important constructs in CL, but
instead of focusing on or prioritizing prescribed (i.e., ‘correct’) forms, actual fre-
quencies of use, not intuitions, alongside a full attention to and consideration of
contexts, are established in the forefront.
6 Corpus Linguistics for English Teachers
Your Reading List: Notable “Corpora in the Language Classroom” books pub-
lished in the past 10 years (from 2007). As previously noted, the LGSWE (pub-
lished in 1999) is what I consider to be the default corpus-based resource,
especially for grammar instruction, but the following textbook treatments
of CL and language teaching may speak directly to your immediate needs to
create classroom activities and lesson plans. I have provided level categories
(introductory, topic-specific, or advanced) to help guide you as to how these
books could be best utilized as additional resources.
Anderson, W. & Corbett, J. (2009). Exploring English with online corpora. New
York: Palgrave Macmillan.
This book introduces readers to available electronic resources (up until early
2009) and demonstrates how teachers can utilize corpora in the classroom. A
glossary of practical terms and topics, interactive tasks, and further readings
are provided. Level: Introductory.
and writing.” I highly recommended this book for English teachers across
levels, especially those interested in developing materials, including grammar
textbook writers. See Section C3, which briefly illustrates how Real Grammar
could be incorporated directly into a lesson on necessity modal verbs. Level:
Topic-Specific (Grammar).
Flowerdew, L. (2012). Corpora and language education. New York: Palgrave Mac-
millan.
Flowerdew provides a critical examination of key concepts and issues in CL,
focusing on the interdisciplinary nature of the field and the role that written
and spoken corpora now play in these different disciplines. Corpus-based
case studies are presented to show central themes and best practices in CL
research. An ‘application’ section discusses CL in teaching arenas, exploring
the pedagogic relevance of corpora. Citing Cook (1998), Flowerdew suggests
that corpora are a contribution to, rather than a solid base of, materials in the
data-driven classroom. Level: Topic-Specific.
Friginal, E. & Hardy, J.A. (2014a). Corpus-based sociolinguistics: A guide for stu-
dents. New York: Routledge.
Jack A. Hardy and I intended to generate ideas about how sociolinguistic re-
search and linguistic distributions from corpora can be effectively merged to
produce a range of meaningful studies. The teaching applications are primar-
ily for upper-level undergraduate and graduate students taking Language in
Society or Sociolinguistics courses. Students are provided detailed information
on corpus collection, tools, and available (sociolinguistic) corpora that can
be used for semester-long research projects. Level: Topic-Specific, Advanced.
Friginal, E., Lee, J., Polat, B., & Roberson, A. (2017). Exploring spoken English learner
language using corpora: Learner talk. New York: Palgrave Macmillan.
This book focuses on corpus-based analyses of learner oral production
in university-level English or ESL classrooms. Our analyses are primarily
research-based, but pedagogical applications are discussed in three spe-
cific areas of student oral production: (1) English for Academic Purposes (EAP)
classrooms, (2) English language experience interviews, and (3) peer response/
feedback activities (see Section B2 for additional descriptions of this project and
our corpus design and collection). Level: Topic-Specific, Advanced.
Liu, D. & Lei, L. (2017). Using corpora for language learning and teaching. Annap-
olis Junction, MD: TESOL Press.
Liu and Lei ask readers, “How Can You Use Corpora in Your Classroom?” As
one of the newest additions to CL in the classroom literature, this book ad-
dresses the needs of today’s teachers for a “step-by-step hands-on introduc-
tion to the use of corpora for teaching a variety of English language skills
(Continued)
10 Corpus Linguistics for English Teachers
O’Keeffe, A., McCarthy, M. & Carter, R. (2007). From corpus to classroom. Cam-
bridge: Cambridge University Press.
From Corpus to Classroom is another recommended resource, which summa-
rizes accessible corpus research in the classroom (until 2007). O’Keeffe, Mc-
Carthy, and Carter explain how corpora can be developed and what they
tell teachers and researchers about language learning. The book intends to
answer key questions, such as, “Is there a basic, everyday vocabulary for En-
glish?”, “How should idioms be taught?”, and “What are the most common
spoken language chunks?”, among others. Level: Topic-Specific.
Timmis, I. (2015). Corpus linguistics for ELT: Research and practice. New York:
Routledge.
Timmis is an experienced language teacher who developed this book as an
accessible, hands-on introduction to using corpora in classroom contexts.
The main focus here is to emphasize a data-rich approach to pedagogy and
how frequency-based information may contribute to effective classroom
teaching. Level: Introductory, Topic-Specific.
A1.2 What is a Corpus?
From the Latin word for “body,” the word corpus (corpora, plural) has been
used to refer to a collection of texts stored on a computer. Note that references
to text are not limited only to language that was initially written. A text1
can also be a transcription of speech. These electronic texts are equivalent to
Corpus Linguistics for English Teachers 11
many research studies over the years. Some users of corpora maintain that “cor-
pus linguistics does not exist” or that “corpus-based research” or other variants,
for example, corpus-assisted, corpus-informed research, would be more
accurate descriptors.
There have been attempts to define CL as its own linguistic field, prompting
distinctions between “corpus-based” (i.e., a research approach or method) and
“corpus-driven” (i.e., as a theory-generating branch in the field of linguistics).
Corpus-based analysis is “a methodology that uses corpus evidence mainly as
a repository of examples to expound, test or exemplify given theoretical state-
ments” (Tognini-Bonelli, 2001, p. 10). This means that corpora can be used to
answer preexisting questions about preexisting suppositions in frameworks that
have already been accepted by scholars in the field. Such analysis has also been
described as top-down because features of language under investigation are
known and chosen before going down to explore the lower levels of individual
texts. The corpus-driven perspective is more inductive, or bottom-up, in that
the linguistic features that are investigated come directly from analyses of the
corpus, not from categories preestablished by the researcher. In a corpus-driven
approach, the commitment of the linguist is to the integrity of the data as a
whole, and descriptions aim to be comprehensive with respect to corpus evi-
dence. The corpus, therefore, is seen as more than a repository of examples to
back preexisting theories or a probabilistic extension to an already well-defined
system. The theoretical statements are fully consistent with, and reflect directly,
the evidence provided by the corpus. Text examples and extracted patterns
are normally taken verbatim; in other words, they are not adjusted in any way
to fit the predefined categories of the analyst (Tognini-Bonelli, 2001). Biber
(2009), however, describes how such research tries to minimize the number of
assumptions of linguistic constructs the researcher might have. Instead, the data
are expected to speak for themselves (Friginal & Hardy, 2014a).
Although corpus-based and corpus-driven approaches can be thought of
as dichotomous, they are more like two poles on either end of a continuum.
There are areas of research in which corpora are used purely to find examples
of predefined linguistic features (e.g., most common stance words), and, at the
same time, there are truly corpus-driven studies, such as allowing computer
programs to determine the likelihood of multiple words’ being used together
(e.g., the concept of lexical bundles or formulaic sequences). The similarity of
these approaches is that both involve the collection and analysis of corpora of
natural language. Many researchers are even interested in similar constructs.
Compilation and analysis, however, are influenced by the ultimate goals: If one
wants to study a preestablished construct, he or she might simply search for
that construct in a corpus. If, on the other hand, the researcher is curious and
does not want to come to the analysis with preconceptions, a large corpus may
be collected and analyzed using corpus analysis that does not include a priori
decisions of what to search for.
14 Corpus Linguistics for English Teachers
Results of these functional interpretations are key for classroom teachers. These
interpretations will reveal to teachers and students that language is, in fact,
mediated by and modified according to registers. There is simply no
one-size-fits-all approach to effectively teaching the lexico-syntactic features
of language (speech or writing). For example, the essential linguistic and con-
textual components of successful business email writing may not necessarily be
the same components that will make a business proposal or a laboratory report
equally successful. What to teach learners, therefore, relies on the various com-
binations of components and factors identified by the target register.
In CL, register is a situationally defined category of speech and writing. A
register distinction of spoken texts, for example, can cover sub-registers, such
as face-to-face interaction, telephone interaction, and video calls (e.g., Skype
calls or mobile “FaceTime” calls). Corpora representing these three sub-registers
could be collected and transcribed. These sub-registers are differentiated by the
Corpus Linguistics for English Teachers 15
medium and contexts, which can certainly influence the use of a whole range of
linguistic features. Register variation, therefore, is primarily based on these con-
textual differences. What I like about the concept of register is that a researcher
or teacher is in control of register comparisons. The situations that define reg-
isters will just have to be clear and consistent. Hence, I can categorize a target
register, such as “written technical reports,” and establish sub-registers, such as
laboratory report, incident report, or business-financial report in the fields of
chemistry, forestry, and business, respectively. I can show my students linguistic
variation across these three groups. There will, for example, probably be differ-
ent distributions of key words, personal pronouns, or passive verbs. The conse-
quence of this teaching approach is increased register awareness among students.
Register has often been used interchangeably with genre. Although there is
little consensus as to the meaning and/or use of these terms, Biber and Conrad
(2009) explain that such a distinction is a matter of focus. Both concepts refer
to text categories that have been situationally defined and have shared general
communicative purposes; the difference between the two is determined more
by how those texts are studied or used. A genre perspective is more interested
in the conventional structures that are used to create an entire text or a section
of a text, such as research article introductions or the abstract from a research
article. On the other hand, a register perspective looks at the most common
linguistic features across spoken and written texts. These linguistic features,
from a register perspective, are thought to be pervasive, and thus, a sample of
a text can be analyzed. A genre analysis, however, would require the text to
remain intact. Although many corpus-based studies take a register perspective,
they may also use or be supplemented by other methods to become more in line
with genre-based approaches (Flowerdew, 2005). Wherever they land theoreti-
cally, however, corpus-based methodologies lend themselves well to answering
the questions relevant to disciplinary specificity. Literacy practices, even those
of linguists studying such practices, may be entrenched and not noticed. Intu-
itive conclusions as to what is frequent or infrequent are not always accurate,
and corpora offer measurable ways to describe what happens in the discourse
empirically. Another benefit of corpus-based methods is that they allow for
more objective empirical studies. This is especially useful for researchers—and
teachers—who view their role as descriptive rather than prescriptive. Topics that
can be investigated include, but are not limited to, vocabulary, phraseological
units, grammatical features, and rhetorical functions (Friginal & Hardy, 2014a).
event, and these are much easier to collect (see Section B2). They are referred
to as specialized corpora. Specialized corpora allow us to control for many
more variables. They are developed to represent a particular domain, includ-
ing those that are dedicated to micro contexts (e.g., abstracts of research arti-
cles or students’ responses to interview questions). These collections are useful
when moving from the analysis of results to generalizations relative to a bigger
population. For the most part, this is a question of scope. What linguistic fea-
ture or domain is being investigated? You might be interested, for example, in
the questions, “What kind of oral language do first year, level 1 IEP students
produce? Do they ask many questions and share critical comments? Do they
produce high frequency academic word list words? Do they often use passive
verbs?” These are interesting questions and could be investigated by using a
smaller sample, like two classes at an IEP in a particular university.
Annotated Corpora
An important aspect of a corpus design may include additional items embed-
ded in texts/transcripts. Especially for spoken corpora, it is, depending on the
18 Corpus Linguistics for English Teachers
Other Categories
Monitor Corpora are collections of texts intended to grow in size (e.g., num-
ber of words and the addition of emerging sub-registers over time). The Bank
of English (BoE), a collection of British, Australian, and American English,
for example, was first made available by the research team from the University
of Birmingham in the 1980s, with Susan Hunston as the lead developer. The
Corpus Linguistics for English Teachers 19
collection now has over half a billion words of “present-day English” and a
sub-corpus developed specifically for teaching purposes with over 65 million
words. Many corpora collected by Mark Davies for his Brigham Young Univer-
sity (BYU) corpus project (see Section B1) are considered to be monitor corpora,
especially the Corpus of Contemporary American English or COCA (Davies,
2008), the Corpus of Historical American English or COHA (Davies, 2010), and
the Global Web-Based English or GloWbE (Davies, 2013). Another category
quite similar to monitor corpora is Balanced Corpora, which typically intends
to represent a specific register over a period of time, with the collection empha-
sizing balance and representativeness, and with a clear sampling plan (sampling
frame) and identified schedule of collection when new data will be added. The
concept for now is primarily theoretical, but there are current corpora that,
when further developed in the future, might be categorized as fully balanced.
Examples of these are the British Academic Written English (BAWE) corpus
and Michigan Corpus of Upper-Level Student Papers (MICUSP) academic texts
(see Section B1) or specialized blogging and social media corpora (Facebook and
Twitter) that have been collected by researchers over the past 10 years.
Finally, McEnery and Hardie (2012) described Opportunistic Corpora
as corpora that do not match the descriptions of the categories mentioned ear-
lier. These corpora do not follow a strict design or sampling frame, and they
simply comprise specific data that were possible to gather for a particular task.
Sometimes, contextual and also technical restrictions may have prevented the
collection of texts originally sought from being completely realized. An ‘op-
portunistic’ process is often needed in the case of most spoken registers, from
recording to transcription.
example, the Oxford English Dictionary, which was first published in 1928, was
based on approximately 5,000,000 citations from natural texts, totaling ap-
proximately 50 million words, compiled by more than 2,000 volunteers over
a period of more than 70 years. Long before this, 150,000 natural sentences
written on slips of paper were the basis, in 1755, for Samuel Johnson’s Dictionary
of the English Language, which was published to illustrate the natural usage of
words at that time (Biber et al., 2010).
Prior to the obtainability of a computer or electronically prepared corpora,
the empirical study of vocabulary use as well as grammar teaching in English
were accomplished by reliance upon texts, such as newspaper writing, short
stories, and scholarly essays. Actual sentences taken directly from novels and
newspapers were reproduced in many commonly used grammar books during
this period to show various structures of formal, grammatically correct sen-
tences and syntactic elements such as verb phrases and clauses. A corpus of
letters written to various government agencies was the basis for earlier works in
the field, such as that of C.C. Fries, who focused on usage and social variation
of language. Another work, published in 1952, which was essentially a gram-
mar of conversation based on a 250,000-word corpus of telephone conversa-
tions, focused on such grammatical features as discourse markers well, oh, now,
and why when these markers initiated a “response utterance unit.”
In the 1960s and 1970s, the majority of researchers in linguistics—particularly
those in the US—adamantly maintained that, since language was a “mental
construct,” empirical research approaches were unsuitable to describe language
competence. They argued that, instead, what Biber (1988) referred to as
intuition-based methods, intuition rather than empirical analysis, should be
adopted as the primary research methodology in the field. Some linguists, how-
ever, steadfastly maintained that empirical linguistic analysis had greater util-
ity and validity. In the early 1960s, for example, Randolph Quirk compiled a
pre-computer-era collection of 200 spoken and written texts, each approximat-
ing 5,000 words, which he dubbed the Survey of English Usage (SEU) and sub-
sequently used to compile descriptive grammars of English (e.g., Quirk et al.,
1972). This empirical, descriptive tradition also had the continuing support of
such functional linguists as Prince and Thompson, who argued that analysis of
(still noncomputerized at this point) collections of natural texts was useful in the
identification of systematic functional linguistic variation. Thompson has been
especially interested in the study of grammatical variation in spoken interac-
tions and has identified features in conversation that influence the retention or
omission of other features, such as complementizers (Biber et al., 2010).
Kučera and Francis (1967) had actually begun work on large electronic cor-
pora in the 1960s, compiling the Brown Corpus (or, in full, the Brown Uni-
versity Standard Corpus of Present-Day American English), a 1-million-word
corpus of published American English written texts. The Brown Corpus cata-
logued a wide variety of types of American English, all of which were written in
Corpus Linguistics for English Teachers 21
1961. A total of 500 samples of approximately 2,000 words each from 15 different
genres were collected for this project. News, religious texts, biographies, official
documents, academic prose, humor, and various styles of fiction were included
(see Kučera & Francis, 1967). A parallel corpus of British English written texts,
the LOB Corpus (also Lancaster-Oslo-Bergen), followed in the 1970s. It was
not until the 1980s that major studies of language use based on large electronic
corpora began to appear as these corpora became more available and accessible,
thanks to the increasing availability of computational tools to facilitate linguis-
tic analysis. For example, in 1982, Francis and Kučera provided a frequency
analysis of the words and grammatical part-of-speech categories found in the
Brown Corpus. Johansson and Hofland (1989) followed with a similar analysis
of the LOB Corpus. Also during this period, book-length descriptive studies of
linguistic features began to appear: for example, Granger (1983) on passives, de
Haan (1989) on nominal post-modifiers, and Biber (1988) on the seminal multi-
dimensional studies of register variation. This period also saw the emergence of
English language learner dictionaries, such as the Collins CoBuild English Language
Dictionary (1987) and the Longman Dictionary of Contemporary English (1987), all of
which were based on the analysis of large electronic corpora. Since the 1980s, the
majority of descriptive studies of linguistic variation in and usage of English—
whether a large, standard corpus, such as the BNC, or a smaller, study-specific
corpus, such as a corpus of 20 biology research articles constructed for a written
academic register analysis—have utilized analyses of electronic corpora, and this
has now become a standard research methodology in the field.
(Continued)
22 Corpus Linguistics for English Teachers
teachers see the value of developing corpus tools into mobile apps and using
them to create communities of practice online. It is quite possible for enter-
prising groups/companies (like Duolingo or Sketch Engine) to take the lead in
fully incorporating corpus tools into existing language learning sites and ap-
plications. Overall, upon recognition of the present challenges and looking
ahead, we are positioned to experience the merging of CL approaches, internet
technology, and many other epistemological fields. CL accommodates a variety
of learners and learning contexts, and effectively complements quantitative and
qualitative research paradigms. It is evident that we are going to see more of
CL and its constructs and tools more directly utilized by (English) teachers in
the classroom.
Note
1 Text is a tricky word to use in CL as it may mean multiple items. In this book, the
possible meanings are (1) text as discourse (e.g., spoken and written texts), (2) text as
a particular ‘file’ (e.g., “I have two essay texts written by my students.”), or (3) text as
an encoding system (e.g., “Please save your files as .txt files, not as .doc or .pdf ”).
A2
Connections
CL and Instructional Technology,
CALL, and Data-Driven Learning
There are several important theoretical underpinnings that support the inte-
gration and use of CL in the teaching of English for a range of learners and
settings. CL is an approach in researching language and its features, and also in
supporting language teachers as they facilitate the learning and acquisition of
English. This means that it is definitely consistent with SLA theories, partic-
ularly those that emphasize the importance of sociocultural approaches: focus
on the learner (e.g., learning styles and characteristics), input and output, use
of realia and authentic texts, the importance of real-world tasks, learner-learner
interaction, and learner autonomy and exploration of linguistic data.
Conrad (2000), in a seminal article that asked whether CL might revolu-
tionize grammar instruction, noted that the final decades of the 20th century
brought about dramatically novel approaches that subsequently redirected gram-
mar teaching and research. She identified renewed interest in an explicit focus on
form and/or grammar instruction in the classroom (e.g., C elce-Murcia, 1991a;
Celce-Murcia, Dörnyei, & Thurrell, 1997; Ellis, 1998; Master, 1994); new ap-
proaches to grammar pedagogy, such as teaching grammar in a discourse con-
text (Celce-Murcia, 1991b); and the design of grammatical c onsciousness-raising
and input analysis activities (e.g., Ellis, 1995; Fotos, 1994; Rutherford, 1987;
Yip, 1994). Simultaneously, classroom technologies and computers were making
it possible to conduct grammar studies of wide-ranging scope and complex-
ity. Conrad then revealed how CL was positioned to facilitate the transfer of
research data to pedagogy. She added that linguistic data from an empirical
study of language, which relies on computer-assisted techniques, can best rep-
resent the context and also serve as the source for input analysis activities. She
noted that, at that time, only one aspect of CL, concordancing, tended to be
emphasized for classrooms (citing, particularly, Johns, 1986, 1994; and Cobb,
1997; Stevens, 1995), but most ESL grammarians regarded CL as contributing
Connections 27
Rather, language features and patterns are typical of particular registers and
will have to be prioritized and highlighted accordingly in materials design
or classroom lesson planning (Biber, 2004). Learning about registers reflects
learning English for various purposes (i.e., ESP and EAP), and it can be (best)
accomplished in the classroom when students use corpora, corpus tools, and
corpus-based materials to examine specific characteristics of spoken and written
registers. Several ESP and EAP studies report that student concordancing based
on a specialized corpus (e.g., Boulton, 2015; Chambers, Farr, & O’Riordan,
2011; Donley & Reppen, 2001; Friginal, 2013b; Gavioli & Aston, 2001) has
proven to be effective in awareness-raising exercises. Students are able to make
distinct comparisons between features used in one type of writing and those
used in another, and distributional data showing how they use specific linguis-
tic features in their own writing, again, compared to another corpus or written
academic texts, can provide additional motivation in editing drafts. Classroom
activities along these types of exercises can clearly be organized, with CL as an
approach within instructional technology.
Instructional or educational technology primarily refers to the tools, mate-
rials, and equipment used to support the teaching and learning of a particular
subject or topic. These technologies include hardware and software and audio/
video equipment, and devices. Corpus tools running on computers, tablets, and
mobile phones are all part of these classroom-based technologies. It is necessary
to keep in mind that technology is only one of many tools that English teach-
ers have at their disposal, and it is important to note that instructional tech-
nology should supplement and support instruction and help accomplish, rather
than replace, teachers’ specific teaching goals. A corollary, as far as students are
concerned, is that the availability of technologies, whether in the classroom, at
home, or anywhere learning is taking place, should also enhance and extend,
rather than replace, what only the individual brings to the dynamics of the
learning process. Students themselves need to see various patterns and be able
to interpret what’s going on. As is the case with any tool, whether a table saw
in the woodworking shop or learning/teaching technology in the classroom,
the user—both the teacher and the student—must discover the ways in which
to fully control the tools and make them work to address their particular plans,
objectives, and needs.
The Oregon Department of Education (2002), in a publication distributed
to its teachers, suggests that it is necessary for teachers to ask themselves the
following question: “Will the use of technology make this lesson better? Will
it facilitate student understanding? Will students’ capacity to demonstrate their
understanding increase because of it?” This publication notes that, by asking
these questions, teachers will be able to determine when these technologies are
appropriate and when they are not. The answers to these questions can be use-
ful to English teachers as they formulate goals in incorporating CL tools into
the classroom. The recognition that CL tools will not work all the time, across
Connections 29
language topics and lesson settings, is very important for teachers. Knowing
how the tools work and being able to take control, in case they don’t work,
are necessary in the successful integration of CL in the classroom. By thinking
about CL tools within the frame of instructional technology, English teachers
will come to view these tools as everyday, nonthreatening classroom devices.
The tools will not be a step ahead of the teachers in their instruction, and they
can use the tools when they are needed for a collocation exercise, for example,
but not when it gets too complicated or confusing for learners.
Figure A2.2 he relationship between target second language skills and priorities
T
for CALL design. Adapted from Chapelle and Jamieson (2009).
Questions:
What specific criteria presented in the questionnaire are best addressed by
CL tools and materials? What are those that will be clear limitations?
What are innovative ways to make learners interact with each other in the
CL classroom? What problems related to Learner Fit do you anticipate?
At the heart of the approach is the use of the machine not as a surrogate
teacher or tutor, but as a rather special type of informant. The difference
between teacher and informant can best be, defined in terms of the flow
40 Corpus Linguistics for English Teachers
of questions and answers. The teacher typically asks a question (answer al-
ready known) to check that learning has taken place: the learner attempts to
answer that question: and the teacher gives feedback on whether the ques-
tion has been successfully answered Such is the I(nitiation)-R(esponse)-
F(eedback) structure of the classroom exchange as described in Sinclair
and Coulthard (1975): and such, too, is the structure of the typical “tu-
torial” computer program that purports to “teach a foreign language.”
The informant, on the other hand is passive - and silent - until a question
(answer unknown) is asked by the learner. The informant responds to
that question as best he (or she) can: and the learner then tries to make
sense of that response (possibly asking other questions in order to do so)
and to integrate it with what is already known.
(1994, p. 1)
The primary theoretical basis for DDL in SLA is the proven value of learners’
active examination of naturally occurring language and their discoveries of lin-
guistic patterns and rules on their own (Boulton, 2009). The focal point of an
effective DDL is the guided but free-form exploration of a language and its fea-
tures. It is in the discovery of these patterns that learners can articulate insights
and experience a degree of self-sufficiency in their language learning. As Johns
was still formally articulating this approach, corpus data (i.e., teachers’ access to
corpora) were clearly limited, and he was also restricted to basic concordancing
tools only available to a select number of computers. However, he noted at that
time that during the past few years that his students used concordance output
regularly, he was able to come up with three primary conclusions (1991):
that is based directly on linguistic data (Aston, 2015; Boulton, 2009; Charles,
2014; Friginal, 2013b; Geluso & Yamaguchi, 2014; Lee & Swales, 2006). Two
lessons in Part C of this book are situated specifically on DDL: C2.2 on “Us-
ing a Concordancer for Vocabulary Learning with Pre-Intermediate EFL Stu-
dents” from McNair and C3.6 developed by Nolen on “A Long-Term, Corpus
Exploration Project for ELLs.”
As I mentioned in Section A1, corpus-informed grammar teaching materials
such as the LGSWE (Biber et al., 1999) helped tremendously in introducing
corpus-based data to applied linguists as well as the general audience composed
of English language teachers and language learners. The LGSWE provides ex-
tensive distributional data of the lexico-syntactic features of written and spoken
registers of British and American English, and it also presents corpus findings that
explain the functional parameters of these two national English varieties based
on comparative patterns of usage. The analysis and presentation of corpora from
the LGSWE have contributed an assortment of frequency distributions to many
language/grammar classrooms with applications to register and cross-linguistic
comparisons. This growth in corpus-based materials production is accompanied
by much easier access to these tools and by the continuing focus on CL research.
DDL has also now been recognized to coordinate well with several theories, ap-
proaches, and subfields within applied linguistics. DDL outside of the classroom
has been established as a clear focus and application of lessons and activities.
However, there are clear challenges and limitations to the successful utiliza-
tion of the DDL methodology. Learners who are not used to this approach may
be intimidated and find DDL technically complicated, especially lower-level
foreign language learners (of English). Kaltenböck and Mehlmauer-Larcher
(2005) point out that “being confronted with a vast amount of unordered,
‘messy’ data, can indeed be a frustrating, even daunting experience” (p. 79).
This is a clear challenge that has a tendency to evoke resistance from learners.
Geluso and Yamaguchi (2014) observed that there was some skepticism and
reluctance from their learners toward DDL, especially during the beginning
stages of instruction. This skepticism is limited not only to learners in the
English classroom. Römer (2011) observed that there is, in addition to learner
reticence, a reluctance to use and/or produce corpora by teachers and materials
writers due to the prospect of having to process an enormous amount of data.
To many learners and teachers, DDL may not be immediately pedagogically
appealing, to say the least, because reading line after line of authentic language
in order to induce meaning can become monotonous (Nolen, 2017). Clearly,
DDL requires an investment on the part of learners and teachers. Learners’ ages,
learning styles, technical knowhow, proficiency in English, and educational
backgrounds all play a role in how receptive they are to the approach (Geluso &
Yamaguchi, 2014). It must also be acknowledged that it can sometimes take
weeks for teachers to properly introduce and teach corpus tools. This was the
case in Geluso and Yamaguchi’s DDL curriculum for which three weeks were
42 Corpus Linguistics for English Teachers
set aside to explain and discuss the utility of DDL so that students would be
adequately motivated to invest their time and effort. Similarly, Mizumoto and
Chujo (2016) disclosed that in a 15-week semester, their corpus-based course
utilized the first 10 weeks to simply provide adequate DDL instruction.
Another widely raised concern is that DDL is only applicable for proficient,
intermediate to advanced learners of the language (Tribble, 2015). This seems
to be a logical reservation, since DDL often focuses on authentic, native-like
language in academia that may surpass the abilities of most beginner-level
learners. However, Boulton (2009) found that learners at an intermediate-low
proficiency level also benefited equally in learning new vocabulary from
corpus-based data rather than from traditional pedagogical strategies such as
the utilization of bilingual dictionaries (e.g., translators and electronic dic-
tionaries that provide word synonyms) within the context of certain tasks.
Boulton adds that DDL is most suited for corpus-trained, advanced proficiency
learners, but it also provides measurable benefits for lower-proficiency learners,
when guided as necessary according to their ability levels. Mueller and Jacobsen
(2016) also found that learners at lower-level English proficiencies can use DDL
as a means of error correction in their writing. DDL provides examples and as
much data as a learner is willing to extract and explore. Concerns about learner
proficiency can, in part, be resolved by various scaffolding exercises.
The current consensus is that DDL is a useful approach in the English class-
room, promotes students’ autonomous learning (Charles, 2014; Mueller &
Jacobsen, 2016), provides them with a wealth of linguistic information (A ston,
2015; Römer, 2011), and initiates their training to become language-based re-
searchers as they acquire the knowledge and abilities to utilize it (Friginal,
2013b). These learners need the support of a variety of language learning tools
in order to meet the challenges and demands of learning English, especially
in academia, for various purposes. DDL, then, can be a powerful approach to
fully expose learners to authentic language in order for them to examine and
understand how the language is structured and used naturally—leading, conse-
quently, to their own successful use. It is by no means an easy task, yet Römer
(2011) argues that DDL can certainly empower learners as it raises their lan-
guage awareness. She adds that corpora and corpus tools have great pedagogical
potential in creating a habit of obtaining data and developing a sense of own-
ership that will likely keep students motivated in their learning of a language,
both inside and outside of the classroom.
Below is a reflection by Nolen (2017) upon experiencing DDL as an English
teacher tasked to develop lessons and activities on phraseology for interna-
tional students in the US In this reflection, he focuses on a collocation example
in showing students to go beyond simple dictionary meanings into a deeper
examination of linguistic chunks to discover the functions of and important,
pragmatic information about words and phrases. See also Nolen’s semester-long
course plan for a DDL vocabulary and grammar instruction (Section C3.6),
focusing especially on concordancing activities outside the classroom.
Connections 43
(Continued)
44 Corpus Linguistics for English Teachers
students themselves. It is important to note that this concept may not match
the cultural expectations of learners from countries where teachers assume a
very prominent and directive role in all forms of instruction. More importantly,
however, teachers are responsible for training students on the use of corpus
tools—clearly indicating that they need to know completely how to use these
tools and addressing any potential glitches as students acquire and refine their
self-learning skills. Students should hear and, hopefully, appreciate the rationale
in support of the DDL approach and the benefits that will accrue to them as
learners and practitioners from developing proficiency in the use of these tools.
The primary goal again is autonomous learning for students. In describing
the process leading to learner autonomy, Kaltenböck and Mehlmauer-Larcher
(2005) note that it resembles a continuum where the control gradually shifts
from the teacher to the students, as there are changes in attitudes, knowledge,
and abilities that will be observed throughout the process.
To summarize, the benefits of DDL outweigh the challenges. This is espe-
cially true for learners who gain access and training to investigate language
data not only independently, but also extensively. Learners can truly become
researchers or ‘detectives’ of their target language. The roles of learners and
teachers tend to change or shift, but both will continue to develop strategies
for discovery. This places the initiative and specific path to learning the lan-
guage on the learners, and demands that they take some of the responsibility
and accountability involved in the process—as well as enjoy the benefits to
be derived from doing so!
A3
Analyzing and Visualizing
English Using Corpora
of the construct per x number of words. Depending on the feature and the size
of the corpus, a teacher might choose to measure the number of occurrences
per 100, 1,000, or 1,000,000 words. This process is also called normalizing (i.e.,
normed count or normed frequency). In many of my studies of word/gram-
matical constructions, I normalize the number of instances per 1,000 words,
following a simple calculation:
number of occurrences
nf = × 1,000
total number of words
Normalization not only allows for teachers to compare linguistic features with
one another but also, more importantly, allows us to compare texts and corpora
of differing lengths.
So, returning to my earlier question about the top 12 most common lexical
verbs in spoken American English, normalized frequency data is actually avail-
able for determining this. Figure A3.1 shows the top 12 lexical verbs obtained
from the Longman corpus (Biber, 2004).
Biber reports that these 12 verbs are very common in spoken interaction,
and they alone will comprise close to 50% of instances of lexical verb use in
the corpus. Based on these frequencies, teachers may start a lesson on teach-
ing verbs in conversation by focusing on introducing the forms and functions
of the first five: get, go, say, know, and think. University IEP students who are
in their first semester in the US in an English oral communication class may
directly benefit from this activity as they will hear these common verbs very
10000
9500
9000
8000
Frequency per million words
7300
7000
7000 6800
6000
5000 4700
4000
3200 3100 3010
3000
2500
1900
2000
1400
1200
1000
0
get go say know think see want come mean take make give
Figure A3.1 op 12 most common lexical verbs in spoken American English, nor-
T
malized per one million words. Adapted from Biber (2004).
Analyzing English Using Corpora 49
Obtaining something (activity): Check if they can get some of that bread.
Moving to or away from something (activity): Get in the car.
Causing something to move (causative): We ought to get these wedding pictures
into an album.
Causing something to happen (causative): Uh, I got to get Max to sign one, too.
Changing from one state to another (occurrence): So I’m getting that way now.
Understanding something (mental): Do you get it?
applied linguistics. Text Samples A3.2 show KWIC lines for the phrase in my
opinion from a corpus of personal blogs written by women based in the US
(collected by Samford, 2013).
A3.1.3 Collocations
As noted earlier, the way in which linguists regard and examine discrete lin-
guistic elements, such as words and phrases, has been strongly influenced by
the work of Firth (1957). These elements should not be regarded or treated
as independent from rules and other words in a text. Accordingly, the corpus
approach allows for the determination of statistically significant word combi-
nations, that is, word collocations, in a given text and how these combinations
are distributed across registers. Collocations can also be found using more ob-
jective measurements from statistical results obtained from reference corpora.
Prediction models of what might follow or precede a word, a noun, or a verb
can be measured based on their expected frequencies. Table A3.1 shows the
collocates changing over time from older to more recent for women, art, fast,
music, and food (Davies, 2017a).
Analyzing English Using Corpora 51
Table A3.1 Google Books’ (from the BYU collection) changing collocates over time
for women, art, fast, music, and food (Davies, 2017a)
AntConc’s
Figure A3.2 (Anthony, 2014) first left and first right collocations for the
word know from a blog corpus.
AntConc’s first left and first right collocations for the word know are provided in
Figure A3.2 from the same corpus, with 584,714 words, of personal blogs refer-
enced earlier, written by women bloggers (Samford, 2013). The most frequently
occurring left collocate of know is “I” (I know, occurring 608 tomes), while the
most frequent right collocate is “what” (know what, occurring 214 times). A
52 Corpus Linguistics for English Teachers
contraction (‘t), often from don’t know, appeared 422 times in the corpus. In in-
terpreting the AntConc output, disregard the search word that is listed as Rank 1
(know) and focus on the raw frequency reported in the output window. Users
can download the full result saved as a text (.txt) file. The procedure for running
collocations in AntConc is pretty straightforward:
1 Load the corpus: File—Open File(s)—then select your folder where your
text files are located
2 Select the tab option for “Collocates” at the top of the main results win-
dow (between “Clusters” and “Word List”)
3 Type your search term (know) in the search bar
4 Identify your first left or first right options and minimum collocate fre-
quency (below “Window Span”)
5 Click “Start” and results (Figure A3.2) will be produced.
Interpreting collocations
An online article by “vaughanbell” (2017) published by Mind Hacks (https://
mindhacks.com/), a neuroscience and psychology news and opinion site,
notes that there is a preference for mental health practitioners to avoid
the phrase commit suicide. These practitioners argue that commit refers to a
crime, and this increases the stigma against what should be regarded as an
act of desperation that deserves compassion as opposed to condemnation.
The author added the following supporting arguments:
On the surface level, vaughanbell argues, claims that the word commit nec-
essarily indicates a crime are clearly wrong. We can commit money or commit
errors, or commit ourselves to work harder, for instance, and no crime is
implied.
After examining traditional dictionary definitions of commit (e.g., from
Google’s default dictionary: [commit] carry out or perpetrate [a mistake,
Analyzing English Using Corpora 53
I have used an activity like this many times in my classes to allow stu-
dents to reflect and share thoughts on an issue and then comment on
what corpus data provide. It is certainly encouraging to witness popu-
lar culture’s acknowledgment of corpus approaches in analyzing profes-
sional discourse. In small groups, discussion guide questions such as the
following could be provided after students have read a short article. In
my experience, these questions always encourage active participation and
immediate use of the COCA database, with students using their phones or
laptops to access the site:
This “unusual frequency” is also referred to as the keyness value of this word
and is based on the likelihood of occurrence of the word in a target corpus as
determined by a process called cross-tabulation. In other words, keyness draws
from word frequency data, but instead of simple averages, statistical computa-
tion is used to determine if a word is more or less likely to occur in one corpus
vs. another.
Key word comparisons provide an interesting look at the unique features of
one type of discourse, language variety, or register compared to another. Key
words can be extracted easily using AntConc and WordSmith Tools. Note that
this process involves loading a target corpus, also known as “node corpus,” and
a reference corpus into the software to proceed with the analysis. A video tu-
torial for running key word analysis using AntConc is available from YouTube.
Search: “AntConc – Keywords.”
In the following example (Table A3.2), I provide two key word lists from
a collection of essays written by L2 university students responding to two ar-
gumentative email prompts. The focus here is to investigate topic effect and
whether a certain topic may have an influence in writing quality. For this key
word analysis, I wanted to categorize the distribution of words repeated from
the actual prompts. Corpus 1 are essays responding to a question about the “im-
portance of planning for the future.” Corpus 2 asks about the implication of too
much “emphasis on personal appearance and fashion.” Frequency and keyness
values are provided for each key word.
Students in a CL class can be asked to interpret the data from the table
(Table A3.2). It’s a good idea to provide the additional key words, if pos-
sible, the first 100 per corpus. Clearly, students identify words specifically
mentioned in the prompts as they write their responses, and these were
the primary key words per corpus. First person pronoun I was the top key
word in the “appearance” corpus. The misspelled words “apperance” and
“fashons,” misspelled 75 and 66 times, respectively, are both in the top 30
for Corpus 2. Teachers can ask students the following questions after they
analyze the results:
1 What patterns did you recognize? How do you interpret the character-
istics of L2 student writing from these two prompts? When compared
to L1 writers, do you think there will be differences?
2 What are ideal topics of comparison for a key word analysis?
3 What are limitations in conducting key word analysis?
Analyzing English Using Corpora 55
Table A3.2 Key word comparison from two groups of essays written by L2 students
Corpus 1 Frequency Keyness Key word Corpus 2 Frequency Keyness Key word
Future Appearance
Table A3.3 The 50 most common 4-grams from the Enron Email Corpus
hand). N-grams can also be extracted using most basic corpus packages; both
AntConc and WordSmith Tools have intuitive commands for n-gram extraction.
Table A3.3 shows a list of the 50 most common 4-grams from a corpus of pro-
fessional, workplace emails from the Enron Email Corpus (see also Section B1).
Lexical bundles. One particular type of n-gram is the lexical bundle, an
n-gram with additional specifications as to how they are extracted or catego-
rized. Customarily, lexical bundles consist of at least three words (tri-grams)
that occur frequently—frequency determined by the researcher—across a
corpus of at least one million words. Another important criterion for labeling
MWUs as lexical bundles is that they must appear in at least five different texts
in the corpus, that is, they are common in other registers as well. This is neces-
sary to avoid any idiosyncratic language usages (Cortes, 2004).
P-frames. Researchers have also moved beyond looking only at contiguous
strings of words to also examine frequent, patterned constructions. P-frames are
consistent phraseological structures that allow, however, for variability in one
position of the phrase frame. An example of a p-frame, found by Römer (2010),
is it would be * to, in which the asterisk represents an open slot. Grammatically,
Analyzing English Using Corpora 57
any number of adjectives might go into the blank slot in this example. Römer
found that the most frequently occurring words in a corpus of student essays in
the “blank” slot were interesting, useful, nice, and better, these accounting for 77%
of all the variants in the corpus.
Using Biber’s MDA approach, Hardy and Römer (2013) extracted dimen-
sions of A-graded university writing from MICUSP. Their Dimension 1 distin-
guished between Involved, Academic Narrative, very common in Philosophy
and Education papers, and Descriptive, Informational Discourse, typical of
A-graded papers in biology and physics. The following text samples show a
biology report compared to a philosophy critique. What characteristics are
typical of one text sample in contrast to the other? What useful teaching
applications occur to you as you identify such patterns? See a brief addi-
tional discussion on the application of MDA results to pedagogy in Section
B1 from the MICUSP description.
Socrates then concludes that group (D) does not exist, since those people,
by desiring what they believe to be harmful (bad) things are desiring to be
miserable and unhappy. No one wants to be miserable and unhappy, so no
one desires what he believes to be bad. (A)–(C) actually desire what they
believe to be good, and group (D) does not exist, so no one desires what
he believes to be bad. I feel compelled to say here that although Socrates
actually claims that “no one wants what is bad” (78), what he means is that
no one wants what he believes is bad.
60 Corpus Linguistics for English Teachers
uppercase first letters. This illustrates the fact that CL extracts anything and ev-
erything that is available in the dataset, from the most frequent feature to those
that only appear once (see Section C4.2 for a lesson that incorporates visual-
izing political speeches with word clouds). In addition to WordClouds.com,
there are several other word cloud generators such as Wordle (www.wordle.net)
or Tagxedo (http://www.tagxedo.com/) that provide free word or tag cloud
templates and other applications. Tagxedo, for example, is also able to easily
provide a key word list and the use of various color and design options.
Because CL relies on frequency data by group or by text file, it is easy to
transform these distributions into figures, especially histograms and charts.
From MS Excel functions to more sophisticated statistical packages like
SPSS or R, figures to enhance numerical presentation are often included in
CL research articles and textbooks. These figures are also easily incorporated
into language classroom activities. Students in small groups or pairs can ex-
amine figures/graphs, identify patterns, and make exploratory conclusions.
Figures A3.4 and A3.5 show word and tag frequency data that learners can
discuss and interpret.
the
of
and
to
in
and
is
that
for
it
as
was
with
be
by
on
not
he
I
this
0 10 20 30 40 50 60
Figure A3.4 isual representation of the Top 20 words in the English language from
V
Google Books (a mega corpus of more than 500 billion words from
scanned books in English and also other languages).
90
I
80
we
70
60 my
50
40 mine
30
our
20
10
0
Older Men Younger Men Older Women Younger Women
Figure A3.5 Use of personal pronouns I, we, my, mine, and our by men and women
bloggers in two age groups (30 and younger vs. 31 and older). Adapted
from Friginal (2009).
Analyzing English Using Corpora 63
Oh, thank you God. Band camp really sucks. I am so tired of all of it! It doesn’t
matter, tomorrow is the last day. I don’t really feel like updating much. Go figure.
We have the 1st, 2nd, and up to set 15 of the 3rd song completed, but just as last
year, our drill writer is stupid and is falling behind. We have no more drill to work
on. Hopefully we will have more tomorrow. (Female blogger, high school student)
Table talk for the Sunday brunch crowd was the Senior Prom at the Golden Age
Center last night. Retired biology teacher Denver Zygote and Granny Garbanzo
double-dated with Judge and Mrs. Halfthrottle. The big excitement came about
half-way through the festivities when Granny attempted to Watusi with her cane
in her hand. (Male blogger, 60s, retired)
Figure A3.6 istributional comparison of the word gentleman from the 1880s to
D
the present in COHA, with KWIC results. Figure and illustrations
adapted from Davies, 2010–.
Figure A3.7 Frequency of gentleman in English books from 1800s to the present
from the Google Ngram Viewer.
disciplines. Zhu leads the Hypermedia and Visualization Lab and Brains &
Behavior research program at GSU, and his research interests include computer
graphics, data visualization, and bioinformatics. As an L2 learner himself, and
also one that identifies as a visual learner, Zhu has advocated for the use of
computer-based visual data in language instruction. He believes that the struc-
ture of language, typically explored in grammar activities (e.g., tree diagrams),
could be best comprehended by groups of learners when they were aided by
color coded and interactive visuals. A model of networks showing nodes of
sentences and the connections words have with each other allowed Zhu to fully
Analyzing English Using Corpora 65
British
Actresses
Joanne
Froggatt
And
Ruth
Wilson
Also
Collected
Prizes
Figure A3.8 Sample sentence tree from Zhu. Adapted from Zhu and Friginal (2015).
Figure A3.9 ext X-Ray’s text editor and standard application tools (Zhu & Friginal,
T
2015).
how POS tags are used in context. This beta version of Text X-Ray1 works as a basic
text editor, with built-in POS visualizer for various POS tags (e.g., nouns, verbs,
prepositions) obtained from the built-in Stanford Tagger; readability and lexical
diversity measures; wordlist comparisons; and a word cloud application. Another
important feature of this program is its ability to compare normalized frequen-
cies of linguistic features, for example, word/phrasal classes, with those aggregated
from MICUSP. Note again that MICUSP is composed of advanced, A-graded
student papers categorized primarily across disciplines and text types collected at
the University of Michigan (O’Donnell & Römer, 2012). Student-produced texts
can be immediately compared with MICUSP data, in real time, across disciplines,
paper types, and student levels, including gender and native speaker vs. non-native
speaker groups. Figure A3.9 shows the primary text editor view of Text X-Ray and
its current set of tools and command buttons:
Figure A3.10 POS-visualizing through Text X-Ray (color coded POS not shown in the
gray scale image, e.g., green = nouns, red = verbs, bold = prepositions).
Web
Corpus cache
Corpus
Corpus
management
database Visualization
system
interface
Corpus API
Text
Figure A3.11 X-Ray’s structural workflow. Adapted from He (2016) and Zhu
and Friginal (2015).
Text X-Ray is a program that I could sit here and play with all day because
I just think it’s cool that a program can pick out parts of speech in a se-
lected text. I haven’t noticed POS-tagging mistakes made by Text X-Ray
yet, but I’m determined to stump it. My immediate thought was to in-
troduce this program to the other instructors where I teach. Several were
Analyzing English Using Corpora 69
interested entering their students’ essays and comparing them to the MIC-
USP papers. The intermediate and advanced-level teachers were inter-
ested in seeing if there was any notable difference between their students’
writing and the papers in MICUSP. I haven’t checked with any of them to
see if they have had a chance to do this yet, but I think that there will most
definitely be a difference between the papers because MICUSP handles
academic papers at the collegiate level, while the papers at my institution
are mostly written by students who hope to get into the GED program
or apply for citizenship. But, it would still be interesting to see what the
MICUSP papers have that the ones written at my school don’t have.
My first thought on using Text X-Ray in the classroom was as a sort of
self-check device that the students could use in our technology room. My
class is for beginners and we do go over the basic parts of speech, so by
having students enter a pre-selected text into the program, having them
pick out the nouns in the passage, and then checking their own accuracy
with Text X-Ray is an excellent way to get the students more engaged
with their language learning. Another feature of Text X-Ray that I could
see myself using in the future for vocabulary purposes is the word list
tab. Approximating word meaning from context is a very difficult task
in any writing classroom, but if I were able to create a list of words that
I think will be difficult for the students in my classroom from a specific
passage of written text, and then use Text X-Ray to highlight the words
in the passage, it would bolster class discussion of the context in which
the words are used.
Using Text X-Ray can highlight how native speakers of English use cer-
tain language forms, vocabulary items, and expressions. It offers students
the use of authentic and real-life examples when learning writing which
are better than examples that are made up by the teacher. It allow students
to learn useful phrases and typical collocations they might use themselves
as well as language features in context which means that students learn
language in context and not in isolation. And it can help students get a
broader view of language by comparison. By doing so, students become
aware of lexical chunks that are useful when it comes to completing writ-
ing tasks. It helps teachers to demonstrate how vocabulary, grammar,
idiomatic expressions and pragmatic constraints with real-life language.
I could use concordancers only when I prepare the class, but I thought I
won’t recommend students to use this kind of program before I saw Text
X-Ray. The program is colorful, and it is very easy to use. With tagging
applications, students can easily find the nouns, verbs, and adjectives in
their essays. I can use it when I teach verb valency patterns. Since my
interest is in teaching grammar using corpus tools, I have been thinking
about applicable methodology that I can follow for classroom research on
grammar teaching with a tool like this. Because of this visual recognition
or representation of “grammar” on the screen, I think students’ learning
will last more than just from simple rote memory.
Articles are hard to learn for Japanese learners of English since Japanese
does not have articles. Texts with highlighted articles (a, an, the) can be
used in the writing classroom as a focus on form activity. Compare with
MICUSP shows the comparison of frequency of major POS between the
corpus and the current essay. By focusing on article use, for example,
the program gives a clue to Japanese students of English if they supplied
the necessary articles or not. If their frequency of articles is much lower
compared to a model corpus, they can focus on their use of articles when
they proofread their essay. Word Cloud – it might help students with
writing a summary of a text. I remember when I was an undergraduate
student, writing a summary in English was so difficult for me. Visual
presentation through word clouds can be useful.
I can imagine Text X-Ray being very helpful to advanced EAP students
who are practicing genre analysis, especially as more and more ELT writ-
ing instructors are attempting to empower students to become their own
investigators of genre. The Text X-Ray tool would allow such students to
determine for themselves the differences in, say, nominalization, between
academic texts and other types of writing.
In my experience, because of the tendency to associate writing skills
with reading skills, a good deal of literacy practice in EAP programs is
focused primarily on writing. Even though students may be reading a good
deal for homework, there is little explicit instruction on how to approach
a text or improve one’s reading fluency and/or accuracy. Having taught an
upper-level reading course in an ESL program in the past, I certainly would
have devoted class time to having students explore their assigned reading
through Text X-Ray. For example, I may have begun by having students
highlight the nouns and do a quick scan for nouns they already know (good
Analyzing English Using Corpora 71
Figure A3.12 Sample Sketch Engine’s word sketch feature for work from the BNC.
appearing 641.5 per million words. The top lemmas are listed and a word sketch,
similar to a word cloud, is provided in various colors and font sizes.
A new addition to Sketch Engine, updated from Figure A3.12, is the Sketch
Engine for Language Learning or SkELL tool, intended for teachers and students
of English. SkELL allows users to check how a word or a phrase is actually used
in a corpus by native English speakers (e.g., from the BNC). All text extracts,
collocations, or synonyms are identified and provided automatically by the pro-
gram. The SkELL tool is free and there is no registration required.
Trump, with close to 42 million followers in early 2018, has tweeted an aver-
age of 5.4 times per day since he became the US president. His top 10-word
list from 2016 includes Hillary, #trump2016, crooked, Clinton, #makeamericagreat-
again, people, America, Cruz, bad, and Trump. Twitter data are very useful, not
only for linguistic analysis but also especially for business and big data analytics.
Product sentiment analysis, movie box-office projections, and trending issues
or topics are all relatively easy to extract in real time from Twitter using its ap-
plication programming interface (API). Unlike Facebook, which can be set by
users to be exclusively private, Twitter defaults as a public platform.
Eisenstein et al. (2010) used computational models to identify regional mark-
ers from user postings on Twitter. For corpus-based dialectology research, the
important link here is how internet and mobile technology can code for vari-
ables such as location in tweets. As it is, Twitter can access users’ geographical
coordinates from, for example, mobile devices that are enabled by Global Po-
sitioning Systems or GPS. This feature produces ‘geotagged’ text data that re-
searchers can obtain from online logs. Most users’ tweets are geotagged, which
means that analysts are able to identify the users’ locations, especially if they
tweeted from their mobile phones. Posts from desktop computers or permanent
computer terminals may be identified from their internet access addresses or
universal resource locators (URL). There are more and more studies that mine
geotagged data online, focusing primarily on trends and internet user traffic.
These types of information are useful to marketing analysts and survey compa-
nies that collect quantitative tracking data of user behavior from the internet.
One of Grieve and his collaborators’ ongoing projects is to document lexical
spread on Twitter. They are in the process of compiling a multi-billion-word
regional monitor corpus, using the Twitter API, consisting of nearly every
geocoded Tweet from the US and the UK since 2013, totaling approximately
25 million words per day. Given this large number of geocoded and time-
stamped Tweets, it is possible for Grieve and his team (2014) to identify newly
emerging words and map their geographical spread over time. An earlier study
that they conducted during the first three quarters of 2013 explored “rising” or
increasingly prevalent words identified from a particular period: for example,
from day 1 of 2013 to day 250 (from January to September). They first ex-
tracted 60,000 words that occurred at least 1,000 times in the corpus and iden-
tified rising words by correlating word relative frequency per day to day of the
year using a Spearman’s rank correlation coefficient. Their list of rising Twitter
words includes rn (right now), selfie/s (a photo of oneself ), tbh (to be honest),
literally, bc (because), ily (I love you), bae (babe, baby), schleep (sleep), sweg (swag,
i.e., style), and yasss (yes). On the declining side, the following are 2013’s “fall-
ing” Twitter words: wat (what), nf (now following), swerve, shrugs (* shrugs *),
dnt (don’t), wen (when), rite (right), yu (you), wats (what’s), and yeahh (yeah).
They were also able to visualize the data per word and then trace the spread
of each word across the US For example, for the world selfie (named “Word
of the Year” for 2013), the following graph (Figure A3.13) clearly shows its
Figure A3.13 elfie/s first appearance and dramatic increase in usage from Twitter in 2013. The gray parts in the US map (typically major
S
cities in the Northeast and the Southwest) indicate that selfie/s originated from and was immediately popularly used in many
major US cities (Grieve et al., 2014).
Analyzing English Using Corpora 75
dramatic linear increase in usage from day 1 to day 250. A ‘heat map’ from the
geotagged tweets shows parts of the US where selfie was included as part of a
tweet. Additionally, they also visualized the first users of these words in tweets,
obtaining profile pictures of Twitter users that could be qualitatively explored
according to gender, age, and other variables.
The previous figures can be used to initiate small group discussions in a Soci-
olinguistics or Language in Society class. Grieve et al. (2014) found that most
rising words on Twitter follow an s-curve when presented graphically. They
also found patterns: (1) Acronyms were on the rise, but creative spellings
were on the decline, (2) there were relatively clear southern and northern US
patterns of lexical spread on Twitter, and (3) lexical innovators appeared to
include young black women in the South and young white men in the West
and the North. (This observation was derived from an examination of profile
pictures of Twitter users.) Students examining visualized geotagged Twitter
data might be asked to consider and discuss questions such as the following:
• What are your immediate impressions about the data? What jumped
out? What are lessons or takeaways from this visualized data?
• Explain what the data/figures and excerpts are about. Answer the
question, “So what?”
• Remind students that CL is a research approach, a way of thinking
about language that shines the spotlight on language use. What then
is a word? (Note that Grieve and colleagues referred to rn, ily, and
yasss as “words” from Twitter.) What are the implications of these new
Twitter words in the study of languages?
• If CL allows investigation of language choice, could we explain why
Twitter users prefer a particular word or grammatical form rather than
alternatives?
Inquiry at Carnegie Mellon University utilize a corpus of hip hop lyrics to ex-
plore vocabulary use in hip hop and to visualize and compare artists’ creative
use of language. Hip hop has been a leading source of linguistic innovation
and has also now been studied across academic fields in the digital humanities,
media criticism, and data visualization. In many of these academic studies, the
language of hip hop is viewed as a cultural indicator. Tahir Hemphill has de-
veloped a searchable rap almanac the Hip Hop Word Count (http://staplecrops.
com/), which examines hip hop lyrics and allows a visualization tool to draw
out shapes and circular lines from the lyrics, revealing a layer of creative work
and the aesthetic focus that artists pursue in their songs.
The Hip-Hop Word Count is a searchable ethnographic database built from
the lyrics of over 40,000 hip hop songs (and growing) from 1979 to the present
(Hopkins, 2011). From this database, linguistic details of hip hop songs can be
explored and compared. As Hemphill suggests, these data can then be used to
not only derive interesting statistics about the songs themselves, but also po-
tentially to describe and explain the culture behind the music. An illustrated
visual on the artist, a particular song, and linguistic information such as total
words, average syllable per word, average letters per syllable, average letters per
word, polysyllabic words, and finally education level (e.g., some high school or
high school graduate) and readability or reading level are provided. Reading
levels are identified as “Readers’ Digest,” “Time Magazine,” and others. In the
following comparison data, adapted from Hemphill’s site, Kanye West’s “Big
Brother” and Tupac Shakur’s “Trapped” received word count scores of 9 and
12, respectively. (The higher the word count, the more sophisticated the lyrics
are, arguably.) Linguistic metrics of the two songs are provided (Figure A3.15).
How can analyzing hip hop lyrics teach us about cultures or subcultures?
Song-level comparisons are potentially interesting to students, especially those
who like this genre of music, but they can also apply this approach to other
genres or extend the comparison to two or more corpora. For example, my
students have always been curious about the differences of vocabulary use and
themes between country music and hip hop. They explore the distribution and
functions of words like love, God, freedom, America, and tagged POS features
such as personal pronouns, verb tenses, passive structures, and nominal modi-
fication. The Hip Hop Word Count also provides time and geographic location
identifiers based on where the artists came from in the US and related compari-
sons of metaphor use and other figures of speech, cultural references, phrase and
rhyme style, meme and socio-political ideas. Hemphill’s database then converts
various data points into an explorable visualization frame that charts “migra-
tion of ideas and builds a geography of language.”
Daniels (2017) used a token analysis method—basically, a type-token ratio—
to determine hip hop artists’ vocabulary range, identifying unique words from
an artist’s first 35,000 song lyrics collected in the corpus. His various results
allowed him to create a master list of who has the most to the least unique and
78 Corpus Linguistics for English Teachers
Figure A3.15 omparison of “Big Brother” (Kanye West) and “Trapped” (Tupac
C
Shakur) from the Hip Hop Word Count. Adapted from Hemphill:
http://staplecrops.com/.
Note
1 We plan to launch a full, new version of Text X-Ray upon completion of our usabil-
ity tests. If you want to access the beta version, please send an email to textxray.
beta@gmail.com for instructions on how to run the program online.
Part B
Tools, Corpora, and
Online Resources
B1
Corpora and Online Databases
Spoken and Written Registers (2006). Many readers may be familiar with teaching
materials developed from the T2K-SWAL, including those directly addressing
essay writing and test preparation activities.
Before proceeding with the listing of corpora and online databases, I’d like
to share my responses to several questions that I frequently hear from teachers
who attend my CL workshops or conference presentations about corpus data
and its use in the classroom. These are also common topics of discussion with
my graduate students in my CL and Technology and Language Teaching courses.
not provide the best comparable data for these types of studies, but it can be used
to provide beginning-level examples. All ICLE essays were lemmatized and
POS-tagged with the CLAWS Tagger, and these items can be searched using
the built-in concordancer. The ICLE’s current edition, though, is not cheap;
the manual with a CD-ROM of all texts costs 272.22€ for a s ingle-user license
and 580.22€ for multiple users (as of early 2018). The LINDSEI manual with
CD-ROM from the same research team under Sylviane Granger (Université
catholique de Louvain) costs 211.75€ for a single-user license.
The important thing to remember here is that it is good to have a ‘reference’
corpus readily available for a class that incorporates corpus-based activities.
You do not need to purchase corpora, as there are excellent ones that are freely
available. For example, MICUSP texts can be downloaded for free with only
official email registration, and full-text corpus data from five large corpora of
English—News on the Web (NOW), Wikipedia, COCA, COHA, GloWbE,
and the Corpus del Español—are all available from Davies’s BYU corpora web
page (www.corpusdata.org/).
and the examinations of the emails’ electronic structure to the use of electronic
data for machine learning and natural language processing.
One of the first studies published on ICLE texts investigated learners’ use of
high-frequency verbs in writing: in particular, the verb MAKE (Altenberg &
Granger, 2002). The main research questions were: Do learners tend to over-
or underuse these verbs? Are high-frequency verbs error-prone or safe? What
part does transfer play in the misuse of these verbs? To answer these ques-
tions, Altenberg and Granger compared Swedish and French student essays
from ICLE with native speaker essays (students from the US) from a sub-
corpus called the Louvain Corpus of Native English Essays (LOCNESS). Alten-
berg and Granger found that learners, even at an advanced proficiency level,
had difficulty in accurately and consistently using MAKE (and lemmas) in var-
ious contexts. The authors suggested that some of these ‘problems’ were
shared by the Swedish and French learners, and may be L1-related. There
were many clear limitations and methodological issues about this study, es-
pecially in its conceptualization of L1 transfer and the concepts of over- and
underuse of linguistic features by learners, with NSs as the target for compari-
son. However, there are several interesting applications that clearly show how
a collection of essays like ICLE can be used for classroom activities, engaging
students actively in the discussion and analyses of semantic features of verbs.
The authors were able to extract 354.3 occurrences of MAKE (all lemmas,
i.e., make, makes, making, made) per 100,000 words for Swedish learners,
(Continued)
88 Tools, Corpora, and Online Resources
234.6 for French learners, and 339.8 for US students. They then grouped the
meanings of MAKE into eight major categories:
l. P
roduce something make furniture, make a hole, make a law
(result of creation):
2. Delexical uses: make a distinction, a decision, a reform
3. Causative uses: make somebody believe, make it possible
4. Earn (money): make a fortune, a living
5. Link verb uses: she will make a good teacher
6. Make it (idiomatic): if we run, we should make it
7. Phrasal/prepositional uses: make out, make up, make out of
8. Other conventional uses: make good, make one’s way
The papers collected for MICUSP were written by students at four different
levels of study: final-year undergraduate and first, second, and third-year graduate
levels. Different types of papers, ranging from essays to lab reports, were collected
from a wide range of disciplines within four academic divisions: humanities and
arts, social sciences, biological and health sciences, and physical sciences. The
corpus enables analyses of disciplinary, developmental, and genre-related phe-
nomena associated with student writing. Each paper in MICUSP also captures
metadata on the student’s gender and ‘nativeness’ status (including information
on first language background in the case of non-native speakers [NNSs]).
The number of papers from the individual MICUSP disciplines range from 21
to 104. Some of the well-represented disciplines in terms of papers and word counts
are psychology, English, sociology, and biology. The overall word count is high-
est for the social sciences division (978,254), followed by the humanities and arts
(734,437), and biological and health sciences (511,550), and is lowest for physical sci-
ences (392,288). MICUSP papers have been categorized according to paper types
(e.g., argumentative essay and report), following a systematic data-driven text-type
analysis of all the papers in the corpus (Römer & O’Donnell, 2011). The papers fall
into the following seven text type categories: argumentative essay, creative writing,
critique/evaluation, proposal, report, research paper, and response paper.
[Discussion continued from Section A3] Hardy and Römer (2013) identified and
analyzed the co-occurring, lexico-grammatical features of MICUSP to help char-
acterize successful student writing. Following Biber (1988), they used multidi-
mensional analysis to establish dimensions of frequently co-occurring features
that best account for cross-disciplinary variation in MICUSP. The four functional
dimensions of MICUSP appear to distinguish between: (1) Involved, Academic
Narrative vs. Descriptive, Informational Discourse; (2) Expression of Opinions and
Mental Processes; (3) Situation-Dependent, Non-Procedural Evaluation vs. Procedural
Discourse; and (4) Production of Possibility Statement and Argumentation. They ar-
gued that as writing instruction increasingly spreads from English departments
to writing intensive coursework housed in other disciplines, there is a need to
better understand student writing as it exists in those content areas. Such an
understanding can help English writing teachers address the needs of students
who are beginning writers in the discipline. Figure B1.1 presents a comparison of
disciplines in how they are distinguished in Dimension 1, Involved, Academic Nar-
rative vs. Descriptive, Informational Discourse. This illustration clearly shows how
writing in the humanities (philosophy and education) is different in linguistic
co-occurrence patterns from those in the sciences (physics and biology).
(Continued)
90 Tools, Corpora, and Online Resources
//
2
English (1.990, SD 7.829)
Nursing (0.484, SD 6.804)
Sociology (0.354, SD 7.518)
0 Political Science (0.165, SD 4.028)
Industrial & Operations Engineering (-0.758, SD 6.011)
-8
Physics (-8.193, SD 3.983)
teaching, have become very popular in the past several years (Belcher, 2009;
Biber, Reppen, & Friginal, 2010; Johns, 2009). Studies such as Flowerdew
(2005), Gavioli (2005), Hinkel (2002), Hyland (2008b), Jarvis, Grant, Bikow-
ski, & Ferris (2003), and Yoon and Hirvela (2004), to name only a few, recognize
the valuable contribution of corpus-based data in the teaching of academic
writing across disciplines, especially in increasing learners’ awareness of the
textual features of their own writing relative to target (i.e., successful) models.
A poster presentation by Roberts and Samford (2013) explored the class-
room applications of MICUSP (and a few other tools and online databases)
by focusing on extralinguistic features such as patterns across specific learner
characteristics (e.g., nativeness, level, or discipline). They suggest to teach-
ers that various comparisons can illustrate progress in student work across
a semester or program and that the use of MICUSP texts can reveal specific
linguistic features unique to a genre (discipline), informing classroom deci-
sions on the instruction of that particular genre. To fully realize the potential
of these tools, teachers should explore and apply their discoveries to their
specific learner populations. However, as with any tool, care, deliberation,
and familiarity should all be taken into consideration when determining the
appropriateness of the application. Once familiar, MICUSP can be extended
to student use in or outside the classroom. Students can use their own texts
for comparison to see personalized results from their writing, rather than
a pre-compiled dataset. Some of their suggestions to teachers include the
following: (1) Pay attention to your learners’ needs. MICUSP is a collection of
upper-level papers, meaning it may not work for lower levels or non-native
professionals immediately. (2) If there is not an existing specialized corpus,
teachers are encouraged to build one from available materials. (3) Word list,
keyness, or frequency alone should not inform curriculum planning or course
development. A high keyness or frequency may provide a basis for justifying
some focus on a word, but may be deceptive due to the idiosyncratic nature
of writing. (4) Data derived from MICUSP is not meant to replace a teach-
er’s intuition, but rather to better inform and work in conjunction with it
(McEnery, Xiao, & Tono, 2006). (5) Specialized corpora can inform novice
teachers or experienced teachers from different disciplinary backgrounds
charged with teaching an academic English course in a new discipline.
life sciences, physical sciences, and social sciences). BAWE was also balanced
according to the number of texts per year. This ensures that their sample would
include equal representation of the levels of students in each subject. Some
subjects (e.g., archaeology, medical science), however, are not taken at all levels
of study, so fewer papers were expected. For medical science, for example, the
Table B1.1 Corpus matrix for the BAWE corpus (Nesi, 2008, 2011)
*This table reflects data from Summer 2012 and includes texts from Jan to Jun 2012.
Corpora and Online Databases 93
corpus planned to include only upper-level students (16 in their final year of
undergraduate, and 48 at the graduate level). Although the primary topic here
is not directly related to sociolinguistics (‘academic writing’), we used this ex-
ample to provide an illustration of a corpus matrix and how you might develop
your own when you design your corpus (Friginal & Hardy, 2014a). Table B1.1
shows the corpus matrix for BAWE, with total number of texts across disci-
plinary groups and subjects.
The BAWE’s corpus matrix is a good example of how to set out to create
a balanced corpus, one that can help researchers answer questions associated
with the target population. The matrix for the BAWE corpus does not show
balance in terms of how much writing is done in each of these disciplines. In
other words, a discipline like history or philosophy might involve a lot more
writing of prose than some of the applied subjects (e.g., hospitality, health and
social care). The corpus creators were not interested in asking questions of
relative balance of writing. Instead, they thought it would be best to give each
subject equal representation, avoiding any idiosyncrasies that might be associ-
ated with a particular subject that might involve a lot more writing. Because
the BAWE project was interested in exploring disciplinary variation, the de-
velopers wanted to balance the amount of writing collected from each subject.
However, teachers might be interested in studying how language is used across
registers while putting more importance on the registers that are most fre-
quently used.
and field (audio recordings, transcripts, and interview notes).” BASE Plus may
be compared with MICASE for dialect comparisons of academic discourse. As
is the case with MICASE (and also T2K-SWAL—see the following), BASE
Plus represents language in academia which does not necessarily feature a large
amount of L2 learner output. The BASE Plus video recordings have been used
in materials development projects at the University of Warwick, most notably
the Essential Academic Skills in English (EASE) series. EASE: Seminar Discus-
sions and EASE: Listening to Lectures are available online (British Academic
Spoken English and BASE Plus Collections, 2017).
Clearly, VOICE and ELF texts are very relevant in teaching English to a range
of international students across countries and areas of study. It is important
to fully capture the structure of ELF and develop classroom activities that will
facilitate a skillful recognition of and appreciation for the importance of this
language variety. Mauranen (2003) argues that the applications of theoretical
and descriptive work to ELF are of considerable practical significance in global
academia. She noted that,
to discuss a subject of his or her choice, from three options. They then contin-
ued the conversation informally by asking follow-up questions from the stu-
dent’s discussion, and the interview concluded with a picture-strip narration.
Each interview lasted approximately 15 minutes, and each was transcribed or-
thographically according to specific guidelines. Background information is also
noted for each speaker, including age, gender, L1, number of years of English at
school, and months living in an English-speaking country (Gilquin, De Cock, &
Granger, 2010). LINDSEI is a rich data source for investigations of lexico-
grammatical phenomena in learner speech and is suitable for a more detailed
analysis of learner interviews as a functional register. Various speech features
are captured in the transcripts, including dysfluent markers, repeats, and addi-
tional annotation (e.g., laughter, scream) as shown in the following excerpts (an
interviewee/student responds to questions about studying and future plans):
the Spanish students and a control group of native English speakers. Other
researchers have used LINDSEI to study fluency and accuracy in learner lan-
guage (Brand & Götz, 2011), grammatical phenomena such as articles and
prepositions (Kaneko, 2007, 2008), or word collocations (Mukherjee, 2009).
A study that I conducted with Brittany Polat (Friginal & Polat, 2015)
explored the various linguistic dimensions of LINDSEI. One of our primary
findings revealed a contrast between the involved conversational style of stu-
dents from countries such as Sweden, the Netherlands, and Germany and the
more informational focus of Japanese students. It was apparent from LINDSEI
transcripts that Swedish students were highly interactive in their interviews
compared to Japanese students, who exhibited very minimal responses. In-
herent cultural factors in face-to-face interviews may have influenced this
difference, especially in how these two groups of students provided elabo-
ration in their responses. Of all the interlanguage groups, the Japanese stu-
dents had the lowest total number of words (36,928), whereas the Swedish
group had 69,301 total words. In addition to first language background and
related cultural and social variables, there are other learner factors that can
also be investigated further. Average length of responses from students and,
more importantly, language fluency factors may be contributing to how stu-
dents make use of conversational features of discourse. Short and simplified
student responses have typically focused on content words and have signifi-
cantly fewer informal characteristics of speech such as transition words (e.g., I
mean, you know) and vague references (e.g., stuff, thing, all those). These types
of results may be used in teaching conversational English to some groups of
students in primarily English-speaking countries.
students have resided in the foreign country prior to the recorded conversa-
tions, whether or not they recall having had previous conversations with this
lecturer outside of class, and how long each day individual students listen to
and speak in English while on Erasmus (MacArthur et al., 2014), is included.
What is unique about EuroCoAT is its inclusion of additional observa-
tional information about speakers during the course of the discussions such
as their positioning, their apparent comfort as they were recorded, and other
general observations. Additionally, the corpus includes information derived
from questionnaires completed by participants after the discussions reporting
their impressions about the ‘naturalness’ of the conversation, their comfort level
throughout, and the similarity of the recorded conversation to other conver-
sations they’d ordinarily have during regular office hours. Specific participant
positioning information, which could potentially be useful in multimodal
studies, is also included. Information is provided as to how the participants
were seated and their postures, what they were sitting on (e.g., swivel or stable
chair), their perspective with respect to the recording camera (e.g., centered or
to the left or right of the camera), toward what or whom they were facing, and
so on. Additional detailed information about the physical environment within
which the discussions occurred is also included: for example, background and
foreground of the room, furniture and equipment in the room, and the location
of the windows and doors (MacArthur et al., 2014).
Topics Do you agree or disagree with this statement? Use reasons and
specific details to support your claim:
(A) It is important for college students to have a part time job.
(B) Smoking should be completely banned at all the
restaurants in the country.
Time allotment 60 second for one speech 20 to 40 minutes for
one essay
Length Not controlled (speakers were asked 200 to 300 words
to speak as long as they want during (+−10%)
the collection period)
Dictionary use No No
allowed?
Spell-checker use No Yes
allowed?
and US news from the BBC and NPR). Other corpora of English varieties are
described in the following.
Leeds) and those that are developed by unaffiliated individuals or groups (e.g.,
BNCWeb at Lancaster University; Intellitext: “Intelligent Tools for Creating
and Analysing Electronic Text Corpora for Humanities Research”; BNCWeb
at Oxford; Phrases in English (PIE) and the BNC; Davies’ BYU-BNC interface
(Friginal & Hardy, 2014a).
Update: In late 2017, the Spoken BNC2014 has been publicly released
by Lancaster University and Cambridge University Press. This new addition
to the BNC contains 11.5 million words of transcribed informal British En-
glish conversation, recorded by speakers, mainly British English speakers, from
2012 to 2016. The recordings consist of casual conversation among friends and
family members and are designed to make the corpus broadly comparable to
and consistent with the demographically sampled original components of the
spoken BNC. This new addition is accessible online for free for “research and
teaching purposes.” Accessing the texts requires the creation of a free account
on Lancaster University’s CQPweb server (https://cqpweb.lancs.ac.uk/). Other
related information about the Spoken BNC2014 and a downloadable manual
and reference guide are also available.
Written Registers
911 Reports: Full text of reports released Government, 17 281,093
on July 22, 2004 by The National technical
Commission on Terrorist Attacks Upon
the United States.
Biomed: Technical articles by American Technical 837 3,349,714
authors obtained from BioMed Central,
which publishes open access, peer-
reviewed biomedical research articles.
ICIC: The Indiana Center for Intercultural Letters 245 91,318
Communication corpus of Philanthropic
Fundraising Discourse consists of
fundraising texts, including case
statements, annual reports grant proposals,
and direct mail letters.
NY Times: The New York Times component of Newspaper 4,148 3,625,687
the ANC consists of over 4,000 articles from
The New York Times newswire for each of the
odd-numbered days in July 2002.
(Continued)
106 Tools, Corpora, and Online Resources
500 texts of approximately 2,000 words each, and each from the same registers
(news, parliamentary debates, lectures, etc.) and all dating from 1990 or later
(Nelson, 1996). The authors and speakers are all 18 years of age or older, were
either born in the target country or moved there at an early age, and were
educated in their respective countries through the medium of English. The
three primary goals of Greenbaum and his team in collecting these data were:
(1) to sample standard varieties from other countries where English is the first
language, such as C anada and Australia; (2) to sample national varieties from
countries such as India and Nigeria where English is an official second lan-
guage; and (3) to include both spoken and manuscript English as well as printed
English (Greenbaum, 1996). The ICE project has research teams in each of
the following countries: Australia, Cameroon, Canada, East Africa (Kenya,
Malawi, Tanzania), Fiji, Great Britain, Hong Kong, India, Ireland, Jamaica,
Kenya, Malta, Malaysia, New Zealand, Nigeria, Pakistan, the Philippines,
Sierra L eone, Singapore, South Africa, Sri Lanka, Trinidad and Tobago, and
the US The spoken and written registers collected by the research teams for
the ICE project are shown in Table B1.5.
The ICE was intended primarily for comparative studies of emerging En-
glishes all over the world alongside ‘native-Englishes.’ The Asian varieties of
English available for free download from the ICE website feature countries/
territories where English has been used extensively as the language of business
and education. Although academic spoken language is very limited in ICE,
there are useful comparisons of spoken and written texts in professional settings
Corpora and Online Databases 107
Table B1.5 Spoken and written registers of the International Corpus of English
Spoken Texts (300 2,000-word samples) Written Texts (200 2,000-word samples)
that may directly relate to academic discourses. Transcripts of class lessons, of-
ten with teacher and student interactions (mostly from teacher lectures), may be
extracted and compared across country groups.
B1.5 Online Collections
to the BNC/ANC in terms of text types, with deviations especially with texts
included in its spoken data component. COCA’s spoken texts (20% of the cor-
pus) come from television news and interview programs for the most part and
not from the types of conversation data (e.g., face-to-face conversation, service
encounters, and telephone interactions) available in the BNC or other corpora,
such as the Longman Corpus. Davies (2009) maintains, however, that COCA’s
overall balanced composition means that researchers can compare data across
registers and achieve relatively accurate results that show patterns of change in
the language from the 1990s to the present. Related to COCA are the COHA,
also from Davies, and the Google Books Ngram Viewer or Google Books Cor-
pus collected by Google Inc., which are both time-stamped from the 1800s to
the present. The current list of BYU corpora is provided in Table B1.6.
The Corpus of Global Web-Based English (GloWbE, pronounced globe)
has a staggering 1.9 billion words from 1.8 million web sources in 20 different
English-speaking countries. Texts are grouped according to where they came
from online (e.g., websites, web pages, or blogs) and the English dialects they
represent. The 20 countries currently in GloWbE include native varieties, such as
the US, Canada, the UK, Ireland, Australia, and New Zealand, and non-native
varieties, such as India, Sri Lanka, Pakistan, Bangladesh, Singapore, Malaysia,
Corpora and Online Databases 109
the Philippines, Hong Kong, South Africa, Nigeria, Ghana, Kenya, Tanzania,
and Jamaica. Davies released the corpus and its online interface in April 2013.
Comparing corpus size, GloWbE is more than four times as large as COCA and
nearly 20 times as large as the BNC. Dialect studies with GloWbE can cover
international varieties of English, as they appear online, with cross-comparisons
focusing on British and American English texts (with more than 775 million
words of text for just these two dialects). Davies’s Corpus del Español is a 100
million-word diachronic corpus of Spanish (1200s–1900s) compiled by Davies,
funded by the US National Endowment for the Humanities. The corpus in-
terface (similar to COCA and COHA) allows users to search frequency data
and use of words, phrases, and grammatical constructions in different historical
periods. Registers of Modern Spanish (e.g., fiction, news) are also categorized.
(Continued)
110 Tools, Corpora, and Online Resources
that span from 1550 to 2008, although texts before the 1800s are extremely
limited and often there are only a few books per year. From the 1800s, the cor-
pus grows to 98 million words per year; by the 1900s, it reaches 1.8 billion, and
11 billion per year by the 2000s (Michel et al., 2011). Google Ngram Viewer
is composed of raw data that are encoded by the number of n-grams, adjacent
sequences of n items from a text.
It is important to note that Google Ngram Viewer was not necessarily cre-
ated with language teaching or research in linguistics (or corpus linguistics)
as its primary application (Michel et al., 2011). The developers of the viewer
wanted to create a new approach to humanities research that they coined
Culturomics. Culturomics (www.culturomics.org/home) is a way to quan-
tify culture by analyzing the growth, change, and decline of published words
over centuries. This would make it possible to rigorously study the evolution
of culture using distributional, quantitative data on a grand scale (Bohannon,
2010). In an effort to prove the adequacy of and to provide clear impetus
for Google Ngram Viewer, Google-affiliated researchers have conducted a
series of studies to validate the usefulness and various applications of their
new program. One study, for example, showed that over 50% of the words in
the n-gram database do not appear in any published dictionary (Bohannon,
2010). In addition, it is argued that patterns and cultural influences of words
could be clearly established and tracked across time frames. The use of G oogle
Ngram Viewer and Culturomics, therefore, contributes academic and techni-
cal value to the study of culture, making it arguably a new cultural tool that
has several possibilities, including, of course, the teaching of new vocabu-
lary and syntactic patterns of English and other languages (Friginal & Hardy,
2014a).
Español Actual (CEA), the Corpus of Contemporary Spanish, has over 540
million words, lemmatized and POS-tagged. The CEA is made up of the fol-
lowing texts: (1) the Spanish part of the 11-language parallel corpus Europarl:
European Parliament Proceedings Parallel Corpus, v. 6 (1996–2010); (2) the
Spanish portion of the trilingual Wikicorpus, which was obtained from Wiki-
pedia (2006); and (3) the Spanish part of the seven-language parallel corpus
MultiUN: Multilingual UN Parallel Text 2000–2009. The MultiUN corpus
utilizes texts of United Nations resolutions. The CEA provides a dedicated
online search interface, the CQPweb, which can be used to search for words,
lemmas, or sentence constructions.
B2
Collecting Your Own
(Teaching) Corpus
Collecting a do-it-yourself (DIY) corpus for class use is certainly doable and,
when planned properly, quite manageable and (almost) fun! It does, however,
require a lot more time and effort—some corpora more than others—to con-
struct a pedagogically sound collection of texts. As discussed in Section A1, a
principled collection must include consideration of a few very important fac-
tors, not only related to the texts and their linguistic properties but also to
specific contexts, approvals and consents, and ethical concerns. Before you start
the actual collection of texts for your corpus, it is important that you first de-
velop your goals, objectives, and research questions if you plan to also use them
for related research. Although much can be understood through exploration
of any dataset, the way that your corpus will be organized, compiled, and
analyzed will depend upon your specific teaching goals and the questions you
want to answer.
Following a systematic procedure when collecting your corpus is quite sim-
ilar to developing your course syllabus after careful consideration of all your
contextual variables. You seek more information about your students, their
specific backgrounds, goals they have before and after completing your course,
and the various activities and evaluation items that you have. For materials
development and classroom research based on your own corpus, there are im-
portant considerations essential to maintaining a sound design. The corpus
texts you have are like your ‘participants’; in other words, these texts are your
individual observations. To create a corpus well designed for teaching pur-
poses, it is critical that you remember this. The same sampling issues associ-
ated with other social science research can be applied to creating a corpus by
looking at your texts this way. Each one is important and will show features
that represent a sampling of your target population. The following sections
describe what to do before, during, and after your collection of a corpus spe-
cifically intended for your classroom.
Collecting Your Own (Teaching) Corpus 115
3 Publishing and sharing. You will most likely not have to publish or share
your teaching corpus, but you may need to have it readily available for your
next course or for your students to download their own copy for projects or
homework activities. Using Dropbox or Google Drive could be your best
option for sharing your texts with your students. I suggest that you create
an ‘agreement document’ signed by your students as a form of contract
with them, indicating that they will only use the data for class purposes
and will not disclose potentially personal or private information about in-
dividuals or other students that may have been captured in the texts. You
can also ask your students to discard the texts after they have completed
your course.
corpus tools that show generalizable, quantitative data of the linguistic character-
istics of academic writing. I argued that it would be interesting, and potentially
useful, to examine the quality of NS/NNS writing, and produce comparative
data that illustrate the degree of variation, gaps, or unique distribution of salient
linguistic features of writing among and between these groups of students. The
descriptions of the linguistic characteristics of writing samples from NS/NNS
have important pedagogical implications that apply directly to materials pro-
duction and syllabus design, to aid the development of L2 writing for NNS, and
support easier transition to advanced, genre-specific writing for NS students.
My focus here, specifically, was upon IEPs in US universities which are
tasked to prepare international undergraduate and graduate students to meet
the rigors of formal and informal academic writing across disciplines. Writing
ability, numerically quantified and measured by TOEFL and IEP test scores,
suggests whether or not a foreign student will be able to successfully meet the
writing requirements of his/her field of specialization. Conversely, many writ-
ing instructors in freshmen composition classes have reported the surprising
range of writing abilities of NS students, which points to the importance of
further studying their usage of linguistic features typically identified as in-
dicators of writing quality (e.g., transition words, epistemic adverbials, pas-
sive/active structures, verb tense/aspect). Relevant and important questions,
then, were how well do NS students actually write in English when given the
same prompts, such as those developed for international students, that is, the
TOEFL? What is the linguistic composition of effective or ineffective essays
from TOEFL prompts as written by students with varying first language back-
grounds and levels of fluency in English? What do results of these comparisons
imply, particularly with respect to ETS and various IEP programs in the US, in
the design of writing tests and assessment of NNS writing, especially if some
NS data also show clear and specific limitations in the quality of academic
writing of NSs of English? My corpus focused on these goals and was developed
to provide exploratory data and answers to these and other related questions.
The importance of a corpus-based comparison of a learner (i.e., NNS)
TOEFL essays with that of NS target corpus is supported by previous research
such as Granger, Hung, and Petch-Tyson (2002), Mukherjee and Rohrbach
(2006), Crossley and McNamara (2009), and O’Donnell, Römer, and Ellis
(2013), to name only a few. By comparing NS/NNS corpora, for example,
instances of L2 learners’ preferences (I hesitate to refer to under- or over-use of
a particular linguistic feature) are documented directly to show areas where
NNSs may have limitations or immediate successes in full-time undergradu-
ate and graduate writing contexts. The Santa Barbara Corpus of Spoken and
Written American English and the T2K-SWAL have been used to describe the
linguistic features of US-based academic writing against L2 writing, as well as
to design language teaching materials addressing the needs of foreign students
in US universities. However, in most of these previous comparisons, the types
Collecting Your Own (Teaching) Corpus 123
of writing, including the use of prompts, and the circumstances of written pro-
duction (e.g., timed, with required word-limit and assessed for quality) have
not been methodically controlled in the corpus collection. There is a need to
carefully design parallel corpora of students’ writing that will clearly show the
influence of variables such as first language background, overall language abil-
ity, and prompts in written outputs. By using two similar TOEFL prompts and
following the same instructions and conditions of production from those uti-
lized in the existing corpus of NNS TOEFL responses, the NS corpus provided
appropriate baseline data about the range of NS writing in US universities.
The NS essays represents ‘real-world’ samples of writing foreign students could
expect from their US peers. In addition, the use of a controlled corpus of NS/
NNS essays addresses essential research validity and reliability issues in these
types of corpus-based comparisons.
My NS/NNS TOEFL responses corpus maintained a balanced number of
texts (and students) per group. There were 320 NNS and 310 NS students who
responded to two essay prompts using a computer, with 30 minutes provided
per prompt. All the essays (N = 1,260) were evaluated for quality of writing
following a rubric that is similar to the one used in TOEFL assessment. The
NNS essays had 140,800 total number of words (an average of 220 words per
essay), while the NS group had 192,200 words (an average of 310 words per
essay). My initial analyses focused on the following features (Table B2.1), first to
obtain comparison data and second to develop IEP teaching materials based on
these results. Results and particular teaching implications of these comparisons
are discussed in Friginal, Li, and Weigle (2014) and Weigle and Friginal (2015)
124 Tools, Corpora, and Online Resources
I propose a study that combines both situational and textual analyses of un-
dergraduate student writing at a private, two-year college that offers students
a small-campus, intensive liberal arts experience. My study’s aim is to explore
issues of student writing in different disciplines along Flowerdew’s (2002) con-
textual to textual continuum. During the study, opposing points of the contin-
uum can be used to inform the understanding of a given analysis. For example,
as I conduct qualitative interviews and focus groups, my questions may be
influenced by my understandings of the texts. Similarly, textual analysis will
be informed and understood by the situational characteristics and functional
purposes of the assignments. These are the research questions for this study:
1 How does context (e.g., the school, faculty expectations, and disciplinary
differences) influence the writing practices of lower-level undergraduate
students?
2 What textual features are associated with successful writing of lower-level
undergraduate students?
3 How does context interact with rhetorical and linguistic forms?
The Context
The setting is a two-year institution that awards associate of arts (AA) degrees
to students. Upon graduation, the majority of students transfer to a ‘receiv-
ing,’ nationally recognized, private university in the same state to finish their
undergraduate careers. This private university ranks among the “Top 20” of
national universities in the US, and its students are held to high s tandards. The
proposed study will include popular fields or majors in the AA institution such
as psychology, biology, political science, physics, English, and philosophy.
In terms of writing practices, students in this two-year college are only re-
quired to take one semester of freshman reading and writing. However, many
Collecting Your Own (Teaching) Corpus 125
students complete Advance Placement (AP) tests to avoid this course. Also, all
students who graduate from the college must take another course that has
been identified as writing rich. Such courses emphasize learning disciplinary
conventions, writing as a process, and creating a polished product (or prod-
ucts) of at least 20 typed pages total. Many professors prefer to assign multi-
ple, shorter assignments that can be revised and resubmitted over the course
of the semester instead of a single, long paper due at the end of the semester.
The international student population in this college has increased dra-
matically over the last 10 years. In fact, the freshman class of 2016, for
example, consists of nearly 27% international students. This newly linguis-
tically heterogeneous situation has caused much discussion and attempts at
pedagogical reforms among faculty at the institution. Professors’ pedagogi-
cal practices that once assumed high levels of English proficiency from their
students have needed to be adjusted, and more English language help has
been necessary in the composition courses and the campus’ writing center.
One aspect of this context that will facilitate the data collection pro-
cess is that the target institution is primarily a teaching college. Although
associated with a research university, the faculty members at the two-year
college are not expected to publish as much in their discipline as those on
the receiving campus. Instead, many faculty members specialize in research
that involves pedagogical practice. Many of them have also expressed in-
terest in collaborating with me in the pedagogically based subareas of their
respective disciplines. Others have invited me to help their students write
for their classes. For these reasons, data collection is not anticipated to be as
big of a burden as it has been for many other compilers of similar corpora.
Preparation
Having taught for a year at this college, I regularly met with other faculty
members and attended division and faculty meetings. Because I taught my
freshman composition course using an EAP approach, my students wrote
targeted pieces as if they were writing in psychology, then anatomy and
physiology, then education. Finally, students conducted their own mini
genre studies and wrote research papers as if they were in an applied lin-
guistics course. Their area of investigation for these research papers was
the academic discipline(s) they were considering majoring in upon comple-
tion of their AA degree. They became ethnographers and textual analysts
of their own intended future discourse communities, as recommended by
Johns (1995, 1997). Students interviewed faculty members, surveyed stu-
dents, familiarized themselves with syllabi and degree requirements, and
analyzed linguistic and rhetorical features of writing.
(Continued)
126 Tools, Corpora, and Online Resources
Disciplines
Although the best representation of student writing as a whole would in-
clude simple random sampling, the proposed corpus has the purpose of
comparing disciplinary writing. Thus, a more purposeful sampling must be
made to avoid the potential for obtaining much more writing from courses
that require more writing (e.g., English courses) compared to their natu-
ral science counterparts. In the planning of the corpus, three disciplinary
groupings have been made: humanities (HU), history and social sciences
(SS), and natural sciences and mathematics (NS). These three academic
divisions were predetermined because they exist as such at the target insti-
tution. Students at this school must also take courses from each division to
complete their AA degree after two years. Unlike the large-scale corpora of
Collecting Your Own (Teaching) Corpus 127
Table B2.2
Corpora of student writing, organized by discipline
“AA corpus” (proposed) MICUSP BAWE corpus
Humanities (HU) Humanities and Arts Arts and Humanities
• Philosophy • Philosophy • Philosophy
• English • English • English
History and Social Social Sciences Social Sciences
Sciences (SS) • Psychology • Politics
• Psychology • Political Science
• Political Science
Natural Sciences and Biological and Health Life Sciences
Mathematics (NS) Sciences • Biological Science
• Biology • Biology • Psychology
• Physics and
Astronomy
student writing, however, only two disciplines from each division have been
selected because of time constraints.
In the same way that the BAWE corpus was influenced in its choice
of disciplines by MICASE, so too was the proposed corpus influenced by
other corpora of student writing. This will allow for cross corpus compar-
isons. The disciplines to be examined are all present in both MICUSP and
the BAWE corpus, although grouped slightly differently by larger academic
area. Table B2.2 shows the disciplines under investigation and their coun-
terparts in the other corpora. One may notice that the BAWE corpus in-
cludes psychology in the area of life sciences and that the BAWE corpus and
MICUSP have a separate group, physical sciences, for physics.
Consent
Before any papers are collected, students will be asked to complete a con-
sent form. All of the students who agree to participate will also be asked to
fill out a brief questionnaire. This questionnaire will ask information such as
age, gender, first language background, intended major (students at this
school are not able to declare their majors).
A separate section of the questionnaire, including another consent form,
will ask this group of students if they would like to also participate in a qual-
itative follow-up study. These students will provide their email addresses
to be contacted after the corpus has been collected. A master list of par-
ticipating student information will be compiled and saved digitally. Code
numbers will be assigned to the students to maintain confidentiality when
dealing with their writing samples and future uses of the corpus.
(Continued)
128 Tools, Corpora, and Online Resources
at least minimally, the areas I am interested in. However, I will also allow
these students to deviate and continue to expand on the topic if they want.
With this method, there is potential for them to bring up issues of disci-
plinary student literacy practices that I had not previously considered.
Participating faculty members will also be interviewed. Ideally, at least
two faculty members from each discipline will be interviewed, providing
multiple perspectives of the situational context from the perspective of an
instructor. Toward the end of the semester in which the corpus is collected,
the 12 interviews should be conducted. This would provide time for me to
familiarize myself with some of the assignments and texts.
One important purpose of these interviews will be to clarify any ques-
tions about the instructors’ syllabi and the tasks that were assigned. In
addition, I will ask questions to better understand how the faculty mem-
bers view student writing in the courses they teach. These interviews will
be vital to understanding the expectations that faculty members have of
the writing of their students. The questions to be investigated are related
to the first research question of how the context, in this case the faculty,
view and potentially influence the writing practices of their students. Do
professors have students write to display knowledge? Or are there other
intentions (e.g., writing to learn, learning disciplinary practices, critical
thinking)?
The questions to faculty members will be driven by specific assignments
and student papers. Participants will be asked about what made those
samples successful. In a way, the semi-structured interviews involve asking
general questions about perceptions of writing but will be influenced by a
method similar to stimulated recall. That is, professors will be given a stim-
ulus (previously graded papers) to help them revisit and externalize their
beliefs and memories of the assignments.
(Continued)
130 Tools, Corpora, and Online Resources
Table B2.3 Gray’s corpus description in number of words (30 texts per discipline/
category)
productivity, and the prestige related thereto, are, in part, measured by the
potential global circulation of Iraqi studies, presently more prevalent in the
form of written genres, in English (or translated in English from Arabic), that
follow strict writing conventions and expectations. Numerous attempts at in-
ternational publication, which also result in numerous failures or rejections, are
a common experience of faculty researchers in Iraq. In recent years, research
mentoring programs and collaborations between Iraqi scholars and researchers
based in universities such as those in the US have been funded by local and
international grants.
Current research in academia draws heavily on the immediate transfer and
dissemination of data, methodologies and approaches, and results that are made
possible by easy access to online journals and databases. These sites serve as
the primary sources of information for literature reviews, establishing research
gaps, and meta-analyses across settings and contexts. International scholars
are able to keep abreast with the contemporary status of work conducted in
their areas of specialization, and updated syntheses of findings are immedi-
ately shared globally (Cilveti & Perez, 2006). However, this rapid increase in
scholarly activity and the immediate publication of a wide range of studies pro-
duce many interrelated challenges, especially for international researchers (or
scholars outside of institutions in the US, the UK, and similar countries). With
English being the dominant language and the primary medium of publication
for the majority of academic or scientific research papers (Eggington, 2004;
Laborde, 2011; Swales, 1985), language use and the structuring or formatting
of research articles are expected to conform to international, very competitive
standards.
A cross-disciplinary comparison of abstracts has been previously con-
ducted by Melander, Swales, and Fredrickson (1997), using abstracts produced
by American NSs and Swedish NNSs in disciplines including linguistics,
biology, and medicine. In a similar vein, but focusing on cross-linguistic
comparison, Hu and Cao (2011) examined the use of hedges and boosters as
metadiscourse markers in English and Chinese abstracts, revealing patterns
of linguistic differences in how Chinese and English abstracts utilize hedging
structures in reporting results and making generalizations or conclusions.
Lopez-Arroyo and Menendez-Cendon (2007) also described and compared
the rhetorical and phraseological structures of medical research article ab-
stracts produced by English and Spanish authors. They found evidence of
cross-linguistic influences in phraseology, vocabulary use, tenses, and use of
other related discourse markers. Similarly, Kafes (2012) conducted a contras-
tive analysis of abstracts written by American, Taiwanese, and Turkish schol-
ars in the social sciences, reporting differing patterns of rhetorical structures
potentially influenced by the authors’ culturally and linguistically diverse
backgrounds. Kafes suggested that comparative results generally reflected the
Collecting Your Own (Teaching) Corpus 135
and (3) summary of major conclusions (Mustafa, 2015). For example, many
nursing RA abstracts have explicit subsections for Objectives, Methodology,
Results, and Recommendation. These abstracts are often presented sepa-
rately from the article, and the onus for authors is to make it as ‘stand-alone’
as possible.
Of the 675 Iraqi abstracts, 53% have at least two authors (compared to 34%
for US-based abstracts). Coauthoring for Iraqi scholars and professionals is often
encouraged not only to foster collaborative work but to also add particular ex-
perts who may focus specifically on such areas as statistics, research approaches,
or the actual writing process in English. Iraqi authors in the corpus all have
MA/MS or PhD degrees and have different academic titles/ranks starting with
the level of university teaching staff. Some authors were affiliated with uni-
versities outside Iraq. Finally, American and British English conventions, for
example, in spelling, vocabulary use, and some syntactic structures are ob-
served in the Iraqi abstracts. Some RA manuscripts are reviewed and checked
by designated “specialists in English” employed by local journal publications.
For these Iraqi journals, authors writing in English are also required to translate
their abstracts into Arabic.
The video camera was positioned in the back corner of the classrooms. It re-
corded the teachers’ linguistic and non-linguistic behaviors and learners’ speech
when they were interacting with the teachers in mostly whole-classroom for-
mats. Since the learners and teachers did not wear clip-on lavalier microphones,
however, it was difficult to capture most of their speech when learners and teach-
ers interacted during individual, pair, or group tasks. Additionally, since three of
the classes included student presentations (i.e., oral communication and reading/
listening classes), the lessons were recorded on those days which consisted of
more regular academic and language tasks such as vocabulary, grammar, read-
ing, writing, and listening activities. Therefore, the recordings were mostly of
instructor and learner talk during whole-class interactions. The first record-
ings occurred in Weeks 3 and 4, four consecutive lessons were then recorded
in Weeks 6–9, and the last recording occurred in Weeks 11–14. All 24 video-
recorded lessons were transcribed verbatim including dysfluencies following the
transcription conventions shown in the following (adapted from Jefferson, 2004).
T Teacher
S1, S2, etc., Identified student
SU Unidentified student
Ss Several or all students at once
- Interruption; abruptly cutoff sound
, Brief mid-utterance pause of less than one second
. Final falling intonation contour with 1–2 second pause
? Rising intonation, not necessarily a question
(P: 02) Measured silence of greater than 2 seconds
X Unintelligible or incomprehensible speech; each token refers to one
word
<LAUGH> Laughter
() Uncertain transcription
{} Verbal description of events in the classroom
(()) Nonverbal actions
Italics Non-English words/phrases
// Phonetic transcription; pronunciation affects comprehension
ICE Capitals indicate names, acronyms, and letters
The transcripts of the video-recorded lessons made up the L2CD corpus. Table
B2.5 provides a full description of the L2CD corpus. As previously mentioned,
it consists of 24 complete lessons, and the size of the corpus is 179,638 tokens.
The following text excerpt is illustrative of typical learner contributions in
teacher-student interactions in the L2CD.
Table B2.5 Description of the L2CD corpus
T: i want you to say a component. don’t worry, we’ll we’ll we’ll work with
that. here.
S5: music.
T: good.
S5: music. dance.
T: some other ones from the audience. music. dance. okay. so let’s see what we
have from the group over there. traditions. behavior.
S3: subculture.
T: food. what what Azeem?
S3: subculture.
T: speech.
S5: religion.
T: so we could say
S5: reli
S4: religion
S5: religion is the is different.
T: let’s put speech here.
S10: belief.
T: and we got behavior here. what else.
S10: i think religion is part of belief.
SU: religion?
S5: no n- no.
SU: x belief.
T: good. religious beliefs. what’s a value.
S7: what is the value.
T: either give me an example or tell me what value means.
S16: honesty. honesty.
S10: individualism.
S6: collectivist.
S4: collectivist. collectivism.
T: wow. we have some experts in here, i can see.
11 Is there anything you want to change about your English learning experience?
12 Is there anything else you want to discuss about your English learning
experience?
Participant Comments
Participant That’s pretty much all how our classes were, we just had grammar and only grammar. That’s why I think we can be ok at
20 French grammar but we’re really bad at talking. Because we just don’t practice a lot, so it was just practice about grammatical
things and everything so it could be really boring, but that’s how we got our bachelor’s, so.
Participant I learn grammar because in China the English teacher they teach a lot of grammar. That’s how I learn grammar, especially
37 Chinese in the high school. I believe the major part of the English exam is about the grammar, about how you write your
sentence, your vocabulary, all grammar. It’s only about twenty percent about the listening, and there’s no speaking test in
china in English exam. Yeah all about grammar I think, at least sixty percent in my opinion.
Participant I saw many problems in Korea, when it comes to learning English. Because we only focus on the reading and grammar, and
86 Korean sometimes listening, but students cannot actually write in English and speak in English…. Because many Korean students
actually hate learning English, because it’s really stressful, and it’s not fun, because they always focus, memorize the
vocabulary and memorize the grammar rule and those kind of things, that makes students hate English. So but I think,
speaking is really important.
Collecting Your Own (Teaching) Corpus 145
paper, and the inclusion of a thesis statement that signaled the development of the
rest of the paper. After group discussion of and answering any questions about
the handout, students exchanged papers and silently read each other’s drafts and
made brief notes on the paper as they read it relative to the guiding questions. To
encourage subsequent peer-peer verbal interaction rather than simply exchanging
written feedback, students were told to make notes sufficient enough to enable
verbal feedback to their partners later but that the majority of that feedback would
be delivered orally during the later conversations with their partner.
A digital recorder was placed on the desk between the pairs of students, and
when they were ready to begin giving their feedback to each other, they decided
who would go first, and the discussions began. After the first partner had com-
pleted the feedback to the other, they switched roles and repeated this process.
The notes they had made on the papers earlier relative to the discussion handout
questions served as reminders as they, in turn, delivered the oral feedback to their
partners. Trained research assistants later transcribed the conversations, referring
to the first student delivering the feedback as “S1” and the second as “S2.” These
abbreviations were subsequently eliminated, together with any irrelevant nota-
tions describing the environment (e.g., “papers shuffling in the background,”
etc.) when the transcripts were converted to text files so that the corpus consisted
entirely of student talk. Table B2.8 presents the composition of the L2PR cor-
pus, displaying the number of words in each text. In this corpus, each text is a
transcript of a pair’s discussion about one student’s paper (Friginal et al., 2017).
At each peer response session, there were two papers discussed by each pair,
and thus two transcripts generated by each pair. For example, in their first ses-
sion, the transcript of pair number one’s discussion of the first paper was 450
words long, and their discussion of the second paper was 459 words long. Each
pair participated in three peer response sessions over the course of the semes-
ter, with the exception of two pairs. Row totals show the number of words
146 Tools, Corpora, and Online Resources
generated by each pair across three sessions. Column totals show the number of
words generated during each session by all five pairs. The bold number in the
bottom right corner is the total number of words in the corpus: 21,429.
The following excerpt shows Zelda reviewing Ivana’s (both pseudonyms)
summary-response paper about the class book, Outcasts United. Ivana wrote
about two instances in the novel during which the refugees experienced dis-
crimination within their American community. Zelda thinks the paragraph
needs to be expanded. In this episode, Ivana and Zelda engage in collaborative
brainstorming that results in the generation of language that Ivana might use
in her second draft to expand the paragraph, consistent with Zelda’s recom-
mendation. Rather than wait for Zelda to point out problematic aspects of her
paper, Ivana begins the episode by sharing that she is stuck. Both students then
participate in the generation of new ideas, demonstrating shared direction of
the task and meaningful engagement with each other’s suggestions.
Ivana: Here I stop because I have no idea, because I have no clue (laughing)
Zelda: (laughing) I’ll just write you some notes here about just “church and
store” and, um, “stories”, and “your opinion” about it.
Ivana: Because maybe I can say that they had to be thankful for escaping from
war, um, and don’t be so aggressive …
Zelda: Mhm.
Ivana: to the new life
Zelda: You can keep going, saying about the church and the store and what
happened in your opinion …
Ivana: Yes, there I will say about it [should not happen
Zelda: Yes, that it’s not] supposed to be to happen …
Ivana: Mhm.
Zelda: because it is in United States. And in conclusion, you can just say that
although in theory it sounds [so easy …
Ivana: Perfect, yeah]
Zelda: uh, but in reality …
(Zelda and Ivana, Peer Response Session 1, February 2013)
This section provides a comprehensive list of corpus tools and online resources
for teachers to explore or be familiar with. These tools are all accessible on the
web, but as URLs change and creators move from one university or company
to another, it would be good to refer to the names of the tools listed here and
search for them online in case a link no longer works. I also provide an an-
notated bibliography of CL research studies from 2010 to the present on the
following themes: CL in the classroom, CL and writing, and CL and spoken
learner data. These studies show CL applications in many contexts and settings,
and utilize a range of corpora that teachers can use or collect themselves.
40.000
20.000
Score
10.000
0.000
–10.000
General fiction
–20.000
–30.000
Broadcasts Personal letters Press reportage Official documents
Conversation Prepared speeches General fiction Academic prose Essay 1 appearence - high
Genres
Figure B3.1 Comparison of spoken and written registers from Biber’s (1988) dimensions
and an input corpus of student essays from the MAT Tagger (Nini, 2014).
TAACO: Tool for Scott Crossley, Georgia TAACO is an easy-to-use tool that
the Automatic State University; calculates 150 indices of both local and
Analysis of Kristopher Kyle, global cohesion measures, including
Cohesion University of a number of type-token ratio indices,
Hawaii; Danielle adjacent overlap indices, and connectives
McNamara, Arizona indices. Available at: http://www.
State University kristopherkyle.com/tools.html
TAALES: Kristopher Kyle, TAALES is a tool that measures
Tool for the University of over 400 classic and new indices
Automatic Hawaii; Scott of lexical sophistication, like both
Analysis Crossley, Georgia single words and n-grams. It also
of Lexical State University provides comprehensive index
Sophistication diagnostics. Available at: http://www.
kristopherkyle.com/tools.html
TAASSC: Kristopher Kyle, TAASSC is an advanced syntactic analysis
Tool for the University of Hawaii tool that measures fine-grained indices
Automatic of clausal and phrasal complexity,
Analysis of classic indices of syntactic complexity,
Syntactic and frequency-based verb argument
Sophistication construction indices. Available at:
and http://www.kristopherkyle.com/tools.
Complexity html
UAM Corpus Mick O’Donnell, UAM CorpusTool is a free text annotation
Tool Universidad tool that also provides statistical output
Autónoma de for various types of linguistic/text
Madrid analysis. Available at: http://www.
corpustool.com/index.html
UAM ImageTool Mick O’Donnell, UAM ImageTool is an image annotator
Universidad tool for visual data corpora. Available
Autónoma de at: http://www.wagsoft.com/
Madrid ImageTool/
Atomic Stephan Druskat, Atomic is a cross-platform corpus
Volker Gast, & annotation tool. Available at: http://
Thomas Krause corpus-tools.org/atomic/index.html
and team, Humbolt
University of Berlin
ANNIS Stephan Druskat, ANNIS is an open source, cross-platform
Volker Gast, & program that uses a web browser-
Thomas Krause based search and visualizer for multi-
and team, Humbolt layer linguistic corpus annotation and
University of Berlin analysis. Available at: http://corpus-
tools.org/annis/
Pepper Stephan Druskat, Pepper is a free online converter tool used
Volker Gast, & to convert corpora from one format
Thomas Krause to another easily. Users need a Java
and team, Humbolt software to run it. Available at: http://
University of Berlin corpus-tools.org/pepper/index.html
Corpus Tools, Resources, and Bibliography 155
As part of CliC, the CLiC Dickens project was launched in December 2017, with
an accompanying CLiC Activity Book for use in the English classroom to study
Dickens’s texts. With the CLiC web app (http://clic.bham.ac.uk/teachers), teach-
ers can create activities directly illustrating corpus-based applications in language
and literature teaching. CLiC allows users to explore 15 Dickens novels as well as
an expanding collection of texts by other authors (e.g., works by Austen, Bronte,
Hardy, Doyle, Wilde, and many others). Reference corpora like 19th-Century
Children’s Literature and a 19th-Century Reference Corpus are available for
comparison. CLiC classroom activities will enable students to explore and ana-
lyze literary texts themselves, formulate their own research questions, and extract
data and text samples for detailed analysis and interpretation.
The following tables (Tables B3.1 and B3.2) list tools, developers, and their
current online locations. I selected and highlighted ‘teacher-relevant’ software
developed by CL like Anthony (in addition to AntConc and TagAnt) as well as
computational linguists or researchers in the field of NLP with whom a few in
the CL world may not be familiar.
(Continued)
Article Summary Corpus
6 Chang, C. F., & The article notes the increasing interest RA corpus
Kuo, C. H. (2011). in the possible applications of corpora and consisting of 60
A Corpus-Based genre-analytic approach to discipline- research articles
Approach to specific materials development. Using from three
Online Materials a word frequency list, move codes are major journals
Development for tagged in the corpus in order to identify in computer
Writing Research moves and move patterns that can help in science.
Articles. English for developing research-based online teaching
A word
Specific Purposes, materials for graduate students of computer
frequency list
30(3), 222–234. science. Examples of specialized vocabulary,
from the corpus
grammatical usage, and move structures
was analyzed
that describe the discourse of computer
to develop a
science are presented across learning tasks,
vocabulary
discussion topics, and online writing models.
profile of RAs,
The article ends with a discussion of the
and move
usefulness and effectiveness of the online
analysis was
RA writing materials, based on student
also conducted
feedback and course evaluation.
(based on a
self-developed
coding scheme
of rhetorical
moves).
7 Chang, P. (2012). This paper proposes a “textlinguistic” Web-based
Using a Stance approach in teaching advanced academic stance corpus,
Corpus to Learn writing to complement corpus approaches for allowing users
about Effective sentence-level lexico-grammatical instruction to study both
Authorial (with seven L2 doctoral students in the social the linguistic
Stance-Taking: sciences as participants). Chang examines how realizations of
A Textlinguistic L2 writers polish their research argument as stance at clause/
Approach. a result of improved stance deployment and sentence level
ReCALL: whether a web-based corpus tool can provide and how stance
the Journal of a constructivist environment which prompts meanings are
EUROCALL, the learners to infer linguistic patterns to made at the
24(2), 209–236. achieve deeper understanding of stance rhetorical move
use and patterns. Results suggest a positive level.
relationship between writing performance
and more accurate use of stance. However,
the application of higher-order cognitive
skills (e.g., inferring and verifying) is found
to be infrequent in the corpus environment.
Instead, participants use more lower-level
cognitive skills (e.g., making sense and
exploring) to learn. The paper concludes that
the learning of stance and stance patterns
is critically contingent on the surrounding
contexts, but overall, a clear authorial stance-
taking plays a critical role in developing an
effective academic argument.
8 Charles, M. (2012). Charles explores the feasibility of a student- EAP students
‘Proper Vocabulary collected corpus in multidisciplinary construct and
and Juicy classes of advanced-level students. The examine their
Collocations’: EAP course consists of six weekly, 2-hour own individual,
Students Evaluate sessions focusing on academic writing, discipline-
Do-It-Yourself with feedback data from 50 participants specific corpora.
Corpus-Building. presented and discussed in the paper.
English for Specific Over 90% of study participants found
Purposes, 31(2), it easy to build DIY corpora and most
93–102. succeeded in constructing a corpus of 10–15
research articles. Students in general were
enthusiastic about working with their own
corpora, and about 90% of them agreed
that their corpus helped them improve
their writing. Most of them mentioned
that they intended to use it in the
future. Students view corpora as a useful
resource in writing effective discipline-
specific texts. Participants’ attitudes and
experiences are also discussed in the paper,
and Charles also presents the issues and
problems that arise in connection with DIY
corpus-building.
9 Charles, M. This study, also by Charles, a follow-up International
(2014). Getting to the aforementioned study, reports graduate
the Corpus Habit: on the personal use of DIY corpora by students (40–50)
EAP Students’ EAP students. A year after completing built and
Long-Term Use of a corpus-based EAP writing course, examined their
Personal Corpora. students were asked to respond to an email own corpora of
English for Specific questionnaire which asked them about research articles
Purposes, 35, 30–40. their use of the corpus they collected. and were asked
Results show that 70% of the respondents to describe
used their corpus in one way or another. their use of the
Case studies of the use and a nonuse of corpus a year
DIY corpora are presented, highlighting after completing
two other key factors likely to affect corpus a corpus-based
use: the individual’s writing process and course.
the focus of their current writing concerns.
10 Chen, H. H. (2011). Previous studies suggest that existing web- Web-based
Developing and based concordancers have not been very collocation
Evaluating a Web- helpful in retrieving accurate and specific retrieval tool,
Based Collocation collocations meaningful for language WebCollocate,
Retrieval Tool for learners. This paper describes and explores which is
EFL Students and the applications of a web-based collocation based on the
Teachers. Computer retrieval tool, WebCollocate, in facilitating the POS-tagged
Assisted Language search for collocations. Gutenberg
Learning, 24(1), corpus.
59–76.
(Continued)
Article Summary Corpus
(Continued)
Article Summary Corpus
(Continued)
Article Summary Corpus
(Continued)
Article Summary Corpus
27 Park, K., & The focus of this paper is the combined use Same as the
Kinginger, C. of real-time digital video and a networked previous study.
(2010). Writing/ linguistic corpus for exploring the ways in
Thinking in Real which these technologies enhance researchers’
Time: Digital capability to investigate the cognitive processes
Video and Corpus of learning. With the help of corpus search
Query Analysis. queries, the analysis of real time data can be
Language Learning extended to provide an explicit representation
and Technology, of learners’ cognitive processes. This innovative
14(3), 31–50. method applies to SLA, especially in writing
and exploring L2 writers’ composing process.
The paper argues that a writer’s composing
process is fundamentally developmental and
facilitated by means of an interactive process
(with a corpus).
28 Perez-Paredes, P., This study explores EFL learners’ (n = 24) BNC.
Sanchez-Tornel, behavior by tracking their interaction
M., & Alcaraz with corpus-based materials during focus-
Calero, J. M. on-form activities (“Observe, Search the
(2012). Learners’ Corpus, and Rewrite”). One group of
Search Patterns learners made no use of web services other
During Corpus- than the BNC during the “Search the
Based Focus-on- Corpus” activity, while the other group
Form Activities. was allowed to use other web services and/
International Journal or consultation guidelines. The overall
of Corpus Linguistics, performance of the second group was found
17(4), 482–515. to be better; the first group’s formulation
of corpus queries on the BNC was deemed
unsophisticated. The students in the first
group used the BNC’s search interface the
same way as they used Google or similar
resources. The researchers recommend that
careful consideration should be given to
the cognitive aspects of corpus searches, the
role of computer search interfaces, and the
implementation of corpus-based instruction.
29 Perez-Paredes, P., This article discusses the use and potential BNC.
Sanchez-Tornel, benefits of learning logs to study learners’
M., Alcaraz Calero, actual use of corpus-based resources. In
J. M., & Jimenez, P. tracking learners’ actual use of corpora,
A. (2011). Tracking the authors explore the number of events
Learners’ Actual or actions performed by each individual,
Use of Corpora: the total number of different web services
Guided vs. Non- used, the number of activities completed,
Guided Corpus the number of searches performed on the
Consultation. BNC, and the number of words or wildcards
extracted per BNC search.
Computer Assisted These parameters were used to examine
Language Learning, whether learner interaction with corpus-
24(3), 233–253. based resources differed under different
corpus consultation conditions, i.e., guided
versus non-guided consultation. Results
show that learners behaved differently in
both the number of different web services
used during the completion of tasks and the
number of BNC searches. The study suggests
that guided activities and learner-tracking
are necessary when teachers incorporate
corpus tools in the classroom.
30 Poole, R. (2012). This study compares the effectiveness of COCA.
Concordance- online textual glosses that are (1) enhanced
Based Glosses with modified corpus-extracted sentences
for Academic from concordance lines, and (2) textual
Vocabulary glosses enhanced with dictionary definitions
Acquisition. drawn from an online learner’s dictionary.
CALICO Journal, Poole aimed to determine which textual
29(4), 679–693. gloss technique would be most beneficial
in helping intermediate to advanced
language learners acquire academic lexical
items. Learner attitudes toward the textual
annotation techniques were also analyzed.
Results show that participants in both
experimental groups exhibited post-test
gains in receptive and judgment tasks, but
only the concordance-based group displayed
improvement on the productive assessment.
In addition, the concordance-based group
members commented that the glosses were
beneficial to them and that they would
likely use them again. The dictionary group
found the glosses ineffective.
31 Quinn, C. This article reports on a corpus-training Collins
(2014). Training module that was implemented in an WordBanks
L2 Writers to intermediate level EFL writing course Online and
Reference Corpora and which taught students how to refer to COCA.
as a Self-Correction corpora for the purpose of self-correcting
Tool. ELT Journal, teacher-coded errors. The full training
69(2), 165–177. sequence of the module is presented
following a discussion of the students’
reactions to the process. The article guides
teachers in preparing intermediate level L2
writers in learning about concordancers.
The focus here is to offer students an
alternative reference to traditional
dictionary searches.
(Continued)
Article Summary Corpus
32 Rodgers, O., The article points out that there is a growing Various corpora
Chambers, A., & body of research in the use of corpora for for LSP.
Le Baron-Earle, F. LSP, but learner evaluations of the activity
(2011). Corpora in are rare. The authors explore the use of
the LSP Classroom: a corpus of academic research articles on
A Learner- biotechnology in French (with native
Centered Corpus English-speaking university students of
of French for biotechnology as learners). After situating
Biotechnologists. the study in the research context, they
International Journal examine issues involved in the creation
of Corpus Linguistics, of an appropriate corpus and describe the
16(3), 391–411. integration of the corpus in the French
language course. Finally, an evaluation
of the learners’ reactions was conducted
through questionnaires and semi-structured
group interviews. Results show the positive
application of corpus-based approaches, and
students’ receptive attitude especially with
the use of learner-centered corpora.
33 Römer, U. (2011). Römer notes that corpora have not only BNC, COCA,
Corpus Research revolutionized linguistic research but have MICASE, and
Applications in also had a major impact on second language ICLE.
Second Language learning and teaching. She points out that
Teaching. Annual applied linguists value what CL has to offer
Review of Applied to language pedagogy, but that corpora
Linguistics, 31, and corpus tools have yet to be widely
205–225. implemented in pedagogical contexts. The
article provides a summary of pedagogical
corpus applications and reviews recent
publications that report on the merging of
CL approaches and language teaching. CL in
syllabus or materials design and the creation
of class lessons/activities are overviewed.
Römer illustrates how both general and
specialized language corpora can be best
adapted to the classroom and discusses
directions for future research in applied CL.
34 Smith, S. This exploratory study describes the DDL WBC
(2011). Learner framework in general (non-major) English (WebBootCat)
Construction university classes, with learners directed to and Sketch
of Corpora for construct their own linguistic corpora. Engine.
General English in
Taiwan. Computer
Assisted Language
Learning, 24(4),
291–316.
Smith argues that the process of creating
a corpus “inculcates a sense of ownership
in the learner” and that this approach
contributes to their increased motivation
in learning (especially if the corpus focuses
on topics of great interest to the learner and
matches their major field of study). The
process of collecting the corpus leads to the
acquisition of not only language but also
useful transferable skills such as problem-
solving competencies and knowledge of
electronic tools. The study presents DDL
applications and contexts in Taiwan and
suggests that corpus construction is an
important component of an effective DDL
course. Participants (90 freshmen general
English students) compiled and analyzed
corpora, with overall positive results in terms
of their increased motivation and signs of
successful learning of English.
35 Walker, C. (2011). This paper presents two case studies of The Bank of
How a Corpus- how CL can be used in business English English (BoE)
Based Study instruction. The rationale here is that senior corpus, BNC,
of the Factors managers in multinational companies often and the British
which Influence find themselves needing more accurate National
Collocation business English models in learning how Commercial
Can Help in to best communicate in the international Corpus
the Teaching of workplace. The paper highlights how a (BNCC).
Business English. corpus-based investigation of the collocational
English for Specific behavior of key lexis features can be used to
Purposes, 30(2), provide “sophisticated” vocabulary samples
101–112. appropriated for senior managers. By studying
collocations associated with a group of word
synonyms, it is often possible to identify slight
but noteworthy differences in the meaning
of (business) words in the group, relevant for
the specific needs of managers in high-impact
intercultural English communication settings.
36 Xu, Q. (2016). This paper provides an overview of learner ICLE,
Application of corpora, their types, and various applications, LINDSEI,
Learner Corpora to and how they might be efficiently utilized Chinese Learner
Second Language in classroom activities for second language English Corpus
Learning and learning. (CLEC), and
Teaching: An Cambridge
Overview. English Learner Corpus
Language Teaching,
9(8), 46.
(Continued)
Article Summary Corpus
1 Chen, H. I. (2010). This paper reports on a preliminary study ‘BNC baby’ and
Contrastive on L2 learners’ interlanguage pragmatic Chinese Learner
Learner Corpus development in academic written English Corpus
Analysis of discourse by examining how epistemic (CLEC).
Epistemic Modality modality is used by NNS writers vs.
and Interlanguage NS writers with data from NS/NNS
Pragmatic corpora. Chen also investigates how NNS
Competence in L2 writers develop interlanguage pragmatic
Writing. Arizona competence in academic writing across
Working Papers in L2 proficiency levels. Findings suggest a
SLA & Teaching, need for culture-sensitive curricula and
17, 27–51. explicit pragmatic instructions in writing
classrooms.
2 Chen, Y. H., This study follows an automatic, The Freiburg-
& Baker, P. frequency-based approach to identify Lancaster-Oslo/
(2010). Lexical frequently-used word combinations (i.e., Bergen (FLOB)
Bundles in L1 lexical bundles) in academic writing. corpus and the
and L2 Academic BAWE corpus.
Writing.
(Continued)
Article Summary Corpus
Language Learning Lexical bundles retrieved from one
and Technology, corpus of published academic texts and
14(2), 30–49. two corpora of student academic writing
were investigated both qualitatively and
quantitatively.
3 Cotos, E. (2014). Cotos explores a local learner corpus to A local learner
Enhancing identify the effects of two types of DDL corpus, an
Writing activities: one relying on a NS corpus electronic
Pedagogy with and the other upon a combination of NS collection
Learner Corpus and learner corpora. The objective of of writing
Data. ReCALL, both types of activities was to improve produced by the
26(2), 202–224. L2 writer’s use of linking adverbials. participants as
Quantitative and qualitative data obtained course assignments
from writing samples, pre-/post-tests, and (40 academic
questionnaire responses were data sources disciplines having
for the study. Results showed an increase 1,623 manuscripts
in frequency, diversity, and accuracy in all with a total of
participants’ use of adverbials, but more 1,322,089 words).
significant improvement was observed in
students who were exposed to the corpus
including their own writing.
4 Crompton, P. This article examines article system ICLE and the
(2011). Article errors in a corpus of English writing Arabic Learner
Errors in the by tertiary-level L1 Arabic speakers. English Corpus
English Writing Frequencies of articles are compared with (95 essays with
of Advanced L1 those in native and non-native English 42,391 words).
Arabic Learners: speaker corpora. Crompton reports that
The Role of the ‘commonest errors’ involve misuse of
Transfer. Asian the definite articles for generic reference.
EFL Journal, These errors are deemed to be caused by
50(1), 4–35. L1 transfer (rather than an interlanguage
developmental order) as supported by
corpus data.
5 Ha, M. J. (2016). The paper examines the frequency and Learner corpus
Linking Adverbials usage patterns of linking adverbials in composed of 105
in First-Year Korean students’ essay writing compared essays produced
Korean University with NS English texts. The distribution by first-year
EFL Learners’ of the different semantic categories of university
Writing: A linking adverbials was nearly identical students in
Corpus-Informed in the Korean writing and NS writing. Korea. The
Analysis. Computer Additive relation was most frequently control corpus
Assisted Language used, followed by the causal, adversative, was taken from
Learning, 29(6), and sequential relations. Results show that the American
1090–1101. Korean learners overused linking adverbials LOCNESS
across all semantic categories when sub-corpus.
compared with LOCNESS texts.
6 Kennedy, C., & This study documents a semester- CWIC
Miceli, T. (2010). long apprenticeship in corpus use for (Contemporary
Corpus-Assisted creative writing applications. The Written Italian
Creative Writing: corpus approach is introduced as an Corpus).
Introducing aid to learners’ imagination in writing
Intermediate and also as a resource to reinforce the
Italian Learners correct use of specific English grammar
to a Corpus as features. Corpus-based activities served
a Reference as groundwork for students’ analysis
Resource. of corpus data. Kennedy and Miceli
Language Learning describe the approach and also the results
and Technology, of their evaluation of its effectiveness
14(1), 28–44. through case studies of three students
and their use of a corpus and a bilingual
dictionary as reference resources while
writing.
7 Laufer, B., & The use of English verb-noun Learner corpus
Waldman, collocations in the writing of NS of of about 300,000
T. (2011). Hebrew at three proficiency levels was words of
Verb-Noun the focus of this paper. The authors argumentative
Collocations in extracted the 220 most frequently and descriptive
Second Language occurring nouns in the LOCNESS essays compared
Writing: A corpus and in the learner corpus and with a
Corpus Analysis created concordances for them. Then, LOCNESS NS
of Learners’ verb-noun collocations were also sub-corpus.
English. Language obtained for comparison. Results show
Learning, 61(2), that learners at all three proficiency
647–672. levels produced far fewer collocations
than NS.
8 Liu, G. (2013). This article examines the characteristics Chinese
On the Use of Chinese EFL learners’ use of linking Learners’ English
of Linking adverbials in speaking and writing Corpus (CLEC)
Adverbials by through a comparison of learner and and the College
Chinese College NS corpora. Using MicroConcord, Learners’ Spoken
English Learners. linking adverbials were obtained for English Corpus
Journal of Language a contrastive study on the individual (COLSEC);
Teaching & features used by the two groups. Data LOCNESS and
Research, 4(1). show that Chinese EFL learners used London-Lund
linking adverbials in their speech Corpus (LLC).
and writing overwhelmingly more
frequently in comparison to NS.
Chinese EFL learners tended to use
more linking adverbials in their speech
than in their writing, while NS data
showed the opposite pattern, i.e., more
linking adverbials in writing than in
speaking.
(Continued)
Article Summary Corpus
9 Lu, X. (2011). This study evaluates 14 syntactic The ESL writing
A corpus‐based complexity measures as indices of corpus from the
evaluation language development in a corpus of Written English
of syntactic college-level ESL writers. Data were Corpus of
complexity analyzed by using a computational Chinese Learners
measures as system developed for automatic analysis (WECCL), with
indices of of syntactic complexity in college-level 3,554 essays
college‐level ESL ESL writing. Results show significant (total word
writers’ language effect of the sampling measures on the = 1,119,510)
development. mean values of most of the complexity written by
TESOL Quarterly, measures. Ten complexity measures had Chinese learners
45(1), 36–62. significant between-level differences, and (aged 18–22
those measures showed several patterns years) from 9
of development. Correlations between colleges.
the scores of the 14 complexity measures
illustrate a strong relationship between
measures of the same type, suggesting
how complexity measures could be
effectively used as indices of L2 writing
development.
10 Luo, Q., & Liao, This paper reports the results of a small- The Beijing
Y. (2015). Using scale study with 30 undergraduate students Foreign Studies
Corpora for Error from 2 college English classes in China University
Correction in that explored the effects of using reference (BFSU) CQP
EFL Learners’ corpora during the process of revising web.
Writing. Journal of essays in EFL. The BFSU CQP web was
Language Teaching used as a resource by the experimental
and Research, 6(6), group while correcting lexico-grammatical
1333–1342. errors in writing. Findings reveal that
corpora as a reference are more helpful
than an online dictionary in enabling
learners to produce accurate corrections
and reduce errors in free production.
Learners also showed a positive attitude
toward corpus use in writing.
11 MacDonald, P. MacDonald analyzes errors identified UPV learner
(2016). “We All in written argumentative essays of 304 corpus from the
Make Mistakes!” Spanish university students of English MiLC corpus;
Analyzing taken from two different corpora: one the WriCLE
an Error- from a technical university context and corpus (750
Coded Corpus the other from learners enrolled in the argumentative
of Spanish humanities. The study also explores the essays written by
University nature of errors coded in the corpus Spanish learners
Students’ Written and the relationship, if any, between the of all proficiency
English. learners’ level of competence and the type levels).
and frequency of errors they make.
Complutense Results show that grammar errors are
Journal of English the most frequent and that the linguistic
Studies, 24, 13. competence of the learners has a lower than
expected influence on the most frequent
types of errors coded in the corpus.
12 Paquot, M., Paquot and Granger review learner corpus No corpus used.
& Granger, S. data and various methods utilized in the
(2012). Formulaic analysis of formulaic language They provide
Language in an extensive discussion of findings from
Learner Corpora. learner corpus research (LCR) on learner
Annual Review of phraseology, distinguishing between co-
Applied Linguistics, occurrence and recurrence patterns. Emphasis
32, 130–149. is also placed on the relationship between
learners’ use of formulaic sequences and
potential transfer factors from the learners’ L1.
13 Thewissen, J. L2 accuracy development trajectories were ICLE.
(2013). Capturing examined in this paper and how they can
L2 Accuracy be captured via an error-tagged version of
Developmental an EFL learner corpus (from the ICLE). A
Patterns: Insights subsection of 223 learner essays was used,
from an Error‐ with each essay manually annotated for
Tagged EFL errors following the Louvain-error tagging
Learner Corpus. taxonomy and individually rated by two
The Modern testing experts according to the Common
Language Journal, European framework linguistic descriptors
97(S1), 77–101. for accuracy. A counting method, potential
occasion analysis, which relies on both an
error-tagged and a POS-tagged version of
the learner data, was used to quantify the
errors. Results indicate that the EFL error
developmental patterns tend to be dominated
by progress and stabilization trends.
1 Aijmer, K. (2011). This study explores the use of well as a Texts from
‘Well I’m not pragmatic marker that helps speakers to the Swedish
sure I think…’ organize and direct the conversation and to component of
The Use of express specific feelings and attitudes. Aijmer the LINDSEI
Well by Non- focused on advanced EFL learners’ use of corpus and its
Native Speakers. well using Swedish texts from the LINDSEI NS counterpart
International corpus as compared with LOCNEC (native (LOCNEC)
Journal of Corpus English speakers) texts. Comparison data show to examine
Linguistics, 16(2), that, overall, Swedish learners overuse well, similarities
231–254. although there are considerable individual and differences
differences observed in the corpus. between NS’ and
NNS’ use of well.
(Continued)
Article Summary Corpus
(Continued)
Article Summary Corpus
Part C of this book presents various corpus-based lessons and activities developed
for the classroom and intended for a range of language learners. The three sub-
sections represent the primary themes of the lessons or activities: namely, (1) CL
and Vocabulary Instruction, (2) CL and Grammar Instruction, and (3) CL and
Teaching Spoken/Written Discourse. There may be overlaps in these themes as,
for example, a vocabulary lesson may also include some instructions related to
the discussion of a specific grammatical feature or structure. The three themes
represent common categories of activities in language classrooms, primarily
for English language learners, but may also be appropriate in settings such as
academic writing instruction or linguistic variation in spoken discourse for na-
tive English speakers in various university-level classes. Instructors or materials
developers in intensive English programs, ‘study abroad’ courses or workshops,
EAP courses for professionals, or writing and grammar courses, or those in
sociolinguistics (focusing on the study of linguistic variation) may also benefi-
cially utilize these model corpus-based lessons and activities.
The contributors of these lessons and activities were all my former students
in the master’s or doctoral programs at the Department of Applied Linguis-
tics and ESL at GSU. They have all taken my Corpus Linguistics or Technol-
ogy and Language Teaching courses, which required materials design projects for
the short-term language instruction of a linguistic feature (e.g., collocations,
linking adverbials, politeness markers, semantic categories of verbs) taught
with a corpus tool. The design of the lessons was often based upon a hypo-
thetical classroom with a specific group of learners that the lesson designers
were familiar with. However, some of them based their design on an actual
class (e.g., an intensive English program course on academic writing) that they
were teaching. After completing the program at GSU, most of the contributors
found teaching positions at various universities in the US or abroad, teaching in
188 Corpus-Based Activities in the Classroom
Table C1.1 M ajor topics and assignments in the Writing in Forestry course by week
Week Topics/assignments
My course started with a diagnostic essay assignment (“Why are you inter-
ested in forestry?”) to help me in conducting an initial assessment of each stu-
dent’s strengths and weaknesses in a more structured academic writing context.
The second week introduced students to more specific academic and technical
writing in forestry by discussing how forestry and scientific writing may be
Developing Corpus-Based Lessons 191
different from other forms of writing (i.e., register comparison) they have done
in the past, including their freshmen English prerequisite course. Empirical
fact and data-based argumentation were emphasized for this week’s discussion.
Corpus data for comparison and an introduction to CL were provided.
Persuasive writing was the focus of the third week, and I assigned my stu-
dents to write an approximately 200-word letter to the editor of the local news-
paper, expressing their opinion on a forestry-related topic of their choice. The
topic had to be current (e.g., forest fires, bark beetle infestation, forest lands
thinning concerns) and accessible to the general public. In the fourth week,
students were assigned to write an annotated bibliography on a forestry topic of
their choice. After choosing a topic, they conducted library research to identify
six relevant academic sources (i.e., journal articles, books, technical reports).
The sources had to provide different perspectives on the topic and come from
credible academic publications. Finally, each source must be cited in the name-
year system and be followed by a one-paragraph annotation (100–250 words)
that summarized and evaluated the source and explained how this source was
relevant to the topic. The fifth week focused on developing the skills needed to
complete the annotated bibliography.
The sixth week expanded on the annotated bibliography assignment by expos-
ing students to writing for the purpose of synthesizing current knowledge about
a forestry topic (i.e., literature review for research reports). Students were assigned
to write a five- to seven-page technical synthesis paper on the forestry topic they
selected for the annotated bibliography. I encouraged them to emphasize in this
paper a key thesis or argument that was developed and supported by the findings
and ideas of the sources used for the annotated bibliography. The paper required an
introduction paragraph, which introduced the topic and presented the thesis state-
ment; a body, in which arguments were supported by synthesizing the sources from
the annotated bibliography; and a conclusion, wherein the thesis was restated and
summarized. The seventh week emphasized approaches for successful synthesis pa-
pers and citation styles, the eighth week focused upon professional memo writing,
and the ninth week demonstrated effective table and figure formats.
In the 10th week, students were introduced to the purpose, format, and
primary contexts of forest management plans, and were assigned to write the
current conditions section of a management plan for a hypothetical National
Forest: the “Greenville Forest.” They were provided with background data and
other information about the Greenville Forest and the following scenario that
framed the assignment: Improvements are desired to a forest service road that
carries commuter and recreation traffic through “Merganser Marsh,” an im-
portant wetland for wildlife and the home of several endangered species. The
paper required three sections: a Physical Setting section, in which the location,
climate, and transportation system of the Greenville Forest were described;
a Biological Setting section, in which the vegetation and the wildlife were de-
scribed; and a Social Setting section, in which the different recreational locations
192 Corpus-Based Activities in the Classroom
the forestry corpus that I collected for the course. Discussions of these com-
parative features focused on the concept of academic written registers and
the specific features of writing defining sub-registers, including fiction,
nonfiction, narrative, analytical, or technical writing. Data visualization of
results from corpus descriptions of register-specific writing provided inter-
esting insights into the uniqueness of individual registers that students were
familiar with and exposed the systematic patterns of word use, structure,
and conventional word associations commonly employed by writers in the
same field. Figure C1.1 shows an example of register comparisons between
conversation and academic writing, and discussion topics I presented in
class. Although my students had limited background in linguistics or gram-
mar studies, they were able to draw connections between features such as
demonstrative pronouns (this and that) used very differently in speech and
in writing. I emphasized to my students that, intuitively, one might expect
that knowledge gleaned from corpus-based research that identified features
and systematic associations of patterns written in a particular field could aid
their writing in forestry, technical reports, memos, and many other future
applications.
Lesson Focus: The Grammar of Individual Words: Demonstrative Pro-
nouns this vs. that.
The traditional description of the difference from most grammar textbooks
and English instruction in the classroom:
12,000
10,000
8,000
6,000
4,000
2,000
0
Conversaon Academic WR
that this
That was delicious. (In this example, that was used by the speaker to
show vague or situational reference)
GAAP requires that a business use the accrual basis. This means that
the accountant records revenues as they are earned… (This used in this
excerpt is an example of “text deixis,” defined as a reference to the reader
of the idea, topic, or specific item mentioned in the previous part of the
same text. Text deixis using this is more frequently used in academic
written registers than in other written registers.)
For exploratory comparisons, I collected first drafts of the lab report written by
my students, including similar papers written by other students in previous years
provided to me by instructors. My forestry corpus compiled 500 recent (from 2000
to 2005) refereed research articles from forestry and related journals (e.g., Forest
Ecology and Management, Forest Science, Journal of Forestry). As noted earlier, included
in the corpus are articles authored by faculty and graduate students in the School.
The novice/student corpus (total number of papers = 144) comprised most ‘first
attempts’ at forestry research reports written by the students in the School.
As the students’ initial research writing activity, this assigned paper focused
on simple forest measurement techniques. Much of the writing instruction I
provided in the classroom involved the discussion of technical report writing
Developing Corpus-Based Lessons 195
conventions in citation, formatting tables and figures, and developing the pre-
sentation of results. With respect to language use, I focused on linguistic fea-
tures such as the use of linking adverbials, verb tenses (especially writing in the
passive), reporting verbs, pronoun use, and vocabulary features (leading into
key word comparisons). The measurement data were provided to the students,
and textbook chapters for citation in the “Review of Previous Works” section
were also provided. The School recommended a style and formatting manual,
Writing Papers in the Biological Sciences, 3rd Edition (McMillan, 2001), to aid stu-
dents in writing research reports across year levels. The format and mechanical
conventions (e.g., how to format tables and figures) in research writing fol-
lowed the suggestions from this manual.
With the collection of our two comparison corpora (student lab reports vs.
published forestry research articles), I provided the students with normalized fre-
quencies (normed per 1,000 words) of many of the target lexical features men-
tioned earlier. Specifically, for example, verbs tagged into their “reporting”
categories: argue, show, find, claim; linking adverbials; and passive structures and
verb tenses, the last three features making use of definitions or categories from the
LGSWE as models. These three groups of features were good examples because
they all are commonly used in technical research reports. Student concordancing
activities were conducted in a computer lab using MonoconcPro (Athelstan, 1999)
so that students could run their own comparisons and also obtain sample text
excerpts of features especially from the professional corpus. It was easy to demo
MonoconcPro to my students, and they immediately learned the basics of running
searches and analyzing concordance lines (“this is just like googling,” one stu-
dent commented). The concordancing exercises were facilitated using handouts
I developed with specific instructions about what the students needed to search,
count, normalize, and interpret. Discussion questions guiding small group activ-
ities were provided. The ultimate goal of the activities was to help students focus
on these features as they rewrote and finalized their lab reports. Sample data,
student interpretation and comments, and the primary results of comparisons and
concordancing activities are presented in the following.
Reporting Verbs
The definition and phraseology of reporting verbs and reporting clauses in
my lessons were derived from the works of Francis, Hunston, and Manning
(1996), Hunston and Francis (1998); and Charles (2005). Reporting clauses fol-
low the verb patterns V + that (e.g., “You indicated that…”) and it be + V-ed
+ that (e.g., “It was stated that…”) in structures involving citation and general
reference. In a research setting, Charles (2005) examined the distribution of
reporting verbs in a corpus of academic theses written by native speakers (NS)
of English from several disciplines. She looked at the phraseological patterns
of reporting clauses by focusing on the structure of internal citation “with a
196 Corpus-Based Activities in the Classroom
• As shown by Hill (2002) [not cited in this book’s bibliography], the mea-
surements did not indicate infestation. It also shows that trees were not
releasing pitch as defense against bark beetle attack. The white pitch tube
shows that the beetle was successfully repelled by the tree.
• Martin and Williams (2003) [not cited in this book’s bibliography] showed
that changes are not necessarily positive at this point. In some of the pictures
shown by the researchers, the needles on conifer trees, like pines, begin to
turn a reddish-brown color. Often the change begins at the top of the trees and
moves down. This shows that some trees’ color from green to brown will ….
Table C1.2 Categories of reporting verbs (adapted from Francis et al., 1996)
Table C1.3 Sample completed group worksheet (1): Verb use (arranged alphabetically,
normalized per 1,000 words and including their lemmas) in student
reports in the Writing in Forestry course (n = 144 reports) and 500 refereed
forestry research articles
Table C1.4 Sample completed group worksheet (2): The 35 frequently used reporting
verbs (ranked and normalized per 1,000 words, including their lemmas)
in student reports in the Writing in Forestry course (n = 144 reports) and
500 refereed forestry research articles
Student reports Use per 1,000 words Professional articles Use per 1,000 words
Linking Adverbials
Linking adverbials (or adverbial connectors), also referred to as transition
words/phrases (e.g., however, for example, in addition), are considered necessary
features of academic and technical writing. The effective use of these linguistic
Developing Corpus-Based Lessons 201
devices is important in creating textual cohesion and logical flow of the nar-
rative, alongside coordinators and subordinators, because they clearly signal
the connection between passages of text (Biber et al., 1999). A critical aspect
of university-level research writing includes the development of logical argu-
ments supported by details and evidence in prose. The effective use of linking
adverbials certainly helps improve the flow of discussions and how ideas are
best organized in academic written discourse. In order to achieve the needed
cohesion in presenting arguments and supporting evidence, students in many
EAP writing classes are encouraged to use linking adverbials in their research
papers (Altenberg & Tapper, 1998; Tanko, 2004).
My students were as enthusiastic and appreciative of the concordancing
activities focusing on linking adverbials as they were about their reporting
verb activities and discussions. We followed the same structure, with the
class in a computer lab working on worksheets that I developed, obtain-
ing normalized frequency counts and extracting text samples for analyses.
The comparison of professional and students’ corpora from our worksheets
showed that students’ papers used 7.58 linking adverbials in total, normalized
per 1,000 words; forestry professionals and practitioners had 12.12, also per
1,000 words. Further exploration of the data indicated that students did not
only use a smaller number of linking adverbials, but they also used fewer
types. Students discussed this disparity in the figures from the corpora, and
we also talked about how to possibly best teach the use of linking adverbials
to students writing research reports. In creating the corpus-based handout
and concordancing activity for linking adverbials, I referenced the list of
commonly used transitions in the form of single adverbs and prepositional
phrases from LGSWE (Table C1.5).
Table C1.5 Use of linking adverbials (arranged alphabetically, normalized per 1,000
words) in student lab reports (144 papers) and professional articles (500
articles)
I don’t think that professional forestry writing prefers these two examples,
so I may not necessarily have to use them. I like it that I learned about
them in this lesson!” (2) “I was surprised at the use of also in the papers.
I thought however would be the most popular here as I use this transition
a lot in my writing. I studied overseas when I was in high school and I
remember being taught to use however and nevertheless in essay writing a
lot. I can see their importance in showing logic and in connecting ideas.
I appreciate seeing the list of most commonly-used linking adverbials by
forestry professors.”
One group of students worked on comparing and ranking the top 24 linking
adverbials from the two corpora, basically by simply reorganizing data from the
aforementioned table, but they were tasked to obtain text samples and to fur-
ther identify major differences between student and professional writing. Their
results are presented on Table C1.6.
Developing Corpus-Based Lessons 203
Table C1.6 List of 24 frequently used linking adverbials (ranked and standardized per
1,000 words) in lab reports (n = 144 reports) and 500 refereed forestry
research articles
Student reports Use per 1,000 Professional articles Use per 1,000
words words
First person: I, we
Second person: you (singular), you (plural)
Third person: he, she, it, they
…but more editors are allowing, even encouraging, first person, active voice
because it may be more direct and concise:
3 Reduces ambiguity
General dysfunction of the immune system at the leukocyte level is sug-
gested by both animal and human studies. (passive)
Vs.
Both human and animal studies suggest that diabetics have general im-
mune dysfunction at the leukocyte level.
Lessons Learned
• It’s OK to use “We” and “I”! Avoiding personal pronouns does not
make your science more objective.
• The active voice is more clear, direct, and vigorous.
• The passive voice is appropriate in some cases, but should be used spar-
ingly and purposefully.
• Journal editors encourage use of the active voice.
Science:
Use active voice when suitable, particularly when necessary for cor-
rect syntax (e.g., “To address this possibility, we constructed a λZap
library…”).
Nature:
Nature journals prefer authors to write in the active voice (“we per-
formed the experiment…”) as experience has shown that readers find
concepts and results to be conveyed more clearly if written directly.
• I prefer passive. The focus on I did this, I did that is distracting to me. Tech-
nical writing is more formal than personal writing, so I have preference for
an informational style rather than a narrative.
• So, this lesson got me thinking about who to follow. If the writing text-
book tells me to use active, but the professors write in the passive, what
should my model be?
The students also compared the distribution of personal pronouns I and we.
There was very minimal use of I in the professional corpus in contrast to the
students’ corpora. Although there was a substantial number of single-authored
works in the professional corpus, it was apparent that single authors avoided
using the first-person singular in their “Methods” and “Discussions and Con-
clusion” sections, and preferred passive structures instead (e.g., I measured the
dbh using…vs. The dbh was measured…). However, the we construction was
very common in multiple-authored papers in the professional corpus. For the
students in the study, both the I and we active structures were regularly used.
We was used in the explicit write-up of process and procedures in the collection
of data because the students worked in groups; I was used in the “Discussions
and Conclusion” section of the paper. Clearly, the preference for these personal
pronouns in the students’ papers influenced their use of passive structures in their
writing.
For verb tenses, the professional corpus had 43.34 present tense verbs and
22.11 past tense verbs per 1,000 words (I provided these numbers to the stu-
dents from tagged data), while the student corpus had 52.67 present tense and
16.23 past tense verbs, respectively. In both corpora, present tense verbs were
used more frequently than past tense verbs. A cross-comparison between the
tense features disclosed that the students preferred present tense verbs more and
had fewer past tense verbs than the professionals.
In summary, highlighting these variations in the distributional patterns of
relevant linguistic features in the corpora could potentially enhance the overall
writing skills of students in forestry, especially in achieving a similar tone and
style typical of published articles in the field. In wrapping up the corpus-based
instruction for this part of the course, the class (as one group) developed the
following suggestions.
Based on the comparison of students and professional research papers
across reporting verbs, linking adverbials, passive/active structures, verb
tenses, and personal pronouns, we suggest that a course such as Writing in
Forestry should
• Help the students diversify their use of reporting verbs and avoid overuse
of common verbs like show (shows, showed, shown, showing),
• Suggest to students that they consider using passive structures along-
side the predominantly active structures in their papers (especially in the
“Methods” or “Collection of Field Data” sections), and
• Help in checking verb tense patterns in the professional corpus and looking
at examples of shifts in tenses and tense consistency.
reporting verbs they used in the paper. A list of reporting verbs grouped into
the categories identified by Francis et al. (1996) was provided as a guide in the
hands-on concordancing activity that followed. Again, the students were in-
structed to copy and print concordance lines of reporting verbs that would be
useful for them in editing their papers.
features in the study, followed by text samples extracted from the corpora, were
enthusiastically received by the students.
Most of the students in the concordancing activities stated that the pro-
cess was useful in writing and editing their papers. Since my students were
primarily L1 speakers of English, there was very limited language barrier;
vocabulary and structural examples in the corpus were easily noted and
applied in the writing and revision of the research reports. Many of the stu-
dents in this writing class expressed their preference for sample papers when-
ever they were given tasks. Because of the specialized content and writing
in the register, students benefit from examples and concrete instructions
in their own writing tasks. It helped tremendously that frequencies and
patterns from corpora provided the kind of examples students preferred in
the completion of tasks (Friginal, 2013b). Students worked with the concor-
dance lines they printed during the activity and reported that they used the
structures and patterns that they observed from their printouts in revising
their papers. There were some who mentioned that this activity was tedious
and time-consuming. Furthermore, some thought that concordancing for
linking adverbials and reporting verbs, for example, was nothing more than
using an online dictionary for synonyms or the MS Word function to find
applicable synonyms of words. These are relevant comments about the role
of corpus tools in the writing classroom, particularly because these devices
require additional time and effort and students might not immediately see
how the activity relates to the task. It is important that the corpus-based
lesson be clearly introduced and the rationale supporting it be explained to
the students and that all expectations as to the task execution and grading be
clarified. Content and mechanical conventions of technical writing could
similarly be incorporated into corpus-based instruction. Potential improve-
ment in the overall quality of writing may come from students’ repeated
exposure to patterns and mechanics of writing shown by model texts. In
reading, printing, and analyzing writing structures from the concordancer,
it was possible that the students acquired additional skills related to editing
sentences and paragraphs.
I believe that the use of a target corpus of professional writing helped in
contextualizing the activities in corpus instruction and solidifying the exam-
ples and patterns that the students needed in constructing their sentences and
paragraphs. It appeared that the students found additional motivation from
studying the patterns within professional writing in their own field. It was
easy for them to check concordance structures useful in editing their papers.
The frequencies of writer preferences, what some students mentioned as over-
use or under-use of specific linguistic features through the exploration of cor-
pora, also provided relevant data that reminded my students of the ideal style
and quality of writing expected in their area of specialization. At the end of
the course, students were asked to rate the level of difficulty for the required
210 Corpus-Based Activities in the Classroom
writing assignments. They rated the management plan (3.9; 1 = easy, 5 = diffi-
cult) and lab report (3.9) as most difficult, followed by the synthesis paper (2.7),
memo (2.6), opinion piece (2.3) and annotated bibliography (2.3). Students
ranked the writing assignments in priority of importance to them (highest
to lowest): lab report, management plan, memo, synthesis paper, annotated
bibliography, opinion piece (Kolb et al., 2007). Ninety-seven percent of the
students agreed that the grammar workshops with corpus tools were helpful
and should be retained in the course. The most supportive written comments
from students were (directly quoted):
• I’m glad I took this class. I feel it has given me a preview of what I will
be writing in the professional program.
• I enjoyed the grammar lessons and feel that they helped me improve my
writing. I wish we could have focused more on them.
• I thought this class was awesome. I feel ahead of the game when it
comes to writing papers in other classes.
• I would like to say that this class is outstanding. I am impressed with my
overall experience in this class. It has prepared me for future classes and
applications down the road. I have had to take a lot of required courses
this semester and let me tell you that this is the only class that I enjoyed.
I do not like writing, but the instructor presented the material so well
that he made it a good experience.
• This class was great. I feel that the course has prepared me well for en-
trance into the forestry program.
• There was an enormous amount of writing, almost too much, but other
than that, I learned some new stuff.
• I think it would be good to include more examples of good work along
with work that needs improvement.
• The only suggestion I have is to change the grading system for the
class. It seems as since there is an individual emphasis on all the pa-
pers but the management plan, there should be an individual grade for
the management plan. By this I mean that a person should be graded
by their individual sections in the management plan and not get a
group grade because not everyone exerted the same level of work or
effort.
• Seems like this course could be structured better. There was a lot of
unused time during the first two weeks. Perhaps a bit more focused on
writing lab reports would be helpful as well. A lot of work for a two
credit class.
Developing Corpus-Based Lessons 211
(Continued)
212 Corpus-Based Activities in the Classroom
The volume features how corpora may be used directly (e.g., by introduc-
ing students to corpora and getting them to work with concordance lines)
and indirectly (e.g., by using the findings of corpus analysis to inform the
design of pedagogical activities); in several teaching contexts (e.g., schools,
language institutes, universities); and in the teaching of English for different
purposes (including EAP) and of different English varieties (e.g., American,
Singaporean, and other Englishes), among others. Sections C2–C4 of this
book share these similar features and objectives.
In 2010, Robinson, Stoller, Jones, and Costanza-Robinson published
Write Like a Chemist which makes use of information gathered from cor-
pus analyses of chemistry texts. This textbook was developed primarily for
chemistry students in the US, both NSs and non-native speakers (NNSs) of
English. In my review of this book for a writing across the curriculum jour-
nal, I noted that
The use of corpus-based technology as well as the collection of spe-
cialized corpora contributes to the innovative designs of writing classes
that incorporate detailed descriptive data of different genres of writing.
Students are exposed to the writing conventions of professionals in their
own field, are given clear examples of lexico-syntactic features of various
written reports and documents, and are able to focus on the nuances and
skills that prepare them not only in writing, but more importantly, in read-
ing genre-specific writing. Write Like a Chemist: A Guide and Resource is an
impressive textbook that is a product of collaborations between chemistry
professors, applied linguists, and technical writing instructors. The book
was intended for upper division and graduate-level university chemistry
classes, and as a resource book for chemistry students, postdocs, faculty,
and other professionals who want to improve their chemistry-specific writ-
ing in the field, particularly in writing: (1) journal articles, (2) conference
abstracts, (3) scientific posters, and (4) research proposals.
C2
CL and Vocabulary
Instruction
Vocabulary-based activities using corpora and corpus tools are the ones most
commonly developed and used by teachers in the classroom for a range of
learners. This certainly reflects CL’s tradition of making use of corpora in
dictionary production and the benefits of using concordances and KWICs in
order to concretely and fully define a word and explore its various meanings.
There have been a few learner English dictionaries based on corpora. Over
the years, major English corpora of spoken and written language (both British
and American) have been collected by publishers to produce corpus-based
word lists, dictionaries, and thesauri. Commercially available corpus-based
dictionaries include The Longman Dictionary of Contemporary English (LDOCE),
6th edition (2015); Collins COBUILD English Dictionary (Sinclair, 2003); and
Davies’s (2010–) A Frequency Dictionary of Contemporary American English: Word
Sketches, Collocates, and Thematic Lists (with Dee G ardner). In addition to this
frequency dictionary based on COCA data, Davies also has various dictionar-
ies and word lists obtained from word frequencies of Spanish and Portuguese
corpora, among others. Vocabulary textbooks for English language teach-
ing utilizing various corpora are also available (e.g., McCarthy, O’Dell, and
Reppen’s (2010) Basic Vocabulary in Use), with emphasis on the role of fre-
quency distribution and real-world language use in how to best teach the
acquisition of L2 vocabulary.
Corpus-based dictionaries offer some interesting features not often included
in more traditional or intuition-based dictionaries. The following is an example
of an entry from the LDOCE on the various meanings of o-kay or OK in spo-
ken and written texts (Figure C2.1).
o.kay, OK. /pronunciaon/adj spoken 1 [not before noun]
not ill, injured, unhappy etc: Do you feel OK now? 2 used to
say that something is acceptable: “ Sorry I’m late.” “That’s
okay.” | Does my hair look OK? 3 [not before noun]
sasfactory but not extremely good: Well, it was OK, but I
like the other one beer.
[…]
okay, OK interjecon 1 used when you start talking about
something else, or when you pause before connuing: OK,
let’s go on to item B.| OK, any quesons so far? 2 used to
express agreement or give permission: “Can I take the car
today?”
[…]
okay, OK v okayed, okaying [T] informal to say officially that
you will agree to something or allow it to happen: Has the
bank okayed your request for a loan?
Frequencies of the word okay in spoken and wrien English
Wrien
Spoken
frequency, and were only included if not already present in the first 2,000
words of the GSL. The total number of words on the list comprises 10% of all
the words in academic texts. As a rule, each word had to be used at least 100
times in the Academic Corpus. Sample entries from the AWL are shown, with
each lemma and all of its lexemes provided. The bolded words were the most
frequent on the sub-list. The most frequent word in each word family could be
the lemma (like the word estimate seen later) or it could be one of the lexemes
(like the word derived, also seen later). The AWL has 10 sub-lists, with Sub-list
1 containing the highest-frequency words.
As noted here, Coxhead’s primary reason for creating the AWL was that other
word lists, especially the GSL, were limited in their capacity to demonstrate
current lexical usage across academic registers. Yet after more than 15 years,
it is possible that the AWL may also need some updating. For example, Ming-
Tzu and Nation (2004) completed a study on homographs within the AWL
and concluded that the list should include a wider range of word families and
lemmas, representing a range of academic homographs. In addition, Nation and
Waring (1997) argued that in order to comprehend a text, 95%–98% of words
in it must be fully understood and acquired by learners. Unfortunately, as Na-
tion and Waring found, word lists such as the AWL and GSL arguably do not
represent at least 95% of words in a target text. A more recent paper of mine,
coauthored with my former graduate students Friginal, Walker, and Randall
(2014), also found that the declining trend in vocabulary use, when measured
in a relatively short time frame (e.g., 1990–2012), was very relevant if applied
to standard word lists used for language teaching. As the AWL has been used
as a reference for teachers of academic writing for a range of learners, including
non-native speakers (NNS) of English, accurate distributions representing the
present status of word usage in specific contexts is of great importance. Teachers
and curriculum developers, then, might focus on these distributions to corre-
spond with patterns of vocabulary utilized currently outside the classroom. In
EAP, these are very important arguments for pedagogy.
CL and Vocabulary Instruction 217
maintains that a language learner needs to notice some feature of the input to
have the best chance of acquiring it. McNair argues that corpora provide such
data for learners immediately, whereas it would take much longer if the learner
were only relying on noticing natural occurrences in reading material. Finally,
Nelson’s main objective in C2.3 is to show how teachers can effectively incor-
porate Cobb’s (2016) Compleat Lexical Tutor, particularly its VocabProfile feature,
to help students develop their essay writing skills for proficiency exams like
the TOEFL or IELTS. Nelson based this lesson on his actual Intensive English
Program (IEP) course, which prepares students for a norm-based test. His goal
was to guide them through the process of discovering the differences between
the way they wrote essays in tests, and he selected benchmark essays so that the
students will, hopefully, incorporate these newly acquired ‘awarenesses’ into
their strategy to achieve higher scores.
Lesson Background
This lesson is situated in a course within an IEP at an aeronautical university
that uses only authentic content from a single subject area to teach academic
English. The majority of students will be pursuing careers as pilots, aerospace
engineers, aircraft maintenance, or technicians, or in the aviation business, and
therefore, the content area for this course is aviation and aerospace.
Task: Analyze an authentic text to identify academic and technical vocabu-
lary items for classroom use.
When using authentic materials, unlike an ESL book, it is up to the teacher
to determine which vocabulary items should be focused on in order to aid
learners with both receptive and productive knowledge as related to the con-
tent. Authentic materials can be challenging for learners who may have, up
until this point, dealt only with adapted or modified texts in English.
Flight Safety Foundation is an international organization concerned with the
safety of flight. Their monthly magazine Aerosafety World publishes analyses of im-
portant safety issues facing the aviation industry. However, the content is written
for a native English-speaking audience, despite the fact that a large percentage of
aviation personnel are second language speakers of English. Through scaffolding,
learners can be introduced to this type of authentic content in the language learning
classroom and thereby be better enabled to enter the broader aviation community.
Instructors can develop a range of lessons based on a single article such as
one found in Aerosafety World, but where should they start? Through the online
CL and Vocabulary Instruction 219
corpus tool WordandPhrase.Info (Davies, 2017b), the article’s lexical items can be
revealed as categories of frequency and, even better for teachers of specific con-
tent, discipline-specific lists. The following example uses the article “Survival
Factors,” which discusses the National Transportation Safety Board’s analysis of
the Asiana Airlines Flight 214 crash (Rosenkrans, 2014).
Procedure
Range 1
(AVL LIST 0 - 500)
193 Words
21: report
15: impact
6: reported
5: response
4: factors, initial, research
3: between, female, group, than, within
2: based, colleague, components, control, described, design, directed, likely, occurred, perceived,
performance, performed, separated, significant, such
1: absence, addition, although, among, application, applied, apply, approach, article, assessed,
attempting, complex, comprehensive, concludes, condition, conducting, contains, contributing,
critical, current, currently, data, degrees, depending, descriptions, determined, difficulty, direct,
directing, discussion, due, enabling, established, estimated, experienced, failures, identified,
images, impacted, improve, include, including, information, involved, level, limits, lower,
mechanisms, needed, noting, objective, obtain, patterns, phase, positive, provide, provided,
relation, requirement, resulted, resulting, section, sections, selected, similar, sources, specifically,
standards, stated, status, subjected, testing, throughout, understanding, unique, units, used,
varied, various
Range 2
(AVL LIST 501 - 3000)
100 WORDS
After the lists are generated, an instructor can examine the top vocabulary
words in each list and use a combination of frequency results and intuition to
decide which words to focus on in classroom activities. In the FSF example,
high-frequency items, such as report, impact, eject, trap, factor, sequence, fuselage,
and initial, are useful for both receptive (e.g., comprehension or main idea ques-
tions) and productive (e.g., group discussion or summary writing) activities
related to the article. Some lexical items, such as impact, are relevant as multiple
parts of speech, and WordandPhrase.Info permits students to discover this distinc-
tion on their own, as illustrated later. Alternatively, an instructor can choose
which part-of-speech the student should focus on.
An ESP instructor can further categorize the subject-specific vocabulary
based on her own content knowledge. For instance, this article contains many
parts of an airplane: fuselage, engine, tail, and gear (likely part of landing gear).
crash influence
920 impact 427 effect
1485 contact 432 control
2312 shock 920 impact
2854 crash 1396 influence
3814 blow 2553 impression
5700 collision 6817 bearing
8401 bang 11990 sway
14157 brunt
Ø Examine the collocates for impact. Click on a verb collocate and take a look
at the concordance entries. Do you see any patterns? Create two original
sentences that contain the collocate and impact, with at least one sentence
using an aviation topic. (See Figure C2.3 which shows the example verb
“minimize.”)
Ø Choose another collocate that is an adjective. Take a look at the concor-
dance entries. Do you see any patterns? Particularly, notice prepositions.
Create two original sentences that contain the collocate and impact, with at
least one sentence using an aviation topic.
Ø Complete Steps 2–6 with the rest of the vocabulary words.
Suggestions to Teachers: There are several options for presenting the vocabu-
lary words to students. Pulling out excerpts and highlighting vocabulary words
is a good way for students to work on part-of-speech recognition, or instructors
can list items with parts of speech already indicated. The example article contains
uses of impact as both a noun and a verb, so students can explore both. Also, in-
structors can decide which part-of-speech collocates should be used based on the
content and goals of the activity, or allow students the freedom to choose.
Homework: Read the article “Survival Factors” about the Asiana Airlines
Flight 214 crash. Write a summary of the events using a minimum of six of
the vocabulary words you explored using WordandPhrase.Info. Try to write sen-
tences using the collocates you learned today.
Lesson Background
As technology continues to become more accessible and expand into our every-
day lives, there is a growing need to incorporate it into language teaching and
learning. As discussed in many parts of this book, DDL is one such approach,
providing students with access to rich linguistic data drawn from corpora, large
sets of texts. Through corpora, students are exposed to an abundance of input,
from which they can induce rules and patterns in language. DDL is consistent
with many theories in SLA, particularly the noticing hypothesis and usage-based
learning (UBL). According to the noticing hypothesis, a language learner needs
to notice some feature of the input to have the best chance of acquiring it. DDL
works by drawing the learner’s attention to the target feature by providing nu-
merous examples in context. One of the tenets of UBL is that we learn language
through analyses of massive amounts of form-function pairings in authentic con-
texts. Corpora provide these data for learners immediately, whereas it would take
much longer if the learner were only relying on natural occurrences in reading
material. These are two of the conceptual underpinnings for DDL.
The overall goal of the activities in this lesson is to promote deeper vocabulary
acquisition in pre-intermediate learners. I have selected pre-intermediate students
because current DDL research has largely overlooked this group or assumed that
they will not benefit from DDL (Chujo, Utiyama, & Miura, 2006). Vocabulary is
the target language feature because although it is instrumental for learners to be
successful, they may not be fully aware of what it means to know a word. It goes
beyond learning the pronunciation, spelling, and a primary definition. Richards
(2008) references several other important lexical characteristics that are necessary
for vocabulary, including polysemy, affective meaning, associations, and colloca-
tions. By providing large numbers of contextualized examples, corpora give learn-
ers the means to induce these characteristics. Another goal for DDL is promoting
learner autonomy by teaching them how to use corpus tools independently. This is
especially important for vocabulary acquisition, the onus of which is already on the
students to learn themselves. The activities in this lesson make use of two different
corpus tools in the hope of providing students with the skills and knowledge nec-
essary to pursue deeper word learning on their own.
tools. Indeed, they are likely not even aware of them. Examples of this include
the Macmillan English Dictionary for Advanced Learners (Rundell, 2007) and the
Longman Grammar of Spoken and Written English (Biber, Johansson, Leech, Con-
rad, & Finegan, 1999), both of which were developed based on corpus studies.
One helpful indirect use that is becoming more accessible for teachers is the
creation of learner corpora, as done by Grigoryan (2016). In a study set at
a school in the United Arab Emirates, 10 participants with proficiency lev-
els ranging from elementary to advanced wrote timed paragraphs about their
weekend on iPads. These paragraphs were processed through AntConc to ex-
amine word use. In a few minutes, AntConc found that the words anyone and
anywhere were used incorrectly in almost all paragraphs. Instructional materials
were developed based on these findings and used in class. Later, students wrote
a second timed paragraph. The AntConc result showed a 98% accurate usage of
anyone and anywhere, indicating that creating learner corpora may be useful for
teachers with access to digital copies of their students’ writing.
With technology becoming more and more prevalent in classroom, it is
increasingly realistic to design activities in which students engage with lin-
guistic data themselves. This is the direct approach to corpus use. Fukushima,
Watanabe, Kinjo, Yoshihara, and Suzuki (2012) created a learner corpus of 212
graduation papers by Japanese students that will be utilized by future EFL stu-
dents at the university. Using a learner corpus for direct use by other learners
has a couple important benefits. First, 98.2% of words from the AWL appear in
the learner corpus, but the difficulty of the writing is reduced. This gives learn-
ers access to the same vocabulary used in EAP, but with more comprehensible
context. Second, these papers provide successful non-native models to which
EFL students can aspire. This could serve as a motivational boost for learners to
see the work of former students used as a model.
I believe that it is also necessary to go beyond pre-/post-test differences
and basic attitudes to discover how learners are using corpus tools. To do this,
Geluso and Yamaguchi (2014) used four-step student journals: (1) Identify for-
mulaic sequences (FSs) in authentic material, (2) use corpus tools to identify
patterns in the FSs, (3) practice the FSs in small groups, and (4) use the FSs in
authentic communication. Rather than using the concordancer for new words,
students used it with words with which they were already familiar to learn
new uses and expressions. From there, they would use the many contexts pro-
vided to tease out the nuances of the FSs. Lastly, Boulton (2012) examined
how distance learners used corpus tools in novel ways. Rather than only using
a concordancer, the students often combined features to discover which words
co-occur, with what frequencies, and in what. Although these students also
reported difficulty using corpus tools and did not find them especially effective,
they indicated that they planned to use them in the future. However, more
research is needed about how exactly DDL improves learning, if that is indeed
the case. Is it the result of focusing attention on specific features? Providing an
abundance of examples? Or is it more motivating, thus learners study more?
226 Corpus-Based Activities in the Classroom
Procedure
What Is WebParaNews?
This project uses WebParaNews (WPN), a free Japanese-English bilingual cor-
pus of news articles from a bilingual newspaper. WPN is a parallel corpus—
each English sentence has an equivalent sentence in Japanese. When a sentence
is highlighted, the translation equivalent is highlighted as well. Additionally,
WPN has a built-in concordancer, a tool that lets the user search for every in-
stance in which a word or phrase appears in the corpus.
Why WPN?
The WPN concordancer will be used to build students’ vocabulary knowl-
edge and demonstrate that learning a word goes beyond knowing synonyms
or primary definitions. The reason for selecting a bilingual concordancer is
because pre-intermediate students often lack the linguistic knowledge neces-
sary to understand the full context provided by a concordancer. The richest
input is useless if students do not understand what they are reading. By using
a Japanese translation, advanced vocabulary and grammatical patterns will
not pose as much of a barrier to pre-intermediate students using the corpus.
Because the data come from a newspaper in Japan, the topics may be more
familiar to students, which will also facilitate comprehension.
Setting
This lesson takes places in a high-beginner EFL class at a Japanese university. In
this classroom, students have access to computers and the internet. In the pre-
vious class period, students turned in a report summarizing two current events.
The teacher marked a few words in each report that sound off due to the use of
near, instead of true, synonyms. The first lesson will use WPN to help students
identify better words to use in those situations and the reasoning behind their
use. The COCA activity is a stand-alone activity. It is not connected to this
CL and Vocabulary Instruction 227
There are many parts to knowing a word, and it is important to learn all of
them to communicate effectively. These activities will give you an opportunity
to learn about the other parts of a word besides its definition.
Discuss with a Partner
Think about these two verbs: begin and start.
228 Corpus-Based Activities in the Classroom
Note: You will have to change the tense for the verbs in WPN. Also,
you can search for multiword phrases, such as “go away,” in WPN.
I could not ________________ the car.
I will ___________________ the journey tomorrow.
It ___________ to rain.
The washing machine would not _____________.
FOR THE TEACHER: Review the answers together. Ask students for possible guide-
lines for using “start” and “begin” based on their intuitions from the concordancer results.
Explain that only “start” can be used for machinery.
Idiomatic Language
Idiomatic expressions are very common expressions in a culture/language. It is
important to use the exact word in these expressions—not synonyms. Use the
concordancer again to select the right words for each expression. Choose one
of the bold words above each pair of sentences. Write a definition for each of
the following expressions.
CL and Vocabulary Instruction 229
Warm or Hot?
1 The Emperor thanked them for their _________ hospitality.
2 Inflation has become a ___________ issue in the presidential election.
Big or Large?
1 After winning the lottery, the woman was living __________.
2 After winning the gold medal, the athlete got a ________ head.
Cold or Cool?
1 Everyone thought that the king was ___________ hearted because he
never talked to the people.
2 Both the ruling and opposition parties must debate the matter with
_________ heads.
FOR THE TEACHER: Review the answers and meanings together. Discuss how
concordancers can help learn formulaic language. Return students’ reports from the previ-
ous class period.
Use WPN and http://www.thesaurus.com/ to find better words. Write the old
sentence and the new word in the blanks for five words in the following “sen-
tences and new word” worksheet:
COCA Activity
Name: _______________________________
Did you know that English has 91 single-word prepositions? This can make
it hard to know which one to use. In this part of the lesson, we will use the
Corpus of Contemporary American English (COCA, Davies, 2008–) to help
us decide which prepositions to use.
Here are your COCA instructions:
Ø Click on Collocates. Note again that collocates are words that often ap-
pear together.
Ø Click POS in the options bar. This command means part-of-speech.
Then, select prep.ALL from the drop-down list. This means we want to
look for prepositions.
Ø Click the 1 on the right side of the number line. This means we want to
know what prepositions appear one word to the right of the word/phrase
we will search for.
Now we are ready to search! Let’s do one together.
EXAMPLE: I live _____ Osaka. (in/at)
Is it “in” or “at?”
Type “I live” into the space next to Word/Phrase and hit enter. What do
you see?
[Students will see a COCA output with the following results: in—611,
with—89, on—46, for—25, at—22, and so on.]
Ø What do you notice? IN appears 611 times and only AT appears 22 times!
But that does not mean IN is the right answer! Click on each word to look
at the sentences. What do you notice? (TO THE TEACHER: Here you
can talk about situations where IN is used, like with cities, and situations
where AT is used, as with street addresses)
Ø Now it is your turn! Work with a partner to answer these questions. As
you answer each question, compare the sentences where each preposition
is used. Can you make any rules?
1. The car was fixed ____ my mom. (with/by) (Hint: Search for “was fixed”)
2. They walked _____ the store. (to/until)
3.
I was waiting ______ three hours. (for/since) (Hint: Search for “was waiting”)
4. Gasoline is priced _____ 100 yen a liter. (for/at) (Hint: Search for “is priced”)
CL and Vocabulary Instruction 231
To the Teacher: Talk about it together as a class. Review the rules that
students have created. A follow-up activity might increase the difficulty by
removing the hints, or using two prepositions that contain high frequencies,
which will force students to examine the contexts.
(Continued)
232 Corpus-Based Activities in the Classroom
Lesson Background
The goal of this lesson is to show how teachers can implement a quantitative tool
to help students heuristically improve their writing (particularly, essay writing
for proficiency exams). The context is for preparing students for a norm-based
test, which has formulaic responses and predictable vocabulary; yet the lesson
can be easily adapted any time a teacher has a high-quality model for students
(discussed below). First, I will present thematic research that deals with writ-
ing for the TOEFL and quantitative methods of writing assessment. Second,
I will briefly introduce Cobb’s (2016) Compleat Lexical Tutor (or LexTutor) and
describe the capabilities/features of the VocabProfile tool. Third, I will illustrate
the particular context and student population for which I developed this lesson
to explore my motivations and pedagogical logic. This lesson was fundamen-
tally influenced by the fact that my students are attempting to write for specific
tasks which require a rather banal structure and sophisticated vocabulary. My
goal was to guide students through the process of discovering the differences
between the way they write and benchmark essays so that they had a defined
conception of their path to a higher score. An important consideration in this
section is the fact that the class had a variety of levels and proficiencies, which is
unusual within the specific IEP. Fourth, along with a few recommendations for
teachers, I will present the lesson plan and each of its components: a list of mate-
rials, a plan for the teacher, a handout for students, and related resources. I con-
clude with a discussion of the feasibility of this tool in IEP classrooms and how
teachers in other contexts may benefit from implementing VocabProfile activities.
Note to Teachers: This particular IEP offers an elective special topics
course for English learners preparing for the TOEFL proficiency tests. The
TOEFL is a notoriously difficult exam to prepare for because there is a vast array
of resources available for students, from many different for-profit companies,
all with disparate quality and similarity to the actual test. My lesson focuses
on writing for the TOEFL and presents a plan that uses frequency-based tool.
The TOEFL Writing section consists of two tasks. First is an integrated writ-
ing task, in which students must integrate and compare information from two
different sources: a short reading and then a short lecture in which a speaker re-
futes or expands on points brought up in the reading; students have 20 minutes
to write approximately 200 words. The second task is an independent writing
task in which students must present their opinion and real-life examples about
a given topic; students have 30 minutes to write at least 300 words.
234 Corpus-Based Activities in the Classroom
Materials
ü T workstation with projected PC and docu-cam
ü Computer workstations with internet access for Ss
ü Copies of the “Complete lesson plan handout”
ü Ss’ TOEFL integrated essays
o checked for spelling
o informally rated from 0 to 5 based on ETS rubric
o in digital (for copy + paste)
o A copy of the ETS integrated writing rubric (as needed)
o Text files for benchmark essays
ü When possible, there can be multiple essays in one file and the profile will
be an average of several essays.
ü Links to the K1, K2, and AWL
Procedure
1 Understanding frequency: 10 minutes
T writes the word “Frequency” on the board and elicits the definition
from Ss (something like “how often or how many times something oc-
curs”). T then explains how words can be divided into categories based on
frequency. To illustrate, T might ask whether the word question (or another
word) would be more or less frequent (“Do you think this is one of the
1,000 most common words in English?”); T contrasts this with the word
collectively. T and Ss read the introduction of the handout; T checks to see
that all students understand:
Ø How frequency can categorize words
Ø How knowing what kinds of words [based on frequency] Ss use in their
essays and comparing that to the types of words a TOEFL essay uses can
potentially help Ss score better on the test.
2 Running profiles: 10 minutes
Students first run the profile of their essays. They follow the directions on
the handout. T helps students who have more difficulty; quicker Ss help
those struggling. T may demo on the overhead with an essay by an exem-
plary student who has agreed to let their essay be used for illustration.
T explains that students don’t need to worry about understanding all the
information on the page yet and immediately begins running the profiles
on the benchmark essays
3 Benchmarking: 15 minutes
Students execute the profile on benchmark essays by following the direc-
tions. There are benchmark essay text files (cleaned for spelling) uploaded
in a designated course management site (e.g., Blackboard or iCollege; or a
CL and Vocabulary Instruction 237
Dropbox system). Students run the profile of the essay that is the next level
above their performance: For example, if a student received a Level 3 on
their essay, they run the profile for the level 4 essay.
Ss fill in the table so they can see a side-by-side comparison of their
frequency stats and those of the benchmark essay. They answer some basic
analytical questions.
4 Small group comparison: 15 minutes
Ss get into groups with mixed performance on the essays so that students can
compare their performance, frequency, and goals with others. Ss are aware
that the course outcomes are not driven by language acquisition but instead
test preparation, so there shouldn’t be too much discomfort sharing scores and
goals. If the class has more shy people, they can be grouped by similar scores.
In groups, students identify what kinds of words they are using too
many of in comparison with the benchmark and what are reasonable goals
for themselves. Depending on how their conversations go, they may begin
to look at which words they are using and what are appropriate alterna-
tives (which makes a group with heterogeneous scores very interesting).
Students review copies of the word lists that the VocabProfile uses, as well as
key words, function words, transitions, and turns of phrase that are recom-
mended for use on the TOEFL by their test-prep materials.
5 Homework
Ss will not completely rewrite their essays, but instead look critically at
their word choices and adjust to eventually match their vocabulary pro-
file with that of benchmark essays. They are instructed not to make any
significant grammatical or structural changes, just replacing words and as-
sociated pronouns, verbs, and so on. They may do this by looking at the
“Edit-to-a-Profile” section.
For example, Student X does not use as many words from the AWL as the
benchmark, and may replace an instance of “then” with “subsequently.” The
objective is that Ss can compare two versions of the same text, with the second
draft being more benchmark-like. T should emphasize to Ss that through re-
peated practice and a more sophisticated understanding of the word choices of
successful benchmark essays, they should aim to make similar lexical decisions
(such as more words from the AWL, or more creative non-list words, or more
K2 words…) during their timed-writing on the actual test.
Students’ Handout
Name ________________________________
My essay received a score of ______________
My benchmark is a _______________
238 Corpus-Based Activities in the Classroom
In this activity, you will find out how the words you use in your TOEFL inte-
grated writing essay differ from benchmark (example) essays. This is especially
important because the essays are so short, only around 300 words. By compar-
ing your words to that of a benchmark essay, you can see what types of words
you should use more or fewer of to get a better score.
We will be using a tool called the VocabProfile. This tool sorts the words you
use in an essay into 4 categories:
K1, K2, AWL, and Other. K1 represents the 1,000 most commonly used words
in English.
K2 represents the 1,000 next most common words in English.
AWL stands for “Academic Word List.”
Other means words not on any of these lists (infrequent and non-academic)
Directions
1 Open the graded copy of your Integrated writing essay. I have fixed all the
spelling errors for you because the tool will not count misspelled words,
but the TOEFL does.
2 Open the website lextutor.ca and click on “VocabProfile,” which is in the
middle column under “researchers.” Then click “VP-Classic.”
3 Make the title “Your Name’s Integrated Essay.” Then, copy and paste your
essay into the field. It should look like this (Figure C2.4):
VocabProfile page, give it the title “Level X benchmark Essay.” Copy and
paste the text into the field and click “Submit_Window.”
6 Now you should have two windows with two vocabulary profiles. One is
your essay; the other is an essay one level higher than yours. We will now
compare them in terms of word frequency. Focus on the box that says,
“Current Profile”; it looks like this:
Current Profile
% Cumul.
K1 81.33 81.33
K2 3.33 84.66
AWL 12.33 96.99
Other 3 100
I note again here that I consider the Longman Grammar of Spoken and Written English
(LGSWE) (Biber et al., 1999) to be one of the most useful and influential grammar
resource books published in the past two decades, transforming the teaching of
English grammar to emphasize the important role of registers in mediating lan-
guage use. In highlighting real-world BrE and AmE in context, the LGSWE shows
language teachers the immediate teaching applications of frequency data. It spec-
ifies the unique distinctions between spoken and written grammars, and provides
researchers a much needed guide in defining and categorizing an extensive list of
syntactic features of English. A follow-up, student version of the LGSWE, The
Longman Student Grammar of Spoken and Written English (Biber, Conrad, & Leech,
2002), provides a slightly condensed discussion of the elements of English gram-
mar, this time, specifically for learners (ideally in upper-level undergraduate or
master’s level pedagogical grammar courses), including how, why, and when to
use different structures in speech and writing. For teachers, a companion Student
Workbook offers various activities and short lessons that could immediately be
adapted in the classroom or assigned as homework.
Conrad and Biber’s (2009) Real Grammar: A Corpus-Based Approach to Language
uses corpus data to show how grammatical structures could be ‘easily’ developed
and taught in many classroom settings. The presentation here is not as extensive
as the LGSWE as the focus is only on a select set of 50 features, such as simple past
tense in polite offers, meanings of take + noun phrases, action verbs with inani-
mate subjects, and adjective clauses that modify sentences. However, the different
units are developed as pullout lessons that a teacher can simply assign to students
for hands-on activity in class or for work outside the classroom. Each unit starts
with a review of what traditional grammar textbooks typically define or em-
phasize compared to actual data from corpora. This instructional part is then
followed by various activities, often about noticing contexts, analyzing discourse,
244 Corpus-Based Activities in the Classroom
Table C3.1 Sample instructional information about necessity modals from Real
Grammar: A Corpus-Based Approach to English
1 I remember my first car accident. It was right after I got my license, and I
must have been sixteen. My dad was in the car with me and I backed into
the car across the street.
2 In group counseling, comfortable seating should be used and chairs set out
in a circle so that everyone can see each other. This is important for pro-
moting trust and confidence in the group.
CL and Grammar Instruction 245
I have used Real Grammar in my Technology and Language Teaching and Corpus
Linguistics courses for graduate students to serve as a model for instructors in
developing lessons and activities for a variety of settings. My students respond
positively to the design and components of the 50 units, and they are able to ex-
tend them to the teaching of other features not covered in the textbook. They
notice areas in which additional discussion from the point of view of grammar
instructors could be added to possibly help ‘convert’ those who are not yet fully
versed in CL approaches and themes. Activities for students to directly interact
with corpus tools could also be added improvements, although it is clear that
this is not a primary goal for the authors. Real Grammar has greatly influenced
the development of Corpus Linguistics for English Teachers (this book), especially
in selecting lessons and activities for Section C3.
Six grammar-focused lessons and activities for various groups of learners
and English courses are presented in Section C3: (1) “Analyzing Verb Us-
age: A Concodancing Homework” (Randall), (2) “Developing Corpus-Based
Materials for the Classroom: Past or Past Progressive with Telicity” (Dun-
away), (3) “Quantifiers in Spoken and Academic Registers” (Walker), (4)
“Teaching Linking Adverbials in an English for (Legal) Specific Purposes
Course” (Heath), (5) “AntConc Lesson on Transitions for an Intermediate
Writing Class” (Emeliyanova), and (6) “The Explorer’s Journal”: A Long-Term,
Corpus Exploration Project for ELLs (Nolen). Lessons range from a homework
assignment on verb use to a semester-long project that requires students to reg-
ularly complete concordancing activities on collocations. Contributors make
use of COCA, GloWbE, or AntConc, with a teacher-collected corpus, and
they highlight the role of CL in encouraging students to think like research-
ers in a simple and exploratory study that aids in discovering and comparing
linguistic patterns.
Lesson C3.1 describes how students can use COCA to explore a variety of
verb structures and functions as they write academic essays. The worksheet de-
veloped by Randall allows students to extract and interpret collocation verbs in
order for them to avoid too much repetition and to add variety to their writing.
This lesson overlaps with my description earlier in Section C1 of my Writing in
Forestry lesson on reporting verbs. Nolen’s contribution (C3.6) is a Corpus Jour-
nal Project intended for smaller classes (ideally 5–10 students) of learners from
the American Council on the Teaching of Foreign Languages (ACTFL), with
proficiency levels ranging from intermediate-mid to advanced-high or above.
Students are taught to use concordancers to complete various in-class and take-
home activities. This is quite a robust nine-week plan, which serves as an in-
tegral part of the course that Nolen has been running for several terms now.
Two related lessons on linking adverbials for international LL.M.
graduate students (C3.4, Heath) and for students in a university-level aca-
demic writing course (C3.5, Emeliyanova) make use of AntConc to search
for and identify the functions of linking adverbials in legal texts and to
provide students with experience in completing a ‘mini research project’
246 Corpus-Based Activities in the Classroom
Lesson Background
This homework focuses on how COCA (Davies, 2008–) can be utilized by
students to explore a variety of verb structures and functions as they write
academic essays. The learners are primarily college-level students studying ac-
ademic ESL/EFL writing in countries such as China, South Korea, or Japan.
The homework covers concepts such as verb collocations and verb functions,
and asks students to predict, identify, and investigate various structures and dis-
tributions from COCA. A practice exercise allows them to use the knowledge
they have gained about the target verbs (prove and illustrate) in a sentence
completion activity.
Concordancing Lesson
For this week’s homework, we are going to use the concordance tool, COCA
(corpus.byu.edu/coca), to understand how to use varieties of verbs more
appropriately in our essays.
Follow the step-by-step instructions to analyze your verb usage. Turn in
this handout at the beginning of class next week.
CL and Grammar Instruction 247
Part A: Predict
How many English verbs do you guess that you know? ____________
Which verbs do you guess are the most common in academic essays?
(Write down 4 guesses)
_________________________ _______________________
_________________________ _______________________
Part B: Identify
Reread Example Essay 1 in Appendix B of your course booklet (page 82).
[Note to teachers: It would be ideal to identify a reading passage from the stu-
dents’ textbook as part of this homework activity. If there are limited options,
you may assign another essay.] Circle all the verbs.
It is true that common verbs like is or do will still be very common in formal,
academic writing. Often, however, we try to choose different verbs to avoid
too much repetition and to add variety to our writing. This keeps our read-
ers interested. Look at these two sentences from the reading:
Verbs like illustrate and prove can be used in sentences where we might
choose be. In order to use them appropriately though, we need to know
more about the patterns that these verbs follow.
Part C: Investigate
Log on to COCA (corpus.byu.edu/coca) to research the verbs PROVE and
ILLUSTRATE.
(Remember: We made accounts for this site earlier in the semester!)
248 Corpus-Based Activities in the Classroom
Prove
Find the most common collocates of prove in academic English.
Remember how to find collocates? [Instructor demoes how to do this in
COCA.]
Write down the 4 most common collocates, and write the total # of occur-
rences in the parentheses:
_________ (____) _________ (____) _________ (____) _________ (____)
(what POS are these words?) __________________
Click on EACH of the 4 most common collocates, and scroll through the
concordance lines to look at the examples.
Write down 1 GOOD example sentence for each.
(Remember: A GOOD example sentence will be one that you understand and one that
represents how the verb is most commonly used with that collocate, i.e., tense, aspect,
preposition, etc.)
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
Illustrate
Now, find the most common collocates with illustrate in academic English,
following the same process.
Write down the 4 most common collocates:
_________ (____) _________ (____) _________ (____) _________ (____)
(Are these the same kinds of words that collocate with prove?) YES NO
(What POS are these words?) ________________________
Click on each of the 4 most common collocates and write down 1 sentence
from each.
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
CL and Grammar Instruction 249
Part D: Practice
Use the knowledge you have gained about these verbs. Each sentence is miss-
ing the verb. Choose whether it should be prove or illustrate, and fill in the
blank. BE SURE TO USE CORRECT VERB TENSE and ASPECT!
Part E: Apply
Now, create your own sentences using prove and illustrate in the ways we
have learned they are most frequently used.
Use your two chosen Sustainable Development Goals as the topic of your ex-
ample sentences. Think about how you could use these sentences in your essay.
Use all the different collocates that you have written about for this homework.
Prove
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
Illustrate
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
(_____________)_________________________________________________
250 Corpus-Based Activities in the Classroom
(Continued)
252 Corpus-Based Activities in the Classroom
or transcripts from different pages to analyze for key words and phrases.
I want mobile app versions of popular online and offline corpus tools so
that students can use them easily in and out of class. I think to improve its
applicability and increase its utilization, corpus linguistics needs to address
its usability/access issues. Right now, I think that the learning curve on any
of these approaches is a bit steeper than it needs to be. That learning curve
is the biggest hurdle that is keeping most teachers from integrating it more
frequently into their instruction.
Lesson Background
The following materials are developed for a two- to three-week lesson on
understanding telic/atelic verbs. Telicity is the characteristic or an inherent
property of a verb or verb phrase denoting an action, situation, or event as
being done (or completed) or continuing (i.e., one with no definite ending).
A verb is telic when it clearly establishes completion, while a verb that pres-
ents an action or event as being incomplete is said to be atelic (Biber et al.,
1999). CL approaches through concordancers (especially with databases like
COCA or MICUSP) provide teachers and materials developers a wealth of
options in designing real-world and authentic classroom activities for these
types of grammar-based lessons. Students may relate well with localized
examples in explaining concepts and defining terms, especially if teachers
are allowed to utilize materials beyond required texts. The following lessons
and activities aim to also show teachers how to design materials that even-
tually can inform or guide the creation of a textbook for a particular group
of learners.
Note to Teachers: The target students here are university-level ESL stu-
dents taking a grammar course (e.g., Introduction to English Grammar), but the
activities may also be used for higher-level students in an intensive English
program focusing on practicing academic speech and writing. Lesson handouts
may be directly provided in the classroom with student (computer) worksta-
tions and access to COCA or may be given as an assignment. The students
already have sufficient background and experience running COCA searchers.
Additional corpus-based data were adapted from Biber et al.’s (2002) Longman
Student Grammar of Spoken and Written English.
CL and Grammar Instruction 253
When individuals fall, they later hit the ground and stop falling.
When David kicks, his leg goes out and back in. The kick starts and ends.
Telic verbs have two types:
➢ One with a long duration (accomplishments)
➢ One with a short duration (achievements)
Achievement After he made a shot, spectators sometimes kicked his golf ball into the rough
(short duration) or hid it under garbage.
Achievement When the wind was the strongest and the rain the hardest was when the tree
(short duration) fell on the house
Accomplishments are verbs with an end that take a long time, like draw and
clean.
Different from achievements, accomplishments finish more than stop.
Normally, a kick and a fall has a short duration, but if a kick is repeated it can
also be an accomplishment.
Also, if a fall happens from a tall building or many things are falling, the word
can be an accomplishment.
Atelic verbs like play and swim, in contrast to telic verbs, have a less clearly
defined ending.
Atelic verbs without an ending are called activity verbs because they are
activities that people often do.
Stop or
Start Long or short time
pause
Children can play for hours, and then stop, but they never really complete play-
ing because, arguably, there is no ending to play.
Activity We were playing electric keyboards, which only require a very light touch.
A person (or an animal) can also swim for a long time, and then stop swim-
ming, but the person didn’t really complete swimming. He or she probably just
got tired. This means that he or she can swim again the following day (or after
resting).
The last important type of verb here is the non-action or stative verb:
256 Corpus-Based Activities in the Classroom
Stative verb Whoever was waiting for me inside my apartment was about to get what he
deserved.
Stative verb Mrs. Jackson was standing at a respectful distance by the door.
B. Identifying Telicity
Practice on Identifying Telicity
ACTIVITY STATE
ACCOMPLISHMENT ACHIEVEMENT
not an event or
event or action event or action event or action
short or long action
long duration short duration short or long
duration clear start/finish has start/finish
unclear start/end duration
Joe <q> discovered that his wife and Mister Schirmer <r> were <q>
having an affair <r>
Interim CEO May <s> announced he <t> was creating a task <s>
force <t>
Rickie <u> called his father in the Middle East and told him <u>
we <v> were marrying for the green card <v>
Heidi <w> was bringing cupcakes to the classroom and <x> <w>
could drop off the costume. <x>
They <y> were hoping a tip might <z> lead them to the gun <y>
used to murder Russel Douglas. <z>
Examples:
1. ________________________________________________________
2. ________________________________________________________
3. ________________________________________________________
Example: Drove has a frequency of 23,982. Enter the two gunmen, whom Garland
police said drove up to a police car that was blocking an entrance to the exhibition hall.
(News Report on COCA).
Examples:
1. ________________________________________________________
2. ________________________________________________________
3. ________________________________________________________
C. Understanding Progressives
Question C1: Biber et al. (2002) report that progressive verbs are not com-
monly used across registers such as conversation, fiction, news, and academic
discourse. From the aforementioned examples, when do English users prefer to
use the past progressive?
258 Corpus-Based Activities in the Classroom
Question C2: The following verbs occurred in the progressive 80% of the
time (from a different group of texts). What are their frequencies in the COCA?
Answer this question by completing the following worksheet.
Verb comparison worksheet: progressives 80%
Question C3: The following verbs occurred in the progressive in less than 2%
of cases. What are their frequencies in the COCA? Answer this question by
completing the following worksheet.
Verb comparison table worksheet: progressives less than 2%
1. Look
Just like that you were looking (occurred at same moment) at days and days
of added labor (Conversation)
CL and Grammar Instruction 259
2. Hope
I did finally get my day at the beach; just not the beach I ______________
for. (Fiction)
3. Depend
Their aunt worked part-time at the visitor center, but they ___________
on her to keep the house running. (Fiction)
4. Concern
One cost of service involved distance and the second ____________
traveling through city traffic. (News)
5. Stay
The Mom and Dad had separated, but Russ ____________ with her
and their two children over Christmas. (Fiction)
6. Chase
Moms ___________ after their kids, and dogs were running after any-
thing that moved. (Magazine)
7. Leave
For example, when I ____________ my relative overnight in the hos-
pital, my relative was using the same bedpan. (Conversation)
8. Think
I stumbled on this and ____________ it might be helpful for people
still learning how to use Gmail. (Conversation)
9. Involve
This project also ____________ a class blog for students to post the
videos that they made. (Academic)
10. Spend
Roosevelt students ____________ a minimum of two hours every day
on reading. (Academic)
D1. Look at examples of simple past and past perfect verbs in COCA and find
out which aspect is most likely to occur in the present perfect. Discuss your
findings with a classmate.
D2. Biber et al. (2002) found that perfect tenses were more common
in BrE than in AmE and that progressive tenses were more common in
AmE.
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
AmE CONV BrE CONV AmE NEWS BrE NEWS
Perfect Aspect Progressive Aspect
Figure C3.1 requency of perfect and progressive aspect in AmE vs. BrE conver-
F
sation and news. Adapted from Biber et al. (2002).
D4. Past or Past Continuous with Lexical Aspect and the COCA
Decide what kind of lexical aspect each verb in the following chart has.
ACTIVITY STATE
ACCOMPLISHMENT ACHIEVEMENT
not an event or
event or action event or action event or action
short or long action
long duration short duration short or long
duration clear start/finish has start/finish
unclear start/end duration
Ø What are the most common past simple and past continuous words? (Make
a guess or use your intuition)
Ø Now, check your answer using COCA. Were the verbs more frequently in
past simple or past continuous case?
262 Corpus-Based Activities in the Classroom
Today (Carter et al., 2016) are two of the few great corpus-based work-
books, in my opinion, but they require supplements to scaffold and allow
for productive tasks. The onus really is on the practitioner to develop his
own corpus-based materials with the least metalanguage possible while at
the same time managing increased contact hours, service requirements,
professional development, multiple positions and personal obligations. A
final challenge is a program’s focus on the bottom line i.e., if a program
invests into creating and adopting a corpus-based curriculum, does that
bring them greater profit and a larger, more diverse student population?
More often than not, achieving greater alignment with best practices does
not easily translate into higher revenue.
Lesson Background
This lesson is developed with the understanding that all students are in a com-
puter lab environment and have previously experienced using the COCA
(Davies, 2008–). If that is not true, some additional preparation may be neces-
sary. The targeted audience is a classroom of students at a lower-to-basic pro-
ficiency level of English. However, it could easily be utilized at many different
levels with an understanding that the time needed to accomplish the task would
likely vary. The main lesson is set to encompass approximately 50 minutes with
264 Corpus-Based Activities in the Classroom
the possibility of extensions to the lesson mentioned at the end. As a final note,
please keep in mind that technology changes daily, and so the listed directions
may not be completely accurate.
The primary purpose for the lesson is to help students choose the appropriate
quantifier for a specific context. A quantifier is defined as an expression used to
represent a more general amount, such as a lot, lots, many, much, loads of, multiple,
and several. These previous expressions, which are more commonly used, denote
plurality of a noun, so they have been selected as the focus of the following lesson.
As the students search for information about the quantifiers, they will be able to see
if the words are used with a count or non-count noun and if the words are more
suitable for an academic writing environment or for an informal conversational en-
vironment. Using non-count nouns is one of the more notably challenging gram-
mar topics as is using situationally appropriate language, which is a major reason
why this topic was chosen. It is important for students to understand why they
are learning what they are learning. Therefore, it is strongly encouraged that
an on-level explanation is provided in order to have full buy-in from the students.
Procedure
The following are the directions that the students can follow. Teachers may
want to start the class with a brief review or warm-up activities for quantifiers
and/or count and non-count nouns.
ideas. If the quantifier occurs relatively equally in both, then provide one
example from each context.
(When you are choosing examples, try and pick examples that are helpful
to you.)
6 In your written examples, underline the noun that the quantifier describes.
(We will come back to this and discuss it more after you have searched for
all of the words.)
7 Repeat steps 2–6 for a lot.
8 Repeat steps 2–6 for lots.
9 Repeat steps 2–6 for many.
10 Repeat steps 2–6 for much.
11 Repeat steps 2–6 for loads of.
12 Repeat steps 2–6 for multiple.
13 Repeat steps 2–6 for several.
Note to teachers: Once all of the students have completed these steps, write
each of the quantifiers on the board. Have each student come and write down
one example (or more as you see fit) from their worksheet.
After they have done this, review with the students and identify count vs.
non-count nouns. Use this as an opportunity to have a conversation about
which quantifiers can be used for both types of nouns and which quantifiers can
only be used for one type of noun.
Possible Extensions
• Assign the students to smaller groups of 2–4 and have them write a dia-
logue using quantifiers that are more common in spoken contexts. Have
them share their dialogues upon completion. (You could assign a specific
topic or a specific list of count and non-count nouns.)
• Put the students in pairs. Have the students write a summary of a recent
topic you have covered. Make them include the quantifiers that are more
common in written contexts. (You could choose to provide a specific list
of count and non-count nouns.)
• Have the students come up with a list of other quantifiers or provide a list
of other quantifiers and have them complete the same activity at home for
practice.
Variations
• The students could use their cell phones instead of computers.
• The students could work in pairs instead of individually.
• This activity could be done with different words.
• This activity could be done using different contexts.
266 Corpus-Based Activities in the Classroom
Worksheet
Students’ quantifier worksheet
a lot
lots
many
much
loads of
multiple
several
Lesson Background
The goal of this lesson is to teach a group of international master of laws
(LL.M.) students how to use the AntConc concordancing software to search
for and identify the functions of linking adverbials in legal texts. This lesson is
intended to serve as an exercise in inductive learning, in the sense that students
will develop an understanding of how linking adverbials are used as cohesive
devices in American legal discourse via their noticing its usage in written legal
corpora. At the very least, students would be able to use this lesson as an affor-
dance, or supplement, to their legal writing courses.
Students do not necessarily have to be fully versed in corpus-based ap-
proaches, discourse analysis, or grammar to make use of corpus tools in their
learning. The idea here is that, with the ability to use AntConc, international
law students may be better able to explore some of the discourse structures that
are unique to legal English texts, and common-law texts in particular. Because
so many of these students come from backgrounds in civil law, many of them
face particular challenges when writing legal English in the context of A merican
common law. By encouraging student familiarity with corpus s oftware and
patterns from corpora, they may better be able to solve quandaries that arise in
their legal writing without having to scour the internet, seek the advice of a
professional, or hunt through a textbook.
and sound like lay terms but have specific (and different) legal definitions,
such as consideration, for example.
Ø Baldwin noted that “these” is the most commonly used form of demon-
strative in American legal English texts (2014). “These,” then, has to be
correctly used as an anaphoric reference to make writing more precise.
However, students often make the mistake of using the word too broadly,
which leads to imprecise and frequently undesirable legal writing. Hartig
and Lu (2014) used corpus tools to show how plain English is not necessar-
ily the only correct way to write effective legal English, which could have
pedagogical implications when shown to students who feel dejected when
their writing style does not match a professor’s prescriptive approach.
Ø Maher (2015) provided some insight into specific grammar features in a
cross-textual corpus analysis of that as it is used in legal English, specif-
ically with regards to the management of averrals and attributions. His
argument was that any student wishing to study common law had to be
aware of the discourse practices found in that particular discourse com-
munity. While he argued for the use of corpora by teachers and materials
developers, the aforementioned possibilities for student-used corpora do
create an interesting possibility for future applications in the classroom.
For instance, those students who are motivated to use corpus tools may be
able to acquire basic skills in discourse analysis, at least in the very specific
discourse structures used in legal English.
Procedure
AntConc as a text analyzer and concordance allowing users to search for word
frequencies, n-grams, and compare key words between different chunks of
corpora will be used for this lesson to explore a set of 50 Constitutional Law
cases and search for the most frequently occurring linking adverbials and four-
word strings. Using AntConc as a teaching tool for legal English courses may
start with a search for frequently used linking adverbials (i.e., transition words).
These features are necessary to create cohesive statements in American legal
writing, as the genre requires students to show the effects caused by past texts
on current or future legal situations (e.g., if/then; therefore; as a result), the degree
of causality derived from contrastive analyses (however, although), and how the
sum of different informational units creates a whole (additionally, furthermore).
Lesson Outline
Pre-class
Teacher: Informs students of the class focus and objectives, provides nec-
essary information and materials via email and, when available, through a
learning management system (e.g., Blackboard, iCollege).
Students: Preview materials (.txt files of all legal cases they are using for the
course; AntConc.exe; instructions to download AntConc software)
In-class
Teacher: [Projects an image of a text (below) onto a large screen] Please take a look at
these sentences from your legal case files, and tell me what you notice about
them. Do they have anything in common? Are they different in any way?
Although the court did not characterize this interest as absolute, it repeat-
edly indicated that it outweighs any countervailing interest that is based on
the quality of life of any individual patient.
In addition to relying on state constitutions and the common law, state
courts have also turned to state statutes for guidance.
The state, therefore, has power to prevent the individual from making
certain kinds of contracts, and in regard to them the Federal Constitution
offers no protection.
These are called linking adverbials, or transition words, and they’re used to
build cohesion in legal texts. [Provides students with a list of common linking adver-
bials and transition words used in legal English writing (See Appendix A)].
First, I want you to try using the AntConc software to notice when the writ-
ers are using different linking adverbials or transition words in the cases. Can
you all please go to [Blackboard or iCollege] or your email and run AntConc
with the .txt files? [Models this action, which is projected to the students]
Students: [Follow model and instructions, running the software and up-
loading and opening .txt files]
Teacher: [Continues to model instructions as they are spoken] Double-click on
the AntConc icon. Now, click on the File tab in the top left corner of the win-
dow. Then, click Open File(s). Next, select the Search.Law.Casebook.txt file.
This file contains all of our legal cases for the semester. Finally, click Open.
Students: [Follow teacher’s guide, load software and import .txt files into
software]
Teacher: [Continues to model instructions as they are spoken] Click on the
tab that says, Word List. Then, at the bottom left of the window, click the
Start icon. You should see a list of words. Now, click on the tab that says,
“Condordance.” Let’s try searching for the word, however. In the search
box, type however, and then click the Search icon. Now, how many hits
do we have for however? [Asks whole class]
Students: [Volunteer responses]
Teacher: [Continues to model instructions as they are spoken] Now, if we
look at these occurrences, what do you notice about them?
Students: [Volunteer responses]
Teacher: [Continues to model instructions as they are spoken] That’s right,
they usually occur at the beginning and middle of the sentence. And, they
are almost always followed immediately by a comma. Let’s go to hit #205.
However is in the first position. This should be the Planned Parenthood
case. Click on however. It should be highlighted in blue.
Students: [Follow instructions]
Teacher: [Continues to model instructions while speaking] Please work
with a partner. Start with the sentence before however – the sentence
should start with, The separate but equal doctrine… Read the two sen-
tences together, and come up with a way to describe what however is
doing to the information.
Students: [Read the sentences as they are instructed; Begin to openly vol-
unteer answers to the teacher]
Teacher: [Addresses student responses, and calls attention to relevant points
made about the text as it is displayed on the projector] Good! I like that
response! I’d like you to notice that however is used, here, to show a con-
trast. In the first sentence, the court is stating one piece of information.
272 Corpus-Based Activities in the Classroom
Then, they use however to call attention to a contrast in ideas – the rest
of the sentence following however shows the court’s argument against the
information given in the preceding sentence.
Students: [Remark on the structure presented by the teacher, possibly ask-
ing more detailed questions about however, or possibly other questions
about words that might be used as a substitution]
Teacher: [Responds to student remarks/questions; gives instructions] Now,
continue working with the same partner. With your partner, I want you to
search for the following words in your corpus.
Teacher: [Gives instructions] As you work, make notes about the number of
occurrences of each word, and where they fall in the sentence structure.
If the word comes at the beginning of the sentence, read the sentence that
precedes it. Then, discuss with your partner the function of the word in
the sentence. What is it doing? Why do you think the writers chose that
word, and not another?
Students: [Work on assigned task]
Teacher: [Gives instructions] Now, take 5 minutes and talk to a pair or group close
to you. Talk about what you discovered in the corpus. Discussion questions:
Which words were more frequent?
Did a word appear at the beginning of a sentence, or in the middle?
Why do you think the writer made a choice to use a certain word, but not
another?
What function do you think the word has in the larger context in which
it occurred?
Students: [Work on task]
Teacher: [Addresses entire class] Well, I’m hearing some pretty interesting dis-
cussions. Did anyone have an insight they would like to share, or a question
about the word usage or function?
Students: [Volunteer responses; discussion ensues]
Teacher: [Addresses entire class] Now, I have a little something that I want you
to do before next class. [Gives worksheet (See Appendix B).] I want you to
run your Re Griffiths v. Ambach case file through the AntConc software,
and use that to complete the worksheet. You need to bring this to our next
meeting so we can discuss our results together. Okay? See you next time.
End of Class: [Class is dismissed, teacher answers any specific questions the students
have about the assignment]
CL and Grammar Instruction 273
ADDITION
furthermore moreover too even more
also in the second place again further
in addition next finally besides
and, or, nor first second, secondly last, lastly
TIME
while immediately never after
later, earlier always when soon
whenever meanwhile sometimes in the meantime
during afterwards now, until now next
(Continued)
274 Corpus-Based Activities in the Classroom
EXEMPLIFICATION or ILLUSTRATION
to illustrate to demonstrate specifically for example
for instance as an illustration e.g.
COMPARISON
in the same way by the same token similarly
in like manner likewise in similar fashion
CONTRAST
yet and yet nevertheless nonetheless
after all but however though
otherwise on the contrary in contrast notwithstanding
on the other hand at the same time
CLARIFICATION
that is to say in other words to explain i.e., (that is)
to clarify to rephrase it to put it another way
CAUSE
because since on account of for that reason
EFFECT
therefore consequently accordingly
thus hence as a result
PURPOSE
in order that so that to that end, to this end for this purpose
QUALIFICATION
almost nearly probably never
although always frequently perhaps
maybe
INTENSIFICATION
indeed to repeat by all means of course
(un)doubtedly certainly without doubt in fact
surely
CL and Grammar Instruction 275
CONCESSION
to be sure granted of course, it is true
SUMMARY
to summarize in sum in brief
in short in summary to sum up
CONCLUSION
in conclusion to conclude finally
Appendix B: Homework
Instructions
• Load the Re Griffiths v. Ambach case file into AntConc
• Use the following words when doing your concordance searches:
o However
o Furthermore
o Although
• Use your Transition Words Worksheet to help find alternative word choices
• Fill in the blanks to complete the following exercise
However
Function of the word (give a brief summary):
________________________________________________________________
________________________________________________________________
Furthermore
Function of the word (give a brief summary):
________________________________________________________________
________________________________________________________________
276 Corpus-Based Activities in the Classroom
Although
Function of the word (give a brief summary):
________________________________________________________________
________________________________________________________________
Objectives
Students will review the concept of cohesion in academic writing. Students
will become aware of how frequently different transition words are used in
common written texts across disciplines. In addition, they will also be provided
opportunities to see patterns of their own use of transitions in their writing.
Students will make sentences using transitions of their choice.
Warm-up
Teacher: We have finished looking at the essay structure, and today we will
start looking at something that is very important for clear and effective
writing. (Goes to the board and writes ‘cohesion’.)
Teacher: Can anybody tell us what ‘cohesion’ means?
Students: Connection of ideas/Linking/Flow/etc.
Teacher: Good. Why is creating strong cohesion important for good writing?
Students: Maintains the flow/helps the reader/easier to read and understand.
Teacher: When we write, what do we use to create cohesion?
Students: Transitions.
Teacher: Great! Yes, some people call them transitions; others might call
them linking words or linking adverbials.
Building Schemata
Teacher: (Displays a list of transitions on the screen.) Work with your partner to
categorize the following transitions into three groups. You have to decide
what the categories are. Write them down in three columns.
Mini-Research Project
Teacher: Transitions from which category do you think are used the most
in academic writing? Do you think that all disciplines/majors use these
transitions equally? (Students volunteer answers.)
Let’s go back to MICUSP and do a mini research on how frequent each type
of transitions is used in argumentative essay since your next project is an argu-
mentative essay.
(Teacher divides students in three groups, assigns each group a category. Students work
together to count how many times transitions from each category are used.)
Teacher: What are your numbers? (Fills out the table)
Positions of Transitions
Teacher: We have identified 4 possible positions of where transitions can
appear in sentences. Now, let’s find sample sentences for each transition.
Instructions:
• Copy and paste sample sentences in Google docs under your transition.
• Some transitions don’t appear in all four positions. Examine three
pages of examples, and if you can’t find an example for a certain posi-
tion, move on to the next one.
• Try to choose sentences that your classmates can understand.
Teacher: Now, homework. Read the sample sentences that you have com-
piled in class. Choose 5 transitions that you have never used before or you are
not very clear about how to use. Make sentences in Google docs/Transitions
Homework. Do not forget to put your names by your sentences. We will be
looking at your sentences tomorrow in class.
Lesson Background
The following lesson is expected to be taught to smaller classes (between
5–10 students) of learners from the American Council on the Teaching of
Foreign Languages (ACTFL), with students’ proficiency levels ranging from
intermediate-m id to advanced-high or above. By the time of the lesson, ELLs
will be familiar with the basic principles of CL and the use of corpus tools but
may not be confident enough to go exploring through a corpus. The objectives
of the lesson are as follows:
Students in the course will have learned (1) the basic theory and use of corpora
in the classroom, (2) the effects of register and register variation in language
usage, (3) where to find online corpora for English, (4) the online interface of
the COCA, and (5) the AntConc concordancer using files from the Michigan
Corpus of Academic Spoken English (MICASE).
The following lesson is not intended for the instruction of new tools or techniques.
It is to establish a class project to sustain the use of corpora. The lesson is not
intended to focus solely on one corpus. Instead, it opens the doors to a variety
of corpora depending on the interests of the students.
The Project
This lesson will introduce a class project that will be active for approximately
nine weeks after the Journal Entry template is taught. The project will be
presented at the end of the handout provided. The project will serve three pur-
poses by the end of nine weeks:
• Students can reference their journal for corpus findings that they investigated.
• The Journal Entries will serve as part of a class portfolio for a grade.
• The project will allow students to become comfortable with corpus tools.
linguistic features across spoken and written texts (pp. 25–26). The em-
phasis of the course will largely center around spoken language; therefore,
a spoken register is necessary as target domain.
Ø According to Tribble (2015), MICASE is the third most common corpus
tools taught to ELLs. This is likely because MICASE is a free online in-
terface that is relatively easy to teach ELLs. It is important to remember,
however, that the selection of corpora for this particular project is flexible
since the EDJ is general enough to apply to a variety of corpora. Teachers
or learners willing to use other corpora or have learners develop their own
corpora like what Charles (2014) had done can still participate in the EDJ if
the corpus tool to be applied is taught first. What is critical is that learners
are familiar with the tools before developing the habit of exploration. De-
spite the corpus used, a brief explanation of the limitations and uses of spe-
cific corpus is necessary to illuminate the importance of selecting corpora
based on whether a corpus is representative to the language to be modeled.
Ø As a coordinating side project, leaners will be required to post their findings
on a blog as part of their project. For the learners that I teach, the blog serves
two purposes: First, it is part of a project where they will post reflections of
their language learning experiences. Second, it will also serve as a database for
some of the tools and techniques they learn throughout courses in the study
abroad program they are attending. This includes the EDJ entries that they
will maintain throughout the semester. The first step of creating a blog will be
part of a lesson during the first week of classes. Word Press (www.wordpress.
com), a website that offers a free blog service, is recommended for the purposes
of the activity since it is easy to create an account and free to use.
cognitive and metacognitive learning strategies; the use of authentic materials like
movies, newspapers, Instagram or Facebook posts, and music; and corpora as a
language learning tool. The underlying goal of the course was to provide learners
with an alternative means to learn language outside of textbooks and traditional
classroom settings. It should be noted that corpora and this activity can be incor-
porated into a variety of courses and classes that may not directly require it. It is,
after all, a tool. With good direction, corpora benefit language learners in general.
It is not best kept in a single classroom that only deals with corpora.
4. What is the REGISTER(S) of the WORD or PHRASE you are interested in?
6. List any COLLOCATIONS (friends of the word) that are common for the translation
of the WORD or PHRASE in your first language. Include the token number.
7. See if the same COLLOCATIONS that you would use in your first language are
possible in English.
10. W
rite an entire sentence as an EXAMPLE with the WORD or PHRASE you
explored.
Please keep a master copy of the template drawn out earlier to use throughout
the journal.
One of the great things about using a corpus is discovery. With or with-
out the help of a guide (teacher), you now have the ability to explore English
through corpora. You need to maintain and continue the hunt for understand-
ing by continually exploring the English language! All great explorers kept
maps and journals of their journeys. Now it’s your turn! Your project is to
develop a corpus journal.
What is the REGISTER(S) of the WORD or PHRASE you are interested in?
Spoken English in classrooms
(Example for Class)
4037 hits
Don’t let all the details overwhelm you! If you have any questions or concerns,
talk to each other or ask Mr Matt.
List any COLLOCATIONS (friends of the word) that are common for the translation
of the WORD or PHRASE in your first language. Include the token number.
is [past participle] by (360 tokens)/ by bona fide or by bona fide researchers (both
152 tokens) / by the way (118 tokens)
See if the same COLLOCATIONS that you would use in your first language are
possible in English.
(Intuition) “By” carries a variety of translations in Spanish including por, de, en,
and con. Therefore, the literal translation of by is complicated. For example, the
common expression por favor would literally translate into by/through/because of
favor.
Por bona fide does not exist in the corpus and can be inferred to be rare or not possible.
(Example for Class)
We’ve explored collocations. We’ve gathered information and seen how they
are used in English and in your first language.
So…what’s next?
The next step is to form hypotheses based on patterns that we see!
Find three common patterns of the COLLOCATIONS for your WORD or PHRASE
Is [past participle] by By bona fide By the way
1. Passive voice 1. Bona fide means 1. A lexical bundle
authentic or true 2. Used to add information
2. Verb phrase 2. Used mainly for emphasis. or change the subject in a
3. Latin expression for conversation
3. Is owned by in good faith used in 3. Used in sentence-initial,
has 152 hits in English. medial, and final positions
MICASE.
(Example for Class)
If you have a hypothesis on how something in English works, test it out and
document your findings for extra credit!
1.
2.
Finished! You’ve completed your first Explorer’s Journal Entry in the English
language. Congratulations!
I. Instructions
• Complete two Journal Entries every week for the remainder of the semes-
ter (nine weeks).
• You will present your Journal Entries to class partners, the teacher, or the
entire class every Monday during class time.
• The Journal can be completed online through a blog or on paper through
a notebook or journal.
• Mr Matt will collect paper entries on Monday and return them on Tuesday
to look over them.
• Mr Matt will check your blog weekly.
• You are expected to explore DIFFERENT words or phrases in E nglish.
For example, examining by and by the way as separate journal en-
tries will not be allowed. You need to explore a variety of words or
phrases.
• Critical Sections are sections that require exploration or explanation. These
primarily include sections 6–11. Evidence of Testing for Extra Credit can
be concordance lines that show the hypothesized pattern or 10 examples
written or typed from
II. Rubric
Each journal entry will be graded in three different aspects + Extra Credit as
shown later:
CL and Grammar Instruction 289
Entry The Journal The entry is All sections are All the sections
Completion Entry is missing one completed, are fully
missing two section, but but some completed
or more it doesn’t critical and there
sections. affect the sections is plenty
OR entire entry. do not written in
have much every critical
The section
written in section.
missing is
them.
a critical
section.
Variety Every entry is There is some Most of the Every entry
(Compared about the repetition, entries are is different
to previous same word, but there is different, and explores
entries) group of also some but there different
words, or variety. is some words or
types of repetition phrases in
phrases. between the English
them. language.
Unsatisfactory Acceptable Good Excellent
will happily grade any extra exploring that you do. The Journal Entries will
be a part of your student portfolio in Mr Matt’s class. They will be consid-
ered part of the grade for the class.
• You have to buy in. Honestly, if you’re not willing to spend time and effort
developing and refining corpus activities, they will likely not have the de-
sired effect. For learners to believe it will help, the teacher has to believe it
will help.
• To use corpora in the classroom, understand corpora! As teachers, we need to
take the time to explore any corpus tools or corpora before introducing it
to learners.
• Training montages are myths! There is no quick way to train learners in cor-
pora (no matter what music you choose). It will take time to reach a point
of competence with corpora in order to use this activity.
• Limitation: Are we there yet? An area of uncertainty is that the time and
effort it takes to get learners to the point of this activity will likely
differ between classes. Future studies may try to find out how long it
generally takes to develop autonomy with corpus tools between learner
proficiencies.
• How does one eat an elephant? One bite at a time. How does one teach corpora?
One byte at a time. Well, maybe not that slow. We do, however, need to
be careful to expose learners to a few new things per class, especially when
dealing with technical tools like corpora.
• Read the room! If your learners have glazed over eyes or a blank expression,
the teacher might need to investigate what material might be confusing.
Do not be shy to mix things up by going impromptu in class if you have
no plan B.
• Spice it up! Corpora have a tendency to not look exciting on their own.
Consider concordance lines; they are literally lines of text taken from a
broader context. It is on us teachers to make it fun. Good teachers can make
good entertainers.
• Make time for Q&A. Learners who have never worked with corpora may
not understand why it matters. It is important to address their questions
and concerns, including what corpora are, how they work, and why they
matter.
• Make it your own. Once you, the teacher, are familiar with material, change
things to complement your teaching style and philosophy. This activity is
not perfect for every group of learners or every teacher. Change things to
make them better for your learners!
CL and Grammar Instruction 291
(Continued)
292 Corpus-Based Activities in the Classroom
In this section, I focus primarily on the teaching of spoken and written ac-
ademic discourse for English learners, but the selected lessons and activities
following this introduction may also be adapted for specific contexts and/
or purposes across settings, including those that are outside university class-
rooms. Sociolinguistic and English for Occupational Purposes (EOP) topics
and applications also benefit from the utilization of corpus data, which, in turn,
could be taught to a wide range of learners. The acquisition of features, such
as politeness in English, for example, is relevant to many international students
attending US universities. Explaining effective, respectful features of email
writing, with text excerpts from corpora, can be a relevant and useful topic of
workshops or orientation programs for international students during their first
semester on US campuses.
It is important to fully explain and illustrate the concept of register variation
when teaching the characteristics of spoken and written English. Although we
might find that a rich and complete description of speakers or writers could
contextualize their language use, it is also important to realize that these indi-
viduals use language differently, depending on the audience and purposes they
have. Corpus-based research on register variation has shown that the lexical
and grammatical findings from one register of a language cannot easily be
generalized to other registers or to the language as a whole. In other words, if
a finding is made based on texts that come from one situational context, that
finding may not apply to language that is used in other settings (Friginal &
Hardy, 2014a). For example, the way that we speak to an employer or a school
principal would differ significantly from the way we banter with friends at a
sporting event. In analyzing this concept linguistically, Biber, Conrad, and
Reppen (1998) investigated how the lemma deal was used in two written regis-
ters (academic prose and fiction) from the Longman-Lancaster Corpus. Each of
294 Corpus-Based Activities in the Classroom
these sub-corpora consisted of two million words. The researchers were inter-
ested in seeing how often deal was used as a noun or as a verb. Out of the four
million word samples of the corpus, the difference between the use of nouns
and verbs was not that great. Deal was used as a noun 366 total times and was
used as a verb 482 times. However, when looking at the differences between
registers, the researchers found that the distribution of the nominal and verbal
forms was quite different. In fiction, deal was more likely to be used as a noun,
and in academic prose, it was much more likely to be used as a verb.
The theoretical arguments over what can and cannot be fully captured in
corpora and how to best conduct research or teach their use to language learn-
ers identify important considerations that should further define CL and its ap-
plications. Kachru (2008) notes that corpus-based linguistic research is as good
as the corpora on which it is based, and grammatical or lexical analyses of
corpora are as good as the analytical tools, such as grammatical taggers or con-
cordancers, and many other new software programs specifically developed to
analyze them. In addition, defining specific grammars of language, that is, spo-
ken vs. written grammars, and recognizing that they exist are also important
considerations. Sociolinguists are often interested in the context within which
speakers and writers use language. The study of sociolinguistics focuses on
variation in language form and use that is associated with social, situational,
attitudinal, temporal, and geographic influences (Friginal & Hardy, 2014a),
and CL has itself evolved over several decades to strongly support these em-
pirical investigations of language-in-use. It is clear that the use of corpora and
corpus-based approaches is an invaluable and indelible contribution to the field
of sociolinguistics in exploring the structural and functional characteristics of
spoken and written discourse. CL offers ways to investigate the composition
of linear strings of language, showing the linguistic context that can provide
learners with the components of a target word (i.e., a key word) used in a sen-
tence; an utterance; or, as introduced earlier in this book, KWIC searches. It
is important to note, however, that KWIC searches are not necessarily only of
orthographic words. On the contrary, one can search for letters, morphemes,
and even multiple words. In languages like Chinese and Korean, one might
even use such a program to search for characters.
Closely connected with sociolinguistic research, the CL approach in the
study of World Englishes focuses on emerging varieties of English as they adapt
to changing circumstances of use and contact with local languages and cultures.
World Englishes has been operationalized to show the expanding nature of En-
glish used by ESL/EFL speakers in various contexts. Studies of World Englishes
have focused on two major subareas: (1) indigenous varieties of English and (2)
the study of English as a Lingua Franca (ELF). Corpus development efforts in
representing indigenous varieties of English are best represented by the Interna-
tional Corpus of English (ICE), as briefly noted in Section B1. The ICE project
is an attempt to construct comparable corpora for all varieties of English spoken
CL and Teaching Spoken/Written Discourse 295
around the world (Greenbaum, 1996). According to Seidlhofer (2007), the most
widespread contemporary use of English throughout the world is that of ELF,
and the Vienna-Oxford International Corpus of English (VOICE) and the Cor-
pus of English as Lingua Franca in Academic Settings (ELFA) (Mauranen, 2007)
are both invaluable resources (also briefly introduced in Section B1).
Analyzing and teaching intercultural spoken interactions: Two rel-
evant examples here are (1) academic speech between international teaching
assistants (ITAs) in the US and their everyday interaction with students, and (2)
the discourse of telephone-based customer service call centers between offshore
(those located outside the US) representatives and American callers. Corpora
have been collected for these two registers, and results of various analyses have
been used in teaching and training purposes, especially for the L2 interlocu-
tors. Over the years, US universities have increasingly employed international
graduate students (especially doctoral students) as teaching assistants. These
ITAs typically assist professors in grading papers, manage study groups, and
hold regular office hours to meet with students regarding class projects and
examinations. In mathematics, science, and engineering departments, ITAs
commonly teach many introductory courses themselves. These ITAs know the
contents of the class very well, but limitations in language and with teaching
strategies have been reported by students (Reinhardt, 2010). Pickering (2006)
reported that student complaints about ITAs’ language use were motivated pri-
marily by prejudicial behavior or the general complexity of the content of the
class. Continuing training and research are instituted in many graduate pro-
grams to help ITAs improve their use of English in the classroom. To this end,
Reinhardt (2010) investigated spoken directives by ITAs in office-hour consul-
tations from a corpus-based perspective. Two corpora were used in the study
for comparison: the ITAcorp (Thorne, Reinhardt, & Golombek, 2008), which
was a learner corpus that collected classroom activities from advanced ESL and
ITA preparation courses (mentioned previously in Section B2.3.2), and MI-
CASE, the Michigan Corpus of Academic Spoken English (Simpson, Briggs,
Ovens, & Swales, 2002). ITAcorp represented the language of ITAs, while
MICASE was used for comparison as it provided samples of spoken texts by
academic professionals (e.g., full-time instructors and professors). The ITAcorp
includes spoken academic English texts (e.g., lecture presentations, discussion
leadings, and office hour role-plays). Spoken texts from MICASE include 152
academic speech events (e.g., advising, colloquia, discussion sections, lectures,
office hours, and tutorials from a balance of university academic disciplines
(Simpson-Vlach & Leicher, 2006)). MICASE also features coded sociolinguistic
variables, such as speaker ages, gender, academic rank, and field of study.
The primary goal of Reinhardt’s (2010) comparison was to inform ITA
instruction in the context of ESP and cross-cultural pragmatics. In addition, a
social-functional approach was conducted to analyze variables, such as polite-
ness in academic conversations involving directives. Summative corpus results
296 Corpus-Based Activities in the Classroom
showed that the ITAs made fewer statements marking independence and inclu-
sion appeals than instructors and professors, but they used directive construc-
tions more frequently. The use of ‘directive vocabulary constructions’ (e.g., I
suggest that, I recommend that you) by ITA and professors illustrates ITAs’ prefer-
ence for these constructions much more than professors’ (or ‘experts’). Overall,
ITAs had 73 total constructions in contrast to 10 from experts. Reinhardt, cit-
ing Blum-Kulka, House, and Kasper (1989), suggested that performatives like I
suggest or I recommend, as used by ITAs, had the effect of an indirect imperative
used to soften the force of the instruction. Directive vocabulary constructions
preferred by ITAs emanate from an institutional authority, but many ITAs have
distanced themselves from the power source of the directive by invoking poli-
cies or rules (e.g., it is required or the administration suggests).
I also specialize in the analysis of professional spoken discourse, especially in
the context of outsourced call centers in countries, such as the Philippines and
India, serving callers from the US I collected a Call Center Corpus provided
by a global, US-owned call center company, and I use this corpus in part to
develop language training materials for call center representatives or ‘agents.’
Communications in outsourced call centers have clearly defined roles, power
structures, and standards against which the satisfaction levels of customers
during and after the transactions are often evaluated. Callers typically demand
to be given the quality of service they expect or can ask to be transferred to an
agent who will provide them the service they prefer. Offshore agents’ ‘perfor-
mance’ in language and explicit manifestations of pragmatic skills naturally are
scrutinized closely when defining ‘quality’ during these outsourced call center
interactions. It is clear that intercultural communication in customer service
has become an everyday phenomenon in the US as callers come into direct
contact with agents who do not share some of their basic assumptions and per-
spectives. Before the advent of outsourcing, Americans had a different view of
customer service facilitated on the telephone. Calling help desks or the cus-
tomer service departments of many businesses mostly involved call-takers who
were able to provide a more localized service. Interactants typically shared the
same “space and time,” and awareness of current issues inside and outside of
the interactions (Friginal, 2009; Friginal & Hardy, 2014a). Based on a 2010
survey, customer satisfaction with calls perceived to be handled in the US was
more than one-fifth higher than it was with calls perceived to be handled out-
side the country. Furthermore, callers said that one of the biggest differences
between “foreign and American call centers” was the ease of understanding
the customer service agent (Brockman, 2010; Friginal, 2011). Clearly, these
additional contexts affect the way in which offshore agents attempt to connect
and interact with their customers across various types of tasks.
Training materials developed for call center agents may cover word lists and
multiword units or chunks in teaching the use of recurring patterns, common
CL and Teaching Spoken/Written Discourse 297
types of questions, or the use of politeness markers. These features could also
be extracted from a corpus of caller utterances to prepare agents for the typ-
ical caller language across a variety of tasks (e.g., troubleshooting or product
purchase). Tables C4.1 and C4.2 show frequent extended chunks from agents
and callers in a US-based call center corpus (i.e., with American agents). These
patterns could be presented in a training workshop in India or the Philippines
for discussion (hopefully, with accompanying sound files). Sample discussion
questions may include the following:
1 Thank you for calling customer support or Thank you for calling (followed by specific
company name)
2 may I have your first and last name please
3 how may I help you today
4 can I have your phone number starting with the area code please
5 how can I help you today
6 what else I can help you with
7 thank you so much for calling
8 your first and last name please
9 is there anything else I can help you with?
10 can I put you on hold for a minute
11 thanks for calling (company name) my name is
12 Can I have your phone number?
13 it’s at the back of the modem
14 thank you for choosing (company) services
15 do you have a pen and paper?
16 let me just go ahead and (cancel, file this, process this order, change the code, etc.)
17 thank you very much for that
18 Thank you, is there anything else?
19 thank you so much for waiting
20 if I put you on hold, would that be OK?
21 Thank you for calling (company name) this is (agent name) speaking
22 I’ll be more than happy to assist you
23 I’m gonna go ahead and change this number
24 may I please have your DSL telephone number?
25 could you please hold for a minute or two?
298 Corpus-Based Activities in the Classroom
• Compare and contrast these two sets of phrases or extended chunks with
vocabulary lists (previously discussed). What are noticeable patterns, sim-
ilarities, and differences?
• Spend time studying and memorizing these chunks from US-based agents
and practice their pronunciation across various possible contexts (e.g., type
of transaction, level of pressure in the call, customer’s/caller’s patience or
disposition during the call).
• Define these chunks. Are meanings clear? Are there examples of poten-
tially misleading or complicated bundles on the list?
• What possible COMPREHENSION topics might these phrases suggest
that should be considered? Note that these are expressions from American
callers, and these chunks represent typical oral language structures from
actual customers.
what do they actually mean? Sample: In the following four-word bundles, what
could be the callers’ primary message or question to the agent? Use these ex-
pressions in a sentence:
Lesson Background
This lesson takes place in a content-based, US academic writing course
taught at Duke Kunshan University in Kunshan, China. This course is geared
toward EFL students who plan to study abroad at an English-medium uni-
versity in the future. By the end of this course, students will be able to write
300 Corpus-Based Activities in the Classroom
response papers that make an original argument, present support through evi-
dence and analysis, and incorporate the work of others through summary and
citation practices. Aside from writing, students will also be able to contribute
to academic group discussions and give short paper presentations. The course
content is sociolinguistics, with a specific look at endangered languages and
language policy.
Students in this class often feel self-conscious about their English lan-
guage skills, and they frequently apologize for their ‘Chinglish.’ These
apologies are often followed by questions of how to make their speaking
and writing more native-like. Additionally, the students mention concerns
about using new words they find in the dictionary in their writing because
they are afraid of using them incorrectly. Given these concerns, I incorpo-
rate corpus tools into the class to offer the students a way to do their own
investigation into US English. I hope that in these investigations, they will
come to find that English is more flexible than they might think. I hope this
approach will build students’ awareness of how grammar rules can change
depending on the context and that variation in usage is normal. As such,
even their ‘Chinglish’ grammar is a variety and appropriate within a par-
ticular context.
Students bring their laptops to class on the days when we engage in corpus
lessons (if a student’s laptop has a problem, one can be borrowed from the
school’s IT department). This class is 1 hour and 15 minutes long, and the fol-
lowing lesson constitutes the first 45 minutes.
Introduction
The lesson associated with the following handout takes place in the third week
of classes. The lesson has a sociolinguistic base, even though it is not initially
presented as such to the students. By the end of class, students will
1 Analyze the usage of “doing good” and “doing well” in both spoken
and written language using the COCA.
2 Develop students’ awareness of differences in spoken and written registers.
3 Reflect on language usage (i.e., moving away from classifying language as
“correct” or “incorrect”).
4 Better understand my language teaching philosophy (i.e., moving away
from purely prescriptive rules and analyzing language in terms of its actual
usage).
Introduction
Speaker A: How are you doing?
Speaker B: I’m doing . (Fill in the blank with the word(s) most
people would say.)
Compare with the students near you. Do you agree? Why or why not?
The Grammar
What part-of-speech is the word “good” usually?
What part-of-speech is the word “well” usually?
• modify nouns.
• modify verbs.
So, we could expect many people to say: I’m doing .
Spoken Language Analysis (Using COCA)
1 Go to the website http://corpus.byu.edu/coca/
2 On the left-hand side of the page, in the blue box, click the word Compare.
3 Type doing good in the “Word1” box and doing well in the “Word2”
box.
4 Click on the word Sections (not the check box), and choose Spoken for
both “1” and “2.”
5 Click Compare words above “Sections.”
6 In line “1” for doing good, which should have the word good, click the num-
ber that appears under “W1.”
7 Look for examples where people are asked “how are you?” “are you okay?”
or similar questions, then fill in the following chart based upon the exam-
ple given. (Hint: You can click “Ctrl + F” or “Command + F” to search
the page for these questions.)
Nursing Ethics (Line 13) Noun “The notion of doing good, being
good, and acting on the good, which
resembles virtue ethics” / Good
refers to charitable deeds.
CL and Teaching Spoken/Written Discourse 303
7 Repeat the same steps for doing well. (Hint: Remember to look at column
“W2.”)
General Discussion
1 What part(s) of speech does “good” serve in spoken language? Do you find
the same pattern in academic language?
2 What part(s) of speech does “well” serve in spoken language? Do you find
the same pattern in academic language?
3 Can you identify a rule for when to use “doing good” and “doing well”?
Is one of them simply incorrect?
4 Can you think of another way to characterize language use other than
correct vs. incorrect?
(Continued)
304 Corpus-Based Activities in the Classroom
notice that language is both creative and messy. I want my students to view
variation in language as the norm, which it is in reality. I want them to have
confidence that their variety of English is also acceptable, and it is up to
both interlocutors, whether native speaker or not, to negotiate for meaning
when communicating.
Lesson Background
This lesson on using Text Lex Compare (Cobb, 2016) to examine political dis-
course could be used in a number of different courses, but it was designed
primarily with an ESL public speaking course in mind. (It could also be used
in an ESL civics course or in composition studies and political science courses.)
In an ESL context, this lesson would be integrated into a unit on persuasive
writing and speaking. It is a noticing activity that uses corpora from American
306 Corpus-Based Activities in the Classroom
Related Comparisons
(This part may be provided to students in the form of a handout or as the topic of a lecture
before the activity)
Ø Friginal and Hardy (2014a) and Reiter (2011) explored the language of US
presidential inaugural addresses, which comprise a distinct type of very formal
political discourse. The address is delivered orally as a speech in public, in a
very formal ceremony, and yet, this speech is written and extensively pre-
pared, at least for most presidents. In over 220 years of American history, there
have been only 58 such speeches, starting with George Washington in 1789
through Donald Trump in 2017. They appear at regularly scheduled intervals.
Up until 1933, when Franklin D. Roosevelt took office, presidents had been
inaugurated every four years on March 4, about four months after the Novem-
ber elections. The date changed with the adoption of the 20th Amendment to
the US Constitution. Since Roosevelt’s first term, inaugurals have been held
every four years on January 20th (Reiter, 2011). These inaugural addresses
serve a function that is ceremonial as well as deliberative, and the rhetoric in
these speeches reflects that.
Ø There is much related research on the form, content, rhetorical style, and
historical applications of presidential inaugurals. Recently, radio and tele-
vision coverage of these speeches have become major events, with both
the content and delivery scrutinized and analyzed immediately after de-
livery by political commentators. Media and internet broadcast of Barack
Obama’s inaugural address in 2009, a historic address delivered by the
CL and Teaching Spoken/Written Discourse 307
first African-American president of the US, was met with great global
enthusiasm. The ceremonial nature of this domain covers a lot of contexts
that have also shifted over time. George Washington spoke in New York
to a small group comprised mostly of supporters and elected senators and
congressmen, while George W. Bush, in 2005, his second term, spoke in
Washington, D.C., to an international audience who tuned-in to listen to
US post-9/11 thoughts and sentiments and immediate future directions
related to the on-going turmoil in the Middle East at that time.
Ø The following two excerpts compare George Washington’s first inaugural
address in 1789 with Bill Clinton’s second inaugural address, delivered on
January 20, 1997—a difference of over 200 years. Personal pronouns are
highlighted in these two excerpts, and references to the country are un-
derlined (e.g., note Clinton’s explicit mention of America). An additional
content analysis of Washington vs. Clinton’s speeches in these two excerpts
can certainly provide you with ideas to pursue further in conducting a
diachronic corpus-based analysis of all US presidential inaugural addresses
(adapted from Friginal & Hardy, 2014a).
George Washington
New York, Thursday, April 30, 1789
(Excerpt from 1,428 words total)
Among the vicissitudes incident to life no event could have filled me with
greater anxieties than that of which the notification was transmitted by your
order, and received on the 14th day of the present month. On the one hand,
I was summoned by my Country, whose voice I can never hear but
with veneration and love, from a retreat which I had chosen with the fond-
est predilection, and, in my flattering hopes, with an immutable decision,
as the asylum of my declining years--a retreat which was rendered every
day more necessary as well as more dear to me by the addition of habit
to inclination, and of frequent interruptions in my health to the gradual
waste committed on it by time. On the other hand, the magnitude and dif-
ficulty of the trust to which the voice of my country called me, being
sufficient to awaken in the wisest and most experienced of her citizens a
distrustful scrutiny into his qualifications, could not but overwhelm with
despondence one who (inheriting inferior endowments from nature and
unpracticed in the duties of civil administration) ought to be peculiarly
conscious of his own deficiencies. In this conflict of emotions all I dare
aver is that it has been my faithful study to collect my duty from a just
appreciation of every circumstance by which it might be affected. All I dare
hope is that if, in executing this task, I have been too much swayed by a
grateful remembrance of former instances, or by an affectionate sensibility
308 Corpus-Based Activities in the Classroom
William J. Clinton
Washington, DC, January 20, 1997
(Excerpt from 2,157 words total)
My Fellow Citizens:
At this last presidential inauguration of the 20th century, let us lift our
eyes toward the challenges that await us in the next century. It is our
great good fortune that time and chance have put us not only at the
edge of a new century, in a new millennium, but on the edge of a bright
new prospect in human affairs, a moment that will define our course,
and our character, for decades to come. We must keep our old de-
mocracy forever young. Guided by the ancient vision of a promised
land, let us set our sights upon a land of new promise.
The promise of America was born in the 18th century out of the bold con-
viction that we are all created equal. It was extended and preserved in the
19th century, when our nation spread across the continent, saved the union,
and abolished the awful scourge of slavery. Then, in turmoil and triumph, that
promise exploded onto the world stage to make this the American Century.
And what a century it has been. America became the world’s mightiest
industrial power; saved the world from tyranny in two world wars and a
long cold war; and time and again, reached out across the globe to millions
who, like us, longed for the blessings of liberty.
Along the way, Americans produced a great middle class and security
in old age; built unrivaled centers of learning and opened public schools to
all; split the atom and explored the heavens; invented the computer and the
microchip; and deepened the wellspring of justice by making a revolution
in civil rights for African Americans and all minorities, and extending the
circle of citizenship, opportunity and dignity to women.
Now, for the third time, a new century is upon us, and another
time to choose. We began the 19th century with a choice, to spread our
nation from coast to coast. We began the 20th century with a choice,
to harness the Industrial Revolution to our values of free enterprise,
conservation, and human decency. Those choices made all the difference.
Sample KWIC lines for must from President Obama’s 2013 address:
• Together, we resolved that a great nation must care for the vulnerable,
and protect its people from life’s worst hazards and misfortune.
• Now, more than ever, we must do these things together, as one nation,
and one people.
• We must harness new ideas and technology to remake our government,
revamp our tax code, reform our schools, and empower our citizens
with the skills they need to work harder, learn more, and reach higher.
• We must make the hard choices to reduce the cost of health care and the
size of our deficit. But we reject the belief that America must choose
between caring for the generation that built this country and investing
in the generation that will build its future.
Procedure
By the end of this lesson, students will be familiar with the language and rhe-
torical features of formal, political speeches. They will be able to understand
the difference between casual spoken English and the register that is used in
speeches. Students will be able to pick out and discuss some characteristics of
persuasive language that are being used in the speeches they select. This lesson
will also help students improve their spoken English in formal registers.
Warm-up
• Do presidents use the same language when you see them on TV as every-
one else? Why or why not? What do you notice about the way they talk?
• What do you know about Barack Obama, Donald Trump, or Bill Clinton?
What can you say about the way they communicate ideas on TV or respond
to interviews?
Demonstration
• Teacher shows how to use Text Lex Compare with President John F. Kenne-
dy’s Inaugural speech compared with Barack Obama’s (Table C4.3).
• Teacher: These two speeches are of the same register, but Kennedy’s speech
was given in 1961 and Obama’s speech was given in 2009.
• Introduce the results (see Token Recycling Index Data in the following),
noting that the 3rd most common shared token was “we,” and the 16th
most common shared token was “they.” How do pronouns shape the tone
of each speech?
• The frequent use of auxiliary verbs in both speeches is also quite interest-
ing. The 8th most popular shared token is “be.” The 15th most popular
token in both speeches is “can.” “Will” is the 27th most popular shared
token. What rhetorical purpose do these auxiliary verbs serve?
Table C4.3 Comparison of unique and shared word tokens/families from Text Lex
Compare
Unique to first 626 Shared 1028 tokens Unique to second 337 Same list alpha first
tokens 424 families 229 families tokens 230 families
freq first (then alpha)
TOKEN Recycling Index: (1028 repeated tokens: 1365 tokens in new text) = 75.31% FAMI-
LIES Recycling Index: (229 repeated families: 459 families in new text) = 49.89% (Token recy-
cling will normally be the most interesting measure of, for example, text comprehensibility, as it is with VPs.)
312 Corpus-Based Activities in the Classroom
Exploration Activity
• The students will select two American presidential speeches to analyze
from a list of speeches from the American Presidency Project. (This can be
done as a paired activity.) Students will have to justify their comparisons,
identify vocabulary features to compare, and discuss and present their re-
sults in class.
• Some guide questions for students: What patterns did you find inter-
esting in your speeches? Were there any differences with regards to the
speaker’s political affiliation, era in which each speech was delivered,
others? Has this activity taught you about “word choices” in public
speaking?
Additional Resources:
• The Avalon Project (http://avalon.law.yale.edu/subject_menus/inaug.
asp)Inaugural Words: 1789 to the Present, from the New York Times
CL and Teaching Spoken/Written Discourse 313
(Continued)
314 Corpus-Based Activities in the Classroom
How can Text Lex Compare best serve teachers and students?
What are ideal topics, activities, and settings for this
feature?
Text Lex Compare is most effective when it is being used to compare a small
number of texts in great detail. It is more effective at analyzing lexical fea-
tures of texts due to the fact that it counts “tokens” rather than chunks of
words. Unfortunately, it is not as effective as AntConc at analyzing colloca-
tions. When using Text Lex Compare, make sure that you select a small num-
ber of texts that have a lot in common. Text length also plays a huge role in
the effectiveness of this tool. Make sure that the two texts are around the
same size, otherwise the results will be skewed. The nice thing about Text
Lex Compare is that it automatically limits the number of texts that students
can look at. Therefore, they can focus on a smaller number of texts without
becoming overwhelmed by the flood of information that the larger corpus
tools sometimes give students. This is useful if you are looking at important
historical documents that students will need to be extremely familiar with.
For example, certain political speeches are discussed in detail on the US
citizenship exam and in many one-hundred level political science courses.
Lesson Background
My (former) university’s Office of International Initiatives has hosted visiting
scholars through its Faculty Mentoring Program, in an effort to promote pro-
fessional development and strengthen the university’s global presence and rela-
tionships. Each visiting scholar is paired with a faculty mentor from the same
field to work with during his/her stay on campus. Within the program, there
is also an Intensive English Instruction component that lasts eight weeks, for
four hours per week. It involves English training for written and oral compre-
hension and production. This part of the program focuses on topics like writ-
ing professional emails, leading discussions, and giving presentations as well as
on specific elements of English, like pronunciation issues, field-related writing
conventions, and cultural aspects of communication.
The participants for this program were nine Chinese professors from a diverse
group of academic fields including art, biology, computer science, English, early
childhood education, history, and physics. For this group, we decided to devote
more time to academic writing skills as well as an analysis of more in-depth,
subject-specific vocabulary. The majority of the scholars are competent in com-
prehending academic reading (especially related to their field), and several of them
have published research articles in English before. Despite this, academic writing
and its specific conventions are an aspect that most members of the group ex-
pressed interest in improving. Corpus-based instruction appeared to be an appli-
cable and meaningful lesson focus for the eight-week course, and it matched the
academic level and specific goals of the participants. The following is the outline of
my course, together with some examples of activities and suggestions for teachers.
Participant Question 1: What types Question 2: What are your Question 3: What parts
of English writing have English writing goals for of English writing are most
you done in the past? the future? Do you hope to difficult for you?
publish papers in English?
Spoken English Corpus (that I provided) and save them each onto a flash
drive for the next class. I demonstrated how to do this.
E. The following class took place in the computer lab again, and the group was
shown how to use the basic features of the AntConc program. They opened
the corpus file, generated the word list ranked by frequency, and were
shown how to search for specific words or phrases. After they had been
given sufficient time to familiarize themselves with the basic functions,
the class used a handout to guide them through more searches. Working
with a partner, they recorded the top ranked words of the corpus and then
completed searches to determine which prepositions and verbs were used
in differing contexts (e.g., move in vs. move on). They were shown how to
analyze the examples to help identify the differences in meaning. The final
activity allowed them to conduct their own searches to identify patterns
they were curious about, check their intuitions, and further familiarize
themselves with using the program.
The next assignment for the class was to start collecting research articles
from their discipline and send them to me by email. The ultimate goal was
for each student to have his or her own discipline-specific corpus written
by native speakers (NSs) of English, which the students could keep as a ref-
erence tool well after the English instruction course had ended. They were
told to collect a minimum of 8 articles, but most of the students eventually
sent closer to 15. I converted the articles, most in PDF format, into text
files and uploaded them into separate folders organized by discipline that
the students could access online through a shared folder.
F. Analyzing Patterns of Writing: It took approximately two weeks for par-
ticipants to have their self-compiled corpora ready. Before each participant
was assigned to analyze his or her specific corpus, a class was scheduled to
analyzing linguistic features of three registers of writing: news, fiction,
and research articles. This was intended to familiarize the participants with
exploring patterns first by hand in order to demonstrate the ease, effective-
ness, and speed of doing it with a concordancing program later. They were
given a one-page excerpt from each category and tasked with underlining
every verb on the page and trying to describe any patterns they found.
After each sample was analyzed, I guided the class into accurately de-
scribing their findings. Examples of the findings included numerous uses
of the present perfect in the news excerpt, a high frequency of the past
simple and past perfect in the fiction sample, and the prevalence of the
passive voice in the research article. This activity culminated in the main
assignment: perform their own searches using their self-compiled corpora.
Participants were given a great deal of flexibility in what to search for,
but the following questions served as a guide:
1 What are some examples of common verbs?
2 What grammatical patterns are most common?
CL and Teaching Spoken/Written Discourse 319
1. News
Read through the news sample and underline all of the main verbs in each
sentence. What grammatical patterns are most common (For example: do/did/
have done/had done/is doing/was doing/was done/etc.)? Are there any verbs that are
used more than once?
2. Fiction
Read through the short story sample and underline all of the main verbs in each
sentence. What grammatical patterns are most common (For example: do/did/
have done/had done/is doing/was doing/was done/etc.)? Are there any verbs that are
used more than once?
3. Academic
Read through the academic article sample (applied linguistics) and underline
all of the main verbs in each sentence. What grammatical patterns are most
common (For example: do/did/have done/had done/is doing/was doing/was done/
etc.)? Are there any verbs that are used more than once?
4. Your Discipline
Analyze the verb patterns of your discipline using AntConc.
1 Open AntConc (from your USB or from http://www.laurenceanthony.net/
software/antconc/)
2 Select File and click on Open File(s)...
3 Select all of the files from the corpus of your field (biology, computer
science, etc.)
4 Go to the Word List tab
5 Click the Start button to see the most frequent words in your corpus
6 You can use the Concordance tab to search for specific verbs/words/
phrases
7 Answer the following guide questions:
What are some examples of common verbs?
What grammatical patterns are most common?
Did you find anything interesting or surprising?
How can you use this information to improve your writing?
320 Corpus-Based Activities in the Classroom
Table C4.5 Sample worksheet: analyzing patterns in writing from your discipline with
AntConc
Participant What are some What Did you find anything How can you use
examples of common grammatical interesting or surprising? this information
verbs [participants were patterns are most to improve your
told they could look up common? writing?
any types of words if
preferred]?
(Continued)
322 Corpus-Based Activities in the Classroom
Lesson Background
The lesson describes Zhu and Friginal’s (2015) Text X-Ray (see Section A3.2
for an introduction to this POS-visualizer) and its application in an academic
writing course for multilingual writers. I introduced Text X-Ray to my students
in a university-level undergraduate composition course. Selected features of the
program were highlighted through three separate activities. While the curricu-
lum and syllabus for this course had already been determined before I had access
CL and Teaching Spoken/Written Discourse 323
to Text X-Ray, I was able to integrate these activities into the pre-established
course syllabus in a way that aligned with the general course objectives.
The class consisted of 21 students representing the following 7 native lan-
guages: Arabic, Korean, Italian, Spanish, Chinese, Norwegian, and Romanian.
The age of students ranged from late teens to mid-20s. While some students
had been in the US for several years (the longest time spent in the US was eight
years), the majority had been in the country for as few as four months. The
latter self-identified in the classroom as “international students,” while many of
the former had immigrated to the US with their families as adolescents and thus
might best be understood in the field as “Generation 1.5” students (Doolan &
Miller, 2012).
At the risk of overgeneralizing the heterogeneity of learners’ experiences,
“Generation 1.5” typically refers to students aged 25 and under who immi-
grated from a non-English-dominant country during their adolescence, speak
a language other than English at home, and have strong speaking and listening
skills in English (Doolan & Miller, 2012). Students who fell within these crite-
ria exhibited advanced conversation skills in English but found issues pertaining
to sentence structure, basic grammatical categories (i.e., POS), and word choice
particularly challenging in their writing. Furthermore, I noticed that these
students struggled to grasp the conventions of college-level academic writing.
Lessons
The following sample lessons describe three separate activities in which Text
X-Ray was utilized in the described course. The activities are presented in the
order in which they took place in the semester.
Sample Lesson #1
The first time I introduced Text X-Ray to my students, it was used to present
the tool’s ability to highlight grammatical categories. The objective of this first
lesson was to cultivate learners’ ability to self-edit for subject-verb agreement
and for issues related to basic sentence structure, both of which the instruc-
tor found to be prevalent in learners’ later drafts, despite repeated attempts to
encourage students to self-edit their final versions of essays for sentence-level
mistakes.
Note to Teachers
• Text X-Ray, in its most basic application, can show teachers/learners the
use of particular POS (e.g., nouns, verbs, adjectives, and adverbs) in a text.
• If a specific objective in a course, perhaps a course in second language
grammar for international students in the US, focuses on the use of a
certain feature, such as ‘existential there,’ Text X-Ray can provide a quick
324 Corpus-Based Activities in the Classroom
Figure C4.2 ighlighted grammatical categories of writing from Text X-Ray (Zhu &
H
Friginal, 2015).
Next, the hard copies of drafts were returned to their original author, and
students were directed to the following directions on the digital projector:
Students were instructed to complete this portion of the lesson in dyads, that is,
partners would paste and analyze one student’s paper before moving on to the
next student’s paper. While students were working, I walked around the room
and facilitated learners’ initial attempts to access Text X-Ray; there were few
difficulties. I noticed that partners were actively interacting with both the Text
X-Ray interface and with one another, pointing to the screen, pointing to the
essay in question, and often verbalizing and gesturing animatedly if there was
a discrepancy between the two.
Whenever possible, I asked students to reflect on such discrepancies. Why
had they not realized that this particular word was a noun? Why had students
labeled another word a noun if Text X-Ray did not? I also observed that many
students had begun to experiment with other features of Text X-Ray, such as
the brown button labeled “verbs.” At this point, I directed their attention to
further instructions on the digital projector:
Sample Lesson #2
Another lesson in which Text X-Ray was utilized took advantage of the corpus
tool’s ability to highlight the linguistic features of a particular text as compared
to the linguistic features conventional to a chosen genre or text type.
During an in-class discussion, a few students had pointed out that the essays
they were required to read in their textbook did not necessary align with the
expectations I communicated to them regarding their own academic writing.
I was quite encouraged by this level of genre-awareness in my students and I
chose to incorporate a genre investigation into a subsequent lesson that would
allow students to further explore their observation.
Fortunately, students in this course were already familiar with MICUSP from
previous classroom activities. Students were directed to a document in the Drop-
box folder, which included three paragraphs from a recently assigned essay in the
class textbook: “Sorry, Vegans: Brussels Sprouts Like to Live, Too,” written by
Natalie Angier in 2009 for the New York Times. Working in pairs, students pasted
these three paragraphs into Text X-Ray’s text editor and then chose the “Compare
with MICUSP” option. Because the students were currently working on writing
argumentative essays of their own, they were told to compare based on this paper
type alone. Thus, students selected “Paper Type”—“Argumentative Essay” and
then hit “Compare.” They received a comparison like the following:
Sample Lesson #3
In a third lesson, students used Text X-Ray to investigate their own use of re-
cently learned vocabulary in their writing.
Throughout the assigned course readings, learners were required to keep logs
of “newly-encountered” words. The vocabulary log required learners to note the
page and paragraph number of four new word/phrases from each text, to copy each
word/phrase along with the 1–2 words that came before and after it in the text (for
the sake of recording collocational information), and to provide a definition of the
vocabulary items in the students’ own words. Students uploaded their vocabulary
logs following each course reading, after which I extracted the most frequently
cited words into a master list that students could reference at any time via the
course Dropbox folder. As students’ essays centered on topics that emerged from
course readings, they were encouraged to incorporate recently learned vocabulary
from course readings into their own essays; however, no quantitative manner of
identifying and/or tallying such vocabulary had been introduced.
CL and Teaching Spoken/Written Discourse 327
Figure C4.4 S ample MICUSP comparison output in Text X-Ray (visualized com-
parison output of student paper and MICUSP averages automatically
obtained by clicking the “Compare” button under the “Compare
with MICUSP” tab).
Students were instructed to copy and paste their most recent draft of an
essay into Text X-Ray’s text editor. They were then guided to download a
copy of the vocabulary list I prepared and select “Load a word list from this
computer,” under Text X-Ray’s “Customized word list” feature. Once this list
was uploaded, students highlighted the uploaded document under “Word list”
and clicked “Highlight the words from the list(s).” Doing so caused any words
from the class-generated vocabulary list to be underlined in red in the students’
writing, providing a visual representation of the amount of new vocabulary
learners were incorporating into their writing. Students performed this activity
in dyads so that there was greater spoken interaction and accountability.
In the final portion of this lesson, students were encouraged to use the “Word
Cloud” function of Text X-Ray to produce a visual graphic of the most fre-
quently used words in their writing. Students were then asked to individually
write a reflection of their current progress in vocabulary use based on the results
of the word cloud and the highlighted vocabulary found in their essay. Questions
students were encouraged to answer in their reflection included the following:
• How much of the class vocabulary are you using in your own writing?
• Based on your word cloud, which word(s) do you use the most in your
writing? Why do you think this is the case?
• What is one way you can practice using more unfamiliar vocabulary in
your academic writing?
328 Corpus-Based Activities in the Classroom
I’d also like to play around more with the “Readability” functions, allowing
students to visualize the amount of complex words and sentences they’re using
in their writing. There are also lots of other ways that highlighting grammat-
ical categories could be used to tailor an activity to whatever we’re working
on in class at a particular point in the semester. For example, I might want to
encourage learners to compose more compound sentences using coordinating
conjunctions, to proofread for misuse of prepositions, and so forth. Being able
to easily track their use of these categories throughout their writing would be
valuable to students.
Note
1 Note again that we plan to launch a full, new version of Text X-Ray upon comple-
tion of our usability tests. If you want to access the beta version, please send an email
to textxray.beta@gmail.com for instructions on how to run the program online.
References
Aijmer, K. (2011). “Well I’m not sure I think…” The use of well by non-native speak-
ers. International Journal of Corpus Linguistics, 16(2), 231–254.
Alderson, J. C. (2007). Judging the frequency of English words. Applied Linguistics,
28(3), 383–409.
Alsop, S., & Nesi, H. (2009). Issues in the development of the British Academic Written
English (BAWE) corpus. Corpora, 4(1), 71–83.
Altenberg, B., & Granger, S. (2002). The grammatical and lexical patterning of MAKE
in native and non-native student writing. Applied Linguistics, 22(2), 173–195.
Altenberg, B., & Tapper, M. (1998). The use of adverbial connectors in advanced Swedish
learners’ written English. In S. Granger (Ed.), Learner English on computer (pp. 80–93).
Harlow, UK: Addison Wesley Longman.
Anderson, W., & Corbett, J. (2009). Exploring English with online corpora. New York, NY:
Palgrave Macmillan.
Angier, N. (December 21, 2009). Sorry, vegans: Brussels sprouts like to live, too. The
New York Times. Retrieved March 14, 2016 from www.nytimes.com/2009/12/22/
science/22angi.html
Anthony, L. (2013). AntFileConverter [Computer Software]. Tokyo, Japan: Waseda
University. Retrieved July 9 from www.laurenceanthony.net/
Anthony, L. (2014). AntConc (Version 3.4.3) [Computer Software]. Tokyo, Japan:
Waseda University. Retrieved July 9 from www.laurenceanthony.net/
Anthony, L. (2015). TagAnt [Computer Software]. Tokyo, Japan: Waseda University.
Retrieved July 9, from www.laurenceanthony.net/
Aston, G. (2015). Learning phraseology from speech corpora. In A. Lenko-Szymanska &
A. Boulton (Eds.), Multiple affordances of language corpora for data-driven learning (pp. 65–84).
Amsterdam: John Benjamins.
Athelstan. (1999). MonoConc Pro. (Computer Software).
Baker, P. (2006). Using corpora in discourse analysis. New York, NY: Continuum.
Baldwin, E. (2014). Beyond contrastive rhetoric: Helping international lawyers use co-
hesive devices in legal writing. Florida Journal of International Law, 26, 399–446.
Barlow, M. (2012). MonoConc Pro 2.2 (MP2.2) [Computer Software]. Retrieved
September 17, 2014 from www.monoconc.com/
332 References
Bauman, J. (1995). About the general service list. Retrieved March 2011, from http://
jbauman.com/aboutgsl.html.
Belcher, D. (1994). The apprenticeship approach to advanced academic literacy: Gradu-
ate students and their mentors. English for Specific Purposes, 13(1), 23–34.
Belcher, D. (Ed.). (2009). English for specific purposes in theory and practice. Ann Arbor, MI:
University of Michigan Press.
Bennett, G. (2010). Using corpora in the language learning classroom. Ann Arbor, MI: University
of Michigan Press.
Biber, D. (1988). Variation across speech and writing. Cambridge, UK: Cambridge University
Press.
Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge,
UK: Cambridge University Press.
Biber, D. (2004). Corpus linguistics and language teaching. Invited lecture, University
of California, Berkeley, Berkeley, CA, March 15.
Biber, D. (2006). University language: A corpus-based study of spoken and written registers.
Amsterdam: John Benjamins.
Biber, D. (2009). A corpus-driven approach to formulaic language in English: Multi-
word patterns in speech and writing. International Journal of Corpus Linguistics, 14(3),
275–311.
Biber, D. (2010). Biber Tagger [Computer Software]. Flagstaff, AZ: Northern Arizona
University.
Biber, D., & Conrad, S. (2009). Register, genre, and style. New York, NY: Cambridge
University Press.
Biber, D., Conrad, S., & Leech, G. (2002). A student grammar of spoken and written English.
London, UK: Longman.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language struc-
ture and use. Cambridge, UK: Cambridge University Press.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman gram-
mar of spoken and written English. Harlow, UK: Pearson.
Biber, D., Reppen, R., & Friginal, E. (2010). Research in corpus linguistics. In R. B.
Kaplan (Ed.), The Oxford handbook of applied linguistics (2nd ed., pp. 548–570).
Oxford, UK: Oxford University Press.
Blum-Kulka, S., House, J., & Kasper, G. (1989). Cross-cultural pragmatics: Requests and
apologies. Norwoord, NJ: Ablex.
Bohannon, J. (2010). Google opens books to new cultural studies. Science, 330(6011),
1600.
Bohannon, J. (2011). Google books, Wikipedia, and the future of culturomics. Science,
331, 135.
Boulton, A. (2009). Testing the limits of data-driven learning: Language proficiency
and training. ReCALL, 21(1), 37–54.
Boulton, A. (2012). Beyond concordancing: Multiple affordances of corpora in uni-
versity language degrees. Procedia - Social And Behavioral Sciences, 34, 33–38
(Languages, Cultures and Virtual Communities (Les Langues, les Cultures et les
Communautes Virtuelles).
Boulton, A. (2015). Applying data-driven learning to the web. In A. Lenko-Szymanska &
A. Boulton (Eds.), Multiple affordances of language corpora for data-driven learning (pp. 267–
295). Amsterdam: John Benjamins.
Brand, C., & Götz, S. (2011). Fluency versus accuracy in advanced spoken learner
language: A multi-method approach. International Journal of Corpus Linguistics, 16(2),
255–275.
References 333
British Academic Spoken English and BASE Plus Collections. (2017). The British Aca-
demic Spoken English. Retrieved March 14, 2016 from www2.warwick.ac.uk/fac/
soc/al/research/collections/base/
Brockman, J. (2010). Who’s taking your call? CFI Group’s 2010 Contact Center Satisfaction
Index. Retrieved March 14, 2016 from www.cfigroup.com/resources/whitepapers_
register.asp?wp=46
Brown, P., Lai, J., & Mercer, R. (1991). Aligning sentences in parallel corpora. In
Proceedings of the twenty-ninth annual meeting of the association for computational linguistics
(pp. 169–176). Berkeley, CA.
Buysse, L. (2012). So as a multifunctional discourse marker in native and learner speech.
Journal of Pragmatics, 44(13), 1764–1782.
Candlin, C. N., & Hafner, C. A. (2007). Corpus tools as an affordance to learning in
professional legal education. Journal of English for Academic Purposes, 6, 303–318.
Carr, C. T., Schrock, D. B., & Dauterman, P. (2012). Speech acts within Facebook
status messages. Journal of Language and Social Psychology, 31(2), 176–196.
Carter, R., McCarthy, M., Mark, G., & O’Keeffe, A. (2016). English grammar today.
Cambridge, UK: Cambridge University Press.
Casanave, C. P. (2006). Controversies in second language writing: Dilemmas and decisions in
research and instruction. Ann Arbor, MI: University of Michigan Press.
Celce-Murcia, M. (1991a). Discourse analysis and grammar instruction. Annual Review
of Applied Linguistics, 11, 459–480.
Celce-Murcia, M. (1991b). Grammar pedagogy in second and foreign language teach-
ing. TESOL Quarterly, 25, 135–151.
Celce-Murcia, M., Dörnyei, Z., & Thurrell, S. (1997). Direct approaches in L2 instruc-
tion: A turning point in communicative language teaching? TESOL Quarterly, 31,
141–152.
Chambers, A., Farr, F., & O’Riordan, S. (2011). Language teachers with corpora in
mind: From starting steps to walking tall. Language Learning Journal, 39, 85–104.
Chapelle, C. (1998). Multimedia CALL: Lessons to be learned from research on in-
structed SLA. Language Learning & Technology, 2(1), 22–34.
Chapelle, C., & Jamieson, J. (2009). Tips for teaching with CALL: Practical approaches to
computer-assisted language learning. White Plains, NY: Pearson Education.
Charles, M. (2005). Phraseological patterns in reporting clauses used in citation: A corpus-
based study of theses in two disciplines. English for Specific Purposes, 17, 113–134.
Charles, M. (2014). Getting the corpus habit: EAP learners’ long-term use of personal
corpora. English for Specific Purposes, 35, 30–40.
Cheng, W. (2012). Exploring corpus linguistics: Language in action. New York, NY:
Routledge.
Cheng, W., Greaves, C., & Warren, M. (2005). The creation of a prosodically tran-
scribed intercultural corpus: The Hong Kong Corpus of Spoken English (prosodic).
ICAME Journal, 29, 47–68.
Cheng, W., Greaves, C., & Warren, M. (2008). A corpus-driven study of discourse intona-
tion. Amsterdam: John Benjamins.
Chujo, K., Utiyama, M., & Miura, S. (2006). Using a Japanese-English parallel corpus
for teaching English vocabulary to beginning level students. English Corpus Studies,
13, 153–172.
Cilveti, L. D., & Perez, I. K. A. (2006). Textual and language flaws: Problems for
Spanish doctors in producing abstracts in English. IBERICA, 11, 61–79.
Cobb, T. (1997). Is there any measurable learning from hands-on concordancing? Sys-
tem, 25, 301–315.
334 References
Cobb, T. (2016). Compleat Lexical Tutor v.8 [website]. Retrieved December 9, 2016
from http://lextutor.ca
Cohen, W. (2015). Enron email dataset. Retrieved from www.cs.cmu.edu/~enron/
Collins CoBuild English Language Dictionary. (1987). London: Collins CoBuild.
Conrad, S. (2000). Will corpus linguistics revolutionize grammar teaching in the 21st
century? TESOL Quarterly, 34(3), 548–560.
Conrad, S., & Biber, D. (2009). Real grammar: A corpus-based approach to English. New
York, NY: Pearson-Longman.
Cook, G. (1998). The uses of reality: A reply to Ronald Carter. ELT Journal, 52(1), 57–63.
Cortes, V. (2004). Lexical bundles in published and student disciplinary writing: Exam-
ples from history and biology. English for Specific Purposes, 23, 397–423.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238.
Coxhead, A. (2002). The academic wordlist: A corpus‐based word list for academic
purposes. In B. Kettemann & G. Marko (Eds.), Teaching and learning by doing corpus
analysis (pp. 73–89). Amsterdam: Rodopi.
Coxhead, A. (2011). The academic word list 10 years on: Research and teaching impli-
cations. TESOL Quarterly, 45(2), 355–361.
Crashborn, O. (2008). Open access to sign language corpora. In O. Crashborn, T. Hanke,
E. Efthimiou, I. Zwitserlood, & E. Thoutenhoofd (Eds.), Construction and exploitation of
sign language corpora (pp. 33–38). Third workshop on the representation and processing
of sign language. Paris: European Language Resources Association (ELRA).
Creswell, J. (2014). Research design: Qualitative, quantitative, and mixed methods approaches
(4th ed.). New York, NY: Sage.
Crossley, S., & McNamara, D. (2009). Computational assessment of lexical differences
in L1 and L2 writing. Journal of Second Language Writing, 18, 119–135.
Daniels, M. (2017). The largest vocabulary in hip hop. Retrieved from https://pudding.
cool/2017/02/vocabulary/index.html
Davies, M. (2008). The Corpus of Contemporary American English (COCA): 520
million words, 1990-present. Retrieved from https://corpus.byu.edu/coca/
Davies, M. (2009). The 385+ million word Corpus of Contemporary American
English (1990–2008+): Design, architecture, and linguistic insights. International
Journal of Corpus Linguistics, 14(2), 159–190.l
Davies, M. (2010). The Corpus of Historical American English (COHA): 400 million
words, 1810–2009. Retrieved from https://corpus.byu.edu/coha/
Davies, M. (2013). Corpus of Global Web-Based English: 1.9 billion words from speak-
ers in 20 countries (GloWbE). Retrieved from https://corpus.byu.edu/glowbe/
Davies, M. (2017a). Using large online corpora to examine lexical, semantic, and cul-
tural variation in different dialects and time periods. In E. Friginal (Ed.), Studies in
corpus-based sociolinguistics (pp. 19–82). New York, NY: Routledge.
Davies, M. (2017b). WordandPhrase.Info [website]. Retrieved March 18, 2016 from
corpus.byu.edu
Davies, M., & Gardner, D. (2010). A frequency dictionary of contemporary American English:
Word sketches, collocates, and thematic lists. New York, NY: Routledge.
De Haan, P. (1989). Postmodifying clauses in the English noun phrase: A corpus-based study.
Amsterdam: Rodopi.
Donley, K., & Reppen, R. (2001). Using corpus tools to highlight academic vocabulary
in SCLT. TESOL Journal, 12, 7–12.
Doolan, S., & Miller, D. (2012). Generation 1.5 written error patterns: A comparative
study. Journal of Second Language Writing, 21(1), 1–22.
References 335
Educational Testing Service. (2016). TOEFL iBT questions writing sample responses
[pdf ]. Retrieved December 9, 2016 from www.ets.org/Media/Tests/TOEFL/pdf/
ibt_writing_sample_responses.pdf
Edutopia. (2017). [Website] Retrieved March 12, 2017 from www.edutopia.org/
Eggington, W. (2004). Rhetorical influences. As Latin was, English is? In C. L. Moder &
A. Martinovic-Zic (Eds.), Discourse across languages and cultures (pp. 251–265). Phila-
delphia, PA: John Benjamins.
Eisenstein, J., O’Connor, B., Smith, N. A., & Xing, E. P. (2010). A latent variable
model for geographic lexical variation. Paper presented at the Proceedings of the
2010 Conference on Empirical Methods in Natural Language Processing, MIT
Stata Center, Cambridge, MA, USA.
Ellis, R. (1995). Interpretation tasks for grammar teaching. TESOL Quarterly, 29, 87–105.
Ellis, R. (1998). Teaching and research: Options in grammar teaching. TESOL Quar-
terly, 32, 39–60.
Firth, J. (1957). Papers in linguistics. Oxford, UK: Oxford University Press.
Flowerdew, J. (2002). Genre in the classroom: A linguistic approach. In A. M. Johns (Ed.),
Genre in the classroom: Multiple perspectives (pp. 91–102). New York, NY: Routledge.
Flowerdew, L. (2005). An integration of corpus-based and genre-based approaches to
text analysis in EAP/ESP: Countering criticisms against corpus-based methodolo-
gies. English for Specific Purposes, 24(3), 321–332.
Flowerdew, L. (2012). Corpora and language education. New York, NY: Palgrave Macmillan.
Fotos, S. (1994). Integrating grammar instruction and communicative language use
through grammar consciousness-raising tasks. TESOL Quarterly, 28, 323–351.
Francis, D., Rivera, M., Lesaux, N., Kieffer, M., & Rivera, H. (2006). Practical guidelines
for the education of English language learners: Research-based recommendations for instruction
and academic interventions. Portsmouth, NH: RMC Research Corporation, Center
on Instruction.
Francis, G., Hunston, S., & Manning, E. (1996). Collins Cobuild grammar patterns 1:
Verbs. London, UK: Harper Collins.
Friginal, E. (2009). The language of outsourced call centers: A corpus-based study of cross-
cultural interaction. Amsterdam: John Benjamins.
Friginal, E. (2011). Interactional and cross-cultural features of outsourced call center
discourse. International Journal of Communication, 21(1), 53–76.
Friginal, E. (2013a). 25 years of Biber’s multi-dimensional analysis: Introduction to the
special issue. Corpora, 8(2), 137–152.
Friginal, E. (2013b). Developing research report writing skills using corpora. English for
Specific Purposes, 32(4), 208–220.
Friginal, E. (2015). Concordancers. In J. Bennet (Ed.), The Sage encyclopedia of intercul-
tural communication (pp. 109–111). Thousand Oaks, CA: Sage.
Friginal, E., & Hardy, J. A. (2014a). Corpus-based sociolinguistics: A guide for students. New
York, NY: Routledge.
Friginal, E., & Hardy, J. A. (2014b). Conducting Biber’s multi-dimensional analysis
using SPSS. In T. Berber Sardinha & M. Pinto (Eds.), Multi-dimensional analysis: 25
years on (pp. 297–316). Amsterdam: John Benjamins.
Friginal, E., Lee, J. J., Polat, B., & Roberson, A. (2017). Exploring spoken English learner
language using corpora: Learner talk. London, UK: Palgrave Macmillan.
Friginal, E., Li, M., & Weigle, S. (2014). Revisiting multiple profiles of learner compo-
sitions: A comparison of highly rated NS and NNS essays. Journal of Second Language
Writing, 23, 1–14.
336 References
Friginal, E., & Mustafa, S. (2017). A comparison of US-based and Iraqi English research
article abstracts using corpora. Journal of English for Academic Purposes, 25, 45–57.
Friginal, E., & Polat, B. (2015). Linguistic dimensions of learner speech in English in-
terviews. Corpus Linguistics Research, 1, 53–82.
Friginal, E., Walker, M., & Randall, J. (2014). Exploring mega-corpora: Google Ngram
viewer and the Corpus of Historical American English. E-JournALL, 1(1), 132–151.
Friginal, E., Waugh, O., & Titak, A. (2017). Linguistic variation in Facebook and
Twitter posts. In E. Friginal (Ed.), Studies in corpus-based sociolinguistics (pp. 342–362).
New York, NY: Routledge.
Fukushima, S., Watanabe, Y., Kinjo, Y., Yoshihara, S., & Suzuki, C. (2012). Develop-
ment of a web-based concordance search system based on a corpus of English papers
written by Japanese university students. Procedia – Social and Behavioral Sciences, 34,
54–58. Languages, Cultures and Virtual Communities (Les Langues, les Cultures et
les Communautes Virtuelles).
Gavioli, L. (2005). Exploring corpora for ESP learning. Philadelphia, PA: John Benjamins.
Gavioli, L., & Aston, G. (2001). Enriching reality: Language corpora in language ped-
agogy. ELT Journal, 55(3), 238–246.
Geluso, J., & Yamaguchi, A. (2014). Discovering formulaic language through d ata-
driven learning: Learner attitudes and efficacy. ReCALL, 26(2), 225–242.
Gilquin, G. (2008). Hesitation markers among EFL learners: Pragmatic deficiency or
difference? In J. Romero-Trillo (Ed.), Pragmatics and corpus linguistics: A mutualistic
entente (pp. 119–149). Berlin: Mouton de Gruyter.
Gilquin, G., De Cock, S., & Granger, S. (Eds.) 2010. The Louvain International Da-
tabase of Spoken English Interlanguage, handbook and CD-ROM. Louvain-la-Neuve,
Belgium: Presses Universitaires de Louvain.
Granger, S. (1983). The be + past participle construction in spoken English with special emphasis
on the passive. New York, NY: Elsevier Science Publishers.
Granger, S., Gilquin, G., & Meunier, F. (2015). The Cambridge handbook of learner corpus
research. Cambridge, UK: Cambridge University Press.
Granger, S., Hung, J., & Petch-Tyson, S. (Eds.). (2002). Computer learner corpora, second
language acquisition, and foreign language teaching. Amsterdam: John Benjamins.
Grant, L., & Ginther, A. (2000). Using computer-tagged linguistic features to describe
L2 writing differences. Journal of Second Language Writing, 9(2), 123–145.
Gray, B. (2013). More than discipline: Uncovering multi-dimensional patterns of vari-
ation in academic research articles. Corpora, 8(2), 153–181.
Greenbaum, S. (Ed.). (1996). Comparing English worldwide: The International Corpus of
English. Oxford, UK: Clarendon Press.
Grieve, J. (2016). Regional variation in written American English. Cambridge, UK: Cambridge
University Press.
Grieve, J., Biber, D., Friginal, E., & Nekrasova, T. (2010). Variation among blogs: A
multi-dimensional analysis. In A. Mehler, S. Sharoff & M. Santini (Eds.), Genres on
the web: Corpus studies and computational models (pp. 45–71). New York, NY: Springer-
Verlag.
Grieve, J., Nini, A., Guo, D., & Kasakoff, A. (2014). Big data dialectology: Analyzing
lexical spread in a multi-billion word corpus of American English. Paper presented at
the A merican Association for Corpus Linguistics Conference, Northern Arizona
University, F lagstaff, AZ.
Grigoryan, T. (2016). Using learner corpora in language teaching. International Journal
of Language Studies, 10(1), 71–90.
References 337
Guo, L. (2011). Product and process in TOEFL iBT independent and integrated writ-
ing tasks: A validation study (Doctoral dissertation). Retrieved from Georgia State
University’s Catalog (gast.2478085).
Guo, G., Crossley, S. A., & McNamara, D. S. (2013). Predicting human judgments of
essay quality in both integrated and independent second language writing samples:
A comparison study. Assessing Writing, 18(3), 218–238.
Halvey, M., & Keane, M. T. (2007). An assessment of tag presentation techniques.
Poster presentation at http://dblp.uni-trier.de/db/conf/www/www2007.html.
Handford, M. (2010). The language of business meetings. Tokyo: Cambridge University
Press.
Hardy, J. A. (2013). Disciplinary specificity in student writing: A proposal for corpus collection
(Unpublished manuscript). Atlanta, GA: Georgia State University.
Hardy, J. A. (2014). Undergraduate student writing across the disciplines: Multi-dimensional anal-
ysis studies (Unpublished doctoral dissertation). Atlanta, GA: Georgia State University.
Hardy, J. A., & Römer, U. (2013). Revealing disciplinary variation in student writing: A
multi-dimensional analysis of 16 disciplines from MICUSP. Corpora, 8(2), 183–208.
Hartig, A. J., & Lu, X. (2014). Plain English and legal writing: Comparing expert and
novice writers. English for Specific Purposes, 3, 87–96.
He, X. (2016). Text and spatial-temporal data visualization (Unpublished doctoral disserta-
tion). Atlanta, GA: Georgia State University.
Hinkel, E. (2002). Second language writers’ text: Linguistic and rhetorical features. Mahwah,
NJ: Lawrence Erlbaum Associates.
Hoey, M. (2005). Lexical priming: A new theory of words and language. London, UK:
Routledge.
Hopkins, C. (2011). The hip-hop word count counts on language to understand the
reality we keep. Retrieved from https://readwrite.com/2011/01/20/the_hip-hop_
word_count_counts_on_language_to_under
Hu, G., & Cao, F. (2011). Hedging and boosting in abstracts of applied linguistics
articles: A comparative study of English-and Chinese-medium journals. Journal of
Pragmatics, 43, 2795–2809.
Hunston, S., & Francis, G. (1998). Verbs observed: A corpus-driven pedagogic gram-
mar. Applied Linguistics, 19(1), 45–72.
Hyland, K. (2008a). Scaffolding during the writing process: The role of informal peer in-
teraction in writing workshops. In D. D. Belcher & A. Hirvela (Eds.), The oral-literate
connection: Perspectives on L2 speaking, writing, and other media interactions (pp. 168–190).
Ann Arbor, MI: University of Michigan Press.
Hyland, K. (2008b). Genre and academic writing in the disciplines. Language Teaching,
41(4), 543–562.
Hyland, K. (2012). Disciplinary identities: Individuality and community in academic discourse.
New York, NY: Cambridge University Press.
Jarvis, S., Grant, L., Bikowski, D., & Ferris, D. (2003). Exploring multiple profiles of
highly rated learner compositions. Journal of Second Language Writing, 12(4), 377–403.
Jefferson, G. (2004). Glossary of transcript symbols with an introduction. In G. H.
Lerner (Ed.), Conversation analysis: Studies from the first generation (pp. 13–31). Amster-
dam: John Benjamins.
Johansson, S., & Hofland, K. (1989). Frequency analysis of English vocabulary and grammar.
Oxford, UK: Oxford University Press.
Johns, A. M. (1995). Teaching classroom and authentic genres: Initiating students into
academic cultures and discourses. In D. Belcher & G. Braine (Eds.), Academic writing
338 References
in a second language: Essays on research and pedagogy (pp. 277–291). Norwood, NJ:
Ablex Publishing Corporation.
Johns, A. M. (1997). Text, role, and context. New York, NY: Cambridge University Press.
Johns, A. M. (2009). Tertiary undergraduate EAP: Problems and possibilities. In D.
Belcher (Ed.), English for specific purposes in theory and practice (pp. 41–59). Ann Arbor,
MI: The University of Michigan Press.
Johns, T. (1986). Micro-Concord: A language learner’s research tool. System, 14, 151–162.
Johns, T., & King, P. (1991). Classroom concordancing. ELR Journal, 4, 1–16.
Johns, T. (1994). From printout to handout: Grammar and vocabulary teaching in the
context of data-driven learning. In T. Odlin (Ed.), Perspectives on pedagogical grammar
(pp. 293–313). Cambridge, UK: Cambridge University Press.
Johnston, T., & Schembri, A. (2006). Issues on the creation of a digital archive of a
signed language. In L. Barwick & N. Thieburger (Eds.), Sustainable data from digital
fieldwork (pp. 7–16). Sydney, NSW: University of Sydney Press.
Juola, P., Ryan, M., & Mehok, M. (2011). Geographic localizing Tweets using stylo-
metric analysis. Paper presented at the American Association for Corpus Linguistics
2011, Georgia State University, Atlanta, GA.
Kachru, Y. (2008). Language variation and corpus linguistics. World Englishes, 27(1), 1–8.
Kafes, H. (2012). Cultural traces on the rhetorical organization of research article abstracts.
International Journal on New Trends in Education and their Implications, 3(2), 207–220.
Kaltenböck, G., & Mehlmauer-Larcher, B. (2005). Computer corpora and the language
classroom: On the potential and limitations of computer corpora in language teach-
ing. ReCALL, 17(1), 65–84.
Kaneko, T. (2007). Why so many article errors? Use of articles by Japanese learners of
English. Gakuen, 798, 1–16.
Kaneko, T. (2008). Use of English prepositions by Japanese university students. Gakuen,
810, 1–12.
Kennedy, C., & Miceli, T. (2010). Corpus-assisted creative writing: Introducing inter-
mediate Italian learners to a corpus as a reference resource. Language Learning and
Technology, 14(1), 28–44.
Keuleers, E., Brysbaert, M., & New, B. (2011). An evaluation of the Google Books
Ngrams for psycholinguistic research. In K. M. Würzner & E. Pohl (Eds.), Lexical
resources in psycholinguistic research (vol. 3, pp. 23–27). Potsdam, Germany: Uni-
versitätsverlag Potsdam.
Klimt, B., & Yang, Y. (2004). Introducing the Enron Corpus. Paper presented at the
First Conference on Email and Anti-Spam (CEAS), Mountain View, CA.
Knight, D., Evans, D., Carter, R., & Adolphs, S. (2009). Headtalk, handtalk, and the
corpus: Towards a framework for multi-modal, multi-media corpus development.
Corpora, 4(1), 1–32.
Koester, A. (2010). Workplace discourse. London, UK: Continuum.
Kolb, T., Friginal, E., Lee, M., Tracy, N., & Grieve, J. (2007). Teaching writing within
forestry. In Proceedings from the University Education in Natural Resources Conference
2007, Oregon State University, Corvallis, OR. Retrieved September 22, 2015 from
www.uenr.forestry.oregonstate.edu
Krishnamurthy, R., & Kosem, I. (2007). Issues in creating a corpus for EAP pedagogy
and research. Journal of English for Academic Purposes, 6, 356–373.
Krutka, D., & Carpenter, J. (2016). Why social media must have a place in schools.
Kappa Delta Pi Record, 52(1), 6–10. doi:10.1080/00228958.2016.1123048
References 339
Nation, P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31(7),
9–13.
Nation, P., & Waring, R. (1997). Vocabulary size, text coverage, and word lists. In N.
Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition and pedagogy (pp.
6–19). Cambridge, UK: Cambridge University Press.
Nelson, G. (1996). The design of the corpus. In S. Greenbaum (Ed.), Comparing English
worldwide: The International Corpus of English (pp. 27–35). Oxford, UK: Clarendon Press.
Nesi, H. (2008). BAWE: An introduction to a new resource. In A. Frankenberg-
Garcia, T. Rkibi, M. Braga da Cruz, R. Carvalho, C. Direito, & D. Santos-Rosa
(Eds.), Proceedings of the eighth teaching and language corpora conference (pp. 239–246).
Lisbon, Portugal: ISLA.
Nesi, H. (2011). BAWE: An introduction to a new resource. In A. Frankenberg-Garcia,
L. Flowerdew, & G. Aston (Eds.), New trends in corpora and language learning (pp.
213–228). London, UK: Continuum.
Nesi, H., & Basturkmen, H. (2006). Lexical bundles and discourse signaling in aca-
demic lectures. International Journal of Corpus Linguistics, 11(3), 282–304.
Nesi, H., & Gardner, S. (2012). Genres across the disciplines: Student writing in higher educa-
tion. New York, NY: Cambridge University Press.
Nini, A. (2014). Multidimensional analysis tagger 1.2- Manual. Retrieved from http://
sites.google.com/site/multidimensionaltagger
Nini, A., Corradini, C., Guo, D., & Grieve, J. (2016). The application of growth curve
modeling for the analysis of diachronic corpora. Language Dynamics and Change.
doi:10.1017/S1360674316000113
Nolen, M. (2017). Exploring corpora: Developing exploratory habits with data-driven
learning in a study abroad setting (Unpublished Master’s paper). Department of Ap-
plied Linguistics and ESL, Georgia State University.
O’Donnell, M. B., & Römer, U. (2012). From student hard drive to web corpus (Part 2):
The annotation and online distribution of the Michigan Corpus of Upper-level Student
Papers (MICUSP). Corpora, 7(1), 1–18.
O’Donnell, M. B., Römer, U., & Ellis, N. (2013). The development of formulaic se-
quences in first and second language writing: Investigating effects of frequency,
association, and native norm. International Journal of Corpus Linguistics, 18(1), 83–108.
O’Keeffe, A., McCarthy, M., & Carter, R. (2007). From corpus to classroom. Cambridge,
UK: Cambridge University Press.
The Oregon Department of Education. (2002). Instructional technology ideas and re-
sources for Oregon teachers. Retrieved January 6, 2010 from www.ode.state.or.us/
teachlearn/subjects/technology/instrtec.pdf
Pennebaker, J. W., Chung, C. K., Ireland, M., Gonzalez, A., & Booth, R. J. (2017).
The development and psychometric properties of LIWC. Austin, TX. Retrieved
October 6, 2017 from www.LIWC.net
Pérez-Paredes, P., Sánchez-Hernández, P., & Aguado-Jiménez, P. (2011). The use
of adverbial hedges in EAP students’ oral performance. In V. Bhatia, P. Sánchez-
Hernández, & P. Pérez-Paredes (Eds.), Researching specialized languages (pp. 95–114).
Amsterdam: John Benjamins.
Pica, T. (1994). Research on negotiation: What does it reveal about second-language
learning conditions, processes, and outcomes? Language Learning, 44(3), 493–527.
Pickering, L. (2006). Current research on intelligibility in English as a lingua franca.
Annual Review of Applied Linguistics, 26, 219–233.
342 References
Pickering, L., Friginal, E., & Staples, S. (Eds.). (2016). Talking at work: Corpus-based
explorations of workplace discourse. London, UK: Palgrave Macmillan.
Plakans, L., & Gebril, A. (2013). Using multiple texts in an integrated writing assessment:
Source text use as a predictor of score. Journal of Second Language Writing, 22(3), 217–230.
Polat, B. (2013). L2 experience interviews: What can they tell us about individual dif-
ferences? System, 41, 70–83.
Quirk, R., Greenbaum, S., Leech, G. N., & Svartvik, J. (1972). A grammar of contempo-
rary English. London, UK: Longman.
Rayson, P. (2003). WMatrix: A statistical method and software tool for linguistic analysis through cor-
pus comparison (Unpublished doctoral dissertation). Lancaster University, Lancaster, UK.
Rayson, P. (2008). From key words to key semantic domains. International Journal of
Corpus Linguistics, 13(4), 519–549.
Reinhardt, J. (2010). Directives in office hour consultations: A corpus-informed inves-
tigation of learner and expert usage. English for Specific Purposes, 29(2), 94–107.
Reiter, J. (2011). Lexical variation in inaugural addresses: A research proposal and pre-
liminary results (Unpublished manuscript). Department of Applied Linguistics and
ESL, Georgia State University, Atlanta, GA.
Reppen, R. (2010). Using corpora in the language classroom. Cambridge, UK: Cambridge
University Press.
Richards, J. C. (2008). Moving beyond the plateau. New York, NY: Cambridge Univer-
sity Press.
Roberson, A. (2015). The Second Language Peer Response (L2PR) corpus. Atlanta, GA:
Georgia State University.
Roberts, J., & Samford, W. (2013). Classroom applications of available corpus tools: Tips and
suggestions for analyzing and utilizing specialized corpora in the classroom. Poster presented
at TESOL Convention 2013, Dallas, TX, March 21, 2013.
Robinson, M., Stoller, F., Jones, J., & Costanza-Robinson, M. (2010). Write like a chem-
ist. Oxford, UK: Oxford University Press.
Römer, U. (2009). The inseparability of lexis and grammar: Corpus linguistic perspec-
tives. Annual Review of Cognitive Linguistics, 7, 140–162.
Römer, U. (2010). Establishing the phraseological profile of a text type: The construc-
tion of meaning in academic book reviews. English Text Construction, 3(1), 95–119.
Römer, U. (2011). Corpus research applications in second language teaching. Annual
Review of Applied Linguistics, 31, 205–225.
Römer, U., & O’Donnell, M. B. (2011). From student hard drive to web corpus (Part 1):
The design, compilation and genre classification of the Michigan Corpus of Upper-level
Student Papers (MICUSP), Corpora, 6(2), 159–177.
Römer, U., & Wulff, S. (2010). Applying corpus methods to written academic texts:
Explorations of MICUSP. Journal of Writing Research, 2(2), 99–127.
Rosenkrans, W. (2014). Survival factors. Aerosafety World. Retrieved from https://
flightsafety.org/asw-article/survival-factors-2/.
Rundell, M. (2007). Macmillan English dictionary for advanced learners (2nd ed.). Oxford:
Macmillan.
Rutherford, W. (1987). Second language grammar: Learning and teaching. London: Longman.
Sainani, K., Eliott, C., & Harwell, D. (2015). Active vs. passive voice in scientific
writing. Retrieved March 14, 2016 from www.acs.org/content/dam/acsorg/events/
professional-development/Slides/2015-04-09-active-passive.pdf
References 343