2013 STG-JN CreatingUsingCorpora ResMethLing PDF

251) NAOM I NAGY AND DEVYAN I ~HA KM A
Poplack, S., J. A. Walker, and R. Malcolmson. 2006. An English "like no other"?: language
contact and change in Qu(!bec. Canadian .Journal of Linguistics 51.213: 185- 213.
13 Creating and using corpora
Potter, J. and A. Hepbum. 2005. Qualitative interviews in psychology: problems and
possibilities. Qualitative Research in Psy"Ciwlogy 2: :lR-55.
Preston, D. 1982. Ritiu' fowklower daun 'rong: Folklorists' failures in phonology. Journal
o(American Folklore 95: 304-26.
Stefan Th. Gries and John Newman
Sankof( D. and G. Sankotl l 973. Sample survey methods and computer-assisted analysis
in the study ofgrammatical variation. In R. Darnell, ed. Canadian Languages in their
Suciai Context. Edmonton: Linguistic Research Inc., 7- 63.
Sankoft~ G., P. Thibault, N. Nagy, H. Blondeau, M.-0 . Fonollosa, and L. Gagnon. 1997.
Variation and the use o1 discourse markers in a language contact situation. /.anguage Introduction
Ttc!rialion and Change 9: 191- 218.
Schcglotl; E. i\. 200 I. Di~course as interactional achievement III: the onmirclcvancc of Over the last tew decades, corpus-linguistic methods have established
action. In D. Schifli·i11, D. Tanm;n, and TI. E. I lamillon, eels. The Handbook of themselves as among the most powerful and versatile tools to study language
Discourse !lnaly.>is. Oxford: Blackwell, 229-49. acquisition, pmcessi.ng, variation, and change. This development has been driven
Shanna, D. 2003. Discourse clitics and constmctivc morphology in Hindi. In M . Bult and in particular by the following considerations:
T. .Hollow~JY King, eds. Nominai.~: ln.~ide and Out. Stanlord: CSLI Publications.
a. technological progress (e.g., processor speeds as well as hard drive
2005. Dialect stabilization and speaker awareness in non-native varieties of English.
Journal ofSociofinguistics 9.2: 194- 225. and RAM sizes);
Sharma, D. and L. Sankaran. 201 1. Cognitive and social forces in dialect shitl: gradual b. methodological progress (e.g., the development of software tools,
change in London Asian spw;h. Language Variation and Change 23: 1-30. programming languages, and statistical methods);
Silverman, K., M. Beckman, J. Pitrelli, M. Ostendorf, C. 'Nightman, P. Price, c. a growing desire by many linguists for (more) objective, quantitlable,
J. Picrrchumbert, and J. Hirschberg. 1992. ToRI: a standard for labeling English and replicable findings as an alternative to, or at least as an addition to,
prosody. Tn Proceedings ofthe 1992 Intematirmal Conference on !:.poken Language intuitive acceptability judgments (see Chapter 3);
Processing, October 12- 16, Banft~ Canada, pp 867-70. d. theoretical development-; such as the growing interest in cognitively
Thichcrgcr, 1\. and A. Berez. 2012. Linguistic daw management. InN. Thieberger, ed. The and psycholinguistically motivated approaches to language in •.vhich
Oxford Handbook ofLinguistic Fieldwork. Oxiord University Press. frequency of (co-)occurrence plays an important role for language
Torres Cacoullos, R. 20 II. Transcrirtion notes. Pennsylvania State University ms. (See the acquisition, processing, usc, and change.
companion website for this volume.)
Valdman, A 2007. NSF Behavioral and Cognitive Sciences grant #0639482. In tbis chapter, we will discuss a necessarily small selection of issues regarding (i)
Wittenburg, P., II. l~rugman , A. RllSsel, A. Klassmann, and H. Sloetjes. 2006. ELAN: a the creation, or compilation, of new corpora and (ii) the usc of corpora once they
professional li"amework tor multimodality research. ln Proceedings ofLREC 2006~ have been com pi led. Although this chapter encompasses both the creation and use
Fifth International Crmjerence on Language Resourr:es and Evaluation, \by 22- 2R, of corpora, there is no expectation that any individual researcher would be
Genoa, Italy. engaged in both these kinds of activities. Different skills are called for when it
comes to creating and using c01·pora, a point noted by Sinclair (2005: 1), who
Appendix 12.1 Tools and software discussed in this chapter
draws attention to the potential pitfalls of a corpus analyst building a corpus,
CLAN http://childes.talkbank.org/clan specifically, the danger that the corpus will be constructed in a way that can only
ELAN Max Planck Institute for P~ycholinguistics, The Language Archive, serve to confirm the analyst's pre-existing expectations. Some of the issues
Nijmegen, The ~etherlands. http://tla.mpi.nl!tools/tla-tools/elan addressed in this chapter are also dealt with in Wynne (2005), McEnery, Xiao,
FLEx http: /liicldworks. si I. org/Hex and Tono (2006), and McEnery and Hardie (2012) in a fairly succinct way, and
Leipzig wvv>v.eva.mpg.de/lingualresources/glossing-mles.php more Lhoroughly in Ludeling and Kyto (2008a, 2008b) and Beal, Con-igan, and
Glossing .Moisl (2007a, 2007b). t
Rules
Praat ww\v.ion.hum.uva.nl/praat l
Transcriber hltp:J/trans.source forge. net Details of corpora and software mentioned in this chapter arc provided in Appendicc> I and 2
(URL~ accessed June 26, 2013). The$e are rapidly developing d()lnain$ and inf01mation provided
TYPPCRi\fT The Natural Lmguagc Database. http://typecraft.org here is naturally only current ss at the time or writing. Updated lists are available on the compa~ion
(All websites accessed .July X, 20 13.) website lor this volume.
257
258 S I'I'. I· AN Tfl . G RJ ES AND JOli N NEWMA "! Creating and using corpora 259
ostler (2008: 459) remarks on the artificiality of distinctions between speech-

2 Creating corpora ·sed and text-based corpora in light of thc increasing use of multi-tiered anno-
2.1 The notion of a "corpus": a prototype and dimensions of b~jons of audio and video data (see Chapter 12 for details on transcription and
ta ulti-tier annotation). One may also choose to distinguish corpus types by content
variation 10
source: synchronic vs historical, national corpora, learner corpora, academic
The notion of a cmpus can best be defined as a category organi7.cd ~~scourse, children's language, interviews, static vs monitor corpora, multilingual,
around a prototype. Most generally, a corpus can be described as ·'a body of web-based. and so on. Corpora, as Ul>ed in linguistics, are created with particular
naturally occurring language" (McEnery, Xiao, and Tono 2006: 4), thereby dis- urposes of study in mind and the valicty of corvus types should not be surpris-
tinguishing a corpus from word lists, dictionaries, databases, and so on. These ~lilt>(! _ it is no more than a reflection of the richness and multi-facetedness of
days, the prototypical corpus is a machine-readable collection oflanguage used in language use and the many perspectives one can bring to the st11dy oflanguage.
authentic settings/contexts: one that is intended to be representative for a partic- one cannot therefore speak of a "standard" in corpus construction or design in the
ular language, variety, orregister (in the sense of reflecting all the possible parts of sense of a set of protocols that must be adhered to in order for the corpus LO be
the intended JanguagcivaJiety/register), and that is intended to be balanced such admissible in corpus linguistics; the conception of"corpus" as a category around a
that the sizes of the parts of the corpus correspond to the proportion these parts prototype is more appropriate (sec Gilqu in an.d Gries 2~09:,Section 2). Further
make up in the language/variety/register (sec McEnery, Xiao, and Tono 200(i: 5; information on selected corpora can be found 111 Appendix 1j .I.
Hunston 2001!: l(iO (i; Gries 2009: Chapter 1). However, many corpora differ There arc now many large cotlJOra of high quality available, where "large"
from an ideal design along these (and other) parameters; in fact, there is disagree- means, say, I00 million words or more. We emphasize, though, that smaller
ment as to whether just any body of naturally occurring language can be called a corpora also have their place alongside the larger corpora. The key consideration
corpus. K.ilgarriff and Grefenstette (2006: 334), by way of introducing and is to have an appropriate match of research goal and corpus type/size, and, for
advocating the study of data from the World Wide Web, adopt a definition of a some research goals, even quite a small corpus constructed by a researcher can
corpus as "a collect ion of texts when considered as an object of language or yield insightful results. Bcrkcnfield (200 1), 1rsing a corpus of just 10,640 words,
literary study." On the other hand, Sinclair (2005: 15) explicitly excludes a was able to carry out rt:Search on phonetic reduction of that in spoken English;
number of categories from linguistic corpora (e.g., a single text, an archive, and, Thompson and Hopper (200 l) successfully explored transitivity in a cotpus of
in particular, the World Wiele Web). Beyond being a body of naturally occurring multi-party conversations consisting of just 446 clauses; Fiorentino (2009)
language, then, it is difficult to agree on any more particular definition of what a studied ordering of adverbial and main clauses in an Italian corpus consisting of
corpus is or is not. Note, too, that some collections of language can diverge from 26,000 words for Lhe written part of the corpus and 32,000 for the spoken part.
the prototypical property of being "naturally occulTing language," and yet are still Smaller corpora such as these can suffice when the focus of the study is a relatively
happily referred to as corpora by their creators. As an example, consider the frequent phenomenon, but would not be advisable if the fiJcus is a relatively rare
TlMlT Acoustic-Phonetic Continuous Speech Corpus, made up of audio record- phenomenon. Granath (2007), reflecting on the di!Tercnt results obtained from
ings of 630 speakers of eight major dialects of American English. For these searching for an English inversion snucture like Thus ended his dreams, found
recordings, each speaker read ten "phonetically rich" sentences - a uniquely reason to appreciate both the 1 million-word corpora and the 50 million-word
valuable resource for the study of acoustic properties of American English, but corpora used in the study: "in the end, combining evidence tl'om large and small
not what one would consider naturally occurring language. cmvora can give us information that neither type of corpus could provide on its
A detailed overview of cor·pora, illustrating the range of types or corpom that own" (Granath 2007: 183).
arc being studied wiLhin linguistics, can be found in the chapters of Liidcling and
Kyto (2ooga: 154- 483). Apart from the above criteria defining prototypical
corpora, one can distinguish corpus types by the media that hold the data: written 2.2 Collecting the corpus data
text (web, text documents, historical manuscripts; see Chapter II for details on the In this and the fo llowing section, we describe the main steps involved
usc of diachronic corpora); audio; video and audio; audio and transcribed spoken in preparing and annotating a new cotpus, before reviewing readily available
texts based on the audio, and so on. There is often an assumption that a corpus will corpora in Section 2.4. ll is fai r to say that most corpora arc created with the
include written language or transcriptions of spoken language (which arguably expectation that they are, in some sense, representative of something larger than
represents the prototypical kind oflanguage use), but it is important to appreciate tl1emselves - what we referred to as the pmtotypical corpus in Section 2.1 -rather
that collections of naturally occurring speech in the form of audio files ("speech than the ultra-pragmatic view of a corpus held by Kilgarriff and Grefenstette.
corpora," as opposed to transcriptions of spoken language) are valid corpora. Consequently, an initial and profound decision relates to exactly what the corpus
STE I'AN 'I' H. !iKI ES ,\ N D .I O if:<l N EWMAN Creating and using corpora 261
260
Table IJ.l A subset ofthe Uppsala Learner Ent;lish Cm7Jus. Adapted)f'om date of recording
~·
Table 1 in Johansson and Geisler 2(} II: 140 place of recording
b.
gender
Boys Girls c.
age
d.
School Mean essay length Number Mem1 essay length Number mother tongue
e.
Level year in words of c.ssays in words of essays other languages spoken
f.
self-reported ethnicity
Junior high Year7 228.0 5 217.0 5 g.
occupation
Year9 22 1.8 5 234.0 5 IJ.
5 190.0 5 educational profile
Senior high Year I 220.8 J,
Year3 277.8 5 245.0 5 professional training

j
Total 237.1 20 22 1.5 20 k.
overseas expclience
The decision as to what the corpus should be representative of will always have a
Jwge impact on how tbe corpus data will be collcctccl: recordings of natural
is supposed to be representative of and what sampling technique is to be used conversation, recorded interviews, conversation from TV programs, fi ctional
(sec Chapter 5 for a more general discussion of sampling). One very basic kind texts or journalese (ti·om tbe web or pwcessed by optical character recognition
of decision guiding the collection of language data concerns the categories that [OCR] software), blogs and chat-room data, general content crawled/collected
form the basis of the sampling: categories oflanguage users (e.g., gender, age, trorn the web are but a few possible data sources, and careful decisions as to what
socioeconomic class, geographical location), categories of the language prod- can and must be included are re4uired, and, realistically, will often have to be
uct~ (e.g., spoken language, written hmguage, registerortanguage use, text type, balanced with what is possible within the restrictions of particular research
formality of the language), or a combination of both of these. A noteworthy agendas and goals. Sometimes there can be hidden biases in making decisions
example of how categories of language users can figure prominently in corpus about representativeness, skewing the data collected in unintended ways. A
data is the sub-corpus of the Uppsala Learner English Corpus used in Johansson typical bias may favor a "standard" or better known variety of language over
and Geisler (20 II ). For the purposes of their study of the syntax of Swedish Jess prestigious (dialectal, colloquial) varieties, or favor the collection of data from
leamers of English, the authors carefully chose leamers' essays to balance the more educated speakers. 'lewrnan and Columbus (2009), for example, found an
numbers of boys and girls and the levels of the school year, as summarized in (unintended) over-representation of vocabulary relating to the education domain
Table 13. 1. in a number of the conversational corpora in the International Corpus of English
Typically, it is categories such as register (i.e., categories relating to properties project, most likely a consequence of the easy availability of speakers from the
of the product rather than the user) that are the preferred basis for structming the education sector as contributors of data. Of course, the researcher may quite
more common corpora in use (see the examples of widely used corpora in consciow;ly opt for data specifically restricted to a standard variety, educated
Section 2.4). This is due in part to the unavailability of sociodemographic data speakers, or other factors, but it should not be thought that a corpus must be
on speakers and writers in the case of many texts (as t·ctricved, for example, from restricted in this way. In addition, there is a variety of further restrictions on the
the vVorlcl Wide Web), hut it may also be due to the view that the variation collection of data which oft~n have to do with what speakers/writers allo;v to be
between, say, spoken and written modalities is far more significant than variation done with their speech/texts. For example, for reasons of copyright or the tradi-
between male and female speech or writing. The approach adopted in creating the tions of speech communities, not everything that can be found on the Web can be
Canadian component of the International Co rpu~ of English (ICE-CAN) offers a added to a c01·pus that is intended for use by otJ1ers.
practical way of proceeding: data arc basically sampled on the basis of categories These days, the World Wide Web offers a useful stm1ing point tor obtaining text
of register (broadly understood), such as spoken vs written, spoken dialogue vs which can be utilized for the construction of corpora. Collections of published
spoken monologue, spoken dialogue private vs spoken dialogue public, written materials (out of the range of current copyright) such as Project Gutenberg provide
printed vs non-printed, but some attempt is made to balance the numbers of male a wealth of literary texts in many languages that can be exploited fiJr the creation of
and female speakers in the data collection. The metadata on speakers contributing customized digital corpora. Rut, as already indicated above, there is an abundance of
to the spoken pa1t of TCE-CAl\ and available as pmt of the distribution of the lllaterial available for downloading apat1 from literary texts: newspaper collections,
corpus, summaria:d below, is in fact extensive enough filr a sociolinbruistically Wikipedia cnllics, university lectures, film scripts, translations of the Bible, blogs,
oriented usc of the corpus: and so on. Oral histoty projects provide opportunities tor the creation of spoken
262 STE I'AN T il. GRIES i\NDJO HN N EWMAN Creating and using corpora 263
corpora. Consider, as just one example, the Southern OmI History Program, which difficulty (overall quality of the channel in terms of number of speak-
e. ers, background noise, channel noise, speed, accent, articulation)
began in 1973 with the aim of documenting the life of the American South in tapes,
videos, and transcripts. According to the website, this project will ultimately make background noise (amount or sound not made by the speakers, e.g.,
f baby crying, television, radio)
500 oral history interviews available over the internet (400 are already available),
selected from the 4,000 or so oral history interviews cruried out over 30 years. TI1c distortion (echo and other types of recording problems)
g.
interviews cover a variety of topics in recent North Carolina history, particularly crosstalk (audibility of the channel A speaker on channel B, and vice
]1.
civil rights, politics, and women's issues. As of writing, the index contains a list of versa)
496 topics. Intctviews can be read as text transcript, listened to (or downloaded) with
N{etadata for the caller:
a media player, or both simultaneously. Note, also, that applications such as
HTTrack (for Windows/Unix) or Sitesucker (for Mac) can currently be used with j.
gender
many sites, enabling an automated mirroring of whole wcbsites. j. age
Our emphasis in this chapter is on creating and using cmvora as written or .k. year8 of education
transcribed texts, but some comments on collecting spoken data are in order (sec I. where the caller grew up
Chapters 9 and I I for many observations directly relevant here). One isstle [!1.
telephone number called
immediately con!i·ont.i.ng a researcher collecting data directly fi·om a speaker is once first versions of video/sound/text fi les have been obtained, typically one or
how to minimize observer efft:ct~. Inconspicuousness and versatility are two key rnore tollow-up steps arc necessary, which arc discussed in the following section.
goals in managing the collection of speech data (intended lo rcficc! natural, non-
self-conscious usc oflan!,•Uage), as discussed in Chapters 6 and 9. The CallHome
American Engli~h Speech Corpus, for example, follows a procedure which is 2.1 Preparing the corpus data
likely to reduce any observer effect. The corpus is based on recorded telephone The first versions of fi les obtained in the first collection step hardly
conversations lasting up to 30 minutes, where the part icipants arc fully aware that ever correspond to the desired final versions. Rather, such files typically require
they are being recorded. The tnmscripts which derive from these recordings, two additional steps before they can be used and made available as corpus files:
however, are based only on I0 contiguous minutes from within those 30 minutes. they virtually always need ro be cleaned up and standardized, and they ofien need
While this strategy docs not exclude some self-consciousness on the prut of the to be marked up and annotated. In today's age of increased data sharing, it is
speakers, it docs serve to lessen any such effect, since the speakers cannot know in important to standardi7.e corvus fi les to faciIitate later use by other researchers
advrulee which I0 minutes are being utilized for the transcript. A second issue with different goals.
surrounding the col lection of spoken data concerns the quality of the audio/video
recording. Needless to say, one aims for the best quality possible (WAV rather than 2.1.1 Cleaning up and standardizing
\1P3 fo rmat for audio fi les, for example), though sometimes a lesser qual ity may The first versions oflilcs typically need to be cleaned of any undesired
sullicc. The corpora in the International Corpus of English project, for example, information they may contain. Files which include intormation that is protected
are designed primarily for di~tribution as corpora in the torm of text fi les where the for privacy reasons need to have such infonnation edited in some way (see
spoken data have been transcribed into regular English orthography. Tn such cases, Chapters 2 and 12). For example, if one gathers recordings of authentic conversa-
the quality of the recording must he good enough for reliable transcription, even if tion, it is often necessary to protect the speakers' privacy as well as the privacy of
it ti1lls short of what a researcher canying out a fine acoustic analysis requires. those whom a speaker talks about in their absence. (lmagine a case where, dming a
Finally, creating a speech corpus in which the acoustic characteristics are of recording, a speaker mentions that her neighbor cheated on last year's tax report or
importance leads natmally to additional kinds of metadata compared with those that her brother's visa has expired.) Data like these require careful consideration of
listed above. Below is a summary of the mctadata available in the CaiiHome how much one can and must anonymi7e the data. In ICE-CAN, for example,
American English Speech Corpus. names other than those of public figures were anonymizcd through the usc of
Mctadata for a conversation recording: ])Scudonyms.
a. total number or speakers Files obtained from the internet or other source<; can be in one of any number of
b. number of females and males fonnats (.txt, .btnli, .xml, .pdf, .doc, etc.) and will almost invariably require some
d. number of speakers per channel and number of males/females per editing for them to be u~ed most effectively. In using files hom the Web as a
channel COnveniem example, editing may include. but is not limited to the tasks listed below:
264 STFFAt' Til . GillES A"!D JOliN N E WMAN Creating and using corpora 265
a. converting all files into one and the same interoperable file tonnat and ·Heade r>
81
<t <fil eoese>
language encoding (e.g., conve1ting data into Unicode text files); <ti tleStmt>
b. removing and!or standardizing unwanted elements (e.g. , deleting <ti t le>Sampl e AOl fromThe At l a nta c onsti t ut i on</title>
<tit1e type=" s ub "> November 4, 1961, p .1 "Atlanta Primary &"
unwanted HTML tags such as image references, title, body, table, "Hartsfiel d Fi les"
and other tags, links, scripts, etc.); August 17, 1961, "Urged s t rong l y&"
C. standardizing different spellings and character representations (e.g., "samcaldwell Joi ns "
Ma rch 6,1961 , p .1 " Legi slato rs Me Moving " by Reg Mur phy
standardizing u and ü into ii, etc.); " Legis1at o r to fight " by Ri c hard Ashworth
d. identifying files downloaded more than once and deleting copies. " House oue Bid&"
p.l8 "Harry~ri lle r \;i ns&"

This kind of editing typically requires ready-made tools with particular features,
</ti t 1e Stmt>
or, better still, the use of a programming language. /\n example of a ready-made <edi t i onstmt>
application at the time of writing is the free cross-platfom1 Java-based text editor <edition>A part of t he XML vers ion of the Brown corpus</editi on>
</edi t i onstmt >
jEdit. While jEdit has many attractive features, it includes three key features <extent>l, 988 wo rds 431 (21 . 7%) quotes 2 s ymbo1s</extent>
relevant to formatting texts for corpus-based research: (i) it accepts a 'Nide range <pub1 i cat i onStMt >
.~Ol
of language encodings, incltlding UTF-8 and UTF-16; (ii) il allows for search and <avai 1aoi 1; tpUsed by permission of The At 1anta Consti t ution
replace over multip le files; (iii) it features search and replacement operations using state News
s e rvice (fl), and Regr•lurphy ( E). </avai l abili ty>
regular expressions, which are a method Lo describe simple or very complicated </publ i cationstmt>
sequences of characters in files (see Table 13. 11). Software like jEdit and other <SourceDesC>
<bi bl> The Atlant a Consti tuci on</bi bl>
text editors intended tor progranm1crs force the user to be more attuned to </sour ceDesc>
properties of fi les which become important in working with corpus tools, such </fil eDese>
as language encodings and (Unix- vs Windows- vs Mac-style) line breaks. <encodi ngDeso
Arbit ra ryHyphen: multi - mi ll ior [05 20] 
Regular expressions increase the power of editing considerably, allowing options </encadi ngDesc>
such as fmding and deleting all annotation contained within angular bmckcts, <revi s i onDesc>
<change when="2008-04-2 ?">Header auto- generated f or TEI versi on</change>
adding Hn annotation at the beginning of each line, removing some variable </revi sionDest>
number of lines of text at the begiJ.ming of a lilc, such as all text within </tei Heade r>
<tciiicadcr>... </teiHeader>, features that are not necessar·ily available in typical fligurc 13. 1. lvfarkup in the TEl Headeroffile AOJ in the X.ML Browll Co1pus
word-processing software.
closing tag. All the infmmation in the TEl header, for example, is found between
2.3.2 Marking up and annotating
the opening tag <teiHeader> and the closing tag </tciHcader>; the header, in turn,
Once one has files that are cleaned up and standardized as desired, a
consists ofa file dcsctiption within the <fileDesc> tags, a title stateme11t within the
second preparatory step usually involves enrich ing these with desired information
<titleStmt> tags, an edition statement within the <editionStmt> tags, and so on, as
they do not yet contain. Such infonnation serves to facilitate the retrieval of
seen in Figure 13.1 . The TEI guidelines for markup of texts are intended to apply
linguistic patterns and their co-occurrence with other (linguistic or extra-
to all kinds of texts and are not designed specifically for the files of a linguistic
linguistic) data. Usually, one distinguishes markup and annotation.
corpus. An extension of the TEl guidelines specifically intended for corpus
In the case of written or transcribed data, the markup section of a file refers to
markup (and annotation) is the Corpus Encoding Standard (CES) and the more
metadata about the file and might include intotmation such as when the data in the
recent version of these standards designed for XML, namely Extensible Corpus
file were collected, a description of the source of the data, when the file was
Encoding StandHnl (ECES).
prepared, basic social information about participants if relevant, and other sucl1
The annotation part of a file refers to elements added to provide specifically
details. f igure 13. 1 shows an example of markup from the beginning of the
linguistic int(mnarion (e.g., part of speech, semantic infom1ation, and pragmatic
Extensible Markup Language (XML) version of the Brown Corpus, distributed
infonnation). Most commonly, annotation takes the form of part-of-speech tag-
as part of Baby BNC v.2. The elements of markup confotm to the specifications
ging of words. The first sentence of the Brown Corpus is shown in a parts-of-
laid down by the Text Encoding Initiative (TEl), a consonium of interested
speecb annotated form in (l a). The tags used in this sentence are explained in
pmties, which are c:oncerned with establishing standards for sharing documents.
(Ib)- full details can be found in the Brown Corpus Manual (khnt.aksis.uib.no/
All'rled brackets< and > demarcate the tags which enclose mctadata; I indicates a
"' icame/manualslbrownilNDEX.HT.M). Other tagsets are the various versions of
266 STI'.i"AN TH. <; NIES AND JOHN NEW:>I AN Creating and using corpora 267
 le !3.2 Four tagging solutions for English rid

<S na"l''> 1'11b
<)II type;"AT">The</W>
<)II type;"NP" subtype~"TL ">Fu lton</W>
------- I am now completely You are well 1 got rid of
<)II type~"NN" subtype;"TI.">County</W> riff ofsuch thing.~. rid ofhim. 1he ntbbish.
<'II' type~"JJ" subtype; "n_">Grand</W>
<'Ntype="NN" subtype; "n.">Jury</1'1> ~LAWS tagger past participle past paniciplc pasrparticiple
<'N type="VBD'">said</1'1>
(., .
rogist1cs verb base verb base past participle
<'N type~''NR">Friday</W> 10
<W type~"AT">an</W> ffCCLmg adjective verb base past participle
<'·~ type="NN">investigation</W> srill-based) adjective adjective adjective
<N type=·'rN">Of</W> ( GoTagger
<•'' type=''NP">At lanta' S<il'l>
<o'l type=" J J">recent</w>
<•'•' type=" NN">primary</w>
<'·' type=" NN">election</1'1> <mW C5="PRP 11 >
<•'' type="VBD">produced</w> <WC5="PRP" hw="in" pes=" PREP"> in </l'f>
<C type- " pet">·· </c> <Wc5="NN2" hw=" term'' pos=" SUBST" >terms </IY>
<Wtype=HAT >nO</~·J>
11
<Wc5=" PRF" hw=" of" pos=" PREP ">of </l'f>

<Wt ype=HNN">evi dence</VI> </rnw>
<C type=" pet " > '' </C>
<W type="CS">t nat</W> Figure 13.3. The ann.olation of in terms ol'as a multi-word unit in th~ BNC XML
<W type="DTI " >any</W>
<W type="NNS">i rregu l a ri ti es</W>
<W type="VBD">tOOk</IV> when singing in the sentence She sCJys she coufdn 't stop singing is lagged VVG-
<w type=" NN" >Place</'''>
<c type="pct"> . </C> NNI ). The hyphenated tag in this case, as used in the British Kational Corpus
</S> (BNC), indicates that the algotithm was unable to decide between the VVG (the-

ing Jorm of a verb) and NNl {the singular ofa common noun), but the preference
1-"igurc 13.2. Thefirst sentence (and paragraph) in the lex/ holly offile AOI in the
is for the VVG lag.
XML Brown C01pus (the tags beginning with p, x, and w mark the paraJ!.raph,
Hyphenated tags arc employed by Meunnan-Solin (2007) as a way of indicat-
senlellce. and each word respectiveiy)
ing the range of different functions that can be expressed by !he word in a
Constituent Likelihood Automatic Word-tagging System (CLAWS) and the diachronic corpus of Engli:;h, creating, in effect. tags which embody grammatic-
University of Pennsylvania (Penn) Trccbank Tagset. Figure 13.2 shows the alization facts. Certainly, there should be no cxpcclalion that part-of-speech
same annotated sentence in an XML format. tagging algorithms will produce identical results. Consider the tags assigned to
(1) a. The/at Fulton/np-tl CountyimHI Cirandili-tl Jury/nn-tl said/vbd rriday/nr rid in the three sentences in Table 13.2, based on four automatic tagging programs,
an/at ilwestigation/nn of/in Atlanta's/npS rcccntzjj primaryinn election/no where it can be seen that there is no unifom1 assignment of the part of speech ofrid
produccd!vbd ··!' · no/at evidence/nn "/" that/cs any/dti irrcgu laritics/nn~ in any of the three sentences given. Here we sec indications of a re-
took!vbd place/nn ./. granunalicalization of a past participle as an adjective, just one example of how
b. at - article; np-tl = proper noun, also appearing in the title (of the newspaper any part-of-speech system needs to be critically assessed.
mticle, in this ca~e); nn-tl =singular common noun, also appe~u·ing in the Another way in which multiple tags can r~fer to one word involves multi-word
title; j_i-tl- adjective, also appearing in the title; vbd = past tense ol' verb; nr units. For instance, the complex preposition in terms of is tagged in the BNC
= adverbial noun; nn = singular common noun; in = preposition; np$ =
XML, as shown in Figure 13.3 {for expository reasons, we have added line breaks
possessive proper noun; .U = adjective; cs - subordinating conjunction; dti
- singular o1· plural dctcm1iner!quantifier; nns = plural common noun; . - to highlight the annotation's sttucture).
sentence closer; "- punctuation Transcription of spoken language presents considerable challenges, at least if
one wishes to highlight faithfully features of spoken language (sec Newman 2008;
Sometimes, a tagging system allows for multiple tags to be associated with one see also Chapter 12). The annotated transcription in (2), a sample of transcribed
and the same word. In general, the CLAWS tagger assigns to each word in a text spoken language taken from ICE-CAN, illustrates some of this complexity.
one or more tags (regardless of the context in which it occurs) and then tries to Overlapping strings arc indicated by <f>.. .<![>, with the complete set of over-
identity the one best tag based on the frequency of word-tag combinations in the lapping strings contained within <{> ...<I{>, stretching across both speaker /1. and
immediate context. However, sometimes the alg01ithm is unable to clearly iden- speaker B. The tags <}>...</}> indicate a " normative replacement," where a
tity one and only one tag and uses a hyphenated tag, such as VVG-NN I instead (as repetition of they (in casual, faee-lo-face conversation) is indicated. This
268 ST FfAN T i l. GRIES ANO J O H N NRWM.<\ N Creating and using corpora 269
annotation allows Jor searching on the raw data (containing the o1iginal two dole which allows one to create a new parameter file for any language, trained
instances of they) or on the normalized version (containing one instance of they (llc lexicon and a training corpus. A "chunke1>' script outputs the tagged words
within <,......></=>). The example in (2) illustrates only a tiny fi·action of the 00 9
some grouping into syntactic constituents. When 1un on the sentence in (3),
105
challenges presented by spoken language. The Great Britain component of P :ample, the chunkcr sc1ipt would insert noun cluster (NC) tags around some
the International Corpus of English (ICE-GB) contains syntactic parses for all fore"
rds and a sentence, and verb cluster (VC) lags around the one-word verb
the data, which make the annotation even more complex. ~~sters are and make. As reponed by Schmid (1994), using TreeTaggcr to tag
(2) <$A> <ICE-CAN:S IA-001#34:1 :A> I think some of the trippers actually do fo~ parts of speech in an English corpus achieved over 95 percent accuracy.
a bit of the portaging by themselves <}><-> they> <i-> <-> they </=> echo 'These are some words which make a sentence.'
</}> b1·ing it to the other end and they come back lo help the kids with (3) $
cmd/tree-tagger- english
<(><[> their packs</[> reading parameters
<SB> <ICE-CAN:S IA-001#35:l:B> <[>I sec</[></{> taggi ng ..
finished .
The advent of extremely large multimodal corpora such as the corpus created
These DT these
through the Human Spccchome Project (90,000 hours of video and 140,000 hours
are VBP be
ofaudio recordings) takes tl1e problems of dealing with audio and video to another some DT some
level altogether, requiring the development of new kinds or tools to manage the words NNS word
extraordinary amouut of data involved (Roy 2009). whi ch II'DT which
Just as with cleaning up and standardizing data, the processes of marking up make VBP make
and annotating typically require more sophisticated tools than mere word- a OT a
processing tools. For some ta~ks (e.g., straightforward replacement operations), sentence NN sentence
general-pwpose applications such as sophisticated text editors may be sufficient. SENT
For some more specialized tasks, ready-made applications with a graphical user
interface arc available. For example, language-encoding conve11ers (Encoding
Master for Windows/Mac, iconv for UnixiLinux, at the time of writing) and 2-4 Several widely used corpora
annotation sofiware such as ELAN, Tmnscriber, and Soundscribcr (Windows)
arc available (see Chapter 12 on transcription). Some larger and more automatic Before turning to how corpora are used, we briefly present here a few
processes such as part-of-speech tagging, however, would normally be carried out widely used corpora with an eye to showcasing different kinds of data and
by running scripts in a progranm1ing environment, though some graphical user annotation (sec Appendix 13.1 for more infom1ation on access to these corpora).
interface (GUT) applications are also available (e.g., GoTagger for English and the Readers should be aware that the Linguistic Data Consortium (LDC, www.ldc.
Windows interrace to TrccTagger for English and other languages). upenn.edu) makes available many high-quality corpora, some free to non-
To exemplify at least one application here, TrccTagger is a suite of scripts members and others available through an annual subscription. 1t is also worth
(cunently available lor Linux, Windows, and Mac) that would suit the needs of mentioning the Child Language Data Exchange System (CHILDES) database and
most researchers wanting to tag a corpus for part of speech. Some basic knowl- associated tools, the child language component of the TalkBank project. Between
edge of programming environments is required to run these scripts, though them, CHILDES and Talk Bank offer a great variety of freely available adult and
mnning them is not a daunting task. To illustrate what is involved, (3) shows child language corpora in various media, with an option of playing streaming
the one-line command needed to tag an Engl ish sentence, with the output directed audio and video through the internet. TalkBank, for example, includes corpora
to the screen as three columns (each word in the input, a tag, and a JemmatiLed designed for the study of aphasia, dementia, second language acquisition, con-
form of the word). The tags are based on the Penn Treebank tagsel. In this versation analysis, and sociolinguistics. The CHILDES system of transcription
example, DT determiner, VBP = non-[3rd person singular present] of a verb, and coding has in turn given rise to the Language Interaction Data Exchange
NNS plural common noun, WDT = Wh-determiner, NN - singular common System (LJDES), which aims to standardize transcription and coding for spoken
noun, SENT sentence closer. It is equally straightfmward to tag a whole fi Je or a multilingual data (LIPPS 2000; Gardncr-Chloros, Moyer, and Scbba 2007).
directory offiles. The tagging requires language-specific parameter files which arc The Brown Corpus (Kucera and Francis 1967) holds a unique place in the
available for a dozen or so languages (including English, Getman, Italian, Dutch, history of corpus linguistics. It represents the first systematic and, at the time,
Span i~h, Bulgarian, Russian, French, M<mdarin). TrecTagger includes a training large-scale attempt to sample written American F.nglish containjng material which
270 S 1 EFAN Til. G RICS A:'-ID JOliN NI!WMAN Creating and using corpora 271
Table 13.3 Sub-corpora ofthe Brown written corpus Table 13.4 Sub-corpora o{lhe ICE corpora
Genre Words % oftotal Mode Genre Words %oftotal

News 88,000 8.8 Private 200,000 20
Editorials 54,000 5.4 Public 160,000 16
Spoken (60%) 140,000 14
Reviews 34,000 3.4 Unscripted
Religion 34,000 3.4 Scripted 100,000 10
Skills and Hobbies 72,000 7.2 Student Writing 40,000 4
Lore 96,000 9.6 Letters 60,000 6
Belles Jcttrcs 150,000 IS Written (40%) Academic Writing 80,000 8
Miscellaneous 60,000 6 Popular Writing so,ooo 8
Learned 160,000 16 Reportage 40,000 4
General fiction 58,000 5.8 Instructional Writing 40,000 4
Mystery 48,000 4.8 Persuasive Writing 20,000 2
Science fiction 12,000 1.2 Creative Writing 40,000 4
Adventure 58,000 5.8 Total 1,000,000
Romance 58,000 5.8
!Iumor 18,000 1.8
Total 1,000,000
lesser-known ones such as Malta, Philippines, and Sri Lanka. A fbll description
of the project, as originally conceived, is given in Greenbaum (1996) and
Greenbaum and Nelson ( 1996). A breakdown of the sub-parts of an ICE corpus
first appeared in print in the year L961. The corpus, described by the authors as a can be seen in Table 13.4.
"Standard Corpus of Present-Day American English," has become known as the The Michigan Corpus of Academic Spoken English (1\HCASE) is a corpus of
Brown Corpus since it was created at Brown University. The corpus contains spoken academic English as recorded at the University of Michigan (Simpson
approximately 1 million words in 500 samples of2,000+ words each, divided into e! al. 2002) between 1997 and 2002. It consists of transcriptions of almost 200
fifteen sub-categories, shown in Table 13.3. There is quite a spread of writing hours of recordings, amounting to about 1.8 million words (according to the
styles represented in the corpus, with written language being the clear guiding MICASE website). Individual speech events range in length from 19 to 178
principle in the collection of data. Drama writing, for example, was excluded on minutes, witl1 word co unL~ ranging from 2,805 words to 30,328 words.
the basis of belonging more to the realm of spoken discourse. Fiction \¥riling was Table 13.5 provides word counts lor an untaggccl version of MICAS F. in which
included, as long as there was no more than SO percent dialogue. The design of the hyphenated parts of a word and parts of a word separated by apostrophes count as
Drown Corpus has been adopted in the creation of a number of other 1-m illion- one word.
word English corpora: the Lancaster-Oslo-Bergen Corpus (LOB), the Freiburg The BNC contains a collection of written and transcribed spoken samples of
Brown Corpus (FROWN), the Freiburg LOB Corpus (FLOB), among others. The British English reflecting a wide range of language use and totaling about I00
corpora mentioned here enable corpus-based comparative studies of American million words. The corpus has been published in various editions: tbc two most
and British wtitten English in 196 I (Brown, LOB), American English in 1961 and widely used (containing the same samples) being the BNC World Edition (2001),
1991 (Brown, FROWN), and B1itish English in 1961 and 1991 (LOB, FLOB). marked up in the Standard Generalized Markup Language (SGM L), and the RNC
The International Corpus of English (ICE) has been mentioned already: it is a XML Edition (2007). Most of the language samples elate from the years 1985- 93,
global project whereby English language materials from many national varieties but some written language samples were taken from the years 1960-84. For the
of English are being collected and marked up according to common guidelines. ·'comext-govemed" p<ut of the spoken component, data were collected based on
The primal)' aim of ICF. is to collect material tbr comparative studies of English patticu lar domains of language usage; i'or the "spoken demographic" part, con-
woddwidc, based on Lhe adoption of a common corpus size (approximately vel1>a!ions were collected by 124 volunteers recruited by the British \ 1arket
I million words) and design. As of April 2012, there were twenty-four varieties R~carch Bureau, with equal numbers of men and women, approximately equal
of English represented in the project, according to the ICE website. These numbers from each age group, and equal numbers from each social grouping.
varieties include better-known ones such as Great Britain and the US, as well as Table 13.6 provides a breakdown ofthe sub-patts of the BNC, with size in terms of
ST eFAN T H. C lti~ S AND J O H N NF WMAN Creating and using corpora 273
272
Table 13.5 Suh-cmpora ofthe t1vf!CASE :>poken cotpus le J3.7 Sub-corpora ofthe written component ofCOCA, as ofApri/2011
'[8b
Genre Words %of total ~ Sub-genre Words %of total
aenrc
Small Lectures 333,338 19.7 ~%)
spo
Spoken 90,065,764 20.6
Large Lcctun.:s 251,632 14.8 fiction 84,965,507 19.4
Discussion Sections 74,904 4.4 Magazine 90,292,046 20.6
73,815 4.4
wriuen (80%)
Lab Sections Newspaper 86,670,481 19.8
Seminars 138,626 8.2
Academic &5,791,918 19.6
Student Presentations 143,369 8.2
2.1
Total 437,785,716 100
Advising Sessions 35,275
Dissertation Defenses 56,837 3.4
Interviews 13.015 {).!i
M eeting~ 70,038 4. 1
171,18H
'\1'-ttnits,'' where a "w-unit" is similar to a11 otihographic word ofEnglish, but may
Office llours I 0.1
Service Encountcfb 24,691 1.5 also include some multi-word units (i.e., sequence~ of orthographic words, such as
Study Groups 129,725 7.7 ; priori, of cuurse, all of a sudden).
1
Tours 21,768 1.3 The Corpus of Contemporary American English (COCA) is a corpus of con-
ColloqLria \57,333 9.3 tcmporaty American English samp led from the years 1990 onward (sec Davies
Totnl 1,695,554
z008-; Davies 20 II), which is only available via a Web interface. The corpus is
being added to each year (i.e., it is a "monitor corpus"). At the time of writing it
contains more than 437 million words of text, equally divided among spoken,
fiction, popular magazines, newspapers. and academic texts, as shown in
Table 13.6 Sub-corpora ofthe BNC
Table 13.7. The spoken ~amples are taken from transcripts of unscripted cotwer-
%of total sation from more than ISO different TV and radio programs. The Corpus of
Mode Genre '\v-units" "w-units'' Historical American English (COilA) is an equally impressive historical coipuS
Written (87.9%) Imaginative 16,496,420 16.8 of American English sampled from the period 1810- 2009, consisting of more than
lnfonnative: natural and pure 3,821,902 3.9 400 million words, with the same kind of interface as COCA.
screncc The Buckeye Corpus of Conversational Speech was created primarily to sup-
lnthrmative: applied science 7, 174,152 7.3 port the study of phonological variation in American English .speech (Pitt et al.
lnfonnative: social science 14,025,537 14.3
Jnfonnative: world affairs 17,244,534 17.5
2005, 2007). The corpus consists of forty "talkers" ti·om Colwubus, Ohio, who
lnfonnative: conunerce and 7,341 ,I(J3 7.5 were each interviewed at Ohio State University in 1999-2000. The interviewees
fi nance were told prior to the interview that the purpose of the interview was "to learn how
Informative: arb 6,574,857 6.7 people express 'cve1yday' opinions in conversation, and that the actual topic was
Jnfonnative: belief and thought 3,037,533 3.1 not important" (Pitt et al. 2005: 91). Debriefing on the true purpose of the inter-
lnformatiw: l~isure l2,237,!l34 12.4
view and obtaining further consent of the interviewee were carried out after the
Spoken: ~or1 t.cxt-govcrncd Educationalilnfonnative I ,646,3li0 1.7
(6.1 %) Business 1,282.416 1.3 interview had taken place. The target length of each interview was 60 minutes.
PublicJTnstimtional 1,672,658 1.7 The corpus includes high-fidelity WAY feilcs, consists of a total of305,652 words,
Leisure 1,574,442 1.6 and comes with phonemic labeling and 011hographic transcription.
Spok~n: Respondent Age 0--14 267,005 0.3 The six corpora singled out for discussion here give some sense of th~ kind of
spoken demographic Respondent Age 15 24 665,351! 0.7 material that linguists work with as corpora. Clearly, there is considerable varia-
(4.2%) Re>pondent Age 25 34 853,832 0.9
Re;pondenl Age 35-44 845,153 0.9
tion along many parameters as one compares these corpora: specialized (English
Respondent Age 45-59 963,483 1.0 as spoken in an academic context, informal interview speech, histmical data) vs
Respondent Age 60 ~ 639,124 0.6 gencml {spoken and written language in a variety of contexts); written language vs
Total 98,363,783 speech; relative balance in the size of the main sub-parts of a corpus, as in COCA,
Vs skewing in the size ofthe main sub-parts, as in the BNC; single medium such as
electronic texts vs multimedia. This variability in design also point-; to a need for
274 STIJFAN TH. G RIE S AND JOH N NEW11AN Creating and using corpora 275
caution when making direct comparison across the corpora or when a research~ ole ]3.8 Frequency lists: words sorted according to .frequency (left panel);
relics solely upon a particular corpus with its own idiosyncratic design to establish 1~. -sed words sorted alpflaheticalfy (center panel): 2-grams sorted according to
' I.e!
"basel ine" frequencies of occurrence of words or patterns. 11: ency {l'ight panel)
Obviously, many more corpora than those mentioned above are avai lable. For
f,.equ
~rcquency Words Frequency Words Frequency
instance, Xiao (2008) refers, by our count, to more than 130. Even the catcgOJy of ,vo
"national" corpora alone (i.e., corpora designed to be representative ofa range of /62.580 yllufdaerd 80 of the 4,892
usage of a national language by native-speakers) includes more than twenty (three tile 35.958 yllufecaep l in the 3,006
just for Polish), and that number has likely increased in the years since Xiao's of 27:789 yllufccarg 5 to the 1,751
nnd 25;600 yllutecmo~er 8 on the 1,228
overview was published. One pmticularly important desideratum for the future of to
corpus linguistics and the neighboring field of natuml language processing is to 21,843 yllufcclg and the 1,114
recogni7.e re:murccs in languages other than English and to appreciate the need to 19,446 yllufeow for the 906
in ylluf
10,296 2 at the 832
develop tools and software applicable to all the language~ of the world. that
9,938 yllufepoh 8 to be 799
is
9,740 ylluferac 87 with the 783
1vas
S,799 ylluresoprup I from the 720
for
3 Using corpora
The previous section discussed a variety of topics concemed with how

to create co1p01a. In this section, we will tum to how to study corpora. ln lemmas: should run, runs, running, and ran all be grouped together under the
Section 3.1 , we will bricily introduce the three main corpus-linguistic methods, lenlll1a RUN or not? Second, in order to be able to compare frequencies of words
ami in Section 3.2, we will discuss the kinds ofapplications and tools that corpus from corpora of different size~, fi·equeneies are allen nonnalizcd as a ratio of
linguists use in their research. occurrences per million words. Third, comparisons of frequency lists can give rise
to interesting data, as when a frequency list of a (usually smaller) specialized
corpus is compared to one of a (usually larger) geneml reference corpus. For
l.l Analytical tools of corpus linguistics example, one can compute for each word in a corpus w the percentage p 1 that it
Corpus linguistics is inherently a distributional discipline because, makes up of a corpus cr and divide il by the percentage p 2 that it makes up in a
essentially, corpora only offer answers to the following questions regarding the different corpus c2. and when you order the resulting relative frequency ratios by
distributions of linguistic items: their size, the top and the bottom will reveal the words most strongly associated
with c1 and c2.
a. ITow often and where does something occur in a corpus? It is important to realize how such lists decontextualiLe each usc: one only sees
b. How often do linguistic expressions occur in close proximity to other how allen, say, and, gracqfitlzy, and in the appear, but not where in the fi le or in
linguistic expressions'! which context(s). One way to obtain some inJom1ation about where in a (patt of a)
c. llow are Iinguistic elements used in tl1eir actual contexts? c~rpus a word occurs is by exploring the dispersion of a word. In tl1c left panel of
The following three sections will discuss each ofthcsc methods in turn. Frgure 13.4, the x-axis represents the distribution of the word per! in the
Wikipcdia entry for "Perl," and each occurrence of the word per! is indicuted by
3.1.1 Frequency lists and dispersion a vertical line. It is very obvious that the highest density of occunencc occurs at
Frequency fi~ts are the most basic corpus·li nguistic tool. They usually the end of the file (where the reference section is located). 1n the right panel, the
indicate how frequent each word or each n-gram (a chain ofn adjacent words) is in corpus has been divided into ten equally sized pmts, and a barplot represents the
a (part of a) corpus. Examples are shown in the three panels of Table 13.&. frequencies ofperf in the ten bins. Aga.in, perl is particularly clustered in the final
10perce~t of the file. Also, the dispersion of a word in a corpus can be quantified,
Crucially, this method assumes a working definition of what a word is, which is
less straightforward than one may think and Jess straightforward than many corpus and the rrght panel provides two such measures of di~persion, Juilland's D and chi-
programs' default seuings reveal: how many words are John :v book and John~· at square. Such measures are particularly useful because two words may have
home, or isn t it? (about) the same frequency of occurrence, but one of them may be evenly spread
There arc a variety of ways in which frequency lists are used ancVor modified. out through the corpus (reflecting its status as a common word), while the other
First, one has to decide whether one needs the frequency lists of word forms or rnay be much more unevenly distributed (reflecting its status as a more specialized
276 STEFAN Til. <i!HI!S A Nil JOHN NEWMAN Creating and using corpora 277
The distribution of >perk in the file The distribution of >perk In the file f. b]e t3.9 Excerpt ofa collocate di~play ofgencraVgenerally
1
~q 1 2 Lefi I
Freq Ll Node Right I Freq lU Right 2 Freq R2
~e~---~------------~--~--~--~~
Juiiland's D = 0.8144 ~ 53 the 121 motors 31 of 52
Chi-square= 77.2249 tit
0
28 in 54 and 15 the 30
?0 a 40 assembly 15 and 25
aod
10
2o of 31 the 14 m 12
15 attomey 23 of 12 to 12
iO
J3 and 19 public 12 that 11
ait 12 secretary 16 business 10 as 11
b)' 9 is 12 s 10 with 8
0 is 8 more 10 ized 9 for 8
0
0.1 0.3 0.5 0.7 0.9 be 7 was l0 izations 7 a 8
The tile Sections of the file
Figure 13.4. 11110 wayo of representing the dispc!rsiul7 of a word (peri) in a.file
approach, collocates are then ranked by their association strength rather than their
word that is just very fi-eqltcnt in particular registers or topics). An example would overall frequency; widely used measures arc Mutual Information Ml, t, the log-
be the words having and KOVemmenl, which occur mughly equally frequently in likelihood ratio, and the Fisher-Yates exact test. Space does not permit us to
the BNC Baby, but the former is much more evenly spread out throughout the discuss this in more detail, but see Wiechmann (2008) for a comprehensive
corpus. Similarly, words may be very unequal in freq uency but still equally discussion.
dispersed; for instance, any and the have very different freq uencies in the 8:\IC
Baby corpus (4,563 and 201,940 respectively), but dispersion measures reflect 3.1 .3 Concordances
that both of them are function words; see Gries 200Rfor more discussion. Probably, the most common corpus-linguistic tool currentlv used is the
KWIC (key word in context) concordance - that is, a display of the word of
3.1.2 Collocations interest in its immediate context. Consider Table 13. 10 fi>r pari of a K\VIC
Just like dispersion plots, the second most basic corpus-linguistic tool concordance of alphabetic and alphabetical.
focuses on a particular linguistic element w (typically a word) and provides some This is the most comprehensive display, showing exactly how the two adjec-
infi>rmation on where w is used. However, unlike dispersion plots, the information tives are used, but the large ammmt of infom1ation comes at the cost that this
about where w occurs does not use the location in the file!corpus as a reference, but display usually needs a human analyst for interpretation, whereas frequencies and
lists which words are most frequently found around w. The standard fotmat in collocate displays can often be processed further automarieally. This type of table
which collocations arc displayed is exemplified in Tablt: 13.9. Such tables arc read w?uld normally be saved into a tab-delimited text file, which can then be opened
vertically not horizontally - such that the ti·equencies listed reveal how often a wrth a spreadsheet software (e.g., LibreOffiec Calc) so that every match (i.e.,
word occurs in a particular position around the node word, here general or every row) can be annotated for linguistic variables of intere>t. The resulting fi le
generally. You can immediately sec how words are used and which larger would exhibit the case-by-variabk fonnat discussed in Chaptet· !5 and could then
expression it enters into: meaningful collocations such as General Motors be loaded into statistics software and analyzed as discussed there.
(found thirty-one times), Aaomey General (twenty-three), Secretmy General With incre.asingly complex usc of concordancing, it quickly becomes necessary
(sixteen), General Assembly (fifteen), and others immediately stand out. to learn about regular expressions, mentioned earlier. This is because while one
In a small table like Table 13.9, these few interesting collocations can be can search for the two forms ofalphabetic and a/plwbetical separately, the manual
identified immediately, but it is also obvious that many collocations involve spclhng-out of search pattems becomes cumbersome if many thousands of verb
function words (the, and, in, to, a, ...) that are so widely dispersed that they lemmas are being retrieved. Even worse, there arc many applications where the
will show up in every word's vicinity. Corpus linguists have theref(Jre developed a ~esrred .result cannot even be spell out a priori: if you want to find all words
variety of so-called association measures, most of which essentially quantity how erlclmg ll1 -ic or -icct!, then you cannot always predict wl1ieh forms might exist in
much more frequent a collocate is around a word of interest w than one would ~~a~c,m a given corpus; the same holds if you want to find all verbs ending in ing
1 111
• Regular expressions, a technique for describing (sets of) character
expect given w's and that collocate's overall frequency in a corpus. In such an
27R ~·1 Hi\:'< TI-l. GlU ES ANil JO liN t\E WMAN Creating and using corpora 279
Table 13. 10 8xcerpl o.fa concordance display ofalphabetic and alphabetical~ ~en multiple files
~· accept a variety of language cncodings, especially Unicode
fi le Line Precedi ng context Match Subsequent context
b· calculate 11-equency of words, parts of words, sequences of words, and
A6S 687 and the invention of alphabetic writing. so on
BN9 81 and :;even firM-class counties alphabetical order of rotation. calculate lh:quency of par1s of speech in a part-ofspeech tagged
taken in 0· corpus
H99 1583 seeks to negotiate the alphabetical subject approach as calculate frequency of patterns allowing for wild card searches
problems of the outlined in e. return concordance lines for a search pattern (word, phrase, part of
EES 788 a word i~ a contiguous alphabetic characters. f
~equen ce of
speech)
B2M 196 provided the basis for an alphabetical sort within each return concordance lines with variable length of lines
g. return collocates of a search pattern (word, phrase)
functional category. )1.
C!li\ 3656 and then put them into alphabetical order. calculate measures of strength of association between words
EA3 5.16 l.u isulatt: U1e cultural alphabetic literacy. return a Iist of n-grams
C·Onsequcnccs of J· save and export results
k.
F7G 656 Rut you would put il in alph~ beticalorder
CLll 1422 most languages with writing alphabetic fingerspclling has been four different kinds ofapproaches arc available to corpus linguists, only the fowth
systems available for over of which covers all the functionality mentioned in the list above and more.
KCY 2439 againl can put the type in alphabetical ascending order The most restricted of these approaches arises when a corpus is onIy available
via a Web interface, as is currently the case with BNCWeb, MICASE, COCA, and
01811y others. Here, the user is completely dependent on the functionality made
available in the interface and the correctness of what is made available. V/hilc the
Table 13.11 Examples ofregular expressions
search facilities of many online corpora are far-reaching, studies that require
Regular expression "Translation'' extensive fi-cqueney infonnation or large amounts of contexts usually cannot be
undertaken with such corpora.
colou?r finds both color and colour because the u is made optional Second, a situation often more useful to the analyst arises when a corpus can be
by the?
installed on one's own hard drive and comes with a specific software to explore
smokinfg'l finds both smoking and smokin ·because either g or' ure
that corpus. For example, the !CE-GB comes with a tool designed specifically for
allowed after the n
\bg[coJt(slt(inglen))?\b finds at least get, gets, getting, got, and golfen as it (ICECCP III; see Nelson, Wallis, and Aruts 2002) and which allows inspection
individual words of many features of the corpus. As another example, the BNC XML edition
[-\wj+ly\b sequences of word characters and hyphens ending io ly currently comes with Xaira searching soClwarc (Xiao 2006). In such cases, Lhc
<w (dtqipnq). *?<w prp wh-words followed by other words until a preprlsition advantages are that the Iinguist has the whole corpus available for more individual
["<J*'I<c pLm\? berore a question mark (to find cases of preposition queries and that the corrus sol1warc is tailored to the r recise format of the corpus.
stranding, such as What W~! you talking about?) However, this type of corpus software is sometimes not as user-friendly as it could
be, users arc still restricted to the functionality of the pmgram, and the ability to
work with one corpus sollware does not transfer to other corpora.
sequences, can handle such cases. Table 13. 11 lists a few simple examples that Third, and perhaps most widely used, the corr>us linguist has the eorr>us on his/
showcase the potentia1 of regular expressions (examples are based on SG MU her hard drive and uses a ready-made general corpus program for retrieval and
XML annotation of the BNC). other operations. Apart fi·om '\Ome commercial applications that are restricted to
dJe 'vlicrosoft Windows operating system (e.g., Wordsmith), several rree alter-
natives arc available, the most useful of which is perhaps AntConc, because it is
3.2 Tools for analysis in corpus linguistics
the only tool we arc aware of that n1ns on the three major open1ting systems, is
We have come to expect a range of basic features relevant to a corpus- good at handling different encodings, and possesses powerful regular expressions
based analysis, as listed below, and consequently there is an expectation that that, unlike nearly all other cun·ently available tools (including the commercial
software tools wi II incorporate some selection of these. ones), allow it to handle many kinds of annotations flexibly. AntConc has a built-in
2ll0 STE FAN '!'fl. G RIES A Nil JOHN NE\VMAN Creating and using corpora 281
Keywords feature which identifies words overused in one corpus by reference ;mport nltk
777
l· fromnltk.corpus i mport PlaintextCorpusReader
to another corpus. While corpus tools like AntConc allow parallel analysis of Z·
777
corpus_root = ' /users/lo1yname/Desktop/~tyFiles/'
777
disparate corpora, users are still dependent on the functionality that is included in 3·
77
toi)'Files = Plai ntextcorpusReader(cor pus_root, '. • . txt')
4· : f'I)'Fi les . ftl ei ds ()
the programs. This also means, for example, that hardly any of the widely used l;E
5· 11111a . txt' , 'Pride.._and_Prejudi ce. txt']
ready-made programs can read CHAT files well, an annotation fi1rmat widely used 7
> 1\'Clrds =\!)'Files .11ords()
in language acquisition research and the CHILDF.S database mentioned above. 6· 7 1\'Clrds[:lO)
1· (~~he', 'Project' , 'Gutenberg', 'EBOO<', 'of', 'Enlna', ',', 'by', 'Jane·, ·Austen']
The fourth and final scenario, one that is becoming increasingly common, , 5 ents; \lyFiles.sents()
7 7
involves researchers having corpom on their hard drive and using general purpose 8· 7 , sents [: 3]
9·(crhe', 'Project', 'Gutenberg', 'EBook', 'of','Emma', ' ,','by' , 'Jane', 'AUsten'],
programming lllnguagcs to process, manipulate, and sellrch files. We devote the [' This', 'eBook', ~i s', ' f or', 'the ', 'use', 'of', 'anyone', ' an~where', 'at' , 'no',
nexl section to this topic. •cost','and', 'with', 'almost', 'no ' , 'restrictions' , ·~.mat soeve r', '.'],['You ',
•mzY'· 'copy', 'it ', ',', ' give', 'i t', ' away', 'or', ' re', ' - ','use' , 'it', 'under',
'the', 'terms', 'of', 'the' , ' Project', 'Gutenberg', ' Licer se' , 'included' I 'with',
:u Programming tools for corpus linguistics 1
' this', 'eBook', 'or', 'onl i ne ', ' at' , 'www', '. , 'gutenberg','.', 'org' ]]
. , >pa ras=MyFi l es.paras()
The huge advantage of programming languages is that they arc 10 7
,»paras[:3]
11
immensely more versatile and powerful than any ready-made soft;varc. This allows · (results omitted he re)
, ,.wo rdsl ; nl tk.Text(words )
researchers to pursue research more efficiently, creatively, and within one environ- 1 77
j :> wordsl. concordance("fri end", lines= 10)
1 77
ment (as opposed to having to learn and use different applications for, say, web- Building index&
crawling, cleaning up files, standardizing them, retrieving concordances, annotating oi s playing 10 of 289 matches :
farni l y , less as a gove r ness than a friend , very fond of both dacghte rs , but
them, analy<:ing them statistically, and plotting some graphs). There is a well-known , they had been living together as friend and friend very mutually at tached,
downside to using progral11llling languages and that is the learning curve for the been l iving toget her as friend and friend very mutuall y attached , and Emma d
n the wedding - day of thi s beloved friend that Emna first sat in mournful t ho
novice user. However, the potential benefits to be gained from persevering and every promise of hap pi ness for her friend . Mr . weston was a man of unexcept
achieving a basic and comfortable Iitemcy in a programming language far outweigh, derer recoll ecti on . she had been a friend and companion such as few possessed
the change?- r: was t rue t hat her friend was goi ng only hal f a mile from the
in our opinion, any learning pains. And there are two additional considerations to as not only a very old and ' ntimate friend o~ the family, but particularly co
bear in mind when thinking about the pros and cons of investing time in learning el somJchpainaspleasre. Every friendo~\lissTaylormustbegladtohave
t 5mi th ' s being exactly the young f riend she wanted- exactly t1e some'Lhi ng
progral11llling Languages: ( I) there is a vast numberofways in which programming H . »> 1.ordsl. sini "arC' f riend')
knowledge can be put to good use in dealing with digital infonnation quite apart Building word-context i ndex&
fh1m corpus linguistics; (2) once you have lcamed one programming language, like father si s t er ~other a.·,, family daughter letter ni nd time brother aunt
11ife 1ife and heart way side cousin eyes fee 1i ngs
R, then you generally have some advantage when it comes to lcaming another one. 15. »711'0 rdsl.collocations()
Typica11y, programming lant.>uages can be installed on any modern desktop Building collocations list
Frank churchill ; Lady cathe rine ; Mi ss \voodhouse; Project Gutenberg;
computer or laptop; they may have to be installed as stand-alone applications or yo~ng man; ~1i ss Bates; ~li ss Fairfax; everything; Jane Fairfax; great
they may be already included as pari of the computer's installed software (e.g., deal; dare say; everybody; Si r wi ' li am; Miss Bingl ey; John Knight l ey;
~lap le Grove; Mi ss Smith; tJi ss Tayl or; Robert Martin; Co 1one l
Perl and Python arc bundled with the Mac OS). Examples of well-known pro- Fi tzwi ll i am
gramming languages are Perl, C#, Java, PHP, Python, and Ruby. While Perl was 16 . »7 ~lyFi l es_tag= [nl t k. pos_tag( sent) for se nt in sents]
17. »7 ~lyFi 1es_tag [13][ : 10]
probably the most widely used progranuning language for many years, an increas- [('Enma', ' NNP'), ('toJoodhouse', 'NNP'), (', ' , ' ,'),(' handsome ' , '1\N'),
ing number of researchers are now using Python and R, which therefore deserve (' , ',' , ,), ('cl ever' , ' RB') 1 (',' , ' , ') 1 ('and', 'CC'), ('rich', 'JJ'),
(' I I I I I I)]
brief exemplification here. Both Python and R are freely downloadable and
available as cross-platform installations (Linux/Unix, Mac OS, Windows). A Figure 13.5. Python se~sion ili11strating .wmejimclions in NI,TK
researcher can choose one or more GUis for each of these languages to create a
more friendly or helpful interface (e.g., color coding in the script, help or doc-
umentation available through pull-down menus). NLTK and illustrates just a sample of the functions that are available in this
For the purposes of corpus linguistics, the comprehensive package of Python module. Tn this session, a directory of two English .txt files (downloaded from
tools known as the Natural Language Toolkit (NLTK) has many attractive fea- Project Gutenberg and pre-processed usingjEdit) is loaded as a corpus with the
tures. The best introduction to NLTK is Bird, Klein, and Loper (2009), also name "MyFiles" (lines 3 4). One can obtain a list of all the file.~ that make up the
currently available as a fi·ee online cBook at the NLTK website; Perkins (20 I0) corpus (line 5). In this case, there are just tv.·o files: one being the Project
is a useful additional text. Figure 13.5 shows a log of a session working with Gutenberg file for the novel Emma and another for the novel Pride and
282 STEh\N TI-l. GRIES A N D JOHN N~:WMAN Creating and using corpora 2X3
Pl~judice, both by Jane Austen. The corpus consisting of these two tiles is broken fotu·lines of code in a shor! R session: first, a corpus lile is loaded (line 1), then
down into a list of \\1ords (line 6) and then a list of the first ten words can be jtl~~ split up into words (in a somewhat simplistic way, line 2), then R computes a
displayed (line 7). As can be seen from the display of the lirstlen words, the files i~'~ed frequency list of the whole. file (line 3) and pr~nts o:rt the thirty most
have not been pre-processed and some metadata about Project Gutenberg appears 5
. cuent words and frcqucnctes (lrne 4). Then, two ot Z1pf's laws are tested
as the first ten words. Similarly, one can break the corpus into sentences (line 8) fi~ l(i) plotting words'lengths against their frequencies (line 5; note that the
and view the first three sentences (line 9), or paragraphs (line 10) and view the fu·st b: qucncies arc logged in order to better represent the distribution of frequencies
three paragraphs (line 11 ). Further commands can prodt1ce a set of the first ten fjea corpus) and adding a summary line (line 6), and by (ii) plotting words'
concordance lines bm;ed on the search lermfi"iend (line 13), words which occur in ;~equency ranks against their (logged) fl·equencies (line 7) and adding a sumnwr)'
similar contexts asfi'iend (line 14), and significant bigrams (line 15). It is possible !inc (line 8). . . . . . . . . .
to add pmi-of-speech tags (not always accurate) to create a lagged corpus Given that corpora contmuously mcreasc 111 s1ze and chvcrsity, 1t 1s becommg
MyFilcs_tag (line 16) and print out the first ten words and punctuation marks of .,crcasingly important tJ1at corpus linguists use tools that are not restricted to
the first tagged sentence(= sentence 13 of the corpus) of Jane Austen's Emma. ~rticular formats, ~ncodi~gs, sizes, ~r other d?sign fa~to~s, and recen~ changes
R is an open-source progranm1ing language and environment originally sho>\' that the field rs makmg great strrdes to thJs end. It tlus trend conunues, the
designed for statistical computing and graphics, hut with all the functionality of field will transform into an even more exciting discipline and contribute more than
"normal" multipurpose programming languages, including loops, conditional its share to insightful studies of all aspects of language.
cxpt·cssions, text processing with and without regular expressions, and so on.
Figure 13.6 exemplifies how very easily a rough frequency list can be created in
Appendix 11.1 Corpora referred to in this chapter
1. corpus. fi le<- scan(" Brownl_G. t xt ", what=character(O) , sep="\n") Baby BNC Details of this collection of corpora, with XAIRA, can be found at
2. words<- unlist(strsplit(corpus. file , "\\lv-", per1 =TRUE) )
3 . freq .list <- sort(table(wo rds ) , decreasing=TRUE) www.natcorp.ox.ac.uk/corpuslbabyinfo.html. Payment required.
4. freq. 1ist[l :30] BNC The British National Corpus can be accessed online at no cost through two
words
the of and to a in that is was his for he as i t with interH1ces: Mark Davies' website at corpus.byu.edu/bnc and William Fletcher's
9790 6363 4320 4116 33l9 3100 1905 1795 14671342 11991182 1159 10691063 Phrases in English site at phrasesinenglish.org. Information on purchasing the
The s I be not 1ad by on whi ch f rom are at have th i s or
948 929 871 846 319 804 797 768 679 651 647 633 628 627 588
corpus (and other releases of samples of the BNC) may be found at www.
S. p1 ot (nchar(names (freq. 1i st)) ~ 1og (freq. list), x1 ao="Log word natcorp.ox.ac.uk. Online access lo the BNC is also provided for B\JC licensees.
f requency", ylab="Word 1e ngth in cha racte rs") A full description of the BNC can be found in the Reference Guide for the
6. 1ines(1owess(nchar(names(freq. 1i st)) - 1og(freq.1 i st)))
7. p1ot (1og (rank( -freq. 1 i s t )) - l og( freq . 1i st), xl ab=" Log •,vord British National Corpus (XM L Edition) at www.natcorp.ox.ac.uk/docs/URG.
f requency", ylab="Log ran k f requency") Brown The BrO\vn Corpus may be downloaded at no cost through the "language
8. 1i nes(l owess (log(rank( -freq .list)) ~ 1O£ Cfreq. 1i s t)))
commons" collection at W\Vw.archive.org/details/BrownCorpus and the NLTK
Frequency x length Frequency x rank package of Python at www.nltk.org. It can be searched online through the LDC
at onlinc.ldc.upenn.edu/login.html and the Corpus Concordance English at
www.lextutor.ca/concordanccrs/concord_c.html. The corpus is included in
the TCAME Corpus Collection available on CD-ROM through ICAME at
icame.uib.no/cd. Different versions of the corpus may segment the C<Jrpus
differently. The language co1mnons version contains the 500 x 2,000 word
samples as separate fi les; the ICA ME version contains fifteen files reflecting the
sub-categories in Table 13.3. Both tagged and unlagged versions of the corpus
arc included in the lCA ME Corpus Collection; an XML ragged version of the
c
0
--···-,--·- 1·- --,---,-..J Brown is included as part ofDabyBNC v.2 which is available at www.natcorp.
0 2 4 6 8 0 2 4 6 8 ox.ac.uk.
Log word frequency Log word freque.1cy Buckeye The Buckeye Corpus, together with a manual, may be obtained at no
cost by following instmctions on the homepage of tJ1e project Rl buckeyecor-
Figme 13.6. R session to create a frequency !J~·t ofa file feom the Brown Corpus pus.osu.edu.
and !he resulting plols
284 STEFA!\" Tll. C:IU~S AKD JOHN NEWMAN Creating and using corpora 2RS
CallHome The Call Home American English Speech corpus is available at cost Appendix 1:u Tools and software referred to in this chapter
through the linguistic Data Consortium at www.ldc.upcnn.edu. Cone Concordancer. www.antlab.sci.waseda.ac.jp/soflwarc.html
C[IILDES The Child Language Data Exchange System. developed by Brian AP~. corpus Encoding Standard. www.cs.vassar.edu/CES
MacWhinney, is accessed freely at childes.psy.cmu.cdu. C~~WS. The Constituent Likelihood Automatic Word-tagging System tagsct(s).
COCA The Corpus of Contemporary American English is freely accessible online C ~crel.lancs.ac.uk/claws .
at www.americancorpus.org, but not distributed as a corpus. A full description •N EUDICO Linguistic Annotator soflwarc. www.lat-mpt.eu!tools!elan
pLr~ . . .
of the corpus can be found at this website. eJ)og. nlp.lst.upc.edu/freelmg
COHA The Corpus of Historical American English is freely accessible online at :ragger. >vcb4u.sctsunan.ac.jp/Website/GoTagger.htm (for notes in English on
corpus.byu.edu!coha, but not distributed as a corpus. A fu ll dcsctiption or the this Windows-only tagger:
corpus can be found at this web>ite. . baidu.com/seanxpq/blog/item/?aa9db03ftlbffc0f738da50e.html)
I'LOB The freiburg LOB corpus is included in the !CA ME Corpus Collection and hi·
J-ffTrack. \N\VW.htl:r~ck:com
is described in t11e accompanying manual. Available for purchase from icamc. [nfogistics. v:ww.111fog1st1CS.com
uib.no/cd. ·sdit. vv'Ww.Jed!l.org
FROWN The Freiburg Brown Corpus is included in the 1CM1J:: Corpu~ LbreOffice Calc. www.librcoffic~.org . .
Collection and is described in the accompanying manual. Available for pul·- NLTK. Natural Language Toolk1l. www.nltk.org. An electrom~ vcrston ~f ~he
chase fi·om icame.uib.no/cd. accompanying book (Bird, Klein, and Loper 2009) IS also available at th1s s1te.
JCAME The lnlcmalional Computer Archive of Modern and Medieval penn Trecbank Tagset. www.comp.leeds.ac.uk/ccalas/tagscts/upcnn.html
~nglish Collection is available for purchase on CD-ROM at icamc.uib. project Gutenberg. www.gutcnbcrg.org
no/cd. R. www. R-project.org
ICE Infonnation on obtaining corpora of the Inlcmational Corpus of English is Sitcsucker. http://sitesucker.us!home.html
available through the ICE website at ice-corpora.net/icc!indcx.htm. At the time Southern Oral History Program. docsouth.unc.cdu/sohp
of writing, ICE corpora for Canada, Jamaica, Hong Kong, East Africa, India, Tnmscriber. trans.sourceforge.net
Singapore, and Philippines arc available at no cost and can be downloaded from TreeTaggcr. www.ims.uni-stuugart.de/projekte!corplex!freeTagger/DecisionTree
the ICE website; ICE corpora for Great Britain, New Zealand, and Ireland are Tagger.html (for the Windows interface to TreeTagger: www.smo.uhi.ac.uki
available on CD-ROM at relatively low cost. -<Jduibhinloideasra/imerface:;/winttintcrfacc.httn)
TCE-CAN The Canadian component of the Tntemational Corpus of English is Wordsmith. Corpus linguistic sofiware available for purchase at www.lexically.
freely available at icc-cOJvom.neuice/index.htm and is described more fully in net/wordsmith
Newman and Columbus (20 I0). XCES. Corpus Encoding Standard in XML format. www.xccs.org
LOB The Lancaster-Bergen-Oslo Corpus (written) corpus is included in the (All \Vcbsilcs accessed July S, 2013.)
ICAMI2 Corpus Collection and is described in the accompanying manual.
Available for purchase from icamc.uib.no/cd.
MI CAS~ The Michigan Corpus of Academic Spoken English is Ji'ecly accessed
online at quod.1ib.umich.edu/m/micasc. A full description of the MICASE References
project and the corpus can be found in the MICASE manual available at Beat, J. C., K. P. CotTigan, and II. L. Moisl, cds. 2007a. Creating and Digitizing Language
micase.elicorpora. info. Individual XML transcripts of the Hies can be down- Corpora. Volume f: Synchro11ic Oatabases. Rasing>toke and :"Jew York: Palgravc
loaded at no cost. Aversion of the whole corpus can also be purchased through Macmillan.
the MICASE website. 2007b. Creating and IJigitizing /,anguage Cotpora. Volume 11: Diachmnic Dalabmes.
TalkBank This collection of corpon1 and transcripts is accessed freely at Basingstoke and New York: Palgrave Macmillan.
talkbank.org. Berkenfield, C. 2001 . The role of frequency in tbe realization of English that. In
TIMIT The Acoustic-Phonetic Continuous Speech Corpus is available for J. L. Bybee and l'. J. Hopper, eds. Frequene-y and the Emergence of Linguistic
purchase through the Linguistic Data Consortium. S!ructure. Philadelphia: John Ocnjamins, 281- 307.
t.:ppsala Leamer English Corpus This corpus is described in Johansson and Bird, S., E. Klein, and E. Loper. 2009. Natural l.anguage Processinp, with Python:
Geisler (2009, 2011). Ana(yzing Text with the Natuml Language Toolkit. Sebastopol, CA: O'Reilly
Media.
(All websites accc.~scd July 8, 2013.)
2S6 STEFAN TIT. ORfES A"TD JOHN NEWMA N Creating and using corpora 287
Davies, ~{. 2008- . The Corpus ofContempmwy American Hnglish (COCA). Available at nan-Solin, A. 2007. The manuscript-based diacluonic corpus of Scottish corre-
}l(eUrl
www.americancorpus.org (accessed June 26, 2013). spondcnce. In Beal, Corrigan, and Moisl. Volume 11, 127 47.
20 I I. The Corpus of Contemporary American English as the first reliable monitor .. Json. G., S. Wallis, and ll. /\arts. 2002. Exploring Naturall.anguage: Working with the
c01pus of English. Literary and Linguistic Compwing 25.4: 447-65. )'io British Component of the lntemational C01pm of' English. Amsterdam and
Fiorentino, G. 2009. The ordering of adverbial and main clauses in spoken and written philadelphia: John Benjamins.
Italian. In B. Lewandowska-Tomaszczyk and K. Dziwirck, cds. Studies in Cognilive _Newfl'llln, J. 2008. Spoken corpora: rationale and application. Taiwan Journal of
Corpus l.inguistics. Prankfurt am Main: Peter Lang, 207- 22. Jjnguistic.1· 6.2: 27- 5R.
Ciardner-Chloros, P., M. Moyer, and M. Sebba. 2007. Coding and analyzing multilingual _Newrnan, J. and G. Columbus. 2009. Education as an over-represented topic in the ICE
data: the LIDESproject. In Beat, Corrigan, and Moi~l. Volume 1: 91- 120. corpora [Part Il]'l Prcs~;ntation for tbe 15th International Conference of the
Gilquin, G. and S. Th. Gries. 2009. CoqJora and experimental methods: a state-of-the-art International Association for World Englishcs (!AWE), October 22-24, Cebu City,
review. Corpus Linguistics and Linguistic The!JI}' 5.1: 1 26. Philippines.
Granath, S. 2007. Size matters-or thus can meaningful 8tructurcs bc: revealed in large 20 tO. T~e ICI'.-Canada Corpus. Version 1. Available at: http://ice-corpora.netlice/
corpora. In R. Facchinetti, eel. Corpus Lin&rr<istics 25 Years On. Amsterdam and 1'\ew download.htm (accessed June 26, 20 13).
York: Rodopi, 169- 85. ostler, t\. 2008. Corpora of less studied languages. In Liideling and Kyto, eds. Volume I,
Greenbaum, S., eel. 1996. Comparing .t:nglish Worldwide: The International Corpus of 457- R3.
English. Oxtord : Clarendon Press. perkins, .J. 2010. Python Text Processing with NLTK 2.0 Cookbook. Birmingham, UK:
Greenbaum, S. and G. Ne18on. 1996. The International Corpus ol English (ICE) pn*ct. Packt Publishing.
World Englishes 15.1: 3- 15. pitt, M. A., K. JohJ1Son, E. Humc, S. Kiesling, and W. Raymond. 2005. The Ruckeye
Ci-ries, S. Th. 200R. Dispersions and adjusted frequencies in corpora.lntemaiimwl Journal Corpus of Conversational Speech: labeling convet1tions and a test of transcriber
ofCorpus Linguistics 13.4: 403-37. reliability. Speech Communication 45.1: S9- 95.
2009. Quantitative Cmpus Lingrtistics with R: A Practical Introduction. London and Pitt, M., L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, and E. Fosler-Lussier.
New York: Routledge. 2007. Buckeye Cotpus o.f'Conw~rsalional Speech (2nd release) [www.buckc:yccorpus.
Hunston, S. 2008. Collection strategies and design decisions. In Uideling and Ky!O, eds. osu.cdu]. Columbus, Oil: Department of Psychology, Ohio State University
Volume I, 154-6S. (Distributor).
Johansson, C. and C. Geisler. 2009. The Uppsala Learner English Corpus: a new corpLts of Roy, n. 2009. New horizons in the SUldy OJ child language acquisition. Proceedings of'
Swedish high school studenL~' writing. In A. Saxena and A. \!iberg, ed~. Jnrer.1peech 2009. September 6-10, Rrighton, England. Available at: www.media.
Multilingualism: Proceedings oftlre 23rd Scandinavian Conference o.f'Linguistics. mit.edu/cogmac/publicationslRoy_int.erspeech_ keynote.pdf (accessed June 26,
Uppsala: Acta Universitatis Upsaliensis, l ~ 1-90. 201 3).
2011. Syntactic aspects of the writing of Swedish 12 learners of English. In Schmid, H. 1994. Probabilistic part-of-speech tagging using decision trees. Proceedings of
J. Newman, H. l:laayen, and S. Rice, eds. Col/JUs-based Studies in Language Use, International Conference 011 New Afeihods in Language Proces.,ing, September,
Language Learning, and Language Documentation. Amsterdam: Rodopi Press, Manchester, UK.
139-73. Simpson, R. C., S. L. Briggs, J. Ovens, and .T. M. Swales. 2002. The Michigan Corpus of'
Kilgarriff, A. and G. Grefenstette. 2006. Introduction to the special issue on the web as Academic Spoken English. Ann Arbor, MI: The Regents ol the lhiversity of
corpus. Computational Tinguistics 29.3: 333- 47. 1viichigan.
Kucera, H. and W. N. Francis. 1967. Computational Analysis 1?{ Present-day En}!lish. Sinclair, J. 2005. Corpus and text- basic principles. In Wynne, 1- 16.
Providence, Rl: Brown University Press. Thompson, S. A. and P. J. Hopper. 2001. Transitivity, clause stmcture and argument
LlPPS-Language Interaction in Plurilingual and Plurilectal Speakers Group. 2000. A stmcture. In J. L. Bybee and P. J. Hopper, eds. Frequency and the f:mergence of
Document tor Preparing and Analysing Language Interaction Data. Special issue of Linguistic Structure. Philadelphia: John Bcnjamins, 27-60.
the International Journal of /Jiling~talism 4.2. Wiechmann, D. 200S. On the computation ofcollostmction strengtl!: testing measures or
Liideling, A. and )11. Kyto. eels. 2008a. Cmpus Linguistics: An International Handbook. association as expressions oflexical bias. Cm]Jr.l s Linguistic.\' and Linguistic Theory
Volume!. Berlin and New York: Mouton de Gmyter. 4.2: 253- 90.
2008b. Corpus Linguistics: An International Handbook. Volume II. Berlin and New Wynne, M., ed. 2005. Developing Linguistic Cmpora: A Guide to Good Practice. Oxton!:
York: Mouton de Gruytcr. Oxbow Books. Available at: www.ahds.ac.uk/creating/guides/linguistic-corporal
Mdinery, T. and A. Hardie. 2012. Cmpus Unguistics: :\4ethod, ThCOIJ', and Practice. index.btm (accessed June 26, 20 13).
Cambridge University Press. X.iao, R. 2006. Xaira an XML-awarc indexing and retrieval architecture. COijJor·a 1.1:
McEnery, T., R. Xiao, andY. Tono. 2006. Cmpus-hased Lang11age Studies.· An Advanced 99-103.
Resource Book. London: Routledge. 200X. Well-known and influential corpora. In Liideling and Ky ti), eds. VoiLtme I, 3S3-457.

2013 STG-JN CreatingUsingCorpora ResMethLing PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2013 STG-JN CreatingUsingCorpora ResMethLing PDF

Uploaded by

Copyright:

Available Formats

251) NAOM I NAGY AND DEVYAN I ~HA KM A

ostler (2008: 459) remarks on the artificiality of distinctions between speech-

Year3 277.8 5 245.0 5 professional training

<P> le !3.2 Four tagging solutions for English rid

<Wc5=" PRF" hw=" of" pos=" PREP ">of </l'f>

Genre Words % oftotal Mode Genre Words %oftotal

The previous section discussed a variety of topics concemed with how

You might also like