You are on page 1of 19

Corpus Linguistics 5

Creating a corpus
http://tinyurl.com/669o4zt

Ylva Berglund Prytz &


Martin Wynne
What is a corpus?

“…a collection of pieces of language,


selected and ordered according to
explicit linguistic criteria in order to be
used as a sample of the language.”
(Sinclair 1996)
Do you need to create
your own corpus?

Adam Cuerden [Public domain or Attribution], via Wikimedia Commons


Research question?
Creating a corpus

1. Design – planning your corpus


2. Data capture and text encoding – collecting your texts
3. Annotation – adding linguistic interpretation to the text
4. Metadata – describing your corpus
5. Format – save the material in a format you can use
6. Archiving, preservation, distribution – the future of
your resource
Corpus Design
planning your corpus

By Juan Consuegra (Personal archive) [GFDL (http://www.gnu.org/copyleft/fdl.html), CC-BY-


SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/) or FAL], via Wikimedia Commons
Data capture
collecting your texts

By Hans Sebald Beham (Germany, Nuremberg, 1500-1550) [Public domain], via Wikimedia Commons
The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-
Crawled Corpora

Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta


Language Resources and Evaluation 43(3): 209-226
Available at
http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=papers:wacky_2008.pdf

Abstract:
This article introduces ukWaC, deWaC and itWaC, three very large corpora of English,
German, and Italian built by web crawling, and describes the methodology and tools
used in their construction. The corpora contain more than a billion words each, and
are thus among the largest resources for the respective languages. The paper also
provides an evaluation of their suitability for linguistic research, focusing on ukWaC
and itWaC. A comparison in terms of lexical coverage with existing resources for the
languages of interest produces encouraging results. Qualitative evaluation of ukWaC
vs. the British National Corpus was also conducted, so as to highlight differences in
corpus composition (text types and subject matters). The article concludes with
practical information about format and availability of corpora and tools.

See also http://wacky.sslmit.unibo.it


BootCaT
Simple Utilities to Bootstrap
Corpora And Terms from the Web

http://bootcat.sslmit.unibo.it/
Annotation
adding (linguistic) interpretation to the text
Metadata
describing your corpus

© The American Board of Orthodontics - all rights reserved world wide


http://www.americanboardortho.com/professionals/clinicalexam/casereportpresentati
on/preparation/titlepage.aspx
Format
save the material in a format you can use
Madam Speaker, Mr. Vice President, Members of Congress, the First Lady of the
United States--she's around here somewhere: I have come here tonight not only to
address the distinguished men and women in this great Chamber, but to speak
frankly and directly to the men and women who sent us here. I know that for many
Americans watching right now, the state of our economy is a concern that rises
above all others, and rightly so. If you haven't been personally affected by this
recession, you probably know someone who has: a friend, a neighbor, a member of
your family. You don't need to hear another list of statistics to know that our
economy is in crisis, because you live it every day. It's the worry you wake up with
and the source of sleepless nights. It's the job you thought you'd retire from but now
have lost, the business you built your dreams upon that's now hanging by a thread,
the college acceptance letter your child had to put back in the envelope. The impact
of this recession is real, and it is everywhere.
Data management and archiving
the future of your resource

The National Archives building at Kew. This work has been released into the public domain by its author, Matt Crypto.
Let’s try it!
Capture Obama’s language

Design
• What is ‘Obama’s language’?
• Where do you find it?
• Can you get everything? How do you select?
State of the Union addresses
http://www.presidency.ucsb.edu/sou.php
Think about

Data capture
Annotation and metadata
Format
Data management
(copyright)

Further reading
Wynne, Martin (2005) Developing Linguistic Corpora: A
Guide to Good Practice. Oxford, Oxbow Books.
http://www.ota.ox.ac.uk/documents/creating/dlc/
Burnard, Lou (2007) Reference Guide for the British
National Corpus (XML Edition) Research Technologies
Service at Oxford University Computing Services.
http://www.natcorp.ox.ac.uk/docs/URG/
Bowker, Lynne, Jennifer Pearson (2002) Working with
specialized language: a practical guide to using corpora.
Routledge (extract via Google Books)
Next week: 6. Using the corpus in linguistic research

Hommerberg, C., Tottie, G. (2007).


Try to or try and? Verb complementation in British
and American English.

In ICAME Journal: Computers in English Linguistics.


April. 45-64.
Available at http://icame.uib.no/ij31/ij31-page45-64.pdf
Corpus Linguistics 5

Creating a corpus
http://tinyurl.com/669o4zt

Ylva Berglund Prytz &


Martin Wynne

You might also like