Professional Documents
Culture Documents
Corpus Presentation 5
Corpus Presentation 5
Creating a corpus
http://tinyurl.com/669o4zt
By Hans Sebald Beham (Germany, Nuremberg, 1500-1550) [Public domain], via Wikimedia Commons
The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-
Crawled Corpora
Abstract:
This article introduces ukWaC, deWaC and itWaC, three very large corpora of English,
German, and Italian built by web crawling, and describes the methodology and tools
used in their construction. The corpora contain more than a billion words each, and
are thus among the largest resources for the respective languages. The paper also
provides an evaluation of their suitability for linguistic research, focusing on ukWaC
and itWaC. A comparison in terms of lexical coverage with existing resources for the
languages of interest produces encouraging results. Qualitative evaluation of ukWaC
vs. the British National Corpus was also conducted, so as to highlight differences in
corpus composition (text types and subject matters). The article concludes with
practical information about format and availability of corpora and tools.
http://bootcat.sslmit.unibo.it/
Annotation
adding (linguistic) interpretation to the text
Metadata
describing your corpus
The National Archives building at Kew. This work has been released into the public domain by its author, Matt Crypto.
Let’s try it!
Capture Obama’s language
Design
• What is ‘Obama’s language’?
• Where do you find it?
• Can you get everything? How do you select?
State of the Union addresses
http://www.presidency.ucsb.edu/sou.php
Think about
Data capture
Annotation and metadata
Format
Data management
(copyright)
…
Further reading
Wynne, Martin (2005) Developing Linguistic Corpora: A
Guide to Good Practice. Oxford, Oxbow Books.
http://www.ota.ox.ac.uk/documents/creating/dlc/
Burnard, Lou (2007) Reference Guide for the British
National Corpus (XML Edition) Research Technologies
Service at Oxford University Computing Services.
http://www.natcorp.ox.ac.uk/docs/URG/
Bowker, Lynne, Jennifer Pearson (2002) Working with
specialized language: a practical guide to using corpora.
Routledge (extract via Google Books)
Next week: 6. Using the corpus in linguistic research
Creating a corpus
http://tinyurl.com/669o4zt