You are on page 1of 27

Introduction to Corpora

and Corpus Linguistics


COGS 523-Lecture 2
Corpus Design
Issues I

10.11.23 COGS 523 - Bilge Say 1


Related Readings
Readings: (Course Pack):
 Tognini-Bonelli (2001) Corpus Issues. Ch3
 McEnery et al(2006) Unit A7-A9, B1 –all appear to be
one article in the course pack
 Meyer (2002) Planning the Construction of a corpus. Ch
2.
Optional : PennTreebank and Czech National Corpus
articles from Course Pack
 McEnery and Wilson (2001) Chs 2 and 3
 Also Available in Sampson and McCarthy (2005)
Anthology:
• Biber (1993) Representativeness in Corpus Design.
Literary and Linguistic Computing 8(4)
• Atkins, Clear and Otkins (1992) Corpus Design Criteria.
Literary and Linguistic Computing, 7(1)

10.11.23 COGS 523 - Bilge Say 2


What is a Corpus?
Derlem (alt. Bütünce)

Text/Speech/
Video + Annotation
Digital media

Written/Spoken Design Criteria


Language

10.11.23 COGS 523 - Bilge Say 3


Stages of Corpus Building-I
 (aka as Corpus Compilation)
 Specifications and Design
 Develop Infrastructure and Find Funding !!!
 Sampling, Representativeness, Balance, Copyright
issues
 Piloting
 Planning Manpower
 Preparation of an Annotation Manual
 Acquisition or Development of Software for Annotation
 Technical Equipment Acquisition
 Design and Development of Corpus Query Tools
 Design of Change Management Processes

10.11.23 COGS 523 - Bilge Say 4


Stages of Corpus Building-II
 Data capture and Preprocessing
 Transcription, Tokenization, Error
Correction
 Annotation (Markup)
 User Documentation
All these accompanied by cyclic
quality control processes and beta
releases for user feedback

10.11.23 COGS 523 - Bilge Say 5


Representativeness and
Balance
 Balance: Weightings between different
sections of a corpus, according to its
design purpose
 Representativeness: The findings from an
idealized representative corpus should be
generalizable to whole language or a
specified part of it.
 What is the relationship between balance
and representativeness?
 Is ideal representativeness possible?
10.11.23 COGS 523 - Bilge Say 6
Ways to Approach Sampling
 Elitist – Based on Literary and
Academic Merit
 Popularity
 Typicalness
 Availability
 Random  (or sampling out of a
National Library Holdings for
example)

10.11.23 COGS 523 - Bilge Say 7


More about sampling
 Choose a sampling frame: identify a
specific population to make
generalizations about
 For BNC spoken part: United Kingdom was
divided into 12 regions of 30 sampling points
selected based on their demographic profile.
 Gender balance: may be hard to get in some
genres
 Who is native? ICE-US: had lived in USA and
spoken American English since 10-12 years of
age
 Education Levels, Age, Dialect Variation

10.11.23 COGS 523 - Bilge Say 8


Spoken Data Sampling
 Elicited – MapTask corpus
 Natural  - Self-recording
 Origins (immigrancy/nativeness,
age,gender,geographic district,
dialect)
 Dialogues vs Monologues

10.11.23 COGS 523 - Bilge Say 9


Something in between
 Netspeak: blogs, chatrooms,
SMSs...
 Pre-prepared speeches...

10.11.23 COGS 523 - Bilge Say 10


Minimal Criteria for a
Balanced General Corpus
 Suggested by Sinclair (91)
 Fiction vs Nonfiction
 Book, journal vs newspaper

 Formal vs informal

 Control of age, gender, and origin of


authors

10.11.23 COGS 523 - Bilge Say 11


Idealized vs Opportunistic
Representativeness

 Measuring exposures (perception)


 Measuring production
Purely frequency based estimate:
90% conversation,
3% letters or notes,
7% press reportage, fiction, lectures etc.
 Distinguishing genre, register, text
type

10.11.23 COGS 523 - Bilge Say 12


The size and frequency of exposures resofof
Czech
Czech
sepakers
speakers
to various topics and kinds of written language (Kucera, 2002).
I. Specialized II. Non-specialized
(technical) texts 33,50% texts 66,50%
Journals 56%
Fiction and Poetry 10%
Letters or chronicles 0,50%

10.11.23 COGS 523 - Bilge Say 13


Size
 How many tokens are enough to discover
the patterns of collocation, polysemy,
morphology, syntax, discourse etc?
 10-20 millions words suggested by
Sinclair in 1991 for a general,small useful
corpus
 100 million words CNC, BNC
 100 million words core, several hundred
more as periphery for ANC

10.11.23 COGS 523 - Bilge Say 14


Types vs Tokens
 Hapax Legomana (Greek for “read only
once”)
 Almost half of the word types occur only once
in the corpus
 1 million word corpus – 100 word types
occur more than 1000 times
 100 million word corpus – 8000 word
types can be expected to occur more than
1000 times – 95% of tokens. Remaining
5% - ½ million word types.

10.11.23 COGS 523 - Bilge Say 15


General Guidelines
 Prosody – 100.000 words of spontaneous
speech
 1 million – verb form morphology, some
syntactic processes, high frequency
vocabulary
 Cross-linguistics and scientific studies are
rare!
 Always collect ~10% more than your aim.
Despite best effort for quality control, you
may have to discard some data.
10.11.23 COGS 523 - Bilge Say 16
Individual Sample Size
 2000 words (first generation corpora)
 Varied vs fixed- BNC varies, as much as 40.000.
 Fixed size: what if something is too small or too big?
 Newspapers: “constructed week” concept
 20.000 words (Ooostdijk, 88)
 2000-5000 words from 20-80 texts from each
genre (Based on Biber’s 1990 study of 10
linguistic features from 55 pairs of samples from
LOB and LLC)
 May be an issue for copyright!

10.11.23 COGS 523 - Bilge Say 17


Brown University Standard Corpus of Present-Day American English
(Francis & Kucera) (Brown Corpus)
1 million words -- 1961-1964, 500 samples of 2000 words each
Structure
Informative Prose 75.0 Y. Imaginative 25 Y.
A. Press: reportage 8.8 Y. K. General Fiction 5.8 Y.
B. Editorial (Press) 5.4 Y. L. Mysteryy and Detect. F. 4.8 Y.
C. Reviews (Press) 3.4 Y. M. Sciencefiction 1.2 Y.
D. Religion 3.4 Y. N. Adventure & Western 5.8 Y.
E. Skills & hobbies 7.2 Y. P. Romance & Love Story 5.8 Y.
F. Popular lore 9.6 Y. R. Humor 1.8 Y.
G. Learned (academic) 16 Y.

(Meyer, 2002)

10.11.23 COGS 523 - Bilge Say 18


The division of text types and domains in
C zech syncronic corpus of written texts (Kucera, 2002)
I. Imaginative texts 15% II.Informative texts 85%

1.Fiction 11,02% 1.Journals 60%


2.Technical and
2.Poetry 0,81% specialized texts 25%

3.D rama 0,21% a.Lifestyle 5,55%

4.Other literary texts 0,36% b.Technology 4,61%

5.Transitional types of texts 2,60% c.Social Sciences 3,67%

d.Arts 3,48%

e.Natural sciences 3,37%


f.Economics and
management 2,27%

g.Law and security 0,82%

h.Blief and religion 0,74%

i.Administrative texts 0,49%

10.11.23 COGS 523 - Bilge Say 19


The composition of the British National Corpus
(part of Table 2.1 in Meyer (2002))

Speech

Type Number of Text Number of Words % of Spoken Corpus

Demographically
Sampled 153 4,211,216 41%

Educational 144 1,265,318 12%

Business 136 1,321,844 13%

Institutional 241 1,345,694 13%

Leisure 187 1,459,419 14%

Unclassified 54 761,973 7%

Total 915 10,365,464 100%

10.11.23 COGS 523 - Bilge Say 20


The composition of the British National Corpus

Writing
Type Number of Text Number of Words % of Written Corpus
Imaginative 625 19,664,309 22%
Natural Science 144 3,752,659 4%
Applied Science 364 7,369,290 8%
Social Science 510 13,290,441 15%
World Affairs 453 16,507,399 18%
Commerce 284 7,118,321 8%
Arts 259 7,523,846 8%
Blief & thought 146 3053672 0.03
Leissure 374 9,990,080 11%
Unclassified 50 1,740,527 2%
Total 3209 89,740,554 99%

(part of Table 2.1 in Meyer (2002)


10.11.23 COGS 523 - Bilge Say 21
Composition of the ICE (part of Table 2.2 in Meyer (2002))

Speech
Type Number of Text Number of Words % of Spoken Corpus
Dialogues 180 360,000 59%
Private
(direct conversions, distance
conversions) 100 200,000 33%
Public
(class lessons, broadcast
discussions, broadcast interviews,
parliamentary debates, legal cross-
examinations, business
transactions) 80 160,000 26%
Monologues 120 240,000 40%

Unscripted
(spontaneous commentaries,
speeches, demonstrations, legal
presentations) 70 140,000 23%

Scripted
(broadcast news, broadcast talks,
speeches (not broadcast)) 50 100,000 17%
Total 300 600,000 99%

10.11.23 COGS 523 - Bilge Say 22


Copyright Issues
 Publishers
 science vs commercial aims conflict
 check who has the copyright
 have written signed agreements
 status of some sources might be disputable:
still have written and signed agreements
 Individuals
 Their informed consent, give guarantee of
being non-identified

10.11.23 COGS 523 - Bilge Say 23


Collecting and
Computerizing Samples
 Written Text
 Scanning (introduces OCR errors)
 Electronic Documents (different formats, different character
sets)
 Uploading documents (See ANC web site)
 Spoken Text
 Inform participants of your aim and that there is no
linguistically “correct” Turkish etc.
 Record longer than needed (2000 word sample- 10-20
minutes needed, collect 30 mins) so that you can cut off
unnatural parts in the beginning
 Record in natural environments
 Invest in good equipment and good software
 Even like that, 4 out 10 samples may be unusable (Meyer,
2002)

10.11.23 COGS 523 - Bilge Say 24


Recording Information
About Samples
 File headings – Annotation schemes
like TEI account for that
 Bibliographical info, ethnographic info,
recording info, annotation info etc.
 Directory Structures and File names
 Usable – for the builders, for the users?

10.11.23 COGS 523 - Bilge Say 25


Partial directory structure of American component of ICE (Figure 3.1 of Meyer (2002))
Spoken Written
Dialoges Monologues Printed Non-Printed

Business Classroom Political Spontaneous Broadcast Broadcast Legal


transactions discussions debates conversations discussions interviews conversations

Draft

S1B-071d
S1B-072d
etc.

Lexical version

S1B-071l
S1B-072l
etc.

Proofread version (I)

S1B-071p1
S1B-072p1
etc.

Proofread version (II)

S1B-071p2
S1B-072p2
etc.
Lecture 3
 Corpus Design II (Annotation)
 Readings: Meyer (2002) Ch4; Sampson
and McCarthy (2005) Ch 39; Garside
(1997) Chs 4,5,16
 Inform me and Ayisigi (in writing) of
your chosen corpus tool for software
review by 17 March. Precheck w. Ayisigi
that the tools suits the task criteria.

10.11.23 COGS 523 - Bilge Say 27

You might also like