You are on page 1of 41

 spoken vs.

written
 monolingual vs. bi/multilingual
 parallel vs. comparable corpora (translation corpora)
 general language purpose vs. specialised
language purpose
 diachronic vs. synchronic
 plain text vs. annotated (tagged) text
Dr AFIDA MOHAMAD ALI
FBMK
Spoken Corpora
 aim at representing spoken language
 London-Lund Corpus (LLC)
 Lancaster/IBM Spoken English Corpus (SEC)
 Cambridge and Nottingham Corpus of
Discourse in English (CANCODE)
 Santa Barbara Corpus of Spoken American
English (SBCSAE)
 Wellington Corpus of Spoken New Zealand
English (WSC)
Written Corpora
 aim at representing written language
 BROWN Corpus (written texts, AE in 1961)
 LOB Corpus (Comparable to BROWN Corpus,
BE, early 1960s)
 FROWN Corpus (AE, Early 1990s)
 FLOB Corpus (BE, Early 1990’s)
Multilingual Corpora
 aim at representing several, at least two, different
languages, often with the same text types (for
contrastive analyses)
 Parallel corpora (source texts plus translations):
Canadian Hansard

 Comparable corpora (monolingual subcorpora


designed using the same sampling techniques):
Aahrus corpus of contract law
 Multilingual
 Bilingual
Multilingual Corpora
Important resources for translation and contrastive
studies.
Multilingual corpora…
 …give new insight into the language compared
 …can be used to study language
specific and universal features
 …illuminate differences between
source texts and translations
 …can be used for a number of practical
applications, in lexicography, language teaching,
translation, etc.
Parallel Corpora
 Bilingual vs.Multilingual
 Unidirectional (from La to Lb or from Lb
to Lc alone) vs. Bidirectional (from La to
Lb
and from Lb to La) vs. Multidirectional
(from La to Lb, Lc etc.)
Comparable corpora
A corpus containing components that
are collected using the same sampling
techniques and similar balance and
representativeness, e.g. the same
proportions of the texts of the same
genres in the same domains in a range
of different languages in the same
sampling period.
 For the latest comprehensive website on corpora and
corpus tools, go to

 http://martinweisser.org/corpora_site/CBLLinks.
html
Comparable vs. parallel
corpora
The sampling frame is essential for
comparable corpora but not for parallel
corpora because the texts are exact
translations of
each other.
General Corpora
 Broadest type of corpus – very large, more than 10
million words, and contain a variety of language so
that findings from it may be somewhat generalized.

 Although no corpus will ever represent all possible


language, generalized corpora seek to give users as
much of a whole picture of a language as possible.

 Analysis of patterns of language use as a whole.


Examples;
 British National Corpus (BNC 100,106,008 words)
 The American National Corpus
 ICE – regional corpus
 COCA (The Corpus of Contemporary American English)

 These large, generalized corpora contain written texts


newspaper and magazine articles, works of fiction and
nonfiction, writing from scholarly journals, spoken
transcripts (informal converstaions, government
proceedings and business meetings)
 If generalizations about language as a whole are to be
drawn, a large general corpus should be consulted.
Specialized Corpora
 Compiled to desribe language use in a specific
variety, register or genre.

 Contains texts of a certain type and aims to be


representative of the language of this type.

 It can be large or small and are often created to


answer very specific questions.
 MICASE (1,700,000 words of English spoken in the
academic domain)
 Contains only spoken language from a university
setting
 CHILDES Corpus - contains language used by children

 MICUSP (Michigan Corpus of Upper-level Student Papers)


– a collection of papers from a range of university
disciplines

 Medical corpus – contains language used by nurses and


hospital staff

 Guangzhou Petroleum English Corpus (411,612 words of


written English from the petrochemical domain)

 HKUST Computer Science Corpus (1,000,000 words of


written English sampled from undergraduate textbooks in
computer science.
 CPSA (Corpus of Professional Spoken American
English)

 Specialized corpora – often used in ESP settings

 The AWL – was generated from a specialized corpora


of academic texts
Diachronic Corpora
Also known as historical corpora.
Texts date to different periods in time. Ideal
to study language change and history.
 Brown/Frown
 Lob/Flob
 Helsinki Diachronic Corpus of
English Texts (8th-18th century)
 Archer Corpus – A representative Corpus of
Historical English Registers (BE and AE,
1650-1990).
Synchronic Corpora
Useful to compare varieties of English. Texts date all
to the same period.
 Brown and Lob
 Frown and Flob
 International Corpus of English (ICE) (Texts
produced after 1989)
 BNC
Learner/developmental
Corpora
 Specialized corpus that contains written texts
and/or spoken transcripts of language used by
students who are currently acquiring the language.
 aim at representing the language as produced by
learners of this language .
 Learner corpora are often tagged and can be
examined, e.g., to see common errors students
made.
Lstr or L2 acquisition/L1 acquired by children
 International Corpus of
Learner English – ICLE (LC)
 Generalized corpora
 Contains essays written by English language learners
with 14 different native languages.

 Standard Speaking Test Corpus (SST)


 More specialized
 E.g., comprised of oral interview tests of Japanese
learners.
Other examples;
 CHILDES (DC)
 Cambridge Learner Corpus
(LC)

 Targeted instruction can be developed for general


language teaching or for specific language groups
depending on the type of learner corpus.
Pedagogic Corpora
 It is a corpus that contains language used in classroom
settings.

 It can include academic textbooks, transcripts of


classroom interactions, or any other written text or
spoken transcript that learners encounter in an
educational setting.
 Pedagogic corpora can be used;
 to ensure students are learning useful language,
 to examine teacher-student dynamics, or
 as a self-reflective tool for teacher development
Monitor Corpora
Constantly supplemented with fresh material and
keep increasing in size, though the proportion of
text types included in the corpus remains constant.
 Bank of English (BoE)
 Global English Monitor Corpus
 AVIATOR
Multimedia corpora
 Santa Barbara Corpus of Spoken American English

 Multimedia corpus of European teenager talk


(SACODEYL)
 Nottingham eLanguage Corpus (2010) – sms/mms,
messages, e mails, blogs.
Types of corpora
Corpora

Spoken Written

Monolingual Bi-/Multi-lingual
Types of corpora

Monolingual

Language for General Purposes


(LGP)

Language for Special Purposes


(LSP)

Reference corpora

Medical
Corpora Economic
corpora Legal
corpora
Types of corpora

Bi-multilingual

Comparable

Parallel

L1 L2 L3 L-N
Translations
L1 to L2 Bidirectional
L1 to L2 Free
L2 to L1 Translat
Types of corpora
Written Corpora

Synchronic Diachronic
(e.g. varieties of English: (e.g. Modern English,
BrEn, USEn, Euro-English, etc.) Medieval English, etc.)
English Corpora
 The Brown Corpus (1964)
1 million words (500 samples/2,000 words, written
American English, texts published in the US in 1961
 The Lancaster-Oslo/Bergen (LOB) Corpus (1978)
similar to the Brown corpus, British English, text from
1961 (compiled 1970-1978)
English Corpora
 The London-Lund Corpus (LLC)
200 samples, ~5000 words each, 1953-1987, spoken
British English, transcribed.
 The Frown Corpus
Freiburg-Brown Corpus of American English (1992)
1990s analogue to the Brown corpus (1 million
words, written American-English.
 The FLOB Corpus
Freiburg-LOB Corpus of British English, 1990s
analogue to the LOB corpus (1 million words,
written British English).
English Corpora
 The British National Corpus (BNC)
100 million-word, samples of written texts (90m
words) and spoken language (10m words).
 The International Corpus of English (ICE)
500 samples (300 spoken, 200 written), ~2,000 words
each, 1990 onwards, 20 national varieties of English
(e.g. UK, India, Singapore, Australia, India, Jamaica)
 The BoE Corpus (The Bank of English Corpus)
450M words, full texts, open, written and spoken,
mainly US and UK
Web Corpora
Adam Kilgariff - http://www.kilgarriff.co.uk/

Marco Baroni - http://www.form.unitn.it/~baroni/

Web Corpora:

 Google: www.google.com

 www.webcorp.org.uk

Web Corpora resources:

 BootCat
http://corpora.fi.muni.cz/bootcat/

 VIEW: VARIATION IN ENGLISH WORDS AND PHRASES


Mark Davies / Brigham Young University
http://view.byu.edu/
Uses of Corpora
ü Lexicography / terminology
ü Linguistics / computational linguistics
Dictionaries & grammars (Collins Cobuild English Dictionary for
Advanced Learners; Longman Grammar of Spoken and Written
English
Critical Discourse Analysis
- Study texts in social context
- Analyze texts to show underlying ideological meanings and
assumptions
- Analyze texts to show how other meanings and ways of talking
could have been used….and therefore the ideological implications
of the ways that things were stated
ü Literary studies
ü Translation practice and theory
ü Language teaching / learning
ESL Teaching
LSP Teaching (exemplar texts)
Lexicography and corpora
Issues such as

1. How common are different words?


2. How common are the different senses for a given
word across registers?
3. Do words have systematic associations with other
words?
4. Do words have systematic associations with
particular registers or dialects?
Linguistics and Corpora
 Research on empirical linguistics
 Study language use in various aspects
– Verify linguistic theory, e.g. the explanation of
definite description,
– Lexical studies e.g. study near synonymous
‘little’ ‘small’
– Sociolinguistics : compare the different of
languages produced from different social groups
(m/f)
– Cultural study e.g. differences found in 2
comparable corpora (British/American) ….
Language Teaching and Corpus-based
approach
 Corpus based : use corpus as a resource
 Knowledge :
– Know better about English
answer specific questions of certain words,
phrases, structures.
– Know where the problems are
error analysis on a learner corpus
– Know what should be taught
word frequency, comparing native/learner
corpora
Language Teaching and Corpus-based
approach
 References :
– create better references
dictionary, grammar book, textbooks
– verify certain hypotheses about languages
find support examples / counter examples
– use a native corpus as a reference
see whether it is possible
which one is more natural
Language Teaching and Corpus-based
approach
 Corpus based : use corpus as a resource
Syllabus design :
– Native corpora => what are actually used
– Learner corpora => what are the problems
– Find out which aspects should be given priority
– Lexical syllabus = focus on frequency of occurrence
– How many words the students should know?
What are they?
– Knowing 90% or 95% of the words?
Language Teaching and Corpus-driven
approach
“In a corpus-driven approach the commitment of the linguist
is to the integrity of the data as a whole, and descriptions
aim to be comprehensive with respect to corpus evidence.
The corpus, therefore, is seen as more than a repository of
examples to back pre-existing theories or a probabilistic
extension to an already well defined system. […] Examples
are normally taken verbatim, in other words they are not
adjusted in any way to fit the predefined categories of the
analyst; recurrent patterns and frequency distributions are
expected to form the basic evidence for linguistic categories;
the absence of a pattern is considered potentially
meaningful.” (Tognini-Bonelli, Corpus linguistics at work,
2001:84)
Language Teaching and Corpus-driven
approach
 Corpus driven
– provides new paradigm of
teaching/learning
– students as a researcher
– data driven learning
– learn how to use concordance + corpora
– extract generalization from data
– Is it possible?

You might also like