Professional Documents
Culture Documents
Ayesha Azhar
Bareera Akbar
Irum Masood
Maryam Ahmed
Tahira Jabeen
Essence of
human
beings
A sea of
words
Incomprehensible
consciously
A Latin word “body / mass”
A collection of written texts, especially the entire
works of a particular author or a body of writing
on a particular subject: "the Darwinian corpus“
Corpora (plural)
History of Corpus Linguistics
Language study is not a new idea.
1921: 30,000 words. A Treasure, but of no use.
1960 with the advent of computer....
The use of collections of COMPUTER-READABLE text for
language study.
Brown Corpus of Standard American English.
One million words of American English texts printed in 1964.
First electronic corpus
Corpus Linguistics
Linguistics being the scientific study of language
and its structure, ‘corpus linguistics’ is the study
of language “on the basis of text corpora.”
Research goals
Funding
Time
Staff/students
Written Corpora
Obtaining/creating, Storing, Organizing
Materials Required:
-scanner, OCR software
Process:
-paper document into electronic text file
Types:
-newspapers, periodicals
-small specialized corpora
-informal writings (travel diaries, e-mail,
discussion, blogs, news groups)
Spoken Corpora
deciding on a transcription system
I. prosodic/non prosodic
II. representing interactional characteristics of
speech (over lapping speech, back channels,
pauses, non-verbal contextual events)
III. permission to use data
IV. ensuring anonymity
V. avoiding impracticality of data
Markup
1. Structural markups:
-written corpus: Titles, authors, paragraphs, subheadings,
chapters etc.
-spoken corpus: Contextual events, paralinguistic features
2: Header:
-written corpus:
Classification into categories(register, genre, topic domain, discourse
mode, formality)
-spoken corpus:
Demographic infirmation about speaker(gender,social
class,occupation,age,native language/dialect)
Relationship among the participants
Linguistic Annotation
Parts of Speech Tagging:
Grammatical category, case assigning
Prosodic Annotation
Phonetic Annotation
Syntactic Parsing
Advantages of Tagging
Vast exploration
Frequency
Co-occurance
Multiple meaning studies
Automatically retrievable
Concordance Lines
Concordance lines are a useful tool for
investigating corpora, but their use is limited by
the ability of the human observer to process
information.
For example
• Sara for the BNC
• ICECUP for the ICE Great Britain.
• Concordancers can be used for the analysis of almost
any corpus.
Concordancer
One of the most frequently used concordancers is
‘Wordsmith Tools’.
More text types and genres, to cover text types which are
less represented in corpora (letters, emails, leaflets, TV
programs, book synopses, recipes, short notes, chat room
logs, etc.),
More longitudinal language data:
from beginners to advanced levels, from children to adults,
from L1 to L2.
More variables:
more language learning variables should be collected and
encoded at the time of corpus collection (proficiency, language
aptitude, motivation, more precise description of the task, of
temporal, social or situational settings, etc).
More languages:
to counterbalance the predominance of Anglo-Saxon native and
learner corpora and to foster the computer-aided analysis of
different languages and language families.
Prior to Corpus Linguistics it was difficult to note patterns of
use in language, since observing and tracking usage patterns
was a monumental task.
Scholars have used various types of corpora to gain insights
into changes related to language development, both in first
and second language situations.
Corpus Linguistics can help in telling about language use and
how it varies in different situations.