You are on page 1of 5

ASSIGNMENT

A COMPARATIVE ANALYSIS OF BNC & COCA

SUBMITTED BY: SUBMITTED TO:


IFRAH ANUM: (18081517-020) MA’AM HAMNA SOHAIL
BS (A.T.S.) 5TH SEMESTER

FACULTY OF ENGLISH
DEPARTMENT OF ENGLISH
UNIVERSITY OF GUJRAT
JANUARY, 2021
NAME: Ifrah Anum   
ROLL NO. 18081517-020
SEMESTER. 5th
SUBJECT: Corpus Linguistics

A COMPARATIVE ANALYSIS OF BNC & COCA

INTRODUCTION:

One of the first steps when it comes to studying a language is the compilation of data, so it can
be analysed and used for the purpose(s) required. As Sinclair (1991: 13) states:
“Thebeginningofanycorpusstudyisthecreationofthecorpusitself.Thedecisionsthataretaken about
what is to be the corpus, and how these lectionis to be organized, control almost everything that
happens subsequently.”
A corpus is a searchable database of language samples for linguistic research. A corpus may be
based on written or spoken language. Some corpora are tagged or annotated by part of speech;
other corpora are plain text.
BRITISH NATIONAL CORPUS (BNC):

The British National Corpus (BNC) is a 100-million-word collection of samples of a written and


spoken language of British English from the later part of the 20th century.
The BNC consists of the bigger written part (90 %, e.g., newspapers, academic books, letters,
essays, etc.) and the smaller spoken part (remaining 10 %, e.g., informal conversations, radio
shows, etc.). The spoken part is also available in the audio format can be played directly in
Sketch Engine. The corpus texts contain a large amount of information and thus each user can
use many search criteria as a time of publication, region captured spoken text, type of media and
text domain, or the David Lee’s classification – a detailed genre specification. There is a broad
consensus among the participants in the project and among corpus linguists that a general-
purpose corpus of the English language would ideally contain a high proportion of spoken
language in relation to written texts. However, it is significantly more expensive to record and
transcribe natural speech than to acquire written text in computer-readable form. Consequently,
the spoken component of the BNC constitutes approximately 10 per cent (10 million words) of
the total and the written component 90 per cent (90 million words). These were agreed to be
realistic targets, given the constraints of time and budget, yet large enough to yield valuable
empirical statistical data about spoken English. In the BNC sampler, a two per cent sample taken
from the whole of the BNC, spoken and written language are present in approximately equal
proportions, but other criteria are not equally balanced.
From the start, a decision was taken to select material for inclusion in the corpus according to an
overt methodology, with specific target quantities of clearly defined types of language. This
approach makes it possible for other researchers and corpus compilers to review, emulate or
adapt concrete design goals. This section outlines these design considerations, and reports on the
final make-up of the BNC.
This and the other tables in this section show the actual make-up of the second version of the
British National Corpus (the BNC World Edition) in terms of:
 texts: number of distinct samples not exceeding 45,000 words

 S-units: number of <s> elements identified by the CLAWS system (more or less


equivalent to sentences)

 W-units: number of <w> elements identified by the CLAWS system (more or less


equivalent to words)
CORPUS OF CONTEMPORARY AMERICAN ENGLISH (COCA):

The Corpus of Contemporary American English (COCA) is the only large, genre-balanced


corpus of American English. COCA is probably the most widely-used corpus of English, and it
is related to many other corpora of English that we have created, which offer unparalleled insight
into variation in English.
The corpus contains more than one billion words of text (25+ million words each year 1990-
2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, and
(with the update in March 2020): TV and Movies subtitles, blogs, and other web pages.
There are four main ways to search the corpus:
First, you can browse a frequency list of the top 60,000 words in the corpus, including searches
by word form, part of speech, ranges in the 60,000-word list, and even by pronunciation. This
should be particularly useful for language learners and teachers.
Second, you can search by individual word, and see collocates, topics, clusters, websites,
concordance lines, and related words for each of these words. Note that some of these searches
are unique to COCA and iWeb.
Third, you can input entire texts and then use data from COCA to get detailed information on
the words and phrases in the text.
Fourth, you can search for phrases and strings. And because the corpus is optimized for speed,
searches for substrings (*ism, un*able) and phrases are very fast, e.g.: got VERB-ed, BUY *
ADJ NOUN, "gorgeous" NOUN -- and even high frequency phrases like: from ADJ to
ADJ, phrasal verbs, or NOUN.
You might pay special attention to the comparisons between genres and years and virtual
corpora, which allow you to create personalized collections of texts related to a particular area of
interest.
The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of
American English, with over 1 billion words, and the only large, genre-balanced corpus of
American English. COCA was released in 2008. Text are from 1990-2019. This link is to the U-
M institutional account, with higher search limits for U-M researchers. COCA high-frequency
word-lists are also available for download.
Most importantly, the genre balance stays almost exactly the same from year to year, which
allows it to accurately model changes in the ‘real world’. After discussing the corpus design, we
provide a number of concrete examples of how the corpus can be used to look at recent changes
in English, including morphology (new suffixes –friendly and –gate), syntax (including
prescriptive rules, quotative like, so not ADJ, the get passive, resultatives, and verb
complementation), semantics (such as changes in meaning with web, green, or gay), and lexis.
COMPARATIVE ANALYSIS:

BNC COCA

100 M words of British English 385+M words of American English

Work on building the corpus began in 1991, 20 M per year for 1990-2008
and was completed in 1994. No new texts
have been added after the completion of the
project but the corpus was slightly revised
prior to the release of the second edition BNC
World (2001) and the third edition BNC XML
Edition (2007).

Since the completion of the project, two sub- Updated every 6-9 months
corpora with material from the BNC have
been released separately: the BNC Sampler (a
general collection of one million written
words, one million spoken) and the BNC
Baby (four one-million-word samples from
four different genres).

newspapers, academic books, letters, essays, Equally divided among spoken, fiction,
informal conversations, radio shows popular magazines, and academic texts

This corpus covers a variety of different Useful for studying variation across genres
genres. and over time.

CHARACTERISTIC BNC COCA

Words 100-million-word one billion words 

Time period 1980-1993 1990-2019

Consist of newspapers, academic books, spoken, fiction, popular magazines,


letters, essays and the smaller newspapers, academic texts, and
spoken part (remaining 10 %, (with the update in March 2020): TV
e.g., informal conversations, and Movies subtitles, blogs, and
radio shows other web pages.
CONCLUSION:

The British National Corpus (BNC) is a 100-million-word collection of samples of a written and


spoken language of British English from the later part of the 20th century. The BNC consists of
the bigger written part (90 %, e.g., newspapers, academic books, letters, essays, etc.) and the
smaller spoken part (remaining 10 %, e.g., informal conversations, radio shows, etc.). The
spoken part is also available in the audio format can be played directly in Sketch Engine.
The corpus texts contain a large amount of information and thus each user can use many search
criteria as a time of publication, region captured spoken text, type of media and text domain, or
the David Lee’s classification – a detailed genre specification.
The Corpus of Contemporary American English (COCA) is the only large, genre-balanced
corpus of American English. COCA is probably the most widely-used corpus of English, and it
is related to many other corpora of English that we have created, which offer unparalleled insight
into variation in English. The corpus contains more than one billion words of text (25+ million
words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers,
academic texts, TV and Movies subtitles, blogs, and other web pages.
BIBLIOGRAPHY
English-corpora.org. 2021. [online] Available at: <https://www.english-
corpora.org/davies/articles/davies_44.pdf> [Accessed 3 January 2021].

Guides.lib.umich.edu. 2021. Research Guides: Linguistics Resources: Corpora. [online] Available at:


<https://guides.lib.umich.edu/c.php?g=282869&p=1884909> [Accessed 3 January 2021].

HILARIO, P., 2021. COCA Vs BNC, Two Corpora or Lexicological Databases, What Are the Odds of
Having the Same Features? [online] Academia.edu. Available at:
<https://www.academia.edu/44241712/COCA_vs_BNC_two_corpora_or_lexicological_databases_what_
are_the_odds_of_having_the_same_features> [Accessed 3 January 2021].

Sketch Engine. 2021. British National Corpus (BNC) Search | Sketch Engine. [online] Available at:
<https://www.sketchengine.eu/british-national-corpus/#:~:text=What%20is%20British%20National
%20Corpus,letters%2C%20essays%2C%20etc.)> [Accessed 3 January 2021].

English-corpora.org. 2021. Corpus of Contemporary American English (COCA). [online] Available at:


<https://www.english-corpora.org/coca/> [Accessed 3 January 2021].

Burnard, L., 2021. [Bnc] About the British National Corpus. [online] Natcorp.ox.ac.uk. Available at:
<http://www.natcorp.ox.ac.uk/corpus/> [Accessed 3 January 2021].

You might also like