You are on page 1of 53

Corpora and

Concordancers
WEEK 5
ENGLISH FOR THE HUMANITIES AND SOCIAL SCIENCES
GE2412
What is a corpus?

 “A collection of naturally occurring examples of language,


consisting of anything from a few sentences to a set of written texts
or tape recordings, which have been collected for linguistic study.”
(Hunston, 2002, p. 2)

 The plural of corpus is corpora.


“ I don’t think there can be any corpora,
however large, that contain information about
all of the areas of English….that I want to
explore [but] every corpus that I’ve had a
chance to examine, however small, has taught


me facts that I couldn’t imagine finding out
about in any other way.
FILLMORE, 1992, P. 35
Why use corpora?

 Corpora can be used in academic research to learn more about a


language.
 Corpora can be used by language learners to find answers to
questions about the best way to say something.

 Discussion activity: With the person sitting next to you, list some
specific questions you have about the best way to say something
in English.
Benefits of corpus data

 Corpus data is natural: it’s how people really speak and write.
 Corpora are often very large, so the information they give us is true
for many people.
 Corpus data is contextualized; we can see how language is used
differently in different situations.
 Corpus data can find differences that intuitions alone cannot
perceive
What can we learn from corpora?

By consulting corpora, we can learn things like:


1. The grammatical behaviour of words e.g. what preposition to use
after a verb. Do we say “scare people up” or “scare people off?”
2. Words which are often used together (we call these collocations)
Do we say “play a job” or “do a job?”
3. The patterns words appear in. Do we say “bread and butter” or
“butter and bread?”
4. How frequently words and phrases are used in different contexts.
Can you use “I” in academic writing?
Major Corpus (1):
The Corpus of Contemporary American English
(COCA)

 More than one billion words of text (25+ million words each year
1990-2019);
 Eight genres: spoken, fiction, popular magazines, newspapers,
academic texts, and (with the update in March 2020): TV and
Movies subtitles, blogs, and other web pages.
Major corpus (2):
The British National Corpus (BNC)

 Originally created by Oxford University press in the 1980s - early


1990s
 Contains 100 million words of text from a wide range of genres (e.g.
spoken, fiction, magazines, newspapers, and academic).
Major corpus (3):
Brown corpus

 The first text corpus of American English.


 Consists of 1 million words (500 samples of 2000+ words each) of
running text of edited English prose printed in the United States
during the year 1961 and it was revised and amplified in 1979.
Major corpus(4):
MICASE

 A collection of nearly 1.8 million words of transcribed speech


(almost 200 hours of recordings) from the University of Michigan (U-
M) in Ann Arbor.
Access COCA

 Which corpus is the best one to use depends on your purpose.


 In this course we will use COCA. It can be accessed
at: https://corpus.byu.edu/coca/.
 COCA was created by researchers at Brigham Young University in
the United States. They generously allow free use of their corpus.
 However, registration (which is free) is needed to perform more
than a few queries. Additional features, such as saving queries and
avoiding ads, are available with a subscription. The Department of
English provides a subscription available to all students and staff at
CityU.
COCA Registration

(please refer to the demonstration video)


 Step 1: Access https://corpus.byu.edu
 Step 2: Go to “My Account>Register”
 Step 3: Input your name, email address and set up a password
 Step 4: Once you submit the form, you will receive an email. Click on the link in that
email.
 Step 5: Log into your account again
 Step 6: Choose City University of Hong Kong on the list.
 Step 7: Log in again.
 Step 8: Click “join license”.
 Step 9: Enter “English2018” as the password. Click “submit”.
Learning collocations with
COCA
The notion of collocations

 One feature common to natural languages is that words tend to occur


together with a restricted set of other words.
 These frequently co-occurring word strings are known as collocations.
 One common criterion for collocation selection is their frequency of
occurrence in the corpus.

(Reveir, 2009)
Activity 1: Query a corpus

We’ll now consult the Corpus of Contemporary American English (we


call that “querying” a corpus) to find the answers to some questions.
 What are some of the most common nouns following “underlying?”
Possible answers

Underlying cause/s
Underlying problem
Underlying assumptions
Activity 2: Query a corpus

 With a partner, try to find what preposition usually follows the word
“gravitate”. Provide two or three examples of words that follow
‘gravitate’.
Possible answers

 gravitate toward
 gravitate to
Activity 3: Query a corpus

Querying the Corpus of Contemporary American English, try to find


the answers to the following question.

 Do we say “make a job” or “play a job?”


Concordance Data

 A click on the particular collocation provides concordance data, that is, the
keywords displayed in context.
 Concordance data includes information about the year in which when the
phrase was used, the text-type and the sub-genre from which the phrase was
extracted.
Advanced search I

 Displays
 LIST: Shows a list of word(s) or combination of words (according to their frequency)
 CHART: Shows a chart comparing frequencies of a word in different genre or time.
 COMPARE: Compares two words according to their frequencies (just generally or
with a certain collocate)
 COLLOCATES: a word (not a phrase) that occurs within up to 10 words before/
after the search word(s); you can choose the collocation range by clicking two
little boxes next to the COLLOCATE box.
 POS LIST: List of “parts of speech”- to look for a part of speech (a noun, a verb etc.)
that occurs after a word
Advanced search II

 Wildcard operators/ other devices:


 [word]- all forms of a word
 [=word]- synonyms
 word *- what comes after the word
 word* - what comes in the end of the word
Use of square brackets: [word]

 So far, we’ve experimented with inflected forms. However, if you’re searching for a
verb, e.g. ‘go’ and would like to retrieve contexts containing all inflected forms of
‘go’: go, went, gone, going, goes,….etc., you need to type your search word in
square brackets. [go]
Activity 4: Use of square brackets: [word]

 Now, you can try that again with the following two
words. What inflected forms can you find?
 (i) [explain]
 (ii) [nice]
Exploring subcorpora
Activity 5: Discover whether a word is
more common in a particular text type
 Look at the example sentence below.
I am fully/totally aware of the problem.
 Answer the following questions by doing the word search in COCA.
a) In which kind of text is “totally” most frequently used?
b) In which kind of text is “fully” most frequently used?
c) Which word are we more likely to use in academic writing?
Instruction

 Type in ‘fully’
 Click on Chart
 Click on ‘See frequency by section’
 If necessary click ‘Change to Vertical chart’
 Repeat the same steps for “totally”
Possible solutions: the word “totally”
Possible solutions: the word “fully”
Discussion questions

 Which word are we likely to use in academic paper?


 Which word would you likely hear in spoken English?
Activity 6: Buy vs. Purchase

 Look at the example sentences below. Do the word search in


COCA to discover the answers..
a) In which kind of text is “buy” most frequently used?
b) In which kind of text is “purchase” most frequently used?
c) Which word is more common in academic writing?
d) Which word would you likely find when in spoken communication?
Activity 7: Word collocation

Using COCA, find the more frequently used collocate for the word
“technology” in the sentence below.

I’m studying the utilization / application of modern technology in class.

Answer: ______________
Instruction
Suggested answers:

 Answer: application
Activity 8: collocation

 Look at the example sentences below. Using COCA,


choose the most frequently used word that collocates with
the word in red.

 I hope to succeed/ achieve the goal. ________________


 There has been a hot/ heated debate over the issue. __________
 He firmly/ highly recommended this place. ______________
Suggested answers

 I hope to achieve the goal.


 There has been a heated debate over the issue.
 He highly recommended this place.
Find and correct your
errors with COCA
Task: Improve your own writing

 Review one of your last written assignments; and find three places
where you received comments related to word choice or
grammatical mistakes.
 Using all the techniques you have learnt, correct your writing errors.
 Please search for the frequency, different combinations, and their
use, and complete the chart in the next page.
Frequency
My word/ (occurrences Ideas for better
What did you notice?
phrase per 100,000 choices
words)
entered into,
e.g. entered Usually followed by a
9 or
onto noun, or in/into
just entered
Extra: Youtube tutorials

Introduction to COCA:
https://www.youtube.com/watch?v=sCLgRTlxG0Y
Using Part-Of-Speech Tags:
https://www.youtube.com/watch?v=KP-7thiUnLM
Collocations:
https://www.youtube.com/watch?v=t_SxpfiPo_o
COCA- Lemmas, Parts of Speech Tags, and Wildcards
https://www.youtube.com/watch?v=3Oy7dL31rhY
COCA Bites: Using the Wildcard in Frequency Searches (5)
https://www.youtube.com/watch?v=_7mSZ6SRCjI
IMRD: Results
Results

 The results section describes the findings of the study.

 Task: Product review & results


 In pairs, take a look at the following product review and identify where
the results section is.
(Cordeaux, 2017)
(Cordeaux, 2017)
Organising the results section

 The results section can be organized in different ways.


 Sometimes a short summary or preview of the results is given, before
they are presented in detail.
 Sometimes results are presented one by one, in detail.
 If you introduced formal research questions in the introduction, they
are usually answered in the same order.
 Findings are sometimes organized and presented as several substudies.
Guidelines

 Include only tables and figures that are necessary, clear, and worth
reproducing.
 Tables
 You should use tables only when necessary.
 You should maintain the uniform format when using more than one tables.
 Figures (graphs, diagrams, maps, photographs, etc)
 You should avoid including too much information in one figure.
 You should use figures only when they will help convey your information.

(Nair & Nair, 2014; University of Southern California, 2018)


Task: Read a passage and complete
the chart

 In pairs, read the article by Lutzky and Kehoe (2017).


 Lutzky, U., & Kehoe, A. (2017). ‘‘Oops, I didn’t mean to be so flippant’’.
A corpus pragmatic analysis of apologies in blog data. Journal of
Pragmatics, 116, 27-36.
 Then, complete the chart describing the characteristics of each
section.
Activity

Introduction Methodology Results

Purpose

Verb tense

Characteristics
/elements
Activity

Introduction Methodology Results

Describes rationale for Provides information on


Purpose Provides the data
the study what was done

Past Past
Verb tense Present
(refers to previous work)
• Description of
• Nature and scope of
materials and
problem
Characteristics procedure • Observations
• Review of relevant
/elements • Sufficient detail so • Results
literature
that procedure could
• Principal results
be reproduced
Writing plan for LD Draft

 At this stage, you should have started exploring the use of COCA. Now, form a
group and read the assignment instruction with your groupmate(s).
 Pick a question and try searching a few words from the suggested search terms
as a starting point. Keep a record of your discovery.

Topic Introduction Methodology Results

RQ 1/ 2/ 3
References

Cordeaux, S. A. (2017). Product review: Lipstick colors and swatches. Retrieved


from https://www.huffingtonpost.com/sana-alam/product-review-lipstick-
c_b_8000324.html
Fillmore, C. (1992). Corpus linguistics or computer-aided armchair linguistics. In
directions in corpus linguistics. Proceedings of Nobel Symposium, 82, 35-60.
Hunston, S. (2002). Corpora in applied linguistics. Cambridge: CUP.
Nair, P. K. R., & Nair, V. D. (2014). Scientific writing and communication in
agriculture and natural resources. Switzerland: Springer International Publishing.
Revier, R.L. (2009). Evaluating a new test of whole English collocations. In A.
Barfield & H. Gyllstad (Eds.), Researching collocations in another language (pp.
49-59). London: Palgrave Macmillan.
University of Southern California. (2018). Organizing your social sciences research
paper. Retrieved from http://libguides.usc.edu/writingguide.
Extra: Vocabulary assessment

 Various vocabulary assessment tools available at http://www.lextutor.ca/tests/


 Vocabulary Levels Tests (VLTs)
 To check vocabulary size at different word frequency levels – both receptive and productive
 2000, 3000, 5000, 10000-word levels; AWL
 Aim at score of at least 80%

 Word Association Test


 Meaning (different senses of a word), collocations
 Vocabulary Knowledge Scale (VKS)
 To check “quality” or “depth” of vocab knowledge
 Vocab Profiler
 Lexical richness (type/token ratio) – more different words
 More frequent words or more low-frequency words being used

You might also like