You are on page 1of 74

Conducting corpus-based

studies: Emerging challenges


and opportunities

SHIRLEY N. DITA, Ph.D.


De La Salle University – Manila
shirley.dita@dlsu.edu.ph
Lecture outline

- corpus
Backgrounder
- corpus linguistics

Some resources: Online Corpora

- previous work
What can be done with corpora
- on actual corpus

Concluding remarks

@2021 Shirley N. Dita 5/23/22


Corpus linguistics
approaches the study of
language in use through
corpora (singular:
corpus).
u A corpus is a large collection of
language texts usually electronically
stored and processed using a search
engine.
What is
CORPUS? u Francis (1982:7): “a collection of texts
assumed to be representative of a given
language, dialect, or other subset of
language, to be used for linguistic
analysis.”

@2021 Shirley N. Dita 5/23/22


A corpus is a collection of (1)
machine-readable (2)
authentic texts (including
transcripts of spoken data)
which is (3) sampled to be (4)
representative of a particular
language or language variety
@2021 Shirley N. Dita 5/23/22
The Corpus Approach

It is empirical, analyzing It utilizes a large and


the actual patterns of principled collection of
language use in natural natural texts as the basis
texts. for analysis

It depends on both
It makes extensive use
quantitative and
of computers for
qualitative analytical
analysis.
techniques.

@2021 Shirley N. Dita 5/23/22


Advantages of computer-based
language studies

Machine- Speed of Allows further


readability is a de processing and automatic
facto attribute of ease of processing to be
modern corpora manipulating data performed

Can process
Can avoid human
machine-readable
bias, hence more
data accurately
reliable
and consistently

@2021 Shirley N. Dita 5/23/22


What Corpus Linguistics is NOT

Corpus Linguistics is NOT


u able to provide negative evidence
u (what is possible or correct or what is not possible
and incorrect in language)
u able to explain why
u able to provide all possible language at one time

@2021 Shirley N. Dita 5/23/22


The Brown family

Brown LOB (Lancaster-Oslo-Bergen)

• published American • published British


English in 60s English in 60s

Frown F-LOB

• Early 1990s American • 1990s British English


English
@2021 Shirley N. Dita 5/23/22
@2021 Shirley N. Dita
5/23/22
Corpora all over the world

The Survey of
English Usage
Corpus Kolhapur (Indian Wellington (New Australian Corpus of
English) Zealand); English (ACE);
• Used in the development
of Quirk et al (1985)

The British National The Bank of English Cambridge


Corpus (BNC) Corpus (BoE) International
• 100 million words • 450M words (mainly US Corpus (CIC)
and UK) • 1B+ words
Corpus size over time …

1960s-70s: Brown and LOB: 1M words

1980s: The Birmingham/Cobuild corpora: 20 M words

1990s: The British National Corpus: 100 M words

Early 21st Century: The Bank of English: 645 M words

Current trend: BIG DATA

@2021 Shirley N. Dita 5/23/22


English-Corpora.org

@2021 Shirley N. Dita 5/23/22


English-Corpora.org

@2021 Shirley N. Dita 5/23/22


iWeb: 14B words

@2021 Shirley N. Dita 5/23/22


@2021 Shirley N. Dita 5/23/22
English-Corpora.org

@2021 Shirley N. Dita 5/23/22


English-Corpora.org

@2021 Shirley N. Dita 5/23/22


NOW Corpus: 15.2B words

@2021 Shirley N. Dita 5/23/22


iWeb: 14B words

@2021 Shirley N. Dita 5/23/22


Types of corpora

u General/reference vs. specialized corpora


u Written vs. spoken corpora
u Synchronic vs. diachronic corpora
u Monolingual vs. multilingual corpora
u Native vs. learner corpora
u Developmental vs. learner/interlanguage corpora
u Raw vs. annotated corpora
u Static/sample vs. dynamic/monitor corpora
@2021 Shirley N. Dita 5/23/22
What you can do with corpus …

01 04
02 03
see the frequency see concordance
of a word or limit by and lines ("Keywords in
phrase by date or compare words and find "collocates" Context"), to see
variety or genre (or phrases in different (nearby words) of a patterns and
by other sections sections of the given word, to phrases in which
corpus (e.g. words investigate the the word occurs
of the corpus)
that are more meaning and usage of
frequent in April 2020 a word (following the
than in March 2020) maxim that "you can
tell a lot about a word
by the words that it
hangs out with")

@2021 Shirley N. Dita 5/23/22


Sample query on iWeb: BREAD

@2021 Shirley N. Dita 5/23/22


@2021 Shirley N. Dita 5/23/22
@2021 Shirley N. Dita 5/23/22
@2021 Shirley N. Dita 5/23/22
@2021 Shirley N. Dita 5/23/22
Frequency

@2021 Shirley N. Dita 5/23/22


Sample search: balikbayan box vs parcel

@2021 Shirley N. Dita 5/23/22


Sample query: STAYCATION

@2021 Shirley N. Dita 5/23/22


Philippine English words

@2021 Shirley N. Dita 5/23/22


Sample query: download

@2021 Shirley N. Dita 5/23/22


Thrice : antiquated … dead?

@2021 Shirley N. Dita 5/23/22


Areas in linguistics that use corpora …

u Lexicography
u Semantics
u Lexical studies u Pragmatics
u Grammatical studies
u Stylistics
u Register/genre analysis
u Literary study
u Language variation u Sociolinguistics
u Contrastive analysis
u Discourse analysis
u Translation studies
u Forensic linguistics
u Language change u Computational linguistics
u Language teaching
u …
@2021 Shirley N. Dita 5/23/22
… outside linguistics

CORPUS LINGUISTICS TECHNIQUES are also used by:


u Historians (e.g. McEnery and Baker, 2016)
u Law scholars (e.g. Mouritsen, 2010)
u Sociologists (e.g. Zinn, 2018)
u Social sciences (in general)

@2021 Shirley N. Dita 5/23/22


Lexicography, lexical studies

@2021 Shirley N. Dita 5/23/22


Phil English entries in OED, June 2015

1. Mabuhay 14. Suki


2. Balikbayan 15. Bahala na
3. High-blood 16. Presidentiable
4. Sari-sari store 17. Baon
5. Estafa 18. Mani-pedi
6. Despidida 19. Dirty kitchen
7. Carnap 20. Sinigang
8. Halo-halo 21. Kuya
9. Utang na loob 22. Buko juice
10. Comfort room 23. Kikay
11. KKB 24. Barangay
12. Barong 25. Barkada
13. Pandesal 26. Gimmick
@2021 Shirley N. Dita 5/23/22
Phil English entries in OED, March 2016

u KILIG
u it can be used as part of the phrase:
u "kilig to the bones"
u compounds "kilig factor"
u "kilig moment
Other words:
- Teleserye
- Vlog
@2021 Shirley N. Dita 5/23/22
October 2018 entries

u ambush interview (n.) – impromptu interview


u accomplish (v.) [forms and questionnaires rather
than fill them out]
u bagoong (n.) – fish sauce
u bihon (n.) – long thin noodles
u ensaymada (n.) – spiral-shaped pastry w/butter &
cheese
u bold (adj.) [erotic or sexually explicit, not
courageous]
u carinderia (n.) – low-key resto
u cartolina (n.) – thick, colored paper for posters
u dine-in (n. and adj.) – cf eat in (SA); for here

@2021 Shirley N. Dita 5/23/22


dirty ice cream, n.
viand. n
holdupper, n.
palay, n.
panciteria, n. October
querida, n.
2018 entries
rotonda, n.
sorbetes, n.
trapo, n.
turon (n.)

@2021 Shirley N. Dita 5/23/22


Verbing of nouns; nouning of verbs
u The nouning of verbs
u There are new eats along Maginhawa!
u We export our produce.
u And the verbing of nouns!
u I'm soloing here! - Fat Amy (PP2)
u I am waitressing at the cheesecake factory (Penny, TBBT)
u We are holidaying in Bangkok!
u We are goodbying again! – Tammy
u Come over, we’re breakfasting!
@2021 Shirley N. Dita 5/23/22
Prison Break S5

SUITS, S8

@2021 Shirley N. Dita 5/23/22


Compounding

u Compounding/blending
u Camwhore, attentionwhore
u Foodporn, catporn, flowerporn, cloudporn
u Eargasm, foodgasm, bedgasm
u Staycation
u Infomercial
u Frienemies
u Webinar, webisode
u Guesstimate

@2021 Shirley N. Dita 5/23/22


Over-generalization of affixes

u pinkish … 10:30ish, Justin Bieberish, purplish


u doable… tweetable, IGable, sippable
u undo … unblock, unfriend, unsmile
u panicky … lyricky, vintagy
u Pianist … apologist, revionist, tiktokerist
u Christmassy … Thanksgivingy, Valentiney
u gentlemanly … Borlonganly…

@2021 Shirley N. Dita 5/23/22


Based on VS based from

@2021 Shirley N. Dita 5/23/22


Stay at home VS stay home

@2021 Shirley N. Dita 5/23/22


Physical VS Social distancing

@2021 Shirley N. Dita 5/23/22


reason for vs. reason to

@2021 Shirley N. Dita 5/23/22


DATA: singular or plural?

Per million words:


Singular: 776
Academic: 21
misc: 9.2
spoken: 1.9
newspaper: 1.6
fiction: 0.3

Per million words:


Plural: 1,035
academic: 42.5
misc: 8.8
spoken: 0.2
fiction/news: 0.1
@2021 Shirley N. Dita 5/23/22
Thrice : antiquated … dead?

@2021 Shirley N. Dita 5/23/22


Wildcard: *complete

@2021 Shirley N. Dita 5/23/22


Philippine English words

@2021 Shirley N. Dita 5/23/22


Grammatical Studies

u demise of the inflected form whom


u use of less instead of fewer with countable nouns
u less [fewer] people
u lesser [fewer] problems
u regularization of irregular morphology
u dreamt àdreamed
u burnt à burned

@2021 Shirley N. Dita 5/23/22


Grammatical Studies

u spread of the s-genitive to non-human nouns


u the book’s cover
u the bag’s handle
u a tendency towards analytical comparison of disyllabic
adjectives
u politer, politest à more polite, most polite
u elimination of shall as a future marker in the first person
u I will return!

@2021 Shirley N. Dita 5/23/22


Grammatical Studies

u Depassivization of progressives
u The ticket is printing [is being printed]
u Get-passive over be-passive
uI got hired [I was hired]
u The man got shot. [The man was shot]
u spread of “singular” they
u Everybody came in their black suit.
@2021 Shirley N. Dita 5/23/22
Grammatical Studies

u How is futurity realized in XX?


u Modal: will/shall
u Periphrastic modals: be + going to
u Simple present: The gate closes at 9 pm.
u Present progressing: I am leaving tomorrow
u emergence of be like as a quotation-introducing verb
u She’s like, “Wow! I never thought that!”
u And I was like, “Really, are you serious?”
@2021 Shirley N. Dita 5/23/22
@2021 Shirley N. Dita 5/23/22
Multi-word verbs in World Englishes (Ella, 2019)

@2021 Shirley N. Dita 5/23/22


Multi-word verbs in World Englishes (Ella, 2019)
1000.0

900.0

800.0

700.0

600.0

500.0

400.0

300.0

200.0

100.0

0.0
Australia Canada New Ireland Great Philippines India Singapore Hong East Nigeria Jamaica
Zealand Britain Kong Africa

Spoken Written
@2021 Shirley N. Dita 5/23/22
@2021 Shirley N. Dita
pr
ep
o sit
io
100
150
200
250
300
350
400
450

0
50

na
lp
hr
as
e
fin
it e
cl
a us
sin e
gl
e
ad
ve
ad rb
ve
rb
ph
ra
s e
no
un
ph
ra
no se
nf
in
ite
cl
a us
Total

ve e
rb
l es
pr s cl
ep a us
o sit e
io
na
pr lc
la
Adjuncts in Phil English (Morales, 2015)

ep
o sit
us
e
io
na
la
dv
er
b

(b
la
nk
)
5/23/22
Total
Disjuncts in World Englishes (Dita, 2017)
350 332

297
300

250 242 241


223
208
200
172 169
145
150 130 123
114
105 99
100 90 97
81 82
74 71 73
66
55 5662
50 36 39 36 41 40
29 27 2416 27
13
24 201620 26
8 10 11 6
0

ll y
ly
tl y
ly
ly

sly

ly
ly

lly
te

in
ul

te
al

ia

ra
en

ou
f

rta
sic

na
na

nt
pe

tu
ar

vi

se

ce
ba

na
tu

rtu
ob
ho

es
r
ap

fo
fo
un

@2021 Shirley N. Dita AUS CAN GB IRE NZ 5/23/22


Disjuncts in World Englishes (Dita, 2017)

@2021 Shirley N. Dita 5/23/22


Split infinitives in WE (Gonzales & Dita, 2017)

@2021 Shirley N. Dita 5/23/22


Discourse Analysis

u What are the common forms of ‘name-calling’ / form of insults


among teen-agers?
u What are the common forms of expressing surprise, disgust,
anger, disappointment?
u What are the forms of cursing among XX?
u What are the common discourse particles among X?
u What are the common vocatives for males? females? Both?

@2021 Shirley N. Dita 5/23/22


u What words are used to describe
“politicians, police officers, DDS, etc”
u What are the usual collocates of UP/,
DLSU, student, man/guy/boy,
woman/girl, mother/mommy,
Father/daddy
u What patterns of description are used
Sociolinguistics to refer to Martial law, EJK,
Pnoy/president/Noynoy; to Filipinos; to
GMA, Duterte
u How are Filipinos/Pinoys/Pinays, etc
constructed in the papers?

@2021 Shirley N. Dita 5/23/22


How are man and woman described?

@2021 Shirley N. Dita 5/23/22


Critical Discourse Analysis

u The language of political campaigns


u The language of Duterte, Marcos, Leni
u The language of ‘apologists’, ‘trolls’,
u The language of dilawans, pinklawans
u the political memes

@2021 Shirley N. Dita 5/23/22


Semantics

u A study on the functions of the following:


u Negative words used positively:
u OC

u Bully

u Autistic

u The semantics of ‘ass’, ‘solid’, ‘chill’ , ‘steady’ , ‘hot’

@2021 Shirley N. Dita 5/23/22


Philippine languages

u What Filipino words have emerged in SNS as evident in the affixation


u Nag-download
u Finorward
u i-bluetooth
u Pagti-tweet
u Picturan
u What are the emerging patterns of aspectual affixation and reduplication in Tagalog?
u Ikakamatay o ikamamatay
u Ipapa-Xerox o Ipasi-Xerox
u Ipapa-defend o ipade-defend
- Ikakataba o ikatataba
- TRINAY o TINRAY?
@2021 Shirley N. Dita 5/23/22
Filipino grammar

u The use of personal pronouns in the students’ essays


u Kina à kila
u Sina à si, sila
u Nina à sila
u The demise of ang/ng;
u the rise of ‘yung’ and ‘nung’

@2021 Shirley N. Dita 5/23/22


What can corpora provide?

Corpora not only tell us what is possible to use, but


also what is actually used, and what is typically
used.

For non-native teachers of English, a corpus can be


regarded as “an always available native-speaker
consultant” (Römer, 2006, p. 129).
@2021 Shirley N. Dita 5/23/22
@2021 Shirley N. Dita 5/23/22
Some points to ponder on …

u Language will evolve, as it should, and there’s nothing we can do


about it!
u Language will never stop changing; it will continue to respond to the needs of
the people who use it.
u (English) teachers are NOT the gatekeepers of the (English) language …
u What was perceived to be ungrammatical or unacceptable a couple of years
ago may be perfectly grammatical or acceptable to a specific variety ..
u error -à deviation -à innovation (feature of a particular variety)
u If in doubt, do a corpus-based investigation …
@2021 Shirley N. Dita 5/23/22
“A life without corpus
linguistics is possible but
meaningless.’
(Freely adapted from Vicco von Bülow)

You might also like