You are on page 1of 47

Using Corpora for Language

Research

COGS 523-Lecture 9
Discource
Characteristics and
Register Variations

10.11.23 COGS 523 - Bilge Say 1


Related Readings
Biber, Conrad and Reppen (1998).
Corpus Linguistics. Chs 5 and 6

10.11.23 COGS 523 - Bilge Say 2


Discourse Studies
 Text based vs. corpus based
 Lack of generalizability and quantitative
techniques
 Discourse features: hard to identify
automatically
 Not much help from conventional corpus
workbenches; but interactive tools in
conjunction w. surface grammatical
analysis tools can work

10.11.23 COGS 523 - Bilge Say 3


 Text Sample 5.1: News reportage
 Throtec. International Inc. said it
reached agreements with an investor
group and Wells Fargo Bank under
which it will receive loans and equity
infusion in return for stock that will
reduce the number of shares in public
hands by as much as 85 percent .
The engineering and consulting firm,
which has been plagued by losses of
five years, said the restructuring is
required to relieve its debt burden
and “acute shortage of cash.”
 Text Sample 5.2: Conversation
 A: Right, I’m ready. Have you locked
the back door? [pause] I thought we
were walking.
 B: Well do you want to walk or do
you want to go in the car.
 A: Well I have to go to the paper
shop.
 B: Well I’ll drop you at the paper
shop while I go round.
 A: Oh that’s a good idea.
Referring Expressions in
Different Text Types
 Exophoric (text-external) vs. text internal
 Known vs. New
 London-Lund Corpus (Spoken)
 Conversation
 Public Speeches
 LOB Corpus
 News Reportage
 Academic Prose

10.11.23 COGS 523 - Bilge Say 6


Characteristics of Referring
Expressions
 Status of Information: Given vs new
 For given information: type of reference
(anaphoric, exophoric, or inferrable)
 For anaphoric reference, form of the
expression (pronoun,synonym, or
repetition)
 For anaphoric reference, the distance
between the anaphoric expression and
the antecedent

10.11.23 COGS 523 - Bilge Say 7


Illustrative Analysis
 Small sample of the texts from the
London-Lund and LOB corpora
were coded: the first 200 words in
forty texts (five texts from
conversation, nine texts from
public speeches, ten texts from
news reportage, and sixteen texts
from academic prose)
10.11.23 COGS 523 - Bilge Say 8
 Six noun phrase characteristics recorded:
1. register of the text
2. nominal form: pronoun versus full noun
3. information status: given versus new
4. if given, type of reference: anaphoric,
exophoric, or inferrable (last category not
included in the references)
5. if anaphoric and a full noun, type of
expression: synonym versus noun
repetition (pronouns have already been
identified in step 2)
6. if anaphoric, the distance between the
target referring expression and its
antecedent
Interactive Text Analysis
Program
 All texts grammatically tagged
 Stopping at all nouns and pronouns,
asking for user feedback, from a list
 Initial analysis – e.g. anaphoric and given
for pronouns ; repeated nouns as given...
 User selects the antecedent if necessary;
the program counts the number of noun
phrases intervening between the referring
expression and the antecedent.

10.11.23 COGS 523 - Bilge Say 10


Frequency of given versus referring expressions

70
Referring Expressions Per 200 words

60

50

40

30

20

10

0
Conversation Speeches News Academic
Prose
Register
New References Given References

10.11.23 COGS 523 - Bilge Say 11


Frequency of exophoric and anaphoric referring
expressions

45
40
Referring expression per 200

35
30
25
words

20
15
10
5
0
Conversation Speeches News Academic Prose
Register
Exophoric pronouns Anaphoric Pronouns Anaphoric Nouns
10.11.23 COGS 523 - Bilge Say 12
Distance between RE and
antecedents
 On-line comprehension and production
requirements make a difference.
 Pronouns tend to occur much closer to
their antecedents than repeated full
nouns – holds across registers.
 Full noun expressions are preferred for
anaphoric reference over large distances.

10.11.23 COGS 523 - Bilge Say 13


Average Distance

Conversation 4.5

Speeches 5.5

News 11.0
Academic Prose 9.0
Table 5.1. Average Distance Measures for Registers

Average
Pronominal Average Full Noun
Distance Distance

Conversation 3.0 9.0

Speeches 3.5 10.0

News 3.0 13.5


Academic Prose 2.5 10.0
Table 5.2. Average Distance Measures for pronominal versus full noun anaphoric
expressions (Biber et al., 1998)
Comments
 A larger number of texts and longer
text samples needed for
generalizable results.
 Other distinctions could be
investigated: e.g. Referring
expression distributions between
main clauses vs dependent clauses.

10.11.23 COGS 523 - Bilge Say 15


Discourse maps of verb
tense and voice
 Marking of verbe tense and voice can reflect
larger rhetorical divisions within a text.
 Subtexts such as sections, and nonovertly
marked divisions can reflect communicative
purpose shifts accompanied by linguistic feature
shifts.
 Verb tense and voice shifts in major sections (I-
Introduction, M-Methods, R-Results- D-
Discussion) of research articles English medical
research (19 medical articles taken from
ARCHER Corpus, published in 1985, each text as
a unit of analysis)

10.11.23 COGS 523 - Bilge Say 16


Linguistic
Feature
Section
Introduction Methods Results Discussion
Present
Tense 47.9 21.1 35.9 60.6
F=29.25; p <0.001: r2= .549

Past tense 20.7 48.5 40.3 13


F=36.74; p <0.001: r2= .605

agentless
passives 18.4 39.9 16.9 16.3
F=33.17; p <0.001: r2= .580

Table 5.3 Mean scores (per 1000 words) of selected linguistic


features across the I-M-R-D sections of English medical research
articles (N=19) (Biber et al., 1998)
Reflections on Frequency
Counts
 Present tense verbs in Introduction and
Discussion sections: emphasis on current state
of the art and the present implications of the
current research.
 Past tense in Methodology and Results: Focus on
reportage of past events and procedures.
 Methodology: agentless passives-presenting
events impersonally.
 How does a text develop? Are there systematic
patterns of variation within sections? What can
we do with texts that do not have overtly
marked sections?

10.11.23 COGS 523 - Bilge Say 18


Drawing a “map” of
progression of verbs
 Two medical research articles in
ecology. (from the Corpus of Writing
in the Disciplines)
 A program that marks over pos-
tagged text two binary distinctions:
past vs non-past (including modals)
and active vs passive (non-finite
clauses were excluded from the
analysis)
10.11.23 COGS 523 - Bilge Say 19
NP: Nonpast
P: Past
A: Active
PS: Passive

10.11.23 COGS 523 - Bilge Say (Biber et al., 1998) 20


Comments
 Transition zones between sections:
writers start a transition at the end
of one section, continue a transition
into the beginning of the following
section.
 Extensions possible:Patterns of
modal verbs, as well as perfect and
progressive aspects.
10.11.23 COGS 523 - Bilge Say 21
Studying Register Variation
 A cover term for varieties defined by their
situational characteristics, such as
purpose, topic, setting, interactiveness.
 We control a range of registers and switch
from one to another, important for
language acquisition and learning.
 Describing linguistic characteristics of
different registers might be a prequisite
to understanding and using this
knowledge.

10.11.23 COGS 523 - Bilge Say 22


Corpus based register
analysis
 Inclusion of a large number of texts
 Consideration of a wide range of
linguistic features
 Comparison across registers

These requirements strengthen the


applicability of a corpus based
approach.

10.11.23 COGS 523 - Bilge Say 23


Research Questions
 How do spoken and written registers differ in
their use of dependent clauses?
 What patterns in the use of linguistic features
are important in distinguishing among the major
spoken and written registers?
 How do texts from different academic disciplines
vary with respect to patterns of linguistic
variation?
 How do the internal sections of texts within a
single academic register vary linguistically?

10.11.23 COGS 523 - Bilge Say 24


Dependent Clause Use
 Are all kinds of dependent clauses
functionally similar, that is representing
structural elaboration and complexity?
 Previous studies: Written registers are
generally more structurally elaborated
than spoken ones
 Distribution of three kinds of dependent
clauses:
 Relative clauses
 Adverbial clauses
 Complement clauses

10.11.23 COGS 523 - Bilge Say 25


Illustrative Analysis
 Two written registers from LOB corpus,
(80 academic prose, 14 official
documents); two spoken registers from
London-Lund (44 conversations, 14
prepared speeches)
 478.000 words
 Semi?-automatic counting based POS-
tagged text
 Only causative adverbial clauses are
counted
10.11.23 COGS 523 - Bilge Say 26
Causative
adverbial
Number of Relative subordinate that-comp.
Register texts clauses clauses clauses
Academic Prose 80 6.8 0.3 3.2
Official
Documents 14 8.6 0.1 1.6
Conversations 44 2.9 3.5 4.1
Prepared
Speeches 14 7.9 1.6 7.6
Table 6.1 Average frequencies of three dependent types (per 1000
words) in four registers (Biber et al., 1998)
Comments
 Academic prose, official documents and prepared
speeches are often focused on conveying
information about particular referents in the
text. Conversations, is more concerned with the
interaction among participants, and concerns
with causes and reasons.
 That-complement clauses mark the stance of the
writer or reader (eg. With verbs such as think,
wish, hope).
 Taking all dependent clauses as one big category
or making generalizations based on one type
only is dangerous.

10.11.23 COGS 523 - Bilge Say 28


 Text sample 6.2: Conversation
 I wouldn’t want it before the end of June
anyhow Reynard because I’m going to
Madrid on the tenth...
 I rushed into the kitchen because I smelt
something was burning...
 Text sample 6.3: Prepared speeches
 There are many people who think that to be
a Christian is to lead a soft option in life ...
 We would hope that our students would
have a full understanding of the cultural
differences...
Importance of enough type
and token frequencies
 Having too few text samples can
lead to dramatically inaccurate
conclusions.
 Following sample, J30, from LOB has
25 relative clauses per 1000 words –
that is four times greater than the
average for academic prose- and no
that clauses at all (register average
3.2 per 1000 words)
10.11.23 COGS 523 - Bilge Say 30
 Text sample 6.4: LOB Corpus
Academic Prose. J30
 Most Vale people also have kin ties
with people who live in these areas
and in other parts of south Wales with
whom they maintain effective social
relations. A larger number of Vale
people who do not work in the urban
areas neverthless visit them fairly
regularly to see friends and relatives
who live there or who are in hospital
there...
Co-occurence patterns in
linguistic features
 In samples below, fragmented speech co-occur with
second-person pronouns, modals, wh- complement
clauses, whereas academic prose co-occurs with frequent
nouns, nominalizations, passive constructions, extraposed
constructions (e.g. it is possible that...)
 Multidimensional Analysis (MD)
 Factor Analysis for identifying sets of variables that are
distributed in similar ways
 Count and normalize linguistic features in a representative
corpus
 Each set of co-occuring linguistic features is called a
“dimension”.
 Interpret the dimensions in terms of situational, social and
cognitive functions, based on the assumption that co-
occurence reflects shared function.

10.11.23 COGS 523 - Bilge Say 32


 Text sample 6.5: Conversation
 What you’d have to do, you know,
you tell him what you need to know,
he’d be able to tell you how to do it.
 Text sample 6.6: Academic Prose
 As has been repeatedly shown
cultural evolution is not a unilinear
process and it is possible that under
certain conditions a simpler social
formation may emerge out of a more
complex one.
Illustrative Analysis
 481 texts, 960,000 words
 LOB (written) and LLC (spoken)
 Sixteen major grammatical categories (Tense and aspect
markers, place and time adverbials, pronouns, questions,
nominal forms, passives, stative forms, modals etc.)
 Five major dimensions of variation were identified.
 Sets of features that occur in a complementary pattern
(positive-negative)
 Functional interpretation based on analysis of the
communicative function(s) and similarities and differences
among the register with respect to that dimension.
 This section is based on Biber (1988)

10.11.23 COGS 523 - Bilge Say 34


Features in
parantheses
are not used in
the calculation
of dimension
scores.

10.11.23 COGS 523 - Bilge Say 35


Functional Interpretations
of Dimensions
 Dimension 1: Negative group –
informational focus. Careful integration of
information and precise word choice.
Positive group: Involved, non-informational
focus, related w. a primarily affective
mode
Primary purpose of the writer/speaker and
production circumstances
 Dimension 2: Narrative vs non-narrative
does not distinguish written-spoken
registers

10.11.23 COGS 523 - Bilge Say 36


Factor Scores
 Calculate dimension scores of each
text, as well as calculation of mean
dimension scores of registers

10.11.23 COGS 523 - Bilge Say 37


-20
-10
0
10
20
30
40

Official documents

Academic prose

Press editorials

p.<.0001, r2=84.3)
General Fiction

PREPARED
SPEECHES

Registers
PUBLIC
CONVERATIONS

Personal Letters

FACE-TO-FACE
CONVERSATIONS
“Involved versus information production” (F=119.9,
Mean scores of English dimension 1 for nine registers:

TELEPHONE
CONVERSATIONS
-4
-2
0
2
4
6
8

BROADCAST

Official documents

Academic prose

TELEPHONE
CONVERSATIONS

FACE-TO-FACE
CONVERSATIONS

(F=32.3, p.<.0001, r2=60.8)


Personal Letters
Registers
Press reportage

PREPARED
SPEECHES
Mean scores of English dimension 2 for eleven

Biograhies
registers: “Narrative versus non-narrative discourse”

General fiction

Romance fiction
English for Special Purposes
 MD study for English as a general
background
 From Corpus of Writing in the
Disciplines
 History is not as narrative as
thought
 Subject matter affects linguistic
realization

10.11.23 COGS 523 - Bilge Say 40


Category No. of texts Approx. no. of words
Ecology research articles
(from Ecology, Journal of
Ecology, and Journal of
Animal Ecology) 20 64000
American history
research articles
(from The Journal of
American History and
The Western Historical
Quarterly) 20 32000

Table 6.3 Composition of subcorpus of biology and history


research articles
8

CONVERSATIONS
Ecology research

History research

General Fiction
FACE-TO-FACE
articles
articles

-2

-4

Mean scores of ecology and history research articles on


Dimension 2, “Narrative versus non-narrative concerns”
 Text sample 6.10: History research article
 Entertainer Josephine Baker posed a special problem
for the government. During her international concert
tours in the 1950s she harshly criticized American
racism. The United States government could not
restrict her travel by withdrawings her passport
because she carried the passport of her adopted
nation. France.
 Text sample 6.11:Ecology research article
 The effects of herbivores are potentially large and
long lasting. How herbivores affect nutrient cycles in
these forests is particularly important because
nutrient availability is generally low, and changes in
nurient availability are major factors driving
succession. Furthermore, populations of boreal
herbivores fluctuate drastically between years and
decades...
8

0
CONVERSATIONS

History research

Ecology research
General Fiction
FACE-TO-FACE

articles

articles
-2

-4

Mean scores of ecology and history research articles on


Dimension 5, “Impersonal versus non-impersonal style”
8

Methods
Introduction

Discussion
Results

Mean scores of ecology research article sections on


Dimension 5, “Impersonal versus non-impersonal style”
Conclusion
 Corpus based methods could be
adapted to even non-automaticized
areas of languge studies.
 Congruent use of qualitative and
quantitative methods

10.11.23 COGS 523 - Bilge Say 46


Next Week

 Meyer (2002) Pseudotitles Chapter


 Invited Talk by Ruken Çakıcı – pls come
at 9:40....

10.11.23 COGS 523 - Bilge Say 47

You might also like