Professional Documents
Culture Documents
basic material taken from Manning and Schutze [1999] another useful overview: Krenn and Samuelsson [1997]
1 / 67
2 / 67
Collocations
Collocations
Outline
1
Denitions
collocation: an expression consisting of two or more words that correspond to some conventional way of saying things
Firth (1957): Collocations of a given word are statements of the habitual or customary places of that word
Collocations Frequency Hypothesis Testing (+ background: Basic Probability Theory) The t-Test Pearsons Chi-Square Test (+ background: Distributions) Likelihood Ratio Test (+ background: Conditional Probability) Fishers Exact Test Verb Subcategorisation Precision and Recall Semantic Similarity Latent Semantic Indexing Register Analysis References
3 / 67
examples:
noun phrases: strong tea, weapons of mass destruction phrasal verbs: to make up other standard phrases: the rich and powerful
4 5
4 / 67
Collocations
Frequency
Collocations
Frequency
Frequency
most basic idea: start with a corpus, and count the relevant frequencies
if looking for two-word collocations, just count frequencies of pairs of adjacent words obvious problem: get lots of useless high frequency words from New York Times:
C (w1 w2 ) 80871 58841 26430 21842 ... 12622 11428 10007 ... w1 of in to on ... from New he ... w2 the the the the ... the York said ...
5 / 67
Frequency
basic idea of frequency still maybe OK with conditions:
1 2
6 / 67
Collocations
Frequency
Collocations
Frequency
idea: something like a mistaken but plausible reanalysis of a word or phrase examples: to step foot in, baited breath, free reign, hone in, ripe with mistakes, for all intensive purposes, manner from Heaven, give up the goat
like folk etymology, a malapropism, or a mondegreen; but not
Collocations
Frequency
Collocations
Frequency
Adding Filters
alternatively, if the problem is to nd rather than to verify, can use lters based on part of speech for instance, in previous example of extracting collocations, can use patterns like the following:
Tag pattern AN NN AAN NPN ... Example linear function regression coefcients Gaussian random variable degrees of freedom ...
example of problem
Google query: junco partner lyrics (9440 whG) Google query: junco partner lyrics connick (279 whG) Google query: junco partner lyrics -connick (930 whG)
frequency counts over 100K generally regarded as unreliable, but may also be the case for smaller problems appear to be related to Googles indexing, and treatment of near-identical page matches
9 / 67
10 / 67
Collocations
Frequency
Collocations
Adding Filters
applied to same New York Times text, get
C (w1 w2 ) 11487 7261 5412 3301 ... 1074 1073 ... w1 New United Los last ... chief real estate ... w2 York States Angeles year ... executive AN ... Tag Pattern AN AN NN AN ... AN ...
Random Variable
the probability of an event is the likelihood that it will occur, represented by a number between 0 and 1:
probability 0: impossibility probability 1: certainty probability 0.5: equally likely to occur as not
a random variable ranges over all the possible types of outcomes for the event being measured . . . example:
random variable X = the result of rolling a die P (X = 1) = the probability of the die showing 1 = 1/6 P (X = 1) = P (X = 2) = . . . = P (X = 6) = 1/6
properties
the probability of an outcome is always between 0 and 1 the sum of probabilities of all outcomes is 1
12 / 67
Collocations
Collocations
Example
imagine a six-sided die where each outcome wasnt equally likely, but had P (X = 1) = P (X = 6) = 1/100, P (X = 2) = P (X = 5) = 4/100, P (X = 3) = P (X = 4) = 45/100
E (X ) = 1.(1/100) + 2.(4/100) + . . . + 6.(1/100) = 3.5, as before Var (X ) = 1.(1/100) + 4.(4/100) + . . . + 36.(1/100) 3.52 = 0.53
xi P (xi )
from the die example, Var (X ) = 1.(1/6) + 4.(1/6) + . . . + 36.(1/6) 3.52 = 2.91
Collocations
Collocations
Estimating Probabilities
Maximum Likelihood Estimator (MLE)
used to estimate the theoretical probability from a sample if a specic event has occurred m times out of n occasions, the MLE probability is m/n the larger the number of occasions measured, the more accurate the MLE
Estimating Probabilities
imagine we dont know the population probabilities
we want to estimate them from a sample
die outcome 1 2 3 4 5 6 number of times rolled 16 18 13 16 19 18 100
= x 1 n
n
16 1 + . . . + 18 6 = 3.58 100
xi
i =1
s2 =
)2 x n1
15 / 67 16 / 67
Collocations
Collocations
Estimating Probabilities
problem: sparse data
its more difcult to estimate the probability of a rare event if the corpus doesnt register, say, a rare word, the MLE for the word is 0
Hypothesis Testing
using frequencies, as previously, we might decide that new companies is a collocation because if has a high frequency
however, we dont really think it is one; maybe its because new and companies are individually frequent, and just appear together by chance
hypothesis testing is a way of assessing whether something is due to chance has the following procedure:
formulate a NULL HYPOTHESIS H0 , that there is no association beyond chance calculate the probability p that the event would occur if H0 were true then, if p is too low (usually 0.05 or smaller), reject H0 ; retain it as a possibility otherwise
17 / 67
18 / 67
Collocations
The t-Test
Collocations
The t-Test
The t-Test
want a test that will say how likely or unlikely a certain event is to occur the t-test compares a sample mean with a population mean, relative to the samples variability t= x
s2 N
The t-Test
non-linguistic example:
H0 : the mean height of a population of men is 158cm (vs a population of shorter men) = 169 and s2 = 2600 sample data: size 200 with x t= 169 158
2600 200
3.05
where N is the size of the sample, and is the population mean according to the null hypothesis look up this t-value against a table
table gives t-value for a given condence level and a given number of degrees of freedom ( = N 1)
d.f. p 1 10 20 0.05 6.314 1.812 1.725 1.645 0.01 31.82 2.764 2.528 2.326 0.005 63.66 3.169 2.845 2.576 0.001 318.3 4.144 3.552 3.091 19 / 67
looking up the table, t > 2.576, so we can reject H 0 with 99.5% condence
d.f. p 1 10 20 0.05 6.314 1.812 1.725 1.645 0.01 31.82 2.764 2.528 2.326 0.005 63.66 3.169 2.845 2.576 0.001 318.3 4.144 3.552 3.091
20 / 67
Collocations
Collocations
Distributions
a PROBABILITY DISTRIBUTION FUNCTION is a function describing the mapping from random variable values to probabilities
these can be either discrete (from a nite set) or continuous
Gaussian Distribution
another important (continuous) one is the G AUSSIAN ( OR NORMAL ) DISTRIBUTION
1 dened by the function f (x ) = exp 2 population mean is , variance (x )2 2 2
DISTRIBUTION
a lot of data can be assumed to have this distribution, e.g. heights in a population the t-test described previously assumes a normal distribution
21 / 67 22 / 67
Collocations
Collocations
Bernoulli Distribution
the discrete B ERNOULLI DISTRIBUTION measures the probability of success in a yes/no experiment, with this probability called p
dened by P (X = 1) = p, P (X = 0) = 1 p population mean is p, variance is p(1 p)
Binomial Distribution
the discrete BINOMIAL DISTRIBUTION measures the probability of the number of successes in a sequence of n independent yes/no experiments (Bernoulli distributions), each of which has probability p
n dened by P (X = x ) = k pk (1 p)nk population mean is np, variance is np(1 p)
23 / 67
models things like the probability of getting k heads from n tosses of a fair coin 24 / 67 for large n, can be approximated by the normal distribution
Collocations
Collocations
Zipf Distribution
the Z IPF DISTRIBUTION is a model of Zipfs law, which says that the frequency of any word is roughly inversely proportional to its rank in the frequency table
1 in its original form, Pn n a , where Pn is the frequency of a word, n is its rank, and a is a constant close to 1 (note that this is not a probability distribution)
linguistic example:
we have 15,828 occurrences of new and 4,675 of companies, and there were 14,307,668 tokens overall 4675 15828 , P (companies) = P (new) = 14307668 14307668 H0 is that the occurrences of new and companies are independent: P (new companies) = P (new)P (companies) 3.615 107 this P (new companies) is
25 / 67
26 / 67
Collocations
Collocations
actually, the t-test will reject only very few possible collocations
0.999932 reason will come later . . . still useful for ranking, however
this is below any level of signicance in our previous table so, we dont reject the null hypothesis: we cant say new companies is a collocation
27 / 67
28 / 67
Collocations
Collocations
t-value is calculated by t= x 1x 2
2 s1 n1
2 s2 n2
suppose H0 is that the words that collocate with strong and powerful are from different distributions
x1 is the probability of the bigram strong w (for some word w ) x2 is the probability of the bigram powerful w
29 / 67
30 / 67
Collocations
Collocations
if looking for collocations of w1 and w2 , this table will consist of the four combinations of w1 , w2 , not-w1 and not-w2 X 2 aggregates the differences between observed and expected values for these cells X =
i ,j 2
to calculate e.g. E1,1 , we use the probability of the rst word of a bigram being new, the second word being companies: E1,1 = 8 + 4667 8 + 15820 N 5.2 N N
where i ranges over table rows, j ranges over table columns, Oij is the observed value for cell (i , j ) and Eij is the expected value
31 / 67
32 / 67
Collocations
Collocations
however, 2 also appropriate for large probabilities, where the normality assumption of the t-test fails
the X 2 value is less than for = 0.05, so we wouldnt reject H 0 : i.e. we wouldnt take new companies as a collocation, as before with the t-test
33 / 67
34 / 67
Collocations
Collocations
Conditional Probability
weve already in fact used the notion of independent events
two events are independent of each other if the occurrence of one does not affect the probability of the occurrence of the other tossing a coin and winning the lottery: independent speeding and having an accident: not independent
Example
the following table shows the weather conditions for 100 horse races and how many times Harry won:
rain 15 15 shine 5 65
win no win
conditional probability: the probability that one event occurs given that another event occurs
Harry won 20 out of 100 races: P (win) = 0.2 (by MLE) the conditional probability of Harry winning given rain is P (win | rain) = 15/30 = 0.5 compare this with the 2 test: under the null hypothesis, the observed data was compared against the situation where the words were independent
35 / 67
36 / 67
Collocations
Collocations
Likelihood Ratio
another approach to hypothesis testing
more appropriate to sparse data than 2 more interpretable also: says how much more likely one hypothesis is than another
Likelihood Ratio
well use the usual MLEs for p , p1 , p2 , and writing c1 , c2 , c12 for the number of occurrences of w1 , w2 , w1 w2 p= c12 c2 c12 x2 , p1 = , p2 = N c1 N c1
well also use the notation for a binomial distribution b (k ; n, p ) = now, the likelihoods are
for Hypothesis 1, L(H1 ) = b(c12 ; c1 , p)b(c2 c12 ; N c1 , p) for Hypothesis 2, L(H2 ) = b(c12 ; c1 , p1 )b(c2 c12 ; N c1 , p2 )
37 / 67 38 / 67
Hypothesis 1 represents independence of w 1 and w2 ; Hypothesis 2 represents dependence (and hence a possible collocation)
n k p (1 p )nk k
Collocations
Collocations
Likelihood Ratio
the log of the likelihood ratio is then log = log for bigrams of powerful:
2 log 1291.42 99.31 82.96 80.39 57.27 ... C (w1 ) 12593 379 932 932 932 ... C (w2 ) 932 932 934 3424 291 ... C (w1 w2 ) 150 10 10 13 6 ... w1 most politically powerful powerful powerful ... w2 powerful powerful computers force symbol ...
L(H1 ) L(H2 )
the events were interested in textindividual words or n-gramsare in fact often less frequent than this: related to the Zipan distribution of words as an example, Dunning selected words from a 500,000 word corpus with frequences of between 1 and 4; these included words like abandonment, clause, meat, poi and understatement the log likelihood is more accurate here (but still needs counts of at least 1)
39 / 67 40 / 67
Collocations
Collocations
41 / 67
42 / 67
Verb Subcategorisation
Verb Subcategorisation
Outline
1
Verb Subcategorisation
verbs express their semantic arguments with different syntactic means
the class of verbs with semantic arguments theme and recipient has a subcategory expressing these via a direct object and a prepositional phrase: he donated a large sum of money to the church a second subcategory permits double objects: he gave the church a large sum of money
Collocations Frequency Hypothesis Testing (+ background: Basic Probability Theory) The t-Test Pearsons Chi-Square Test (+ background: Distributions) Likelihood Ratio Test (+ background: Conditional Probability) Fishers Exact Test Verb Subcategorisation Precision and Recall Semantic Similarity Latent Semantic Indexing Register Analysis References
43 / 67
these subcategorisation frames are typically not in dictionaries might be interested in identifying them via statistics Brent [1993] developed the system Lerner to assign one of six frames to verbs
Description NP only tensed clause innitive NP & clause NP & innitive NP & NP Good Example greet them hope hell attend hope to attend tell him hes a fool want him to attend tell him the story Bad Example *arrive them *want hell attend *greet to attend *yell him hes a fool *hope him to attend *shout him the story 44 / 67
4 5
Verb Subcategorisation
Verb Subcategorisation
Hypothesis Testing
suppose verb vi occurs a total of n times in the corpus and that there are m n occurrences with a cue for frame f j assume also some error
j
Dene cues. Dene a regular pattern of words and syntactic categories which indicates the presence of the frame with high certainty (probability of error). For a particular cue cj we dene the probability of error j that indicates how likely we are to make a mistake if we assign frame f to verb v based on cue cj . Do hypothesis testing. Initially assume the frame is not appropriate for the verb: this is the null hypothesis H0 . We reject H0 if the cue cj indicates with high probability that H0 is wrong (OBJ | SUBJ OBJ | CAP) (PUNC | CC) OBJ = accusative case personal pronouns; SUBJ OBJ = nominative or accusative case personal pronouns; CAP = capitalised word; PUNC = punctuation; CC = subordinating conjunction positive indicator for transitive verb: consider . . . greet/V Peter/CAP ,/PUNC . . .
45 / 67
this suggests a binomial distribution then reject null hypothesis H0 that vi does not permit fj with the following probability of error:
n
n r
r nr j (1 j )
were assessed
46 / 67
Verb Subcategorisation
Verb Subcategorisation
the measures of PRECISION and RECALL are one way of doing that imagine you have a system for sorting your objects of interest into two piles, relevant and irrelevant
system can make two types of errors: classifying a relevant object as irrelevant, or an irrelevant one as relevant system decisions can then be broken into four categories: true positive (TP), false positive (FP), false negative (FN), true negative (TN)
system predicts: relevant irrelevant actually: relevant TP FN irrelevant FP TN
47 / 67
then, PRE =
20 50
and REC =
20 40
48 / 67
Verb Subcategorisation
Verb Subcategorisation
F-Measure
theres typically a trade-off between precision and recall there are a number of ways of combining them into a single measure
TP 13 19 22 25 27 29 31 31 33
FP 0 0 1 1 3 5 8 13 19
TN 30 30 29 29 27 25 22 17 11
FN 20 14 11 8 6 4 2 2 0
MC 20 14 12 9 9 9 10 15 19
%MC 32 22 19 14 14 14 16 24 30
PRE 1.00 1.00 .96 .96 .90 .85 .79 .70 .63
REC .39 .58 .67 .76 .82 .88 .94 .94 1.00
one is the F-measure, the weighted harmonic mean of the two: F= 2 PRE REC PRE + REC
MC is total misclassied
49 / 67
50 / 67
Semantic Similarity
Semantic Similarity
Outline
1
Semantic Similarity
there are a number of resources that group words together by semantic relatedness
examples are thesauruses, Wordnet semantic relations are synonymy, hypernymy, etc e.g. dog and canine might be in a class together; this might be a hyponym of a class corresponding to animal
Collocations Frequency Hypothesis Testing (+ background: Basic Probability Theory) The t-Test Pearsons Chi-Square Test (+ background: Distributions) Likelihood Ratio Test (+ background: Conditional Probability) Fishers Exact Test Verb Subcategorisation Precision and Recall Semantic Similarity Latent Semantic Indexing Register Analysis References
51 / 67
4 5
52 / 67
Semantic Similarity
Semantic Similarity
Example
say we have 5 terms of interestcosmonaut, astronaut, moon, car, truckand 6 documents we describe their interaction by a matrix A, where cell a ij contains the count of term i in document j d1 d2 d3 d4 d5 d6 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1
this interaction is described by a matrix, and the grouping is carried out by a process called Singular Value Decomposition (SVD)
A=
this can be thought of as a ve-dimensional space (dened by the terms) with six objects in that space (the documents) what we want to do is reduce the dimensions, thus grouping similar terms
53 / 67 54 / 67
Semantic Similarity
Semantic Similarity
Dimensionality Reduction
there are many possible types of dimensionality reduction LSI chooses the mapping that means that the reduced dimensions correspond to the greatest axes of variation
that is, if the new dimensions are numbered 1 . . . k , dimension 1 captures the greatest amount of commonality, dimension 2 the second greatest, and so on
Example
this process is carried out by the matrix operation called Singular Value Decomposition here, the term-by-document matrix A t d is decomposed into three other matrices At d = Tt n Snn (Dd n )T this decomposition is (almost) unique
T =
55 / 67
56 / 67
Semantic Similarity
Semantic Similarity
Example
2.16 0.00 S= 0.00 0.00 0.00 0.00 1.59 0.00 0.00 0.00 0.00 0.00 1.28 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.39
Example
DT =
57 / 67
58 / 67
Semantic Similarity
Semantic Similarity
Example
so far weve just transformed the dimensions; now to reduce
for this example, decide to reduce to 2 dimensions
Conceptually . . .
can imagine that the terms are made up of semantic particles
perhaps along the lines of Wierzbickas semantic primitives however, not dened a priori; only a consequence of the relations in the given set of documents
to look at documents, combine this reduced dimensionality with the weighting of the dimensions
T derive new matrix B2d = S22 D2 d
LSI rearranges things so that the terms with greatest number of semantic particles in common are grouped
d4 0.97 1.00 d5 0.70 0.35 d6 0.26 0.65
B = Dim 1 Dim 2
d1 1.62 0.46
d2 0.60 0.84
d3 0.44 0.30
59 / 67
60 / 67
Register Analysis
Register Analysis
Outline
1
Register Analysis
work done by Douglas Biber
these notes from Biber [1993]
Collocations Frequency Hypothesis Testing (+ background: Basic Probability Theory) The t-Test Pearsons Chi-Square Test (+ background: Distributions) Likelihood Ratio Test (+ background: Conditional Probability) Fishers Exact Test Verb Subcategorisation Precision and Recall Semantic Similarity Latent Semantic Indexing Register Analysis References
61 / 67
4 5
62 / 67
Register Analysis
Register Analysis
Descriptive Analysis
example: from Brown corpus, mean frequencies of three dependent clause types (per 1000 words)
register press reports ofcial documents conversations prepared speeches relative clauses 4.6 8.6 2.9 7.9 causative adverbial subordinate clauses 0.5 0.1 3.5 1.6 that complement clauses 3.4 1.6 4.1 7.6
Dimension Identication
Biber carried out a quantitative analysis of 67 linguistic features in the LOB and London-Lund corpora
features included: tense and aspect markers, place and time adverbials, pronouns and pro-verbs, nominal forms, prepositional phrases, adjectives, lexical specicity, lexical classes (e.g. hedges, emphatics), mmodals, specialised verb classes, reduced forms and discontinuous structures, passives, stative forms, dependent clauses, coordination, and questions frequencies were counted, and normalised to per-1000 values
from this, can see e.g. that relative clauses are common in ofcial documents and prepared speeches relative to conversation may be interested in grouping many of these characteristics of text together
then, FACTOR
ANALYSIS
this is a dimensionality reduction procedure very similar to LSI the dimensions similarly end up in decreasing order of explanatory power
63 / 67
64 / 67
Register Analysis
References
Dimension Identication
after inspecting the results of the factor analysis, Biber interpretively labelled the rst ve dimensions:
1 2 3 4 5
Outline
1
Informational vs Involved Production Narrative vs Nonnarrative Concerns Elaborated vs Situation-Dependent Reference Overt Expression of Persuasion Abstract vs Nonabstract Style
2
Collocations Frequency Hypothesis Testing (+ background: Basic Probability Theory) The t-Test Pearsons Chi-Square Test (+ background: Distributions) Likelihood Ratio Test (+ background: Conditional Probability) Fishers Exact Test Verb Subcategorisation Precision and Recall Semantic Similarity Latent Semantic Indexing Register Analysis References
66 / 67
4 5
65 / 67
References
Douglas Biber. Using Register-Diversied Corpora for General Language Studies. Computational Linguistics, 19(2):219241, 1993. Michael Brent. From grammar to lexicon: Unsupervised learning of lexical syntax. Computational Linguistics, 19(2):243262, 1993. Ted Dunning. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1):6174, 1993. Brigitte Krenn and Christer Samuelsson. The Linguists Guide to Statistics: D ON T PANIC. URL http://coli.uni-sb.de/christer. Version of December 19, 1997, 1997. Christopher Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, USA, 1999. Ted Pedersen. Fishing for Exactness. In Proceedings of the South-Central SAS Users Group Conference, Austin, TX, USA, 1996.
67 / 67