You are on page 1of 17

Tutorial Structure

primarily as some examples of tasks linguists might be interested in

Statistics for Linguists: A Tutorial


Mark Dras
Centre for Language Technology Macquarie University

within these, statistical ideas that are useful


hypothesis testing various statistical measures (2 , likelihood ratios, . . . ) statistical distributions some other useful ideas (e.g. Latent Semantic Analysis)

basic material taken from Manning and Schutze [1999] another useful overview: Krenn and Samuelsson [1997]

HCSNet Summerfest 28 November 2006

1 / 67

2 / 67

Collocations

Collocations

Outline
1

Denitions
collocation: an expression consisting of two or more words that correspond to some conventional way of saying things
Firth (1957): Collocations of a given word are statements of the habitual or customary places of that word

Collocations Frequency Hypothesis Testing (+ background: Basic Probability Theory) The t-Test Pearsons Chi-Square Test (+ background: Distributions) Likelihood Ratio Test (+ background: Conditional Probability) Fishers Exact Test Verb Subcategorisation Precision and Recall Semantic Similarity Latent Semantic Indexing Register Analysis References
3 / 67

examples:
noun phrases: strong tea, weapons of mass destruction phrasal verbs: to make up other standard phrases: the rich and powerful

has limited compositionality


more than an idiom example: international best practice

4 5

4 / 67

Collocations

Frequency

Collocations

Frequency

Frequency
most basic idea: start with a corpus, and count the relevant frequencies
if looking for two-word collocations, just count frequencies of pairs of adjacent words obvious problem: get lots of useless high frequency words from New York Times:
C (w1 w2 ) 80871 58841 26430 21842 ... 12622 11428 10007 ... w1 of in to on ... from New he ... w2 the the the the ... the York said ...
5 / 67

Frequency
basic idea of frequency still maybe OK with conditions:
1 2

use when looking to verify specic alternatives or patterns; or add a lter

6 / 67

Collocations

Frequency

Collocations

Frequency

Example #1a: Eggcorns


described in Language Log
http://itre.cis.upenn.edu/myl/languagelog/

Example #1b: Snowclones


also dened at Language Log idea: adaptable clich e frames examples: Have X, will travel; X is the new Y; X, we have a problem again, use Google hits

idea: something like a mistaken but plausible reanalysis of a word or phrase examples: to step foot in, baited breath, free reign, hone in, ripe with mistakes, for all intensive purposes, manner from Heaven, give up the goat
like folk etymology, a malapropism, or a mondegreen; but not

collection by Chris Waigl


http:www.eggcorns.lascribe.net

interested in seeing whether eggcorn is gaining currency


compare by Google hits e.g. inclement weather (173K whG) vs inclimate weather (11K whG) vs incliment weather (719 whG)
7 / 67 8 / 67

Collocations

Frequency

Collocations

Frequency

Aside: Google Counts


theres been discussion about the reliability of Google-derived frequencies
see Jean Veronis blog and Language Log

Adding Filters
alternatively, if the problem is to nd rather than to verify, can use lters based on part of speech for instance, in previous example of extracting collocations, can use patterns like the following:
Tag pattern AN NN AAN NPN ... Example linear function regression coefcients Gaussian random variable degrees of freedom ...

example of problem
Google query: junco partner lyrics (9440 whG) Google query: junco partner lyrics connick (279 whG) Google query: junco partner lyrics -connick (930 whG)

frequency counts over 100K generally regarded as unreliable, but may also be the case for smaller problems appear to be related to Googles indexing, and treatment of near-identical page matches

9 / 67

10 / 67

Collocations

Frequency

Collocations

Hypothesis Testing (+ background: Basic Probability Theory)

Adding Filters
applied to same New York Times text, get
C (w1 w2 ) 11487 7261 5412 3301 ... 1074 1073 ... w1 New United Los last ... chief real estate ... w2 York States Angeles year ... executive AN ... Tag Pattern AN AN NN AN ... AN ...

Random Variable
the probability of an event is the likelihood that it will occur, represented by a number between 0 and 1:
probability 0: impossibility probability 1: certainty probability 0.5: equally likely to occur as not

similarly, given particular adjectives, can nd most frequent co-occurring nouns:


w support safety sales ... man ... C (strong, w ) 50 22 21 ... 9 ... w force computers position ... man ... C (powerful, w ) 13 10 8 ... 8 ...

a random variable ranges over all the possible types of outcomes for the event being measured . . . example:
random variable X = the result of rolling a die P (X = 1) = the probability of the die showing 1 = 1/6 P (X = 1) = P (X = 2) = . . . = P (X = 6) = 1/6

properties
the probability of an outcome is always between 0 and 1 the sum of probabilities of all outcomes is 1

good, but still not perfect


e.g. man occurring in both lists want to ignore if man is relatively common by itself
11 / 67

12 / 67

Collocations

Hypothesis Testing (+ background: Basic Probability Theory)

Collocations

Hypothesis Testing (+ background: Basic Probability Theory)

Summary Measures for Random Variables


the (arithmetic) mean, or average: E (X ) =
i

Example
imagine a six-sided die where each outcome wasnt equally likely, but had P (X = 1) = P (X = 6) = 1/100, P (X = 2) = P (X = 5) = 4/100, P (X = 3) = P (X = 4) = 45/100
E (X ) = 1.(1/100) + 2.(4/100) + . . . + 6.(1/100) = 3.5, as before Var (X ) = 1.(1/100) + 4.(4/100) + . . . + 36.(1/100) 3.52 = 0.53

xi P (xi )

from the die example, E (X ) = 1.(1/6) + 2.(1/6) + . . . + 6.(1/6) = 3.5

the variance, to measure the spread: Var (X ) = E (X E (X ))2 = E (X 2 )(E (X ))2 =


i

xi2 P (xi )(E (X ))2

from the die example, Var (X ) = 1.(1/6) + 4.(1/6) + . . . + 36.(1/6) 3.52 = 2.91

note that these apply to the whole population


13 / 67 14 / 67

Collocations

Hypothesis Testing (+ background: Basic Probability Theory)

Collocations

Hypothesis Testing (+ background: Basic Probability Theory)

Estimating Probabilities
Maximum Likelihood Estimator (MLE)
used to estimate the theoretical probability from a sample if a specic event has occurred m times out of n occasions, the MLE probability is m/n the larger the number of occasions measured, the more accurate the MLE

Estimating Probabilities
imagine we dont know the population probabilities
we want to estimate them from a sample
die outcome 1 2 3 4 5 6 number of times rolled 16 18 13 16 19 18 100

sample mean and variance given observations x i


sample mean = x sample variance s2 =
n i =1 (xi

= x 1 n
n

16 1 + . . . + 18 6 = 3.58 100

xi
i =1

s2 =

16 (1 3.58)2 + . . . + 18 (6 3.58)2 3.05 99

)2 x n1
15 / 67 16 / 67

Collocations

Hypothesis Testing (+ background: Basic Probability Theory)

Collocations

Hypothesis Testing (+ background: Basic Probability Theory)

Estimating Probabilities
problem: sparse data
its more difcult to estimate the probability of a rare event if the corpus doesnt register, say, a rare word, the MLE for the word is 0

Hypothesis Testing
using frequencies, as previously, we might decide that new companies is a collocation because if has a high frequency
however, we dont really think it is one; maybe its because new and companies are individually frequent, and just appear together by chance

possible solution: smoothing


even though the MLE seems like the right way of estimating probabilities, its not always the right one infrequent events can get too little probability mass can redistribute some of the probabilities

hypothesis testing is a way of assessing whether something is due to chance has the following procedure:
formulate a NULL HYPOTHESIS H0 , that there is no association beyond chance calculate the probability p that the event would occur if H0 were true then, if p is too low (usually 0.05 or smaller), reject H0 ; retain it as a possibility otherwise

17 / 67

18 / 67

Collocations

The t-Test

Collocations

The t-Test

The t-Test
want a test that will say how likely or unlikely a certain event is to occur the t-test compares a sample mean with a population mean, relative to the samples variability t= x
s2 N

The t-Test
non-linguistic example:
H0 : the mean height of a population of men is 158cm (vs a population of shorter men) = 169 and s2 = 2600 sample data: size 200 with x t= 169 158
2600 200

3.05

where N is the size of the sample, and is the population mean according to the null hypothesis look up this t-value against a table
table gives t-value for a given condence level and a given number of degrees of freedom ( = N 1)
d.f. p 1 10 20 0.05 6.314 1.812 1.725 1.645 0.01 31.82 2.764 2.528 2.326 0.005 63.66 3.169 2.845 2.576 0.001 318.3 4.144 3.552 3.091 19 / 67

looking up the table, t > 2.576, so we can reject H 0 with 99.5% condence
d.f. p 1 10 20 0.05 6.314 1.812 1.725 1.645 0.01 31.82 2.764 2.528 2.326 0.005 63.66 3.169 2.845 2.576 0.001 318.3 4.144 3.552 3.091

20 / 67

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Distributions
a PROBABILITY DISTRIBUTION FUNCTION is a function describing the mapping from random variable values to probabilities
these can be either discrete (from a nite set) or continuous

Gaussian Distribution
another important (continuous) one is the G AUSSIAN ( OR NORMAL ) DISTRIBUTION
1 dened by the function f (x ) = exp 2 population mean is , variance (x )2 2 2

weve already seen a UNIFORM example)

DISTRIBUTION

(the original die

this was a discrete function P (X = x ) = 1 n (where n is the number of outcomes)

a lot of data can be assumed to have this distribution, e.g. heights in a population the t-test described previously assumes a normal distribution
21 / 67 22 / 67

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Bernoulli Distribution
the discrete B ERNOULLI DISTRIBUTION measures the probability of success in a yes/no experiment, with this probability called p
dened by P (X = 1) = p, P (X = 0) = 1 p population mean is p, variance is p(1 p)

Binomial Distribution
the discrete BINOMIAL DISTRIBUTION measures the probability of the number of successes in a sequence of n independent yes/no experiments (Bernoulli distributions), each of which has probability p
n dened by P (X = x ) = k pk (1 p)nk population mean is np, variance is np(1 p)

23 / 67

models things like the probability of getting k heads from n tosses of a fair coin 24 / 67 for large n, can be approximated by the normal distribution

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Zipf Distribution
the Z IPF DISTRIBUTION is a model of Zipfs law, which says that the frequency of any word is roughly inversely proportional to its rank in the frequency table
1 in its original form, Pn n a , where Pn is the frequency of a word, n is its rank, and a is a constant close to 1 (note that this is not a probability distribution)

The t-Test for Proportions


can extend the t-test to use proportions or counts
for a text, consider it as a long sequence of N bigrams each bigram is either the one were looking for (success) or not (failure) this gives a Bernoulli distribution

linguistic example:
we have 15,828 occurrences of new and 4,675 of companies, and there were 14,307,668 tokens overall 4675 15828 , P (companies) = P (new) = 14307668 14307668 H0 is that the occurrences of new and companies are independent: P (new companies) = P (new)P (companies) 3.615 107 this P (new companies) is

25 / 67

26 / 67

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

The t-Test for Proportions


linguistic example (cont.):
there are 8 occurrences of new companies, so sample mean (proportion) is 8 =p= 5.591 107 x 14307668 variance is p(1 p), which for very small p is approximately p now, the t-test: t= x
s2 N

Finding Collocations by t-Test Ranking


some bigrams of frequency 20:
t 4.4721 4.4721 4.4720 4.4720 4.4720 2.3714 2.2446 1.3685 1.2176 0.8036 C (w1 ) 42 41 30 77 24 14907 13484 14734 14093 15019 C (w2 ) 20 27 117 59 320 9017 10570 13478 14776 15629 C (w1 w2 ) 20 20 20 20 20 20 20 20 20 20 w1 Ayatollah Bette Agatha videocassette unsalted rst over into like time w2 Ruhollah Midler Christie recorder butter made many them people last

5.591 107 3.615 107


5.591107 14307668

actually, the t-test will reject only very few possible collocations
0.999932 reason will come later . . . still useful for ranking, however

this is below any level of signicance in our previous table so, we dont reject the null hypothesis: we cant say new companies is a collocation

27 / 67

28 / 67

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Hypothesis Testing of Differences


previous hypothesis testing compared a sample value against a postulated corresponding population value may want to compare sample values from (what you believe are) two different distributions
null hypothesis H0 is then that these two distributions are actually the same

Finding Collocations by t-Test Ranking


can then approximate the t value by t x 1x 2
x 1 +x 2 N

C (strong w )C (powerful w ) C (strong w )C (powerful w )

collocations with strong vs powerful


t 3.1622 2.8284 2.4494 2.4494 ... 7.0710 6.3257 4.6904 4.5825 ... C (w ) 933 2337 289 588 ... 3685 3616 986 3741 ... C (strong w ) 0 0 0 0 ... 50 58 22 21 ... C (powerful w ) 10 8 6 6 ... 0 7 0 0 ... w computers computer symbol machines ... support enough safety sales ...

t-value is calculated by t= x 1x 2
2 s1 n1

2 s2 n2

suppose H0 is that the words that collocate with strong and powerful are from different distributions
x1 is the probability of the bigram strong w (for some word w ) x2 is the probability of the bigram powerful w
29 / 67

30 / 67

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test


t-test assumes normal distribution of sample mean probabilities; however, for text this is not true an alternative for hypothesis testing is the 2 test, which does not assume this simplest form takes a 2 2 CONTINGENCY
TABLE

Pearsons Chi-Square Test


for the previous example, recall that C (new) = 15, 828, C (companies) = 4, 675, and C (new companies) = 8, and that there are 14,307,668 tokens in the corpus
w2 = companies w2 = companies w1 = new 8 (new companies) 15820 (e.g. new machines) w1 = new 4667 (e.g. old companies) 14287173 (e.g. old machines)

if looking for collocations of w1 and w2 , this table will consist of the four combinations of w1 , w2 , not-w1 and not-w2 X 2 aggregates the differences between observed and expected values for these cells X =
i ,j 2

to calculate e.g. E1,1 , we use the probability of the rst word of a bigram being new, the second word being companies: E1,1 = 8 + 4667 8 + 15820 N 5.2 N N

(Oij Eij )2 Eij

where i ranges over table rows, j ranges over table columns, Oij is the observed value for cell (i , j ) and Eij is the expected value

for this example, X 2 1.55

31 / 67

32 / 67

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Collocations

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test


as for the t-test, the 2 has an associated number of degrees of freedom
for a table of dimensions r c , there are (r 1)(c 1) degrees of freedom

Comparison: Chi-Square vs t-Test


for the previous example, theres quite a lot of overlap
for example, the top 20 bigrams according to the t-test are the same as the top 20 for 2

we check the distribution for 2 :


d.f. p 1 2 3 4 100 0.99 0.00016 0.20 0.115 0.297 70.06 0.95 0.0039 0.10 0.35 0.71 77.93 0.10 2.71 4.60 6.25 7.78 118.5 0.05 3.84 5.99 7.81 9.49 124.3 0.01 6.63 9.21 11.34 13.28 135.8 0.005 7.88 10.60 12.84 14.86 140.2 0.001 10.83 13.82 16.27 18.47 149.4

however, 2 also appropriate for large probabilities, where the normality assumption of the t-test fails

the X 2 value is less than for = 0.05, so we wouldnt reject H 0 : i.e. we wouldnt take new companies as a collocation, as before with the t-test

33 / 67

34 / 67

Collocations

Likelihood Ratio Test (+ background: Conditional Probability)

Collocations

Likelihood Ratio Test (+ background: Conditional Probability)

Conditional Probability
weve already in fact used the notion of independent events
two events are independent of each other if the occurrence of one does not affect the probability of the occurrence of the other tossing a coin and winning the lottery: independent speeding and having an accident: not independent

Example
the following table shows the weather conditions for 100 horse races and how many times Harry won:
rain 15 15 shine 5 65

win no win

conditional probability: the probability that one event occurs given that another event occurs

Harry won 20 out of 100 races: P (win) = 0.2 (by MLE) the conditional probability of Harry winning given rain is P (win | rain) = 15/30 = 0.5 compare this with the 2 test: under the null hypothesis, the observed data was compared against the situation where the words were independent

35 / 67

36 / 67

Collocations

Likelihood Ratio Test (+ background: Conditional Probability)

Collocations

Likelihood Ratio Test (+ background: Conditional Probability)

Likelihood Ratio
another approach to hypothesis testing
more appropriate to sparse data than 2 more interpretable also: says how much more likely one hypothesis is than another

Likelihood Ratio
well use the usual MLEs for p , p1 , p2 , and writing c1 , c2 , c12 for the number of occurrences of w1 , w2 , w1 w2 p= c12 c2 c12 x2 , p1 = , p2 = N c1 N c1

here, examine explicitly two hypotheses to explain bigram w 1 w2


Hypothesis 1: P (w2 | w1 ) = p = P (w2 |w1 ) Hypothesis 2: P (w2 | w1 ) = p1 = p2 = P (w2 |w1 )

well also use the notation for a binomial distribution b (k ; n, p ) = now, the likelihoods are
for Hypothesis 1, L(H1 ) = b(c12 ; c1 , p)b(c2 c12 ; N c1 , p) for Hypothesis 2, L(H2 ) = b(c12 ; c1 , p1 )b(c2 c12 ; N c1 , p2 )
37 / 67 38 / 67

Hypothesis 1 represents independence of w 1 and w2 ; Hypothesis 2 represents dependence (and hence a possible collocation)

n k p (1 p )nk k

Collocations

Likelihood Ratio Test (+ background: Conditional Probability)

Collocations

Likelihood Ratio Test (+ background: Conditional Probability)

Likelihood Ratio
the log of the likelihood ratio is then log = log for bigrams of powerful:
2 log 1291.42 99.31 82.96 80.39 57.27 ... C (w1 ) 12593 379 932 932 932 ... C (w2 ) 932 932 934 3424 291 ... C (w1 w2 ) 150 10 10 13 6 ... w1 most politically powerful powerful powerful ... w2 powerful powerful computers force symbol ...

Comparison: Chi-Square and Likelihood Ratio


the likelihood ratio has an intuitive meaning
from the previous table, the bigram powerful symbol is e0.557.27 2.729 1012 times more likely to occur than would be expected by the individual words alone

L(H1 ) L(H2 )

comparison carried out by Dunning [1993]


2 tends to be less accurate with sparse data
as a rule of thumb, need large sample, and counts in each cell (i.e. occurrences of words or bigrams) of at least 5

the value 2 log has a 2 distribution


so you can do hypothesis testing using the 2 table

the events were interested in textindividual words or n-gramsare in fact often less frequent than this: related to the Zipan distribution of words as an example, Dunning selected words from a 500,000 word corpus with frequences of between 1 and 4; these included words like abandonment, clause, meat, poi and understatement the log likelihood is more accurate here (but still needs counts of at least 1)
39 / 67 40 / 67

Collocations

Fishers Exact Test

Collocations

Fishers Exact Test

Fishers Exact Test


the previous tests have all been PARAMETRIC tests: that is, they assume some distribution its possible to use a NON - PARAMETRIC test, which makes no assumptions trade-off is that its typically more time-consuming to calculate, and is only feasible for smaller amounts of data Fishers Exact Test computes the signicance of an observed table by exhaustively computing the probability of every table that would have the same marginal totals suggested as an alternative to the previous tests by Pedersen [1996]

Fishers Exact Test


consider again a 2 2 contingency table
w2 = companies w2 = companies w1 = new E1,1 (new companies) E2,1 (e.g. new machines) w1 = new E1,2 (e.g. old companies) E2,2 (e.g. old machines)

the probability of obtaining any such set of values is p=


E1,1 +E1,2 E2,1 +E2,2 E1,1 E2,1 E1,1 +E1,2 +E2,1 +E2,2 E1,1 +E2,1

41 / 67

42 / 67

Verb Subcategorisation

Verb Subcategorisation

Outline
1

Verb Subcategorisation
verbs express their semantic arguments with different syntactic means
the class of verbs with semantic arguments theme and recipient has a subcategory expressing these via a direct object and a prepositional phrase: he donated a large sum of money to the church a second subcategory permits double objects: he gave the church a large sum of money

Collocations Frequency Hypothesis Testing (+ background: Basic Probability Theory) The t-Test Pearsons Chi-Square Test (+ background: Distributions) Likelihood Ratio Test (+ background: Conditional Probability) Fishers Exact Test Verb Subcategorisation Precision and Recall Semantic Similarity Latent Semantic Indexing Register Analysis References
43 / 67

these subcategorisation frames are typically not in dictionaries might be interested in identifying them via statistics Brent [1993] developed the system Lerner to assign one of six frames to verbs
Description NP only tensed clause innitive NP & clause NP & innitive NP & NP Good Example greet them hope hell attend hope to attend tell him hes a fool want him to attend tell him the story Bad Example *arrive them *want hell attend *greet to attend *yell him hes a fool *hope him to attend *shout him the story 44 / 67

4 5

Verb Subcategorisation

Verb Subcategorisation

Algorithm for Learning Subcat Frames


Lerner had two steps:
1

Hypothesis Testing
suppose verb vi occurs a total of n times in the corpus and that there are m n occurrences with a cue for frame f j assume also some error
j

Dene cues. Dene a regular pattern of words and syntactic categories which indicates the presence of the frame with high certainty (probability of error). For a particular cue cj we dene the probability of error j that indicates how likely we are to make a mistake if we assign frame f to verb v based on cue cj . Do hypothesis testing. Initially assume the frame is not appropriate for the verb: this is the null hypothesis H0 . We reject H0 if the cue cj indicates with high probability that H0 is wrong (OBJ | SUBJ OBJ | CAP) (PUNC | CC) OBJ = accusative case personal pronouns; SUBJ OBJ = nominative or accusative case personal pronouns; CAP = capitalised word; PUNC = punctuation; CC = subordinating conjunction positive indicator for transitive verb: consider . . . greet/V Peter/CAP ,/PUNC . . .
45 / 67

in inferring a frame fj from cue cj

this suggests a binomial distribution then reject null hypothesis H0 that vi does not permit fj with the following probability of error:
n

example: cue for frame NP only (transitive verb)

pE = P (vi does not permit fj | C (vi , cj ) m) =


r =m

n r

r nr j (1 j )

various values for

were assessed

46 / 67

Verb Subcategorisation

Precision and Recall

Verb Subcategorisation

Precision and Recall

Precision and Recall


typically, when building a statistical model to do something, you want to evaluate how suitable it is
for example, how well it performs a simple task

Precision and Recall


precision is the proportion of system-predicted relevant objects that are correct: TP PRE = TP + FP recall is the proportion of actually relevant objects that the system managed to predict as relevant TP REC = TP + FN example: theres a set of 200 documents, of which 40 are actually relevant; your system says that 50 are relevant, including 20 of the ones that actually are
system predicts: relevant irrelevant actually: relevant TP = 20 FN = 20 irrelevant FP = 30 TN = 130

the measures of PRECISION and RECALL are one way of doing that imagine you have a system for sorting your objects of interest into two piles, relevant and irrelevant
system can make two types of errors: classifying a relevant object as irrelevant, or an irrelevant one as relevant system decisions can then be broken into four categories: true positive (TP), false positive (FP), false negative (FN), true negative (TN)
system predicts: relevant irrelevant actually: relevant TP FN irrelevant FP TN

47 / 67

then, PRE =

20 50

and REC =

20 40

48 / 67

Verb Subcategorisation

Precision and Recall

Verb Subcategorisation

Precision and Recall

Verb Subcategorisation: Lerner Accuracy


for Lerner, precision and recall values were calculated for various
j

F-Measure
theres typically a trade-off between precision and recall there are a number of ways of combining them into a single measure

this is the table for the tensed clause frame


j

.0312 .0156 .0078 .0039 .0020 .0010 .0005 .0002 .0001

TP 13 19 22 25 27 29 31 31 33

FP 0 0 1 1 3 5 8 13 19

TN 30 30 29 29 27 25 22 17 11

FN 20 14 11 8 6 4 2 2 0

MC 20 14 12 9 9 9 10 15 19

%MC 32 22 19 14 14 14 16 24 30

PRE 1.00 1.00 .96 .96 .90 .85 .79 .70 .63

REC .39 .58 .67 .76 .82 .88 .94 .94 1.00

one is the F-measure, the weighted harmonic mean of the two: F= 2 PRE REC PRE + REC

MC is total misclassied

49 / 67

50 / 67

Semantic Similarity

Semantic Similarity

Outline
1

Semantic Similarity
there are a number of resources that group words together by semantic relatedness
examples are thesauruses, Wordnet semantic relations are synonymy, hypernymy, etc e.g. dog and canine might be in a class together; this might be a hyponym of a class corresponding to animal

Collocations Frequency Hypothesis Testing (+ background: Basic Probability Theory) The t-Test Pearsons Chi-Square Test (+ background: Distributions) Likelihood Ratio Test (+ background: Conditional Probability) Fishers Exact Test Verb Subcategorisation Precision and Recall Semantic Similarity Latent Semantic Indexing Register Analysis References
51 / 67

you might want to automatically derive classes to capture relations


for when you have a new unknown word: e.g. if in Susan had never eaten a fresh durian before you dont know what kind of thing durian is if you want types of classes other than the standard ones

4 5

52 / 67

Semantic Similarity

Latent Semantic Indexing

Semantic Similarity

Latent Semantic Indexing

Latent Semantic Indexing (LSI)


in LSI, we look at the interaction of terms and documents the purpose of this interaction is twofold
to have the documents tell us which terms should be grouped together to have the grouped-together terms tell us about the similarity of the documents

Example
say we have 5 terms of interestcosmonaut, astronaut, moon, car, truckand 6 documents we describe their interaction by a matrix A, where cell a ij contains the count of term i in document j d1 d2 d3 d4 d5 d6 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1

this interaction is described by a matrix, and the grouping is carried out by a process called Singular Value Decomposition (SVD)

A=

cosmonaut astronaut moon car truck

this can be thought of as a ve-dimensional space (dened by the terms) with six objects in that space (the documents) what we want to do is reduce the dimensions, thus grouping similar terms
53 / 67 54 / 67

Semantic Similarity

Latent Semantic Indexing

Semantic Similarity

Latent Semantic Indexing

Dimensionality Reduction
there are many possible types of dimensionality reduction LSI chooses the mapping that means that the reduced dimensions correspond to the greatest axes of variation
that is, if the new dimensions are numbered 1 . . . k , dimension 1 captures the greatest amount of commonality, dimension 2 the second greatest, and so on

Example

this process is carried out by the matrix operation called Singular Value Decomposition here, the term-by-document matrix A t d is decomposed into three other matrices At d = Tt n Snn (Dd n )T this decomposition is (almost) unique

T =

cosmonaut astronaut moon car truck

Dim 1 0.44 0.13 0.48 0.70 0.26

Dim 2 0.30 0.33 0.51 0.35 0.65

Dim 3 0.57 0.59 0.37 0.15 0.41

Dim 4 0.58 0.00 0.00 0.58 0.58

Dim 5 0.25 0.73 0.61 0.16 0.09

consider the columns . . .

55 / 67

56 / 67

Semantic Similarity

Latent Semantic Indexing

Semantic Similarity

Latent Semantic Indexing

Example
2.16 0.00 S= 0.00 0.00 0.00 0.00 1.59 0.00 0.00 0.00 0.00 0.00 1.28 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.39

Example

this matrix embodies the weight of the dimensions


it always goes largest to smallest

DT =

Dim 1 Dim 2 Dim 3 Dim 4 Dim 5

d1 0.75 0.29 0.28 0.00 0.53

d2 0.28 0.53 0.75 0.00 0.29

d3 0.20 0.19 0.45 0.58 0.63

d4 0.45 0.63 0.20 0.00 0.19

d5 0.33 0.22 0.12 0.58 0.41

d6 0.12 0.41 0.33 0.58 0.22

57 / 67

58 / 67

Semantic Similarity

Latent Semantic Indexing

Semantic Similarity

Latent Semantic Indexing

Example
so far weve just transformed the dimensions; now to reduce
for this example, decide to reduce to 2 dimensions

Conceptually . . .
can imagine that the terms are made up of semantic particles
perhaps along the lines of Wierzbickas semantic primitives however, not dened a priori; only a consequence of the relations in the given set of documents

to look at documents, combine this reduced dimensionality with the weighting of the dimensions
T derive new matrix B2d = S22 D2 d

LSI rearranges things so that the terms with greatest number of semantic particles in common are grouped
d4 0.97 1.00 d5 0.70 0.35 d6 0.26 0.65

B = Dim 1 Dim 2

d1 1.62 0.46

d2 0.60 0.84

d3 0.44 0.30

59 / 67

60 / 67

Register Analysis

Register Analysis

Outline
1

Register Analysis
work done by Douglas Biber
these notes from Biber [1993]

Collocations Frequency Hypothesis Testing (+ background: Basic Probability Theory) The t-Test Pearsons Chi-Square Test (+ background: Distributions) Likelihood Ratio Test (+ background: Conditional Probability) Fishers Exact Test Verb Subcategorisation Precision and Recall Semantic Similarity Latent Semantic Indexing Register Analysis References
61 / 67

idea: different registers have systematic patterns of variation


e.g. professional letters vs academic prose can do descriptive analyses, based on frequency or proportions of selected characteristics however, may also want to identify groups of characteristics distinguishing registers

4 5

62 / 67

Register Analysis

Register Analysis

Descriptive Analysis
example: from Brown corpus, mean frequencies of three dependent clause types (per 1000 words)
register press reports ofcial documents conversations prepared speeches relative clauses 4.6 8.6 2.9 7.9 causative adverbial subordinate clauses 0.5 0.1 3.5 1.6 that complement clauses 3.4 1.6 4.1 7.6

Dimension Identication
Biber carried out a quantitative analysis of 67 linguistic features in the LOB and London-Lund corpora
features included: tense and aspect markers, place and time adverbials, pronouns and pro-verbs, nominal forms, prepositional phrases, adjectives, lexical specicity, lexical classes (e.g. hedges, emphatics), mmodals, specialised verb classes, reduced forms and discontinuous structures, passives, stative forms, dependent clauses, coordination, and questions frequencies were counted, and normalised to per-1000 values

from this, can see e.g. that relative clauses are common in ofcial documents and prepared speeches relative to conversation may be interested in grouping many of these characteristics of text together

then, FACTOR

ANALYSIS

was carried out

this is a dimensionality reduction procedure very similar to LSI the dimensions similarly end up in decreasing order of explanatory power

63 / 67

64 / 67

Register Analysis

References

Dimension Identication
after inspecting the results of the factor analysis, Biber interpretively labelled the rst ve dimensions:
1 2 3 4 5

Outline
1

Informational vs Involved Production Narrative vs Nonnarrative Concerns Elaborated vs Situation-Dependent Reference Overt Expression of Persuasion Abstract vs Nonabstract Style
2

Collocations Frequency Hypothesis Testing (+ background: Basic Probability Theory) The t-Test Pearsons Chi-Square Test (+ background: Distributions) Likelihood Ratio Test (+ background: Conditional Probability) Fishers Exact Test Verb Subcategorisation Precision and Recall Semantic Similarity Latent Semantic Indexing Register Analysis References
66 / 67

example of features associated with Dimension 1:


functions Monologue Careful Production Informational Faceless Interactive (Inter)personal Focus Involved Personal Stance On-Line Production linguistic features nouns, adjectives prepositional phrases long words 1st and 2nd person pronouns questions, reductions stane verbs, hedges emphatics adverbial subordination characteristic registers informational exposition e.g. ofcial documents academic prose conversations (personal letters) (public conversations)

4 5
65 / 67

References

Douglas Biber. Using Register-Diversied Corpora for General Language Studies. Computational Linguistics, 19(2):219241, 1993. Michael Brent. From grammar to lexicon: Unsupervised learning of lexical syntax. Computational Linguistics, 19(2):243262, 1993. Ted Dunning. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1):6174, 1993. Brigitte Krenn and Christer Samuelsson. The Linguists Guide to Statistics: D ON T PANIC. URL http://coli.uni-sb.de/christer. Version of December 19, 1997, 1997. Christopher Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, USA, 1999. Ted Pedersen. Fishing for Exactness. In Proceedings of the South-Central SAS Users Group Conference, Austin, TX, USA, 1996.
67 / 67

You might also like