Statistika Za Filologe

Tutorial Structure
primarily as some examples of tasks linguists might be interested in
Statistics for Linguists: A Tutorial

Mark Dras
Centre for Language Technology Macquarie University
within these, statistical ideas that are useful

hypothesis testing various statistical measures (2 , likelihood ratios, . . . ) statistical distributions some other useful ideas (e.g. Latent Semantic Analysis)
basic material taken from Manning and Schutze [1999] another useful overview: Krenn and Samuelsson [1997]
HCSNet Summerfest 28 November 2006
1 / 67
2 / 67
Collocations
Collocations
Outline
1
Denitions
collocation: an expression consisting of two or more words that correspond to some conventional way of saying things
Firth (1957): Collocations of a given word are statements of the habitual or customary places of that word
Collocations Frequency Hypothesis Testing (+ background: Basic Probability Theory) The t-Test Pearsons Chi-Square Test (+ background: Distributions) Likelihood Ratio Test (+ background: Conditional Probability) Fishers Exact Test Verb Subcategorisation Precision and Recall Semantic Similarity Latent Semantic Indexing Register Analysis References
3 / 67
examples:
noun phrases: strong tea, weapons of mass destruction phrasal verbs: to make up other standard phrases: the rich and powerful
has limited compositionality

more than an idiom example: international best practice
4 5
4 / 67
Collocations
Frequency
Collocations
Frequency
Frequency
most basic idea: start with a corpus, and count the relevant frequencies
if looking for two-word collocations, just count frequencies of pairs of adjacent words obvious problem: get lots of useless high frequency words from New York Times:
C (w1 w2 ) 80871 58841 26430 21842 ... 12622 11428 10007 ... w1 of in to on ... from New he ... w2 the the the the ... the York said ...
5 / 67
Frequency
basic idea of frequency still maybe OK with conditions:
1 2
use when looking to verify specic alternatives or patterns; or add a lter
6 / 67
Collocations
Frequency
Collocations
Frequency
Example #1a: Eggcorns

described in Language Log
http://itre.cis.upenn.edu/myl/languagelog/
Example #1b: Snowclones

also dened at Language Log idea: adaptable clich e frames examples: Have X, will travel; X is the new Y; X, we have a problem again, use Google hits
idea: something like a mistaken but plausible reanalysis of a word or phrase examples: to step foot in, baited breath, free reign, hone in, ripe with mistakes, for all intensive purposes, manner from Heaven, give up the goat
like folk etymology, a malapropism, or a mondegreen; but not
collection by Chris Waigl

http:www.eggcorns.lascribe.net
interested in seeing whether eggcorn is gaining currency

compare by Google hits e.g. inclement weather (173K whG) vs inclimate weather (11K whG) vs incliment weather (719 whG)
7 / 67 8 / 67
Collocations
Frequency
Collocations
Frequency
Aside: Google Counts

theres been discussion about the reliability of Google-derived frequencies
see Jean Veronis blog and Language Log
Adding Filters
alternatively, if the problem is to nd rather than to verify, can use lters based on part of speech for instance, in previous example of extracting collocations, can use patterns like the following:
Tag pattern AN NN AAN NPN ... Example linear function regression coefcients Gaussian random variable degrees of freedom ...
example of problem
Google query: junco partner lyrics (9440 whG) Google query: junco partner lyrics connick (279 whG) Google query: junco partner lyrics -connick (930 whG)
frequency counts over 100K generally regarded as unreliable, but may also be the case for smaller problems appear to be related to Googles indexing, and treatment of near-identical page matches
9 / 67
10 / 67
Collocations
Frequency
Collocations
Hypothesis Testing (+ background: Basic Probability Theory)
Adding Filters
applied to same New York Times text, get
C (w1 w2 ) 11487 7261 5412 3301 ... 1074 1073 ... w1 New United Los last ... chief real estate ... w2 York States Angeles year ... executive AN ... Tag Pattern AN AN NN AN ... AN ...
Random Variable
the probability of an event is the likelihood that it will occur, represented by a number between 0 and 1:
probability 0: impossibility probability 1: certainty probability 0.5: equally likely to occur as not
similarly, given particular adjectives, can nd most frequent co-occurring nouns:

w support safety sales ... man ... C (strong, w ) 50 22 21 ... 9 ... w force computers position ... man ... C (powerful, w ) 13 10 8 ... 8 ...
a random variable ranges over all the possible types of outcomes for the event being measured . . . example:
random variable X = the result of rolling a die P (X = 1) = the probability of the die showing 1 = 1/6 P (X = 1) = P (X = 2) = . . . = P (X = 6) = 1/6
properties
the probability of an outcome is always between 0 and 1 the sum of probabilities of all outcomes is 1
good, but still not perfect

e.g. man occurring in both lists want to ignore if man is relatively common by itself
11 / 67
12 / 67
Collocations
Collocations
Summary Measures for Random Variables

the (arithmetic) mean, or average: E (X ) =
i
Example
imagine a six-sided die where each outcome wasnt equally likely, but had P (X = 1) = P (X = 6) = 1/100, P (X = 2) = P (X = 5) = 4/100, P (X = 3) = P (X = 4) = 45/100
E (X ) = 1.(1/100) + 2.(4/100) + . . . + 6.(1/100) = 3.5, as before Var (X ) = 1.(1/100) + 4.(4/100) + . . . + 36.(1/100) 3.52 = 0.53
xi P (xi )
from the die example, E (X ) = 1.(1/6) + 2.(1/6) + . . . + 6.(1/6) = 3.5
the variance, to measure the spread: Var (X ) = E (X E (X ))2 = E (X 2 )(E (X ))2 =

i
xi2 P (xi )(E (X ))2
from the die example, Var (X ) = 1.(1/6) + 4.(1/6) + . . . + 36.(1/6) 3.52 = 2.91
note that these apply to the whole population

13 / 67 14 / 67
Collocations
Collocations
Estimating Probabilities
Maximum Likelihood Estimator (MLE)
used to estimate the theoretical probability from a sample if a specic event has occurred m times out of n occasions, the MLE probability is m/n the larger the number of occasions measured, the more accurate the MLE
imagine we dont know the population probabilities
we want to estimate them from a sample
die outcome 1 2 3 4 5 6 number of times rolled 16 18 13 16 19 18 100
sample mean and variance given observations x i

sample mean = x sample variance s2 =
n i =1 (xi
= x 1 n
n
16 1 + . . . + 18 6 = 3.58 100
xi
i =1
s2 =
16 (1 3.58)2 + . . . + 18 (6 3.58)2 3.05 99
)2 x n1
15 / 67 16 / 67
Collocations
Collocations
problem: sparse data
its more difcult to estimate the probability of a rare event if the corpus doesnt register, say, a rare word, the MLE for the word is 0
Hypothesis Testing
using frequencies, as previously, we might decide that new companies is a collocation because if has a high frequency
however, we dont really think it is one; maybe its because new and companies are individually frequent, and just appear together by chance
possible solution: smoothing

even though the MLE seems like the right way of estimating probabilities, its not always the right one infrequent events can get too little probability mass can redistribute some of the probabilities
hypothesis testing is a way of assessing whether something is due to chance has the following procedure:
formulate a NULL HYPOTHESIS H0 , that there is no association beyond chance calculate the probability p that the event would occur if H0 were true then, if p is too low (usually 0.05 or smaller), reject H0 ; retain it as a possibility otherwise
17 / 67
18 / 67
Collocations
The t-Test
Collocations
The t-Test
The t-Test
want a test that will say how likely or unlikely a certain event is to occur the t-test compares a sample mean with a population mean, relative to the samples variability t= x
s2 N
The t-Test
non-linguistic example:
H0 : the mean height of a population of men is 158cm (vs a population of shorter men) = 169 and s2 = 2600 sample data: size 200 with x t= 169 158
2600 200
3.05
where N is the size of the sample, and is the population mean according to the null hypothesis look up this t-value against a table
table gives t-value for a given condence level and a given number of degrees of freedom ( = N 1)
d.f. p 1 10 20 0.05 6.314 1.812 1.725 1.645 0.01 31.82 2.764 2.528 2.326 0.005 63.66 3.169 2.845 2.576 0.001 318.3 4.144 3.552 3.091 19 / 67
looking up the table, t > 2.576, so we can reject H 0 with 99.5% condence
d.f. p 1 10 20 0.05 6.314 1.812 1.725 1.645 0.01 31.82 2.764 2.528 2.326 0.005 63.66 3.169 2.845 2.576 0.001 318.3 4.144 3.552 3.091
20 / 67
Collocations
Pearsons Chi-Square Test (+ background: Distributions)
Collocations
Distributions
a PROBABILITY DISTRIBUTION FUNCTION is a function describing the mapping from random variable values to probabilities
these can be either discrete (from a nite set) or continuous
Gaussian Distribution
another important (continuous) one is the G AUSSIAN ( OR NORMAL ) DISTRIBUTION
1 dened by the function f (x ) = exp 2 population mean is , variance (x )2 2 2
weve already seen a UNIFORM example)
DISTRIBUTION
(the original die
this was a discrete function P (X = x ) = 1 n (where n is the number of outcomes)
a lot of data can be assumed to have this distribution, e.g. heights in a population the t-test described previously assumes a normal distribution
21 / 67 22 / 67
Collocations
Collocations
Bernoulli Distribution
the discrete B ERNOULLI DISTRIBUTION measures the probability of success in a yes/no experiment, with this probability called p
dened by P (X = 1) = p, P (X = 0) = 1 p population mean is p, variance is p(1 p)
Binomial Distribution
the discrete BINOMIAL DISTRIBUTION measures the probability of the number of successes in a sequence of n independent yes/no experiments (Bernoulli distributions), each of which has probability p
n dened by P (X = x ) = k pk (1 p)nk population mean is np, variance is np(1 p)
23 / 67
models things like the probability of getting k heads from n tosses of a fair coin 24 / 67 for large n, can be approximated by the normal distribution
Collocations
Collocations
Zipf Distribution
the Z IPF DISTRIBUTION is a model of Zipfs law, which says that the frequency of any word is roughly inversely proportional to its rank in the frequency table
1 in its original form, Pn n a , where Pn is the frequency of a word, n is its rank, and a is a constant close to 1 (note that this is not a probability distribution)
The t-Test for Proportions

can extend the t-test to use proportions or counts
for a text, consider it as a long sequence of N bigrams each bigram is either the one were looking for (success) or not (failure) this gives a Bernoulli distribution
linguistic example:
we have 15,828 occurrences of new and 4,675 of companies, and there were 14,307,668 tokens overall 4675 15828 , P (companies) = P (new) = 14307668 14307668 H0 is that the occurrences of new and companies are independent: P (new companies) = P (new)P (companies) 3.615 107 this P (new companies) is
25 / 67
26 / 67
Collocations
Collocations
The t-Test for Proportions

linguistic example (cont.):
there are 8 occurrences of new companies, so sample mean (proportion) is 8 =p= 5.591 107 x 14307668 variance is p(1 p), which for very small p is approximately p now, the t-test: t= x
s2 N
Finding Collocations by t-Test Ranking

some bigrams of frequency 20:
t 4.4721 4.4721 4.4720 4.4720 4.4720 2.3714 2.2446 1.3685 1.2176 0.8036 C (w1 ) 42 41 30 77 24 14907 13484 14734 14093 15019 C (w2 ) 20 27 117 59 320 9017 10570 13478 14776 15629 C (w1 w2 ) 20 20 20 20 20 20 20 20 20 20 w1 Ayatollah Bette Agatha videocassette unsalted rst over into like time w2 Ruhollah Midler Christie recorder butter made many them people last
5.591 107 3.615 107

5.591107 14307668
actually, the t-test will reject only very few possible collocations
0.999932 reason will come later . . . still useful for ranking, however
this is below any level of signicance in our previous table so, we dont reject the null hypothesis: we cant say new companies is a collocation
27 / 67
28 / 67
Collocations
Collocations
Hypothesis Testing of Differences

previous hypothesis testing compared a sample value against a postulated corresponding population value may want to compare sample values from (what you believe are) two different distributions
null hypothesis H0 is then that these two distributions are actually the same
Finding Collocations by t-Test Ranking

can then approximate the t value by t x 1x 2
x 1 +x 2 N
C (strong w )C (powerful w ) C (strong w )C (powerful w )
collocations with strong vs powerful

t 3.1622 2.8284 2.4494 2.4494 ... 7.0710 6.3257 4.6904 4.5825 ... C (w ) 933 2337 289 588 ... 3685 3616 986 3741 ... C (strong w ) 0 0 0 0 ... 50 58 22 21 ... C (powerful w ) 10 8 6 6 ... 0 7 0 0 ... w computers computer symbol machines ... support enough safety sales ...
t-value is calculated by t= x 1x 2
2 s1 n1
2 s2 n2
suppose H0 is that the words that collocate with strong and powerful are from different distributions
x1 is the probability of the bigram strong w (for some word w ) x2 is the probability of the bigram powerful w
29 / 67
30 / 67
Collocations
Collocations
Pearsons Chi-Square Test

t-test assumes normal distribution of sample mean probabilities; however, for text this is not true an alternative for hypothesis testing is the 2 test, which does not assume this simplest form takes a 2 2 CONTINGENCY
TABLE

for the previous example, recall that C (new) = 15, 828, C (companies) = 4, 675, and C (new companies) = 8, and that there are 14,307,668 tokens in the corpus
w2 = companies w2 = companies w1 = new 8 (new companies) 15820 (e.g. new machines) w1 = new 4667 (e.g. old companies) 14287173 (e.g. old machines)
if looking for collocations of w1 and w2 , this table will consist of the four combinations of w1 , w2 , not-w1 and not-w2 X 2 aggregates the differences between observed and expected values for these cells X =
i ,j 2
to calculate e.g. E1,1 , we use the probability of the rst word of a bigram being new, the second word being companies: E1,1 = 8 + 4667 8 + 15820 N 5.2 N N
(Oij Eij )2 Eij
where i ranges over table rows, j ranges over table columns, Oij is the observed value for cell (i , j ) and Eij is the expected value
for this example, X 2 1.55
31 / 67
32 / 67
Collocations
Collocations

as for the t-test, the 2 has an associated number of degrees of freedom
for a table of dimensions r c , there are (r 1)(c 1) degrees of freedom
Comparison: Chi-Square vs t-Test

for the previous example, theres quite a lot of overlap
for example, the top 20 bigrams according to the t-test are the same as the top 20 for 2
we check the distribution for 2 :

d.f. p 1 2 3 4 100 0.99 0.00016 0.20 0.115 0.297 70.06 0.95 0.0039 0.10 0.35 0.71 77.93 0.10 2.71 4.60 6.25 7.78 118.5 0.05 3.84 5.99 7.81 9.49 124.3 0.01 6.63 9.21 11.34 13.28 135.8 0.005 7.88 10.60 12.84 14.86 140.2 0.001 10.83 13.82 16.27 18.47 149.4
however, 2 also appropriate for large probabilities, where the normality assumption of the t-test fails
the X 2 value is less than for = 0.05, so we wouldnt reject H 0 : i.e. we wouldnt take new companies as a collocation, as before with the t-test
33 / 67
34 / 67
Collocations
Likelihood Ratio Test (+ background: Conditional Probability)
Collocations
Conditional Probability
weve already in fact used the notion of independent events
two events are independent of each other if the occurrence of one does not affect the probability of the occurrence of the other tossing a coin and winning the lottery: independent speeding and having an accident: not independent
Example
the following table shows the weather conditions for 100 horse races and how many times Harry won:
rain 15 15 shine 5 65
win no win
conditional probability: the probability that one event occurs given that another event occurs
Harry won 20 out of 100 races: P (win) = 0.2 (by MLE) the conditional probability of Harry winning given rain is P (win | rain) = 15/30 = 0.5 compare this with the 2 test: under the null hypothesis, the observed data was compared against the situation where the words were independent
35 / 67
36 / 67
Collocations
Collocations
Likelihood Ratio
another approach to hypothesis testing
more appropriate to sparse data than 2 more interpretable also: says how much more likely one hypothesis is than another
Likelihood Ratio
well use the usual MLEs for p , p1 , p2 , and writing c1 , c2 , c12 for the number of occurrences of w1 , w2 , w1 w2 p= c12 c2 c12 x2 , p1 = , p2 = N c1 N c1
here, examine explicitly two hypotheses to explain bigram w 1 w2

Hypothesis 1: P (w2 | w1 ) = p = P (w2 |w1 ) Hypothesis 2: P (w2 | w1 ) = p1 = p2 = P (w2 |w1 )
well also use the notation for a binomial distribution b (k ; n, p ) = now, the likelihoods are
for Hypothesis 1, L(H1 ) = b(c12 ; c1 , p)b(c2 c12 ; N c1 , p) for Hypothesis 2, L(H2 ) = b(c12 ; c1 , p1 )b(c2 c12 ; N c1 , p2 )
37 / 67 38 / 67
Hypothesis 1 represents independence of w 1 and w2 ; Hypothesis 2 represents dependence (and hence a possible collocation)
n k p (1 p )nk k
Collocations
Collocations
Likelihood Ratio
the log of the likelihood ratio is then log = log for bigrams of powerful:
2 log 1291.42 99.31 82.96 80.39 57.27 ... C (w1 ) 12593 379 932 932 932 ... C (w2 ) 932 932 934 3424 291 ... C (w1 w2 ) 150 10 10 13 6 ... w1 most politically powerful powerful powerful ... w2 powerful powerful computers force symbol ...
Comparison: Chi-Square and Likelihood Ratio

the likelihood ratio has an intuitive meaning
from the previous table, the bigram powerful symbol is e0.557.27 2.729 1012 times more likely to occur than would be expected by the individual words alone
L(H1 ) L(H2 )
comparison carried out by Dunning [1993]

2 tends to be less accurate with sparse data
as a rule of thumb, need large sample, and counts in each cell (i.e. occurrences of words or bigrams) of at least 5
the value 2 log has a 2 distribution

so you can do hypothesis testing using the 2 table
the events were interested in textindividual words or n-gramsare in fact often less frequent than this: related to the Zipan distribution of words as an example, Dunning selected words from a 500,000 word corpus with frequences of between 1 and 4; these included words like abandonment, clause, meat, poi and understatement the log likelihood is more accurate here (but still needs counts of at least 1)
39 / 67 40 / 67
Collocations
Fishers Exact Test
Collocations
Fishers Exact Test
Fishers Exact Test

the previous tests have all been PARAMETRIC tests: that is, they assume some distribution its possible to use a NON - PARAMETRIC test, which makes no assumptions trade-off is that its typically more time-consuming to calculate, and is only feasible for smaller amounts of data Fishers Exact Test computes the signicance of an observed table by exhaustively computing the probability of every table that would have the same marginal totals suggested as an alternative to the previous tests by Pedersen [1996]
Fishers Exact Test

consider again a 2 2 contingency table
w2 = companies w2 = companies w1 = new E1,1 (new companies) E2,1 (e.g. new machines) w1 = new E1,2 (e.g. old companies) E2,2 (e.g. old machines)
the probability of obtaining any such set of values is p=

E1,1 +E1,2 E2,1 +E2,2 E1,1 E2,1 E1,1 +E1,2 +E2,1 +E2,2 E1,1 +E2,1
41 / 67
42 / 67
Verb Subcategorisation
Outline
1
verbs express their semantic arguments with different syntactic means
the class of verbs with semantic arguments theme and recipient has a subcategory expressing these via a direct object and a prepositional phrase: he donated a large sum of money to the church a second subcategory permits double objects: he gave the church a large sum of money
43 / 67
these subcategorisation frames are typically not in dictionaries might be interested in identifying them via statistics Brent [1993] developed the system Lerner to assign one of six frames to verbs
Description NP only tensed clause innitive NP & clause NP & innitive NP & NP Good Example greet them hope hell attend hope to attend tell him hes a fool want him to attend tell him the story Bad Example *arrive them *want hell attend *greet to attend *yell him hes a fool *hope him to attend *shout him the story 44 / 67
4 5
Algorithm for Learning Subcat Frames

Lerner had two steps:
1
Hypothesis Testing
suppose verb vi occurs a total of n times in the corpus and that there are m n occurrences with a cue for frame f j assume also some error
j
Dene cues. Dene a regular pattern of words and syntactic categories which indicates the presence of the frame with high certainty (probability of error). For a particular cue cj we dene the probability of error j that indicates how likely we are to make a mistake if we assign frame f to verb v based on cue cj . Do hypothesis testing. Initially assume the frame is not appropriate for the verb: this is the null hypothesis H0 . We reject H0 if the cue cj indicates with high probability that H0 is wrong (OBJ | SUBJ OBJ | CAP) (PUNC | CC) OBJ = accusative case personal pronouns; SUBJ OBJ = nominative or accusative case personal pronouns; CAP = capitalised word; PUNC = punctuation; CC = subordinating conjunction positive indicator for transitive verb: consider . . . greet/V Peter/CAP ,/PUNC . . .
45 / 67
in inferring a frame fj from cue cj
this suggests a binomial distribution then reject null hypothesis H0 that vi does not permit fj with the following probability of error:
n
example: cue for frame NP only (transitive verb)
pE = P (vi does not permit fj | C (vi , cj ) m) =

r =m
n r
r nr j (1 j )
various values for
were assessed
46 / 67
Precision and Recall

typically, when building a statistical model to do something, you want to evaluate how suitable it is
for example, how well it performs a simple task

precision is the proportion of system-predicted relevant objects that are correct: TP PRE = TP + FP recall is the proportion of actually relevant objects that the system managed to predict as relevant TP REC = TP + FN example: theres a set of 200 documents, of which 40 are actually relevant; your system says that 50 are relevant, including 20 of the ones that actually are
system predicts: relevant irrelevant actually: relevant TP = 20 FN = 20 irrelevant FP = 30 TN = 130
the measures of PRECISION and RECALL are one way of doing that imagine you have a system for sorting your objects of interest into two piles, relevant and irrelevant
system can make two types of errors: classifying a relevant object as irrelevant, or an irrelevant one as relevant system decisions can then be broken into four categories: true positive (TP), false positive (FP), false negative (FN), true negative (TN)
system predicts: relevant irrelevant actually: relevant TP FN irrelevant FP TN
47 / 67
then, PRE =
20 50
and REC =
20 40
48 / 67
Verb Subcategorisation: Lerner Accuracy

for Lerner, precision and recall values were calculated for various
j
F-Measure
theres typically a trade-off between precision and recall there are a number of ways of combining them into a single measure
this is the table for the tensed clause frame

j
.0312 .0156 .0078 .0039 .0020 .0010 .0005 .0002 .0001
TP 13 19 22 25 27 29 31 31 33
FP 0 0 1 1 3 5 8 13 19
TN 30 30 29 29 27 25 22 17 11
FN 20 14 11 8 6 4 2 2 0
MC 20 14 12 9 9 9 10 15 19
%MC 32 22 19 14 14 14 16 24 30
PRE 1.00 1.00 .96 .96 .90 .85 .79 .70 .63
REC .39 .58 .67 .76 .82 .88 .94 .94 1.00
one is the F-measure, the weighted harmonic mean of the two: F= 2 PRE REC PRE + REC
MC is total misclassied
49 / 67
50 / 67
Semantic Similarity
Semantic Similarity
Outline
1
Semantic Similarity
there are a number of resources that group words together by semantic relatedness
examples are thesauruses, Wordnet semantic relations are synonymy, hypernymy, etc e.g. dog and canine might be in a class together; this might be a hyponym of a class corresponding to animal
51 / 67
you might want to automatically derive classes to capture relations

for when you have a new unknown word: e.g. if in Susan had never eaten a fresh durian before you dont know what kind of thing durian is if you want types of classes other than the standard ones
4 5
52 / 67
Semantic Similarity
Latent Semantic Indexing
Semantic Similarity
Latent Semantic Indexing (LSI)

in LSI, we look at the interaction of terms and documents the purpose of this interaction is twofold
to have the documents tell us which terms should be grouped together to have the grouped-together terms tell us about the similarity of the documents
Example
say we have 5 terms of interestcosmonaut, astronaut, moon, car, truckand 6 documents we describe their interaction by a matrix A, where cell a ij contains the count of term i in document j d1 d2 d3 d4 d5 d6 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1
this interaction is described by a matrix, and the grouping is carried out by a process called Singular Value Decomposition (SVD)
A=
cosmonaut astronaut moon car truck
this can be thought of as a ve-dimensional space (dened by the terms) with six objects in that space (the documents) what we want to do is reduce the dimensions, thus grouping similar terms
53 / 67 54 / 67
Semantic Similarity
Semantic Similarity
Dimensionality Reduction
there are many possible types of dimensionality reduction LSI chooses the mapping that means that the reduced dimensions correspond to the greatest axes of variation
that is, if the new dimensions are numbered 1 . . . k , dimension 1 captures the greatest amount of commonality, dimension 2 the second greatest, and so on
Example
this process is carried out by the matrix operation called Singular Value Decomposition here, the term-by-document matrix A t d is decomposed into three other matrices At d = Tt n Snn (Dd n )T this decomposition is (almost) unique
T =
cosmonaut astronaut moon car truck
Dim 1 0.44 0.13 0.48 0.70 0.26
Dim 2 0.30 0.33 0.51 0.35 0.65
Dim 3 0.57 0.59 0.37 0.15 0.41
Dim 4 0.58 0.00 0.00 0.58 0.58
Dim 5 0.25 0.73 0.61 0.16 0.09
consider the columns . . .
55 / 67
56 / 67
Semantic Similarity
Semantic Similarity
Example
2.16 0.00 S= 0.00 0.00 0.00 0.00 1.59 0.00 0.00 0.00 0.00 0.00 1.28 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.39
Example

this matrix embodies the weight of the dimensions

it always goes largest to smallest
DT =
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
d1 0.75 0.29 0.28 0.00 0.53
d2 0.28 0.53 0.75 0.00 0.29
d3 0.20 0.19 0.45 0.58 0.63
d4 0.45 0.63 0.20 0.00 0.19
d5 0.33 0.22 0.12 0.58 0.41
d6 0.12 0.41 0.33 0.58 0.22
57 / 67
58 / 67
Semantic Similarity
Semantic Similarity
Example
so far weve just transformed the dimensions; now to reduce
for this example, decide to reduce to 2 dimensions
Conceptually . . .
can imagine that the terms are made up of semantic particles
perhaps along the lines of Wierzbickas semantic primitives however, not dened a priori; only a consequence of the relations in the given set of documents
to look at documents, combine this reduced dimensionality with the weighting of the dimensions
T derive new matrix B2d = S22 D2 d
LSI rearranges things so that the terms with greatest number of semantic particles in common are grouped
d4 0.97 1.00 d5 0.70 0.35 d6 0.26 0.65
B = Dim 1 Dim 2
d1 1.62 0.46
d2 0.60 0.84
d3 0.44 0.30
59 / 67
60 / 67
Register Analysis
Register Analysis
Outline
1
Register Analysis
work done by Douglas Biber
these notes from Biber [1993]
61 / 67
idea: different registers have systematic patterns of variation

e.g. professional letters vs academic prose can do descriptive analyses, based on frequency or proportions of selected characteristics however, may also want to identify groups of characteristics distinguishing registers
4 5
62 / 67
Register Analysis
Register Analysis
Descriptive Analysis
example: from Brown corpus, mean frequencies of three dependent clause types (per 1000 words)
register press reports ofcial documents conversations prepared speeches relative clauses 4.6 8.6 2.9 7.9 causative adverbial subordinate clauses 0.5 0.1 3.5 1.6 that complement clauses 3.4 1.6 4.1 7.6
Dimension Identication
Biber carried out a quantitative analysis of 67 linguistic features in the LOB and London-Lund corpora
features included: tense and aspect markers, place and time adverbials, pronouns and pro-verbs, nominal forms, prepositional phrases, adjectives, lexical specicity, lexical classes (e.g. hedges, emphatics), mmodals, specialised verb classes, reduced forms and discontinuous structures, passives, stative forms, dependent clauses, coordination, and questions frequencies were counted, and normalised to per-1000 values
from this, can see e.g. that relative clauses are common in ofcial documents and prepared speeches relative to conversation may be interested in grouping many of these characteristics of text together
then, FACTOR
ANALYSIS
was carried out
this is a dimensionality reduction procedure very similar to LSI the dimensions similarly end up in decreasing order of explanatory power
63 / 67
64 / 67
Register Analysis
References
Dimension Identication
after inspecting the results of the factor analysis, Biber interpretively labelled the rst ve dimensions:
1 2 3 4 5
Outline
1
Informational vs Involved Production Narrative vs Nonnarrative Concerns Elaborated vs Situation-Dependent Reference Overt Expression of Persuasion Abstract vs Nonabstract Style
2
66 / 67
example of features associated with Dimension 1:

functions Monologue Careful Production Informational Faceless Interactive (Inter)personal Focus Involved Personal Stance On-Line Production linguistic features nouns, adjectives prepositional phrases long words 1st and 2nd person pronouns questions, reductions stane verbs, hedges emphatics adverbial subordination characteristic registers informational exposition e.g. ofcial documents academic prose conversations (personal letters) (public conversations)
4 5
65 / 67
References
Douglas Biber. Using Register-Diversied Corpora for General Language Studies. Computational Linguistics, 19(2):219241, 1993. Michael Brent. From grammar to lexicon: Unsupervised learning of lexical syntax. Computational Linguistics, 19(2):243262, 1993. Ted Dunning. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1):6174, 1993. Brigitte Krenn and Christer Samuelsson. The Linguists Guide to Statistics: D ON T PANIC. URL http://coli.uni-sb.de/christer. Version of December 19, 1997, 1997. Christopher Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, USA, 1999. Ted Pedersen. Fishing for Exactness. In Proceedings of the South-Central SAS Users Group Conference, Austin, TX, USA, 1996.
67 / 67

Statistika Za Filologe

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistika Za Filologe

Uploaded by

Copyright:

Available Formats

Tutorial Structure

primarily as some examples of tasks linguists might be interested in

Statistics for Linguists: A Tutorial

within these, statistical ideas that are useful

HCSNet Summerfest 28 November 2006

has limited compositionality

use when looking to verify specic alternatives or patterns; or add a lter

Example #1a: Eggcorns

Example #1b: Snowclones

collection by Chris Waigl

interested in seeing whether eggcorn is gaining currency

Aside: Google Counts

Hypothesis Testing (+ background: Basic Probability Theory)

similarly, given particular adjectives, can nd most frequent co-occurring nouns:

good, but still not perfect

Hypothesis Testing (+ background: Basic Probability Theory)

Hypothesis Testing (+ background: Basic Probability Theory)

Summary Measures for Random Variables

from the die example, E (X ) = 1.(1/6) + 2.(1/6) + . . . + 6.(1/6) = 3.5

the variance, to measure the spread: Var (X ) = E (X E (X ))2 = E (X 2 )(E (X ))2 =

xi2 P (xi )(E (X ))2

note that these apply to the whole population

Hypothesis Testing (+ background: Basic Probability Theory)

Hypothesis Testing (+ background: Basic Probability Theory)

sample mean and variance given observations x i

16 (1 3.58)2 + . . . + 18 (6 3.58)2 3.05 99

Hypothesis Testing (+ background: Basic Probability Theory)

Hypothesis Testing (+ background: Basic Probability Theory)

possible solution: smoothing

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test (+ background: Distributions)

weve already seen a UNIFORM example)

(the original die

this was a discrete function P (X = x ) = 1 n (where n is the number of outcomes)

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test (+ background: Distributions)

The t-Test for Proportions

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test (+ background: Distributions)

The t-Test for Proportions

Finding Collocations by t-Test Ranking

5.591 107 3.615 107

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test (+ background: Distributions)

Hypothesis Testing of Differences

Finding Collocations by t-Test Ranking

C (strong w )C (powerful w ) C (strong w )C (powerful w )

collocations with strong vs powerful

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test

Pearsons Chi-Square Test

(Oij Eij )2 Eij

for this example, X 2 1.55

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test (+ background: Distributions)

Pearsons Chi-Square Test

Comparison: Chi-Square vs t-Test

we check the distribution for 2 :

Likelihood Ratio Test (+ background: Conditional Probability)

Likelihood Ratio Test (+ background: Conditional Probability)