You are on page 1of 21

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/265965333

Basic statistics for corpus linguistics

Chapter · September 2008


DOI: 10.13140/2.1.1684.6084

CITATIONS READS
2 1,697

1 author:

Gard Jenset

29 PUBLICATIONS   106 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Gard Jenset on 23 September 2014.

The user has requested enhancement of the downloaded file.


Basic statistics for corpus linguistics∗
Gard B. Jenset
Department of Foreign Languages
University of Bergen

1 Introduction
This tutorial is a short introduction to what statistics is good for, the basis of
statistical thinking, and how some statistical tests can be computed using the
program R, cf. R Development Core Team (2008). The internal mechanics of
the tests and measures presented below are only discussed to the extent that I
consider it necessary for correct use and interpretation. Note that this handout
is not intended as a substitute for a full statistics course where these topics are
treated in more depth. Many important concepts have been omitted here for
lack of time, space or because of their complexity. Statistics typically takes a
while to get used to, and the best way of doing so is through an organized course
spanning at least a full semester. Section 12 below lists some relevant statistics
courses at the faculty of humanities at the UiB as of the time of writing.

2 What is statistics?
Before getting to the heart of the matter, it is perhaps necessary to clear the way
by stating what statistics isn’t about. First, statistics is not an indicator of how
‘true’ or ‘correct’ the obtained results are. Second, statistics is not primarily
about mathematical calculations. And third, statistics is not a substitute for
informed reasoning.
Rather, statistics is a way of quantifying assumptions, so that they can be
applied to large data sets. Thus, statistics is an indicator of how ‘correct’ your
results are, if you have based the calculations on appropriate assumptions and
interpreted the results correctly – and this is a big ‘if.’ This is a matter of care-
ful consideration and experience, not mechanical application of test procedures.
Furthermore, the calculation of such tests is now a trivial matter, carried out
quickly and accurately with appropriate software. However, the software can
∗ Handout for methods seminar in English linguistics, Fall 2008. I am grateful to Kolbjørn

Slethei and Kari Haugland for valuable comments and suggestions. Any mistakes or misrep-
resentations that persist despite their advice remain my responsibility.

1
never tell you if you have made some erroneous assumption or violated the con-
ditions of a test: the software crunches numbers, and the validity of the results
depends on the person who entered those numbers into the program. Finally,
statistics can be used in two ways: to describe a data set, or to draw inferences
outside of the data set (descriptive and inferential statistics, respectively). The
conditions for describing or drawing inferences are obviously not the same, and
this means that it is important to define what is being studied, and how the
conditions for a given test are met in the data set.

3 A typology of statistics
Statistics is not one homogenous field, and it is sometimes useful to think of it
in terms of three broad paradigms:
i) The frequentist approach: The results of a statistical test are conceived of
as part of a very long (but hypothetical) series of repeated experiments or
tests.
ii) The Bayesian1 approach: The results of a statistical tests are conceptu-
alized as the conditional probability of an outcome, given some data or
observation.
iii) The explorative analysis approach: This purely descriptive tradition, as
exemplified by for instance correspondence analysis, considers observations
as correlations between categories in an n-sized dimension space.
It is probably safe to say that i) has dominated both the practice and teach-
ing of statistics in the 20th Century for a number of reasons that we will not
touch upon here; suffice it to say that this is what is primarily taught in introduc-
tory statistics courses in most universities. ii) has recently been getting more and
more popular, partially because of the increase in computational power avail-
able. However, Bayesian statistics is more properly taught in an intermediate
statistics course, see Gelman, Carlin, Stern, and Rubin (2004) for a comprehen-
sive and useful introduction. iii) has been most widely used within the French
sociology tradition following Benzécri, but the initial work was instigated in an
effort to solve philological problems, cf. Apollon (1990, 195–197) and Nenadic
and Greenacre (2007). Correspondence analysis is a useful descriptive addition
to the other traditions, and it will probably gain more widespread use, since
there are now a number of correspondence analysis packages available in R. For
the present tutorial, however, we will primarily deal deal some common (and
some not-so-common) frequentist tests.
1 After the British Presbyterian minister and mathematician Thomas Bayes (1702–1761).

2
4 Why R?
R is a free, open source version of the programming language S-plus, and is
becoming the defacto standard statistical toolbox in many academic fields. In
addition to being free, R has a number of advantages over commercial statistical
packages such as SPSS:
• once you get used to the idea of a command line interface, R is much faster
and easier to work with than SPSS.
• R is very flexible and can be used for preparing the data before applying
the statistical tests, that is, it is much more than just a statistical software
package.
• http://cran.r-project.org/ contains a large library of user-contributed
packages for solving various problems. These packages are mostly written
by researchers themselves, who know what problems they want to solve in
their particular field. That is, R provides not only general methods, but
custom methods fitted to various academic fields.
• furthermore, R has many linguistics-specific functions contained in pack-
ages such as languageR by Harald Baayen; other packages like openNLP
are also useful.
• R is more reliable than, say, an online statistics calculator. It is sometimes
difficult to check the reliability of such calculators, and there is no way
of knowing how long a given web page hosting such a calculator will be
available.
• R (like SPSS) produces print quality graphics like figures and charts.

5 Types of data
The notion of data type is crucial to all branches of statistics. Because all
statistical tests make assumptions about types of data (they are quite picky),
it is necessary to decide which type the data at hand most closely correspond
to, in order to choose the most appropriate test. In corpus linguistics, we are
almost always dealing with nominal data.

5.1 Nominal data


Most linguistic data are of the nominal kind. As the name implies, it deals with
named categories of things, like countries, beer types, or syntactic categories.
Such data are unordered, which means that rearranging them does not affect
their information value: listing countries by geographical size or alphabetically
by name does not affect their properties as data. Nominal data are sometimes
referred to as ‘count-data,’ because no arithmetical operations are allowed on
them, they can only be counted (1, 2, 3, . . . , n bottles of beer).

3
5.2 Ordinal data
Ordinal data are ordered categories of things, and classic examples are score lists
or race results: The winner was the first across the finishing line, but it is not
important by how much he or she beat the competitors; that is, the magnitude
of the difference between each category does not affect the information value it
has for this kind of data. The important thing is the order the data points2
occur in. In linguistic contexts, ordinal data can for instance be the result
of an experiment where the participants are asked to judge the naturalness of
sentences, and rank them according to which sounds ‘best’ and ‘worst.’

5.3 Continuous data


Continuous data are things that can be measured on a continuous scale. This
includes anything that can be measured in centimeters, inches, kilos, pounds,
years, hours, minutes, seconds etc. The key property which differentiates these
kinds of data from the previous ones, is their reducibility. One meter is composed
of 100 centimeters, each of those centimeters is 10 millimeters, each of which
can be measured in micrometers, nanometers and so on. This means that a
number of arithmetic operations can be carried out, such as calculating the
mean value. The average height in a population is easy to interpret in relation
to the height of each person (i.e., each data point). It is less obvious how to
interpret, say, an average number of children (which parts of a whole child are
missing in a ‘0.8 child’ ?). In linguistics, continuous data are mostly found in
psycholinguistic reaction-time experiments where reaction times to linguistic
stimuli are measured in seconds and milliseconds, or in studies where the age of
participants is relevant.3

6 A bit of terminology
A ‘population’ in statistics means a group or collection of entities that we want Population
to study. Thus, ‘population’ could refer to people, but also light bulbs, car
accidents, university students, or grammatical constructions. A population is
thus not something which occurs naturally – it is defined for the purposes of
the research project.
A ‘sample’ is a subset of the population that we want to study. Sometimes Sample
the sample is carefully collected based on pre-defined criteria. However, we
sometimes have to work with the sample we happen to have available, like in
historical linguistics.
A ‘random sample’ is a sample where every member of the population has Random
equal probability of being included in the sample. This is not always possible sample
to achieve, but most statistical tests assume that the sample is drawn randomly
2 In statistics, data are usually referred to in plural, while a single piece of data is a ‘data

point.’
3 There are ways of transforming nominal corpus data into continuous data, through a

log-transformation. Such procedures fall outside the scope of this tutorial.

4
Figure 1: A chi-square distribution Figure 2: A normal distribution

from the population. It is then a matter of interpretation how badly a violation


of this assumption will affect the results.
A ‘distribution’ is a mathematical function which can in some cases serve as a Distribution
fair (but not necessarily perfect) model of the population we wish to study. One model
such model is the so-called ‘normal’ or Gauss-distribution (used with continuous
data), with a shape more or less like a bell. There are other such distributions,
notably the chi-square (or χ2 ) distribution used to model the population under
study in chi-square tests (nominal data).
A ‘null hypothesis’, or H0 , is a term used to denote the default assumption Null hypoth-
of most statistical test, namely that all the variation in the sample data is due esis
to random variation. The null hypothesis is then tested against an alternative
hypothesis, or H1 , which is typically states that the variation is not due to
chance.

7 Statistical tests
The sections below present some statistical tests as they are implemented in R.
For instructions on how to install and use R in general, see the web page http:
//cran.r-project.org/. As the tutorial is directed towards corpus linguistics,
the presentation will focus on the tests which are most appropriate for nominal
data, the so-called non-parametric tests.

7.1 Pearson’s chi-square


Pearson’s chi-square (often referred to as simply ‘chi-square’) is a commonly
used test in linguistics, because it can handle almost any kinds of nominal data.
However, it still assumes that
i) the data are a random sample from the population
ii) the chi-square distribution is a fair model of how the phenomenon under
study is distributed in the population

5
iii) expected observations in each cell larger than five
iv) you have actual observed frequencies – never do a chi-square test on per-
centages!
The Pearson chi-square can be used in two ways, as a test of independence /
correlation or as a test of goodness of fit. The goodness of fit test is used to
check whether a set of observations are adequately represented by the chi-square
distribution, and it will not be discussed further here. The test for independence
is based on the following logic:
a) Take two sets or more of some observations in a 2 × 2 or larger table
b) Compare the observations to the chi-square distribution
The aim of this is to test whether the observations in the categories that we have
divided the data into (i.e. the rows and columns of the table) represent random
variation or whether it is caused by the factors represented by the categories.
The underlying assumption is that if the observations in the table (i.e. our
categories) are related only by chance, the observations will match well with
the chi-square distribution in figure 1. Conversely, if the observations do not
match well, then it is assumed that the categories have somehow influenced the
distribution of the observations. The result of a Pearson chi-square test is thus
an answer to a yes/no question: are the observations a random sample from a
single chi-square distributed population – yes or no.
This can be illustrated with the following fictional data:

Genre NP subject Clausal subject


Fiction 45 67
Newspaper 34 82
Pearson’s chi-square is computed in R the following way, assuming that x is a 2
× 2 table with nominal data, created like this (the < − sign is R’s assignment
operator which assigns the material on the right hand side to a short-hand
variable):
(1) x < − matrix(c(45, 34, 67, 82), nrow = 2) R code
which produces the following output when x is entered into R:4

[,1] [,2]
[1,] 45 67
[2,] 34 82

The chi-square test is then computed like this:5


(2) cx < − chisq.test(x) R code
4 In statistical terminology, this is called a ‘contingency table.’
5 Result as reported by R: X-squared = 2.5119, df = 1, p-value = 0.113

6
In some circumstances, R will apply the Yates correction for continuity to the
Pearson chi-square test. The issue is somewhat complicated, but there are good
reasons not to use the Yates corrected chi-square. In order to tell R not to use
it, write:6
(3) cxx < − chisq.test(x, correct = FALSE) R code
where the argument correct = FALSE turns off the Yates’ correction.

How are the results of a Pearson chi-square test to be interpreted? The Interpretation
Pearson chi-square p-value indicates the probability of obtaining the entire set of Pearson’s
of observations in the table, provided that the observations are a random sample chi-square
from the population, and that the null-hypothesis is appropriate. In other words,
the p-value indicates whether the null-hypothesis (the set of observations is a
random selection from a single, chi-square distributed population) should be
rejected (low p, in linguistics and the social sciences often somewhat arbitrarily
set to p < 0.05) or whether we should choose to not reject the null hypothesis
(p > 0.05). There is often an implicit alternative hypothesis of the form that
the set of observed values come from two (or more) different populations.
In the example above, the result of the uncorrected Pearson chi-square was p
= 0.0847. Since this is a number which is larger than the threshold of 0.05, the
result would normally be considered an example of random variation and thus
not significant. That is, in this case we cannot reliably differentiate between
random variation (noise) and interaction effects (information). But note that
the obtained p-value is also quite close to the conventional 5% threshold.
As pointed out above, the Pearson chi-square assumes that we have a random
sample from the entire population we want to generalize to. But what if this is
not the case? In this case, we need to interpret the results with more care, and
take into consideration the size of the sample in relation to the entire population
as well as the effect size (see below), instead of blindly trusting in the chi-square
p-value. Note that the p-value does not say anything about the association
between the observed values, it refers to the whole set of observations in relation
to a larger population (for between-observation association, see the section on
effect size below).

The proper way to report the results of a Pearson chi-square test is to include Reporting
all the following information: results
• the chi-square value (reported as ‘X-squared’ in R),
• the df-value (stands for ‘degrees of freedom’, this is a complicated concept
which falls outside the scope of this tutorial),
• whether Yates’ correction for continuity was used,
6 Note the different result, as reported by R: X-squared = 2.9725, df = 1, p-value = 0.0847

7
• the p-value (this should be the value as reported by the test, not e.g.
p > / < 0.05)7

7.2 Fisher’s exact test


Fisher’s exact test8 has been vigorously promoted in linguistics in recent years
by Anatol Stefanotwitsch and Stefan Th. Gries, cf. Stefanowitsch and Gries
(2003) and Gries and Stefanowitsch (2004) et seq., following an earlier proposal
by Pedersen (1996).
Traditionally, the Fisher exact test is treated as equivalent with the Pearson
chi-square, but used in cases where the Pearson chi-square is considered inap-
propriate, notably with very small sample sizes (n < 20) or in cases where the
expected table cell values are smaller than five.
The test has certain advantages and certain limitations. Among its advan-
tages are that it is less conservative than the Pearson chi-square, that is, it can
more easily detect a real relationship in the data. Furthermore, the Fisher exact
test p-value can be interpreted as a reasonable measure of the size of the ob-
served effect , i.e., the strength of association between the variables for purposes
of comparison, cf. footnote 6 in (Stefanowitsch and Gries, 2003, 238–239). In
this case, care should be taken to make sure that there could plausibly exist a
real association or dependency in the data in the first place.
Like in the Pearson chi-square, the R-format of the Fisher exact test is:
(4) fx < − fisher.test(x) R code
When the fisher exact test is run with the x table above as its argument, we get
quite a lot of information from R:
data: x
p-value = 0.09575
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval: 0.9015428 2.9181355
sample estimates:
odds ratio
1.61635

For the present purposes we will ignore most of this information, and simply
consider the p-value, which in this case is 0.096. As with the Pearson chi-
square, this is normally taken to indicate that the result is not significant given
the conventional threshold of 0.05.

7 An argument could also be made in favor of the opposite case, i.e. reporting smaller

than/greater than 0.05. However, in this matter I choose to follow the conventions in Wilkin-
son (1999).
8 In this context,‘exact’ refers to how the test is computed – it has nothing to do with the

reliability of the test.

8
The Fisher Exact p-value can be interpreted as the likelihood of obtaining Interpretation
the observed table, or a table with ‘more extreme’ (essentially larger differences) of Fisher’s
observations. Additionally, the p-value gives a relative effect size adjusted for exact test
the observed frequencies in the table. In the context of corpus linguistics, the
most obvious role for the Fisher exact test is to measure dependencies between
collocations, or in the case of Stefanowitsch and Gries, dependencies between
words and constructions. Note that it is not given that the results of a Fisher
exact test can be extended beyond the corpus, due to the mathematical as-
sumptions it is based on. Stefanowitsch and Gries do so anyway, but through
an explicitly psychological, or psycholinguistic interpretation of their object of
study, thus illustrating both the limitations of the test and how to overcome
them.

7.3 Rank tests


This handout is primarily directed towards corpus linguistics, but as mentioned
in section 5 above, we sometimes deal with ordinal data in linguistics, typically
in the context of an experimental or sociolinguistic study.9 In principle, we could
test this kind of data using a chi-square test, but such an approach would ignore
useful information, namely the fact that the data points are ordered (as opposed
to the unordered nominal data). In the research situations mentioned above,
there are three rank tests that are appropriate, the Wilcoxon, Mann-Whitney U,
and Kruskal-Wallis rank tests, which are all variations over a common theme.10
These tests are based on mathematical operations on the rank values, which are
then compared to distribution models. In this workshop we will only discuss
the Mann-Whitney U test and the Wilcoxon test, both of which are (somewhat
confusingly) implemented in R as a version of the Wilcoxon test:
(5) m < − wilcox.test() R code

7.3.1 Mann-Whitney U
Consider the following situation, adapted from the example in Hinton (2004,
222–223): In a survey, native speakers from two different areas have been asked
to judge how ‘good’ or ‘acceptable’ a specific construction sounds in their dialect
on a scale from 0 to 100. The result is as follows:
9 The discussion of rank tests relies primarily on Hinton (2004, 216–229).
10 Thedifferences between the tests are as follows: the Wilcoxon test is designed for compar-
ing two related samples; the Mann-Whitney U test is designed for comparing two unrelated
samples; the Kruskal-Wallis test is designed for testing more than two samples.

9
Tinytown Megacity
43 67
34 82
14 33
62 46
22
75

This is tested in R in the following manner, by assigning the judgments for each
area to a vector, x and y respectively, and entering them into the formula as
follows:
x < − c(43,34,14,62) R code
y < − c(67,82,33,46,22,75)
m < − wilcox.test(x, y)
The result is p = 0.8714, which would usually be taken to indicate that there is
no real difference between the two areas in their judgments – in fact, they are
almost identical.

7.3.2 Wilcoxon
Now consider a slightly different scenario, adapted from the example in Hinton
(2004, 228–229), where a group of subjects are asked to rank two different
constructions using a scale of 1 to 20:

Subject construction 1 construction 2


George 17 8
Adele 20 11
Ray 8 15
Noam 18 5
Suzanne 17 15
Jim 11 5
Cynthia 16 8
Háj 23 6
Seana 8 9
Paul 21 8
These columns of scores can be considered ‘related’ since for each row, both
judgments were made by the same test subject. The procedure in R is the same
as above, with two minor modifications.
First, the paired = TRUE argument tells R that the two samples are related,
i.e., that it is a Wilcoxon test. Second, R will by default attempt to compute
the exact significance, which will result in error messages if the size in rank
difference between two categories is the same. The solution is to tell R not
to use an exact method, like this: m1 < − wilcox.test(x1, y1, paired =
TRUE, exact = FALSE).

10
x1 < − c(17,20,8,18,17,11,16,23,8,21) R code
y1 < − c(8,11,15,5,15,5,8,6,9,8)
m1 < − wilcox.test(x1, y1, paired = TRUE, exact = FALSE)
The result is p = 0.025, suggesting (again based on the conventional threshold
of 0.05) that the subjects have a systematic preference for one construction
over the other (i.e., there is a real difference in the subjects’ rating of the two
constructions). Judging by the differences in rank sums, it seems that the
subjects find construction 1 more acceptable than construction 2.11

7.4 Student’s t-test, ANOVA, and parametric rank tests


These tests are so-called parametric tests designed for continuous data, and fall
outside the scope of the present tutorial. How and when to use them is taught
in all introductory statistics courses, such as the ones listed in section 12. In the
context of corpus linguistics, their use is somewhat questionable, and the reader
should be aware that it is regrettably not unusual to find these tests employed
in ways which do not fit well with their assumptions.

8 Effect size
What is the importance of effect size, or association strength? Generally, the
p-value of a statistical test says nothing about the size of the observed effect
in the data, that is, the association between variables in the data. Rather, the
p-value tests the hypothesis that the distribution in the data is a random sample
from a population which has the properties of some mathematical distribution
(e.g. the chi-square). That is, the p-value indicates how likely we would be to
observe the data – the full set of data – in this table if we assume that the
population follows a chi-square distribution and if the data in our matrix is a
random sample from some population.
Whether these assumptions hold or not, is often a question of interpretation.
However, the main reason why effect size is important is this:12
In corpus linguistics, the chi-square p-value addresses a different
question than the one we want to answer !
As Kilgarriff (2005) has pointed out, we usually know that the data in the corpus
we want to study is not a random collection of words with a chi-square distri-
bution. Applying a chi-square test is thus to attempt to measure something the
chi-square p-value was never intended to measure. To understand this problem
with the p-value, you need to know that the chi-square p depends very much
on the size of the sample (n, that is, the sum of all cells in the table): as the
11 This is based on the sums of negative and positive differences. For instance, if the subject

rates construction 1 over construction 2, the difference will be positive (14 − 10 = 4, while
12 − 14 = −2). A full explanation of this procedure falls outside the scope of this tutorial.
12 This applies primarily to corpus linguistics. In an experimental study things are a little

different.

11
sample size grows, the p-value will inevitably grow smaller; the end result being
a high number of false indications of statistical significance. Essentially, if you
fail to reject the null-hypothesis (i.e. you get a p-value which is larger than 0.05
– remember, we want small p-values) in corpus linguistics, this might simply
be an effect of a small sample size (n) – it says virtually nothing about the
the size of the association between the variables we want to investigate. Thus,
null-hypothesis testing in corpus linguistics is problematic, and the problem is
not solved simply by acquiring a bigger corpus.
What we need instead is some test or measure which indicates the magnitude
of difference when we observe 34 in one cell and 82 in another cell, or which
can tell us how much the information in one of the columns contributes to the
overall result. Put differently, when when we observe 34 in one table cell and 82
in another table cell, how can we quantify the tendency of the factors involved
to go in the same (or opposite) directions? With the possible exception of the
Fisher exact test, cf. 7.2 above, the statistical tests we have looked at so far
need to be augmented by some kind of effect size measure to give us this kind of
information. In this section two such useful measures are introduced, however,
there are a lot more such measures around in the social and behavioral sciences,
and no ‘gold standard’ currently exists.

8.1 Phi and Cramér V


Phi (or φ) is computed based on the chi-square value. Recall that the chi-square
p-value is very sensitive to the sample size (n). Phi and Cramér V ‘factor out’
the size of the sample, and give the ‘average’ contribution of rows and columns
(that is, the categories in the table and their respective observations in the rows
and columns) to the final result. Phi has certain weaknesses when the table gets
bigger than 2 × 2, however, and the Cramér V is a generalized version of Phi,
cf. Cramér (1946, 282); Cohen (1988, 223–225); Gries (2005, 280).
Essentially, Phi is restricted to 2 × 2 tables, whereas Cramér V can be used
on larger tables. Note that in the 2 × 2 table case the tests are identical, cf.
(6) and (8) below. The Cramér V is implemented as a default test in SPSS, but
not in R. However, it is easy to compute. The formula is as follows:
(6)
s
χ2
V =
n × (k − 1)
Cramér V
χ2 is the computed test statistic from the uncorrected Pearson chi-square, n
is the total sample size (i.e. the sum off all cells in the matrix), and k is the
smaller of either the number of rows or the number of columns. Converted into
R code, this can be calculated quite efficiently as follows (assuming the same
matrix vector x as above):13
13 The procedure for computing the Cramér V in R was posted by Marc Schwarts on http:

//www.biostat.wustl.edu/archives/html/s-news/2003-09/msg00185.html

12
(7) cv < − sqrt(chisq.test(x, correct = FALSE)$statistic
/(sum(x) ∗ min(dim(x) − 1 ))) R code
It is possible to save some typing by converting this code into a script and
loading the script into R. This will not be covered here, though. If you are only
working with 2 × 2 tables, Phi is even easier to compute: Phi
(8)
r
χ2
P hi =
n
(Phi is simpler because the reason for introducing (6) was to test more complex
cases, i.e., cases where the table is larger than 2 × 2). Phi can be computed as
follows:
(9) f < − sqrt(chisq.test(x, correct = F)$statistic/sum(x)) R code
Cramér V and Phi14 can be interpreted as follows: The test computes the mean Interpretation
percent difference in clustering of observations between rows and columns. That of Cramér V
is, the test measures how closely the observations in rows and columns are
associated with each other. If we apply the Phi measure to the ‘x’ table above,
the result is as follows:
(10)
r
2.9725
P hi = = 0.114
228
The result, 0.114 or 11.4 %, indicates a mutual association between rows
and columns of approximately eleven and a half percent.15 Wether this is a
large effect size is a matter of interpretation. As a rule of thumb – it cannot be
stressed enough that this is only a guideline, not a fixed rule – according to Fleiss
et al. (2003, 99), effects of less than 30 or 35 % indicate a trivial association.
However, opinions vary and Cohen (1988, 224–226) considers effects smaller
than 1 % as trivial, 10-20 % small, 20-50 % medium sized, and anything over
50 % as large. Remember that these are percentages, i.e., 0 % indicates no
association whatsoever whereas 100 % indicates perfect association; essentially
a result close to 100 % means that the observed results are entirely due to (or
explained by) the categories of the investigation (i.e. rows/columns).
However, other factors should influence the interpretation of the effect size,
notably:
• the size of the sample – a small sample is almost always a bad representa-
tion of the population. Thus, whether the observed effect can be applied
to the entire population needs careful interpretation.
14 The subtle differences that exist in the interpretation of the two measures fall outside the

scope of this tutorial.


15 It is not entirely uncontroversial to think of the effect size results discussed in this handout

as percentages. I choose to do so anyway because it gives an intuitive grasp of the relative


size of the effect, but not everyone would agree with me in this. The reader is hereby warned.

13
• how much data is missing? If you know that a lot of data is missing, this
should influence the interpretation.
• what type of study are you conducting? The interpretation of the Phi/Cramér
V should differ in a corpus based syntax study, an experimental situation,
or the evaluation of a sociolinguistic survey.

Note that both these measures are symmetric, that is, they give you both
the association of rows with columns and columns with rows. Often this is ok,
but sometimes we want to measure asymmetric relationships – this is discussed
in the section on the Goodman-Kruskal lambda below.

8.2 Goodman-Kruskal lambda


Unlike Phi and Cramér V, the Goodman-Kruskal lambda is not based on the
chi-square statistic. Instead, it is based on the probability of ‘guessing’ the right
result in the table cells if you know something about the categories of the data.
See Goodman and Kruskal (1979) for the original proposals for this measure.
The Goodman-Kruskal lambda is a so-called ‘measure of error reduction’ which
is a useful variation to consider when one is looking for asymmetric associations
(i.e. rows are strongly associated with columns, but columns are not strongly
associated with rows; or vice versa). Consider another fictional example, with
the use of different constructions in historical periods:16

Period 1 Period 2 Period 3


Construction1 82 11 3
Construction2 39 2 9

The constructions in question appear to be declining over the observed time


span. But how much variation is explained by the time period, and how much
is explained by the internal variation between the two constructions? By taking
the time periods (columns) as the potentially explanatory factor (independent
variable), we do a lambda test to check how much the temporal variation is
associated with the constructions.
First, create the table above in R:
(11) lx < − matrix(c(82,39,11,2,3,9), nrow = 2) R code
The Goodman-Kruskal lambda (or λB ) is not implemented as a default test in
R, but can be computed as follows:
(12) lb < − (sum(apply(lx, 2, max))-max(rowSums(lx))) R code
/(sum(lx)-max(rowSums(lx)))
16 We are of course ignoring the problem of whether it is meaningful to directly compare

linguistic data from different historical periods.

14
The R code in (12) above is an implementation of a mathematical formula from
Siegel and Castellan (1988, 299). However, the mathematical reasoning behind
this measure is slightly more complex than that of the Cramér V, and it will
not be explained in depth here.
Note that the code above assumes that x again is a matrix of nominal data
where rows represent observations and where the columns of the matrix contain
the classes, i.e., the independent variable.17

Goodman-Kruskal lambda can be interpreted as follows: This test measures Interpretation


how much the potential error of predicting the observed results can be reduced of lambda
by looking at additional columns (or classes). Put differently, if we are trying
to predict the distribution of row observations other than the one with the
highest frequency (i.e. the variation), how much would knowing the classes
(columns) help us? In the case above, the result is 0.12, or 12 %. That is, in
this case information about time period is only moderately helpful in explaining
the variation (conversely, if the test is done on the rows instead of the columns,
the result is 0.5, indicating that other factors have more explanatory value
here). In other words, the Goodman-Kruskal lambda can be used to assess
to what extent (measured in percent) each variable in either rows or columns
contributes to the effect observed on the other variable. Note that this test is
not particularly suited for 2 × 2 tables, or tables where the observations are
very evenly distributed.

8.3 Effect size measures for ordinal data


There are a number of effect size measures available for ordinal data, examples
include Spearman’s rho (ρ) and Kendall’s tau (τ ). Although only Kendall’s tau
is discussed below, the R function cor.test() can be used to compute several
measures, see the R help files for details.18 To compute Kendall’s tau on the
grammaticality judgments in section 7.3.2, we use the same vectors x1 and y1:
(13) r < − cor.test(x1, y1, method = ‘kendall’,
alternative = ‘two.sided’, exact = FALSE) R code
In the code above, method refers to the type of test, alternative = ‘two.sided’
means that we had no indication before the experiment which construction
would be rated highest, and exact = FALSE is necessary because there are ties
in the rank sums. The output is as follows:

Kendall’s rank correlation tau


data: x1 and y1
z = -0.9223, p-value = 0.3564
alternative hypothesis: true tau is not equal to 0
17 The code apply(...2 ...) instructs R to treat the columns as the independent vari-

able. Similarly, the rows can be treated as the independent variable by simply writing
apply(...1...).
18 Open R and type ?cor.test

15
sample estimates:
tau
-0.2411214
An in-depth discussion of all the output above falls outside the scope of this
tutorial, and we will only consider the tau value.

The Kendall tau is always a number between −1 and 1, where −1 indicates Interpretation
negative association (i.e. disagreement), 1 indicates positive association (i.e. of Kendall’s
agreement), and 0 indicates no association. Formally, Kendall’s tau is the dif- tau
ference between the sum of actual rank scores and potential maximum rank
scores, which makes this a good measure of the size of the observed effect. The
value obtained above, −0.24 or 24 %, indicates a weak to moderate negative
association or difference in acceptability, between the two constructions. Siegel
and Castellan (1988, 245) recommend Kendall’s tau for measuring agreement
between raters, see Kendall (1938) for the first proposal of this measure.

9 P -values and research questions


It is crucial to keep in mind that the result of a statistical test cannot answer
your research questions for you: you need to interpret the statistical results, see
Woods et al. (1986, 127–131) for a good introductory discussion of this problem.
This process of interpretation can be more or less difficult, depending on some
factors:
i) Are the research questions well operationalized – i.e., have you spelled out
how you think your hypothesis relates to the data in terms of frequencies
or magnitudes?
ii) Do you have all the relevant information (i.e., are there other factors that
could influence the outcome)?
iii) How well do your data match the assumptions of the statistical test?
Basically, i) is your responsibility – the researcher conducting a study is respon-
sible for clarifying how the empirical and statistical results can be interpreted
as having explanatory value with regards to a research question.
ii) is obviously a matter of interpretation – what is ‘enough’ information
about the sample, the population, any missing data etc? As a rule of thumb,
the information should be sufficient to let you make good operationalizations.
iii) is a very difficult problem to handle, and very often a statistical test is
used in a way which does not match its assumptions well. It is important to keep
in mind, however, that even when the assumptions match almost perfectly, you
still need to (or ought to) explain your reasons for using a specific statistical test
– ‘everyone else does it’ is not an acceptable reason! How badly your data violate
the assumptions of a given test and how this will influence your interpretations

16
of the results in relation to the research questions is a factor of uncertainty
which must be dealt with in any case.
It might then be tempting to ask what is the point of doing a statistical
analysis at all? The answer is simple: there is a world of difference between
interpreting the result of a statistical test and interpreting raw frequencies. The
human mind is not particularly well equipped to process complex frequency
data in a reliable, unbiased way. Consequently, an appropriate statistical test –
whatever its shortcomings – is in most cases preferable over raw frequencies as
the basis for quantitative, scientific analysis.

10 What is not covered in this tutorial?


As mentioned previously, this tutorial is restricted to a few nonparametric tests
and measures within the frequentist tradition. For a particular research project,
there might be useful tests and measures to be found among the parametric
tests, as well as in the Bayesian and correspondence analysis traditions. In
most cases it would be advisable to follow a formal course in statistics such
as one of the courses listed in section 12 below. Below are some examples of
important concepts that were omitted for reasons of space, but this is in no way
an exhaustive list:
There is a lot more to be said about data types than the brief exposition in
section 5. For instance, the problem of data source – as opposed to data type –
has not been touched upon at all, but is nevertheless important.
Furthermore, the question of sample and population is often quite complex
in most real research projects and requires a lot more attention than what
was given to it in section 6. Yet another important – but omitted – aspect of
statistical testing is one-directional versus two-directional tests.
All of these concepts typically require more attention than they could pos-
sibly receive in a short workshop. Again, the best solution would be to follow a
regular course where these issues can be treated with the attention they require.

11 Relevant literature
For a gentle, non-numerical introduction to statistical thinking, Rowntree (1981)
is a good place to start. A more in-depth consideration of statistical methods
is presented in an easily accessible way in Hinton (2004). Specifically linguistic
applications of statistics are briefly introduced in Núñez (2007), and compre-
hensively treated in Baayen (2008) and Johnson (2008). Of the last two books,
Baayen’s is without doubt the most advanced, but also the one least suited for
a novice to statistics.
For an in-depth understanding of some of the issues pertaining to the inter-
pretation of statistics, it is necessary to go beyond introductory books. Articles
such as Cohen (1994), Johnson (1999), Tversky and Kahneman (1971), Upton
(1982) and Upton (1992) contain extremely valuable discussions on the choice

17
and use of statistical test as well as interpretation of p-values. For the rather
complicated question of effect size measures, Cohen (1988) is still a classic,
but Siegel and Castellan (1988) and Fleiss, Levin, and Paik (2003) also discuss
this question (along with many other questions) in a relatively accessible way.
Kempthorne (1979) contains an excellent discussion of the question of data ori-
gin. For the question of population and sample in linguistics, Woods et al. (1986,
48–57) is a good place to start, but Clark (1973) and Tversky and Kahneman
(1971) offer invaluable refinements.
Finally, Wilkinson (1999) is highly recommended as a guide to good practice
in handling and presenting statistics for research.

12 Statistics courses at the faculty of humanities


• dasp 106 “Statistikk og kognisjonsforskning” – 5 credits: A brief intro-
duction to parametric and nonparametric frequentist tests.
• dasp 302 “Statistisk metode” – 10 credits: A more in-depth introduction
to parametric and nonparametric frequentist tests.
• huin 308 “Statistikk for HF-fag” – 15 credits: Identical to dasp 302, but
with added coursework on correspondence analysis and Bayesian statistics.

For an updated list of statistics courses offered at the faculty of humanities,


see the course listings at http://studentportal.uib.no/.

References
Apollon, D. (1990). Dataanalytiske metoder i filologien. In O. E. Haugen and
E. Thomassen (Eds.), Den filologiske vitenskap, pp. 181–208. Oslo: Solum
forlag.
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to
statistics using R. Cambridge: Cambridge Univeristy Press.
Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language
statistics in psychological research. Journal of verbal learning and verbal be-
havior 12 (4), 335–359.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, NJ: Lawrence Erlbaum.
Cohen, J. (1994). The earth is round (p < .05). American psychologist 49 (12),
997–1003.
Cramér, H. (1946). Mathematical methods of statistics. Princeton: Princeton
University Press.

18
Fleiss, J. L., B. Levin, and M. C. Paik (2003). Statistical methods for rates and
proportions (3rd ed.). Hoboken, NJ: Wiley.
Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (2004). Bayesian data
analysis (2nd ed.). Boca Raton, FL.: Chapman & Hall/CRC.

Goodman, L. A. and W. H. Kruskal (1979). Measures of association for cross


classifications. New York: Springer.
Gries, S. T. (2005). Null-hypothesis significance testing of word frequencies: A
follow-up on Kilgarriff. Corpus linguistics and linguistic theory 1 (2), 277–294.
Gries, S. T. and A. Stefanowitsch (2004). Extending collostructional analysis.
International journal of corpus linguistics 9 (1), 97–129.
Hinton, P. R. (2004). Statistics explained (2nd ed.). London: Routledge.
Johnson, D. H. (1999). The insignificance of statistical significance testing. The
journal of wildlife management 63 (3), 763–772.
Johnson, K. (2008). Quantitative methods in linguistics. Oxford: Blackwell
Publishing.
Kempthorne, O. (1979). In dispraise of the exact test: Reactions. Journal of
statistical planning and inference 3, 199–213.
Kendall, M. (1938). A new measure of rank correlation. Biometrika 30 (1-2),
91–93.
Kilgarriff, A. (2005). Language is never, ever, ever, random. Corpus linguistics
and linguistic theory 1 (2), 263–276.
Nenadic, O. and M. Greenacre (2007). Correspondence analysis in R, with
two- and three-dimensional graphics: The ca package. Journal of Statistical
Software 20 (3), 1–13.
Núñez, R. (2007). Inferential statistics in the context of empirical cognitive lin-
guistics. In M. Gonzalez-Marquez, I. Mittelberg, S. Coulson, and M. J. Spivey
(Eds.), Methods in Cognitive Linguistics, pp. 87–118. Amsterdam: John Ben-
jamins Publishing Company.
Pedersen, T. (1996). Fishing for exactness. In Proceedings of the South Central
SAS User’s Group (SCSUG-96) Conference, pp. 188–200.
R Development Core Team (2008). R: A Language and Environment for Statis-
tical Computing. Vienna, Austria: R Foundation for Statistical Computing.
Rowntree, D. (1981). Statistics without tears. London: Penguin Books.
Siegel, S. and N. J. Castellan (1988). Nonparametric statistics for the behavioral
sciences (2nd ed.). New York: McGraw-Hill.

19
Stefanowitsch, A. and S. T. Gries (2003). Collostructions: Investigating the
interaction of words and constructions. International journal of corpus lin-
guistics 8 (2), 209–243.
Tversky, A. and D. Kahneman (1971). Belief in the law of small numbers.
Psychological bulletin 76 (2), 105–110.

Upton, G. J. (1982). A comparison of the alternative tests for the 2 × 2 compar-


ative trial. Journal of the Royal Statistical Society. Series A(General) 145 (1),
86–105.
Upton, G. J. (1992). Fisher’s exact test. Journal of the Royal Statistical Society.
Series A(Statistics in society) 155 (3), 395–402.
Wilkinson, L. (1999). Statistical methods in psychology journals: Guidelines
and explanations. American psychologist 54 (8), 594–604.
Woods, A., P. Fletcher, and A. Hughes (1986). Statistics in language studies.
Cambridge: Cambridge University Press.

20

View publication stats

You might also like