Professional Documents
Culture Documents
net/publication/265965333
CITATIONS READS
2 1,697
1 author:
Gard Jenset
29 PUBLICATIONS 106 CITATIONS
SEE PROFILE
All content following this page was uploaded by Gard Jenset on 23 September 2014.
1 Introduction
This tutorial is a short introduction to what statistics is good for, the basis of
statistical thinking, and how some statistical tests can be computed using the
program R, cf. R Development Core Team (2008). The internal mechanics of
the tests and measures presented below are only discussed to the extent that I
consider it necessary for correct use and interpretation. Note that this handout
is not intended as a substitute for a full statistics course where these topics are
treated in more depth. Many important concepts have been omitted here for
lack of time, space or because of their complexity. Statistics typically takes a
while to get used to, and the best way of doing so is through an organized course
spanning at least a full semester. Section 12 below lists some relevant statistics
courses at the faculty of humanities at the UiB as of the time of writing.
2 What is statistics?
Before getting to the heart of the matter, it is perhaps necessary to clear the way
by stating what statistics isn’t about. First, statistics is not an indicator of how
‘true’ or ‘correct’ the obtained results are. Second, statistics is not primarily
about mathematical calculations. And third, statistics is not a substitute for
informed reasoning.
Rather, statistics is a way of quantifying assumptions, so that they can be
applied to large data sets. Thus, statistics is an indicator of how ‘correct’ your
results are, if you have based the calculations on appropriate assumptions and
interpreted the results correctly – and this is a big ‘if.’ This is a matter of care-
ful consideration and experience, not mechanical application of test procedures.
Furthermore, the calculation of such tests is now a trivial matter, carried out
quickly and accurately with appropriate software. However, the software can
∗ Handout for methods seminar in English linguistics, Fall 2008. I am grateful to Kolbjørn
Slethei and Kari Haugland for valuable comments and suggestions. Any mistakes or misrep-
resentations that persist despite their advice remain my responsibility.
1
never tell you if you have made some erroneous assumption or violated the con-
ditions of a test: the software crunches numbers, and the validity of the results
depends on the person who entered those numbers into the program. Finally,
statistics can be used in two ways: to describe a data set, or to draw inferences
outside of the data set (descriptive and inferential statistics, respectively). The
conditions for describing or drawing inferences are obviously not the same, and
this means that it is important to define what is being studied, and how the
conditions for a given test are met in the data set.
3 A typology of statistics
Statistics is not one homogenous field, and it is sometimes useful to think of it
in terms of three broad paradigms:
i) The frequentist approach: The results of a statistical test are conceived of
as part of a very long (but hypothetical) series of repeated experiments or
tests.
ii) The Bayesian1 approach: The results of a statistical tests are conceptu-
alized as the conditional probability of an outcome, given some data or
observation.
iii) The explorative analysis approach: This purely descriptive tradition, as
exemplified by for instance correspondence analysis, considers observations
as correlations between categories in an n-sized dimension space.
It is probably safe to say that i) has dominated both the practice and teach-
ing of statistics in the 20th Century for a number of reasons that we will not
touch upon here; suffice it to say that this is what is primarily taught in introduc-
tory statistics courses in most universities. ii) has recently been getting more and
more popular, partially because of the increase in computational power avail-
able. However, Bayesian statistics is more properly taught in an intermediate
statistics course, see Gelman, Carlin, Stern, and Rubin (2004) for a comprehen-
sive and useful introduction. iii) has been most widely used within the French
sociology tradition following Benzécri, but the initial work was instigated in an
effort to solve philological problems, cf. Apollon (1990, 195–197) and Nenadic
and Greenacre (2007). Correspondence analysis is a useful descriptive addition
to the other traditions, and it will probably gain more widespread use, since
there are now a number of correspondence analysis packages available in R. For
the present tutorial, however, we will primarily deal deal some common (and
some not-so-common) frequentist tests.
1 After the British Presbyterian minister and mathematician Thomas Bayes (1702–1761).
2
4 Why R?
R is a free, open source version of the programming language S-plus, and is
becoming the defacto standard statistical toolbox in many academic fields. In
addition to being free, R has a number of advantages over commercial statistical
packages such as SPSS:
• once you get used to the idea of a command line interface, R is much faster
and easier to work with than SPSS.
• R is very flexible and can be used for preparing the data before applying
the statistical tests, that is, it is much more than just a statistical software
package.
• http://cran.r-project.org/ contains a large library of user-contributed
packages for solving various problems. These packages are mostly written
by researchers themselves, who know what problems they want to solve in
their particular field. That is, R provides not only general methods, but
custom methods fitted to various academic fields.
• furthermore, R has many linguistics-specific functions contained in pack-
ages such as languageR by Harald Baayen; other packages like openNLP
are also useful.
• R is more reliable than, say, an online statistics calculator. It is sometimes
difficult to check the reliability of such calculators, and there is no way
of knowing how long a given web page hosting such a calculator will be
available.
• R (like SPSS) produces print quality graphics like figures and charts.
5 Types of data
The notion of data type is crucial to all branches of statistics. Because all
statistical tests make assumptions about types of data (they are quite picky),
it is necessary to decide which type the data at hand most closely correspond
to, in order to choose the most appropriate test. In corpus linguistics, we are
almost always dealing with nominal data.
3
5.2 Ordinal data
Ordinal data are ordered categories of things, and classic examples are score lists
or race results: The winner was the first across the finishing line, but it is not
important by how much he or she beat the competitors; that is, the magnitude
of the difference between each category does not affect the information value it
has for this kind of data. The important thing is the order the data points2
occur in. In linguistic contexts, ordinal data can for instance be the result
of an experiment where the participants are asked to judge the naturalness of
sentences, and rank them according to which sounds ‘best’ and ‘worst.’
6 A bit of terminology
A ‘population’ in statistics means a group or collection of entities that we want Population
to study. Thus, ‘population’ could refer to people, but also light bulbs, car
accidents, university students, or grammatical constructions. A population is
thus not something which occurs naturally – it is defined for the purposes of
the research project.
A ‘sample’ is a subset of the population that we want to study. Sometimes Sample
the sample is carefully collected based on pre-defined criteria. However, we
sometimes have to work with the sample we happen to have available, like in
historical linguistics.
A ‘random sample’ is a sample where every member of the population has Random
equal probability of being included in the sample. This is not always possible sample
to achieve, but most statistical tests assume that the sample is drawn randomly
2 In statistics, data are usually referred to in plural, while a single piece of data is a ‘data
point.’
3 There are ways of transforming nominal corpus data into continuous data, through a
4
Figure 1: A chi-square distribution Figure 2: A normal distribution
7 Statistical tests
The sections below present some statistical tests as they are implemented in R.
For instructions on how to install and use R in general, see the web page http:
//cran.r-project.org/. As the tutorial is directed towards corpus linguistics,
the presentation will focus on the tests which are most appropriate for nominal
data, the so-called non-parametric tests.
5
iii) expected observations in each cell larger than five
iv) you have actual observed frequencies – never do a chi-square test on per-
centages!
The Pearson chi-square can be used in two ways, as a test of independence /
correlation or as a test of goodness of fit. The goodness of fit test is used to
check whether a set of observations are adequately represented by the chi-square
distribution, and it will not be discussed further here. The test for independence
is based on the following logic:
a) Take two sets or more of some observations in a 2 × 2 or larger table
b) Compare the observations to the chi-square distribution
The aim of this is to test whether the observations in the categories that we have
divided the data into (i.e. the rows and columns of the table) represent random
variation or whether it is caused by the factors represented by the categories.
The underlying assumption is that if the observations in the table (i.e. our
categories) are related only by chance, the observations will match well with
the chi-square distribution in figure 1. Conversely, if the observations do not
match well, then it is assumed that the categories have somehow influenced the
distribution of the observations. The result of a Pearson chi-square test is thus
an answer to a yes/no question: are the observations a random sample from a
single chi-square distributed population – yes or no.
This can be illustrated with the following fictional data:
[,1] [,2]
[1,] 45 67
[2,] 34 82
6
In some circumstances, R will apply the Yates correction for continuity to the
Pearson chi-square test. The issue is somewhat complicated, but there are good
reasons not to use the Yates corrected chi-square. In order to tell R not to use
it, write:6
(3) cxx < − chisq.test(x, correct = FALSE) R code
where the argument correct = FALSE turns off the Yates’ correction.
How are the results of a Pearson chi-square test to be interpreted? The Interpretation
Pearson chi-square p-value indicates the probability of obtaining the entire set of Pearson’s
of observations in the table, provided that the observations are a random sample chi-square
from the population, and that the null-hypothesis is appropriate. In other words,
the p-value indicates whether the null-hypothesis (the set of observations is a
random selection from a single, chi-square distributed population) should be
rejected (low p, in linguistics and the social sciences often somewhat arbitrarily
set to p < 0.05) or whether we should choose to not reject the null hypothesis
(p > 0.05). There is often an implicit alternative hypothesis of the form that
the set of observed values come from two (or more) different populations.
In the example above, the result of the uncorrected Pearson chi-square was p
= 0.0847. Since this is a number which is larger than the threshold of 0.05, the
result would normally be considered an example of random variation and thus
not significant. That is, in this case we cannot reliably differentiate between
random variation (noise) and interaction effects (information). But note that
the obtained p-value is also quite close to the conventional 5% threshold.
As pointed out above, the Pearson chi-square assumes that we have a random
sample from the entire population we want to generalize to. But what if this is
not the case? In this case, we need to interpret the results with more care, and
take into consideration the size of the sample in relation to the entire population
as well as the effect size (see below), instead of blindly trusting in the chi-square
p-value. Note that the p-value does not say anything about the association
between the observed values, it refers to the whole set of observations in relation
to a larger population (for between-observation association, see the section on
effect size below).
The proper way to report the results of a Pearson chi-square test is to include Reporting
all the following information: results
• the chi-square value (reported as ‘X-squared’ in R),
• the df-value (stands for ‘degrees of freedom’, this is a complicated concept
which falls outside the scope of this tutorial),
• whether Yates’ correction for continuity was used,
6 Note the different result, as reported by R: X-squared = 2.9725, df = 1, p-value = 0.0847
7
• the p-value (this should be the value as reported by the test, not e.g.
p > / < 0.05)7
For the present purposes we will ignore most of this information, and simply
consider the p-value, which in this case is 0.096. As with the Pearson chi-
square, this is normally taken to indicate that the result is not significant given
the conventional threshold of 0.05.
7 An argument could also be made in favor of the opposite case, i.e. reporting smaller
than/greater than 0.05. However, in this matter I choose to follow the conventions in Wilkin-
son (1999).
8 In this context,‘exact’ refers to how the test is computed – it has nothing to do with the
8
The Fisher Exact p-value can be interpreted as the likelihood of obtaining Interpretation
the observed table, or a table with ‘more extreme’ (essentially larger differences) of Fisher’s
observations. Additionally, the p-value gives a relative effect size adjusted for exact test
the observed frequencies in the table. In the context of corpus linguistics, the
most obvious role for the Fisher exact test is to measure dependencies between
collocations, or in the case of Stefanowitsch and Gries, dependencies between
words and constructions. Note that it is not given that the results of a Fisher
exact test can be extended beyond the corpus, due to the mathematical as-
sumptions it is based on. Stefanowitsch and Gries do so anyway, but through
an explicitly psychological, or psycholinguistic interpretation of their object of
study, thus illustrating both the limitations of the test and how to overcome
them.
7.3.1 Mann-Whitney U
Consider the following situation, adapted from the example in Hinton (2004,
222–223): In a survey, native speakers from two different areas have been asked
to judge how ‘good’ or ‘acceptable’ a specific construction sounds in their dialect
on a scale from 0 to 100. The result is as follows:
9 The discussion of rank tests relies primarily on Hinton (2004, 216–229).
10 Thedifferences between the tests are as follows: the Wilcoxon test is designed for compar-
ing two related samples; the Mann-Whitney U test is designed for comparing two unrelated
samples; the Kruskal-Wallis test is designed for testing more than two samples.
9
Tinytown Megacity
43 67
34 82
14 33
62 46
22
75
This is tested in R in the following manner, by assigning the judgments for each
area to a vector, x and y respectively, and entering them into the formula as
follows:
x < − c(43,34,14,62) R code
y < − c(67,82,33,46,22,75)
m < − wilcox.test(x, y)
The result is p = 0.8714, which would usually be taken to indicate that there is
no real difference between the two areas in their judgments – in fact, they are
almost identical.
7.3.2 Wilcoxon
Now consider a slightly different scenario, adapted from the example in Hinton
(2004, 228–229), where a group of subjects are asked to rank two different
constructions using a scale of 1 to 20:
10
x1 < − c(17,20,8,18,17,11,16,23,8,21) R code
y1 < − c(8,11,15,5,15,5,8,6,9,8)
m1 < − wilcox.test(x1, y1, paired = TRUE, exact = FALSE)
The result is p = 0.025, suggesting (again based on the conventional threshold
of 0.05) that the subjects have a systematic preference for one construction
over the other (i.e., there is a real difference in the subjects’ rating of the two
constructions). Judging by the differences in rank sums, it seems that the
subjects find construction 1 more acceptable than construction 2.11
8 Effect size
What is the importance of effect size, or association strength? Generally, the
p-value of a statistical test says nothing about the size of the observed effect
in the data, that is, the association between variables in the data. Rather, the
p-value tests the hypothesis that the distribution in the data is a random sample
from a population which has the properties of some mathematical distribution
(e.g. the chi-square). That is, the p-value indicates how likely we would be to
observe the data – the full set of data – in this table if we assume that the
population follows a chi-square distribution and if the data in our matrix is a
random sample from some population.
Whether these assumptions hold or not, is often a question of interpretation.
However, the main reason why effect size is important is this:12
In corpus linguistics, the chi-square p-value addresses a different
question than the one we want to answer !
As Kilgarriff (2005) has pointed out, we usually know that the data in the corpus
we want to study is not a random collection of words with a chi-square distri-
bution. Applying a chi-square test is thus to attempt to measure something the
chi-square p-value was never intended to measure. To understand this problem
with the p-value, you need to know that the chi-square p depends very much
on the size of the sample (n, that is, the sum of all cells in the table): as the
11 This is based on the sums of negative and positive differences. For instance, if the subject
rates construction 1 over construction 2, the difference will be positive (14 − 10 = 4, while
12 − 14 = −2). A full explanation of this procedure falls outside the scope of this tutorial.
12 This applies primarily to corpus linguistics. In an experimental study things are a little
different.
11
sample size grows, the p-value will inevitably grow smaller; the end result being
a high number of false indications of statistical significance. Essentially, if you
fail to reject the null-hypothesis (i.e. you get a p-value which is larger than 0.05
– remember, we want small p-values) in corpus linguistics, this might simply
be an effect of a small sample size (n) – it says virtually nothing about the
the size of the association between the variables we want to investigate. Thus,
null-hypothesis testing in corpus linguistics is problematic, and the problem is
not solved simply by acquiring a bigger corpus.
What we need instead is some test or measure which indicates the magnitude
of difference when we observe 34 in one cell and 82 in another cell, or which
can tell us how much the information in one of the columns contributes to the
overall result. Put differently, when when we observe 34 in one table cell and 82
in another table cell, how can we quantify the tendency of the factors involved
to go in the same (or opposite) directions? With the possible exception of the
Fisher exact test, cf. 7.2 above, the statistical tests we have looked at so far
need to be augmented by some kind of effect size measure to give us this kind of
information. In this section two such useful measures are introduced, however,
there are a lot more such measures around in the social and behavioral sciences,
and no ‘gold standard’ currently exists.
//www.biostat.wustl.edu/archives/html/s-news/2003-09/msg00185.html
12
(7) cv < − sqrt(chisq.test(x, correct = FALSE)$statistic
/(sum(x) ∗ min(dim(x) − 1 ))) R code
It is possible to save some typing by converting this code into a script and
loading the script into R. This will not be covered here, though. If you are only
working with 2 × 2 tables, Phi is even easier to compute: Phi
(8)
r
χ2
P hi =
n
(Phi is simpler because the reason for introducing (6) was to test more complex
cases, i.e., cases where the table is larger than 2 × 2). Phi can be computed as
follows:
(9) f < − sqrt(chisq.test(x, correct = F)$statistic/sum(x)) R code
Cramér V and Phi14 can be interpreted as follows: The test computes the mean Interpretation
percent difference in clustering of observations between rows and columns. That of Cramér V
is, the test measures how closely the observations in rows and columns are
associated with each other. If we apply the Phi measure to the ‘x’ table above,
the result is as follows:
(10)
r
2.9725
P hi = = 0.114
228
The result, 0.114 or 11.4 %, indicates a mutual association between rows
and columns of approximately eleven and a half percent.15 Wether this is a
large effect size is a matter of interpretation. As a rule of thumb – it cannot be
stressed enough that this is only a guideline, not a fixed rule – according to Fleiss
et al. (2003, 99), effects of less than 30 or 35 % indicate a trivial association.
However, opinions vary and Cohen (1988, 224–226) considers effects smaller
than 1 % as trivial, 10-20 % small, 20-50 % medium sized, and anything over
50 % as large. Remember that these are percentages, i.e., 0 % indicates no
association whatsoever whereas 100 % indicates perfect association; essentially
a result close to 100 % means that the observed results are entirely due to (or
explained by) the categories of the investigation (i.e. rows/columns).
However, other factors should influence the interpretation of the effect size,
notably:
• the size of the sample – a small sample is almost always a bad representa-
tion of the population. Thus, whether the observed effect can be applied
to the entire population needs careful interpretation.
14 The subtle differences that exist in the interpretation of the two measures fall outside the
13
• how much data is missing? If you know that a lot of data is missing, this
should influence the interpretation.
• what type of study are you conducting? The interpretation of the Phi/Cramér
V should differ in a corpus based syntax study, an experimental situation,
or the evaluation of a sociolinguistic survey.
Note that both these measures are symmetric, that is, they give you both
the association of rows with columns and columns with rows. Often this is ok,
but sometimes we want to measure asymmetric relationships – this is discussed
in the section on the Goodman-Kruskal lambda below.
14
The R code in (12) above is an implementation of a mathematical formula from
Siegel and Castellan (1988, 299). However, the mathematical reasoning behind
this measure is slightly more complex than that of the Cramér V, and it will
not be explained in depth here.
Note that the code above assumes that x again is a matrix of nominal data
where rows represent observations and where the columns of the matrix contain
the classes, i.e., the independent variable.17
able. Similarly, the rows can be treated as the independent variable by simply writing
apply(...1...).
18 Open R and type ?cor.test
15
sample estimates:
tau
-0.2411214
An in-depth discussion of all the output above falls outside the scope of this
tutorial, and we will only consider the tau value.
The Kendall tau is always a number between −1 and 1, where −1 indicates Interpretation
negative association (i.e. disagreement), 1 indicates positive association (i.e. of Kendall’s
agreement), and 0 indicates no association. Formally, Kendall’s tau is the dif- tau
ference between the sum of actual rank scores and potential maximum rank
scores, which makes this a good measure of the size of the observed effect. The
value obtained above, −0.24 or 24 %, indicates a weak to moderate negative
association or difference in acceptability, between the two constructions. Siegel
and Castellan (1988, 245) recommend Kendall’s tau for measuring agreement
between raters, see Kendall (1938) for the first proposal of this measure.
16
of the results in relation to the research questions is a factor of uncertainty
which must be dealt with in any case.
It might then be tempting to ask what is the point of doing a statistical
analysis at all? The answer is simple: there is a world of difference between
interpreting the result of a statistical test and interpreting raw frequencies. The
human mind is not particularly well equipped to process complex frequency
data in a reliable, unbiased way. Consequently, an appropriate statistical test –
whatever its shortcomings – is in most cases preferable over raw frequencies as
the basis for quantitative, scientific analysis.
11 Relevant literature
For a gentle, non-numerical introduction to statistical thinking, Rowntree (1981)
is a good place to start. A more in-depth consideration of statistical methods
is presented in an easily accessible way in Hinton (2004). Specifically linguistic
applications of statistics are briefly introduced in Núñez (2007), and compre-
hensively treated in Baayen (2008) and Johnson (2008). Of the last two books,
Baayen’s is without doubt the most advanced, but also the one least suited for
a novice to statistics.
For an in-depth understanding of some of the issues pertaining to the inter-
pretation of statistics, it is necessary to go beyond introductory books. Articles
such as Cohen (1994), Johnson (1999), Tversky and Kahneman (1971), Upton
(1982) and Upton (1992) contain extremely valuable discussions on the choice
17
and use of statistical test as well as interpretation of p-values. For the rather
complicated question of effect size measures, Cohen (1988) is still a classic,
but Siegel and Castellan (1988) and Fleiss, Levin, and Paik (2003) also discuss
this question (along with many other questions) in a relatively accessible way.
Kempthorne (1979) contains an excellent discussion of the question of data ori-
gin. For the question of population and sample in linguistics, Woods et al. (1986,
48–57) is a good place to start, but Clark (1973) and Tversky and Kahneman
(1971) offer invaluable refinements.
Finally, Wilkinson (1999) is highly recommended as a guide to good practice
in handling and presenting statistics for research.
References
Apollon, D. (1990). Dataanalytiske metoder i filologien. In O. E. Haugen and
E. Thomassen (Eds.), Den filologiske vitenskap, pp. 181–208. Oslo: Solum
forlag.
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to
statistics using R. Cambridge: Cambridge Univeristy Press.
Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language
statistics in psychological research. Journal of verbal learning and verbal be-
havior 12 (4), 335–359.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, NJ: Lawrence Erlbaum.
Cohen, J. (1994). The earth is round (p < .05). American psychologist 49 (12),
997–1003.
Cramér, H. (1946). Mathematical methods of statistics. Princeton: Princeton
University Press.
18
Fleiss, J. L., B. Levin, and M. C. Paik (2003). Statistical methods for rates and
proportions (3rd ed.). Hoboken, NJ: Wiley.
Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (2004). Bayesian data
analysis (2nd ed.). Boca Raton, FL.: Chapman & Hall/CRC.
19
Stefanowitsch, A. and S. T. Gries (2003). Collostructions: Investigating the
interaction of words and constructions. International journal of corpus lin-
guistics 8 (2), 209–243.
Tversky, A. and D. Kahneman (1971). Belief in the law of small numbers.
Psychological bulletin 76 (2), 105–110.
20