You are on page 1of 20

# Basic statistics for corpus linguistics

## Gard B. Jenset Department of Foreign Languages University of Bergen

Introduction

This tutorial is a short introduction to what statistics is good for, the basis of statistical thinking, and how some statistical tests can be computed using the program R, cf. R Development Core Team (2008). The internal mechanics of the tests and measures presented below are only discussed to the extent that I consider it necessary for correct use and interpretation. Note that this handout is not intended as a substitute for a full statistics course where these topics are treated in more depth. Many important concepts have been omitted here for lack of time, space or because of their complexity. Statistics typically takes a while to get used to, and the best way of doing so is through an organized course spanning at least a full semester. Section 12 below lists some relevant statistics courses at the faculty of humanities at the UiB as of the time of writing.

What is statistics?

Before getting to the heart of the matter, it is perhaps necessary to clear the way by stating what statistics isnt about. First, statistics is not an indicator of how true or correct the obtained results are. Second, statistics is not primarily about mathematical calculations. And third, statistics is not a substitute for informed reasoning. Rather, statistics is a way of quantifying assumptions, so that they can be applied to large data sets. Thus, statistics is an indicator of how correct your results are, if you have based the calculations on appropriate assumptions and interpreted the results correctly and this is a big if. This is a matter of careful consideration and experience, not mechanical application of test procedures. Furthermore, the calculation of such tests is now a trivial matter, carried out quickly and accurately with appropriate software. However, the software can
Handout for methods seminar in English linguistics, Fall 2008. I am grateful to Kolbjrn Slethei and Kari Haugland for valuable comments and suggestions. Any mistakes or misrepresentations that persist despite their advice remain my responsibility.

never tell you if you have made some erroneous assumption or violated the conditions of a test: the software crunches numbers, and the validity of the results depends on the person who entered those numbers into the program. Finally, statistics can be used in two ways: to describe a data set, or to draw inferences outside of the data set (descriptive and inferential statistics, respectively). The conditions for describing or drawing inferences are obviously not the same, and this means that it is important to dene what is being studied, and how the conditions for a given test are met in the data set.

A typology of statistics

Statistics is not one homogenous eld, and it is sometimes useful to think of it in terms of three broad paradigms: i) The frequentist approach: The results of a statistical test are conceived of as part of a very long (but hypothetical) series of repeated experiments or tests. ii) The Bayesian1 approach: The results of a statistical tests are conceptualized as the conditional probability of an outcome, given some data or observation. iii) The explorative analysis approach: This purely descriptive tradition, as exemplied by for instance correspondence analysis, considers observations as correlations between categories in an n-sized dimension space. It is probably safe to say that i) has dominated both the practice and teaching of statistics in the 20th Century for a number of reasons that we will not touch upon here; suce it to say that this is what is primarily taught in introductory statistics courses in most universities. ii) has recently been getting more and more popular, partially because of the increase in computational power available. However, Bayesian statistics is more properly taught in an intermediate statistics course, see Gelman, Carlin, Stern, and Rubin (2004) for a comprehensive and useful introduction. iii) has been most widely used within the French sociology tradition following Benz ecri, but the initial work was instigated in an eort to solve philological problems, cf. Apollon (1990, 195197) and Nenadic and Greenacre (2007). Correspondence analysis is a useful descriptive addition to the other traditions, and it will probably gain more widespread use, since there are now a number of correspondence analysis packages available in R. For the present tutorial, however, we will primarily deal deal some common (and some not-so-common) frequentist tests.
1 After

## the British Presbyterian minister and mathematician Thomas Bayes (17021761).

Why R?

R is a free, open source version of the programming language S-plus, and is becoming the defacto standard statistical toolbox in many academic elds. In addition to being free, R has a number of advantages over commercial statistical packages such as SPSS: once you get used to the idea of a command line interface, R is much faster and easier to work with than SPSS. R is very exible and can be used for preparing the data before applying the statistical tests, that is, it is much more than just a statistical software package. http://cran.r-project.org/ contains a large library of user-contributed packages for solving various problems. These packages are mostly written by researchers themselves, who know what problems they want to solve in their particular eld. That is, R provides not only general methods, but custom methods tted to various academic elds. furthermore, R has many linguistics-specic functions contained in packages such as languageR by Harald Baayen; other packages like openNLP are also useful. R is more reliable than, say, an online statistics calculator. It is sometimes dicult to check the reliability of such calculators, and there is no way of knowing how long a given web page hosting such a calculator will be available. R (like SPSS) produces print quality graphics like gures and charts.

Types of data

The notion of data type is crucial to all branches of statistics. Because all statistical tests make assumptions about types of data (they are quite picky), it is necessary to decide which type the data at hand most closely correspond to, in order to choose the most appropriate test. In corpus linguistics, we are almost always dealing with nominal data.

5.1

Nominal data

Most linguistic data are of the nominal kind. As the name implies, it deals with named categories of things, like countries, beer types, or syntactic categories. Such data are unordered, which means that rearranging them does not aect their information value: listing countries by geographical size or alphabetically by name does not aect their properties as data. Nominal data are sometimes referred to as count-data, because no arithmetical operations are allowed on them, they can only be counted (1, 2, 3, . . . , n bottles of beer). 3

5.2

Ordinal data

Ordinal data are ordered categories of things, and classic examples are score lists or race results: The winner was the rst across the nishing line, but it is not important by how much he or she beat the competitors; that is, the magnitude of the dierence between each category does not aect the information value it has for this kind of data. The important thing is the order the data points2 occur in. In linguistic contexts, ordinal data can for instance be the result of an experiment where the participants are asked to judge the naturalness of sentences, and rank them according to which sounds best and worst.

5.3

Continuous data

Continuous data are things that can be measured on a continuous scale. This includes anything that can be measured in centimeters, inches, kilos, pounds, years, hours, minutes, seconds etc. The key property which dierentiates these kinds of data from the previous ones, is their reducibility. One meter is composed of 100 centimeters, each of those centimeters is 10 millimeters, each of which can be measured in micrometers, nanometers and so on. This means that a number of arithmetic operations can be carried out, such as calculating the mean value. The average height in a population is easy to interpret in relation to the height of each person (i.e., each data point). It is less obvious how to interpret, say, an average number of children (which parts of a whole child are missing in a 0.8 child ?). In linguistics, continuous data are mostly found in psycholinguistic reaction-time experiments where reaction times to linguistic stimuli are measured in seconds and milliseconds, or in studies where the age of participants is relevant.3

A bit of terminology
Population

A population in statistics means a group or collection of entities that we want to study. Thus, population could refer to people, but also light bulbs, car accidents, university students, or grammatical constructions. A population is thus not something which occurs naturally it is dened for the purposes of the research project. A sample is a subset of the population that we want to study. Sometimes the sample is carefully collected based on pre-dened criteria. However, we sometimes have to work with the sample we happen to have available, like in historical linguistics. A random sample is a sample where every member of the population has equal probability of being included in the sample. This is not always possible to achieve, but most statistical tests assume that the sample is drawn randomly
2 In statistics, data are usually referred to in plural, while a single piece of data is a data point. 3 There are ways of transforming nominal corpus data into continuous data, through a log-transformation. Such procedures fall outside the scope of this tutorial.

Sample

Random sample

## Figure 2: A normal distribution

from the population. It is then a matter of interpretation how badly a violation of this assumption will aect the results. A distribution is a mathematical function which can in some cases serve as a fair (but not necessarily perfect) model of the population we wish to study. One such model is the so-called normal or Gauss-distribution (used with continuous data), with a shape more or less like a bell. There are other such distributions, notably the chi-square (or 2 ) distribution used to model the population under study in chi-square tests (nominal data). A null hypothesis, or H0 , is a term used to denote the default assumption of most statistical test, namely that all the variation in the sample data is due to random variation. The null hypothesis is then tested against an alternative hypothesis, or H1 , which is typically states that the variation is not due to chance.

Distribution model

Null hypothesis

Statistical tests

The sections below present some statistical tests as they are implemented in R. For instructions on how to install and use R in general, see the web page http: //cran.r-project.org/. As the tutorial is directed towards corpus linguistics, the presentation will focus on the tests which are most appropriate for nominal data, the so-called non-parametric tests.

7.1

Pearsons chi-square

Pearsons chi-square (often referred to as simply chi-square) is a commonly used test in linguistics, because it can handle almost any kinds of nominal data. However, it still assumes that i) the data are a random sample from the population ii) the chi-square distribution is a fair model of how the phenomenon under study is distributed in the population 5

iii) expected observations in each cell larger than ve iv) you have actual observed frequencies never do a chi-square test on percentages! The Pearson chi-square can be used in two ways, as a test of independence / correlation or as a test of goodness of t. The goodness of t test is used to check whether a set of observations are adequately represented by the chi-square distribution, and it will not be discussed further here. The test for independence is based on the following logic: a) Take two sets or more of some observations in a 2 2 or larger table b) Compare the observations to the chi-square distribution The aim of this is to test whether the observations in the categories that we have divided the data into (i.e. the rows and columns of the table) represent random variation or whether it is caused by the factors represented by the categories. The underlying assumption is that if the observations in the table (i.e. our categories) are related only by chance, the observations will match well with the chi-square distribution in gure 1. Conversely, if the observations do not match well, then it is assumed that the categories have somehow inuenced the distribution of the observations. The result of a Pearson chi-square test is thus an answer to a yes/no question: are the observations a random sample from a single chi-square distributed population yes or no. This can be illustrated with the following ctional data: Genre Fiction Newspaper NP subject 45 34 Clausal subject 67 82

Pearsons chi-square is computed in R the following way, assuming that x is a 2 2 table with nominal data, created like this (the < sign is Rs assignment operator which assigns the material on the right hand side to a short-hand variable): (1) x < matrix(c(45, 34, 67, 82), nrow = 2) R code

which produces the following output when x is entered into R:4 [1,] [2,] [,1] 45 34 [,2] 67 82

## The chi-square test is then computed like this:5 (2)

4 In

cx < chisq.test(x)
statistical terminology, this is called a contingency table. as reported by R: X-squared = 2.5119, df = 1, p-value = 0.113

R code

5 Result

In some circumstances, R will apply the Yates correction for continuity to the Pearson chi-square test. The issue is somewhat complicated, but there are good reasons not to use the Yates corrected chi-square. In order to tell R not to use it, write:6 (3) cxx < chisq.test(x, correct = FALSE) R code

where the argument correct = FALSE turns o the Yates correction. How are the results of a Pearson chi-square test to be interpreted? The Pearson chi-square p-value indicates the probability of obtaining the entire set of observations in the table, provided that the observations are a random sample from the population, and that the null-hypothesis is appropriate. In other words, the p-value indicates whether the null-hypothesis (the set of observations is a random selection from a single, chi-square distributed population) should be rejected (low p, in linguistics and the social sciences often somewhat arbitrarily set to p < 0.05) or whether we should choose to not reject the null hypothesis (p > 0.05). There is often an implicit alternative hypothesis of the form that the set of observed values come from two (or more) dierent populations. In the example above, the result of the uncorrected Pearson chi-square was p = 0.0847. Since this is a number which is larger than the threshold of 0.05, the result would normally be considered an example of random variation and thus not signicant. That is, in this case we cannot reliably dierentiate between random variation (noise) and interaction eects (information). But note that the obtained p-value is also quite close to the conventional 5% threshold. As pointed out above, the Pearson chi-square assumes that we have a random sample from the entire population we want to generalize to. But what if this is not the case? In this case, we need to interpret the results with more care, and take into consideration the size of the sample in relation to the entire population as well as the eect size (see below), instead of blindly trusting in the chi-square p-value. Note that the p-value does not say anything about the association between the observed values, it refers to the whole set of observations in relation to a larger population (for between-observation association, see the section on eect size below). The proper way to report the results of a Pearson chi-square test is to include all the following information: the chi-square value (reported as X-squared in R), the df-value (stands for degrees of freedom, this is a complicated concept which falls outside the scope of this tutorial), whether Yates correction for continuity was used,
6 Note

## Interpretation of Pearsons chi-square

Reporting results

## the dierent result, as reported by R: X-squared = 2.9725, df = 1, p-value = 0.0847

the p-value (this should be the value as reported by the test, not e.g. p > / < 0.05)7

7.2

## Fishers exact test

Fishers exact test8 has been vigorously promoted in linguistics in recent years by Anatol Stefanotwitsch and Stefan Th. Gries, cf. Stefanowitsch and Gries (2003) and Gries and Stefanowitsch (2004) et seq., following an earlier proposal by Pedersen (1996). Traditionally, the Fisher exact test is treated as equivalent with the Pearson chi-square, but used in cases where the Pearson chi-square is considered inappropriate, notably with very small sample sizes (n < 20) or in cases where the expected table cell values are smaller than ve. The test has certain advantages and certain limitations. Among its advantages are that it is less conservative than the Pearson chi-square, that is, it can more easily detect a real relationship in the data. Furthermore, the Fisher exact test p-value can be interpreted as a reasonable measure of the size of the observed eect , i.e., the strength of association between the variables for purposes of comparison, cf. footnote 6 in (Stefanowitsch and Gries, 2003, 238239). In this case, care should be taken to make sure that there could plausibly exist a real association or dependency in the data in the rst place. Like in the Pearson chi-square, the R-format of the Fisher exact test is: (4) fx < fisher.test(x) R code

When the sher exact test is run with the x table above as its argument, we get quite a lot of information from R: data: x p-value = 0.09575 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.9015428 2.9181355 sample estimates: odds ratio 1.61635 For the present purposes we will ignore most of this information, and simply consider the p-value, which in this case is 0.096. As with the Pearson chisquare, this is normally taken to indicate that the result is not signicant given the conventional threshold of 0.05.

7 An argument could also be made in favor of the opposite case, i.e. reporting smaller than/greater than 0.05. However, in this matter I choose to follow the conventions in Wilkinson (1999). 8 In this context,exact refers to how the test is computed it has nothing to do with the reliability of the test.

The Fisher Exact p-value can be interpreted as the likelihood of obtaining the observed table, or a table with more extreme (essentially larger dierences) observations. Additionally, the p-value gives a relative eect size adjusted for the observed frequencies in the table. In the context of corpus linguistics, the most obvious role for the Fisher exact test is to measure dependencies between collocations, or in the case of Stefanowitsch and Gries, dependencies between words and constructions. Note that it is not given that the results of a Fisher exact test can be extended beyond the corpus, due to the mathematical assumptions it is based on. Stefanowitsch and Gries do so anyway, but through an explicitly psychological, or psycholinguistic interpretation of their object of study, thus illustrating both the limitations of the test and how to overcome them.

## Interpretation of Fishers exact test

7.3

Rank tests

This handout is primarily directed towards corpus linguistics, but as mentioned in section 5 above, we sometimes deal with ordinal data in linguistics, typically in the context of an experimental or sociolinguistic study.9 In principle, we could test this kind of data using a chi-square test, but such an approach would ignore useful information, namely the fact that the data points are ordered (as opposed to the unordered nominal data). In the research situations mentioned above, there are three rank tests that are appropriate, the Wilcoxon, Mann-Whitney U, and Kruskal-Wallis rank tests, which are all variations over a common theme.10 These tests are based on mathematical operations on the rank values, which are then compared to distribution models. In this workshop we will only discuss the Mann-Whitney U test and the Wilcoxon test, both of which are (somewhat confusingly) implemented in R as a version of the Wilcoxon test: (5) 7.3.1 m < wilcox.test() Mann-Whitney U R code

Consider the following situation, adapted from the example in Hinton (2004, 222223): In a survey, native speakers from two dierent areas have been asked to judge how good or acceptable a specic construction sounds in their dialect on a scale from 0 to 100. The result is as follows:
discussion of rank tests relies primarily on Hinton (2004, 216229). dierences between the tests are as follows: the Wilcoxon test is designed for comparing two related samples; the Mann-Whitney U test is designed for comparing two unrelated samples; the Kruskal-Wallis test is designed for testing more than two samples.
10 The 9 The

Tinytown 43 34 14 62

Megacity 67 82 33 46 22 75

This is tested in R in the following manner, by assigning the judgments for each area to a vector, x and y respectively, and entering them into the formula as follows: x < c(43,34,14,62) y < c(67,82,33,46,22,75) m < wilcox.test(x, y) The result is p = 0.8714, which would usually be taken to indicate that there is no real dierence between the two areas in their judgments in fact, they are almost identical. 7.3.2 Wilcoxon R code

Now consider a slightly dierent scenario, adapted from the example in Hinton (2004, 228229), where a group of subjects are asked to rank two dierent constructions using a scale of 1 to 20: Subject George Adele Ray Noam Suzanne Jim Cynthia H aj Seana Paul construction 1 17 20 8 18 17 11 16 23 8 21 construction 2 8 11 15 5 15 5 8 6 9 8

These columns of scores can be considered related since for each row, both judgments were made by the same test subject. The procedure in R is the same as above, with two minor modications. First, the paired = TRUE argument tells R that the two samples are related, i.e., that it is a Wilcoxon test. Second, R will by default attempt to compute the exact signicance, which will result in error messages if the size in rank dierence between two categories is the same. The solution is to tell R not to use an exact method, like this: m1 < wilcox.test(x1, y1, paired = TRUE, exact = FALSE). 10

x1 < c(17,20,8,18,17,11,16,23,8,21) y1 < c(8,11,15,5,15,5,8,6,9,8) m1 < wilcox.test(x1, y1, paired = TRUE, exact = FALSE) The result is p = 0.025, suggesting (again based on the conventional threshold of 0.05) that the subjects have a systematic preference for one construction over the other (i.e., there is a real dierence in the subjects rating of the two constructions). Judging by the dierences in rank sums, it seems that the subjects nd construction 1 more acceptable than construction 2.11

R code

7.4

## Students t -test, ANOVA, and parametric rank tests

These tests are so-called parametric tests designed for continuous data, and fall outside the scope of the present tutorial. How and when to use them is taught in all introductory statistics courses, such as the ones listed in section 12. In the context of corpus linguistics, their use is somewhat questionable, and the reader should be aware that it is regrettably not unusual to nd these tests employed in ways which do not t well with their assumptions.

Eect size

What is the importance of eect size, or association strength? Generally, the p-value of a statistical test says nothing about the size of the observed eect in the data, that is, the association between variables in the data. Rather, the p-value tests the hypothesis that the distribution in the data is a random sample from a population which has the properties of some mathematical distribution (e.g. the chi-square). That is, the p-value indicates how likely we would be to observe the data the full set of data in this table if we assume that the population follows a chi-square distribution and if the data in our matrix is a random sample from some population. Whether these assumptions hold or not, is often a question of interpretation. However, the main reason why eect size is important is this:12 In corpus linguistics, the chi-square p-value addresses a dierent question than the one we want to answer ! As Kilgarri (2005) has pointed out, we usually know that the data in the corpus we want to study is not a random collection of words with a chi-square distribution. Applying a chi-square test is thus to attempt to measure something the chi-square p-value was never intended to measure. To understand this problem with the p-value, you need to know that the chi-square p depends very much on the size of the sample (n, that is, the sum of all cells in the table): as the
11 This is based on the sums of negative and positive dierences. For instance, if the subject rates construction 1 over construction 2, the dierence will be positive (14 10 = 4, while 12 14 = 2). A full explanation of this procedure falls outside the scope of this tutorial. 12 This applies primarily to corpus linguistics. In an experimental study things are a little dierent.

11

sample size grows, the p-value will inevitably grow smaller; the end result being a high number of false indications of statistical signicance. Essentially, if you fail to reject the null-hypothesis (i.e. you get a p-value which is larger than 0.05 remember, we want small p-values) in corpus linguistics, this might simply be an eect of a small sample size (n) it says virtually nothing about the the size of the association between the variables we want to investigate. Thus, null-hypothesis testing in corpus linguistics is problematic, and the problem is not solved simply by acquiring a bigger corpus. What we need instead is some test or measure which indicates the magnitude of dierence when we observe 34 in one cell and 82 in another cell, or which can tell us how much the information in one of the columns contributes to the overall result. Put dierently, when when we observe 34 in one table cell and 82 in another table cell, how can we quantify the tendency of the factors involved to go in the same (or opposite) directions? With the possible exception of the Fisher exact test, cf. 7.2 above, the statistical tests we have looked at so far need to be augmented by some kind of eect size measure to give us this kind of information. In this section two such useful measures are introduced, however, there are a lot more such measures around in the social and behavioral sciences, and no gold standard currently exists.

8.1

## Phi and Cram er V

Phi (or ) is computed based on the chi-square value. Recall that the chi-square p-value is very sensitive to the sample size (n). Phi and Cram er V factor out the size of the sample, and give the average contribution of rows and columns (that is, the categories in the table and their respective observations in the rows and columns) to the nal result. Phi has certain weaknesses when the table gets bigger than 2 2, however, and the Cram er V is a generalized version of Phi, cf. Cram er (1946, 282); Cohen (1988, 223225); Gries (2005, 280). Essentially, Phi is restricted to 2 2 tables, whereas Cram er V can be used on larger tables. Note that in the 2 2 table case the tests are identical, cf. (6) and (8) below. The Cram er V is implemented as a default test in SPSS, but not in R. However, it is easy to compute. The formula is as follows: (6) V = 2 n (k 1) Cram er V 2 is the computed test statistic from the uncorrected Pearson chi-square, n is the total sample size (i.e. the sum o all cells in the matrix), and k is the smaller of either the number of rows or the number of columns. Converted into R code, this can be calculated quite eciently as follows (assuming the same matrix vector x as above):13
13 The procedure for computing the Cram er V in R was posted by Marc Schwarts on http: //www.biostat.wustl.edu/archives/html/s-news/2003-09/msg00185.html

12

(7)

## cv < sqrt(chisq.test(x, correct = FALSE)\$statistic /(sum(x) min(dim(x) 1 )))

R code

It is possible to save some typing by converting this code into a script and loading the script into R. This will not be covered here, though. If you are only working with 2 2 tables, Phi is even easier to compute: (8) P hi = 2 n

Phi

(Phi is simpler because the reason for introducing (6) was to test more complex cases, i.e., cases where the table is larger than 2 2). Phi can be computed as follows: (9) f < sqrt(chisq.test(x, correct = F)\$statistic/sum(x)) R code Interpretation of Cram er V

Cram er V and Phi14 can be interpreted as follows: The test computes the mean percent dierence in clustering of observations between rows and columns. That is, the test measures how closely the observations in rows and columns are associated with each other. If we apply the Phi measure to the x table above, the result is as follows: (10) P hi = 2.9725 = 0.114 228

The result, 0.114 or 11.4 %, indicates a mutual association between rows and columns of approximately eleven and a half percent.15 Wether this is a large eect size is a matter of interpretation. As a rule of thumb it cannot be stressed enough that this is only a guideline, not a xed rule according to Fleiss et al. (2003, 99), eects of less than 30 or 35 % indicate a trivial association. However, opinions vary and Cohen (1988, 224226) considers eects smaller than 1 % as trivial, 10-20 % small, 20-50 % medium sized, and anything over 50 % as large. Remember that these are percentages, i.e., 0 % indicates no association whatsoever whereas 100 % indicates perfect association; essentially a result close to 100 % means that the observed results are entirely due to (or explained by) the categories of the investigation (i.e. rows/columns). However, other factors should inuence the interpretation of the eect size, notably: the size of the sample a small sample is almost always a bad representation of the population. Thus, whether the observed eect can be applied to the entire population needs careful interpretation.
14 The subtle dierences that exist in the interpretation of the two measures fall outside the scope of this tutorial. 15 It is not entirely uncontroversial to think of the eect size results discussed in this handout as percentages. I choose to do so anyway because it gives an intuitive grasp of the relative size of the eect, but not everyone would agree with me in this. The reader is hereby warned.

13

how much data is missing? If you know that a lot of data is missing, this should inuence the interpretation. what type of study are you conducting? The interpretation of the Phi/Cram er V should dier in a corpus based syntax study, an experimental situation, or the evaluation of a sociolinguistic survey. Note that both these measures are symmetric, that is, they give you both the association of rows with columns and columns with rows. Often this is ok, but sometimes we want to measure asymmetric relationships this is discussed in the section on the Goodman-Kruskal lambda below.

8.2

Goodman-Kruskal lambda

Unlike Phi and Cram er V, the Goodman-Kruskal lambda is not based on the chi-square statistic. Instead, it is based on the probability of guessing the right result in the table cells if you know something about the categories of the data. See Goodman and Kruskal (1979) for the original proposals for this measure. The Goodman-Kruskal lambda is a so-called measure of error reduction which is a useful variation to consider when one is looking for asymmetric associations (i.e. rows are strongly associated with columns, but columns are not strongly associated with rows; or vice versa). Consider another ctional example, with the use of dierent constructions in historical periods:16 Period 1 82 39 Period 2 11 2 Period 3 3 9

Construction1 Construction2

The constructions in question appear to be declining over the observed time span. But how much variation is explained by the time period, and how much is explained by the internal variation between the two constructions? By taking the time periods (columns) as the potentially explanatory factor (independent variable), we do a lambda test to check how much the temporal variation is associated with the constructions. First, create the table above in R: (11) lx < matrix(c(82,39,11,2,3,9), nrow = 2) R code

The Goodman-Kruskal lambda (or B ) is not implemented as a default test in R, but can be computed as follows: (12) lb < (sum(apply(lx, 2, max))-max(rowSums(lx))) /(sum(lx)-max(rowSums(lx))) R code

16 We are of course ignoring the problem of whether it is meaningful to directly compare linguistic data from dierent historical periods.

14

The R code in (12) above is an implementation of a mathematical formula from Siegel and Castellan (1988, 299). However, the mathematical reasoning behind this measure is slightly more complex than that of the Cram er V, and it will not be explained in depth here. Note that the code above assumes that x again is a matrix of nominal data where rows represent observations and where the columns of the matrix contain the classes, i.e., the independent variable.17 Goodman-Kruskal lambda can be interpreted as follows: This test measures how much the potential error of predicting the observed results can be reduced by looking at additional columns (or classes). Put dierently, if we are trying to predict the distribution of row observations other than the one with the highest frequency (i.e. the variation), how much would knowing the classes (columns) help us? In the case above, the result is 0.12, or 12 %. That is, in this case information about time period is only moderately helpful in explaining the variation (conversely, if the test is done on the rows instead of the columns, the result is 0.5, indicating that other factors have more explanatory value here). In other words, the Goodman-Kruskal lambda can be used to assess to what extent (measured in percent) each variable in either rows or columns contributes to the eect observed on the other variable. Note that this test is not particularly suited for 2 2 tables, or tables where the observations are very evenly distributed. Interpretation of lambda

8.3

## Eect size measures for ordinal data

There are a number of eect size measures available for ordinal data, examples include Spearmans rho () and Kendalls tau ( ). Although only Kendalls tau is discussed below, the R function cor.test() can be used to compute several measures, see the R help les for details.18 To compute Kendalls tau on the grammaticality judgments in section 7.3.2, we use the same vectors x1 and y1: (13) r < cor.test(x1, y1, method = kendall, alternative = two.sided, exact = FALSE) R code

In the code above, method refers to the type of test, alternative = two.sided means that we had no indication before the experiment which construction would be rated highest, and exact = FALSE is necessary because there are ties in the rank sums. The output is as follows: Kendalls rank correlation tau data: x1 and y1 z = -0.9223, p-value = 0.3564 alternative hypothesis: true tau is not equal to 0
17 The code apply(...2 ...) instructs R to treat the columns as the independent variable. Similarly, the rows can be treated as the independent variable by simply writing apply(...1...). 18 Open R and type ?cor.test

15

sample estimates: tau -0.2411214 An in-depth discussion of all the output above falls outside the scope of this tutorial, and we will only consider the tau value. The Kendall tau is always a number between 1 and 1, where 1 indicates negative association (i.e. disagreement), 1 indicates positive association (i.e. agreement), and 0 indicates no association. Formally, Kendalls tau is the difference between the sum of actual rank scores and potential maximum rank scores, which makes this a good measure of the size of the observed eect. The value obtained above, 0.24 or 24 %, indicates a weak to moderate negative association or dierence in acceptability, between the two constructions. Siegel and Castellan (1988, 245) recommend Kendalls tau for measuring agreement between raters, see Kendall (1938) for the rst proposal of this measure. Interpretation of Kendalls tau

## P -values and research questions

It is crucial to keep in mind that the result of a statistical test cannot answer your research questions for you: you need to interpret the statistical results, see Woods et al. (1986, 127131) for a good introductory discussion of this problem. This process of interpretation can be more or less dicult, depending on some factors: i) Are the research questions well operationalized i.e., have you spelled out how you think your hypothesis relates to the data in terms of frequencies or magnitudes? ii) Do you have all the relevant information (i.e., are there other factors that could inuence the outcome)? iii) How well do your data match the assumptions of the statistical test? Basically, i) is your responsibility the researcher conducting a study is responsible for clarifying how the empirical and statistical results can be interpreted as having explanatory value with regards to a research question. ii) is obviously a matter of interpretation what is enough information about the sample, the population, any missing data etc? As a rule of thumb, the information should be sucient to let you make good operationalizations. iii) is a very dicult problem to handle, and very often a statistical test is used in a way which does not match its assumptions well. It is important to keep in mind, however, that even when the assumptions match almost perfectly, you still need to (or ought to) explain your reasons for using a specic statistical test everyone else does it is not an acceptable reason! How badly your data violate the assumptions of a given test and how this will inuence your interpretations

16

of the results in relation to the research questions is a factor of uncertainty which must be dealt with in any case. It might then be tempting to ask what is the point of doing a statistical analysis at all? The answer is simple: there is a world of dierence between interpreting the result of a statistical test and interpreting raw frequencies. The human mind is not particularly well equipped to process complex frequency data in a reliable, unbiased way. Consequently, an appropriate statistical test whatever its shortcomings is in most cases preferable over raw frequencies as the basis for quantitative, scientic analysis.

10

## What is not covered in this tutorial?

As mentioned previously, this tutorial is restricted to a few nonparametric tests and measures within the frequentist tradition. For a particular research project, there might be useful tests and measures to be found among the parametric tests, as well as in the Bayesian and correspondence analysis traditions. In most cases it would be advisable to follow a formal course in statistics such as one of the courses listed in section 12 below. Below are some examples of important concepts that were omitted for reasons of space, but this is in no way an exhaustive list: There is a lot more to be said about data types than the brief exposition in section 5. For instance, the problem of data source as opposed to data type has not been touched upon at all, but is nevertheless important. Furthermore, the question of sample and population is often quite complex in most real research projects and requires a lot more attention than what was given to it in section 6. Yet another important but omitted aspect of statistical testing is one-directional versus two-directional tests. All of these concepts typically require more attention than they could possibly receive in a short workshop. Again, the best solution would be to follow a regular course where these issues can be treated with the attention they require.

11

Relevant literature

For a gentle, non-numerical introduction to statistical thinking, Rowntree (1981) is a good place to start. A more in-depth consideration of statistical methods is presented in an easily accessible way in Hinton (2004). Specically linguistic applications of statistics are briey introduced in N un ez (2007), and comprehensively treated in Baayen (2008) and Johnson (2008). Of the last two books, Baayens is without doubt the most advanced, but also the one least suited for a novice to statistics. For an in-depth understanding of some of the issues pertaining to the interpretation of statistics, it is necessary to go beyond introductory books. Articles such as Cohen (1994), Johnson (1999), Tversky and Kahneman (1971), Upton (1982) and Upton (1992) contain extremely valuable discussions on the choice

17

and use of statistical test as well as interpretation of p-values. For the rather complicated question of eect size measures, Cohen (1988) is still a classic, but Siegel and Castellan (1988) and Fleiss, Levin, and Paik (2003) also discuss this question (along with many other questions) in a relatively accessible way. Kempthorne (1979) contains an excellent discussion of the question of data origin. For the question of population and sample in linguistics, Woods et al. (1986, 4857) is a good place to start, but Clark (1973) and Tversky and Kahneman (1971) oer invaluable renements. Finally, Wilkinson (1999) is highly recommended as a guide to good practice in handling and presenting statistics for research.

12

## Statistics courses at the faculty of humanities

dasp 106 Statistikk og kognisjonsforskning 5 credits: A brief introduction to parametric and nonparametric frequentist tests. dasp 302 Statistisk metode 10 credits: A more in-depth introduction to parametric and nonparametric frequentist tests. huin 308 Statistikk for HF-fag 15 credits: Identical to dasp 302, but with added coursework on correspondence analysis and Bayesian statistics. For an updated list of statistics courses oered at the faculty of humanities, see the course listings at http://studentportal.uib.no/.

References
Apollon, D. (1990). Dataanalytiske metoder i lologien. In O. E. Haugen and E. Thomassen (Eds.), Den lologiske vitenskap, pp. 181208. Oslo: Solum forlag. Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge Univeristy Press. Clark, H. H. (1973). The language-as-xed-eect fallacy: A critique of language statistics in psychological research. Journal of verbal learning and verbal behavior 12 (4), 335359. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. Cohen, J. (1994). The earth is round (p < .05). American psychologist 49 (12), 9971003. Cram er, H. (1946). Mathematical methods of statistics. Princeton: Princeton University Press.

18

Fleiss, J. L., B. Levin, and M. C. Paik (2003). Statistical methods for rates and proportions (3rd ed.). Hoboken, NJ: Wiley. Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (2004). Bayesian data analysis (2nd ed.). Boca Raton, FL.: Chapman & Hall/CRC. Goodman, L. A. and W. H. Kruskal (1979). Measures of association for cross classications. New York: Springer. Gries, S. T. (2005). Null-hypothesis signicance testing of word frequencies: A follow-up on Kilgarri. Corpus linguistics and linguistic theory 1 (2), 277294. Gries, S. T. and A. Stefanowitsch (2004). Extending collostructional analysis. International journal of corpus linguistics 9 (1), 97129. Hinton, P. R. (2004). Statistics explained (2nd ed.). London: Routledge. Johnson, D. H. (1999). The insignicance of statistical signicance testing. The journal of wildlife management 63 (3), 763772. Johnson, K. (2008). Quantitative methods in linguistics. Oxford: Blackwell Publishing. Kempthorne, O. (1979). In dispraise of the exact test: Reactions. Journal of statistical planning and inference 3, 199213. Kendall, M. (1938). A new measure of rank correlation. Biometrika 30 (1-2), 9193. Kilgarri, A. (2005). Language is never, ever, ever, random. Corpus linguistics and linguistic theory 1 (2), 263276. Nenadic, O. and M. Greenacre (2007). Correspondence analysis in R, with two- and three-dimensional graphics: The ca package. Journal of Statistical Software 20 (3), 113. N un ez, R. (2007). Inferential statistics in the context of empirical cognitive linguistics. In M. Gonzalez-Marquez, I. Mittelberg, S. Coulson, and M. J. Spivey (Eds.), Methods in Cognitive Linguistics, pp. 87118. Amsterdam: John Benjamins Publishing Company. Pedersen, T. (1996). Fishing for exactness. In Proceedings of the South Central SAS Users Group (SCSUG-96) Conference, pp. 188200. R Development Core Team (2008). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Rowntree, D. (1981). Statistics without tears. London: Penguin Books. Siegel, S. and N. J. Castellan (1988). Nonparametric statistics for the behavioral sciences (2nd ed.). New York: McGraw-Hill.

19

Stefanowitsch, A. and S. T. Gries (2003). Collostructions: Investigating the interaction of words and constructions. International journal of corpus linguistics 8 (2), 209243. Tversky, A. and D. Kahneman (1971). Belief in the law of small numbers. Psychological bulletin 76 (2), 105110. Upton, G. J. (1982). A comparison of the alternative tests for the 2 2 comparative trial. Journal of the Royal Statistical Society. Series A(General) 145 (1), 86105. Upton, G. J. (1992). Fishers exact test. Journal of the Royal Statistical Society. Series A(Statistics in society) 155 (3), 395402. Wilkinson, L. (1999). Statistical methods in psychology journals: Guidelines and explanations. American psychologist 54 (8), 594604. Woods, A., P. Fletcher, and A. Hughes (1986). Statistics in language studies. Cambridge: Cambridge University Press.

20