Early Corpus Linguistics

"Early corpus linguistics" is a term we use here to describe linguistics before the advent of Chomsky. Field linguists, for example Boas (1940) who studied American-Indian languages, and later linguists of the structuralist tradition all used a corpus-based methodology. However, that does not mean that the term "corpus linguistics" was used in texts and studies from this era. Below is a brief overview of some interesting corpus-based studies predating 1950.

Language acquisition
The studies of child language in the diary studies period of language acquisition research (roughly 1876-1926) were based on carefully composed parental diaries recording the child's locutions. These primitive corpora are still used as sources of normative data in language acquisition research today, e.g. Ingram (1978). Corpus collection continued and diversified after the diary studies period: large sample studies covered the period roughly from 1927 to 1957 analysis was gathered from a large number of children with the express aim of establishing norms of development. Longitudinal studies have been dominant from 1957 to the present again based on collections of utterances, but this time with a smaller (approximately 3) sample of children who are studied over long periods of time (e.g. Brown (1973) and Bloom (1970)].

Spelling conventions
Kading (1897) used a large corpus of German - 11 million words - to collate frequency distributions of letters and sequences of letters in German. The corpus, by size alone, is impressive for its time, and compares favourably in terms of size with modern corpora.

Language pedagogy
Fries and Traver (1940) and Bongers (1947) are examples of linguists who used the corpus in research on foreign language pedagogy. Indeed, as noted by Kennedy (1992), the corpus and second language pedagody had a strong link in the early half of the twentieth century, with vocabulary lists for foreign learners often being derived from corpora. The word counts derived from such studies as Thorndike (1921) and Palmer (1933) were important in defining the goals of the vocabulary control movement in second language pedagogy.

Chomsky changed the direction of linguistics away from empiricism and towards rationalism in a remarkably short space of time. In doing so he apparently invalidated the corpus as a source of evidence in linguistic enquiry. Chomsky suggested that the corpus could never be a useful tool for the linguist, as the linguist must seek to model language competence rather than performance. Competence is best described as our tacit, internalised knowledge of a language.

Performance is external evidence of language competence, and is usage on particular occasions when, crucially, factors other than our linguistic competence may affect its form. Competence both explains and characterises a speaker's knowledge of a language. Performance, however, is a poor mirror of competence. For examples, factors diverse as short term memory limitations or whether or not we have been drinking can alter how we speak on any particular occasion. This brings us to the nub of Chomsky's initial criticism: a corpus is by its very nature a collection of externalised utterances - it is performance data and is therefore a poor guide to modelling linguistic competence. Further to that, if we are unable to measure linguistic competence, how do we determine from any given utterance what are linguistically relevant performance phenomena? This is a crucial question, for without an answer to this, we are not sure that what we are discovering is directly relevant to linguistics. We may easily be commenting on the effects of drink on speech production without knowing it. However, this was not the only criticism that Chomsky had of the early corpus linguistics approach.

The non-finite nature of language
All the work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions:
The sentences of a natural language are finite. The sentences of a natural language can be collected and enumerated.

The corpus was seen as the sole source of evidence in the formation of linguistic theory - "This was when linguists...regarded the corpus as the sole explicandum of linguistics" (Leech, 1991). To be fair, not all linguists at the time made such bullish statements - Harris [1951) is probably the most enthusiastic exponent of this point, while Hockett [1948] did make weaker claims for the corpus, suggesting that the purpose of the linguist working in the structuralist tradition "is not simply to account for utterances which comprise his corpus" but rather to "account for utterances which are not in his corpus at a given time." The number of sentences in a natural language is not merely arbitrarily large - it is potentially infinite. This is because of the sheer number of choices, both lexical and syntactic, which are made in the production of a sentence. Also, sentences can be recursive. Consider the sentence "The man that the cat saw that the dog ate that the man knew that the..." This type of construct is referred to as centre embedding and can give rise to infinite sentences. (This topic is discussed in further detail in "Corpus Linguistics" Chapter 1, pages 7-8).

The only way to account for a grammar of a language is by description of its rules - not by enumeration of its sentences. It is the syntactic rules of a language that Chomsky considers finite. These rules in turn give rise to infinite numbers of sentences.

The value of introspection
Even if language was a finite construct, would corpus methodology still be the best method of studying language? Why bother waiting for the sentences of a language to enumerate themselves, when by the process of introspection we can delve into our own minds and examine our own linguistic competence? At times intuition can save us time in searching a corpus. Without recourse to introspective judgements, how can ungrammatical utterances be distinguished from ones that simply haven't occurred yet? If our finite corpus does not contain the sentence: *He shines Tony books how do we conclude that it is ungrammatical? Indeed, there may be persuasive evidence in the corpus to suggest that it is grammatical if we see sentences such as: He gives Tony books He lends Tony books He owes Tony books Introspection seems a useful and good tool for cases such as this. But early corpus linguistics denied its use. Also, ambiguous structures can only be identified and resolved with some degree of introspective judgement. An observation of physical form only seems inadequate. Consider the sentences: Tony and Fido sat down - he read a book of recipes. Tony and Fido sat down - he ate a can of dog food. It is only with introspection that this pair of ambiguous sentences can be resolved e.g. we know that Fido is the name of a dog and it was therefore Fido who ate the dog food, and Tony who read the book.

Other criticisms of corpus linguistics
Apart from Chomsky's theoretical criticisms, there were problems of practicality with corpus linguistics. Abercrombie (1963) summed up the corpus-based approach as being composed of "pseudo-procedures". Can you imagine searching through an 11-million-word corpus such as that of Kading (1897) using nothing more than your eyes? The whole undertaking becomes prohibitively time consuming, not to say error-prone and expensive.

Abercrombie's were undoubtedly correct. Whatever Chomsky's criticisms were, although it never totally died, the impact of the criticisms levelled at early corpus linguistics in the 1950s was immediate and profound. Francis and Kucera began work on the now famous Brown corpus. The revival of corpus linguistics It is a common belief that corpus linguistics was abandoned entirely in the 1950s. This is simply untrue. For example, Quirk (1960) planned and executed the construction of his ambitious Survey of English Usage (SEU) which he began in 1961. In the same year, in the field of language acquisition the observation of naturally occuring evidence remained dominant. Introspective judgements are not available to the linguist/psychologist who is studying child language acquisition - try asking an eighteen-month-old child whether the word "moo-cow" is a noun or a verb! Introspective judgements are only available to us when our meta-linguistic awareness has developed. Also, in the field of phonetics, naturally observed data remained the dominant source of evidence with introspective judgements never making the impact they did on other areas of linguistic enquiry. During this period the computer slowly started to become the mainstay of corpus linguistics. In 1975 Jan Svartvik started to build on the work of the SEU and the Brown corpus to construct the London-Lund corpus. Svartvik computerised the SEU, and as a consequence produced what some, including Leech (1991) still believe to be "to this day an unmatched resource for studying spoken English". These researchers were in a minority, but they were not universally regarded as peculiar and others followed their lead. The table below (from Johansson, 1991) shows how corpus linguistics grew during the latter half of this century. Corpus linguistics was largely abandoned during this period, and there is no evidence that a child at the one-word stage has meta-linguistic awareness, and does a disservice to those linguists who continued to pioneer corpus-based work during this interregnum. The availability of the computerised corpus and the wider availability of institutional and private computing facilities do seem to have provided a spur to the revival of corpus linguistics. Chomsky re-examined Although Chomsky's criticisms did discredit corpus linguistics, they did not stop all corpus-based work. Even Chomsky (1964) cautioned the rejection of performance data as a source of evidence for language acquisition studies. Early corpus linguistics required data processing abilities that were simply not available at that time.

or followed by a punctuation mark. The computer's ability to retrieve all examples of this word.Date To 1965 1966-1970 1971-1975 1976-1980 1981-1985 1985-1991 Studies 10 20 30 80 160 320 The machine readable corpus The term corpus is almost synonymous with the term machine-readable corpus. or perhaps even a part of speech in a text. Interest in the computer for the corpus linguist comes from the ability of the computer to carry out various processes. say of all the examples of however followed closely by the word we. This is the tool most often implemented in corpus linguistics to examine corpora. it seems worthwhile to consider in slightly more detail what these processes that allow the machine to aid the linguist are. We may take our initial list of examples of however presented in context (usually referred to as a concordance). and extract from this another list.for example. Whatever philosophical advantages we may eventually see in a corpus. The processes described above are often included in a concordance program. So if we are interested. We may even sort the list by searching for words occuring in the immediate context of the word. is a further aid to the linguist. alphabetically on words appearing to the right or left. We may then be interested in sorting the data in some way . sequence of words. which when required of humans. we can simply ask the machine to search for this word in the text. The machine can find the relevant text and display it to the user. The type of analysis that Kading waited years for can now be achieved in a few moments on a desktop computer. Processes Considering the marriage of machine and corpus. usually in context. ensured that they could only be described as psuedotechniques. say. it is the computer which allows us to exploit corpora on a large scale with speed and accuracy. in the usages of the word however in the text. The computer has the ability to search for a particular word. It can also calculate the number of occurrences of the word so that information on the frequency of the word may be gathered. .

with the code VVZ indicating that it is a third person singular present tense (Z) form of a lexical verb (VV). The implicit information has been made explicit through the process of concrete annotation. the utility of the corpus is increased when it has been annotated. Unsurprisingly.  Text Encoding and Annotation    If corpora is said to be unannotated it appears in its existing raw state of plain text. Such annotation makes it quicker and easier to retrieve and analyse information about the language contained in the corpus. . individual texts are often used for many kinds of literary and linguistic analysis . the form "gives" contains the implicit part-of-speech information "third person singular present tense verb" but it is only retrieved in normal reading by recourse to our pre-existing knowledge of the grammar of English. whereas annotated corpora has been enhanced with various types of linguistic information. For example. but one which may be considered a repository of linguistic information. the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways. The following list describes the four main characteristics of the modern corpus. But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. or a conversation analysis of a tv talk show. in an annotated corpus the form "gives" might appear as "gives_VVZ". any collection of more than one text can be called a corpus. However.the stylistic analysis of a poem. Leech (1993) describes 7 maxims which should apply in the annotation of text corpora. making it no longer a body of text where linguistic information is implicitly present. hence a corpus is any body of text). However. (corpus being Latin for "body".Goals and conclusion In this section we have     seen the failure of early corpus linguistics examined Chomsky's criticisms seen the failings of introspective data seen how corpus linguistics was revived In the remaining sections we will see    how corpus linguists study syntactic features (Section 2) how corpus linguistics balances enumeration with introspection (Section 3) how corpora can be used in language studies (Section 4) Definition of a corpus   The concept of carrying out research on written or spoken texts is not restricted to corpus linguistics. In principle. Indeed.

Formats of Annotation Currently.a header and the text itself. COCOA was an early computer program used for extracting indexes of words in context from machine readable texts. Current trends are moving more towards more formalised international standards of encoding. One longstanding annotation practice is known as COCOA refernces. In the TEI. For example. each text (or document) consists of two parts . authors. SGML has the following advantages:     Clarity Simplicity Formally rigourous Already recognised as an international standard The TEI's contribution is a detailed set of guidelines as to how this standard is to be used in text encoding (Sperberg-McQueen and Burnard. the code "A" could be used to refer to the variable "author" and the string would stand for the author's name. some more lasting than others. 1994). the Association for Literary and Linguistic Computing and the Association for Computers and the Humanites. Its conventions were carried forward into several other programs. which are the instantiations of that variable. Very simply. a project sponsored by the Association for Computational Linguistics. a COCOA reference consists of a balanced set of angled brackets (< >) which contains two entities:   A code which stands for a particular variable name.g. The Longman-Lancaster corpus and the Helsinki corpus have also used COCOA references. The TEI uses a form of document markup known as SGML (Standard Generalised Markup Language). The header contains information such as the following: . The flagship of this current trend is the Text Encoding Iniative (TEI). dates and titles. A string or set of strings. notably the OCP (Oxford Concordance Program). there are no widely agreed standards of representing information in texts and in the past many different approaches have been adopted. Thus COCOA references which indicate the author of a passage of text would look like the following: <A CHARLES DICKENS> <A WOLFGANG VON GOETHE> <A HOMER> COCOA references only represent an informal trend for encoding specific types of textual information. Its aim is to provide standardised implementations for machine-readable text interchange. e.

although to English speakers their extent may not be apparent at first. the issue of accents and of non-Roman alphabets such as Greek. title and date the edition or publisher used in creating the machine-readable text information about the encoding practices adopted. Click here for an example of a document header. ü would be encoded by the TEI as . However. but in many cases filenames can only provide us with a tiny amount of information. so we could ask a computer program to retrieve texts where the author's gender variable is equal to "FEMALE". while in the German extraneous information is added. Textual and extra-textual information The most basic type of additional information is that which tells us what text or texts we are looking at.. In languages other than English. Therefore. German speakers either introduce an extra letter "e" or place a double quote mark before the revelant letter. Various strategies have been adopted by native speakers of languages which contain accents when using computers or typewriters which lack these characters. accented characters need to be encoded in other ways. Orthography It might be thought that converting a written or spoken text into machine-readable form is a relatively simple typing optical scanning task. For example. for maximum interchangeability. For example. IBM-compatible computers are capable of handling accented characers. the case of the French. using the delimiting characters of & and . so Frühling would become Fruehling or Fr"uhling. we might only be interested in looking at texts in a corpus that were written by women. issues of encoding are vital. French speakers omit the accent entirely.   author. these strategies cause additional problems . Information about the nature of the text can often consist of much more than a title and an author. Click here to read more about headers and text You might also want to read about the EAGLES advisory body in chapter 2 of Corpus Linguistics (page 29). To handle the umlaut. In response to this the TEI has suggested that these characters are encoded as TEI entities. A computer file name may give us a clue to what the file contains. These information fields provide the document with a whole document header which can be used by retrieval programs to search and sort on particular variables. but many other mainframe computers are unable to do this. Russian and Japanese present a problem. but even with a basic machine-readable text. information is lost. writing Hélenè as Helene.

are often known as "tagging" rather than annotation. For example. Read about the handling of non-Roman alphabets and the transcription of spoken data in Corpus Linguistics. however. This refers to corpora which hold the same texts in more than one language. which is not comprised of translations of the same texts. chapter 2.&uumlaut. For example. and which words are translations of each other. and the codes which are assigned to features are known as "tags". the Aarthus corpus of Danish. and an increasing amount of work in being carried out on the building of multilingual corpora. For the corpus to be useful it is necessary to identify which sentences in the sub-corpora are translations of each other.g. A parallel corpus is not immediately user-friendly. which involve the attachment of special codes to words in order to indicate particular features. Types of annotation Certain kinds of linguistic annotation. This is not always a simple process. A corpus which shows these identifications is known as an aligned corpus as it makes an explicit link between the elements which are mutual translations of each other. French and English contract law consists of a set of three monolingual law corpora. "Das" with "The". At a further level. specific words might be aligned. but each contains completely different texts in those several languages.g. e. . pages 34-36. the German word "raucht" would be equivalent to "is smoking" in English. The parallel corpus dates back to mediaeval times when "polyglot bibles" were produced which contained the biblical texts side by side in Hebrew. which contain texts of several different languages. as often one word in one language might be equal to two words in another language. First we must make a distinction between two types of multilingual corpora: the first can really be described as small collections of individual monolingual corpora in the sense that the same procedures and categories are used for each language. in a corpus the sentences "Das Buch ist auf dem Tisch" and "The book is on the table" might be aligned to one another. Latin and Greek etc. The second type of multilingual corpora (and the one which receives the most attention) is parallel corpora. e. These terms will be used in the sections which follow:         Part of Speech annotation Lemmatisation Parsing Semantics Discoursal and text linguistic annotation Phonetic transcription Prosody Problem-oriented tagging  Multilingual Corpora     Not all corpora are monolingual.

handling of calls to or from a roaming mobile station . temporairement située dans sa zone . two EU-funded projects (CRATER and MULTEXT) are aiming to produce genuinely multilingual parallel corpora. sub d = 380 ----------& the location register. At present there are few cases of annotated parallel corpora. The Canadian Hansard corpus is annotated. -----& enregistreur de localisation . We'll reexamine Chomsky's argument that corpus linguistics will result in skewed data. autre que l ' enregistreur de localisation nominal . and is aligned at sentence level. currently located in its area . par exemple . . An example of a bilingual corpus This example is taken from a parallel French-English corpus. Although the majority of this session is concerned with statisitical procedures which can be said to be quantitative. de l ' établissment de communication en provenance ou à destination d ' une station mobile en déplacement . -----& le transfert intercellulaire consiste à commuter une communication en cours d ' une cellule ( ou d ' une voie radioélectrique à l ' autre à l ' intérieur de la même cellule ) . for instance. utlilisé par un ccm pour la recherche d ' informations en vue . However. and those which exist tend to be bilingual rather than multilingual. and contains parallel texts in French and English. other than the home location register used by an msc to retrieve information for. sub d = 22 ----------& the location register should as a minimum contain the following information about a mobile station : -----& 1 ' enrigisteur de localisation doit contenir au moins les renseignments suivants sur une station mobile : sub d = 386 ----------& handover is the action of switching a call in progress from one cell to another ( or radio channels in the same cell ) . but it only covers a restricted range of text types (proceedings of the Candian Parliament). Introduction In this session we'll be looking at the techniques used to carry out corpus analysis. and the situation is likely to change dramatically in the near future. We'll also be looking at the relationship between quantitative and qualitative research. this is an area of growth. it is important not to ignore the importance of qualitative analyses. and see the procedures used to ensure that a representative sample is obtained. However.

The more basic task of just looking at a single .With the statistical part of this session two points should be made. Thus. Space precludes the coverage of all of the techniques which can be used on corpus data. Ambiguities. Other books. Qualitative analysis: Richness and Precision. socialism or communism). Findings can be generalised to a larger population. but not necessary incompatible perspectives on corpus data. Second. In this section we'll look at both types and see the pros and cons associated with each. No attempt is made to assign frequencies to the linguistic features which are identified in the data. . quantitative analysis allows us to discover which phenomena are likely to be genuine reflections of the behaviour of a language or variety. and which are merely chance occurences. which are inherent in human language. Qualitative analysis allows for fine distinctions to be drawn because it is not necessary to shoehorn the data into a finite number of classifications.g. and direct comparisons can be made between two corpora. the word "red" could be used in a corpus to signify the colour red. detailed description. and even construct more complex statistical models in an attempt to explain what is observed. can be recognised in the analysis.   First. The aim of qualitative analysis is a complete. M. In quantitative research we classify features. we do not aim here to provide a "step-by-step" guide to statistics. The main disadvantage of qualitative approaches to corpus analysis is that their findings can not be extended to wider populations with the same degree of certainty that quantitative analyses can. Many of the techniques used are very complex and to explain the mathematics in full would require a separate session for each one. Paul Baker. Tony McEnery. For example. and rare phenomena receives (or should receive) the same amount of attention as more frequent phenomena. This is because the findings of the research are not tested to discover whether they are statistically significant or due to chance. Andrew Wilson. count them. Quantitative analysis: Statistically reliable and generalisable results. You should bear in mind that these two types of data analysis form different. notably Language and Computers and Statistics for Corpus Linguistics (Oakes. In a qualitative analysis both senses of red in the phrase "the red flag" could be recognised.forthcoming) present these methods in more detail than we can give here. so long as valid sampling and significance techniques have been used. or as a political cateogorisation (e. that this section is of necessity incomplete. Qualitative vs Quantitative analysis Corpus analysis can be broadly categorised as consisting of qualitative and quantitative analysis.

meaning that categories may have to be collapsed into one another resulting in a loss of data richness. it must be noted that at the time of Chomsky's criticisms. today we have powerful computers which can store and manipulate many millions of words. quantatitve analysis tends to sideline rare occurences. and that it would therefore be skewed and hence unrepresentative of the population as a whole. As can be seen. classifications have to be of the hardand-fast (so-called "Aristotelian" type). For statistical purposes. and thus their relative normality or abnomrality. Quantatitive analysis is therefore an idealisation of the data in some cases. However. However. with the result that the finished corpus had to be of a manageable size for hand analysis. corpus collection and analysis was a long and pains-taking task. a stage of qualitative research is often a precursor for quantitative analysis. Also. So in the above example about the phrase "the red flag" we would have to decide whether to classify "red" as "politics" or "colour". An item either belongs to class x or it doesn't. Schmied demonstrates that corpus linguistics could benefit as much as any field from multi-method research. Thus. To ensure that certain statistical tests (such as chi-squared) provide reliable results. The issue of size is no longer the problem that it used to be. it does enter significantly into the factors which must be considered in the production of a maximally representative corpus. and it applied not just to corpus linguistics but to any form of scientific investigation which is based on sampling. Chomsky criticised corpus data as being only a small sample of a large and potentially infinite population. as there are many safeguards which may be applied in sampling to ensure maximum representativeness. This is a valid criticism. In any case.language variety allows one to get a precise picture of the frequency and rarity of particular phenomena. First. the categories for classification must first be identified. carried out by hand. many linguistic terms and phenomena do not therefore belong to simple. Chomsky's criticisms were at least partly true at the time of those early corpora. However. Corpus Representativeness As we saw in Session One. single categories: rather they are more consistent with the recent notion of "fuzzy sets" as in the red example. the picture of the data which emerges from quantitative analysis is less rich than that obtained from qualitative analysis. There has been a recent move in social science towards multi-method approaches which tend to reject the narrow analytical paradigms in favour of the breadth of information which the use of more than one method may provide. A recent trend From this brief discussion it can be appreciated that both qualitative and quantitative analyses have something to contribute to corpus study. the picture is not as drastic as it first appears. as Schmied (1993) notes. it is essential that minimum frequencies are obtained . Although size is not a guarantee of representativeness. since before linguistic phenomena can be classified and counted. .

Genre groupings have a lot to do with the theoretical perspective of the linguist who is carrying out the stratification. This refers to defining the different genres. This approach is one which was used in building the Brown corpus. This means that we must rigourously define our sampling frame . Frequency Counts This is the most straight-forward approach to working with quantitative data. that it is made up if.this was the approach taken by the Lancaster-Oslo/Bergen corpus who used the British National Bibliography and Willing's Press Guide as their indices. Another approach could be to define the sampling frame as being all the books and periodicals in a particular library which refer to your particular area of interest. chapter 3. page 65..Random sampling techniques are standard to many areas of science and social science. Read about a different kind approach. and the problems of using standard statistical equations to determine these figures in Corpus Linguistics. in Corpus Linguistics. Biber (1993) also points out the advantage of determining beforehand the hierarchical structure (or strata) of the population. and is often more representative. before we can define sampling procedures for it. Stratificational sampling is never less representative than pure probablistic sampling. and these same techniques are also used in corpus building. Items are classified according to a particular scheme and an arithmetical count is made of the number of items (or tokens) within the text which belong to each classification (or type) in the scheme. which was used in collecting the spoken parts of the British National Corpus. For example. as it allows each individual stratum to be subjected to probablistic sampling. However. Biber (1993) emphasises that we need to define as clearly as possible the limits of the population which we wish to study. But there are additional caveats which the corpus builder must be aware of. chapter 3.the entire population of texts from which we take our samples. For example.. . all the German-language books in Lancaster University library that were published in 1993.. channels etc. written German could be made up of genres such as:       newspaper reporting romantic fiction legal statutes scientific writing poetry and so on. page 66. Read about optimal lengths and number of sample sizes. these strata (like corpus annotation) are an act of interpretation on the part of the corpus builder and others may argue that genres are not naturally inherent within a language. One way to do this is to use a comprehensive bibliographical index .

An example of this might be an analysis of the incidence of different parts of speech in a corpus which had already been part-of-speech tagged. but they have certain disadvantages. These four classes would constitute our types. in the example above. In other words. verb.. we count the number of times each word appears in the corpus. This is not a problem when the two corpora that are being compared are of the same size.000 X 100 = 0.000 50 500 A brief look at the table seems to show that boot is more frequent in written rather than spoken English. however. resulting in a list which might look something like: abandon: 5 abandoned: 3 abandons: 2 ability: 5 able: 28 about: 128 etc. Working with Proportions Frequency counts are useful. for example a corpus of spoken language with a corpus of written language. Frequency counts simply give the number of occurences of each type.For instance. abandon. More often. For instance. Even in the case of word frequency analysis. The following example compares two such corpora. if we calulate the frequency of occurrence of boot as a percentage of the total number of tokens in the corpus (the total size of the corpus) we get: spoken English: 50/50. abandons and abandoned might all be classed as the lexeme ABANDON. Very often the classification scheme used will correspond to the type of linguistic annotation which will have already been introduced into the corpus at some earlier stage (see Session 2).. looking at the frequency of the word boot Type of corpus Number of words Number of instances of boot English Spoken 50.1% . but when they are of different sizes frequency counts are little more than useless. adjective and adverb.1% written English: 500/500.. the use of a classification scheme implies a deliberate act of categorisation on the part of the investigator.. variant forms of the same lexeme may be lemmatised before a frequency count is made. However. When one wishes to compare one data set with another. we might set up a classification scheme to look at the frequency of the four major parts of speech: noun. they do not indicate the prevalence of a type in terms of a proportion of the total number of tokens in the text. Another example inolves the simple one-to-one mapping of form onto classification.000 X 100 = 0.000 English Written 500.

The test compares the difference between the actual frequencies (the observed frequencies in the data) with those which one would expect if no factor other than chance had been operating (the .000. we need to perform a further calculation . since most people find them easier to understand than comparing fractions of unusual numbers like 53. but to be more certain that this is not just due to co-incidence. Significance Testing Significance tests allow us to determine whether or not a finding is the result of a genuine difference between two (or more) items. even without a computer statistics package. suppose we are examining the Latin versions of the Gospel of Matthew and the Gospel of John and we are looking at how third person singular speech is represented. of occurences of dicit no.1%) for both the written and spoken corpora. such as the example above. of occurences of dixit 46 118 107 119 From these figures is looks as if John uses the present form ("dicit") proportionally more often than Matthew does. Wilcoxon's rank sum test and so on. Even where disparity of size is not an issue. A simple count of the two verb forms in each text produces the following results: Text Matthew John no. Here we will only examine the chisquared test as it is the most commonly used significance test in corpus linguistics. if that results in an unwieldy looking small number (in the above example it would be 0.the significance test. proportional data (percentages etc) can not be used with the chi-squared test. The most basic way to calculate the ratio between the size of the sample and the number of occurences of the type under investigation is: ratio = number of occurrences of the type / number of tokens in the entire sample This result can be expressed as a fraction. the [Student's] t-test. For example. or whether it is just due to chance. There are several types of significance test available to the corpus linguist: the chi squared test.0001) the ratio can then be multiplied by 100 and represented as a percentage. This is a non-parametric test which is easy to calculate. However. Also. and can be used with data in 2 X 2 tables. However. it should be noted that the chi-squared test is unreliable where very small numbers are involved and should not therefore be used in such cases. it is often better to use proportional statistics to present frequencies.Looking at these figures it can be seen that the frequency of boot in our made-up example is the same (0. or more commonly as a decimal. Specifically we want to compare how often the present tense form of the verb "to say" is used ("dicit") with how often the perfect form of the verb is used ("dixit").

the greater the probablity that the observed frequencies are influenced by chance alone. and our result is higher than 10.82 16.84 5.1) x (number of rows in the frequency table . In practice it is normal to assign a cut-off point which is taken to be the difference between a significant result and an "insignificant" result. We then look at the table of chi-square values in the row for the relevant number of degrees of freedom until we find the nearest chi-square value to the one which is calculated.e. natural language processing and language teaching. especially if one is not a native speaker of a language or language variety.the number of degrees of freedom which is simply: (number of columns in the frequency table . Information about collocations is important for dictionary writing. but also of larger phraseological units.expected frequencies).001. and we can therefore say with a high degree of certainty that this difference is a true reflection of variation in the two texts and not due to chance. Collocations The idea of collocations is an important one to many areas of linguistics. A value close to 1 means that the difference is almost certainly due to chance.843. both fixed and more variable. Having calculated the chi-squared value (we will omit this here and assume it has been done with a computer statistical package) we must look in a set of statistical tables to see how significant our chi-squared value is (usually this is also carried out automatically by computer).83 13.) In our example about the use of dicit and dixit above we calculate a chi-squared value of 14.81 6.001 1 2 3 3. The table below shows the significant p values for the first 3 degrees of freedom: Degrees of Freedom p = 0.001. However. the difference between Matthew and John can be said to be significant at p < 0.i.63 9.05 (probablity values of less than 0.27 The number of degrees of freedom in our example is 1.1) In the example above this is equal to (2-1) x (2-1) = 1. The closer to 0 the value. the more unlikely that it is due to chance alone.05 p = 0.05" and are assumed to be significant. The closer these two results are to each other. Thus. .83 (see the final column in the table) so the probability value for this chi-square value is 0.21 11. and read off the probability value for that column. the more significant the difference is .01 p = 0. Khellmer (1991) has argued that our mental lexicon is made up not only of single words.99 7.34 10. it is not easy to determine which co-occurences are significant collocations. We also need one further value .05 are written as "p < 0. This is usually taken to be 0.

weapon and post. minority. believer. which can be used in lexicography and particularly specialist technical translation. Two of the most commonly encountered formulae are: mutual information and the Z-score. To perform such comparisons we need to consider multivariate techniques. We can group similar collocates of words together to help to identify different senses of the word. linguistic features) but they cannot provide a picture of the complex interrelationship of similarity and difference between a large number of samples. Such information about the delicate differences in collocation between the two words has a potentially important role.the higher the score the greater the degree of collocality. because they belong together) with the probability that they are simply the result of chance. while powerful collocated with words such as tool. comparing the probablities that two words occur together as a joint event (i.e.e. For example. Although these two words have similar meanings.Given a text corpus it is possible to empirically determine which pairs of words have a substantial amount of "glue" between them. Multiple Variables The tests that we have looked at so far can only pick up differences between particular samples (i. neighbour. For example. whilst at the same time losing the minimal amount of information about their differences. texts and copora) on particular variables (i. indicating the landscape sense of the word. supporter and odor. for example in helping students who learn English as a foreign language. We can discriminate the differences in usage between words which are similar. Strong collocated with northerly.e. . and with words like investment indicating the financial use of the word. For example. currents. Church et al (1991) looked at collocations of strong and powerful in a corpus of press reports. their mutual information scores for associations with other words revealed interesting differences. symbol. Both tests provide similar data. the words riding and boots may occur as a joint event by reason of their belonging to the same multiword unit (riding boots) while the words formula and borrowed may simply occur because of a one-off juxtaposition and have no special relationship. Those most commonly encountered in linguistic research are:     factor analysis principal components analysis multidimensional scaling cluster analysis The aim of multivariate techniques is to summarise a large set of variables in terms of a smaller set on the basis of statistical similarities between the original variables. a score is given . and large numbers of variables. bank might collocate with words such as river. For each pair of words. showings. figure. Mutual information and the Z-score are useful in the following ways:    They enable us to extract multiword units from corpus data.

Correspondence analysis is similar to factor analysis. Log-linear Models Here we will consider a different technique which deals with the interrelationships of several variables. religion) received high loadings on one factor. For example. government) loaded highly on another factor.Although we will not attempt to explain the complex mathematics behind these techniques. Multidimensional scaling (MDS) also makes use of an intercorrelation matrix. it is worth taking time to understand the stages by which they work: All the techniques begin with a basic cross-tabulation of the variables and samples. A matrix is created. Any one of these factors might be solely responsible for the . Each variable receives a loading on each of the factors which are extracted. and explain what it is that causes the data to behave in a particular way. As linguists.g. but it differs in the basis of its calculations. we often want to go beyond the simple description of a phenomenon. in analysing a set of word frequencies across several texts one might find that words in a certain conceptual field (i. the highest correlation value recieves a rank order of 1. let us imagine that we are interested in the factors which influence whether the word for is present or omitted from phrases of duration such as She studied [for] three years in Munich. E. The matrix is then used to group the variables contained within it.e. the text genre. the next highest receives a rank order of 2 and so on. signifying its closeness to that factor. We may hypothesise several factors which could have an effect on this. Cluster analysis involves assembling the variables into unique groups or "clusters" of similar items. which is then converted to a matrix in which the correlation coefficients are replaced with rank order values. MDS then attempts to plot and arrange these variables so that the more closely related items are plotted closer together than the less closely related items. the semantic category of the main verb and whether or not the verb is separated by an adverb from the phrase of duration.g. whereas those in another field (e. e. A loglinear analysis allows us to take a standard frequency cross-tabulation and find out which variables seem statistically most likely to be responsible for a particular effect. which is used to attempt to "summarise" the similarities between the variables in terms of a smaller number of reference factors which the technique extracts. For example. in a similar fashion to factor analysis (although this may be a distance matrix showing the degree of difference rather than similarity between the pairs of variables in the cross-tabulation). Follow this link for an example of factor analysis. For factor analysis an intercorrelation matrix is then calculated from the cross-tabulation.g. The hypothesis being that the many variables which appear in the original frequency cross-tabulation are in fact masking a smaller number of variables (the factors) which can help explain better why the observed frequency differences occur.

should be seen as a subset of the activity within an empirical approach to linguistics. when it is used here it refers to a body of text which is carefully sampled to be maximally representative of the language or language variety. all the factors working together could be responsible for the presence/omission of for. or it might be the case that a combination of factors are culpable. providing real examples of corpus use. The importance of copora to language study is aligned to the importance of empirical data. the examples are necessarily selective . verb class and adverb separation) and test the significance of a three variable model.g. Finally. Then we would test each of the two variable models (taking away one variable in each case) and finally each of the three one-variable models. Empirical data enable the linguist to make objective statements. It is important to note that although many linguists may use the term "corpus" to refer to any collection of texts. Then we take away each variable at a time from the model and see whether significance is maintained in each case. The best model would be taken to be the one with the fewest number of variables which still retained statistical significance. Corpora in Speech Research A spoken corpus is important because of the following useful features:  It provides a broad sample of speech. Empirical data also allows us to study language varieties such as dialects or earlier periods in a language for which it is not possible to carry out a rationalist approach. Although corpus linguistics entails an empirical approach. and how they can contribute to the advancement of knowledge in each. Corpus can consult further reading for additional examples. extending over a wide selection of variables such as: o speaker gender . The way that we test the models in loglinear analysis is first to test the significance of associations in the most complex model .omission of for. we would start with a model that posited three variables (e. We will focus on the conceptual issues of why corpus data are important to these areas. proper. rather than those which are subjective. empirical linguistics does not always entail the use of a corpus. In view of the huge amount of corpus-based linguistic research. until we reach the model with the lowest possible dimensions.that is the model which assumes that all of the variables are working together. or based upon the individual's own internalised cognitive perception of language. genre. Introduction In this session we will examine the roles which corpora may play in the study of language. A loglinear analysis provides us with a number of models which take these points into account. In the following pages we'll consider the roles which corpora use may play in a number of different fields of study related to language. So in the above example.

o o o speaker age speaker class genre (e. How do prosodic elements of speech relate to other linguistic levels? 2. Empirical data has been used in lexicography long before the discipline of corpus linguistics was invented. Also. and in the 19th Century the Oxford Dictionary used citation slips to study and illustrate word usage. for example.   It provides a sample of naturalistic speech rather than speech elicited under aritificial conditions. The findings from the corpus are therefore more likely to reflect language as it is spoken in "real life" since the data is less likely to be subject to production monitoring by the speaker (such as trying to suppress a regional accent). have changed the way in which linguists can look at language. however. newsreading. or other (non-representative) collection of machine readable text can call up all the examples of a word or phrase from many millions of words of text in a few seconds. Dictionaries can be produced and revised much more quickly than before. This work can be divided roughly into three types: 1. How does what is actually perceived and transcribed relate to the actual acoustic reality of speech? 3. Prosodic annotation of spoken corpora Because much phonetic corpus annotation has been at the level of prosody. How does the typology of the text relate to the prosodic patterns in the corpus? 4. A linguist who has access to a corpus. legal proceedings etc) This allows generalisations to be made about spoken language as the corpus is as wide and as representative as possible. this has been the focus of most of the phonetic and phonological research in spoken corpora. It also allows for variations within a given spoken language to be studied. because corpus data contains a rich amount of textual information . illustrated his dictionary with examples from literature. Where more than one type of annotation has been used it is possible to study the interrelationships between say. by sorting the right-hand context of the word alphabetically so that it is possible to see all instances of a particular collocate together. .Corpora in Lexical Studies 5. Because the (transcribed) corpus has usually been enhanced with prosodic and other annotations it is easier to carry out large scale quantitative analyses than with fresh raw data. definitions can be more complete and precise since a larger number of natural examples are examined. thus providing up-to-date information about language. Follow this link for an example of the benefits of corpus linguistics in lexicography 8. phonetic annotations and syntactic structure. For example. Corpora.g. poetry. 6. Samuel Johnson. Examples extracted from corpora can be easily organised into more meaningful groups for analysis.regional variety. 7. Furthermore.

At Nijmegen University. Oostdijk and de Haan (1994a) are aiming to analyse the frequency of the various English clause types. There is now a greater interest in the more systematic study of grammatical frequency . A phraseological unit may consitute a piece of technical terminology or an idiom. The open-ended (constantly growing) monitor corpus has its greatest role in dictionary building as it enables lexicographers to keep on top of new words entering the Copora makes a useful tool for syntactical research because of :   The potential for the representative quantification of a whole language variety. The grammar is then loaded into a computer parser and is run over a corpus to test how far it accounts for the data in the corpus. The formal grammar is first devised by reference to introspective techniques and to existing accounts of the grammar of the language. or existing words changing their meanings. genre. along with lexical studies.for example. and the existence of mutual information tools which establish relationships between co-occuring words (see Session 3) mean that we can treat phrases and collocations more systematically than was previously possible. finite corpora also have an important role in lexical studies . Finally. or the balance of their use according to genre etc. Corpora and Grammar Grammatical (or syntactic) studies have. The grammar is then modified to take account of those analyses which it missed or got wrong. part-of-speech tags etc it is easier to tie down usages of particular words or phrases as being typical of particular regional varieties. Many smaller-scale studies of grammar using corpora have included quantitative data analysis (for example. there is a group of researchers who have used corpora in order to test essentially rationalist grammatical theory. However. rather than use it for pure description or the inductive generation of theory. the ability to call up word combinations rather than individual words. 10. However. 9. Schmied's 1993 study of relative clauses). . genres and so on. primarily rationalist formal grammars are tested on reallife language found in computer corpora (Aarts 1991). Since the 1950s the rational-theory based/empiricist-descriptive division in linguistics (see Session One) has often meant that these two approaches have been viewed as separate and in competition with each other. and collocations are important clues to specific word the area of quantification. Their role as empirical data for the testing of hypotheses derived from grammatical theory. date. It is possible to rapidly produce reliable frequency counts and to subdivide these areas across various dimensions according to the varieties of language in which a word is used. for instance. been the most frequent types of research which have used corpora.

In theoretical linguistics. right-o and all right.g. The main contribution of such research has been to the understanding of how conversation works. . Another study by Stenstöm (1987) examined "carry-on signals" such as right. In looking empirically at natural language in corpora it is clear that this "fuzzy" model accounts better for the data: clear-cut boundaries do not exist. with respect to lexical items and phrases which have conversational functions. instead there are gradients of membership which are connected with frequency of inclusion. Sometimes relevant social information (gender.the rationalist approach that we mentioned in the section on Corpora and Grammar. but especially as a response. These signals were classified according to the typology of their various functions e. Another role of corpora in semantics has been in establishing more firmly the notions of fuzzy categories and gradience.either an item belongs to a category or it does not. class. All right was used to mark a boundary between two stages in discourse. and takes account of indeterminacy and gradience. to evaluate a previous response or terminate an exchange. This is partly because these fields rely on context (Myers 1991) and the small samples of texts used in corpora tend to mean that they are somewhat removed from their social and textual contexts. Stenstöm (1984) correlated discourse items such as well. region) is encoded within the corpus but it is still not always possible to infer context from corpus texts. but how often it falls into one category as opposed to the other one. Mindt (1991) demonstrates how a corpus can be used in order to provide objective criteria for assigning meanings to linguistic terms.syntactic. Mindt argues that semantic distinctions are associated in texts with characteristic observable contexts . Much of the work that has been carried out in this area has used the London-Lund corpus which was until recently the only truly conversational corpus. However. morphological and prosodic . Mindt points out that frequently in semantics. sort of and you know with pauses in speech and showed that such correlations related to whether or not the speaker expects a response from the addressee. meanings of terms are described by reference to the linguist's own intuitions . psychological work on categorisation suggests that cognitive categories are not usually "hard and fast" but instead have fuzzy boundaries. so it is not so much a question of whether an item belongs to one category or the other. categories are usually seen as being hard and fast . Corpora in Pragmatics and Discourse Analysis The amount of corpus-based reseach in pragmatics and discourse analysis has been relatively small up to now.and by considering the environments of the linguistic entities an empirical objective indicator for a particular semantic distinction can be arrived at.:   right was used in all functions.Corpora and Semantics The main contribution that corpus linguistics has made to semantics is by helping to establish an approach to semantics which is objective.

The latter form should therefore be excluded from counts of "sexist" suffixes when looking at gender bias in writing. whilst there is a non-gender marked alternative for policeman/policewoman. there is evidence that this is a growing field. Kjellmer (1986). Corpora and Sociolinguistics Although sociolinguistics is an empircal field of research it has hitherto relied primarily upon the collection of research-specific data which is often not intended for quantitative study and is thus not often rigorously sampled. At present. since the amount of conversational data available. In fact men and women had similar subject/object ratios. In phrases such as we need the right man for the job it is difficult to decide whether man is gender specific or could be replaced by person. Although corpora have not as yet been used to a great extent in sociolinguistics.that woman would be less "active". For instance. there is no such alternative for the -ess form in Duchess of York. namely police officer. Holmes points out the difficulty of classifying a form when it is actively undergoing semantic change. and the social/geographical range of people recorded both will have increased. for example. Interestingly. Holmes (1994) makes two important points about the methodology of these kinds of study. which are worth bearing in mind. the frequencies of the female items were much lower than the male items in both corpora. A corpus can provide what these kinds of data cannot provide . both within and without the area of gender studies. The availability of new conversational corpora. As one would expect. Second. Sometimes the data are also elicited rather than naturalistic data.  that's right was used as an emphasiser. and at the occurrence of the items man/men and woman/women. when classifying and counting occurrences the context of the lexical item should be considered. used the Brown and LOB corpora to examine the masculine bias in American and British English. She argues that the word man can refer both to a single male (such as in the phrase A 35 year old man was killed. First. quantitative analyses of corpus-based approaches to issues in pragmatics have been poorly served. Hopefully this is one area which will be exploited by linguists in the near future. it's alright and that's alright were responses to apologies. Another hypothesis of Kjellmer's was not supported in the corpora . The majority of studies in this area have concerned themselves with lexical studies in the area of language and gender. He looked at the occurrence of masculine and feminine pronouns. or can have a generic meaning which refers to mankind (such as Man has engaged in warfare for centuries. the female items were more common in British English than in American English.a representative sample of naturalistic data which can be quantified. that is would be more frequently the objects rather than the subjects of verbs. Corpora and Stylistics . however. These simple points should incite a more critical approach to data classification in further sociolinguistic work using corpora. such as the spoken part of the BNC (British National Corpus) should provide a greater incentive both to extend and to replicate such studies.

while Mindt (1992) has . Nevertheless. Students who are taught with traditional syntax textbooks which contain sentences such as Steve puts his money in the bank are often unable to analyse more complex sentences such as The government has welcomed a report by an Australian royal commission on the effects of Britain's atomic bomb testing programme in the Australian desert in the fifties and early sixties (from the Spoken English Corpus). 1987b) has looked at ways of expressing quantification and frequency in ESL (English as a second language) textbooks. using subsamples of corpora as a database. corpora can be used to look critically at existing language teaching materials. however. Another type of stylistic variation is the more general variation between genres and channels for example. Kennedy (1987a. Other work has looked at variations between genres. In order to define an author's particular style. Altenberg (1984) examined the differences in the ordering of cause-result constructions while Tottie (1991) looked at the differences in negation strategies. are explicitly empirical and use examples and descriptions from corpora or other sources of real life language data. For example. Wilson (1992) used sections from the LOB and Kolhpur corpora. and others have found corpora to be important sources of data in their research. Corpora in the Teaching of Languages and Linguistics Resources and practices in the teaching of languages and linguistics tend to reflect the division between the empirical and rationalist approaches. some stylisticians are interested in investigating broader issues such as genre. As Leech and Short (1981) point out. Holmes (1988) has examined ways of expressing doubt and certainty in ESL textbooks. long sentences vs short sentences and so on). Corpus examples are important in language learning as they expose students to the kinds of sentences that they will encounter when using the language in real life situations. in part examine the degree by which the author leans towards different ways of putting things (technical vs non-technical vocabulary. Apart from being a source of empirical teaching data. the Augustan Prose Sample and a sample of modern English conversation to examine the usage of since and found that causal since had evolved from being the main causal connective in late seventeenth century writing to being characteristic of formal learned writing in the twentieth century. This is where corpora can play a useful role. Other books. we must. Many textbooks contain only invented examples and their descriptions are based upon intutition or second-hand accounts. stylistics often demands the use of quantification to back up judgements which may appear subjective rather than objective. one of the most common uses of corpora has been in looking at the differences between spoken and written language. but also with other authors or the norms of the language or variety as a whole.Stylistics researchers are usually interested in individual texts or authors rather than the more general varieties of a language and tend not to be large-scale users of corpora. This task requires comparisons to be made not only internally within the author's own work.

looked at future time expressions in German textbooks of English. These studies have similar methodologies . using the Theasurus Linguae Graecae corpus which contains most of extant ancient Greek literature. to look for evidence of a particular phemonema and making rough estimates at frequency. Read about language teaching for "special purposes" in Corpus Linguistics. and another who were taught via traditional lecturer-based methods. Corpora have also been used in the teaching of linguistics.or herself. a frequency lexicon or concordances of examples. No real attempts were made to produce samples that were representative. The general conclusion from these studies is that nonempirically based teaching materials can be misleading and that corpus studies should be used to inform the production of material so that the more common choices of usage are given more attention than those which are less common.something which can be done for ancient Greek for example. pages 104-105. in practice historical linguistics has not tended to follow a strict corpus linguistic paradigm. Kirk is using corpora not only as a way of teaching students about variation in English but also to introduce them to the main features of a corpus-based approach to linguistic analysis. Students can call up help in the form of the list of tag mnemomics. In general the computer-taught students performed better than the human-taught students throughout the term. hides the annotation and asks the student to annotate the sentence him.Cytor . instead taking a selective approach to empirical data. or foreground less frequent stylistic choices at the expense of more common ones. since the texts of a historical period or a "dead" language form a closed corpus of data which can only be extended by the (re)discovery of previously unknown manuscripts or books. Kirk (1994) requires his students to base their projects on corpus data which they must analyse in the light of a model such as Brown and Levinson's politeness theory or Grice's co-operative principle. Recent work at Lancaster University has looked at the role of corpus-based computer software for teaching undergraduates the rudiments of grammatical analysis (McEnery and Wilson 1993). Chapter 4. However. This software . .they analyse the relevant constructions or vocabularies. A further application of corpora in this field is their role in computer-assisted language learning. McEnery. both in the sample text books and in standard English corpora and then they compare their findings between the two sets. Most studies found that there were considerable differences between what textbooks are teaching and how native speakers actually use language as evidenced in the corpora. In taking this approach. In some cases it is possible to use (almost) all of the closed corpus of a language for research . Baker and Wilson (1995) carried out an experiment over the course of a term to determine how effective Cytor was at teaching part-of-speech learning by comparing two groups of students .reads in an annotated corpus (either part-of-speech tagged or parsed) one sentence at a who were taught with Cytor. Corpora and Historical Linguistics Historical linguistics can be seen as a species of corpus linguistics. Some textbook gloss over important aspects of usage.

Throughout the period she found that the most common prepositions of this type were of and by. The Helsinki corpus is representative in that it covers a range of genres. but should not diminish the value of corpusbased linguistics. The work which is carried out on historical corpora is qualitatively similar to that which is carried out on modern language corpora. The "philologist's dilemma" . The most effective way of solving this problem is to build larger corpora of course. However. The Helsinki team have also produced "satellite" corpora of early Scots and early American English. gender etc) the harder it is to represent each one fully and achieve statistical reliability. education and social class. some historical linguistics have changed their approach.6 million words of English dating from the earliest Old English Period (before AD 850) to the end of the Early Modern English period (1710).Old English. For example. although it is also possible to carry out work on the evolution of language through time. since with appropriate care they are surmountable. resulting in an upsurge in strictly corpus-based historical linguistics and the building of corpora for this purpose. the Lampeter Corpus of Early Modern English Tracts (a sample of English pamphlets from between 1640 and 1740) and the ARCHER corpus (a corpus of British and American English from 1650-1990). as Rissanen (1989) pointed out. age. which were of almost equal frequency at the beginning of the period. rather they should serve as warnings of possible pitfalls which need to be taken on board by scholars.In recent years. without understanding its limitations in the terms of which genres it does and does not cover. 2. The "God's truth fallacy" . but by the fifteenth century by was three times more common than of. genres. age. Middle English and Early Modern English . regional varieties and sociolinguistics variables such as gender.and each period is subdivided into a number of 100-year subperiods (or 70-year subperiods in some cases).the danger that the use of a corpus and a computer may supplant the in-depth knowledge of language history which is to be gained from the study of original texts in their context. The most widely known English historical corpus is the Helsinki corpus. Other examples of English historical corpora in development are the Zürich Corpus of English Newspapers (ZEN). Rissanen's reservations are vaild and important. Peitsara (1993) used four subperiods from the Helsinki corpus and calculated the frequencies of different prepositions introducing agent phrases.the more variables which are used in sampling and coding the corpus (periods.the danger that a corpus may be used to provide representative conclusions about the entire language period. 3. The "mystery of vanishing reliability" . . and by 1640 by was eight times as common. It is divided into three main periods . it is important to be aware of the limitations of corpus linguistics. Rissanen identifies three main problems associated with using historical corpora 1. Studies like this have particular importance in the context of Halliday's (1991) conception of language evolution as a motivated change tin the probabilities of the grammar. The Helsinki corpus contains approximately 1. however.

Both corpora consist of conversations with a fieldworker . Dialect corpora allow these other aspects to be studied. rather than use corpora. including word recognition. These corpora have also been used as the basis of more complex aspects of language such as the use of the subjunctive (Johansson and Norheim 1988).in Kirk's corpus from Northern Ireland.corpora have long been recognised as a valuable source of comparison between language varieties as well as for the description of those varieties themselves.e. The psycholinguist should not go blindly into experiments in areas such as this with only a vague notion of frequency to guide the selection and analysis of . However. Few examples of dialect corpora exist at present . Dialectology is an empirical field of linguistics although it has tended to concentrate on experiments and less controlled sampling. corpora can still have a part to play in this field. and in the Helsinki corpus from several English regions. Such elicitation experiments tend to focus on vocabulary and pronunciation. and because the corpora are sampled so as to be representative. One important use is as a source of data from which materials for laboratory experiments can be developed. The Kolhapur Indian corpus is also broadly parallel to Brown and LOB. 1961). Quirk et al's (1985) "common core" hypothesis. For examples. there is still scope for the extension of such work. quantitative as well as qualitative conclusions can be drawn about the target population as a whole. Corpora and Psycholinguistics Although psycholinguistics is inherently a laboratory subject.Corpora in Dialectology and Variation Studies In this section we are concerned with geographical variation . Schreuder and Kerkman (1987) point out that frequency is an important consideration in a number of cognitive processes. although the sampling year is 1978.two of which are the Helsinki corpus of English dialects and Kirk's Northern Ireland Transcribed Corpus of Speech (NITCS). and Braj Kachru's conception of national varieties as forming many unique "Englishes" which differ in important ways from one another. Certain corpora have tried to follow as far as possible the same sampling procedures as other corpora in order to maximise the degree of comparability. measuring mental processes such as the length of time it takes to position a syntactic boundary in reading or how eye movements change. Most work on lexis and grammar comparing the Kolhapur Indian corpus with Brown and LOB has supported the common core hypothosis (Leitner 1991). One of the earliest pieces of work using the LOB and Brown corpora in tandem was the production of a word frequency comparison of American and British written English. One role for corpora in national variation studies has been as a testbed for two theories of language variation. the LOB corpus contains roughly the same genres and sample sizes as the Brown corpus and is sampled from the same year ( i. neglecting other aspects of linguistics such as syntax.

where an accurate picture of abnormal data must be constructed before it is possible to hypothesise and test what may be wrong with the human language processing system. Before the study was carried out nobody knew how frequent speech errors were in everyday language. the spoken corpus was able to provide exactly the kind of data that was required. Garnham's study was able to classify and count the frequencies of different error types and hence provide some estimate of the general frequency of these in relation to speakers' overall output. the findings seemed to suggest a picture of American culture at the time of the two corpora (1961) that was more macho and dynamic than British culture. In the last decade. one of the earliest pieces of work to be carried out was a comparison of its vocabulary with the vocabulary of the American Brown corpus (Hofland and Johansson 1982). Words in the domains of crime and the military were also more common in the American data. In general. A more direct example of the role of corpora in psycholinguistics can be seen from Garnham et al's (1981) study which used the London-Lund corpus to examine the occurence of speech errors in natural conversational English. For example .travel words were more frequent in American English than British English. the Polytechnic of Wales (POW) corpus is a corpus of children's language. A third role for corpora lies in the the analysis of language pathologies. including the frequencies of different senses and parts of speech of ambiguous words (if the corpora are annotated). there has been a move towards the empirical analysis of machine-readable data in these areas. Leech and Fallon (1992) used the results of these earlier studies. as was "violent crime" in the crime category. This revealed interesting differences which went beyond the purely linguistic ones such as spelling (colour/color) or morphology (got/gotten). Corpora and Cultural Studies It is only recently that the role of a corpus in telling us about culture has really begun to be explored.materials. however. because such an analysis required adequate amounts of natural conversation. However. Sampled corpora can provide psycholinguists with more concrete and reliable information about frequency. After the completion of the LOB corpus of British English. and of the language of children who are developing their (normal) linguistic skills. it seems to be an interesting and promising . Although this work is in its infancy and requires methodological refinement. lack the quantified representative descriptions which are available. Studies of the language of linguistically impaired people. The frequencies of concepts in these categories revealed differences between the two countries which were primarily of cultural. it is important to stress their potential for these analyses. They then grouped the differences which were statistically significant into fifteen broad categories. along with KWIC concordances of the two corpora to check up on the senses in which words were being used. perhaps suggestive of the larger size of the United States. perhaps suggestive of the American "gun culture". while previous work on speech errors had been based on the gradual ad hoc accumulation of data from many different sources. For example. not linguistic difference. while the CHILDES database contains a large amount of impaired and normal child language in several languages. a corpus of impaired and normal language development was been collected at Reading University. Although little work has been done with sampled corpora to date.

perhaps the most important of these have been social psychologists.8%) followed by actions of speaker and speaker's group (28. company reports etc. not just that which is being analysed. Conclusion In this session we have seen how a number of areas of language study have benefited from exploiting corpus data. For example "actions of speaker or speaker's group".8%) and actions of others (17.000 words of conversation and retrieved all instances of the commonest causal conjunction because (and its variant cos). Hence quantification in corpus linguistics is more meaningful than other forms of linguistic quantification because it can tell us about a variety of language. They took 200. the main important advantages of corpora are:  Sampling and quantification. To solve this problem Antaki and Naji (1987) used the London-Lund corpus (of spoken language) as a source of data for explanations in everyday conversation. which could also integrate more closely work in language learning with that in national cultural studies. and most everyday human interaction takes place through the medium of speech. which was then used to classify all the explanations according to what was being explained. diaries.area of study. However. while at the same time they are under pressure to quantify and test their theories rather than rely on qualitative data. One area of research in social psychology is that of how and why people attempt to explain things. Because a corpus is sampled to maximally represent the population. Social psychologists require access to naturalistic data which cannot be reproduced in laboratory conditions (unlike many other psychology-related fields). these are written texts. Corpora and Social Psychology Although linguists are the main users of corpora. To obtain data for studying explanations researchers have relied on naturally occuring texts such as newspapers. Researchers in other fields which make use of language data have also recently taken an interest in the exploitation of corpus data . they are not the sole users. any findings taken from the corpus can be generalised to the larger population. A frequency analysis of the explanation types in the corpus showed that explanations of general states of affairs were the most common type of explanation (33. and one may expect other social psychologists to make use of corpora in the future. . This places them in a curious position. To summarise. Explanations (or attributions) are important to the psychologist because they reveal the ways in which people regard their environment. An analysis of a pilot sample derived a classification scheme for the data. Work such as Antaki and Naji shows clearly the potential of corpora to test and modify theory in subjects which require naturalistic quantifiable language data. This refuted previous theories that the prototypical type of explanation is the explanation of a person's single action.7%). "general states of affairs" and so on.

collection and encoding. rather than the intonation of every syllable. Australian English might not be considered to be a dialect of English. or when subjects are asked to "role-play" a situation. "dialect" is most commonly used to mean sub-national linguistic variation which is geographically motivated. stress. Enriched data. by using a concordance program.   Ease of access. The majority of corpora are readily available. Therefore. e. The annotations in prosodically annotated corpora typically follow widely accepted descriptive frameworks for prosody such as that of O'Connor and Arnold (1961). since dialects cannot be readily distinguished from languages on solely empirical grounds. Hence data retrieval from annotated corpora can be easier and more specific than with unannotated data. prosody: Prosody refers to all aspects of the sound system above the level of segmental sounds e. either free or at low-cost price. dialect: The term "dialect" is more difficult to define. as Scotland is a part of the United Kingdom. However. it is usually easy to access the data within it. For example. Thus the corpus provides one of the most reliable sources of naturally occurring data that can be examined. parsing and prosodic transcription. scientific method: No theory of science is ever complete. Many corpora have already been enriched with additional linguistic information such as part-of-speech annotation.g. while Scottish English can be regarded this way. we may decide to determine whether sentence x is a valid sentence of language y by looking in a corpus of the language in question and gathering evidence for the grammatically. For example. Once the corpora have been obtained. the data are largely naturalistic. A smaller subset of Scottish English .such as that which is spoken in Glasgow. would almost certainly be termed a dialect. while intransitive verbs can never take a direct object. in comparison to "national variety". the researcher does not have to go through the issues of sampling. intonation and rhythm. typically through the medium of the corpus. elicited: Elicited data is data which is gained under non-naturalistic conditions. Corpus data is not always completely unmonitored in the sense that the people producing the spoken or written texts are unaware until after the fact that they are being asked to participate in the building of a corpus. transitive and intransitive: Transitive verbs can take an object. unmonitored and the product of real social contexts.g. which differ quantitatively rather than qualitatively. but of being able to be . KWIC: Key Word In Context. Glossary         common core hypothesis: The theory that all varieties of English have central fundamental properties in common with each other. Naturalistic data. Popper states that empirical theories have to have the property not only of being verified. Usually. only the most promintent intonations are annotated. But for the most part. or otherwise of the sentence. a laboratory experiment. As all of the data collection has been dealt with by someone else. empiricism: an empiricist approach to language is dominated by the observation of naturally occurring data.

B.765 0. co-occurence patterns of words.807 ought 0.544 .) collocations: Collocations are characteristic. although are different in either meaning. This forms theories which have predictive power. the following table gives a cross-tabulation of modal verbs across 4 genres of text (labelled A. part-of-speech: A way of describing a lexical item in grammatical terms.g.026 Word can could 0.798 0. E.782 must 0. Genre A B C D 210 148 59 89 120 49 36 23 100 86 15 46 24 29 13 4 43 34 12 28 3 4 0 1 0 10 12 4 Modal Verb can could may might must ought shall intercorrelation matrix: This is calculated from a cross-tabulation (see above)and shows how statistically similar all pairs of variables are in their distributions across the various samples. must.717 0. or it can mean "to kick". homographs: Homographs are words which have the same spelling. the forms kicks. Science proceeds by speculation and hypothesis. this is just a table showing the frequencies for each variable across each sample. but actively seeks to make the claim that it represents how the processing is actually undertaken.186 might 0. e.g. might.118 0. rationalism: rationalist theories are based on the development of a theory of mind in the case of linguistics. cross-tabulation: Put simply. For example: "Christmas" may collocate with "tree".    falsified (the process of finding a rule by looking for exceptions of it). could. "boot" can mean an item of shoe-wear. "angel". and have as a fundamental goal cognitive plausibility. The table below shows the intercorrelations between can. For example. The aim is to develop a theory of human language processing. derivation or pronunciation. For example.544 1 may 0. C. PEARSON PRODUCT MOMENT CORRELATION COEFFICIENT can 1 could 0. may. kicked and kicking would all be reduced to the lexeme KICK.528 shall 0. and D). ought and shall taken from the table above. comparative adjective. lexeme: The head word form that one would look up for if one were looking for the word in the dictionary. past participle. singular common noun. These variants form the lemma of the lexeme KICK. and "presents".796 0.

816 0. The relationship between can and can is 1.765 must 0. as they are identical.528 0.554 0. normal distribution: A variable follows a normal distribution if it is continuous and if its frequency graph follows the characteristic.026 1 0.such as a person's height in cms).521 0. Knowledge of parameters is not necessary either. median and mode co-incide (see graph on the left).717 shall 0. These tests are generally easier to learn and apply. Type I and Type II errors: Although we can be confident that the results of a significance test are accurate. There are two ways that this can occur:   A Type I error occurs when we decide the difference is significant (due to factors other than chance) when in fact it is not.554 0. second that the data is measured on an interval scale (e.306 0.798 might 0. . The probability of this happening is the same as the significance level of the test. Some variables show a greater similarity in their distributions than others: for instance.032 0.   Parametric tests make certain assumptions about the data on which the test is performed. This is not so serious relatively.032 0.601 0. First.782 0.587 0. there is the assumption that the data is drawn from a normal distribution (see below).796 ought 0.601 0.521 1 0.118).118 0.798) than it does to shall (0.795 0.816 1 0.g. any interval between two measurements is meaningful . Non-parametric tests make no assumptions at all about the population from which the data is drawn.186 0. non-parametric test: All statistical tests of significance belong to one of two distinct groups parametric and non-parametric.637 0. can shows a greater similarity to may (0.078 1 The closer the score is to 1.795 1 0. bell-shaped form in which all the values of mean. the better the correlation between the two variables.637 0. A Type II error occurs when we decide that the difference is due to chance.may 0.078 0. when in fact it is not. This is the most serious type of error to make (equivalent to a judge finding an innocent suspect guilty).807 0.587 0. there is always a small chance that the decision made might be wrong.306 0. parametric tests make use of parameters such as the mean and standard deviation. Thirdly. symmetrical.

