You are on page 1of 4

Collocation frequency as a readability factor

George R. S. Weir and Nikolaos K. Anagnostou

Department of Computer and Information Sciences
University of Strathclyde,
Richmond Street, Glasgow, G1 1XH, U.K.,

A readability measure that can reliably estimate the
difficulty level of sample texts has great potential in ESL
teaching. We argue for the inclusion of collocation
frequency as a generic factor in estimates of ESL text
readability, since the readability of any text will be
affected adversely by the presence of collocations.
Collocational word combinations are seldom amenable
to comprehension solely on the basis of acquaintance
with individual word components, so they present
particular difficulties for L2 learners.
Our paper
describes the measure of average collocation frequency,
how this is derived from a reference corpus and can be
applied as a factor in estimating textual readability. A
software tool, based upon this approach, is presently in

Readability, collocation,
frequency, ESL.



Readability formulae work by using quantifiable
textual aspects, in order to estimate the ‘difficulty’
inherent in that text. Commonly, the key factors
considered in readability measures are word length
and sentence length, or variations on these
These aspects are founded in
readability studies (Dale and Chall, 1945).
Since the introduction of computer-based
textual analysis, newer factors such as word
frequency can be included in readability formulae.
The frequency of words, as derived from large
reference corpora, reflects a viable factor in
estimating readability since more common words
are likely to be familiar to more readers. Thereby,
a text composed mainly of highly common words is
comprehensible). In spite of the plausibility of
including word frequency in readability measures,
there are few reported examples (cf. Weir and
Ritchie, 2007; Stenner et al., 1988).
The logic underlying a focus on word frequency

as an affective factor in readability also extends to
frequency of word sequences. For this reason, our
research activity on readability considers the impact
of word sequences. This work has two strands.
Firstly, the impact on text readability from the
presence of n-grams (word combinations of length
n) will also reflect the commonality of such word
sequences, i.e., the more frequently any n-gram
appears in general usage, the more likely that it will
be familiar to a reader and thereby have less impact
upon the difficulty of a text than a sequence of
similar length but with lower general frequency of
The second strand of our investigation of word
sequence influence on readability centers on
collocations. We follow the sense of Manning and
Schutze (1999), who describe collocations as ‘any
turn of phrase or accepted usage where somehow
the whole is perceived to have an existence beyond
the sum of its parts’ (p.29). Choueka (1988),
offers a similar description: ‘[a collocation is] a
sequence of two or more consecutive words, that
has characteristics of a syntactic and semantic unit,
and whose exact and unambiguous meaning cannot
be derived directly from the meaning or connotation
of its components.’
In the context of our research, the significant
characteristic of a collocation is that its meaning is
not simply derivable from the meaning of its
constituent words. Since their composite meaning
cannot be derived from an understanding of their
components, from a readability perspective,
collocations are semantically opaque.
Because collocations have this complex
semantics, their sophistication presents particular
difficulties for language learners.
language learners, particularly L2 learners who are
non-native to the language culture, lack exposure to
the contexts and usages that imbue collocations
with meaning.
This complexity in collocations can not be
captured at the tractable levels of sentences, words

. 2 Collocation extraction The use of a reference collocation frequency list is central to the procedure we propose for gauging collocational impact. based on absolute and relative co-occurrence frequencies). the scaling step accounts for the effect of each particular collocation. The most common association measures used in collocation extraction are T-score. we create a collocation frequency list from a reference corpus such as the British National Corpus (or. 2007). Finally. its comprehensibility. higher frequency of occurrence for a collocation signifies a greater likelihood that it will be understood. Next. 2000). log-likelihood. some multi-word units that would not be so considered by English language users. Since such corpora are not available with collocations pre-identified. these approaches are more or less ‘noisy’. Once the collocations have been identified. chi-square. t-test. In the first step. called ‘Extract’ and ‘Full . this frequency list is derived from a large representative reference corpus. In what follows. a requirement for our frequency list is a means of collocation extraction or identification. This is derived by reference to an ‘external’ measure of ‘likely familiarity’ for collocations and is factored by the frequency of occurrence of the collocation in the sample text. For the most part. Corpus frequency is given either as a percentage (relative frequency) or as a number of occurrences (absolute frequency). we have selected the ‘Collocate’ program (Barlow. • information-theoretic measures (e. a range of software tools is available that aims to identify a list of the collocations in a given text. p. scaling and aggregation. Thereby. mutual information. This approach to measuring the collocational impact upon a text is similar to that used for gauging the impact of individual words upon readability (Campbell & Weir.. we can readily count their frequency of occurrence in the reference corpus and thereby populate our reference collocation frequency list. any word sequence) that occurs with high frequency in a plausible reference corpus is more likely to be familiar to the language user than another collocation that has lower frequency of occurrence. 1 Estimating collocational impact Our procedure has three steps: quantification. 2004): as a basis for our proof of concept approach to gauging collocational impact upon readability. 2001. Collocate provides two main functions for collocation extraction.. We propose that the frequency of a particular collocation in the reference collocation frequency list is an indicator of its semantic opacity. As such. on the assumption that we can generate a reference frequency list for collocations.g. In similar vein. In keeping with Wermter and Hahn (2004). p. 2. Despite the inherent noisiness in current collocation extraction techniques. Mutual information. then we should be able to ascertain a more accurate estimate for the semantic difficulty of that text. We assume that any collocation (indeed. In order to derive such measures. we quantify the number and frequency of occurrence for collocations in a sample text. 21. such measures may be classified as: • frequency-based measures (e. In both cases. or simply put. If we can accommodate a plausible measure of collocational influence upon the readability of a text. Ideally. log-likelihood. collocation frequency lists measure the frequency of collocations rather than individual words. 86-87). Evert. Such software tools are further detailed in Anagnostou & Weir (2007). the key requirement is a frequency list derived from a reference such as the British National Corpus (Burnard.g.1 Association measures Association measures are the criteria employed to decide whether any specific sequence of words qualifies as a collocation. Collocation frequency lists are similar in nature to word frequency lists. we aggregate the individual measures of collocational influence in order to arrive at an estimate for overall impact of collocations upon the sample text.or syllables and thereby constitutes a layer of semantics that is not currently considered in existing readability measures. Dice coefficient and Z-score (cf. • statistical measures (e.g. for initial proof of concept. the BNC Baby). 2004. Based upon such techniques. Dice’s coefficient). in which the two main fields are the word type and its frequency in the corpus. McEnery and Wilson. we have a plausible means of automating the measurement of collocational impact for sample English texts. we propose a method for gauging the degree of collocational influence on any sample English text. entropy). Our ‘external measure’ of likely familiarity for a collocation reflects the frequency of that collocation in a reference corpus. in the sense that they identify as likely collocations.

This scaling takes the reference frequency of a collocation instance and multiplies this by the number of occurrences of this collocation in the sample text. This provides a collocational impact factor for each individual collocation in the sample text. This metric we term the ‘average collocation frequency’ (ACF). 3 (pmw). This would accommodate the idea that repeated use may assist the reader to interpret the meaning of that collocation. to our requirements. which. the ACF acts like a ‘replacement’ or hypothetical collocation.. By adding the cells in this column we can calculate nc. (Of course. the higher the frequency of occurrence of a difficult collocation in a sample text. thus it has a frequency of 0. would have the same impact on semantic difficulty Figure 2: Deriving a value for the ACF As illustrated in Figure 2. we could consider the repeated appearance of a collocation as reduced in its semantic impact. . The collocation ‘manna from heaven’ appears once in the BNC Baby. we have a text containing three different types (m=3) of collocations. should it substitute all the collocations in the sample text.e. therefore it is rather difficult and its impact on the semantic opacity of a passage that includes it will be significant. measured as occurrences per million words (pmw). and this is the role that the ‘relative’ weight of a collocation plays in the ACF. in order to derive a single metric for the whole text. number of occurrences of collocation type i in sample text Average Collocation Frequency Our aggregation step allows us to combine the impact measures of individual collocations in a sample text. which . in this case. we proceed to the final (aggregation) step in our measurement process. ‘Full Extract’ allows for the comprehensive extraction of n-grams and collocation candidates from a corpus and is better suited to general collocation identification and thereby. along with regular expressions. fi*ni/nc where fi is the reference frequency for collocation instance i. column ni indicates how many times each collocation appears in the sample text. Having produced a measure of impact upon the sample text for each individual collocation instance. Consequently. This is a relatively rare collocation. In this instance. Armed with our reference collocation frequency list and a method for identifying the collocations present in any sample text. We also take the view that the semantic impact of the collocation is increased if it appears more than once in a text. and this is given by the formula: Collocations found in the sample text ni fi put up with n1 = 2 f1 = 60 pmw kick the bucket n2 = 1 f2 = 22 pmw pull yourself together n3 = 4 f3 = 46 pmw m=3 frequency of collocation type i in reference corpus nc = n1 + n2 + n3 = 2 + 1 + 4 = 7 number of different types of collocations in sample text total number of collocation occurrences in sample text Figure 1: Factoring collocations Figure 1 provides the data required in order to proceed to the calculation of the ACF (Figure 2). divided by the total number of all collocations in the sample text. or the total number of collocation occurrences in the sample text. this approach was used for producing our reference collocation frequency list and was also used to identify the collocations present in our sample texts. the harder understanding the text is going to be. Any measure of semantic difficulty based upon collocations needs to accommodate this fact.) The following example helps to illustrate how the ACF is calculated for any sample text. In Figure 1. The cells in column fi are populated with the frequency of each collocation in the reference corpus.Extract’. i. Where: ACF = average collocation frequency nc = total collocation occurrences in sample text m = number of different collocations in sample text fi = frequency of collocation i in reference corpus ni = number of instances of collocation i in sample text Consider the following scenario.2 occurrences per million words 1 3 1 1 1 ∑ fi ∗ ni = 7 ⋅ 60 ⋅ 2 + 7 ⋅ 22 ⋅1 + 7 ⋅ 46 ⋅ 4 = 7 i =1 2 1 4 120 + 22 + 184 = ⋅ 60 + ⋅ 22 + ⋅ 46 = ≈ 46. In other words. ni is the absolute number of occurrences of the collocation instance i in the sample text and nc is the total number of different collocations in the sample text. The first of these applies filters such as word/phrase or word/tag combinations. we are able to perform the scaling step in our three part measurement process. While ‘Extract’ is geared towards targeted searches of collocations.6 pmw pwm 7 7 7 7 acf = frequency of collocation “put up with” in the language relative weight of collocation “put up with” in the sample text Frequency of a hypothetical collocation. using the BNC Baby (approximately five million words) as the reference corpus.

A. 4 Conclusions and further work Given the inherent difficulty that collocations present to English language learners. . (2004). S. Burnard. Furthermore. Edinburgh: Edinburgh University Press. is illustrated in Figure 3. a method that can quantify such impact in sample texts has considerable potential as a teaching aid. pp. R. in Proceedings of ICTATLL 2008. S. 26-33. Y. and Weir. 49-55. 63-72. University of Strathclyde Publishing. We anticipate that this software facility will be made generally available to the research community in due course. R. R. the average collocation frequency for the sample text is derived. Inc. Horabin. Glasgow. dissertation..S. Figure 3: ACFCalc prototype This prototype is described further in Anagnostou & Weir (2008). Technical report. Universitat Stuttgart. the ACF is measured in occurrences per million words. MIT Press. in G. J. (1948). M. Textbooks and Readability. Collocation extraction based on modifiability statistics. Ozasa (Eds). (2008). (2007) ‘Matching Readers to Texts. Stenner. Choueka. we believe that the result is a plausible estimate of the aggregate impact of the collocations present in the sample text. G. R. A software tool is under development for use in conjunction with a collocation extraction facility (such as Collocate). U. D. Sri Lanka (forthcoming). Switzerland. Ozasa (Eds). S. The prototype of this system. C. Texts. the ACF would return a percentage. if the collocation frequencies were percentages. This new tool will generate ACF measurements based upon the approach and the factors described in this paper. such a measure can serve as a semantic factor in estimating the readability of texts and thereby supplement the word and sentence-based factors conventionally employed in such techniques. Ph. A. Manning. We propose the Average Collocation Frequency as filling such roles and have argued for its plausibility as a gauge for the semantic impact of collocations in any sample English text. 27. (2000). Review of software applications for deriving collocations. For instance.would have the same effect on the semantic difficulty of the text as all the collocations in the considered text. K. Textbooks and Readability. J. Weir & T. L. Texts. 11-20. S. (1999). In this example. S.. Campbell. The ACFCalc Tool. Glasgow.M. and Smith. I. Proceedings of RIAO ’88. (1988). N. University of Strathclyde Publishing. Whatever unit of measure is applied in the ACF calculation. Weir & T. (2004). Corpus Linguistics. The ACF unit of measurement is always the unit of the collocation frequencies in the reference collocation frequency list. (2007). D. Educational research bulletin. pp. Weir & T. S. From these inputs. Collocate User Manual. Textbooks and Readability. and Weir. Evert. Barlow. S. Users Reference Guide for the British National Corpus. Proceedings of the 20th International Conference on Computational Linguistics. J. Durham. in G. A formula for predicting readability. G. and Wilson. Dale. R. References Anagnostou. (2004). Oxford University Computing Services. Cambridge. N. Wermter. 37-54.0: Locating collocations and terminology. Geneva. Foundations of Statistical Natural Language Processing. The Lexile Framework. McEnery. R. Collocate 1. and Hahn. pp. Anagnostou. R. E. Smith. Estimating Readability with the Strathclyde Readability Measure. Looking for needles in a haystack. (2007). This requires a reference collocation frequency list and a sample text collocation frequency list. NC: Metametrics. MA. (1988). K. G. C. G. and Chall. A. Weir. The Statistics of Word Cooccurrences (Word Pairs and Collocations). entitled ACFCalc. R. in G. and Schutze H. pp. (2001). S. Glasgow. and Weir. R. 609–623. G. and Ritchie. pp. University of Strathclyde Publishing. D. Ozasa (Eds). Texts.