You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/4060454

An automatic Chinese collocation extraction algorithm based on lexical


statistics

Conference Paper · November 2003


DOI: 10.1109/NLPKE.2003.1275923 · Source: IEEE Xplore

CITATIONS READS

15 185

3 authors, including:

Ruifeng Xu Qin Lu
Harbin Institute of Technolog, Shenzhen Graduate School The Hong Kong Polytechnic University
228 PUBLICATIONS 3,648 CITATIONS 230 PUBLICATIONS 2,858 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Emotion Cause Detection View project

Class Noise & Instance Transfer View project

All content following this page was uploaded by Qin Lu on 27 May 2014.

The user has requested enhancement of the downloaded file.


AN AUTOMATIC CHINESE COLLOCATION EXTRACTION ALGORITHM
BASED ON LEXICAL STATISTICS
Ruifeng Xu , Qin Lu, and Yin Li

Department of Computing, The Hong Kong Polytechnic University,


Hung Hom, Kowloon, Hong Kong
{csrfxu, csluqin, csyinli }@comp.polyu.edu.hk

ABSTRACT language [Manning,1999].


Definitions of collocations are given by linguists. In
This paper presents an automatic Chinese collocation
this work, we follow Benson’s widely adopted
extraction system using lexical statistics and
definition [Benson,1990], a collocation is an
syntactical knowledge. This system extracts
arbitrary and recurrent word combination.
collocations from manually segmented and tagged
Chinese news corpus in three stages. First, the Choueka first attempted to extract collocations of two
BI-directional BI-Gram statistical measures, or more adjacent words that frequently appear
including BI-directional strength and spread, and χ2 together from an 11 million-word corpus taken from
test value, are employed to extract candidate the New York Times [Choueka,1988]. Church and
two-word pairs. These candidate word pairs are then Hanks [Church,1990] redefined the collocation as a
used to extract high frequency multi-word pair of correlated words. They employed mutual
collocations from their context. In the third stage, information to extract pairs of words that tend to
precision is further improved by using syntactical co-occur within a fixed-size of 5 words allowing
knowledge of collocation patterns between content co-occurrence of adjacent words as well as distant
words to eliminate pseudo collocations. In the pairs to be extracted. However, multi-word
preliminary experiment on 30 selected headwords, collocations were not extracted until Smadja
this three-stage system achieves a 73% precision rate, [Smadja,1993] developed Xtract to extract adjacent
a substantial improvement on the 61% achieved using and distant, two-words and multi-words collocations.
an algorithm we developed earlier based on an Xtract is a three-stage system. In the first stage, the
improved version of the Smdja’s 53% accurate Xtract strength and spread statistics (to be introduced in 3.2)
system. are calculated for words co-occurring with a given
Keywords: Chinese Collocation, information headword, in its –5 to 5 word context window. The
extraction, and statistical models two-word pairs with significant strength and spread
value are extracted as collocation candidates. In the
1. INTRODUCTION second stage, high frequency n-grams are extracted
Collocation is a lexical phenomenon where two or from the candidate list. They are treated as n-gram
more words are used together to convey a specific collocation candidates. In the last stage, parsing
semantic meaning. For example, we will say “warm information is employed to weed out pseudo
greeting” , but not“hot greeting”, “broad daylight”, collocation candidates. Xtract is the most important
work and best-performing approach to English
but not“bright daylight”. Similarly, in Chinese
collocation extraction.
“ are synonyms, but we will say
rather than Works on Chinese collocation extraction were
rather than . In short, collocation refers to reported in [Sun 1997,Sun 1998,Wei 2002] including
the co-occurrence of words and phrases that have a a work done by our team [Lu,2003]. The main
fixed usage and meaning, but which apply no general structure and the statistical measures used in these
syntactic or semantic rules, rather, on natural habitual systems are quite similar to those of Xtract. However
their performance is not ideal. When we directly adverbs) are kept in the final result.
applied the first two stages of Xtract to a Chinese
In our preliminary experiment, 30 words (nouns,
corpus, the precision rate of two-word collocations is
verbs and adjectives across a wide frequency range)
only around 50%. After we optimized the parameters
are selected as headwords for testing. Comparing the
for this algorithm, precision increased to 53%.
automatically extracted collocations and the
Taking relative proximities into account, precision
collocations manually extracted by professional
rate improved another 8%.
linguists, our improved system achieves 75%
As we sought, in the first two stages, to improve the precision rate.
performance of statistic-based Chinese collocation
The rest of the paper is organized as follows. Section
algorithm, we noticed that the frequency of Chinese
2 introduces the work on establishing word BI-Gram
word co-occurrence does not follow the so called
co-occurrence database and shows the distribution of
normal distribution, as assumed by Smadja and most
word pairs in Chinese. Section 3 describes our
of the reported work. Collocation extraction,
algorithm for using word BI-Gram statistics to extract
especially in Chinese, operates in the Keyword in
two-word collocations. Section 4 briefly introduces
Context (KWIC) mode. In this mode, once a
multi-word collocation extraction. Section 5
headword is selected, the system has to search the
explains the work on using syntactical collocation
whole corpus to find all its co-words in a fixed
patterns to remove pseudo candidate collocations.
context window. This over-emphasis the frequency of
Section 6 briefly evaluates the algorithms using a
a headword, with the result the system cannot find
simple example. Section 7 concludes our paper.
headword and co-word frequencies which fall into
the pattern of “high-low” or “low-high”. For 2. WORD BI-GRAM DATABASE
example, given a headword and a co-word
We used the annotated Peking University Corpus of
, because is a high-frequency word and
the on People’s Daily from Jan 1998 to June 1998 as
is a very low-frequency word, this collocation
our testing corpus. Follow the lead of most reported
cannot be identified because the too strong frequency
works; we limited the observation context window to
difference leads to over-dependence on the high
a range of [-5, 5] words for a headword. Each word
frequency words.
pair in the corpus is recorded as a 3-tuple: (wi, wj, , d),
Based on these observations, we propose to extract where wi and wj, are two co-occurred words and d
two-word collocations using BI-directional BI-Gram denotes the distance. Since a word pair can appear at
statistics. The improved system still has three-stages. different distances, the system collects all of them
In the first stage, bi-gram word pairs will be extracted within the observation window. Thus a record of a
as collocation candidates using bi-directional bi-gram word pair is actually an 11-tuple as shown below:
statistics. The average frequency and standard
wi ,wj,, t-5, t-4, t-3, t-2, t-1,t, t+1, t+2, t+3 t+4 t+5
deviation of the co-words are optimized in order to
better describe the actual distribution. Then, the Where t-5 to t+5 are the frequency counts that wj
strength and spread are both extended to appear in the –5 to +5 positions for the head word wi,
BI-directionally analyze the co-occurrence status of t is the total frequency count of the headword wi. To
headword and its co-words. We then evaluate the speed up the search process, we built a database of all
co-relation of word pairs using χ2 test, a more the co-occurrences of word pairs and indexed them
effective measure for data that do not fit well into a by headword and co-word. This way, co-occurrence
normal distribution. In the second stage, we follow information can be found directly in this database
our previous method reported in another paper, to without the need to search for them in the corpus.
retrieve the high frequency strings that consist of Working with the co-occurrence database has obvious
extracted two-word collocation candidates as advantages over the traditional KWIC mode in that
multi-word collocation candidates. In the third stage, the word co-occurrence database supports global
the syntactic collocation patterns between content statistical analysis and optimization, which the KWIC
words, introduced by [Lin,1993], are applied to weed mode does not.
out candidates that do not fit into Chinese syntactical The distribution of word frequency with respect to
patterns. Since function words are less important in testing corpus size is represented by the solid line in
Chinese than in English, only candidates with at least Figure 1. The test corpus consists of 5,802,864
two content words (nouns, verbs, adjectives and words, with 32,643,684 two-two pairs and 8,330,460
different word pairs is 8,330,460. Note that the large number of sparse co-occurrence cases. It is
numbers of word co-occurrences rapidly increases in obvious that the distribution of word co-occurrence
the first 30% and then slows down in 30-70% range. does not fit into the normal distribution. This is an
The growth slows down even more beyond that. As important reason to modify the existing collocation
the abundant numerals, personal names, place names extraction algorithms because the measures were
in the corpus lead to a large number of sparse and suited for an approximate normal distribution.
useless word pairs, we substituted them by a so called
class words, Consequently, the co-occurrence
3. COLLOCATION EXTRACTION USING
frequency has a much flatter growth and the total WORD BI-GRAM STATISTICS
number of different word pairs are reduced to only 3.1 Average frequency
71,020,431. In Figure 1, this is represented by a
dotted line. Our system retains both the original word In Xtract, average frequency and the variance of
co-occurrence data and the substituted word co- co-words for a given headword are used to calculate
occurrence data. the strength and spread to identify the word pairs
with significant values. Obviously, the value of
average frequency strongly influences the threshold
for collocation extraction. With a closer examination
of the co-word frequency, we want to identify two
important points of the distribution, the drop-down
spot, SA, a point after which the curve would drop
significantly changing the shape of the curve, and
flat-out spot, SB, a point after which the
co-occurrence frequency is considered sparse and
statistically insignificant.
Suppose there exist the i-th co-word, such that

f (ith ) ≥ 4 * f and
Figure 1. Co-occurrence Freq. vs. corpus size f (ith −1 ) − f (ith ) , (3.1.1)
≥R
f (ith ) − f (ith +1 )

where R is a threshold experimentally set to 5. Then i


is considered the drop-down spot JA. If SA, can be
found, the distribution of co-words has i co-words
with large wild values. If such a spot is not found, the
first co-word is treated as SA.
In order to reduce the influence of a large range of
sparse co-words on the average frequency, we take
the summation of frequencies until we find a j such
that the summation is 95% of the total frequency
where j is considered the flat-out spot, SB.
Then the revised average frequency, f new , is
Figure 2. Frequency distribution of co-words
calculated as
Figure 2 shows the frequency distribution of SB
co-words with respect to headwords. The X-axis ∑ f (i )
represents different co-words. Y-axis shows the

f new =
i=S A (3.1.2)
SB − S A
corresponding frequency distribution. The figure uses
the relative percentage of co-occurrence instead of Using the new average frequency, re-compute the
the absolute frequency. From Figure 2, we find that standard deviation.
on the average, only the first 5-10% of co-words are
high frequency. After that, the frequency of 1 SB
(3.1.3)
σ new =
SB − S A
∑( f i − f new )2
subsequent co-words quickly decreases leavings a i=S A
f new and σ new are expected to better reflect the measurement bi-directional. Consequently, we
characteristics of the distribution of co-words. extended the strength to
− −
3.2 BI-directional BI-Gram Statistics kinew= 0.5 * f ( wa wb ) − f ( w a )
+ 0.5 *
f ( wa wb ) − f ( w b ) (3.2.6)
σ ( wa ) σ ( wb )
This step is a key component of the whole collocation
extraction process because the rest of the collocation and spread to
extraction methods are based its results.
∑ ( f (w a wb ) − f a ) 2 ∑ ( f (w a wb ) − f b ) 2
In Xtract, for a given headword w, the strength of its U new = 0.5 * + 0.5 * (3.2.7)
co-word wi , is defined as follows: 10 10

fi − f (3.2.1) The words pairs satisfy ki > k 0 , U i > U 0 , and


ki =
σ following the χ2 test will be extracted as two-word
The strength is an indication of how strongly the two collocation candidates.
words are co-related. Greater strength means a 3.3 χ2 test
stronger co-relation between w and wi.
The most reported works on collocation extraction,
Further, assume that f i , j is the frequency that wi including our previous system, assume that the
co-occurs with w in distance j where –5 <= j <=5. probabilities of word co-occurrence are
The spread which characterizes the positions that wi approximately normally distributed. However, this
around w is define as: assumption is proven not true in English
[Church,1993], and according to our word
∑( f i, j − fi )2 co-occurrence statistics, mentioned above, this is also
Ui = , not true for Chinese.
10

where 5
. (3.2.2) To overcome this problem, we adjust the f-means to
fi = ∑f
j = −5
i, j / 10 reduce the influence of wild value and sparse data.
Furthermore, the χ2 (chi-square) test, which does
The larger the spread, the more likely the co-word not assume normal distribution probabilities, is
appears in certain positions in the window than a flat applied here to evaluate the collocation candidate by
distribution. comparing the observed frequencies of collocation
In Xtract, Smadja uses the following three conditions candidate with the frequencies expected for
to identify collocation candidates: independence. If the difference is obvious larger
than a threshold, the words in the collocation
fi − f (3.2.3) candidates are proven far from independent, and thus,
I 1 : ki = ≥ k0
σ are highly associated.
I 2 : Ui ≥ U0 (3.2.4) The χ2 statistics summarizes the differences between
(3.2.5) observed and expected frequencies as follows:
I 3 : f i , j ≥ f i + ( k1 x U i )
(Tij − Eij ) 2 (3. 3.1)
χ2 = ∑
where, k0, Uo, k1 are threshold parameters, taking i, j Tij
values 1, 10, and 1, respectively.
where Tij is the observed times of word pairs, and Eij
Due to its inability, discussed in Section 1, to handle is the expected frequency for independence.
“high-low” and “low-high” problems, we modified
the candidate selection conditions in several ways. In our system, we applied the χ2 test only to BI-Gram
First, we use the new average frequency and standard collocation candidate evaluation. For word wa, and
deviation to substitute the original average frequency word wb, suppose wa appears ta times, and wb appears
and standard deviation. Second, our strength and tb times in a N-word corpus, and the co-occurrence
spread not only takes into account the values of wi as times of wa and wb is tab, the expected frequencies for
a co-word, but also reverse the roles of the head word independence Eab is calculated as:
and the co-word so that the data in the reversed-role ta tb (3.3.2)
E ab = ⋅ ⋅N
are also taken into consideration. Thus we call our N N

Then the χ2 test for the BI-Gram wa and wb is


N (t a ⋅ t a b − t ab ⋅ t a b ) 2 (3.2.3) fit a syntactic template patterns. These template
χ2 =
t a ⋅ t b ⋅ (t ab + t a b ) ⋅ (t a b + t a b ) patterns are based on noun, verb, and adjective
headwords at three levels. The following table
Considering χ2 value is 3.841 corresponding to the explains this but due to limitations of space, provides
cases that wa and wb have a 90% probability of an example for a noun only.
dependence, this value is selected as a threshold. If
the χ2 values for these two words are larger than
3.841, these words will tend to be dependent, with a
larger value indicating a stronger association.
At this stage, the word pairs with frequency and
position distributions that satisfy the extended
strength, extended spread, and χ2 tests are extracted
as candidate two-word collocations.
Table 1. The template Patterns for
4. MULTI-WORD COLLOCATION
EXTRACTIOIN The first level pattern indicates whether the headword
is the ‘key word of’ a phrase: Subject-Phrase, Verb-P,
Based on the extracted two word-BI-Grams, the Object-P, Attribute-P, Complement-P Adverbial-P, or
multi-word collocation extraction is simple. Here, we Noun-Phrase. The second level indicates the POS of
basically directly followed Xtract’s algorithm. A the co-word, such as /n, /v, /a, /m, /q. We notate with
simple process is described as: PB (Before) and PA (after) to represent the position of
Given a pair of words w and wi, and a integer a co-word with respect to the headword. This
specifying the distance between the two words, (w, wi, information will help us to remove word pairs that
d), all the sentences containing them in a given have high statistical values, but which should be
position are produced. Since we are only interested in eliminated. For example, from the table we know
relative frequencies, we compute only the moment of that if for a noun is present in a Verb-Phrase, the noun
order 1 of the frequency distributions. For each must follow the verb. If a noun-verb word pair is
possible relative distances from w, we only keep extracted in a Verb-Phrase, it will be eliminated no
those words occupying the position which have a matter how high the co-occurrence frequency may be.
probability greater than a given threshold T. In other Furthermore, the co-occurrence of functional words
words, we compute the f(wi)/Nx. Once the result is and content words in Chinese tend to be syntactical
over 0.75, a threshold, (X, wkj) will be kept. Finally, only, so at this stage, only collocations which consist
combine all the words satisfying the above of at least two content words are reserved.
requirement will be considered a multi-word
6. EVALUATION
collocation.
Since no benchmark data are available to evaluate the
Both Smadja’s and our experiment proves that such a
precision of collocation, as with Xtract, it was
simple statistical extraction for multi-word
necessary to have professional linguists examine the
collocation can achieve approximately 95% accuracy.
results. At this time, we have selected 30 headwords
Thus no further improvements are necessary.
including 10 nouns, 10 verbs and 10 adjectives across
5. WEEDING OUT THE PSEUDO a wide frequency range. The extracted results are
COLLOCATIONS compared with manually established correct answers.
Our improved collocation extraction system achieves
The precision of two-word collocation is a mere 73% precision rate on average. The following
50%-60%. This is because some word pairs, that provides an example of how each of the algorithm is
having significant statistic measures, are syntactically used to get the result for a given headword .
strongly correlated yet are not considered true
collocations, as in the case of . We call The word frequency for in the corpus is
these pseudo collocations. These can be removed by 2,271 with a total of 462 co-words. The frequency
using predefined syntactic templates. distribution of its co-words is shown in Figure 3.
According to [10], The Collocation Dictionary of Using Smadja’s approach, the average is 4.9, but in
Modern Chinese Content Words, collocations usually our algorithm, it is 7.8. For the two-word pairs
extraction, the suggested collocations using our out that we are not able to obtain recall rate due to the
previous system, modified from Xtract, is lack of availability of such information. In the future,
we will further investigate methods to identify
collocations through synonym substitutions to
eliminate the word pairs which are highly frequent
but which tend to be free combinations. Furthermore,
a shallow parser will be built to process running text
so as to extract collocations in real environments.
Acknowledgement
This project is partly supported by CERG grant
entitled: Automatic Acquisition of Chinese
Collocations with High Precision (PolyU reference:
B-Q535)
REFERENCES
[1] C. D. Manning, and H. Schutze, 1999,
Foundations of Statistical Natural Language
Processing, The MIT Press
[2] M. Benson, 1990, Collocations and General
Figure 3 Co-word Frequency for Purpose Dictionaries.” International Journal of
Lexicography, vol. 3 (1)
Based on
our algorithm, however, [3] Y. Choueka Y. 1988 “Looking for Needles in a
and are eliminated as they do not Haystack or Locating Interesting Collocation
satisfy the requirement of the BI-directional Expressions in Large Textual Database” in Proc. of
BI-Grams and χ2 tests. Meanwhile two more the RIAO Conf. on User-oriented Content-based Text
candidates are appended, they are 10 and and Image Handling, 21-24 Cambridge
8. Using the syntactical filter mentioned in Section 4, [4] K. Church, and P. Hanks 1990, “Word
the candidate 20 and 28 are filtered out. Association Norms, Mutual Information, and
Finally, our system output suggests collocations: Lexicography.” Computational Linguistics, vol. 16(1)
[5] F. Smadja, 1993 “ Retrieving Collocations from
Text: Xtract.” Computational Linguistics, vol. 19 (1)
[6] M. S. Sun, J. Fang, and C. Huang, 1997 “A
Compared with the correct answer by a professional Preliminary Study on the Quantitative Analysis on
linguist with reference to [10], Chinese Collocations. Chinese Linguistics, vol.1
[7] H. L. Sun 1998 “Distributional Properties of
Chinese Collocations in Texts,” In Proc. 1998 Int.
Conf. on Chinese Information Processing, Tsinghua
University Press.
are considered collocations. In this example, our [8] N. X. Wei 2002 “The Research of Corpus-based
system achieves 11/14=78% precision rate, which is and Corpus-driven Collocation Extraction,” Modern
higher than 13/23=56% achieved in our earlier system. Linguistics, vol. 4 (2)
7. CONCLUSION [9] Q. Lu, Y. Li, and R. F. Xu, "Improving Xtract for
Chinese Collocation Extraction", submitted for
In this paper we present a collocation extraction publication
system, which first extracts candidate, and then [10] S. K. Zhang, and X. G.. Lin 1992, The
elimination using syntactic templates. The Collocation Dictionary of Content Words in Modern
algorithm uses BI-directional BI-Gram measures as Chinese, Commercial Press.
selection criteria with improved selection functions [11] K.W. Church, et al. 1993 “Introduction to the
based on existing systems. Result showed that our Special Issue on Computational Linguistics Using
system can achieve the precision of 73% in average Large Corpora.” Computational Linguistics, vol.19
for the 30 words tested, a 10% to 20% absolute
improvement over other systems. It should be pointed

View publication stats

You might also like