You are on page 1of 6

Proceedings of 2008 3rd International Conference on Intelligent System and Knowledge Engineering

Extracting Chinese Multi-word Terms from Small Corpus

Zhou Lang 1,3, Zhang Liang2 , Feng Chong3, Huang Heyan3


1.College of Computer Science and Technology, Nanjing University of Science and
Technology Nangjing, 210094
2.Dept. of Computer Dept. of Computer Science and Technology, Nangjing University, Nanjing,
210093
3.Research Center of Computer & Language Information Engineering,CAS, Beijingˈ100089
yzzhoulang@126.com

Abstract A|N

| A| N NP
?
A| N N
In this paper, we present an automatic terminology According to the grammatical rule, the multi-word
extraction approach for Chinese multi-word terms. In terms are restricted to a certain form of noun phrases.
this term extraction system, besides five linguistic rules In fact, there are a partial terms are verb phrases, even
acquired from an available term list by some machine adjective phrases. In order to obtain more terms with
learning methods, two statistical strategies are various grammatical construct, some more open
involved: a termhood measure based on the term linguistic rules are adopted in our term extraction
distribution variation, and a unithood measure system.
adopting the left and right entropy method to estimate Meanwhile, many statistical methods were referred
the collocation variation degree. The candidates are in term identification. Damerau[2] used mutual
ranked according to the values of the former. The information of words in the candidates as the weight to
latter is used to filter the preposition phrases and some measure the degree of term cohesion. Cohen[3]
verb-object phrases that rarely appear as terms. By adopted n-gram model to count the occurrences of the
validating on a small scale corpus in the computer n-gram strings in each document and the whole corpus
domain, the precision reaches 91.5% of the top 2000 respectively, then computed the cohesion by log-
outputs. likelihood measure. Frantzi[4] proposed C-value
approach to identify the terms from noun phrases. The
1. Introduction particularity of C-value is the consideration of the
nested terms. Minoru[5] presented a method
Along with the rapid development in science, combining perplexity and frequency information to
technology and communication, more and more new rank the candidates. Li[6] removed the non-term items
terms are generated every day. Increasingly users wish from the candidates based on the CBC clustering
to get the changes and catch the new information. The algorithm. Hypothesis testing methods are also widely
automatic terminology extraction caters for these used in term extraction process, such as t-test, F -test,
2
demands, and could automatically recognize the terms
Z-scores and so on.
from a given corpus according to the linguistic
Most of these methods are based on a large scale
knowledge and the statistical weight scores of the
corpus. In actual researches, collecting corpus needs
terms.
the support of a mass of manpower, material and
Most automatic terminology extraction systems
financial resources. And they pay more attention on
adopt the associative structure of linguistic and
measuring the term cohesion than domain specificity.
statistical methods: the candidates are extracted by
In this paper, we will introduce a new term
some linguistic knowledge, and then selected or ranked
identification measure based on two criterions: one is
by the statistical measures. The widely applied
the termhood values, which is evaluated by term
grammatical construct is introduced by Justeson &
distribution variation in the whole corpus; the other is
Katz[1]:
___________________________________ unithood measure to filter some preposition phrases
978-1-4244-2197-8/08/$25.00 ©2008 IEEE


Proceedings of 2008 3rd International Conference on Intelligent System and Knowledge Engineering

and verb-object phrases which merely appears as terms. The available term resource is a term list related to
Our approach is testified to be effective on the small the computer domain, which contains 10,026 terms.
scale corpus. The term list provides a great deal of samples for
The remainder of the paper details the terminology extracting the candidate terms in the same domain, and
extraction process. Section 2 presents the linguistic will be used to learn the length restriction and the inner
knowledge acquired from a term list in computer structural rules. Before learning linguistic knowledge,
domain. These rules are used to extract candidate terms the term list should be pre-processed by the ICTCLAS,
from a small corpus. Section 3 introduces the two an automatic Part-of-Speech tagger provided by the
criteria used in order to rank and filter the candidate institute of computing technology of Chinese academy
terms. Section 4 reports our experimental result, and of sciences.
discusses the results obtained with respect to C-value,
2 2.2. The length restriction
mutual information, t-test, F -test and the log-
likelihood ratio approaches. The paper ends with
In this paper, the length of the term is considered
conclusions and perspective for future research.
as the number of tokens after segmentation. Based on
the observation of statistical information about length
2. Candidate term extraction of these term samples, it is easy to find that the length
of terms ranges from one to ten. The detailed term
The purpose of the paper is to extract multi-word length distribution is shown in table 1.
terms in texts. Most terminology extraction approaches
have been only focused on noun phrases. However, Table 1. The length distribution of term samples
verbal phrases and adjectival phrases in domain- term length the number the ratio
specific documents also contain domain information. 1 1089 10.86%
The choice of the linguistic knowledge used to 2-6 8871 88.48%
extract candidates affects the precision and the recall
7-10 66 0.66%
of the result. The strict linguistic knowledge is positive
for precision but negative for recall. For example, the From table 1, we can get that 88.48% of the terms
linguistic filter used by Justeson and Katz[1] only with the length of two to six. Although the terms with
permits noun phrases, which would produces high one word take the proportion of 10.86%, 48.86% of
precision because noun phrases in the domain corpus is which is the abbreviation of English characters. Zhang
the most likely to be terms. At the same time, it brings Rong[7] also reported an approximate statistical
on a low recall, since the verbs and other words are information from a term bank with 328,150 terms.
excluded by this filter. Consequently, we set the first restriction rule: the
Further more, the automatic Chinese Part-of- length of candidate terms must be between two to six.
Speech tagger can’t produce an utterly correct result by
far. And there’s no obvious morphological changes 2.3. The inner structural rules
between noun form and verb form in Chinese words, it
often confuses the noun and verb in Part-of-Speech Based on the observation of 8,871 terms with two
tagging process. For instance, “ձᄬߚᵤ”(dependency to six words, which were segmented into 24,127
analysis) is segmented as two words: “ ձ ᄬ ” tokens, we found that non-morpheme word, modal
(dependency) and “ ߚ ᵤ ” (analysis) , and they are particle and status word never appeared, and the
tagged as verb. interjection, idiom, the pronoun, the location word and
Accordingly, it is apt to adopt an open linguistic punctuation merely occur with 67 times. Only 0.13%
filter in the candidate term extraction component of of the head of terms tagged as the auxiliary word, the
our system. These linguistic knowledge acquired from conjunction or the postpositional particle, in like
a term list in computer domain by a series of machine manner, 0.11% of the terms ended with the frontal
learning approaches, is composed of some restriction particle, localizer, the conjunction or the auxiliary
rules about term length and the inner structural word. In addition, 99.74% of the terms contained the
knowledge of terms. noun, the verb, the measure word, the postpositional
particle or the shortened form.
2.1. The term resource According to the above statistical information, four
inner structural rules are designated to extract
candidate terms. Firstly, the candidate terms can’t
contain non-morpheme word, modal particle, the status


Proceedings of 2008 3rd International Conference on Intelligent System and Knowledge Engineering

word, the interjection, idiom, the pronoun, the location terms (e.g. “ হ ⊩ ߚ ᵤ ” (syntactic analysis)) in
word or punctuation; Secondly, the auxiliary word, the different documents are great larger than that of
conjunction and the postpositional particle shouldn’t common words (e.g. “ᅲ偠᭄᥂” (test data)). Figure 1
be the head; Thirdly, filter the words string ending shows the distribution variation of “হ⊩ߚᵤ” and
with frontal particle, localizer, the conjunction or the “ᅲ偠᭄᥂” in a corpus of computer domain.
auxiliary word; Finally, at least one of the words in
candidate term is the noun, the verb, the measure word,
the postpositional particle or the shortened form.

3. Candidate term identification


According to the five linguistic restriction rules
mentioned in section 2, a series of word strings are
extracted from the domain-specific corpus. We select
the ones, which occur more than twice in the whole
corpus, as the candidate terms. In general, some further
process with statistical measures will be applied to
rank or filter the candidates to select the final set of
terms.
Kageura[8] proposes two critical notions about the
statistical measure used in term extraction. One is
unithood, which weights the cohesion of the syntactic
collocations in terms. The other is termhood, which Figure 1. Distribution variation curves of “হ⊩ߚ
measure the domain specificity of the terms. ᵤ” and “ᅲ偠᭄᥂”
Therefore, the unithood and termhood factors are
incorporated together into the term identification In figure 1, the vertical denotes the ratio of
process, to measure the importance of candidate terms. frequency in each document, and the zero dots are
The statistical measures in our term identification eliminated. Comparing the two curves, it is easy to get
phase are named TDCV (Term Distribution and that the frequency curve of common word (“ᅲ偠᭄
Collocation Variation). The TDCV approach contains ᥂ ”) are reposeful, but that of term (“ হ ⊩ ߚ ᵤ ”)
two components: One is candidate terms ranking fluctuates acutely. By analyzing the corpus, we has
module. Based on the statistical information of the found that the most occurrences of terms in the science
term distribution variation in the whole corpus, the or technology documents follow two conditions: (1) If
value of termhood decides the rank of candidates; the a term has a strong relation with the document topic, it
other is candidate terms filtering module. Using the left will be referred continually in the text; (2) If a term
and right entropy, the values of unithood could and the document topic have a faint correlation, its
measure the flexibility of collocation in the candidates. It is appearance probability would decrease dramatically.
easy to recognize the candidates which contain the Therefore, in most case, the variation of term
conventional lexical items unlikely appearing in the occurrences is larger than that of common words in the
term, such as prep phrases and part of verb-objective whole corpus. Next paragraph will present a new
phrases. termhood measure based on the term distribution
variation.
3.1. Candidate term ranking To verify the fluctuation degree of samples
distribution, the simplest and effective measure is the
The term extraction approaches mentioned in sample variance. A smaller variation means a stable
Section 1 are based on a large domain-specific corpus, appearance of the word, by contraries the remarkable
some of which even additionally need a larger general change brings on a larger variation. For a given
corpus. In this paper, we propose a termhood measure candidate term t , the termhood measure is defined as
method based on the statistical information of terms follow:
distribution variation, which is also effective for the
small scale corpus. tf w
By observing the distribution of terms on the termhood t ˜V (1)
df w
corpus, we find that the frequency variation ranges of


Proceedings of 2008 3rd International Conference on Intelligent System and Knowledge Engineering

we find that most of them contains conventional


lexical items, such as “ Ă Ă Ё ”(in…) and “ ߽
tf t 1 N 2

df t
˜ ¦ tfi t  tf t
N 1 i 1
⫼ĂĂ”(use…). In general, the conventional lexical
items are easy to compose collocation with many other
words or phrases and few of them are nouns.
Where, tf t is the frequency of t , df t is the In this work, we use the left and right entropy[9] to
measure the collocation variation degree of the
document frequency, N is the number of documents candidate terms as the unithood and emphasize
which contain t , tfi t is the frequency of t in the i- particularly on filtering the phrases with the
conventional lexical items. Fist of all, statistical
th document, and tf t is the average frequency in information in the corpus is utilized to computer the
the N documents. left and right entropy for each word appearing in the
Considering the document frequency of some term candidates. The left and right entropy of noun is
is very low, such as “fisher 㒓ᗻ߸߿ᓣ”(fisher linear defined as zero.
discriminant) only occurs 12 times in a document, it is Le w  ¦ p lw | w ˜ log 2 p lw | w
needed to do some modification for equation 2. We lL
* (4)
will add a new dot N  1, tf into the distribution
Re w  ¦ p wr | w ˜ log 2 p wr | w
r R
for each term. tf is the average frequency in the
overall documents. Where, L is the set of words appear at the left of
w , R is the set of words appear at the right of w ,
tf t 1 N 1 2
termhood t
df t
˜ ¦ tfi t  tf t
Ni1
(2) p lw | w denotes the probability of l adjoining at
the left ofw , p wr | w denotes the probability of
Assuming there are M documents in the corpus, r adjoining at the right of w . If w is a conventional
tf t is replaced by tf t : lexical item, at lest one of its left and right entropy
must be high. Based on the left and right entropy, the
tf t unithood of t is defined as follow.
tf t  M 1 tf t
tf t M unithood t max ^ Entropy wi ` (5)
(3) wi t
N 1 M N 1
­ Re wi i 1
If the corpus is big enough, the value of tf t °
Entropy wi ® Le wi  Re wi 2 1 i  n
approximates to tf t / N  1 . °
¯ Le wi i n
The frequency information required in equation 3
can be easily collected from the corpus. Where, wi is the word involved in t . n is the
length of t .
3.2. The unithood of candidate term When a candidate term contains conventional
lexical items, its unithood will be higher than others,
Many methods for evaluating term unithood have i.e. it is unlikely to be a true term in the corpus. A
been proposed in other works[1][2][3]. These methods threshold could be set to filter the candidates which are
are designed to measure the cohesion between the higher than it.
words of candidates, the one of which has a fixed
syntactic structure would get a higher score. But many 4. Experiments and evaluation
kinds of phrases have less chance to be a term, such as 4.1. Test corpus
preposition phrases and some verb-object phrases. As
a result of the open linguistic knowledge adopted in A small corpus used for the experiments consisted
the candidate terms extraction phase, the preposition of 142 Chinese documents, belonging to the domain of
phrases and verb-object phrases take a great part in the computer. Pre-processed by the ICTCLAS, these
candidates list. With the investigation on these errors,


Proceedings of 2008 3rd International Conference on Intelligent System and Knowledge Engineering

documents with the size of 3.14 megabytes were the unithood filter, the terms with a value higher than
segmented into 422,792 words. 4.0 were removed. As a result, 267 preposition phrases
and verb-object phrases including conventional words
4.2. Method of evaluation were recognized, but 3 of which are terms. After the
unithood filtering, 264 true negatives were removed
In order to evaluate the performance of the from the ranked list, and the precision were greatly
proposed method, we have conducted two evaluations. improved from 79.6% to 91.3%.
First, we evaluated the quality of terminology
extraction by identifying the true positives among N of 4.3.2. Comparison with other methods. In order to
the highest ranked candidates returned by the system. explicitly exhibit the performance of our proposed
Top 2000 candidates ranked by termhood were method, we implemented several comparative
2
validated by a domain export. Setting the threshold of experiments: C-value, mutual information, t-test, F -
unithood as 4.0, the true negatives in the filtered
test and he log-likelihood ratio methods. There
candidates were manually counted.
measures were used to process the candidates extracted
Second, we compared our approach with several
by the same five linguistic rules with ours. The
other methods commonly used in term extraction, such
precision values of top 2000 ranked by the six methods
as the widely used C-value measure, which is sensitive
and the recall values were contrasted in table 3 and
to nested terms; the mutual information method, which
table 4.
is the extended basis of many measures; two
2
hypothesis testing methods(t-test, F -test); and the Table 3. Precisions of the three methods
log-likelihood ratio method. We counted how many Precision values
domain-specific terms were included in the top 2000 Top Top Top Top
candidates returned by these measures respectively. 500 1000 1500 2000
Additionally, 1,265 terms in 50 stochastically extracted TDCV 95.5% 94.2% 92.7% 91.3%
documents were recognized to estimate the recall for C-value 64.0% 65.1% 62.3% 60.2%
each approach. MI 46.2% 44.9% 41.8% 40.1%
t-test 80.0% 80.0% 78.4% 78.1%
4.3. Results and analysis F 2 -test 79.8% 79.0% 78.9% 79.5%

51,361 candidates were extracted from the small Log- 72.2% 74.0% 74.8% 76.2%
corpus according to the five linguistic rules mentioned likelihood
in Section 2. Filtering the candidates with the ratio
occurrence being lower than 3, 10,413 items are
remained.
Table 4. Recalls of the three methods
4.3.1. Quality of TDCV Approach. The remained Recall values
candidates were processed by termhood ranking and Top 1000 Top 2000 Top 5000
unithood filtering in the term identification phase. The ATR-UF 16.02% 24.40% 39.15%
precision values of the top 2000 terms produced by the C-value 16.13% 23.95% 37.47%
two strategies were list in table 2. MI 6.09% 12.56% 31.62%
t-test 12.29% 20.02% 33.71%
Table 2. The precision of top 2000 terms produced 2
F -test 11.06% 19.96% 33.54%
by ranking and filtering process
Top Top Top Top Log- 10.38% 18.83% 33.37%
500 1000 1500 2000 likelihood
Termhood 90.4% 89.1% 84.8% 79.6% ratio
ranking It is obvious that our TDCV method substantially
Unithood 95.5% 94.2% 92.7% 91.3% outperforms other five measures. In the top 2000 of the
filtering results, the precision value of TDCV is 31 points
Ranked by the termhood values from high to low, higher than the C-value, leads mutual information by
2
there were 408 errors in the top 2000 terms, in which 50 points. Moreover and exceeds t-test, F -test, the
281 items contained conventional words. Processed by log-likelihood ratio 12 points. Furthermore, our TDCV


Proceedings of 2008 3rd International Conference on Intelligent System and Knowledge Engineering

method has a better capability to improve the rank of unithood filter based on the measure of collocation
terms. For example, the term “ 䇁 ䷇ Ẕ ㋶ ” (speech variation degree to remove the preposition phrases and
retrieval) having 18 occurrences in the corpus, gets a some verb-object phrases which contain the
much higher weight than the common lexical item “㸼 conventional lexical items. Our experiments show that
1” (table 1) with the occurrence of 105. This is also this method is effective in the small scale corpus.
reflected in the respective output list rank of C-value Although the proposed approach shows a high
measures. While “䇁䷇Ẕ㋶” has ranked 94 on the precision for multi-word term with representative
TDCV output list, C-value puts it on the 738th rank. domain specificity, there is still some work to be
Conversely, “ 㸼 1” is ranked on 48 by C-value, conducted in the future. Our experiments were
conducted on a small scale test corpus which only has
whereas it is ranked on 2939 by TDCV. On the other
142 documents. It might be necessary to implement
hand, TDCV will also demote the true terms if their
our approach under larger corpus such as that has over
occurrences in the corpus are not balanced. Especially
500 documents. And, there is a portion of mistake that
the common lexical items with higher frequency and
can't be resolved. It is necessary to analyze the feature
lower document frequency, are easy to be mistakenly
of these errors and try some solutions.
ranked in the front of the output list. For example, in
our test corpus, there are four documents adopt the
articles about news reports as corpus, so the phrase Reference
Āᮄ䯏᡹䘧ā (news report) getting the occurrence of [1] John Justeson and Slava Katz. Technical Term: Some
32, is ranked on 524. However, this case is rather the Linguistic Properties and an Algorithm for Identification in
exception in the corpus. Text. Natural Language Engineering, 1995, Vol.1, No.1,
In our experiment, the performances of C-value pp.9-27.
and mutual information are much lower than the other [2] Damerau F.J.. Evaluating Domain-Oriented Multi-Word
three methods. After analyzing the errors of each Terms from Texts. Information Processing and Management,
method, we find that the linguistic rules applied in the 1993, Vol.29, No.4, pp.433-447.
candidates extraction phase are the main causation. [3] Jonathan D. Cohen. Highlights: Language- and Domain-
Because the rules are more open than the restriction of Independent Automatic Indexing Terms for Abstracting.
noun phrase, it brings many prep phrases, verb phrases Journal of the American Society for Information Science,
and other phrases into the candidates list. Along with 1995, Vol.46, No.3, pp.162-174.
the number and type increment of father string, the [4] Frantzi K.T., Sophia Ananiadou, Hideki Mima.
term weights of C-value method are greatly decreased. Automatic Recognition of Multi-word terms: the C-
At the same time, it results in the drop of the precision value/NC-value Method. International Journal on Digital
Libraries, 2000, Vol.3, No.2, pp.115-130.
value.
The recall values in table 4 are universally low, and [5] Minoru Yoshida, and Hiroshi Nakagawa. Automatic
have the narrow disparities among them. The main Term Extraction Based on Perplexity of Compound Words.
IJCNLP 2005, pp.269-279.
causation of the low recalls lies on the small scale of
the small corpus. Many terms are filtered, because the [6] Li Yong. To Automatically Filter Specific Field Terms
occurrences of them in the corpus can’t reach the Based on the Clustering Method. Computer Engineering &
Science, Vol.30, No.2, 2008, pp.64-66
frequency threshold. Furthermore, the Part-of-Speech
errors and linguistic knowledge restriction lead [7] Zhang Rong. Research on Extraction and Clustering of
likewise the problem. Term Definition and Term Extraction. The doctor
dissertation of Beijing Language and Culture University.
2006.
5. Conclusion and future work
[8] Didier Bourigault. Surface Grammatical Analysis for the
Extraction of Terminological Noun Phrases. Proceedings of
In our study, we proposed a new terminology COLING’92, 1992, pp.977-981.
extraction approach and showed that it obviously
[9] Virach Sornlertlamvanich, Tanapong Potipiti, and
outperforms some of the widely used measures based
Thatsanee Charoenporn. Automatic Corpus-based Thai Word
on a small computer domain-specific corpus. Our Extraction with the C4.5 Learning Algorithm. In proceedings
TDCV method introduces a new termhood of the 18th conference on Computational linguistics, 2000,
measurement of the multi-word candidate terms pp.802-807.
according to the frequency distribution variance and a



You might also like