Professional Documents
Culture Documents
net/publication/4060454
CITATIONS READS
15 185
3 authors, including:
Ruifeng Xu Qin Lu
Harbin Institute of Technolog, Shenzhen Graduate School The Hong Kong Polytechnic University
228 PUBLICATIONS 3,648 CITATIONS 230 PUBLICATIONS 2,858 CITATIONS
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Qin Lu on 27 May 2014.
where 5
. (3.2.2) To overcome this problem, we adjust the f-means to
fi = ∑f
j = −5
i, j / 10 reduce the influence of wild value and sparse data.
Furthermore, the χ2 (chi-square) test, which does
The larger the spread, the more likely the co-word not assume normal distribution probabilities, is
appears in certain positions in the window than a flat applied here to evaluate the collocation candidate by
distribution. comparing the observed frequencies of collocation
In Xtract, Smadja uses the following three conditions candidate with the frequencies expected for
to identify collocation candidates: independence. If the difference is obvious larger
than a threshold, the words in the collocation
fi − f (3.2.3) candidates are proven far from independent, and thus,
I 1 : ki = ≥ k0
σ are highly associated.
I 2 : Ui ≥ U0 (3.2.4) The χ2 statistics summarizes the differences between
(3.2.5) observed and expected frequencies as follows:
I 3 : f i , j ≥ f i + ( k1 x U i )
(Tij − Eij ) 2 (3. 3.1)
χ2 = ∑
where, k0, Uo, k1 are threshold parameters, taking i, j Tij
values 1, 10, and 1, respectively.
where Tij is the observed times of word pairs, and Eij
Due to its inability, discussed in Section 1, to handle is the expected frequency for independence.
“high-low” and “low-high” problems, we modified
the candidate selection conditions in several ways. In our system, we applied the χ2 test only to BI-Gram
First, we use the new average frequency and standard collocation candidate evaluation. For word wa, and
deviation to substitute the original average frequency word wb, suppose wa appears ta times, and wb appears
and standard deviation. Second, our strength and tb times in a N-word corpus, and the co-occurrence
spread not only takes into account the values of wi as times of wa and wb is tab, the expected frequencies for
a co-word, but also reverse the roles of the head word independence Eab is calculated as:
and the co-word so that the data in the reversed-role ta tb (3.3.2)
E ab = ⋅ ⋅N
are also taken into consideration. Thus we call our N N