Using The Web For Automated Translation Extraction in Cross-Language Information Retrieval

Using the Web for Automated Translation Extraction in
Cross-Language Information Retrieval
Ying Zhang Phil Vines

School of Computer Science and Information Technology
RMIT University, GPO Box 2476V, Melbourne, Australia, 3001.
{yzhang,phil}@cs.rmit.edu.au
ABSTRACT from English to Chinese, existing systems are able to detect

There have been significant advances in Cross-Language In- an English OOV term since no segmentation is required for
formation Retrieval (CLIR) in recent years. One of the ma- English terms. They are either present in the translation
jor remaining reasons that CLIR does not perform as well dictionary or not. However when trying to discover appro-
as monolingual retrieval is the presence of out of vocabulary priate Chinese translations for these terms, segmentation
(OOV) terms. Previous work has either relied on manual again comes into play, and previous work has suffered from
intervention or has only been partially successful in solving inability to correctly identify new terms automatically. We
this problem. We use a method that extends earlier work in propose a segmentation free method based on frequency and
this area by augmenting this with statistical analysis, and length analysis and corpus-based disambiguation, and show
corpus-based translation disambiguation to dynamically dis- that it is successful in the vast majority of cases. We have
cover translations of OOV terms. The method can be ap- concentrated on short queries as they represent typical web
plied to both Chinese-English and English-Chinese CLIR, queries and have proved difficult to translate due to lack of
correctly extracting translations of OOV terms from the context.
Web automatically, and thus is a significant improvement The structure of the paper is a as follows. In Section 2,
on earlier work. we describe the various components of CLIR systems, exist-
ing approaches to the OOV problem, and explain the ideas
Categories and Subject Descriptors behind the extensions we have developed. In Section 3, we
H.3.3 [Information Search and Retrieval] describe our algorithm for extracting English translations of
General Terms Chinese OOV terms, while in Section 4, we give our algo-
Algorithms, Languages rithm for extracting Chinese translations of English OOV
terms. In Section 5, we detail our experiments and the re-
Keywords sults we obtained; and Section 6 concludes the paper.
Cross-Language IR, OOV problem, query translation
1. INTRODUCTION 2. PREVIOUS WORK

Dictionary-based query translation is a widely used ap-
Successful translation of OOV terms is one of the chal-
proach in CLIR [3, 6, 7, 10], because of its simplicity and
lenges of CLIR. Particular difficulties exist in languages
the increasing availability of machine readable dictionaries.
where there are no clearly defined boundaries between words
However, dictionary-based translation schemes need to ad-
as is the case with Chinese text. When translating from
dress three major issues; phrase identification and transla-
Chinese to English, a standard first step is to segment the
tion, translation ambiguity and out of vocabulary (OOV)
text into words based on an existing segmentation dictio-
terms.
nary. However where an OOV term occurs, it will not be
Phrase identification refers to the identification of groups
recognized, and segmented into either smaller sequences of
of words which have a special meaning when they co-occur
characters or individual characters. In this case the con-
that is different from the individual meanings of the words,
stituent components will usually be translated into terms
for example non proliferation treaty and cross straits. It has
in the target language that have little relationship to the
been noted that correct phrase translation may improve re-
original meaning. We describe a technique to detect and
trieval effectiveness by up to 25%[2]. Often a given word
correct this situation via means of a probability value gen-
has multiple translations. Translation disambiguation refers
erated by a hidden Markov model (HMM). When translating
to selecting the most appropriate translation in the given
context. A number of schemes have been developed to re-
solve the translation ambiguity problem, using techniques
Permission to make digital or hard copies of all or part of this work for such as term co-occurrence [3, 7], mutual information [12] or
personal or classroom use is granted without fee provided that copies are language modeling [6]. OOV terms are typically new terms
not made or distributed for profit or commercial advantage and that copies from current affairs, such as personnel names, place names
bear this notice and the full citation on the first page. To copy otherwise, to and translated words. Several authors [4, 5, 1, 11] have pro-
republish, to post on servers or to redistribute to lists, requires prior specific posed techniques to deal with OOV terms in CLIR, and we
permission and/or a fee.
SIGIR’04, July 25–29, 2004, Sheffield, South Yorkshire, UK. summarize these below.
Copyright 2004 ACM 1-58113-881-4/04/0007 ...$5.00. While each of the above phases involve different tech-
162
niques, they are all inter-related. For example, if the term different approaches are needed in each case. When looking
cross straits is missing from the phrase dictionary, it will for English translations of Chinese OOV terms, the Chinese
be translated word by word, and the meaning will be lost. OOV terms need to be appropriately detected as the first
It would be much more appropriate if it could be identi- step. Normally a segmenter is used to determine Chinese
fied as an OOV term and translated as a phrase. Similarly, word boundaries, and this information would be used to as-
OOV term translation extraction sometimes produces more sist in the identification of the Chinese OOV term. The
than one candidate translation. However, once added to problem is that the Chinese OOV term we are looking for is
the dictionary, a sound disambiguation technique will usu- currently unknown, and thus we have no information about
ally be able to select the most appropriate translation. We how it should be segmented. In previous work [5], this prob-
have developed a disambiguation technique based on lan- lem was overcome by manual intervention to provide appro-
guage modeling using a HMM [6] together with a decay priate segmentation. Looking for a Chinese translation of
factor [7], and extended it by adding the concept of win- an English OOV term is also not straightforward since a
dow size effect. Although the focus of this paper is on the number of candidate Chinese character strings are normally
OOV problem, it must be used in conjunction with a trans- found and must then be segmented. Previous automatic
lation disambiguation technique if sensible results are to be procedures [11, 4] have not been particularly successful.
achieved. However when measuring the effectiveness of any Our segmentation free technique overcomes the difficul-
OOV term translation extraction technique, it is necessary ties experienced previously by researchers in this area. The
to carefully design experiments so as to isolate the improve- details of each procedure will be explained in the following
ment separately contributed by disambiguation and OOV sections.
term translation extraction.
3. ENGLISH TRANSLATION EXTRACTION
2.1 Existing Approaches to the OOV Problem IN CHINESE-ENGLISH CLIR
Depending on the language, it may be possible to de-
In this section, we discuss the task of automatically ex-
duce appropriate transliterated translations automatically.
tracting the English translations of Chinese OOV terms thro-
For example, AbdulJaleel and Larkey describe a translitera-
ugh the mining of web text. We use a four stage process to
tion technique [1] that they successfully applied in English-
extract the English translations: Chinese OOV term detec-
Arabic CLIR. However the issue is more difficult in Chinese
tion, web text extraction, collection of co-occurrence statis-
as many characters have the same sound, and many English
tics, and translation selection.
syllables do not have equivalent sounds in Chinese, meaning
that selecting the correct characters to represent a translit- 3.1 Chinese OOV Term Detection
erated word can be problematic. Meng et. al. [11] describe
First, we detect the Chinese OOV terms. This process
a similar technique for transliteration of English names to
builds on a translation disambiguation technique we devel-
Chinese. However, the technique appears to only be par-
tially successful. In their example, they derive “ä°v ”
oped previously [15]. We used a HMM to select the most
appropriate translations for a sequence of query key terms.
as the translation of “Christopher”, where the dictionary [9]
and the Web gives “ .°Wv ” as the translation. More-
A byproduct of this technique is that the probability esti-
mate can be used to indicate the likely occurrence of OOV
over, when an OOV term translation is based on meaning
terms. When a Chinese OOV term occurs, it will be incor-
rather than sound, transliteration techniques will fail. For
rectly segmented into individual characters, which will then
example, “NASD” and “Mars rover” cannot be transliter-
be separately translated into English terms. For example,
ated. Chen et. al. used Yahoo-China search engine to find
a Japanese personnel name was segmented and translated
translations of OOV terms [5]. Our approach extends this
idea as we explain below.
as ð (north) (limit) É (military). Since the correlation
between these English terms is weak, the language model
probability value (Pvalue ) given by the HMM will be very
2.2 Segmentation Free Translation Extraction low. When Pvalue falls below Pmin , we conclude that the
Our approach stems from the observation that when new query contains OOV terms.
terms, foreign terms, or proper nouns are used in Chinese
web text, they are sometimes accompanied by the English 3.2 Extraction of Web Text
translation in the vicinity of the Chinese text, for example
ÜÄCyCè ...Seattle Mariners. By mining the Web to col-
Second, we extract strings that contain the Chinese query
terms and some English text from the Web.
lect a sufficient number of such instances for any given word
and applying statistical techniques, we are then able to in- 1. When a Chinese query term is missing from the dic-
fer the appropriate translation with reasonable confidence. tionary or the probability value does not reach the
The idea of using the Web to search for translations is not Pmin [15], we run a script file that uses Google to fetch
new [5, 11], however our technique is segmentation-free and the top 100 Chinese documents using the whole Chi-
consequently can extract translations that were previously nese query and save them into a local file using the
undetected, or only detected by manual intervention to pro- following command:
vide correct segmentation. lynx -source "http://www.google.com.au/
It is common to find a small amount of English text in search?q=Chinese-Query&num=100&
Chinese web documents, but extremely rare to find Chi- lr=lang_zh-TW&cr=countryTW&hl=zh-TW&
nese text in English web documents. We therefore rely on ie=UTF-8&oe=UTF-8" > local_file
Chinese web documents to extract translations in both di-
rections. It should be noted that a side effect of using a search
Segmentation causes difficulty in both directions, however engine is that only higher quality web text is returned.
163
This reduces the likelihood of noisy translations being (c) Add (Ct , et ) into the translation dictionary.
collected.
2. Then search for the English term et with the highest
2. For each returned document, only the title and the frequency:
query-biased summary are extracted and saved into a
local file. (a) Search for the English terms etargets , where
f (etargets ) = max(f (ei )).
3. The file is then filtered to remove HTML tags and
(b) Extract the English term et and the Chinese que-
metadata, leaving only the web text.
ry substring Ct , where
For example, suppose we have a query Q which is com- f (et , Ct ) = max(f (etargets, Cij )).
posed of a sequence of Chinese terms (c1 , c2 , c3 , c4 , c5 ),
and we have retrieved a series of titles and query- (c) if Ct = Ct and et = et , add (Ct , et ) into the
biased summaries of web text that contain both Chi- translation dictionary.
nese query substrings C ∈ Q and English terms e, as In the example in Table 1 above, two translation pairs (c2 c3 ,
shown in Figure 1. e1 ) and (c1 c2 c3 c4 c5 , e3 ) are extracted and added into the
translation dictionary. We have extracted at most two trans-
......c2 c3 e1 ......c1 c2 c3 c4 c5 e2 ......
lation pairs, which proved to be ample for short queries;
...c2 c3 e1 ......c1 c2 c3 c4 c5 e3 ......
and in fact, in most cases, only one translation pair was
......c2 c3 e1 ......c2 c3 e4 ......
extracted.
...c1 c2 c3 c4 c5 e3 ......c2 c3 e1 ......
...c1 c2 e2 ......c3 c4 e1 ......
4. CHINESE TRANSLATION EXTRACTION
Figure 1: Web text retrieved IN ENGLISH-CHINESE CLIR
Our work builds on previous work in this area, in particu-
3.3 Collection of Co-occurrence Statistics lar that of Chen and Gey [4]. In their translation extraction
process, each English OOV term is submitted as a query to
We then collect co-occurrence information from the data Yahoo!Chinese in traditional Chinese (Big5 encoding). The
we obtained, in the following mannner: top 200 result entries are then segmented into words using
1. Scan for the occurrence of English text. Where English a dictionary-based longest matching method. For each line
text occurs, check the immediately proceeding Chinese containing the English query word or phrase, they consider
text to see if it is a substring of the original Chinese the five Chinese “words” immediately before and after the
query. English word or phrase, and use a weighting scheme to se-
lect the top m of these as the best translation, where m is
2. We collect the frequency of co-occurrence of the En- the number of English terms.
glish text and all Chinese query substrings that appear Since the Chinese word being searched for is currently un-
immediately before the English text. known, there is no information as to how it should be seg-
For each English term ei with frequency f (ei ) we obtained a mented, and thus segmentation errors may occur, leading to
group of associated Chinese query substrings Cij with length an incorrect translation extraction. When this occurs the re-
|Cij | and co-occurrence frequency f (ei , Cij ). Extending the trieval effectiveness is inevitably substantially degraded. In
example from in Figure 1, this information is summarized an example provided [4], 7 out of 17 extracted translations
in Table 1. exhibited this problem. Another problem is that sometimes
the correct Chinese translation may occur some distance
ei f (ei ) Cij |Cij | f (ei , Cij ) ÂkqÇ
Ç Ü0ÁǱ
from the English OOV term, for example “ ...
e1 5 c2 c3 2 4 ·0ÁǱ the Nansha Islands the Spratley Is-
c3 c4 2 1 lands ...”. In contrast to Chen and Gey [4], our technique
e2 2 c1 c2 c3 c4 c5 5 1 uses a larger window size and does not discriminate against
c1 c2 2 1 terms that occur some distance from the English OOV term.
e3 2 c1 c2 c3 c4 c5 5 2 It does not rely on a prior segmentation and is based on
e4 1 c2 c3 2 1 the consideration of every possible Chinese substring occur-
ring adjacent to the English OOV term. In the experiments
Table 1: The frequency of co-occurrence of English terms we have conducted we have found this procedure to be free
and Chinese query substrings from segmentation error in translation extraction. We de-
scribe below our process to automatically extract the Chi-
nese translations of English OOV terms from the Web.
3.4 Translation Selection
We then select the most appropriate translation as follows: 4.1 Extraction of Web Text
First, we extract strings that contain the English OOV
1. Firstly search for longest Chinese substring Ct : term and some Chinese text from the Web.
(a) Search for the Chinese query substrings Ctargets , 1. We use Google to fetch the top 100 Chinese documents
where |Ctargets | = max(|Cij |). with the English OOV term eoov as the query. For each
(b) Extract the English term et and the Chinese query returned document, only the title and the query-biased
substring Ct , where summary are extracted and saved into a local file using
f (et , Ct ) = max(f (ei , Ctargets )). the following command:
164
lynx -source "http://www.google.com.au/ 2. We select the two substrings s1 and s2 from Table 2
search?q=English-OOV-Term&num=100& with the highest frequency. In the event of a tie we
lr=lang_zh-CN&cr=countryCN&hl=zh-CN" use length to discriminate. Provided that one is not a
> local_file substring of the other, both s1 and s2 are added into
the T as shown in Table 3.
2. The file is then filtered to remove HTML tags and
metadata, leaving only the web text. sn |sn | f (sn )
s10 16 5
4.2 Collection of Co-occurrence Statistics s3 8 9
We then collect co-occurrence information from the data. s1 4 13
s2 4 11
1. We scan for the occurrence of eoov and accumulate the
frequency foov . Where eoov occurs, we collect twenty Table 3: Translation candidate set
Chinese characters immediately before as Slef t and
twenty Chinese characters immediately after as Sright . 3. From the translation candidate set T , we exclude any
substring that is already in the translation dictionary
2. Since we want to use a process that does not rely on or does not occur in the document collection.
segmentation, we start by considering all substrings in
Slef t and Sright , and collecting the frequency fn and If we have added more than one translation to the
the length |sn | of each Chinese substring. translation dictionary, we use our disambiguation tech-
nique to select the most appropriate alternative in the
3. We then rank the substrings based on the likelihood given context.
of being the correct translation. We use the rank-
ing function r to select only the top ten strings for 5. EXPERIMENTS AND RESULTS
further consideration. Generally we prefer substrings
that occur more frequently over those that occur less We wish to explore issues related to querying in each di-
frequently, and prefer longer substrings over shorter rection, and therefore we have conducted CLIR experiments
ones. However the natural distribution is that shorter on both Chinese-English and English-Chinese query trans-
strings occur more frequently than longer ones. The lation.
ranking function we have developed is
5.1 Chinese-English CLIR
|sn | fn In this section, we describe the experimental setup for
r =α× + (1 − α) × retrieving English documents using Chinese queries.
L foov
Where L is the maximum length of the substring. From Document Collection and Queries
our experiments, we determined that α = 0.25 pro- We used the English document collection from the NTCIR-
vides the best combination of frequency and length, a 4 1 CLIR task and the associated 50 Chinese training topics.
value that proved to be robust across our experiments. A Chinese topic contains four parts: title, description, nar-
An example is given in Table 2, which we refer to in rative and key words relevant to whole topic. The titles of
the remaining steps. the Chinese topics were translated and used as queries to
retrieve the documents from the English document collec-
sn |sn | f (sn ) r tion. The average length of the titles is 3.3 terms which
s1 4 13 0.598529 approximates the average length of short web queries.
s2 4 11 0.510294 Chinese-English Dictionaries
s3 8 9 0.447059 We used two dictionaries in our experiments: ce3 from Lin-
s4 6 9 0.434559 guistic Data Consortium2 and CEDICT Chinese-English dic-
s5 6 9 0.434559 tionary3 to translate Chinese queries into English. Using
s6 4 9 0.422059 these two dictionaries, we were able to find at least one En-
s7 4 9 0.422059 glish translation for each term in a given Chinese query for
s8 4 7 0.333824 74% of the queries. However, 54% of translated queries con-
s9 4 7 0.333824 tain inappropriate or wrong translations.
s10 16 5 0.320588
Pre-processing
Table 2: The frequency and length of Chinese substrings English stop words were removed from the English docu-
ment collection, and the Porter stemmer [13] was used to re-
duce words to stems. To obtain optimal segmentation accu-
4.3 Translation Selection racy, we combined two segmenters. The first one is compiled
From these top ten substrings in Table 2, we select the by Erik Peterson4 ; the second, Autotag, provided by Chi-
most appropriate translations in the following manner: nese Knowledge Information Processing Group (CKIP) from
Taiwan. The Chinese stop list was manually selected from
1. We select the two substrings s10 and s3 from Table 2
1
with the longest length. In the event of a tie we use http://research.nii.ac.jp/ntcir-ws4/
2
frequency to discriminate. Provided that one is not a http:// www.ldc.upenn.edu/
3
substring of the other, both s10 and s3 are added into http://www.mandarintools.com/cedict.html
4
the translation candidate set T as shown in Table 3. http://www.mandarintools.com/segmenter.html
165
the statistical results we obtained from the given Chinese 30 BabelFish
30
topic file. Each Chinese query was segmented into words Disambiguation
using the segmenters as described above, the Chinese stop Disambiguation + oov
23
words were then removed from each Chinese query. Our
CLIR experiments used the Lucy search engine developed 19 20
20
by the Search Engine Group5 at RMIT University. 16
Experiment Design
11 11 10 10
The following three runs were performed in our Chinese to 10
English CLIR experiments:
1. RUN1: To provide a baseline for our CLIR results, we
used BableFish to “manually” translate each Chinese
query. Kraaij [8] showed successful use of the widely 0
used BableFish 6 translation service based on Systran. Correct Inappropriate Wrong
2. RUN2: Chinese queries were translated using a dictio- Figure 2: Translation Quality Comparison
nary look-up and the disambiguation technique previ-
ously developed [15].
to extract correct translations of Chinese OOV terms, we
3. RUN3: Chinese queries were translated using the dis- decided to use the NTCIR-4 formal run topics to test the
ambiguation technique combined with the English trans- robustness of our technique. The title of each topic is pre-
lation extraction technique described in Section 3. sented a list of comma separated query key terms. From
the 60 titles, we found 25 Chinese OOV terms. Of these,
The latter two experiments allow us to isolate the improve- our technique was able to extract English translations for
ment contributed by each of the techniques. 18 terms (72%). Each of these was the same as or equiv-
Experimental Results and Discussion alent to the provided English translation, as shown in Ta-
As the relevance judgements for this collection are as yet ble 6. While not quite as good as the results we obtained for
unavailable, we were only able to evaluate the translation the NTCIR-3 collection, it clearly demonstrates that in the
quality. We define successful translation as each term in the majority of cases, we are able to automatically extract the
given query being correctly translated. An inappropriate Chinese OOV terms and corresponding English translations
translation occurs when a translation of a query term has without having to rely on manual segmentation.
been found that relates to the original meaning, but is in-
appropriate in the given context. For example, “train” was 5.2 English-Chinese CLIR
selected in Query011, since the more appropriate term “rail- In this section, we describe the experimental setup for
way” was not in the dictionary. A wrong translation occurs retrieving Chinese documents using English queries. The
when a query term has been translated into a term which aim of our work is to find appropriate Chinese translations
has no relation to the original meaning. Such wrong trans- of English OOV terms.
lations never return relevant documents. Figure 2 shows
Document Collection and Queries
the comparison of translation quality of the three runs we
The test collection used in this task is the TREC-5 and
described earlier. When the disambiguation technique [15]
TREC-6 Chinese collection. We used dictionary-based seg-
was applied, the number of successful translations increased
mentation with greedy-parsing to segment the document col-
to 23. When we combined our English translation extrac-
lection. There are 54 English topics of TREC-5 and TREC-6
tion technique with the disambiguation technique, 30 queries
Chinese track. Each topic consists of three sections: title,
were successfully translated.
description, and narrative. Not every query contains En-
Our English translation extraction technique yielded a
glish OOV terms and such queries are obviously not effected
number of translations of previously untranslated terms, as
by the OOV problem. As we are particularly interested in
well as correct translation where previously only inappropri-
OOV problem, we have only selected the queries containing
ate ones existed (see Table 5). For example, we were able to
English OOV terms. In order to mimic typical web queries,
replace “north limit military director film” with “Takeshi ki-
we decided to use only the title of topics as queries. To
tano director film” in Query017 and “non national boundary
maximize the number of test queries, we augmented some
doctor” with “Medecins Sans Frontieres” in Query024. In-
otherwise OOV free topics with English OOV terms from
terestingly, our techniques sometimes produced translations
the description and narrative sections (see Table 7). This
that might be considered more correct than the provided
provided a total of 14 queries.
translation. For example, our system extracted the English
term “La Nina”, which is arguably more correct than the English-Chinese Dictionaries
given translation “anti-El Nino”. Of the 50 queries from We compiled a translation dictionary for our experiments
the training topics 8 were found to contain Chinese OOV using three dictionaries: ec2 and ec2 from Linguistic Data
terms. The translations that our technique found are shown Consortium, and CEDICT Chinese-English dictionary. Our
in Table 5, it can be seen that 7 out of 8 are correct. translation dictionary contains 128, 527 entries including 19,
Although this result is quite encouraging, the final data 081 multi word phrases that were used for phrase identifica-
set is very small. In order to test the ability of our system tion and translation.
5
http://www.seg.rmit.edu.au/ Experiment Design
6
http://world.altavista.com/ The goal of our experiments was to measure the ability of our
166
technique to find appropriate Chinese translations of English Recall C-C E-C CO-C EO-C
OOV terms. Since the relevance judgements for this collec- 0.00 0.4850 0.4639 0.6261 0.5698
tion are available, rather than directly judging translation 0.10 0.2839 0.2361 0.4992 0.3928
quality we instead measure the difference in retrieval effec- 0.20 0.2208 0.1953 0.4160 0.3561
tiveness of the translated queries. In our first run we used 0.30 0.1828 0.1694 0.3431 0.3030
the given Chinese queries without any of the Chinese equiv- 0.40 0.1411 0.1369 0.2985 0.2561
alents of the English OOV terms (C-C), and used this to 0.50 0.1104 0.1055 0.2625 0.2121
compare the performance of the translated English queries 0.60 0.0824 0.0718 0.2255 0.1797
without any English OOV terms (E-C). This allowed us to 0.70 0.0624 0.0409 0.1933 0.1291
test the basic effectiveness of our system using the transla- 0.80 0.0339 0.0276 0.1293 0.0614
tion disambiguation technique, without regard to the OOV 0.90 0.0089 0.0077 0.0828 0.0280
problem. We then manually added the Chinese equivalents 1.00 0.0000 0.0000 0.0082 0.0057
of the English OOV terms to the Chinese queries (CO-C), Avg.P 0.1284 0.1113 0.2610 0.2013
and used this as a further baseline to test the ability of % Mono - 86.68 - 77.13
our technique to automatically find the appropriate Chi-
nese translations of the English OOV terms. This was done Table 4: English-Chinese CLIR results
by adding the English OOV terms to the English queries
and using our system to translate and then retrieve Chinese
documents (EO-C). Our English-Chinese CLIR experiments not being able to translate OOV terms leads to significant
used the MG [14] search engine. loss in retrieval performance for such queries; and (b) our
technique is effective in automatically finding appropriate
Experimental Results and Discussion translations of English OOV terms. Although we were only
Table 4 shows a comparison of the recall precision values able to automatically find appropriate translations for 88%
for the English-Chinese CLIR experimental results. With- (22 out of 25) of English OOV terms, this is a consider-
out any English OOV terms, our translated queries achieved able improvement on previous work in this area [5], where
86.7% of the monolingual result. The underlying perfor- 72% (8 out 11) translations required manual segmentation
mance of our query translation system was effected by the correction in order to be processed correctly.
following two factors: first, some of the English query trans- As was the case with our Chinese-English experiments,
lations provided by the TREC organizers did not precisely the final sample size was somewhat small. To further test the
parallel the original Chinese queries, for example: in CH47,
9F5
robustness of our technique, we collected 50 English OOV
“ (Philippines)” is missing from the English query;
Ã¥
terms from news web sites and applied our Chinese trans-
in CH21, there is no exact English equivalent of “
)
lation extraction technique. However, since our technique
(return to China)” given in the English query; second, requires a corpus to provide disambiguation, we were not
some translations provided by the translation dictionary are able to carry out the final step of our translation extraction
inappropriate in the given context, for example, in CH28,
ð°
procedure, namely using the corpus to select the most ap-
the English term “cellular phone” is translated into “
#Ä
propriate translation from a set of candidate translations.
”, where the given Chinese equivalent is “ ”; addi-
àâ
Table 8 shows set of candidate translations for each English
tionally, in CH47, “impact” is translated into “ ”, where
*
OOV term. It can be seen that 10 of the English OOV
the given Chinese equivalents is “ ”. When the trans- terms have a single correct Chinese translation. A further
lated English OOV terms were added, we achieved 77.1% of 35 English OOV terms have at least one correct translation
the monolingual result. Besides the two factors we discussed in the candidate set, while our technique failed to find cor-
above, we were not able to automatically find the most ap- rect translations in 5 instances. In our experiments using
propriate Chinese translations for every English OOV term. the TREC Chinese data, we arrived at similar situation at
We failed to find the translation of “Sino -Vietnamese” in the penultimate stage, but with the aid of corpus-based dis-
CH46, and thus did not obtain any improvement in retrieval ambiguation were able to select the most appropriate trans-
effectiveness. In CH48, the correct translation of “Kuwaiti” lation in the final phase.
was lost because we did not consider any translation that We were concerned that the English OOV terms extracted
is already in the translation dictionary. This is a short- from Chinese web pages might only pertain specifically to
coming of our extraction algorithm, which assumes that Chinese news. However, we note that we were able to find
no English OOV term shares the same Chinese translation Chinese translations for 96% of English OOV terms from
with any English term in the translation dictionary. We are TREC-5 and TREC-6. This gives us some confidence that
presently working to overcome this problem. In CH49, we the technique should at least have general applicability to
extracted an inappropriate translation of “START treaty”, news based queries. We are currently investigating how well
because there is currently no single dominant accepted Chi-
:>4QÉìÕ :>
the techniques performs in other vocabulary domains.
nese translation for this term; “ ”, “
Ôu4QÉìÕ ” and “!4QÉìÕ ” being used inter-
6. CONCLUSION
changeably.
It can be seen in Table 4 that successful translation of We have looked in detail at the OOV problem as it applies
English OOV terms results in a 80% improvement in av- to Chinese-English and English-Chinese CLIR. We have de-
erage recall precision. Obviously, this is because we have scribed a new technique to detect potential Chinese OOV
specifically selected the queries known to contain English terms based on a HMM and term co-occurrence. We have
OOV terms. While the improvement in a heterogenous set also described improved ways to extract the translation of
of queries would be more modest, the results show that (a) OOV terms from the Web in a way that does not rely on
prior segmentation. We have tested these techniques on sev-
167
eral collections and a set of terms from news articles and
found them to be robust and provide a substantial improve- English OOV terms Chinese Translation
ment in OOV term translation quality. Interestingly, al- Candidate Set
ûÂne/
though the Web is constantly changing, we were able to find
:|ûûÂne/
1 Pervez Musharraf
most OOV terms, many of which related to news events up
óÉ\)®ÌY\/
É\)®ÌY\©/
2 Shiite Muslim
to 10 years ago.
°ÁRBy°/
!°ÁRBy°/
3 Ali al-Sistani
n./cs¸c/
7. REFERENCES 0¡/
4 Paul Bremer
é&¬0¡{<Â/
[1] N. AbdulJaleel and L. Larkey. Statistical transliteration for 5 Avian flu
english-arabic cross language information retrieval. In
G®°/
óÀÆǱK9]/
6 Hambali
Proceedings of the 12th International Conference on
Information and Knowledge Management, pages 139–146, .²úSÆ¢¦Ö/
2003. [:/:>/[:>/
ÒOǱh#ó])ñ/
7 Mad Cow
[2] L. Ballesteros and W. B. Croft. Dictionary-based Methds
for Cross-Lingual Information Retrieval. In Proc. ì/
,K/
8 Nintendo
¡n¡n¾/
International Conference on Database and Expert Systems 9 Carlsberg
Applications, pages 791–801, 1996.
{¡n¡n¾/
[3] L. Ballesteros and W. B. Croft. Resolving Ambiguity for fTUq/°fTUq/
n~/ýn~/
10 Credit Lyonnais
Cross-Language Retrieval. In Research and Development in
y/¥L¼Æ®/
11 Osama bin Laden
Information Retrieval, pages 64–71, 1998.
©Hð/©Hðú/
12 John Howard
[4] A. Chen and F. Gey. Experiments on Cross-language and
\)/É\)³V/
13 Saddam Hussein
Patent Retrieval at NTCIR-3 Workshop. In Proceedings of
É\)³V/
14 Kofi Annan
the 3rd NTCIR Workshop, Japan, 2003.
]Ìj/
ù©®Éã9]/
[5] A. Chen, H. Jiang, and F. Gey. Combining Multiple 15 Hezbollah
Sources for Short Query Translation in Chinese-English 1ù©®Éã9]/
Cross-Language Information Retrieval. In Proceedings of y`b4ÛNÌ/
{w/Ê{w/{w
16 NASD
the 5th International Workshop Information Retrieval with
:/
17 PricewaterhouseCoopers
Asian Languages, pages 17–23, 2000.
0/0)/ãÞj
18 SARS
[6] M. Federico and N. Bertoldi. Statistical Cross-Language
/g¾C
19 Matrix Reloaded
¢/¢q°/
Information Retrieval Using N-Best Query Translations. In 20 Martha Stewart
))sf3´ø:/
Proceedings of the 25th annual international ACM SIGIR 21 Tour de France
Ûhæì
conference on Research and development in information 22 NMD
àâE/>ÓÛÀÓä/
retrieval, pages 167–174, Tampere, Finland, 2002. 23 Mars rover
/NkQ/
24 Blaster
[7] J. Gao, M. Zhou, J. Nie, H. He, and W. Chen. Resolving
t
25 Forest Gump
query translation ambiguity using a decaying co-occurrence
model and syntactic dependence relations. In Proceedings w@Ó/
d?ûì/
26 DJIA
of the 25th annual international ACM SIGIR conference
:/¦:/
27 NAS
on Research and development in information retrieval,
.°/.°/
28 Land Rover
.°/
pages 183–190, Tampere, Finland, 2002. 29 Kim Clijsters
[8] W. Kraaij. TNO at CLEF-2001: Comparing Translation
¼Ey/¼EyøL/
à/
Resources. In Proceedings of the CLEF-2001, pages 79–83, 30 Likud Party
àD\Màü/
31 Lord of the Rings
2001.
[9] S. Kuai, Q. Lu, and L. Yang, editors. A New h®./
h®.
;/
32 Starbucks
English-Chinese Dictionary. Joint Publishing Company,
HongKong, 1st edition, 1984. ¤4+/
lÚ/Ú»/
nØ/
[10] K. L. Kwok. Exploiting a Chinese-English Bilingual 33 Enron
iTÚnØ/
Wordlist for English-Chinese Cross Language Information 34 Abdullah Gul
Retrieval. In Proceedings of the 5th International ®/iTÚnØ/
Workshop Information Retrieval with Asian Languages, #å/£õ®/
Æ/YÛL/7Æ/
35 Olympus
pages 173–179, 2000.
?L¼?
;~Bì/
36 Cappuccino
[11] H. Meng, B. Chen, S. Khudanpur, G. Levow, W. Lo,
L¼?
;~Bì¬/
37 Espresso
D. Oard, P. Schone, K. Tang, H. Wang, and J. Wang.
-É²/û>åy/
0ÄÊ/
Mandarin-English Information (MEI): investigating 38 Mohammad Khatami
fbÂ/LyfbÂ/
translingual speech retrieval. In Computer Speech and 39 Finding Nemo
åõ./åõ./cªøL/
Language, 2003. 40 Arnold Schwarzenegger
}g/
41 Rupert Murdoch
[12] A. Mirna. Using Statistical Term Similarity for Sense
b¦s¸¦Ó/
42 Lancome
Disambiguation in Cross -language Information Retrieval.
[/`)/;Æ¥©Ç/
43 TAFE
Information Retrieval, 2(1):67–68, 2000.
¬/¬/¬+ì/
44 Logitech
[13] M. F. Porter. An Algorithm for Suffix Stripping.
Oä/ÌÆ/ïø:/
45 PDP
Program(Automated Library and Information Systems),
`/
46 Aopen
`øï+ì/
14(3):130–137, 1980. 47 ViewSonic
[14] I. H. Witten, A. Moffat, and T. C. Bell. Managing
`øï+ì/
Gigabytes: Compressing and Indexing Documents and
1ºï®Â/
nð/nð9/
48 Ariel Sharon
Images. Morgan Kaufmann Publishers, Los Altos, CA
LäÆ/úCå/
49 Donald H. Rumsfeld
94022, USA, 2nd edition, 1999. 59 N-Gage
[15] Y. Zhang and P. Vinese. Improved Cross-Language
Information Retrieval via Disambiguation and Vocabulary
Discovery. In Proceedings of the 8th Australasian Table 8: Extracted Chinese translations of English OOV
Document Computing Symposium, pages 3–7, 2003. terms
168
Query ID Chinese query Chinese OOV Terms Extracted English Translations Given English Translations
L¦¦bêFòÖ L¦¦bêFòÖ
0 0
003 Program for Promoting Academic Program for Promoting Academic
Excellence of Universities Excellence of Universities
007 J/ J CI611 China Airlines
009 ¥cR¥h ¥cR¥h ST1 ST1
'_6 '_6
_
010 La Nina Anti-El Nino
El Nino El Nino
''÷ÌSR ''÷ÌS
ÜÄCyCè ÜÄCyCè
016 Kazuhiro Sasaki Kazuhiro Sasaki
Seattle Mariners Seattle Mariners
017 ð ÉsÜ{k ð É Takeshi Kitano Takeshi Kitano
¦Lð°Ú
ýý\ L
020 NISSAN Nissan Mortor Company
RENAULT Renualt
024 Ã) Ã) MEDECINS SANS FRONTIERES Medecins Sans Frontieres
Table 5: NTCIR3: Extracted English translations of Chinese OOV terms
Query ID Chinese query Chinese OOV Terms Extracted English Translations Given English Translations
001 B — Chiutou
002 Õ?4 Õ?4 Johnnie Walker Johnnie Walker
003 vÎûÜ vÎûÜ Embryonic Stem Cell Embryonic Stem Cells
Á¡9 Á¡9

004 Griffith Griffith
²
— Joyner
— Flojo
005 P£b P£b Dioxin Dioxin
006 p,[ p,[ Michael Jordan Michael Jordan
®üjä` ®üjä`
ÚÕ
007 Panama Canal Panama Canal
— Torrijos-Carter Treaty
008 § § Viagra Viagra
012 gÒ gÒ Akira Kurosawa Akira Kurosawa
013 BÁÈ® — Keizo Obuchi
014 ¢¸V¤ ¢¸V¤ environmental hormone Environmental Hormone
021 ÛÖb4 ÛÖ Electronic Commerce Electronic Commercial Transaction
022 åÆð° åÆð° Kia Motors Corp Kia Motors
030 ÄÔb clone Cloning
034 À®Ñ/ — Tokyo provincial governor
038 ²b ²b Nanotechnology Nanotechnology
046 äO£å äO gene Genetic Treatment
048 )Ô85 )Ô85 ISS International Space Station
[o4Ìå [o4Ìå
4Ìå
051 stealth fighter Stealth Fighter
F117 —-
048 :zþ*Æ :zþ*Æ Contactless Smart Cards CSC Contactless SMART Card
µÔQ
Ä
052 — Crown Princess
— Masako
Table 6: NTCIR4: Extracted English translations of Chinese OOV terms
Topic Number English OOV terms Extracted Chinese Translations Given Chinese Translations
1 2 reunification Z²: :
2 2 cross-strait 0Üø Ü
3 3 Daya Bay LÆl LÆl
4 3 Qinshan +Ì +Ì
5 7 Dongsha Islands ÀÂkq ÀÂkq
6 7 Xisha Islands ÜÂkq ÜÂkq
7 7 Spratly Islands Âkq Âkq
8 8 Richter °< Ð<
9 11 Peace-keeping Z\è Z
10 14 HIV ÿ>Ó No translation
11 21 Peng Dingkang ½ ½
12 21 Reunification Z²: :
13 28 PSDN Ib¦ Ib¦
14 31 Castro [ [
15 42 Liaohe River è` è`
16 42 Haihe River 0` 0`
17 42 Huaihe River ` `
18 42 Songhua River T T
19 42 Pearl River ¾T ¾T
20 46 Sino-Vietnamese Not Found ¥Ö
21 47 Pinatubo CFÛÌ CF
22 47 Subic Bay .l .l
23 48 Kuwaiti ) )
24 49 Non-Proliferation Treaty Xj±XÉìÕ Xj±XÉìÕ
25 49 START 4QÉìÕ >>4QÉìÕ
Table 7: TREC-5 and TREC-6: Extracted Chinese translations of English OOV terms
169

Using The Web For Automated Translation Extraction in Cross-Language Information Retrieval

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Using The Web For Automated Translation Extraction in Cross-Language Information Retrieval

Uploaded by

Copyright:

Available Formats

Using the Web for Automated Translation Extraction in

Cross-Language Information Retrieval

Ying Zhang Phil Vines

ABSTRACT from English to Chinese, existing systems are able to detect

1. INTRODUCTION 2. PREVIOUS WORK

Table 5: NTCIR3: Extracted English translations of Chinese OOV terms

Table 6: NTCIR4: Extracted English translations of Chinese OOV terms

You might also like