You are on page 1of 8

Mining the Web for Bilingual Text

Philip Resnik*
Dept. of Linguistics/Institute for Advanced C o m p u t e r Studies
University of Maryland, College Park, MD 20742
resnik@umiacs, umd. edu

Abstract language pairs of interest.

STRAND (Resnik, 1998) is a language- Although the majority of Web content is in
independent system for automatic discovery English, it also shows great promise as a source
of text in parallel translation on the World of multilingual content. Using figures from
Wide Web. This paper extends the prelim- the Babel survey of multilinguality on the Web
inary STRAND results by adding automatic ( h t Z p : / / w w w . i s o c . o r g / ) , it is possible to esti-
language identification, scaling up by orders mate that as of June, 1997, there were on the or-
of magnitude, and formally evaluating perfor- der of 63000 primarily non-English Web servers,
mance. The most recent end-product is an au- ranging over 14 languages. Moreover, a follow-
tomatically acquired parallel corpus comprising up investigation of the non-English servers sug-
2491 English-French document pairs, approxi- gests that nearly a third contain some useful
mately 1.5 million words per language. cross-language data, such as parallel English on
the page or links to parallel English pages - -
1 Introduction the follow-up also found pages in five languages
not identified by the Babel study (Catalan, Chi-
Text in parallel translation is a valuable re- nese, Hungarian, Icelandic, and Arabic; Michael
source in natural language processing. Sta- Littman, personal communication). Given the
tistical methods in machine translation (e.g. continued explosive increase in the size of the
(Brown et al., 1990)) typically rely on large Web, the trend toward business organizations
quantities of bilingual text aligned at the doc- that cross national boundaries, and high levels
ument or sentence level, and a number of of competition for consumers in a global mar-
approaches in the burgeoning field of cross- ketplace, it seems impossible not to view mul-
language information retrieval exploit parallel tilingual content on the Web as an expanding
corpora either in place of or in addition to map- resource. Moreover, it is a dynamic resource,
pings between languages based on information changing in content as the world changes. For
from bilingual dictionaries (Davis and Dunning, example, Diekema et al., in a presentation at the
1995; Landauer and Littman, 1990; Hull and 1998 TREC-7 conference (Voorhees and Har-
Oard, 1997; Oard, 1997). Despite the utility of man, 1998), observed that the performance of
such data, however, sources of bilingual text are their cross-language information retrieval was
subject to such limitations as licensing restric- hurt by lexical gaps such as Bosnia/Bosnie-
tions, usage fees, restricted domains or genres, this illustrates a highly topical missing pair in
and dated text (such as 1980's Canadian poli- their static lexical resource (which was based on
tics); or such sources simply may not exist for WordNet 1.5). And Gey et al., also at TREC-7,
* This work was supported by Department of De-
observed that in doing cross-language retrieval
fense contract MDA90496C1250, DARPA/ITO Con- using commercial machine translation systems,
tract N66001-97-C-8540, and a research grant from Sun gaps in the lexicon (their example was acupunc-
Microsystems Laboratories. The author gratefully ac- ture/Akupunktur) could make the difference be-
knowledges the comments of the anonymous reviewers, tween precision of 0.08 and precision of 0.83 on
helpful discussions with Dan Melamed and Doug Oard,
and the assistance of Jeff Allen in the French-English individual queries.
experimental evaluation. ttesnik (1998) presented an algorithm called

i i tion, by asking for pages containing one portion
CandidatePair Cmdidat~ Pair i' Candidate pak it
Generation Evaluafio~ a,i Filtel/ng of anchor text (the readable material in a hy-
(structural) , OanSuage d=pen&nO1
i I
1_ _ _ ~ _ _l
perlink) containing the string "English" within
a fixed distance of another anchor text contain-
ing the string "Spanish". (The matching pro-
cess was case-insensitive.) This generated many
good pairs of pages, such as those pointed to by
hyperlinks reading Click here for English ver-
Figure 1: The S T R A N D architecture sion and Click here for Spanish version, as well
as many bad pairs, such as university pages con-
STRA N D (Structural Translation Recognition f o r taining links to English Literature in close prox-
Acquiring Natural Data) designed to explore imity to Spanish Literature.
the Web as a source of parallel text, demon- The candidate generation stage is followed
strating its potential with a small-scale evalu- by a candidate evaluation stage that represents
ation based on the author's judgments. After the core of the approach, filtering out bad can-
briefly reviewing the STRAND architecture and didates from the set of generated page pairs.
preliminary results (Section 2), this paper goes It employs a structural recognition algorithm
beyond that preliminary work in two significant exploiting the fact that Web pages in parallel
ways. First, the framework is extended to in- translation are invariably very similar in the
clude a filtering stage that uses automatic lan- way they are structured - - hence the 's' in
guage identification to eliminate an important STRAND. For example, see Figure 2.
class of false positives: documents that appear The structural recognition algorithm first
structurally to be parallel translations but are in runs both documents through a transducer
fact not in the languages of interest. The system that reduces each to a linear sequence of
is then run on a somewhat larger scale and eval- tokens corresponding to HTML markup
uated formally for English and Spanish using elements, interspersed with tokens repre-
measures of agreement with independent human senting undifferentiated "chunks" of text.
judges, precision, and recall (Section 3). Sec- For example, the transducer would replace
ond, the algorithm is scaled up more seriously to the HTML source text <TITLE>hCL'99
generate large numbers of parallel documents, Conference Home Page</TITLE> with the
this time for English and French, and again sub- three tokens [BEGIN: TITLE], [Chunk: 24], and
jected to formal evaluation (Section 4). The [END:TITLE]. The number inside the chunk
concrete end result reported here is an automat- token is the length of the text chunk, not
ically acquired English-French parallel corpus counting whitespace; from this point on only
of Web documents comprising 2491 document the length of the text chunks is used, and
pairs, approximately 1.5 million words per lan- therefore the structural filtering algorithm is
guage (without markup), containing little or no completely language independent.
noise. Given the transducer's output for each doc-
ument, the structural filtering stage aligns the
2 STRAND Preliminaries two streams of tokens by applying a standard,
This section is a brief summary of the STRAND widely available dynamic programming algo-
system and previously reported preliminary re- rithm for finding an optimal alignment between
sults (Resnik, 1998). two linear sequences. 1 This alignment matches
The STRAND architecture is organized as a identical markup tokens to each other as much
pipeline, beginning with a candidate generation as possible, identifies runs of unmatched tokens
stage that (over-)generates candidate pairs of that appear to exist only in one sequence but
documents that might be parallel translations. not the other, and marks pairs of non-identical
(See Figure 1.) The first implementation of the tokens that were forced to be matched to each
generation stage used a query to the Altavista other in order to obtain the best alignment pos-
search engine to generate pages that could be
viewed as "parents" of pages in parM]el transla- 1 K n o w n to m a n y p r o g r a m m e r s as d i f f .

Highlights of Best Practices @ Fails saillants des praflques exemplalres

Seminar on Self-Regulation S ~ m i n a l r e sin" I'autor(=glemen ration

Le v ~ k = d i 25 oc~m~ 1996, 40 ~ u d= mg~u d¢ IA n~gl¢~nu~m m l ~ ~ ~nduL.¢ ~ I ~
prafiqo~ ¢ ~ a k u cn ~ 4¢ r ~ | l ¢ ~ ~ vi=~l i ald¢¢ I ~ o d o r = ~ ~ famillalis~ i v =
re$,ulla~ ! m o ~ . AJ medm,te~ f m rile s w m , Zm~ B r o , ~ DirecSc~Gr.aera]. Ccm*m'aelPSodu~
Zaue Bmw~ d n ¢ ~ gL~ruL Du'ccxi~ dc~ b i ~ ~ ou~tvzmati~, a ~ v ~ Lae~mcc ¢n ~Lulam
re.~a~t m i m a = att~lmtive m m d * l i ~ (ASD) m ~ atmh u ,~lut~at7 ¢~d~aa ~
indu.~T ~lf-nv*mq~nL He ~ thai • for~b,~m~n| ~ ~ A~[~ v,ua~ld e ~ inchtopi~u
wl~ck ASD=pm,~d= tl~ ram1 =pprop*u~ mecl=m~= *=d wire ~ m= ~ u d k = l ~ ~ m = d w ~ que ~ a l t p r o d ~ m m m &~zl~mt ~ Lad i ~ L f ~ d~s nw~.s de pt~sLau~ des s ~ qm
din=. traltax~t d= d i v m ~jets. ~ l u m/:caai=~ ~ ~ ~¢ r ~ e t~t ~ i ¢ ¢ = I ~ I ~
~ p r e w ~ ~ m i gl~ I¢i probl~ae~ ~ v ~ l p u chacua.
Vdmm*r~ C = I ~
c~l~ ,d~Ud~
"A voluuuu7code iJ • , ~ ~4 ~aadardized ~ t ~ a t ~ -- ~ cxpl~:ifly ~ ¢4 • I~isla~ve
~gut~orT ~gin'~ -* dc=iloed to ipB=oc~ ~**~, cc~Uol = ~ ¢ L~eb ~ i ~ o( ~ who agre=d
t ~ i ~ l~lillatif m t ~ I c ~ s a l r ¢ - ~ paur i a f l t , ¢ ~ , f ~ , o~m34~ = ~ v a ] ~
TreamryBoard $c~'*, "t~imiam s o ~ = u ' rll6atto ~eguha~They,im#y c~IT~the pm~ie,p~m• ~ m d¢ = ¢p~i,teaoat ~ . Ib ='$1imin~l p ~ . • p~rsui',i M. B d = Gl~h~, ualy~e
altetamln ~ bellI r©|tda~edbytheg ~ e n m ~ " pn ~ap~. Affsi~• ~ g l e ~ m a i r = , = ~ aM C ~ i l du T~sm, 5=rue&aMS ~ v e n ~ n t do

~f~h~ to o ~ e m e n ~ aed e ~ e m S ~ t ~ e f ~ , n S u l = k ~ Wht~ ~ ~des b ~ e • eemb~ ~

a¢~'=laSo, indudi~: Au ~ n t o~ I= n!gtcmcmask~fair I ' o b ~ d ' ~ e ~ ~ du pabli¢, le= S ~ n u i I'L,ch¢ll©

• .,1~p , ~ t m *o be ,k~tepett ~ ® q ~ d d y eum h*~:

• the l ~ = c ~ i ~ ¢~,d m p r e ~ I d pm ie # = ;

• ill ~tt== d ' ~ t = b p m ~ l h ~ de ~ qul fraRumt I ~ i u i t i ~ v = de t~ikmmtmkm:

S~mm,en*?' • h f = i l l t ~ l ~ i u a ~ IJm$1~lleiLs peuvuq ~,e m~llft4u=Cu fcm~.~= d ~ ~ u B ~ d i n ,

Figure 2: Structural similarity in parallel translations on the Web

sible. 2 At this point, if there were too many In the preliminary evaluation, I generated a
unmatched tokens, the candidate pair is taken test set containing 90 English-Spanish candi-
to be prima facie unacceptable and immediately date pairs, using the candidate generation stage
filtered out. as just described• I evaluated these candi-
Otherwise, the algorithm extracts from the dates by hand, identifying 24 as true translation
alignment those pairs of chunk tokens that were pairs. 5 Of these 24, STRAND identified 15 as true
matched to each other in order to obtain the translation pairs, for a recall of 62.5%. Perhaps
best alignments. 3 It then computes the corre- more important, it only generated 2 additional
lation between the lengths of these non-markup translation pairs incorrectly, for a precision of
text chunks. As is well known, there is a re- 15/17 = s8.2%.
]]ably linear relationship in the lengths of text
translations - - small pieces of source text trans- 3 Adding Language Identification
late to smaJl pieces of target text, medium to In the original S T R A N D architecture, addi-
medium, and large to large. Therefore we can tional filtering stages were envisaged as pos-
apply a standard statistical hypothesis test, and sible (see Figure 1), including such language-
if p < .05 we can conclude that the lengths are dependent processes as automatic language
reliably correlated and accept the page pair as identification and content-based comparison of
likely to be translations of each other. Other- structually aligned document segments using
wise, this candidate page pair is filtered out. 4 cognate matching or existing bilingual dictio-
naries. Such stages were initially avoided in
2An anonymous reviewer observes that d i f f has no order to keep the system simple, lightweight,
preference for aligning chunks of similar lengths, which and independent of linguistic resources• How-
in some cases might lead to a poor alignment when a
good one exists. This could result in a failure to identify tion stage (10 lines), are parameters of the algorithm
true translations and is worth investigating further. that were determined during development using a small
3Chunk tokens with exactly equal lengths are ex- amount of arbitrarily selected French-English data down-
cluded; see (Resnik, 1998) for reasons and other details loaded from the Web. These values work well in prac-
of the algorithm. tice and have not been varied systematically; their values
4The level of significance (p < .05) was the ini- were fixed in advance of the preliminary evaluation and
tial selection during algorithm development, and never have not been changed since.
changed. This, the unmatched-tokens threshold for • The complete test set and my judgments
prima/aeie rejection due to mismatches (20~0), and the for this preliminary evaluation can be found at
maximum distance between hyperlinks in the genera- http ://umiacs. umd• edu/~resnik/amt a98/.

•~ u ~ / v..B.~,~ I s~.~c.~ I o,~,~o I~.~1
~lea~ ~ =~ ~ m m y oL ~ b o ~ J m e ~ free a t . r e 6 ~ m ~ ~ , ~ ~ J ad,f~0~J dayJ
dltpJltt b¢ fstac, tt¢l lain yt, ur ~=Ii~,=%~= r~ l = tk:llvct7 I = LIPS O Y E L N I g I l l r iato fiat
Sptt~l 1 ~ ba~. Wt ~ig o ~ a ~ ~ou ~ith tat dfiptfitg ~ (bared ~ uilka).

Ykwlt~ PW'cbu~

. . . . . . . . . . . "-.%', .... .~"~-'~. "2 .~

o , ¢~'t ~ . lo, ~ c~.,,,Its rmt*

Figure 3: Structurally similar pages that are not translations

ever, in conducting an error analysis for the pre- given as little as 50k characters per language as
liminary evaluation~ and further exploring the training material.
characteristics of parallel Web pages, it became For the language filtering stage of STRAND,
evident that such processing would be impor- the following criterion was adopted: given two
tant in addressing one large class of potential documents dl and d2 that are supposed to be
false positives. Figure 3 illustrates: it shows in languages L1 and L2, keep the document
two documents that are generated by looking pair iff Pr(Llldl) > Pr(L21dl) and Pr(/21d2) >
for "parent" pages containing hyperlinks to En- Pr(Llld2). For English and Spanish, this trans-
glish and Spanish, which pass the structural fil- lates as a simple requirement that the "English"
ter with flying colors. The problem is poten- page look more like English than Spanish, and
tially acute if the generation stage happens to that the "Spanish" page look more like Spanish
yield up many pairs of pages that come from on- than English. Language identification is per-
line catalogues or other Web sites having large formed on the plain-text versions of the pages.
numbers of pages with a conventional structure. Character 5-gram models for languages under
There is, of course, an obvious solution that consideration are constructed using 100k char-
will handle most such cases: making sure that acters of training data from the European Cor-
the two pages are actually written in the lan- pus Initiative (ECI), available from the Linguis-
guages they are supposed to be written in. In tic Data Consortium (LDC).
order to filter out candidate page pairs that In a formal evaluation, S T R A N D with the new
fail this test, statistical language identification language identification stage was run for English
based on character n-grams was added to the and Spanish, starting from the top 1000 hits
system (Dunning, 1994). Although this does yielded up by Altavista in the candidate gen-
introduce a need for language-specific training eration stage, leading to a set of 913 candidate
data for the two languages under consideration, pairs. A test set of 179 items was generated for
it is a very mild form of language dependence: annotation by human judges, containing:
Dunning and others have shown that when
• All the pairs marked GOOD (i.e. transla-
classifying strings on the order of hundreds or
thousands of characters, which is typical of the tions) by STRAND (61); these are the pairs
that passed both the structural and lan-
non-markup text in Web pages, it is possible
guage identification filter.
to discriminate languages with accuracy in the
high 90% range for many or most language pairs • All the pairs filtered out via language idea-

tification (73) Comparison N Pr(Agree)
J1, J2: 106 0.85 0.70
• A random sample of the pairs filtered out J1, STRAND: 165 0.91 0.79
structurally (45) J2, STRAND: 113 0.81 0.61
J1 f3 J2, STRAND: 90 0.91 0.82
It was impractical to manually evaluate all pairs
filtered out structurally, owing to the time re- Table 1: English-Spanish evaluation
quired for judgments and the desire for two in-
dependent judgments per pair in order to assess Table 1 shows agreement measures between
inter-judge reliability. the two judges, between STRAND and each
The two judges were both native speakers of individual judge, and the agreement between
Spanish with high proficiency in English, nei- S T R A N D and the intersection of the two judges'
ther previously familiar with the project. They annotations - - that is, STRAND evaluated
worked independently, using a Web browser to against only those cases where the two judges
access test pairs in a fashion that allowed them agreed, which are therefore the items we can re-
to view pairs side by side. The judges were gard with the highest confidence. The table also
told they were helping to evaluate a system that shows Cohen's to, an agreement measure that
identifies pages on the Web that are translations corrects for chance agreement (Carletta, 1996);
of each other, and were instructed to make de- the most important t¢ value in the table is the
cisions according to the following criterion: value of 0.7 for the two human judges, which
can be interpreted as sufficiently high to indi-
Is this pair of pages intended to show
cate that the task is reasonably well defined.
the same material to two different
(As a rule of t h u m b , classification tasks with
users, one a reader of English and the
< 0.6 are generally thought of as suspect in
other a reader of Spanish?
this regard.) The value of N is the number of
The phrasing of the criterion required some con- pairs that were included, after excluding those
sideration, since in previous experience with hu- for which the human judgement in the compar-
man judges and translations I have found that ison was undecided.
judges are frequently unhappy with the qual- Since the cases where the two judges agreed
ity of the translations they are looking at. For can be considered the most reliable, these were
present purposes it was required neither that used as the basis for the computation of recall
the document pair represent a perfect transla- and precision. For this reason, and because
tion (whatever that might be), nor even nec- the human-judged set included only a sample
essarily a good one: STR,AND was being tested of the full set evaluated by STRAND, it was nec-
not on its ability to determine translation qual- essary to extrapolate from the judged (by both
ity, which might or might not be a criterion for judges) set to the full set in order to compute
inclusion in a parallel corpus, but rather its abil- recall/precision figures; hence these figures are
ity to facilitate the task of locating page pairs reported as estimates. Precision is estimated
that one might reasonably include in a corpus as the proportion of pages judged GOOD by
undifferentiated by quality (or potentially post- STRAND that were also judged to be good (i.e.
filtered manually). "yes") by both judges - - this figure is 92.1%
The judges were permitted three responses: Recall is estimated as the number of pairs that
should have been judged GOOD by STRAND
• Yes: translations of each other (i.e. that recieved a "yes" from both judges)
that STRAND indeed marked GOOD - - this fig-
• No: not translations of each other ure is 47.3%.
• Unable to tell These results can be read as saying that of ev-
ery 10 document pairs included by S T R A N D in
When computing evaluation measures, page a parallel corpus acquired fully automatically
pairs classified in the third category by a hu- from the Web, fewer than 1 pair on average was
m a n judge, for whatever reason, were excluded included in error. Equivalently, one could s a y
from consideration. that the resulting corpus contains only about

8% noise. Moreover, at least for the confidently Comparison N Pr(Agree)
judged cases, S T R A N D is in agreement with the J1, J2: 267 0.98 0.95
combined human judgment more often than the J1, STRAND: 273 0.84 0.65
human judges agree with each other. The recall J2, STRAND: 315 0.85 0.63
J1 N J2, STRAND: 261 0.86 0.68
figure indicates that for every true translation
pair it accepts, STRAND must also incorrectly re- Table 2: English-French evaluation
ject a true translation pair. Alternatively, this
can be interpreted as saying that the filtering
process has the system identifying about half didates that can be obtained without building
of the pairs it could in principle have found a Web crawler dedicated to the task, since one
given the candidates produced by the genera- of Altavista's distinguishing features is the size
tion stage. Error analysis suggests that recall of its index. In practice, however, the user inter-
could be increased (at a possible cost to pre- face for Altavista appears to limit the number
cision) by making structural filtering more in- of hits returned to about the first 1000. It was
telligent; for example, ignoring some types of possible to break this barrier by using a feature
markup (such as italics) when computing align- of Altavista's "Advanced Search": including a
ments. However, I presume that if the number range of dates in a query's selection criteria.
M of translation pairs on the Web is large, then Having already redesigned the S T R A N D gener-
half of M is also large. Therefore I focus on in- ation component to permit multiple queries (in
creasing the total yield by attempting to bring order to allow search for both parent and sibling
the number of generated candidate pairs closer pages), each query in the query set was trans-
to M, as described in the next section. formed into a set of mutually exclusive queries
based on a one-day range; for example, one ver-
4 Scaling Up Candidate Generation sion of a query would restrict the result to pages
last updated on 30 November 1998, the next 29
The preliminary experiments and the new ex- November 1998, and so forth.
periment reported in the previous section made Although the solution granularity was not
use of the Altavista search engine to locate "par- perfect - - searches for some days still b u m p e d
ent" pages, pointing off to multiple language up against the 1000-hit maximum - - use of both
versions of the same text. However, the same parent and sibling queries with date-range re-
basic mechanism is easily extended to locate stricted queries increased the productivity of
"sibling" pages: cases where the page in one the candidate generation component by an or-
language contains a link directly to the trans- der of magnitude. The scaled-up system was
lated page in the other language. Exploration run for English-French document pairs in late
of the Web suggests that parent pages and sib- November, 1998, and the generation component
ling pages cover the major relationships between produced 16763 candidate page pairs (with du-
parallel translations on the Web. Some sites plicates removed), an 18-fold increase over the
with bilingual text are arranged according to a previous experiment. After eliminating 3153
third principle: they contain a completely sep- page pairs that were either exact duplicates
arate monolingual sub-tree for each language, or irretrievable, STRAND'S structural filtering
with only the single top-level home page point- removed 9820 candidate page pairs, and the
ing off to the root page of single-language ver- language identification component removed an-
sion of the site. As a first step in increasing other 414. The remaining pairs identified as
the number of generated candidate page pairs, GOOD - - i.e. those that STRAND considered
STRAND was extended to permit both parent to be parallel translations - - comprise a paral-
and sibling search criteria. Relating monolin- lel corpus of 3376 document pairs.
gual sub-trees is an issue for future work. A formal evaluation, conducted in the same
In principle, using Altavista queries for fashion as the previous experiment, yields the
the candidate generation stage should enable agreement data in Table 2. Using the cases
S T R A N D to locate every page pair in the A1- where the two human judgments agree as
tavista index that meets the search criteria. ground truth, precision of the system is esti-
This likely to be an upper bound on the can- mated at 79.5%, and recall at 70.3%.

Comparison N Pr(Agree) i¢ 5 Conclusions
J1, J2: 267 0.98 0.95
J1, STRAND: 273 0.88 0.70 This paper places acquisition of parallel text
J2, STRAND: 315 0.88 0.69 from the Web on solid empirical footing, mak-
J1 N J2, STRAND: 261 0.90 0.75
ing a number of contributions that go beyond
Table 3: English-French evaluation with stricter the preliminary study. The system has been
language ID criterion extended with automated language identifica-
tion, and scaled up to the point where a non-
trivial parallel corpus of English and French can
A look at STRAND'S errors quickly identifies be produced completely automatically from the
the major source of error as a shortcoming of World Wide Web. In the process, it was discov-
the language identification module: its implicit ered that the most lightweight use of language
assumption that every document is either in En- identification, restricted to just the the language
glish or in French. This assumption was vi- pair of interest, needed to be revised in favor of a
olated by a set of candidates in the test set, strategy that includes identification over a wide
all from the same site, that pair Dutch pages range of languages. Rigorous evaluation using
with French. The language identification cri- human judges suggests that the technique pro-
terion adopted in the previous section requires duces an extremely clean corpus - - noise esti-
only that the Dutch pages look more like En- mated at between 0 and 8% - - even without hu-
glish than like French, which in most cases is man intervention, requiring no more resources
true. This problem is easily resolved by train- per language than a relatively small sample of
ing the existing language identification compo- text used to train automatic language identifi-
nent with a wider range of languages, and then cation.
adopting a stricter filtering criterion requiring
Two directions for future work are appar-
that Pr(Englishldl ) > Pr(Lldl ) for every lan-
ent. First, experiments need to be done using
guage L in that range, and that d2 meet the
languages that are less common on the Web.
corresponding requirement for French. 6 Doing
Likely first pairs to try include English-Korean,
so leads to the results in Table 3.
English-Italian, and English-Greek. Inspection
This translates into an estimated 100% pre- of Web sites - - those with bilingual text identi-
cision against 64.1% recall, with a yield of 2491 fied by STRAND and those without - - suggests
documents, approximately 1.5 million words per that the strategy of using Altavista to generate
language as counted after removal of HTML candidate pairs could be improved upon signifi-
markup. That is, with a reasonable though cantly by adding a true Web crawler to "mine"
admittedly post-hoc revision of the language sites where bilingual text is known to be avail-
identification criterion, comparison with human able, e.g. sites uncovered by a first pass of the
subjects suggests the acquired corpus is non- system using the Altavista engine. I would con-
trivial and essentially noise free, and moreover, jecture that for English-French there is an order
that the system excludes only a third of the of magnitude more bilingual text on the Web
pages that should have been kept. Naturally than that uncovered in this early stage of re-
this will need to be verified in a new evaluation search.
on fresh data.
A second natural direction is the applica-
SLanguage ID across a wide range of languages is tion of Web-based parallel text in applications
not. difficult to obtain. E.g. see the 13-language set such as lexical acquisition and cross-language
of the freely available CMU stochastic language iden- information retrieval - - especially since a side-
tifier (,,~dougb/ident.html),
the 18-language set of the Sun Language ID Engine
effect of the core STRAND algorithm is aligned
(ht tp: / / /research/ila/ demo /index.html ), "chunks", i.e. non-markup segments found to
or the 31-language set of the XRCE Language correspond to each other based on alignment
Identifier ( of the markup. Preliminary experiments using
mltt/Tools/guesser.html). Here I used the language ID
method of the previous section trained with profiles
even small amounts of these data suggest that
of Danish, Dutch, English, French, German, Italian, standard techniques, such as cross-language lex-
Norwegian, Portuguese, Spanish, and Swedish. ical association, can uncover useful data.

P. Brown, J. Cocke, S. Della Pietra, V. Della
Pietra, F. Jelinek, R. Mercer, and P. Roossin.
1990. A statistical approach to ma-
chine translation. Computational Linguistics,
Jean Carletta. 1996. Assessing agreement
on classification tasks: the Kappa statis-
tic. Computational Linguistics, 22(2):249-
254, June.
Mark Davis and Ted Dunning. 1995. A TREC
evaluation of query translation methods for
multi-lingual text retrieval. In Fourth Text
Retrieval Conference (TREC-4). NIST.
Ted Dunning. 1994. Statistical identification of
language. Computing Research Laboratory
Technical Memo MCCS 94-273, New Mexico
State University, Las Cruces, New Mexico.
David A. Hull and Douglas W. Oard. 1997.
Symposium on cross-language text and
speech retrieval. Technical Report SS-97-04,
American Association for Artificial Intelli-
gence, Menlo Park, CA, March.
Thomas K. Landauer and Michael L. Littman.
1990. Fully automatic cross-language docu-
ment retrieval using latent semantic indexing.
In Proceedings of the Sixth Annual Confer-
ence of the UW Centre for the New Oxford
English Dictionary and Text Research, pages
pages 31-38, UW Centre for the New OED
and Text Research, Waterloo, Ontario, Octo-
Douglas W. Oar& 1997. Cross-language text
retrieval research in the USA. In Third
DELOS Workshop. European Research Con-
sortium for Informatics and Mathematics
Philip Resnik. 1998. Parallel strands: A pre-
liminary investigation into mining the web for
bilingual text. In Proceedings of the Third
Conference of the Association for Machine
Translation in the Americas, AMTA-98, in
Lecture Notes in Artificial Intelligence, 1529,
Langhorne, PA, October 28-31.
E. M. Voorhees and D. K. Harman. 1998.
The seventh Text REtrieval Conference
(TREC-7). NIST special publication,
Galthersburg, Maryland, November 9-11.
http ://trec. nist. gov/pubs, html.