Professional Documents
Culture Documents
Best Friends Are Linguists
Best Friends Are Linguists
DOI 10.1007/s10579-005-2693-4
Springer 2005
1. Introduction
This article concerns the relationship between linguistics and the work carried
out during 19721993 at IBM Research in automatic speech recognition (ASR)
and natural language processing (NLP). Many statements I will make will be
incomplete: I am not that conversant with the literature. I apologize to those
whom I may oend. Conceivably it would have been much better to leave things
alone, stay silent. Hopefully this journal will be willing to devote some of its
pages to Letters to the Editor to correct the record or air opposing views.
The starting point is the following quote attributed to me: Whenever I re
a linguist our system performance improves. I have hoped for many years that
this quote was only apocryphal, but at least two reliable witnesses have
recently convinced me that I really stated this publicly in a conference talk
(Jelinek, 1998). Accepting then that I really said it, I must rst of all arm
that I never red anyone, and a linguist least of all.
So my motivation is defensive: to show that neither I nor my colleagues at
IBM ever had any hostility to linguists or linguistics. In fact, we all hoped
that linguists would provide us with needed help. We were never reluctant
to include linguistic knowledge or intuition into our systems: if we didnt
succeed, it was because we didnt nd an ecient way to do include it.
26
FREDERICK JELINEK
27
28
FREDERICK JELINEK
Most recognizers behave not like scientists, but like mad inventors or
untrustworthy engineers.
. . . performance will continue to be very limited unless the recognizing
device understands what is being said with something of the facility of a
native speaker (that is, better than a foreigner uent in the language).
Any application of the foregoing discussion to work in the general area of
pattern recognition is left as an exercise for the reader.
3. The NLP/ASR Situation in the 1970s
In the 1970s NLP and ASR research was dominated by an articial intelligence approach. Programs were rule-based, expert systems were beginning to
take over. Noam Chomsky, a very respected (and rightly so) though controversial gure, felt that statistics had no place in linguistic research. His
demonstration that language is not nite state (Chomsky, 1962) was considered decisive, and its applicability to NLP was over-estimated. The purest
linguists based their work on self-constructed examples, not on the prevalence of phenomena in observed data.
As already mentioned, strict distinction between training and test was
frequently ignored. Grammars were being written that applied to less than
dozen verbs.
Our ASR group at IBM was composed mostly of engineers and physicists.
Only 3 or 4 people out of 10 had any previous experience with speech. None
had graduate training in that eld. But several of us had a background in
Information Theory and that inuenced our thinking. Because of that
background it was natural for us to come up with the Communication
Theory formulation of ASR (see Figure 2). Our creed was as follows:
The structure of models and their parametrization will be determined by
linguistic intuition.
The models complexity will be limited by our ability to devise algorithms
capable of estimating robust parameter values from available data.
Speakers
Mind
Speech
Producer
Speech
Acoustic
Processor
Linguistic
Decoder
^
W
Speech Recognizer
Speaker
Acoustic Channel
29
The second point accounted for our relatively primitive modeling of the
language translation problem (see below). We were always accused of
reluctance to use linguistic information, and when we did, remarks were
made like The IBM group is coming around. They admit the need for a
linguistic approach. Well, we always wanted linguistics, only we didnt
know how to incorporate them.
What we did realize is that for most distinguished linguists the NLP/ASR
problem was of no direct research interest. Of course, there were other linguists, such as Georey Leech or Henry Kucera, who were very interested in
data, and as soon as we could we sought cooperation with them.
It was in any case clear that we should seek linguists advice about creation of resources to be exploited by NLP/ASR.
4. Availability of Linguistic Resources
Linguistic resources pre-dated the modern statistical, data driven approach
to NLP/ASR. I will mention resources in the order in which we naive engineers working on ASR discovered them. First was the Brown Corpus
(Francis and Kucera, 1982) that existed since 1967. It was rather small by
todays standards, 1 million words, but it contained a selection of genres and
was annotated by parts of speech. That got us interested in automatic tagging: we thought of it as an opportunity to improve the ASR language model.
So Bahl and Mercer invented the HMM approach to tagging (Bahl, Mercer,
1976). It was quite a disappointment to us that even though the accuracy of
our tagger was quite high (about 96%), we found no eective way to exploit
it in a language model.
After some search we became aware that Geo Leech and Roger Garside
at Lancaster attempted automatic tagging by rule (Garside, 1987, 1993). And
this led to our discovery of the existence of the LancasterOslo Bergen
corpus (Johansson et al., 1978) associated with the English grammar books
by Quirk et al. (1985).
Actually, by 1985 we started looking around for a possible new application
of the statistical methods we developed for ASR. We hit on the possibility of
machine translation (MT). And, of course, we thought that grammar would be
important and we wanted to induce it from annotated data. Thats when we
found the Lancaster treebank constructed in the years 19831986 under the
leadership of Geo Sampson and Geo Leech (Garside et al., 1987; Sampson,
1987). Unfortunately, as I recall, it was hard to obtain the rights to use this
treebank, and so we commissioned the University of Lancaster to create a new
treebank just for our own use.
We believed then that when it comes to data for parameter estimation,
quantity would beat quality (within reason, of course). And we thought that
the treebank should be based on solid intuition carried in the minds of
30
FREDERICK JELINEK
every native speaker. So the resulting 3 million word treebank (Leech and
Garside, 1991) was constructed by 10 Lancaster housewives guided by Leech
who did nally write quite a thick annotation manual. At the end, the
housewives became experts . . .
31
Eric Brill, a graduate student at U Penn invented the concept of transformation based learning which he applied rst to part of speech tagging
(Brill, 1992).
Chaired by Ezra Black, a group developed the PARSEVAL guidelines
for determining parsing accuracy (Black et al., 1991).
Spatter, the rst statistical, history-based parser was implemented by
David Magerman (Black et al., 1991). It built up the parse left-to-right
with the help of questions embedded in a decision tree.
Spatter phrases were annotated by lexical headwords. In order for
Spatter to learn its moves from the treebank, it was necessary to provide
its parses with phrase headwords. So Ezra Black developed the headword
percolation rules later used by many projects.
I think that it is accurate to say that this initial eort supported by NSF
eventually resulted in the parsers developed by Collins (1996).
7. Machine Translation
As mentioned earlier, we embarked on MT in 1986 when we sought a new
area to which to apply our statistical, self organized techniques. Besides, we
had 15 years of ASR work behind us and those who switched were also
attracted by the change as well as the possibility of picking some low
hanging fruit.
We had two ideas: to use the noisy channel paradigm to formulate the
problem (see Figure 3), and to base our learning on parallel texts. Naturally,
as the source language we wanted to use one not too dierent from English.
So we were very fortunate when we discovered the Canadian Hansards text
that transcribed in English and French the debates of the Ottawa parliament.
As to our problem formulation, we were later somewhat surprised when it
was revealed to be almost common sense. In fact, it was probably Bob
Mercer who found the following quotations in an article by Weaver (1995):
When I look at an article in Russian I say: This is really written in English
but it has been coded in some strange symbols. I will now proceed to decode
it.3
. . . the matter is probably absolutely basic namely the statistical character of the problem.
Writers "English"
Intention
Transformation into
Foreign Language
Machine Translator
Foreign Writer
32
FREDERICK JELINEK
The leaders and driving forces behind the project were Bob Mercer and
Peter Brown, although many others were also importantly involved. Before
we really got down to business, four of us took a fast course in French. Not
that we believed it, but the organization that oered the course claimed that
it will teach us French in 2 weeks!
The result of our endeavor was a series of systems (Brown et al., 1990, 1993)
that participated in a 1991 DARPA project in which we had two competitors:
Dragon Systems (Lingstat), and the combined forces of NMSU, CMU, and ISI
(Pangloss). Our own system was called Candide I dont know why.
The other two teams took advantage of linguistic knowledge, we almost
not at all. And again, wise people said profoundly What we need to do is to
combine linguistics with statistics. As if we had not tried as hard as we could
to include in our statistical frame the linguistics we could get hold of! Why
else did we put in the eort to create good parsers?
Besides, we did have linguistic components: morphology and word sense
disambiguation (Brown et al., 1991). And we performed preprocessing in
which we attempted to rearrange the words of the French source sentence
into a more English-like progression.
8. Conclusion
The IBM group (actually, its successor) continues to exist and carries out
outstanding research, but beginning with 1993 several of us founders
started leaving it. The original MT project was also stopped around 1996,
although it has since been resurrected and is now going strong.
Linguists study language phenomena as much as possible like physicists
study physical phenomena. They will give us advice, but will not directly
engage themselves in building systems. Just as engineers learned to take
advantage of the insights of physics, it is our task to gure out how to use the
insights of linguistics.
Our main problem is sparseness of data, and more precisely of annotated
or categorized data. What we should ask is that linguists help us structure
systems capable of extracting knowledge under minimal human supervision.
Our second crucial task is the design of relatively compact systems of
modules that reect language phenomena and enable machine learning to
estimate the corresponding parameters (Baker, 1975).
Notes
1
Not all participants, of course. James Baker who was a pioneer in introducing HMMs to
the ASR eld based his work on a rigorous mathematical formulation [4].
3
4
33
John Cocke was actually the main driving force behind the establishment of an ASR group
at IBM Research. It was he who suggested the use of trigrams as the basis of language
modeling.
From a March 1947 letter to Norbert Wiener.
This is a very dicult assignment. The number of linguistic insights may be large, each
contributing just a little to overall performance. The cludge that might result from straightforward incorporation of specialized modules would be hard to manage and the parameters
impossible to estimate. Consider the dierence between the elegant ASR and MT structures
(1980 and 1990, respectively) and todays high performing systems!
References
ARPA Project on Speech Understanding Research (19711976).
Bahl L.R., Baker J.K., Jelinek F., Mercer R.L. (1977). PERPLEXITY Measure of Diculty
of Speech Recognition Tasks. 94th meeting of the Acoustical Society of America, Miami
Beach, Florida.
Bahl L.R., Mercer R.L. (1976) Part of Speech Assignment by a Statistical algorithm. IEEE
International Symposium on Information Theory, Ronneby, Sweden.
Baker J.K. (1975) The Dragon System An Overview. IEEE Transactions on Acoustics,
Speech, and Signal Processing, ASSP-23(1), pp. 2429.
Black E., et al. (1991) A Procedure for Quantitatively Comparing the Syntactic Coverage of
English Grammars. Proceedings of Fourth DARPA Speech and Natural Language Workshop. Pacic Grove, CA, pp. 306 311.
Black E., Jelinek F., Laerty J., Magerman D., Mercer R.L., Roukos S. (1992) Towards
history-based grammars: Using richer models for probabilistic parsing. Proceedings of the
Fifth DARPA Speech and Natural Language Workshop, Harriman, N.Y.
Brill E. (1992) A simple rule-based part of speech tagger, Proceedings of the Third Conference
on Applied Natural Language Processing, ACL, Trento, Italy.
Brown P.F., Cocke J., Della Pietra S.A., Della Pietra V.J., Jelinek F., Laerty J.D., Mercer
R.L., Roussin P.S. (1993) A statistical approach to machine translation. Computational
Linguistics, 16(2), pp. 7985.
Brown P.F., Della Pietra S.A., Della Pietra V.J., Mercer R.L. (1991) A statistical approach to
sense disambiguation in machine translation. Proceedings of Fourth DARPA Speech and
Natural Language Workshop, Pacic Grove, CA, pp. 146151.
Brown P.F., Della Pietra S.A., Della Pietra V.J., Mercer R.L. (1993) Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, Vol. 19,
Number 2, pp. 263 311.
Canadian Hansards.
Chomsky N. (1962) Syntactic Structures. Mouton&Co., S-Gravenhage.
Collins M. (1996) A new statistical parser based on bigram lexical dependencies. Proceedings
of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz,
CA, pp. 184191.
DARPA Mohunk Conference on Natural Language Processing. (1988) Lake Mohunk, NY.
Francis W.N., Kucera H. (1982) Frequency Analysis of English Usage, Houghton Miin Co.,
Boston.
Garside R. (1987) The CLAWS Word-Tagging System In Garside R., Leech G., Sampson G.
(eds), The Computational Analysis of English: A Corpus-based Approach, Longman, London, pp. 3041.
34
FREDERICK JELINEK
Garside R. (1993) Large Scale Production of Syntactically Analyzed Corpora. Literary and
Linguistic Computing, 8(1), pp. 3946.
Garside R.G., Leech G.N., Sampson G.R. (1987) The Computational Analysis of English: A
Corpus-Based Approach. Longman.
Jelinek F. (1998) Applying Information Theoretic Methods: Evaluation of Grammar Quality,
Workshop on Evaluation of NLP Systems, Wayne, PA.
Johansson S., Leech G., Goodluck H. (1978) Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for Use with Digital Computers. Department
of English, University of Oslo, Oslo.
Leech G., Garside R. (1991) Running a grammar factory: The production of syntactically
analysed corpora or treebanks. In Johansson S., Stenstro A.-B. (eds), English Computer
Corpora: Selected Papers and Research Guide. Mouton de Gruyter, Berlin & New York,
pp. 1532.
Pierce J.R. (1969) Wither Speech Recognition? The Journal of the Acoustic Society of America,
46(4) (Part 2), pp. 10491050.
Quirk R., Greenbaum S., Leech G., Svartvik J. (1985) A Comprehensive Grammar of the
English Language. Longman, London.
Sampson G. (1987) The grammatical database and parsing scheme. In Garside R., Leech G.,
Sampson G.(eds), The Computational Analysis of English: A Corpus-based Approach,
Longman, London, pp. 8296.
Second Conference on Applied Natural Language Processing. (1988) ACL, Austin, TX.
Weaver, Warren (1955) Translation. Machine Translation of Languages, MIT Press, Cambridge, MA.