You are on page 1of 10

Language Resources and Evaluation (2005) 39: 2534

DOI 10.1007/s10579-005-2693-4

Springer 2005

Some of my Best Friends are Linguists


FREDERICK JELINEK
Department of electrical and Computer Engineering, Johns Hopkins University, Barton Hall
320, Baltimore, MD 21218, USA
E-mail: jelinek@jhu.edu

1. Introduction
This article concerns the relationship between linguistics and the work carried
out during 19721993 at IBM Research in automatic speech recognition (ASR)
and natural language processing (NLP). Many statements I will make will be
incomplete: I am not that conversant with the literature. I apologize to those
whom I may oend. Conceivably it would have been much better to leave things
alone, stay silent. Hopefully this journal will be willing to devote some of its
pages to Letters to the Editor to correct the record or air opposing views.
The starting point is the following quote attributed to me: Whenever I re
a linguist our system performance improves. I have hoped for many years that
this quote was only apocryphal, but at least two reliable witnesses have
recently convinced me that I really stated this publicly in a conference talk
(Jelinek, 1998). Accepting then that I really said it, I must rst of all arm
that I never red anyone, and a linguist least of all.
So my motivation is defensive: to show that neither I nor my colleagues at
IBM ever had any hostility to linguists or linguistics. In fact, we all hoped
that linguists would provide us with needed help. We were never reluctant
to include linguistic knowledge or intuition into our systems: if we didnt
succeed, it was because we didnt nd an ecient way to do include it.

2. The Beginning of ASR/NLP Data Driven Methods


When our Continuous Speech Recognition group started its work at IBM
Research, the management wanted to make sure that our endeavors were
guided by strict scientic principle. They therefore placed into the group two
linguists who were going to guide our progress. Both linguists were quite selfcondent, sure that fast progress will be possible. For instance, when we
(trained as engineers or physicists) were at a loss how to construct a language
model, one of the linguists declared Ill just write a little grammar.
Before we started to develop our data driven approach, the speech
recognition paradigm was a follows:

26

FREDERICK JELINEK

1. Segment speech into phone-like units


2. Use pattern recognition to identify the segments
3. On the basis of confusion penalties determined by experts, nd the least
penalized utterance tting the identied segment string.
The rst task undertaken by our group was the recognition of utterances
generated by the so called Raleigh Finite State Language (see Figure 1). For
every word we hand-crafted a pronunciation baseform (string of phones from
an alphabet of about 50 phones) and carried out a recognition experiment
using a trained speaker. Confusion statistics obtained by a standard EM
approach easily beat those estimated by our experts (25% versus 65% error
rates).
We then put the data-driven approach to a more daring test. In our
experiment we replaced the phonetic baseforms by orthographic baseforms
(e.g., the word ought was described by the 5 units long string OUGHT rather
than by the phonetic OT. Thus from the systems point of view the G
sound in OUGHT was the same as the G sound in GENERAL or in GO!). This
orthographic experiment turned out to have only a 57% error rate,
superior to the 65% error rate based on confusion penalties determined by
experts.
After about a year of frustration the linguists left our group, returned to
their basic research, and we were free to pursue our self-organized, data
driven, statistical dream. This is the reality to which referred the admittedly
hyperbolic word re in my quote.
We are talking here of the period 19721974 when the rst ARPA project
(19711976) was started. Before that time, researchers in the speech
understand NLP/ASR eld routinely presented results achieved on training
data. Most participants in the ARPA project estimated the diculty of their
recognition tasks by the somewhat vaguely dened concept of branching
factor that took no statistics into account1 and was essentially equal to the
arithmetic average of the number of words between which the recognizer had
to decide at each decision point (all tasks were then nite state). To combat
this fallacy and yet stay in the realm of decisions between words, we introduced the concept of perplexity (Bahl et al., 1997) directly related to the
mathematically traditional cross entropy.
It is in this atmosphere of bad formulation and exaggerated claims that
the famous and inuential AT&T communications engineer J.R. Pierce
published his warning (Pierce, 1969) that certainly slowed down investment
in ASR research. Here are some quotes from his article:
. . . ASR is attractive to money. The attraction is perhaps similar to the
attraction of schemes for turning water into gasoline, extracting gold from
the sea, or going to the moon.

Figure 1. Grammar of the Raleigh Language.

SOME OF MY BEST FRIENDS ARE LINGUISTS

27

28

FREDERICK JELINEK

Most recognizers behave not like scientists, but like mad inventors or
untrustworthy engineers.
. . . performance will continue to be very limited unless the recognizing
device understands what is being said with something of the facility of a
native speaker (that is, better than a foreigner uent in the language).
Any application of the foregoing discussion to work in the general area of
pattern recognition is left as an exercise for the reader.
3. The NLP/ASR Situation in the 1970s
In the 1970s NLP and ASR research was dominated by an articial intelligence approach. Programs were rule-based, expert systems were beginning to
take over. Noam Chomsky, a very respected (and rightly so) though controversial gure, felt that statistics had no place in linguistic research. His
demonstration that language is not nite state (Chomsky, 1962) was considered decisive, and its applicability to NLP was over-estimated. The purest
linguists based their work on self-constructed examples, not on the prevalence of phenomena in observed data.
As already mentioned, strict distinction between training and test was
frequently ignored. Grammars were being written that applied to less than
dozen verbs.
Our ASR group at IBM was composed mostly of engineers and physicists.
Only 3 or 4 people out of 10 had any previous experience with speech. None
had graduate training in that eld. But several of us had a background in
Information Theory and that inuenced our thinking. Because of that
background it was natural for us to come up with the Communication
Theory formulation of ASR (see Figure 2). Our creed was as follows:
 The structure of models and their parametrization will be determined by
linguistic intuition.
 The models complexity will be limited by our ability to devise algorithms
capable of estimating robust parameter values from available data.

Speakers
Mind

Speech
Producer

Speech

Acoustic
Processor

Linguistic
Decoder

^
W

Speech Recognizer

Speaker
Acoustic Channel

Figure 2. Source-channel model of speech recognition.

SOME OF MY BEST FRIENDS ARE LINGUISTS

29

The second point accounted for our relatively primitive modeling of the
language translation problem (see below). We were always accused of
reluctance to use linguistic information, and when we did, remarks were
made like The IBM group is coming around. They admit the need for a
linguistic approach. Well, we always wanted linguistics, only we didnt
know how to incorporate them.
What we did realize is that for most distinguished linguists the NLP/ASR
problem was of no direct research interest. Of course, there were other linguists, such as Georey Leech or Henry Kucera, who were very interested in
data, and as soon as we could we sought cooperation with them.
It was in any case clear that we should seek linguists advice about creation of resources to be exploited by NLP/ASR.
4. Availability of Linguistic Resources
Linguistic resources pre-dated the modern statistical, data driven approach
to NLP/ASR. I will mention resources in the order in which we naive engineers working on ASR discovered them. First was the Brown Corpus
(Francis and Kucera, 1982) that existed since 1967. It was rather small by
todays standards, 1 million words, but it contained a selection of genres and
was annotated by parts of speech. That got us interested in automatic tagging: we thought of it as an opportunity to improve the ASR language model.
So Bahl and Mercer invented the HMM approach to tagging (Bahl, Mercer,
1976). It was quite a disappointment to us that even though the accuracy of
our tagger was quite high (about 96%), we found no eective way to exploit
it in a language model.
After some search we became aware that Geo Leech and Roger Garside
at Lancaster attempted automatic tagging by rule (Garside, 1987, 1993). And
this led to our discovery of the existence of the LancasterOslo Bergen
corpus (Johansson et al., 1978) associated with the English grammar books
by Quirk et al. (1985).
Actually, by 1985 we started looking around for a possible new application
of the statistical methods we developed for ASR. We hit on the possibility of
machine translation (MT). And, of course, we thought that grammar would be
important and we wanted to induce it from annotated data. Thats when we
found the Lancaster treebank constructed in the years 19831986 under the
leadership of Geo Sampson and Geo Leech (Garside et al., 1987; Sampson,
1987). Unfortunately, as I recall, it was hard to obtain the rights to use this
treebank, and so we commissioned the University of Lancaster to create a new
treebank just for our own use.
We believed then that when it comes to data for parameter estimation,
quantity would beat quality (within reason, of course). And we thought that
the treebank should be based on solid intuition carried in the minds of

30

FREDERICK JELINEK

every native speaker. So the resulting 3 million word treebank (Leech and
Garside, 1991) was constructed by 10 Lancaster housewives guided by Leech
who did nally write quite a thick annotation manual. At the end, the
housewives became experts . . .

5. The Founding of LDC


In the late nineteen eighties the NSF Directorate of Computer and Information Science (CISE) was headed by a famous applied mathematician Jack
Schwartz. Before he assumed the job he used to collaborate with the great
John Cocke of IBM, the C in the CKY algorithm, and the originator of the
RISC machine concept and of many other computer innovations.2 In the fall
of 1987 I went to visit Jack at NSF and suggested to him that NSF should
underwrite the creation of a treasury of annotated English. I had in mind a
much more sizeable treebank than the one their being constructed at Lancaster. Jack was willing to explore the problem and instructed Charles Wayne
to help.
We had experience at IBM with acquisition of rights to machine readable
data. It was an enormous problem: organizations wanted to charge considerable money to a deep pockets company like IBM. We did nd some
free data at the Reading House for the Blind, but in order to use it, we had
to obtain individual releases from the owners of each separate item (book,
article, etc.) contained in the collection. It was clear that to carry out negotiations for rights, the data guardianship task should reside in a non-prot
institution best associated with a university. So at the NLP conference in
January 1988 (Second Conference on Applied Natural Language Processing,
1998) I inquired of Aravind Joshi and Mitch Marcus if they would be interested in having such an institution at the University of Pennsylvania. They
said they would, I reported it to Charles Wayne, and he took it from there.
A conference was organized at Lake Mohunk, NY (DARPA Mohunk
conference, 1998), a steering committee was set up, rules about membership
were drafted, and LDC came into being with its rst task: the U Penn
Treebank.

6. Rise of Data Driven Parsing


By this time we were more eager than ever to see if construction of a statistical parser were possible (on the basis of a treebank, of course) . We
thought we needed cooperation with some group more experienced in NLP.
So we applied for an NSF grant jointly with the University of Pennsylvania.
We did receive the support and several good things came out of it:

31

SOME OF MY BEST FRIENDS ARE LINGUISTS

 Eric Brill, a graduate student at U Penn invented the concept of transformation based learning which he applied rst to part of speech tagging
(Brill, 1992).
 Chaired by Ezra Black, a group developed the PARSEVAL guidelines
for determining parsing accuracy (Black et al., 1991).
 Spatter, the rst statistical, history-based parser was implemented by
David Magerman (Black et al., 1991). It built up the parse left-to-right
with the help of questions embedded in a decision tree.
 Spatter phrases were annotated by lexical headwords. In order for
Spatter to learn its moves from the treebank, it was necessary to provide
its parses with phrase headwords. So Ezra Black developed the headword
percolation rules later used by many projects.
I think that it is accurate to say that this initial eort supported by NSF
eventually resulted in the parsers developed by Collins (1996).

7. Machine Translation
As mentioned earlier, we embarked on MT in 1986 when we sought a new
area to which to apply our statistical, self organized techniques. Besides, we
had 15 years of ASR work behind us and those who switched were also
attracted by the change as well as the possibility of picking some low
hanging fruit.
We had two ideas: to use the noisy channel paradigm to formulate the
problem (see Figure 3), and to base our learning on parallel texts. Naturally,
as the source language we wanted to use one not too dierent from English.
So we were very fortunate when we discovered the Canadian Hansards text
that transcribed in English and French the debates of the Ottawa parliament.
As to our problem formulation, we were later somewhat surprised when it
was revealed to be almost common sense. In fact, it was probably Bob
Mercer who found the following quotations in an article by Weaver (1995):
When I look at an article in Russian I say: This is really written in English
but it has been coded in some strange symbols. I will now proceed to decode
it.3
. . . the matter is probably absolutely basic namely the statistical character of the problem.

Writers "English"
Intention

Transformation into
Foreign Language

Machine Translator

Foreign Writer

Figure 3. Communication Theory MT Formulation.

32

FREDERICK JELINEK

The leaders and driving forces behind the project were Bob Mercer and
Peter Brown, although many others were also importantly involved. Before
we really got down to business, four of us took a fast course in French. Not
that we believed it, but the organization that oered the course claimed that
it will teach us French in 2 weeks!
The result of our endeavor was a series of systems (Brown et al., 1990, 1993)
that participated in a 1991 DARPA project in which we had two competitors:
Dragon Systems (Lingstat), and the combined forces of NMSU, CMU, and ISI
(Pangloss). Our own system was called Candide I dont know why.
The other two teams took advantage of linguistic knowledge, we almost
not at all. And again, wise people said profoundly What we need to do is to
combine linguistics with statistics. As if we had not tried as hard as we could
to include in our statistical frame the linguistics we could get hold of! Why
else did we put in the eort to create good parsers?
Besides, we did have linguistic components: morphology and word sense
disambiguation (Brown et al., 1991). And we performed preprocessing in
which we attempted to rearrange the words of the French source sentence
into a more English-like progression.
8. Conclusion
The IBM group (actually, its successor) continues to exist and carries out
outstanding research, but beginning with 1993 several of us founders
started leaving it. The original MT project was also stopped around 1996,
although it has since been resurrected and is now going strong.
Linguists study language phenomena as much as possible like physicists
study physical phenomena. They will give us advice, but will not directly
engage themselves in building systems. Just as engineers learned to take
advantage of the insights of physics, it is our task to gure out how to use the
insights of linguistics.
Our main problem is sparseness of data, and more precisely of annotated
or categorized data. What we should ask is that linguists help us structure
systems capable of extracting knowledge under minimal human supervision.
Our second crucial task is the design of relatively compact systems of
modules that reect language phenomena and enable machine learning to
estimate the corresponding parameters (Baker, 1975).

Notes
1

Not all participants, of course. James Baker who was a pioneer in introducing HMMs to
the ASR eld based his work on a rigorous mathematical formulation [4].

SOME OF MY BEST FRIENDS ARE LINGUISTS


2

3
4

33

John Cocke was actually the main driving force behind the establishment of an ASR group
at IBM Research. It was he who suggested the use of trigrams as the basis of language
modeling.
From a March 1947 letter to Norbert Wiener.
This is a very dicult assignment. The number of linguistic insights may be large, each
contributing just a little to overall performance. The cludge that might result from straightforward incorporation of specialized modules would be hard to manage and the parameters
impossible to estimate. Consider the dierence between the elegant ASR and MT structures
(1980 and 1990, respectively) and todays high performing systems!

References
ARPA Project on Speech Understanding Research (19711976).
Bahl L.R., Baker J.K., Jelinek F., Mercer R.L. (1977). PERPLEXITY Measure of Diculty
of Speech Recognition Tasks. 94th meeting of the Acoustical Society of America, Miami
Beach, Florida.
Bahl L.R., Mercer R.L. (1976) Part of Speech Assignment by a Statistical algorithm. IEEE
International Symposium on Information Theory, Ronneby, Sweden.
Baker J.K. (1975) The Dragon System An Overview. IEEE Transactions on Acoustics,
Speech, and Signal Processing, ASSP-23(1), pp. 2429.
Black E., et al. (1991) A Procedure for Quantitatively Comparing the Syntactic Coverage of
English Grammars. Proceedings of Fourth DARPA Speech and Natural Language Workshop. Pacic Grove, CA, pp. 306 311.
Black E., Jelinek F., Laerty J., Magerman D., Mercer R.L., Roukos S. (1992) Towards
history-based grammars: Using richer models for probabilistic parsing. Proceedings of the
Fifth DARPA Speech and Natural Language Workshop, Harriman, N.Y.
Brill E. (1992) A simple rule-based part of speech tagger, Proceedings of the Third Conference
on Applied Natural Language Processing, ACL, Trento, Italy.
Brown P.F., Cocke J., Della Pietra S.A., Della Pietra V.J., Jelinek F., Laerty J.D., Mercer
R.L., Roussin P.S. (1993) A statistical approach to machine translation. Computational
Linguistics, 16(2), pp. 7985.
Brown P.F., Della Pietra S.A., Della Pietra V.J., Mercer R.L. (1991) A statistical approach to
sense disambiguation in machine translation. Proceedings of Fourth DARPA Speech and
Natural Language Workshop, Pacic Grove, CA, pp. 146151.
Brown P.F., Della Pietra S.A., Della Pietra V.J., Mercer R.L. (1993) Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, Vol. 19,
Number 2, pp. 263 311.
Canadian Hansards.
Chomsky N. (1962) Syntactic Structures. Mouton&Co., S-Gravenhage.
Collins M. (1996) A new statistical parser based on bigram lexical dependencies. Proceedings
of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz,
CA, pp. 184191.
DARPA Mohunk Conference on Natural Language Processing. (1988) Lake Mohunk, NY.
Francis W.N., Kucera H. (1982) Frequency Analysis of English Usage, Houghton Miin Co.,
Boston.
Garside R. (1987) The CLAWS Word-Tagging System In Garside R., Leech G., Sampson G.
(eds), The Computational Analysis of English: A Corpus-based Approach, Longman, London, pp. 3041.

34

FREDERICK JELINEK

Garside R. (1993) Large Scale Production of Syntactically Analyzed Corpora. Literary and
Linguistic Computing, 8(1), pp. 3946.
Garside R.G., Leech G.N., Sampson G.R. (1987) The Computational Analysis of English: A
Corpus-Based Approach. Longman.
Jelinek F. (1998) Applying Information Theoretic Methods: Evaluation of Grammar Quality,
Workshop on Evaluation of NLP Systems, Wayne, PA.
Johansson S., Leech G., Goodluck H. (1978) Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for Use with Digital Computers. Department
of English, University of Oslo, Oslo.
Leech G., Garside R. (1991) Running a grammar factory: The production of syntactically
analysed corpora or treebanks. In Johansson S., Stenstro A.-B. (eds), English Computer
Corpora: Selected Papers and Research Guide. Mouton de Gruyter, Berlin & New York,
pp. 1532.
Pierce J.R. (1969) Wither Speech Recognition? The Journal of the Acoustic Society of America,
46(4) (Part 2), pp. 10491050.
Quirk R., Greenbaum S., Leech G., Svartvik J. (1985) A Comprehensive Grammar of the
English Language. Longman, London.
Sampson G. (1987) The grammatical database and parsing scheme. In Garside R., Leech G.,
Sampson G.(eds), The Computational Analysis of English: A Corpus-based Approach,
Longman, London, pp. 8296.
Second Conference on Applied Natural Language Processing. (1988) ACL, Austin, TX.
Weaver, Warren (1955) Translation. Machine Translation of Languages, MIT Press, Cambridge, MA.

You might also like