You are on page 1of 8
Speech Recognition and Understanding Recent Advances, Trends and Applications Edited by Pietro Laface Dipartimento di Automatica e informatica Politecnico di Torino Corso Duca degli Abruzzi 24, 10129 Torino, Italy Renato De Mori ‘School of Computer Science 3480 University St., Montreal, Quebec H3A 2A7, Canada Springer-Verlag Berlin Heidelberg NewYork London Paris Tokyo Hong Kong Barcelona Budapest Published in cooperation with NATO Scientific Affairs Division able Speech Recognition 1989) ‘ystem: Preliminary De- New Jersey : IEEE, 1990 990) Experimenting Text Creation by Natural-Language, Large-Vocabulary Speech Recognition P. Allo, M. Beandett, M. Ferretti, G. Maltese, F. Mancial, A. Mazza, S. Scardi, G. Viilre IBM Italy Rome Scientific Center via Giorgione 159, 00147 ROME (Italy) 4. Introduction In the last years the probabilistic approach to speech recognition has allowed the development of high- performances large-vocabulary speech recognition systems (1) [2]. At the IBM Rome Scientife Center a speoch-recognition prototype for the Italian language, based on this approsch, has boen bull. The prototype is able to recognize in real time natural-language sentences built using a vocabulary containing up t0 20000 words. [4]. Once and forall the user has to perform an acoustic traning phase (about 20 minutes 1ong), daring which he is required to utter a predefined text. Words rust be uttered inserting small pauses (a few centizoconds), between them. The prototype architecture is based on a personal computer equipped with special hardware. ‘The fist system we developed was aimed at a business and finance Iezicon, Many labora tory tests have shown the effectiveness of the prototype as 4 tool to eeate texts by voice. Aer a fest phase during which in-house experiments were carried on [5], the need arose to test the system in real work ‘enviroments and for diferent applications. Two applications were considered: the dictation of radiological {eports and of insurance company documents. Due to their characteristic, these applications sccaned to be ‘ery well sited for our purposes. Since the vocabulary of the recognizer must be predefined, we had to ‘adapt the system to the lexicon required by the new applications. ‘The paper describes the techniques devel- ‘oped to eficiently adapt the basic component of the recognizer: the acoustic and language model. The results obtained experimenting automatic text dictation during real work ae alo presented. 2. System structure To undertand the steps necewary to perform the system adaptation a brief ouline ofits basic structure is seeded. In the probabilistic approach to speech recogiton we lok forthe nquenee of word W which has the highest probability given the acoustc information A extrctod from the observed signal [I]. In oor case the acoustic signal isa sequence of acount labels extracied fom the signal evry cemnecond and repres- entng the energy content of the signal in 20 frequency bands. ‘Applying the Bayes theorem we ean write: 7) LAT ) Wa) = LAL) pani 2ae o ‘where P(F 17) isthe probability thatthe sequence of words IF will produce the sequence of acoustic infor- mation 4. PGW) is the « riot probabiliy of the sequence of words W. P(A) isthe probably of the sequence of acoustic information A. We sek the maximum of the aboreexpresion with respect to 7. We ‘an ignore P(A) because it docs not depend on HY. Therefore we nend to maximize the numerator ofthe ‘expression (1). ‘Ar consequence of equation (1), the recognition tak can be viewed as composed by four components: 1. an acoustic procemor to etc the acoustic information Z from the apech sig; peep Verne scent Avanos { Spinge roger eter 992 382 2. an acoustic model to compute the probability P (197) ; 3. a language mode to compute the a-priori probability P (7) 4. an ficient search strategy, to find, among all possible sequences of woids HV, the most probable one. While the signal processing stage and the search strategy can be considered as vocabulary independent, both, the acoustic and the language model depend on the application which the system is simed for, 30 they must be adapted to the lexicon of the new application. In the next paragraphs we will present the structure ofthe ‘models and the techniques employed to adapt them. ‘Acoustic Model ‘The acoustic model task is to compute P (4117) . ‘The approach currently employed for the acoustic model is based on hidden Markov models that ar finite state automata (3]. For everytime slice the model takes a Uransition from the current slate to one of the allowed states (the transition can alzo produce no state change). For each transition an acoustic label is produced [1]- Both the transitions and the label emission ‘occur according two probability distributions which depend on the current state only. These models are called hidden because itis only possible to observe the sequence of acoustic symbols produced, while the sequence of states is unknown, Each word belonging (o the vocabulary is represented by a different model. To allow speech recognition for a very large vocabulary, the basic speech units modeled by the Markov sources are associated (0 the basic sounds of the language. So an alphabet of acoustic units must be defined to represent the basic sounds of the language. Examples of acoustic units used for speech recognition are: syllables, diphones, phoves. We ‘choose the phone as phonetic unit. ‘The basic sounds of the Halian language were described by a set of 56 phonetic units (4). For each phonetic unit a Markov model representing its pronunciation has been created {In our system all the phonetic units have the same topological structure. The distinction between different sounds is left entirely to the probability distributions, namely, to the parameters of the model. ‘The compu- tation of the parameters is accomplished during the acoustic training phase employing the predefined text uttered by the user. The model for each word is built by the concatenation of the Markov sources corre- sponding to the sting of phonetic units forming its pronunciation, Language Model The task ofthe language model is to compute P (FP). ‘The enmputatinn is performed in the following way: PU) =] Pratoswia This is the so-called trigram language model[2] : it contains the approximation of considering equivalent all the sentences ending with the same couple of words 4-5 . ‘The number of possible trigrms is so large that is practically impossible to collect the amount of data needed to estimate the probebilty of each of them. To overcome the problem, an interpolation between different probability distributions is performed. ‘Thee iferent distributions are computed for tgeams, bigrams and vnigrams. The probability of word ms given the words wi and wr is estimated as fllows Poesbeyee) Chosen), Cheeni) Oboe) * “2 Chg) where C+) the number of times the word string wie, was observed in the taining data and V i the number’ of words in the vocabulary. The 1 coeffcints are timated using the expectation ‘maximization algori straints for speech reco large is the taining co Pexplerity isa typical number of words that ‘used, the perplexity val Vocabulary definit In ovr system the voe: the system performance In the first version of at the dictation of ca Cones in a corpes cont economy and finance ‘Mondo, and press age 3, System adapt: To build a speech reco + to define the vocal + to adapt the scout While the adaptation © ‘on factors like the lex it will be described the Acoustic model ai We have just seen tha forming its pronunciat application, i 10 perk number of words that netic transcriptions us for whieh a new phon Usvally, the phoneti vocabularies the trans automatic as posible transcription are base systems cannot provi complexity of the prot ‘employed a different described by a limited description). Given t phonetic transrition the grapheine-to-phon of is lexical knovwled Uncertainty, A set of right part. The lef contents; the right pa 383 ‘maximization algoithrn [9]. The tigram language model is an effective tool to represent the linguistic con straints for speech recognition purposes; on the other hand it requires a large amount of taining data. The larger is the training corpus, the higher is the recognition rate [10]. Pesplexity isa typical measure of the predictive power of a language model [11]. It estimates the average number of words that are considered equiprobable by the model. Of course, ifthe language model is not used, the perplesty value equals the vocabulary size Vocabulary definition {In our system the vocabulary is predefined; this means thatthe choice of the vocabulary is a crucial step for the system performances. {In the first version of our prototype a vocabulary containing more than 20000 words was used. It is aimed at the dictation of economy and finance reports. [5]. The 20000 words were chosen as the most frequent ‘ones in a corpes containing 44 millions of words composed by articles from the most important Habian ‘exonomy and finance newspaper (I! Sole 24 Ore), articles from an economy and finance newsmagazine (Jf Mondo), and press agency news. ‘The coverage of this vocabulary computed on a disjoint corpus was 96.5%. 3, System adaptation To build a speech recognition system for a new application, we nee * to define the vocabulary: + to adapt the acoustic and language model ‘While the adaptation of acoustic and language models can be automatized, the choice of vocabulary depends. ‘on factors like the lexicon of the application and the sizeof the available text. Thus in the next paragraphs, it wll be described the tools that have been created to make the required adaptations easy and quick Acoustic mode! adaptation We have just seen that ench word is deseribed eantion of the Markov mpodels of phonetic units forming its pronunciation. Accosding to the chosen technique, the first step in adapting the system to new application, is to pesform the phonetic transcription ofall the words in the new vocabulary. To init the number of words that must be transcribed, a large database was built containing all the words and the pho ‘etic transcriptions used in previous vocabularies. By using the databace it is possible to find all the words for which a new phonetic transcription must be supplied Usually, the phonetic transcription és performed manually. It is @ very expensive process and for large ocabularis the transcriptions could contain erors. We tried to make the phonetic transcription process as vomatic as possible. ‘The systems that have been proposed w solve the problem of automatic phonetic transcription are hased on sles [6] [7] or on automatic leaming from training data [8}. Actually, these ‘stems cannot provide the accuracy required for autoxnatic specch recognition. Thit is due both to the complexity of the problem and to the dlficuty to describe all the possibilities with a limited set of roles, We employed a cifferent technique from the mentioned oes. We separated phonotactical knowledge (well deserhed by a limited set of rules) from lexical knowlege (haved on experience and not suitable for a formal description). Given the sting representing the orthographic form of the word our system produces a set of phonetic transriptions for that word, which are the ones that can be obtained applying our set of roles for the grapheme-to-phoneme translation. ‘The user en choose manually the correct transcription on the basis Of his lexical knowledge. Grapheme-to-phoneme translation for the Italian language has a relatively low Uncertainty. A set of 7B rules allows to describe all the ambiguities. Each rule consists of a let part and a fight part. ‘The left part consists of a grapheme string and is (possibly empty) left and right graphemic contents: the right part consists of the set of possible phonetic transcriptions for the grapheme sting. "The Teosen eameibtons produced applying this set of rules is then pruned by means of a st of global rules Mor phecgemnle ie all the transcriptions which do not have one and jut one sueaed vows). The See Phonetic tansciption always belongs to the resulting sei.” The average number of phoee: Wore Language modet ‘The language mode! is built following the previously described technique. We have tre that takes a dictionary anda corpur asa stating point and give the probability ea Pom loo) for any word ws, given the content my, 4. In-house test for the new application Refor cxrerimenting the specch-rocognier in x real envionment we needed to perform in-house tests to fusca the recogrition rate of the system when used to dictate pred sreeerciminay sly en the wage ofa voie-actvated text eto indicated that larg-vocabuary speech Tetrion cin ofr a very compsitvealtematve to traditional tox enty. Ii likely to be well sepia TILE who have a large experience in keyboard tex entry. The results (enor rte 2.35%) show vet ne, tztem could bs a revolutionary tool, for example, for the office ofthe fare, but poser orca aman factors problems. These experiments have been very useful for vs, beac they bane wena ht SEAR asprovements to the syste interface and the noed of some special Keys to make eaty fe seen and the correction phases, 5. Real-work applications crease application areas were considered: medical report dictation and office (insurance company) doe uments and memos creation. We Tic namy and fnance lexicon is quite diferent from the lesienn requted to dictate radiloncal reports Utne has some siniares with respect to the lexicon usd in the insurance company renents ras selected by using a corpus containing only radiologieal reports; in the taking the economy-finance vocabulary aa a starting point. Radiological Reports Vocabulary Theee thle corner contained about 5 milion words collected in four diferent hospitals. The horpita 100. Coverage Figure 1. Coverage A vocabulary con coverage and the n One of the main ¢ the report dictation imenter all the 32) Performed, were in the most frequent vocabulary had a reports Insurance Com ‘The scleetion of th rilion words prov the economy and fi insurance corpus ¥ selected and the 3 showed a 99% cov Coverage 5000 1000015000 20000-25000 30000 Nocabulary dimension ‘Figure 1. Coverage of corpus as fonction of vocabulary size ‘A vorabulary containing S000 words seemed to us a reasonable trade-off between the need to reach high ‘covernge and the need to have enough data to estimate the language model parameters. (One of the main characteristics found in this kind of lexicon was the presence of a st of words peculiar to {the report dictation proceas at each location. ‘To make the recognizer well suited to the needs of the exper. jimenter all the 3200 different words found in corpus provided by the hospital where the experiment was performed, were included in the voeabulary. The vocabulary was completed adding 1900 words which were the most frequent ones in the other corpora not included in the previous Est of 3200 words, ‘The resulting ‘vocabulary had 2 100% coverage on reports from the experiment location and 97.5% coverage on other reports, Ineurance Company Reports Vocabulary ‘The selection of the words to be used in this vocabulary was performed by analyzing a corpus containing 1.5 million words provided by the insurance company. “There was a cetain similarity between this lexicon and the economy and finance one: the coverage obtained by economy and finance vocabulary (20000 words) on insurance corpus was 95.2%. The 15000 most frequent words of economy and finance vocabulary were selected and the 3100 most frequent words found in insurance corpus were added. The resulting vooabulary showed 2 99% coverage on insurance texts and 8 95.7% coverage on economy and finance ones, 386 Language models Informations “Table 1 Language mods dhracereics Parameter Radiology Reports insurance Reports ‘Vocabulary size 5100 18100 “Training data 45 13+ 40 Millions of distinct trigrams 062 34 Millions of distinct bigrams on 23 Pemplexty 8 ® ‘The radiology reports language model was trained using 4.8 million words. For the insurance company ‘ports a corpus containing 1.5 milion words typical of the application was merged with 40 million words extricied from the economy and finance corpus. In the table the most significant parameters of the lan sage models are reported. 6. Results nthe following paragraph the results obtained experimenting the (wo prototypes during real work are reported. Radiological Reports Dictation Four doctors have used the recognizer during their every-day work to prepare the reports to be delivered to the patients. No one had any difcoty in inserting short pauses between words. ‘The doctors dictated 150 ‘ports containing more than 12000 words. The vocabulary coverage for the dictated reports was 98.6%. ‘Table 2 reports the results ofthe experiment. “Fable 2. Radiological report idation Speaker Number of reports Error ale ‘Speaker spl 3s 17% 1a o2 a 20% - 12% ca _ 33% 09% _ [ se _ 0% 3m ‘The error rate is teferred to the number of errors done by the recognizer without taking into account errors 4dve to words not included in the vocabulary. ‘The speaker's error rate is referred 10 the erors Jue to misuse of the recognizer (wrong commands, absence of pause between words, ete). ‘The global error rate can be ‘computed summing the numbers in the thitd and fourth column of the table. We ean see that the recognizers performances range from 90.5% 10 95.6% of accuracy Insurance Cor ‘The experimentat lary coverage on spl 2 oo 4 5 “Table 3 Taran ‘Speaker References wo a a a fo () m 8) 01 0) an LR. Baki, on, IEEE 179-190, F. Jelinek, 73.20.11, LR. Dah, in the Tany ‘and Signal P. DOr, Recogmi No. 2, Ma . Allo, M wih 2 200 28 R. Cartson tional Conf D.H. Kh Trans. on / T. J. Seino System | F.delinck, “Pattern R pp. 381-38 M. Ferret tic Spee F.Selinek, Tash 387 Insurance Company Reports Dictation ‘The experimentation was caried on by five different users who dictated more than 8000 words, The vocabu- lary coverage on the dictated text was about 99%. Table 2 report the results obtained in this case “Table 3. Insurance company reports dictation ‘Speaker Error rate Speakers error rate wl 19% 10% #2 14% 05% ca 140% 20% A 74% 15%. 5 24% 08% References (1) LR. Bahl F. Jelinek, R.L. Moroer, A Maximum Litethood Approach to Continuous Speech Recogni- a eo io 3 to m & vb 10) an tio, IEEE Trans. on Patiern Anolis and Machine Intelligence, vol. PAMI-S, n0. 2, 1983, pp. 173-190. F. Jelinek, The developeicat of an experimental dacrete dictation recognizer, Proceedings IEEE, vol. 73, no. 11, November 1985, pp. 1616-1624. LR. Dahl, P.F. Brown, P.V. De Souza, RL. Mercer, M.A. Picheny, Acoustic Markov Models Used Inthe Tangora Speech Recognition System, Proc. JEEE International Conference on Acoustics, Speech and Signal Processing, P. DOrta, M. Ferret, A. Martell, S. Melecrnis, S. Scare, G. Volpi, Large-Vocabulary Speech Recognition: a System for the Halin Language, [BM Jounal of Research and Development, Vol. 32, No. 2, March 1988, pp.217-226. P. All, M. Brandetti, M. Ferretti, G. Maltese, S. Scarci, Experimenting Natoral-Language Dictation with 2 20000-Word Specch Recognizer, IEEE CompEuro 89, lamburg, May 8-12, 1989, pp. 2-78 - 24 R. Carlson, B. Granstrocm, A Text-to-Speech System Based Entiely on Rules, Proc. (EEE Interna: ‘ional Conference on Acoustics, Speech and Signal Processing, Philadelphia, PA, April 1976. D. H. Kiat, Structere of a Phonological Rule Component for Synthesis-by-Rule Program [EEE Trans. on Acoust, Speech and Sig. Proc. vol. ASSP-24, no, 5, 1976, pp. 391-398. Tod. Sejnowski, C. R. Rosenberg, Parailel Networks that Learn to Pronounce English Tent, Complex ‘Systems, 1 (1987), pp. 145-168 F. Jelinek, R.L. Mercer, Interpolated Estimation of Markov Source Parameters from Sparse Data, in “Pattern Recognition in Practice”, FL.Gelsema and LN. Kanal, Fd, North-Holland, New York, 1980, pp. 31-387, M, Ferretti, G. Maltese, 8. Scare, Measures of Language Model and Acoustic Information in lstic Speech Recognition, Furospeech 89, Psi, September 1989, pp. 473-476 F. felinek, RL. Mercer, LR. Bahl, J.K. Baker, Perplxity - a Measure of Difficulty of Speech Recog- nition Tasks, 94th Meeting Acoustical Society of America, Miami Beach, December 1977 babi

You might also like