You are on page 1of 7

Hardware Oriented Architectures for Continuous-Speech Speaker-Independent ASR Systems

Gian Carlo Cardarilli (1), Alessandro Malatesta (1), Marco Re (1) Luigi Arnone (2), Sara Bocchio (2)
(1) University of Tor Vergata, Rome, Italy Dept. of Electronic Engineering (2) ST Microelectronics Advanced System Technology, Agrate Brianza (MI), Italy

Abstract In this paper we focus on the design and development of high performance speech recognition systems. The main problem with state-of-the-art speech recognition systems softwares is the uneven balance of accuracy and speed: systems with high level of accuracy tend to be extremely slow while fast systems have a degree of accuracy not suitable for most general purpose applications. We propose to speed up the system by performing the most computationally demanding task with dedicated hardware. We firstly analyse some commonly used ASR algorithms in order tho choose the most suitable one. Then we develop a parallel hardware architecture that implements the selected algorithm. Finally we show how we described the proposed architectural model using the SystemC extension to the C++ programming language. Simulation results are given, along with reference tests made with the HTK's speech recogniser for comparison purposes. 1. Introduction Many are the challenges to face in realizing a speech recognizer. Anyway the main goal is to obtain a system that matches the requirements of the application in which it will be used, both in speed and accuracy. Our focus is on continuous-speech speaker-independent automatic speech recognition (ASR) systems. This task requires great processing power and large quantities of memory. As a matter of fact the main features of a good speech recognizer are a high level of accuracy along with the capability to work in real-time in order to guarantee a proper interactivity with the user. This last need can be satisfied using two different approaches: a) reducing the complexity of the system, thus losing part of the accuracy [10], b) using more computational resources, thus raising the cost of the system [12-13-14]. Usually there will be a trade-off between these two choices. Anyway this trade-off generally produces systems with insufficient accuracy to be used in real applications, or systems with computational needs too high to be used in real-time on platforms with relatively low processing power (such as portable systems). Our attempt aims to obtain additional low-cost processing power by developing application specific hardware. There are two main advantages in this approach: first the recognizer will be able to run in real-time with more complex Figure 1 - HMM with left-right topology

algorithms (with a resulting improvement in accuracy), second we will be able to use high performance recognizers also on portable platforms with low processing power (avoiding the use of distributed systems, such as Distributed Speech Recognition [12], that can be used only when the proper network connection is available). In the following sections we will show the most common solutions used to realize ASR systems. Then we will extend one of these solutions to develop a model for an hardware architecture. 2. Classic implementation of ASR systems The most common approach for the design of ASR systems uses Hidden Markov Models [1] for acoustic modeling. In this approach every acoustic event (word, phoneme, syllable or whatever else is chosen by the designer to be the speech's building block in the system) is modeled using a left-right Markov chain like the one showed in Figure 1. The model's evolution in time is determined by transition probabilities between states (given by a fixed value associated to the HMM's arcs) and by emission probabilities from states (the values of which are determined at every time interval according to the coded speech). In the rest of this paper we will consider the common assumption that every HMM models a single phoneme (in our case a phoneme is a time-stationary acoustic event). Given a model set, obtained using iterative estimation methods (such as the Baum-Welch algorithm [1]) and large recorded speech databases, is possible to generate a search space that obeys the rules given by a vocabulary and a language model. The search space is thus a network of phonetic HMMs connected by cross-model transitions. Both the vocabulary and the language model apply their rules by adding constraints to these cross-model transitions, constraining, in that way, the possible phoneme sequences. After generating the search space the only action to perform in order to recognize a speech segment is to identify the state sequence that best matches the observations sequence. The latter is extracted from the speech segment itself using a signal processing front-end. The search is done using the Viterbi algorithm [2], according to which the probability of a state j in the search space is computed iteratively at each time interval (usually referred to as frame) using the following formula:

t j=max [t1 ia ij ]b j O t
i

Here t(j) is the probability of state j at time t, aij is the transition probability from state i to state j and bj(Ot) is the emission probability of state j given the observation Ot at time t. Usually the emission probability density functions are modeled using Gaussian mixtures. In that case the emission probability 1

Lexical Model (Dictionary)

Language Model (Grammar)

Search Space Evaluation ( Speech Feature Extraction Front-End

Viterbi ) Word/Model History Recognition Hypotesis Backtracking and Postprocessing

Acoustic Models (Gaussian Mitures and Transition Matrix)

Figure 2 - Classic ASR System Structure computation is done according to the following formula:
NMIXES

b j O t =

m=0

c jmN [O t , jm , jm ]

Here NMIXES is the number of Gaussian components in the mixture, cjm are the mixture weight coefficients and N[Ot , jm , jm] are multidimensional Gaussian PDFs with mean jm and covariance matrix (diagonal in most systems) evaluated in Ot. During the search process the most relevant paths through the search space are recorded as state sequences. This allows, at the end of the search process, to select the best path via backtracking. A representation of an ASR system implemented with the method just described is shown in Figure 2. The system is made of different parts representable as cascaded processing blocks. The first block is the feature extraction front-end (many systems use Mel-Frequency Cepstral Coefficients [6]). The observations generated by the front-end are then passed to the search block. This block evaluates the evolution of the search space using the observations and applying the Viterbi algorithm on each state. At the end of the search process the best state sequence is obtained performing backtracking from the search space's end state that has highest probability. Alternatively is possible to obtain one or more recognition hypotheses using more complex postprocessing algorithms such as the stackdecoding algorithm [3]. For example, the Sphinx 2 ASR system, developed by Carnegie Mellon University [4-10], has a structure similar to the one just described. This system, though it has a good recognition accuracy, is characterized by extremely long execution times (over 10x real-time) [4-8-10]. In the next section we will introduce an alternative methodology to realize ASR systems. That methodology uses a different formulation of the Viterbi algorithm known as Token Passing algorithm.

3. Token Passing based ASR systems The Token Passing model [5] implements the Viterbi algorithm reorganizing the recognizer's structure on different distinct levels. In this model each HMM state j is able to hold at each frame one or more objects called tokens. Each token contains an alignment probability j(t) between a partial observation sequence {O1 , ... , Ot} and a partial state sequence that ends in state j at time t. A token can be passed from one state to another if there is a transition connecting the states. In this process the token's probability is modified according to the transition's score. With these hypotheses we can replace the iterative formulation of the Viterbi algorithm where for each state in every HMM the following steps are applied: 1. Pass a copy of every token held in current state to every state towards which a transition exists. During the token passing from state i to state j update the token's alignment probability with the transition probability aij and the emission probability bj(Ot). 2. Reorder the token list in each state according to the alignment probability. Keep only the best N tokens and discard the rest. With this algorithm, the structure of the ASR system becomes the one showed in Figure 3. The feature extraction front-end is the same used in the classical implementation but the search space is now broken into single HMM models that are evaluated independently according to the Token Passing algorithm. This segmentation of the search space is one of the features of the Token Passing approach that will allow us to introduce parallelism in our final architecture. Another facet is that the cross-model transitions are handled by a dedicated module that controls the token flow towards the first state of every HMM. The tokens exiting from the last state of each HMM are passed to the token flow control module that, applying the constraints imposed by the vocabulary and the language model, selects the tokens to be passed to the first state of each HMM and applies the needed corrections to 2

Acoustic Models (Gaussian Mitures and Transition Matrix)

Lexical Model (Dictionary)

Language Model (Grammar)

Speech

Feature Extraction Front-End

Observations

Word/Model History

HMMs Instances List

Tokens

Token Flow Control

Backtracking and Postprocessing

Recognition Hypotesis Tokens

Figure 3 - Token Passing Based System the tokens' alignment probabilities. Finally the token flow control module serves also the task of generating and storing a lattice of recognized words along with their start and end times. From this lattice, via backtracking or other postprocessing methods, the final recognition hypothesis will then be extracted. The main characteristic of this method is to separate the recognition task in several distinct sub-tasks: the acoustic processing (token flow control through the HMM's internal states), the lexical processing (token flow control on transitions between different HMMs belonging to the same word) and the grammatical processing (token flow control on transitions between word boundary models). This subdivision makes the system more versatile and greatly simplifies data handling. The structure of a Token Passing based system is showed in Figure 3. A similar structure is used in the recognition tool HVite that comes with the HTK toolkit [6], developed by Cambridge University Engineering Department and Entropic Research Laboratory Inc. 4. Hardware oriented analysis of ASR algorithms In order to develop an hardware architecture that realizes an ASR system (or part of it), we analysed in greater detail the algorithms we presented in the previous sections. In the classic Viterbi search architecture the system needs to access memories that hold the acoustic models' parameters, the vocabulary and the language model. The probability of each state in the search space is held in another storing element that contains also the informations on the topology of the states network. The main problem of this last structure is that its dimensions grow exponentially as the number of words in the vocabulary increases (for example, if we use 3 states HMMs and a dictionary containing 20000 words, the search space would easily contain more than 300000 states). During the recognition process the temporal evolution of the search space is recorded in another memory in the form of a word lattice. Such a speech recognizer is characterized by a strictly sequential behavior: the use of a search space built as a single net does not allow for clustering the states into blocks that can be processed in parallel. Moreover the memory is centralized and it access is controlled by the search block. This implies the use of big memories with consequent great latency in data access. As regards the HMM parameters, it is usually possible to reduce its dimension by using shared-mixture or tied-states models [7].Unfortunately, using shared components model sets does not solve the data access latency problem, but also introduces a loss in the recognizer's accuracy. Also the information on the search space topology and on state probability is held in a single memory, raising the same problems with data access latency that we have described above. At this point we analyse the specific properties of the Token Passing method. For this analysis we will focus on the HTK's HVite implementation. As we already said, in this implementation the search space is segmented in clusters of states, each cluster corresponding to an HMM. From the point of view of an hardware implementation this feature enables us to parallelize the operations to be performed on each cluster, thus considerably speeding up the evaluation process. Obviously this solution would not be possible with a single search space containing all the states, as in the classic Viterbi search. The state clustering also allows to distribute the memory containing the models' parameters amongst the separate processing units, thus reducing data access latency. The latency can be reduced also for the access to the states' probabilities, in fact these probabilities are now handled as a data flow (a token flow in our case) and not stored in a centralized memory. 3

Feature Extraction Front-End

Token Passing

Backtrace , Postprocessing

Figure 4 - HTK's recognizer structure

Lexical Model (Dictionary) Feature Extraction Front-End Observations Acoustic Model Local Memory HMM Model Speech Acoustic Model Local Memory HMM Model Backtracking and Postprocessing Word/Model History Language Model (Grammar)

Token Flow Control

Acoustic Model Local Memory

Recognition Hypotesis

HMM Model

Tokens

Figure 5 - Hardware architecture for Token-Passing based ASR system On the basis of these remarks we decided to use the Token Passing model as a starting point to define an architecture that implements, with some proper changes, the system showed in Figure 3. 5. Implementing the token passing algorithm in hardware: a possible architecture Starting from the HTK's structure, we aimed to develop part of the system in hardware in order to obtain a speaker independent continuous speech recognizer capable of running in real-time also with complex acoustic, lexical and language models (with consequent improvement in accuracy). This choice is based on the results of several tests made on both HTK and Sphinx2. According to these results most of the processing time (more than 50%) needed for a typical session of speech recognition, is used for HMM evaluation (mainly to evaluate emission probabilities) and for feature extraction [8]. The time spent for these two tasks grows exponentially as the acoustic models become more complex. Hence an hardware system that implemented these operations in a considerably shorter time than the corresponding software implementation, would highly reduce the overall recognition time. A simplified structure oh the HTK's recognizer is showed in Figure 4. On the basis of our previous remarks we try to define an architecture for the HMM evaluation block (the one marked in the picture), leaving the detailed implementation of the remaining blocks of the system to our future work. Notice anyway that hardware implementation of the blocks that make up the feature extraction front-end has been already dealt with in other papers [9]. The proposed architecture is showed in Figure 5. Its structure has been derived from the block diagram in Figure 3. The system is made of an array of parallel HMM hardware coprocessors. The model parameters have been distributed 4 amongst the coprocessors and stored in local memories in order to speed up data retrieval. The language model evaluation has been moved to a later stage of processing (for example a CPU): the token flow control block will apply only the lexical constraints an will generate the word lattice, storing it in an external memory. The grammatical constraints will be applied during the postprocessing step. This solution will avoid to store in the hardware block also the statistical language models that, for very large vocabularies, need huge amounts of memory, and give the flexibility to store different grammar and vocabularies in the postprocessing layer. A similar approach using an FPGA technology is described in [15] but this implies all the drawbacks of FPGAs and moreover this solution requires the FPGA reconfiguration every time a new grammar and vocabulary is loaded. The following section will show how we developed a preliminary SystemC model for the architecture described in the first part of the paper. Simulation results will be also shown. 6. A model for an ASR core: a phonetic processor In order to test the architecture shown in Figure 5 we developed a SystemC architectural model of the system. We chose SystemC as our modeling language because it allowed to exploit parallelism while describing the algorithms at a behavioral level. Futhermore SystemC simulations are significantly faster than models described with Matlab or HDL. Finally, by following some simple programming rules, it is possible to directly convert SystemC code into synthesizable HDL using tool such as Synopsys' SystemC Compiler. The model realized is actually a simplified version of the architecture shown in Figure 5. It is basically a phone recognizer with the structure shown in Figure 6: an input block that reads observation vectors from an HTK-format observation file (previously obtained by MFCC coding of speech via HTK's

TFC Module

Input Module HMM Processor Array

Modelset Memory

Figure 6 - Hardware architecture of the tested phonetic recognizer HCopy tool), an array of HMM processors (one for each of the 33 phonemes in the modelset used) that evaluate the HMM's internal transitions, and a token flow control module (TFC) that controls the feedback of tokens to the first state of each HMM processor and outputs the recognition results as phone sequences. There is also a modelset memory module that loads the acoustic models from file (HTK's Master Model File format) and, during the system's initialization, loads them into the local memories of the HMM processors. In this configuration at each frame an observation is passed to all the HMM processors running in parallel. The results are then collected by the token flow control module that selects the tokens to be passed back to the first state of each HMM for the next frame. The internal architecture of the HMM processor is shown in Figure 7. In our case we are using a left-right topology with three emitting states for each HMM. As we can see from the picture the three emitting states are processed in parallel by dedicated modules. All the token data is stored in a shared token buffer. After internal token passing evaluation the results are collected by the exit token module that performs external token passing (towards the token flow control module). The local model memory is used in the initialization phase in order to load the proper model from external model memory and distribute the data to the emitting states and the exit state. As we stated previously the token flow control module should also apply lexical constraints. Though in this preliminary model the system's ouput is a phone sequence instead of a word sequence, we introduced a new kind of constraint in order to control the phone sequences according to a dictionary. A crossphone transition matrix was generated from the phone sequences contained in a pronounciation dictionary using HTK's HLStats tool. Every time a token is passed between two HMMs, the corresponding cross-model transition score is applied to the token's probability. 5 The following section describes the testing environment and the results obtained. 7. Tests and Results The system described in Section 6 has been tested on a database of 200 isolated italian words spoken by 7 male speakers and 6 female speakers. The modelset used for the test is made of HMMs with left-right topology (same as Figure 1), three emitting states and single-gaussian probability density function for the emission probabilities. The modelset has been trained with HTK on the same database used for testing. Before testing our system we evaluated the recognition accuracy of the modelset on the testing database using HTK's HVite. Note that recognition accuracy is expressed in terms of percentage of correctly recognized phones (%Correct) and Phone Error Rate (PER). These quantities are defined in the following way, given that N is the total number of phones in the test set, H the number of phones correctly recognised, D the number of deletions, S the number of substitutions, I the number of insertions: % Correct= H N H DS I PER= N

Running the test with the same dictionary used for training, HVite obtained %Correct = 97% and PER = 93,52%. Anyway these results are hardly comparable to the ones given by our test system, mainly because HTK can recover a great amount of errors with the application of the dictionary constraints to the hypotesis given by the acoustic models. In order to make HTK behave like our system, we repeated the test using a dictionary in which single phones are used instead of

Local Model Memory Emitting States Exit State Shared Token Buffer

Figure 7 - Hardware architecture of a single HMM processor words. With this setup HTK gave %Correct = 79,23% and PER = 54,67%. At this point, after obtaining a reference performance level, we could run the test with our architectural SystemC model. The system's output resulted in %Correct = 79,23% and PER = 54,67%. Compared to the performance of HVite without dictionary our system has almost the same level of accuracy. On the other side we can see that HVite with the use of a dictionary definitely outperforms our system given the same acoustic accuracy. Therefore we can be confident that by providing a good lexical/grammatical back-end to our architecture it is possible to reach the same level of accuracy of HTK's HVite. The development of such a back-end, along with the porting towards hardware platforms of the tested subsystems, will be the subjects of our future work. Conclusions In this paper we firstly showed the main approaches used to realize HMM-based speech recognition systems, with main focus on continuous-speech speaker-independent systems with large vocabulary. In particular we found that using the Token Passing algorithm allows for an efficient hardware implementation of the recognizer. We then proposed an hardware architecture derived from an existing Token-Passing based software for speech recognition (namely HTK), since after an analysis of different design choices, we concluded that a parallel custom hardware that performs decoding is the most promising solution. We finally showed our attempt to validate our architecture by describing it in a high-level language. We chose SystemC because it enables the designer to describe the system at a behavioral level while exploiting parallelism and emulating an hardware-like communication between subsystems. Furthermore, following some simple coding guidelines, it is possible, by using commercial applications, to directly convert the SystemC code into an HDL for fast prototyping or ASIC production. Actually we modeled only part of our architecture, the phonetic recognizer, in order to test only the acoustic accuracy of our system, without further constraints. The recognition tests we made have been set up for both our model and HTK, for comparison purposes. The results showed that the SystemC model performed with the same accuracy of HTK running without dictionary. Nevertheless we saw that, by adding a dictionary, HTK showed a sharp improvement in performance. Therefore we could state that the acoustic accuracy of our model is a suitable start point to obtain a recognition accuracy similar to that of HTK's HVite. The next step will be to extend our model with lexical and grammatical models in order to improve accuracy. Meanwhile the validated subsystems, like the phonetic recognizer shown in this paper, will be converted to HDL using a tool like Synopsys' SystemC Compiler, and then synthesized into a fast prototyping hardware platform (FPGA based, for example) so that processing speed improvement can be evaluated with respect to software solutions like HTK. Finally, improvements in acoustic accuracy can be developed through changes in the core architecture. References [1] Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, Vol. 77, NO. 2, February 1989. [2] A. J. Viterbi, Error bounds for convolutional codes and an asymptotically optimal decoding algorithm, IEEE Transations on Information Theory, vol. IT-13, pp. 260-269, Apr. 1967. [3] Schwartz, R. and Chow, Y.L., The Optimal N-Best Algorithm: An Efficient Procedure for Finding Multiple Sentence Hypotheses, IEEE International Conference on 6

Acoustics, Speech, and Signal Processing, Apr. 1990. [4] Kai-Fu Lee, Automatic Speech Recognition The development of the SPHINX system, Kluwer Academic Publishers, 1989. [5] S. J. Young, N. H. Russell, J. H. S. Thornton, Token Passing: a simple concepyual model for connected speech recognition systems, Cambridge University Engineering Department, July 31, 1989. [6] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev, P. Woodland, The HTK Book (for HTK Version 3.1), Cambridge University Engineering Department, December 2001. [7] M. Hwang, X. Huang, Shared-Distribution Hidden Markov Models for Speech Recognition, IEEE Transactions on Speech and Audio Processing, VOL. 1, NO. 4, October 1993. [8] A. Malatesta, Studio e definizione di architetture hardware per riconoscitori vocali per parlato continuo e vocabolario esteso, Thesis in Electronics Engineering, University of Tor Vergata, Rome, July 2003. [9] N. Kumar, W. Himmelbauer, G. Cauwenberghs, A. G. Andreou, An analog VLSI chip with asynchronous interface for auditory feature extraction, Center for language and speech processing, Johns Hopkins University, Baltimore, USA. [10] Mosur K. Ravishankar, Efficient Algorithms for Speech Recognition, Ph.D. Thesis, School of Computer Science, Computer Science Division, Carnegie Mellon University, May 1996. [11] J. M. Huerta, Speech Recognition in Mobile Environments, Dept. of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, April 2000. [12] Ravishankar, M. Parallel Implementation of Fast Beam Search for Speaker-Independent Continuous Speech Recognition. Technical report submitted to Computer Science and Automation, Indian Institute of Science, Bangalore, India, Apr. 1993. [13] J.S. Reeve, A Parallel Viterbi Decoding Algorithm, Department of Electronics an Computer Science, University of Southampton, July 2000. [14] Rajeev Dujari, Parallel Viterbi Search Algorithm for Speech Recognition, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, February 1992. [15] S. J. Melnikoff, S. F. Quigley and M. J. Russell, Implementing a Simple Continuous Speech Recognition System on an FPGA, IEEE Symposium on Field Programmable Custom Computing Machine, 2002.

You might also like