IJSISE 3 SharmaAtkins

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/269931894
Automatic speech recognition systems: Challenges and recent

implementation trends
Article in International Journal of Signal and Imaging Systems Engineering · December 2014

DOI: 10.1504/IJSISE.2014.066600
CITATIONS READS
3 2,426
2 authors, including:
Jamin Atkins
University of the West Indies, St. Augustine
5 PUBLICATIONS 10 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Capacity Building and Research on Smart Grid Technology in the Caribbean Region View project
All content following this page was uploaded by Davinder Sharma on 02 June 2015.
The user has requested enhancement of the downloaded file.

220 Int. J. Signal and Imaging Systems Engineering, Vol. 7, No. 4, 2014
Automatic speech recognition systems: challenges

and recent implementation trends
Davinder Pal Sharma* and Jamin Atkins

VLSI Research Laboratory,
Department of Physics,
The University of the West Indies,
St. Augustine, Trinidad and Tobago
Email: Davinder.Sharma@sta.uwi.edu
Email: Jamin.Atkins@sta.uwi.edu
*Corresponding author
Abstract: Speech recognition is one of the next generation technologies for human-computer
interaction. Speech recognition has been researched since the late 1950s but due to its
computational complexity and limited computing capabilities of the last few decades, its progress
has been impeded. In laboratory settings automatic speech recognition systems (ASR) have
achieved high levels of recognition accuracies, which tend to degrade in real world environments.
This paper analyses the basics of the speech recognition system. Major problems faced by ASR
in real world environments have been discussed with major focus on the techniques used in
the development of noise robust ASR. Throughout the years there have been different
implementation mediums for ASR but Field Programmable Gate Arrays (FPGAs) seems to
provide a unique advantage for the implementation of Digital Signal Processing (DSP) systems
and by extension ASR systems.
Keywords: ASR; automatic speech recognition; DSP; digital signal processing; FPGA; field
programmable gate array; Matlab; FFT; fast Fourier transform.
Reference to this paper should be made as follows: Sharma, D.P. and Atkins, J. (2014)
‘Automatic speech recognition systems: challenges and recent implementation trends’,
Int. J. Signal and Imaging Systems Engineering, Vol. 7, No. 4, pp.220–234.
Biographical notes: Davinder Pal Sharma is a Lecturer of Electronics at the University of the
West Indies, Trinidad and Tobago. He received his PhD from Guru Nanak Dev University, India,
in the year 2004. He has 14 years of teaching and research experience. He is the author of book
Digital Signal Processing and has more than 25 publications in reputed international/national
journals and conferences to his credit. His area of research includes digital signal processing, data
communication, VLSI (very large scale integration) implementation of DSP algorithms, fuel cell
modelling and smart grid system. He is a member of IEEE.
Jamin Atkins received his BSc degree in Computer Science from the University of the West
Indies in 2009. He is currently pursuing PhD in the Department of Physics of the same institution
where his research involves the negation of babble from clean speech. His other academic
interests are: VLSI implementation of speech recognition system and statistical signal processing.
He is a student member of IEEE.
1 Introduction flaws, ASR is an established technology with integration

into mobile phone platforms, automobile stereo systems and
Automatic Speech Recognition (ASR) has been traditionally many other devices. It allows hands-free interaction within
defined as the ability of a computer system to recognise environments where key attention must be directed to another
speech. However, a more recent and accepted description is task (Sharma and Atkins, 2010). A typical ASR system is
given by Jurafsky and Martin (2000) and Rabiner and Schafer shown in Figure 1. It consists of a frontend that accepts a
(2007), they define ASR as the designing and action of speech waveform as an input and extracts excitation features
systems in mapping acoustic signals to a string of words. The also known as observations from it, which represent the
overall goal of an ASR system is to function accurately information required to perform recognition. The extracted
independent of microphone used, speaker’s accent and features are usually Mel-Frequency Cepstral Coefficients
acoustic environment. Most of these issues have not been (MFCCs) but also have other features such as Perceptual
solved (Raj and Stern, 2005; Huang et al., 2001). Despite its Linear Predictive (PLP) coefficients. The backend attempts to
Copyright © 2014 Inderscience Enterprises Ltd.

Automatic speech recognition systems 221
match the speech utterance to previously stored acoustic Speech can be represented in the phonetic domain though
models using the language model and dictionary blocks to the use of a set of fundamental components known as
constraint the possible decoded outcome to words within the phonemes. For most languages the number of phonemes
dictionary, which follow the chosen language model. The ranges from 32 to 64. As such, a Digital Signal Processing
language model is developed by training the system to detect (DSP) frontend is required for analog to digital conversion,
word sequences which are most likely. The pattern matching spectral analysis and feature extraction as shown in
block makes use of the Viterbi algorithm in order to perform Figure 2. The reliability of frontend analysis is critical to the
recognition using a set of phone-level statistical-based performance of the speech recognition system, e.g. the frontend
acoustic models known as Hidden Markov Models (HMM; of a speaker independent speech recognition system should be
Melnikoff et al., 2002b; Rabiner, 1989). This is done by able to extract features, which are based solely on the message
calculating the maximum likelihood of the words, which conveyed by the speech signal with no bearing on the speaker’s
could have produced the input feature sequence. Recognition accent, unique voice or acoustic environment.
output from the pattern matching block is sent to a decision
rule block responsible for further refining the recognition Figure 2 Speech recognition frontend processing (Huang et al.,
output and attempts to maximise the recognition accuracy by 2001)
limiting the possible output based on a priori statistics.
Figure 1 Basic speech recognition system (Rodríguez-Andina

et al., 2001)
The frontend is responsible for digital conversion of the

speech signal, environmental noise filtering as well as
feature extraction. In order to perform its function, this
block segments the input signal into frames of 20–30 ms in
length. This length is usually chosen because it allows for
Rest of the paper is organised as follow: Section 2 describes the short time analysis of the speech signal due to the fact
the importance of the frontend with emphasis on MFCCs that small segments of the waveform can be considered
as the feature of choice; Section 3 discusses the basic stationary for time segments of this length (Rabiner and
constituent parts of the backend of the speech recognition Schafer, 2007). The frames are then multiplied by a
system and how they relate to each other. Section 4 indicates windowing function, usually by a Hanning window function
the problems faced by ASR systems while solutions to those whose equation is given by:
problems are stated in Section 5 with key attention given
to the problem of background noise. An overview of   2 n 
 1      cos   0n N
implementation trends of ASR systems are given in Section 6. hn  n     N  (1)
Section 7 discusses Field Programmable Gate Arrays 0 otherwise

(FPGAs) based implementations of ASR systems and their
advantages over traditional processing systems. The FPGA- The result of this multiplication is the transformation of the
based implementation results of a Fast Fourier Transform window into a bell-shaped curve, as shown in Figure 2. While
(FFT) engine, which is required for feature extraction, are the performance of the frontend is paramount to recognition
presented in Section 8. accuracy, the time taken for frontend processing must be
smaller than the frame overlap used in processing (Gomes-
Cipriano et al., 2004). A thorough review of the algorithms
2 The frontend of ASR system currently associated with frontend signal processing and
analysis, which include Linear Predictive Coding (LPC),
In order for recognition to take place, there is a need to Mel-Frequency Cepstral Coefficients (MFCCs), Relative
convert the analog speech signal not only into a digital Spectral Analysis (RASTA) as well as different variants of
format but also into one in which it is possible to Perceptual Linear Prediction (PLP), is available in literature
differentiate between the fundamental speech components. (Rabiner and Schafer, 2007; Huang et al., 2001; Rabiner
222 D.P. Sharma and J. Atkins
and Schafer, 2011; Anusuya and Katti, 2010; Davis and where N represents the total number of samples in the 10 ms
Mermelstein, 1980). window and n represents the position of the sample under
Of these feature extraction techniques, MFCC and consideration. The result of the Hamming window is a bell-
RASTA-PLP are the most commonly used due to their ability shaped transform of the input data.
to provide more robust features in adverse conditions. The FFT block accepts the windowed frame output by
the Hamming block and calculates the Discrete Fourier
2.1 Mel-frequency cepstral coefficients Transform (DFT) from the input data. This block is
responsible for the conversion of the input signal from the
The aim of the frontend processing system is emulate the time to the frequency domain. The most popular algorithm
human auditory system. As such, the frontend makes use of for performing the FFT is the Cooley–Tukey algorithms,
much signal processing in order to match the frequency which use the divide and conquer approach to recursively
response of the eardrum. When sound waves hit the compute the DFT on the input signal. The DFT of the
eardrum, different sections vibrate based on the constituent input signal is of particular importance as the amount
frequencies of the incident sound wave (Allen, 1994). The of energy at each frequency is unique to each phoneme.
non-linear sensitivity of the auditory system is modelled by Early implementation of speech recognition attempted to
the use of the Mel-frequency scale, which relates input recognise speech directly from the DFT of phonemes. The
frequency to perceived pitch and defined as: size of the FFT is performed usually 512 or 1024 points.
 f  One approach to simulating the subjective Mel spectrum
mels  1125  1   (2) is by the use of a filter bank with a triangular band pass
 700  frequency response and constant Mel-frequency interval
Davis and Mermelstein (1980) presented this method of spacing. The filter bank is made up of about 24 filters
feature extraction and defined the MFCC as a representation (Vu et al., 2010). The Mel filter bank is important as it
defined by the real cestrum of a windowed short-time signal has a frequency response similar to that of the human ear. In
derived from the FFT of the signal, while Zolnay et al. this filter bank, more emphasis is given to the lower
(2003) show that MFCC representation is beneficial to frequencies with a log decrease in resolution as the
speech recognition due the fact that the human perception of frequencies increase. The process illustrated in Figure 3
frequency is not linear. The block diagram of frontend of uses a short-time Fourier analysis on pre-windowed (via the
ASR system is given in Figure 3. use of the Hamming window) frames resulting in the
With the exception of the analog to digital converter calculation of the DFT of each frame λ, via the use of the
(ADC block), all of the other block of the feature extracting FFT. The values are centred on constant Mel-frequency
frontend can be implemented in software. The output of the interval spacing, given in equation (2), which cluster at the
last five blocks is the result of a computation, which is low end of the scale and spread out at larger frequencies. A
performed on the waveform values generated by the analog Mel-frequency filter bank of M (m = 1,2,…,M) filters can be
to digital converter. defined as:
The feature extraction process illustrated in Figure 3

involves the processing of multiple, short overlapping 
frames of speech, which is shown in Figure 2. Each frame is 0 k  f  m  1 ;k  f  m  1

usually a length of 30 ms with a 10 ms overlap usually  k  f  m  1
sampled at 16 kHz, the minimum sampling frequency H m k    f  m  1  k  f  m  (4)
possible since speech may contain components with  f  m   f  m  1
 f m 1  k
frequencies as high as 8 kHz. The abrupt truncation of the     
f  m   k  f  m  1
signal into 30 ms segments results in spectral leakage in the  f  m  1  f  m 

frequency domain. A Hamming window is used to minimise
this leakage before converting to the frequency domain. The where k is the k-th term in the DFT of the input frame.
windowed signal is given by The DFT is defined as:
N 1
 2 n  X  k   x  n  e  j 2 n / N
w  n   0.54  0.46 cos   (3) 0k  N (5)
 N 1  n 0
Figure 3 Block diagram of frontend of an ASR system (Huang et al., 2001) (see online version for colours)
where x[n] is the n-th sample of N in the frame being in Figure 1. The recognition module makes use of a Viterbi
currently processed. The log energy output of each of the M decoder for calculating maximum likelihoods of states from
filters is then calculated as: generated emissions.
 N 1 
S  m  ln   X  k  H m  k 
2
0mM (6) 3.1 Hidden Markov Models (HMM)
 k 0 
The Hidden Markov Model is a statistically based finite state
The log Mel spectrum is then converted back to a time automation, which has an associated square transition
cepstrum though the use of the Discrete Cosine Transform probability matrix, defining transition between states A, a
(Ahmed et al., 1974), resulting in M MFCC feature vectors vector of emission probabilities for the generation of
given by: observations b, and an initial state probability vector π
M 1 (Rabiner and Juang, 1986). As such the underlying state
c n  S  m cos  n  m  0.5 / M 
m 0
0nM (7) transitions can only probabilistically associate with another
observable stochastic process, which is capable of producing
The value of M is usually between 10 and 15 with 13 being the features that can then be observed. An example of an HMM,
most common number (Huang et al., 2001). The delta and in which b is defined by a continuous multivariate Probability
acceleration coefficients of these are also taken, resulting in a Density Function (PDF), is shown in Figure 4.
39 dimensional vector for each speech frame. This process is
usually carried out on speech frames of 30 ms with 10 ms Figure 4 Example of an HMM (Huang et al., 2001)
overlap. Huang et al. (2001), Patel and Roa (2010), Staworko
and Rawski (2010) and Gracuarena et al. (2004) provide more
in-depth review of the MFCC computation.
3 The backend of ASR system
The backend of the speech recognition system is responsible

for the decoding of the input feature vectors and performing
the computations necessary for recognition of the speech
pattern initially input into the frontend. That is given a
sequence of observed speech feature vectors O of the form:
O  o1 , o2 , , oT (8)
Various constituents of an HMM shown in Figure 4 can be
where oi is the feature vector at time t.
defined as:
The recogniser aims at calculating the maximum
likelihood given by:  O= {o1,o2,….,oM}: A set of possible observations,
which can be either discrete output observations. In
arg max  P  wi | O  (9) terms of speech recognition, the observations are
generally parameterised as10 ms sections of speech.
where wi is the i-th vocabulary word. This calculation is
computed through the use of Bayes’ rule, which can be  Ω = {1,2,…N}: A set of states, which constitute the
stated as: model. At every time interval t, the model must be in
one of these states usually denoted s.
P  O | wi  P  wi 
P  wi | O   (10) 
P O  A={aij}: An output probability matrix where aij is the
probability of a transition from state i to j also denoted
Given that prior probabilities for P(wi) exist within the as:
system together with parametric models such as Markov aij  P ( st  j | st 1  i ) (11)
models, it is possible to solve equation (10) without solving
the more difficult problem of estimating class condition  B = {bi(k)}: An output probability matrix (in the case of
observation densities P(O|w) (Van Segbroeck and Hamme, a discrete probability distribution) where bi(k) is the
2011). In order to generate the prior probabilities, the probability of emitting observation ok in state i. If
system must be trained on large amounts of labelled data for X = X1, X2,…,Xt is the observed output of an HMM, the
the generation of models for matching. state sequence S = s1, s2,…,st is hidden and bi(k) can be
The backend comprises acoustic models for a given written as:
vocabulary, a language modelling module and usually a
confidence scoring module with their arrangement, as shown bi  k   P  X t  ok | st  i  (12)
In the case of speech recognition, the continuous output correspond to about 10–15 ms for the standard methods of
probabilities are usually modelled via the use of a Gaussian analysis. Other restrictions may require that each word
Mixture Models (GMMs) with a diagonal covariance for model has the same number of states since lower word error
each mixture. The emission probability is given by: rates are produced when models represent words with a
R n similar number of sounds. Since words are constituted of

P  ok | sk   cr  oki  ri ri2  (13) multiple phonemes, they are modelled by concatenating
r 1 i 1
HMMs together.
where ok = (ok1,…,okn)T is a deterministic feature vector
with dimension n, cr ranges from 0 to 1 and represents the Figure 5 Bakis (left to right) HMM used for modelling based
on the number of spoken observations (Rabiner and
weight of the r-th mixture. R denotes the total number of Juang, 1993)
mixtures and Ψ(o,µ,σ2) is a univariate Gaussian defined as:
1   o   2 
  o;  ;  2
 exp   (14)
2 2  2 2 
 
with means µ and variance σ2. The parameter set for the
diagonal Gaussian Mixture Density of equation (12) is fully 3.3 Pattern matching
defined by the mean vectors, variance vectors and mixture
weights from all component densities (Kuhne et al., 2011). In the pattern matching step, the results are obtained as the
All parameters in sk are developed though the training of the sequence of HMMs, which maximise the probability of an
model on an extensive copra. observation sequence. While the most obvious computation
of this probability is via the sequencing of all possible
 π = {πi}: An initial state distribution vector where states, the computational complexity is exponential. The
Viterbi algorithm is used in a more efficient manner for
 i  P  X t  ok si  i  (15) probability computation. This algorithm is similar to the
forward algorithm used in training the HMMs, except is
In-depth analysis of HMMs and the associated algorithms maximised instead of summing the forward probabilities
are found in the work of Rabiner (1989). (Rabiner, 1989).
3.2 Acoustic modelling 3.4 Language modelling

In the probabilistic approach to speech recognition, given a As the size of an ASR system increases, so does the number
sequence of words W = {w1, w2, w3,…,wn} and O, the of possible recognition output and hence the probability of
acoustic information extracted from the speech signal, the erroneous recognition output symbols (Paliwal and Yao,
aim is to find the most probable sequence of words given A 2010). The language model seeks to refine the output of the
that is maximising P(W|O), which can be found by the use pattern matching section by attempting to limit the number
of Bayes’ formula given in equation (10) of invalid output strings by placing a confidence score on
The task of the acoustic model is to compute the consecutive output words or phonemes. One of the most
probability P(O|W), that the pronunciation of the word commonly used language models is the N-gram model
string W will produce the extracted features A. By modelling (Wang and Li, 2009). The N-gram language model
the fundamental sounds of a language, a complete calculates the probability of a word ending an N word
observation set can be created to cover all possible acoustic sequence given the last N–1 word, i.e. P(wN|w1w2….wN–1).
excitations (D’Orta et al., 1987). The study of linguistics has The N-gram model presents a confidence score based on the
shown that the phoneme can be considered as the most last N–1 word via the use of estimates. The chain rule of
fundamental classification for speech (Flanagan, 1972). A probability allows for the decomposition to the probability
computation for word sequences. Given a word sequence
phoneme is formally defined as the smallest contrastive unit
W = {w1, w2,…, wN}, P(w1,…,wN) can be calculated as:
in the sound system of a language. A single phoneme may
have multiple pronunciations, which allow for the use of P  w1  wN   P  w1  P  w2 | w1   P  wN | wN 1 
HMMs for modelling. Phonemes are best modelled though N (16)
the use the left to right or Bakis HMM (as shown in Figure 5),  P  wk | wk 1 
due to the unidirectional change in their state (Rabiner k 1
et al., 1983). There are currently two methods of modelling Using the Markov assumption that the probability of a word
words (Rabiner and Juang, 1993). The first indicates that in the sequence depends only on its predecessor, the bigram
there should be as many states as there are phonemes within model allows for the word sequence probability computation
a word. This results in the number of states varying from 2 using equation (17) (Rapp, 2008):
to about 10. The second is that the number of states should N
be equal to the average number of observations in a spoken
version of the word. In this case, each state would
 
P w1N  P  wk | wk 1  (17)
k 1
It is noted that the language model should contain strings background noise. The solution to problem of convolution
related to the task for which the recognition system is being noise is relatively simple since this form of noise becomes
developed, as this would increase the probability of jargon an additive constant in the log domain and can readily be
within the application environment being added to the removed (Van Segbroeck and Hamme, 2011). Some common
language model. Since language is creative, every possible N- classes of noise are babble, factory, aircraft and car noise
gram cannot be modelled and smoothing techniques must be (also known as Volvo noise) (Krishnamurthy and Hansen,
used to remove zero probabilities. The ASR system can use 2009; Maithani and Tyagi, 2008; Schuller et al., 2009).
the language model block as a manner of validating output of
the system by defining valid output as any recognised sting Figure 6 Model of the acoustic environment showing sources
with an N-gram probability above a threshold limit. It should of noise
also be noted that N-gram modelling can also be applied at Noise
the phoneme level for providing a confidence score on a n[m]
phone string (Wessel et al., 2001).
y[m]
s[m]
4 Challenges in automatic speech recognition Channel h
Channel h +
systems Clean speech
Noisy Speech
While much work has been done in the field of ASR to

improve recognition accuracies, the problem of providing Each different class of noise is the resultant of different
acceptable recognition rates under real-world conditions in environmental processes. Car noise, for example, refers to the
real time still proves illusive. Considering the entire speech noise presents in the interior of a car and may be split into four
recognition system, it is not difficult to say that there are still groups. First is wind noise, which is produced by turbulence at
many problems to be solved. In the frontend, for example, the the edges of the vehicle and is proportional to the velocity of
required features for speech recognition should be constant the vehicle. Second is driving noise, a result of the wheels,
regardless of sex, accent and age. The development of such driving and suspension. This component is highly influenced
features would not only mean a large reduction in the amount by the road surface and wheel type with a rough surface being
of training data required for the generation of acceptable more influential than a smooth one. The third group is the
model parameters but a standard frontend capable of engine, which depends on the load and number of revolutions.
extracting features from a wide range of languages, which The fourth group results from the rattle and squeaks produced
would then result in the possibility of multiple language- by interior components of the vehicle (Grimm et al., 2007).
based models. While HMMs together with the use of GMMs Speech corrupted by Volvo noise tends to be well audible even
provide an adequate acoustic modelling scheme for speech below 0 dB. This is due to the fact that most of the energy of
from the phonetic to the word levels, the amount of training Volvo noise lies above 2000 Hz (Schuller et al., 2009).
data required for acceptable results is proportional to the size Speech babble is one of the most challenging noise
of the recognition vocabulary (Choi et al., 2010b). Provision interferences for all speech systems since it occupies the same
of language models capable of reducing perplexity would feature space as clean speech. Babble is commonly defined as
increase the capacity of current systems while retaining a a noise encountered when a crowd or group of individuals
low-error rate. Another problem which may be considered to is speaking together. Babble is an overlap of multiple
be one of the most imperative is that of noise. Currently, conversations and individual speakers. New analysis has
speech recognition systems give acceptable performance
shown that babble is more accurately modelled as a sum of
under acoustically controlled laboratory conditions; however,
conversations rather than individual speech streams of
performance tends to decrease significantly in real-world
environments. This is due to the presence of the non- conversations overlapped with each other (Krishnamurthy
stationary noise in the actual operating environments and Hansen, 2009). Babble noise is dependent on a number of
(Betkowska et al., 2007). The two main sources of noise factors. These include the number of speakers, the subject as
affecting ASR are convolution noise due to changes in well as the emotional stress levels of the individuals. An
microphone and channel conditions and background noise. investigation of babble done by Krishnamurthy and Hansen
seeks to generate a model for the noise. Babble affects
clean speech by increasing the number of Gaussian hops per
5 Methods of creating noise robust systems feature frame. The possible corruption by babble can then be
calculated by counting Gaussian hops.
Many approaches have been taken towards solving the Factory and aircraft noise have a few similar properties,
noise problem and may be placed in two groups: feature one of them is that they are relatively stationary. The factory
compensation methods and model adaptation methods (Raj noise is known to fluctuate from moment to moment but
and Stern, 2005). The two main sources of noise-affecting the noise is constant on the average over a long interval of
ASR as shown in Figure 6 are convolution noise due time (Sacerdote, 1959). Factory noise like aircraft noise is
to changes in microphone and channel conditions and predominantly due to fans and turbines.
Speech recognition systems are also affected by white and which is defined as the difference between the desired
coloured noises. White noise contains equal energy at all response and the filter output. While the Weiner filter is
frequency components with a zero mean value. White noise is known for producing acceptable results under conditions
essentially a random process usually considered with a where stationary noise is presented, its reliability rapidly
Gaussian distribution. Coloured noise does not vary complete deteriorates if the inputs become non-stationary. The
randomly as there is a co-variance between samples at different following equations explain the process mathematically:
time-indexes. Since white and coloured noises have different 
energy-frequency characteristics, they have different effects on xe  n    h  m y  n  m (23)
the speech signal. Kalman and Wiener filtering have proven m 
very effective in negating these kinds of noise in front-end
speech processing.   
 
2
E    x  n    h  m  y  n  m   (24)
  m   
5.1 Speech enhancement techniques
The expected output xe[n] is generated via convolution of
Speech enhancement techniques attempt to suppress noise the noisy signal with h[n] as stated in equation (23). Using
in the speech signal at the risk of degrading the original
the Minimum Mean Square Error (MMSE) estimation of
clean signal (Paliwal and Yao, 2010). Both spectral
equation (24), an attempt is made to minimise h[n].
subtraction and the various filtering techniques belong to
this sub-category of feature compensation.
The method of spectral subtraction (Huang et al., 2001) 5.2 Missing feature techniques
involves the assumption of the environment as shown in Missing Feature Theory (MFT) attempts to determine the
Figure 6 that the clean input signal s[m] has been corrupted level of reliability for different spectral regions in the
by some additive noise n[m] resulting in speech spectrogram (Raj and Stern, 2005). This is achieved
y  m  s  m  n  m (18) by segmenting the time-frequency spectrogram into cells
and attempting to determine which ones are unreliable or
Statistical independence of s[m] and n[m] results in the missing for subsequent processing. Since MFT compensates
power spectrum of y[m] being equal to the sum of the for additive noise distortions by locating the unreliable time-
expected values of the individual power spectra given by: frequency cells, then performing the recognition on the
partial feature vectors. Firstly, a missing feature mask must
E  Y  f    E  S  f   E  N  f  
2 2 2
(19) be estimated to differentiate between the spectrally reliable
     
regions (Van Segbroeck and Hamme, 2011). The application
The general idea in spectral subtraction (Choi, 2004) is that, of missing feature techniques to ASR is due to the influence
given in periods of speech silence, when the output power of two complementary fields of signal analysis with the
spectrum is assumed due to noise, through the use of a running majority of mathematical approaches to missing feature
noise, power average E[|N(f)|2] can found as given by: (MF) recognition being developed for the completion of
M 1 partially occluded objects in the field of visual pattern
1
Nf    |Y  f |
2
i
2
(20) recognition (Raj and Stern, 2005). Raj and Stern point out
M i 0 that the mathematics developed for missing feature
Equations (19) and (20) imply that |S(f)| is given by: recognition can be applied to types of signal restoration,
which can prove useful in the identification of degraded
2 1  signal components with the potential for restoration though
S  f   Y  f   N  f   Y  f  1 
2 2 2
 SNR  f  
(21) the use of MF techniques.
  The procedure for MF analysis requires 20 ms frame-
where SNR(f) is the frequency-dependent signal to noise based segmentation of the conventional frontend followed
ratio given by: by the computation of the frames power spectrum given by:
2
Yf
2 N 1
Z p  m, k   w  n  mL  x  n  e
 j 2  n  mL  k / N
SNR  f   (22) (25)
N f 
2 n0
where Zp(m,k) represents the k-th frequency band of the

From equation (20) we can calculate the MFCCs of the cleaned power spectrum of the m-th frame of speech, w(m)
speech signal and used for recognition. It should be noted that represents the desired windowing function (usually Hamming),
an additional module would be required for speech detection. x(n) represents the n-th sample of the speech signal,
Many novel ways have been developed for the calculation of
L represents adjacent inter-frame shift and N represents the
SNR(f) (Claes and Van Compernolle, 1996; Morgan and
number of samples per frame. A perception-based filter
Hermansky, 1992; Van Compernolle et al., 2007).
Wiener filtering (Neves et al., 2008) attempts to such as the Mel Filter bank may be applied to emulate the
generate a filtering operation, which results in a linear frequency sensitivity of the human as:
estimate of the clean speech. The general approach is taken X p  m, k   hk  i  Z p  m, i  (26)
to minimise the mean-square value of the error signal, i
where hk(i) represents the response of the k-th filter in the extensive sections and pre-computation of frequently
filter bank. The MF technique makes use of the additive use parameters (which may result in additional memory
noise composition given in equation (18), which shows that requirements). Equipped with wider inner bus lines, the ability
the output power spectrum of the corrupted speech signal is to perform numerous simultaneous commands and high speed
the sum of power spectrums of the corrupting noise and sum of product calculations, digital signal processors are
clean speech. A hard masking system such as the binary capable devices for signal processing (Heungsuk et al., 2001;
system (Kim and Loizou, 2010; Srinivasan and Wang, Muzaffar et al., 2005). However, while optimisation techniques
2007) scores cell as being either fully reliable or not. In enable the realisation of real-time functionality with the
general, the hard masking system is essentially a maximising recognition of a 20 ms frame being accomplished in 3.090 ms a
operator between the speech signal and corrupting noise, time much shorter than the 10 ms inter-frame spacing. Such
generating the score based on the high of the two signals. implementations have achieved recognition accuracies ranging
The soft or fuzzy masking technique attempts to estimate from 85% to 98% in implementations of small vocabulary
the probability of each cell being dominated by speech. This systems. The major pitfall of the DSP hardware approach is the
mask mt can be generated via the use of a sigmoid function fact that in order to convert the system to a real-time capable
(Barker et al., 2000): system, code optimisations must be performed in assembly,
which can prove to be time-consuming task, as well as the
 1  degree of parallelism which can be achieved, is based on the
mt    (27)
 1  exp     st  nt      fixed architecture available.
  FPGAs, which allow the implementation of RISC soft-core
where ρ is the slope and θ is a preset threshold. It has been processors, provide a framework in which the entire speech
observed that soft masking provides better decision than recognition system may be mapped onto the FPGA platform
hard masking but ρ tends to be sensitive to noise type for a cheap embedded solution. The addition of a PROM for
(Van Segbroeck and Hamme, 2011). Another modification storing the reference HMMs is also easily achieved. The fact
required for MFT implementation occurs during the evaluation that FPGAs can provide a hardware implementation of
of the acoustic model. At this state, the probability of being algorithms allows for an RISC processor to be incorporated
reliable for each feature should be taken into account when into the system design and does not require power in which the
computing the acoustic score. available resources are unused and results in power saving,
board real-estate and overall system cost. As a result smaller
and cheaper systems can be implemented with improved
6 Implementation trends for ASR systems performance as shown in Figure 7, where all data were
obtained from a laboratory environment.
Speech recognition has seen a shift in implementation schemes Many attempts have been made to port Hidden Markov
over the years. Initially, the entire system was implemented in Model-based speech recognition to FPGA. Notable attempts
software resulting in less than real time recognition of speech include the work of Hauck (1998) and Choi et al. (2010b) in
(Lee et al., 1990). Currently, much work is being done in the which full systems were implemented completely with
mapping of ASR systems onto FPGA as an option over independent external memories. FPGAs provide an ideal
implementations making use of digital signal processors. The platform for speech recognition systems since they allow for
reconfigurable nature of FPGA as well as the relative ease of the implementation of true parallelism.
subsystem implementation on FPGA makes it an interesting The frontend may be implemented in software and run
option for system deployment. FPGA-based implementations on a soft-core processor such as the Xilinx Microblaze or
not only reduce the software overhead and system support the Altera Nios II platforms (Xilinx Foundation, 2011a;
requirement, but also increase the real-time factor while Xilinx Foundation, 2011b; Xilinx Foundation, 2011c). The
maintaining the level of accuracy of speech recognition approach taken by Cheng et al. (2009) was to implement a
systems. As a result of this, in order for speech recognition to GMM calculator which is used to calculate the emission
become viable, the parallel nature of pattern recognition probabilities at each state. It was found that about 65% of the
algorithms must be exploited (Melnikoff et al., 2002a; Ke et al.,
processing time was spent calculating emission probabilities
2008).
with the actual recognition by Viterbi decoding taking up
The SPHINX system developed by Carnegie Mellon
University provides an example of the pure software 33% (Miura et al., 2008). A GMM module was proposed to
approach to speech recognition (Lee et al., 1990). The perform all emission probability calculations in parallel. A
system produced accuracies of 96% for a 997-word similar approach was also taken by Choi et al. (2010b).
vocabulary. The main short fall of the software-based Implementation trends indicate that the gain over software
approach is that the von Neumann architecture of most of hardware implementation of the frontend is miniscule
CPU-based systems limits the amount of true parallelism (Melnikoff et al., 2002b; Gonzalez-Concejero et al., 2006;
which can be exploited. Mosleh et al., 2010). It becomes more resource efficient to
Meng et al. (2005) show the complications involved in implement only the backend in hardware. Most implementations
optimising the speech recognition algorithms for digital attempt to improve the efficiency of the emission probability
signal processor. This includes use of low-level languages, calculator and Viterbi decoder (Ke et al., 2008; Lim et al.,
decomposition of the algorithm to identify computationally 2006).
Figure 7 Performance comparison of different ASR implementation technologies (Choi et al., 2010a; Lee et al., 1990; Meng et al., 2005)
(see online version for colours)
7 FPGA-based implementation of ASR systems one memory Von Neumann architecture. Hauck (1998)
provides an in-depth review of the role of FPGA in
FPGA were developed from the advancement of reprogrammable computing and the benefits it offers over
reprogrammable and reconfigurable hardware such as the ASICs.
Programmable Logic Device (PLD; Stitt, 2011). In general, Many implementations of ASR systems on FPGAs have
it consists of many basic hardware cells which can be been successfully attempted. As stated earlier in this section
reprogrammed to implement small combinational or sequential it is evident that FPGAs provide an optimal platform
circuits. These cells called logic blocks are placed on a matrix for implementation. Current implementations fall into three
of horizontal and vertical pathways as shown in Figure 8, major categories:
which can be programmed to interconnect cells resulting in
 Soft-core implementations used for running software-
signal routing throughout the device. On the outside of this
based ASR.
matrixes are rows of Input/Output Blocks (IOB) for
interfacing with other devices.  Implementation of parts of the speech recognition
system as discussed in Sections 2 and 3.
Figure 8 Block diagram of a typical FPGA architecture
(Stitt, 2011)  Implementation of the entire speech recognition system
(system-on-chip).
The choice of implementation method may be based on the
amount of resources available as the real estate needed for
the implementation of the different categories differ greatly
with the full System-on-Chip (SoC) structure. Floating
point requires the most and the partial implementation requires
the least space. The soft-core implementation, however, has
the advantage that it provides a Reduced Instruction Set
Computing (RISC) based microprocessor which may be used
to run modules different from the speech recognition algorithms.
7.1 Soft-core processor-based implementations

The soft-core implementation approach involves the
There has been a great deal of interest in creation of custom implementation of a microprocessor onto the FPGA and
application-specific reprogrammable systems. In such an then use microprocessor to run the speech recognition
approach, the logic is supplied by the designer and tools are algorithm usually written in C.
used to automatically map this circuitry into FPGA logic. Soft-core processors are generally used to create an
This reduces the cost in terms of money and time for FPGA-based SoC. In these systems, the central-processing
implementing algorithms in hardware (Sharma, 2009). A unit core controls the circuit and minimal computations while
comparison of conventional CPU/DSP processor and FPGA other parts of the circuit are responsible for interfacing and
technologies for the implementation of DSP algorithms parallel processing. One major advantage of this method is
shows that both the CPU and DSP are controlled by an that, since the soft core is a general computing engine, it may
operating system, which runs algorithms in pseudo parallel be used to run another necessary process, a feature which
by context switching between process threads with all provides a system with a microprocessor for less intense
threads sharing the same memory and limited multipliers computation, while still providing the necessary hardware
among other resources. FPGA implementation of an for the computationally demanding speech recognition
algorithm allows each subsection to be implemented as an system. Some popular soft cores include: LEON, OpenRISC,
individual core with pathways connecting each core resulting MicroBlaze, Nios II and Cortex-M1. Table 1 provides a
in a highly efficient pipeline, without the restrictions of the comparison of available soft cores.
Table 1 Comparison of available soft cores (1-CORE Technologies, 2012)
Word Size Pipeline Cycles Per Hardware Floating Size in Logic

Core Architecture Comments
(bits) depth Instruction Multiplier Point Unit Elements
S1 Core SPARC-v9 64 6 1 YES YES 37,000–60,000 A single core version of
Ultrasparc T1
LEON3 SPARC-v8 32 7 1 YES YES 3500
LEON2 SPARC-v8 32 5 1 YES Extension 5000 Core Unmaintained in
Available favour of LEON3
OpenRISC 1200 OpenRISC 32 5 1 YES NO 6000
1000
MicroBlaze MicroBlaze 32 3,5 1 Optional Optional 1324 Limited to Xilinx Devices
aeMB MicroBlaze 32 3 1 Optional NO 2536 Open-Source clone of
MicroBlaze
OpenFire MicroBlaze 32 3 1 Optional NO 1928 Proprietary Altera Devices
Nios II/f Nios II 32 6 1 YES Optional 1800
NiosII/s Nios II 32 5 1 YES Optional 1170
NiosII/e Nios II 32 0 6 NO Optional 390
LatticeMicro32 LatticeMicro32 32 6 1 Optional NO 1984 Not limited to Lattice
devices.
Cortex-M1 ARMv6 32 3 1 YES NO 2600
DSPuva16 DSPuva16 16 0 4 YES NO 510
PicoBlaze PicoBlaze 8 0 2 NO NO 192 Only on Xilinx Platforms
PacoBlaze PicoBlaze 8 2 NO NO 204 Open-source PicoBlaze
clone
LatticeMicro Latice Micro 8 0 2 NO NO 200
The use of the soft-core for speech processing is best suits should be obvious that partial ASR systems require a
to the designer who may use the speech recognition system controlling processor, which feeds input and receives the
as a secondary module in his design. It allows for taking output of the system. This controller may be a soft-core on
advantage of the FPGA parallelism in sections of the the same or different FPGA or a traditional microprocessor.
algorithm, which may be time consuming if ran on a general According to Choi et al. (2010b), implementations can also
microprocessor. It provides the best balance between contain sections, which read and write to a shared RAM
performance and flexibility of all three methods. using a valid flag to indicate when processing should be done.
Soft-core-based implementations may be full software or Vu et al. (2010) implement a low power architecture of an
a hardware-software co-design (Cheng et al., 2009). The full
MFCC-based frontend, which consumes little resources and
software-based implementation running on the soft-core does
power but matches performance of microprocessor-based
not take full advantage of the resources and parallel nature of
systems. The flexibility offered by partial system
the FPGA. Most co-designs implement the MFCC feature
extraction via the use of the soft-core with the computationally implementation was also utilised by Veitch et al. (2010) and
demanding GMM computation and Viterbi decoding also provided an IP (Intellectual Proprietary) core for the
implemented in the FPGA fabric (Lim et al., 2006). acceleration of HMM-based speech recognition systems. By
decoupling Gaussian calculations from the rest of the
system, results were calculated with minimal communication
7.2 Partial implementations
between the backend search software and the FPGA-based
Partial implementations involve the implementation of small Gaussian core.
segments of the speech recognition system onto FPGA. This Designers may choose to implement or improve sections
method can prove advantageous when only minimum such as emission probability computation though the GMM
computational boosts are required to meet performance calculator (Miura et al., 2008; Schuster et al., 2006). Partial
goals. Calculations made by Cheng et al. (2009) and Lim implementations seek to accelerate speech-recognition system
et al. (2006) indicate that the most of the time is spent on the implementations, which require minimal speed gains or
GMM emission calculations of the speech pattern matching whose speed is fairly adequate for its use. The frontend
block. They also indicate that limited computational gains implementation found in the work of Ahadi et al. (2003)
can be achieved by hardware implementation of the provides a good example of partial implementation as it
frontend of the system; Cheng et al. (2009) indicate that provides facilities for changing the speech enhancement
only 6.02% of the timing profile for speech recognition algorithms used in noisy environment. The FPGA platform
involves feature extraction, while GMM calculation and allows for the easy redirection of outputs to newly implemented
Viterbi decoding require 63.1% and 30.88%, respectively. It cores, which facilitate dynamic hardware choice.
Table 2 Comparison of various FPGA-based implementation methods
Full Implementation Soft-Core Implementation Partial Implementation

 Full parallelism resulting in  Course with very small foot  Provides acceleration of
extremely high performance. print can be used to run ASR computationally intensive
 Takes full advantage of algorithms. algorithms.
underlying hardware.  Cores can be used to run other  Best performance gain: footprint
tasks. ratio of three techniques.
Advantages
 Sections of the system can be  Low power consumption.
developed in software which is a  Portable.
simpler process.  Implementation may be used by
other algorithms requiring the
same processing.
 High resource utilisation.  Use of floating point numbers is  May require data type
 High power consumption. limited by the core architecture. conversion for use with driver
 Flexibility. System  Some cores lack the capability to system.
Disadvantages components are not readily use hardware multipliers.  While this method provides a
available for use by other  Limited scope to take advantage speed gain, the overall ASR
algorithms. of parallel nature of both system may be slower than the
algorithm and FPGA. full implementation.
High: entire system implemented Low-high: Determined by choice of Low-medium: Highly dependent on
Resource in logic elements. soft-core as well as whether the section of core which is to be
Utilisation accelerator cores are also implemented.
implemented.
Low: Requires intricate planning Medium High
Ease of
of system state flow and resource
Implementation
management.
7.3 Full system implementations the low and mid-high end of the FPGA spectrum in terms of
both capability and price are the Spartan-3E and Virtex-5.
There have been many successful attempts to develop full
Using the system generator block, the relevant net list was
system-on-chip implementations of the ASR system. The
developed to implement the circuit on each platform. When
Speech Silicon project (Schuster et al., 2006) discusses the
compared to the output of the Simulink model, the results
design of an FPGA-based system on a chip implementation,
generated by the Virtex-5 showed a great correlation.
which is capable of performing continuous speech recognition
Apart from correlation with theoretical values, the resource
on medium-sized vocabularies. The ability of FPGAs to be usage of the two devices was also compared. The presence
interfaced with many different devices makes it a natural of dedicated multipliers and summers allow for the
candidate for use as a control device powered by speech implementation of high-level logic without the use of CLBs
recognition. Coupled with the wealth of resources available, the to build these systems. The result of this is faster
entire speech chain shown in Figure 1 can be implemented in performance and more efficient implementation.
the FPGA fabric. Choi et al. (2010a, 2010b) provide good Tables 3 and 4 show the utilisation summary for FFT
examples of SoC implementation of ASR. It should be noted
engine implementation on the Virtex-5 and Spartan-3E
that majority of the logic of the implementation is contained in
FPGAs, respectively. The tables were produced by the
the sections of the emission probability calculator and the Viterbi
Xilinx’s design suite (during place and route step) software
decoder. Implementations in the work of Gomes-Cipriano et al.
as part of the output. The values in this table were reported
(2004), Elmisery et al. (2003), Melnikoff et al. (2002a) and
by the Xilinx project navigator after implementation of the
Schuster et al. (2006) show that the clock speed required to
perform the recognition is in the order of 102 MHz and well out system on the identified platform.
of the Giga hertz region of traditional microprocessors. Table 3 Device utilisation summary of Virtex-5 FPGA
Table 2 shows a comparison of the three methods of
implementing ASR technology on FPGA. Device Utilisation Summary Virtex-5
Slice Logic Utilisation Used Available Utilisation
Number of slice registers 3635 32,640 11%
8 Implementation of FFT engine for ASR system Number of Slice LUTs 3253 32,640 9%
Number used as logic 2044 32,640 6%
As a case study, a 1024-point 4-channel FFT engine, which
is used in the frontend of the ASR system has been Number used as Memory 1089 12,480 8%
simulated and implemented on various FPGAs. Using Number of occupied Slices 1154 8160 14%
Simulink and Xilinx DSP system generator, a model of FFT Number with an unused Flip Flop 342 3977 8%
engine shown in Figure 9 was developed and implemented
Number of DSP48Es 10 288 3%
on two different platforms. These platforms selected from
Table 4 Device utilisation summary for Spartan-3E FPGA DSP48 blocks, a dedicated piece of hardware, which allows
for efficient multiplication and accumulation on the Virtex-5. It
Device Utilisation Summary for Spartan 3E1600
should also be noted that the structure of the V5 slice is
Logic Utilisation Used Available Utilisation slightly more complex than that of the S3E, with the Virtex-5
Total Number Slice Registers 4418 29,504 14% being able to implement Boolean equations of six input
Number of 4 input LUTs 4035 29,504 13% variables, while the S3E is limited to four inputs. With
Number of occupied Slices 3101 14,752 21% further optimisation of the implementation, a decrease in
Number of Slices containing only 3101 3101 100% utilisation by 5% is possible, this further solidifies the
related logic premise that algorithms 4–5 times more complex than the
Number of Slices containing 0 3101 0% FFT may be implemented while not consuming the entire
unrelated logic resource set available.
Total Number of 4 input LUTs 4380 29,504 14% Table 5 shows the comparison of the resource utilisation
Number of RAMB16s 7 36 19% for a 1024-point radix-2 implementation of the FFT on various
Number of MULT18X18SIOs 16 36 44% FPGAs. It becomes obvious that specialised hardware such
as the Xilinx DSP48 blocks and the Altera 18  18 blocks
This non-optimal implementation of a very important signal greatly reduce the device latency. The Xilinx devices
processing function consumes 11–14% of the total number of provide the DSP48 blocks, which is capable of performing
slice logic but less than 20% of the total logic implementation complex arithmetic as well as accumulation and greatly
area of the Virtex-5 and Spartan-3E, respectively. reduce the memory requirements and footprint of the
The difference in device utilisation between the Virtex 5 design. The Altera FPGAs however tend to have a shorter
(V5) and the Spartan 3E (S3E) are due the presence of latency than Xilinx-branded counterparts.
Figure 9 Simulink model used for FFT engine simulation on FPGA (see online version for colours)
Table 5 Comparison of typical 1024-point FFT implementation on various FPGA platforms
Spartan 6 Spartan 3A DSP

Slice Logic Utilisation Virtex-5 Virtex-6 Spartan 3E1600 Cyclone III Stratix III
XC6SLX150T XC3D3400A
Number of Slice Registers 3635 2774 3344 2618 4418 4650 4458
Number of Slice LUTs 3235 2744 2567 4144 4035 3857 2480
Flip Flops 1089 3626 3701 3779 4380 155,940 155,904
Maximum Clock Frequency(Hz) 380 395 210 180 66 244 413
Latency(µs) 8.84 18.64 74.17 86.53 834.23 4.19 2.48
18 k Block RAM 5 9 33 17 – 20 12
Number of DSP48Es 16 12 12 12 – – –
18  18 Multipliers – – – – 16 24 10
9 Conclusion Choi, E. (2004) ‘Noise robust front-end for ASR using spectral
subtractions spectral flooring and cumulative distribution
mapping’, Proceedings of the 10th Australian International
This paper provides an overview of HMM-based ASR system, Conference on Speech Science & Technology, Sydney,
which accepts speech signals and produces outputs symbols pp.451–456.
that are ideally a unique representation of the input speech Choi, J., You, K. and Sung, W. (2010a) ‘An FPGA implementation of
signal. The current problems of speech recognition systems speech recognition with weighted finite state transducers’,
have been discussed with keen focus on environmental noise as Proceedings of 2010 IEEE International Conference on
well as currently proposed solutions. It has been noted that Acoustics Speech and Signal Processing (ICASSP), IEEE,
pp.1602–1605.
although these proposals provide some relief, yet they were
Choi, Y-K., Kinsun, Y., Jungwook, C. and Wonyong, S. (2010b)
highly dependent on the kind of noise present. A side effect of
‘A real-time FPGA based 20000 – word speech recognizer
which could prove detrimental in real-world environments, with optimized DRAM access’, IEE Transactions on Circuits
where noise source and characteristics may change rapidly. and Systems, pp.1–13.
Highlighted solutions attempt to deal with generalised classes Claes, T. and Van Compernolle, D. (1996) ‘SNR-normalisation for
of noise instead of the individual types. While similarities in robust speech recognition’, IEEE International Conference on
noise properties may result in general stationary/non-stationary Acoustics, Speech, and Signal Processing, pp.331–334.
classification, the effect of particular noise groups on the Davis, S. and Mermelstein, P. (1980) ‘Comparison of parametric
feature vector may be different, rendering identical treatments representation for monosyllabic word recognition in
continuously spoken sentences’, IEEE Transactions on
ineffective. ASR system implementations on various targets Acoustics, Speech and Signal Processing, pp.357–366.
have been discussed with a conclusion that FPGAs are the best D’Orta, P., Marco, F., Alessandro, M. and Stefano, S. (1987) ‘An
solution for such implementations. automatic speech recognition system for the Italian language’,
Proceedings of the 3rd Conference on European Chapter of the
Association for Computational Linguistics, ACM, pp.80–83.
Acknowledgements Elmisery, F.A., Khalid, A.H., Salama, A.E. and Hammed, H.F.
(2003) ‘A FPGA-based HMM for a discrete Arabic speech
Authors are thankful to the University of the West Indies for recognition system’, Proceedings of the 15th International
Conference on Microelectronics, pp.322–325.
providing necessary funding through Grant No. CRP.4.
Flanagan, J.L. (1972) Speech Analysis, Synthesis and Perception,
MAR11.4 to carry out research on the project ‘Development Springer, pp.406–426.
of algorithms and systems for robust speech recognition in Gomes-Cipriano, J.L., Nunes, R.P. and Barone, D.A.C. (2004)
the noisy environments’. Implementation of Voice Processing Algorithms in FPGA
Hardware, Technological and Exact Sciences Institute, Novo
Hamburgo, Brazil.
References Gonzalez-Concejero, C., Rodellar, V., Alvarez-Marquina, A., Martines
de Incaya, E. and Gomez-Vilda, P. (2006) ‘Designing an
1-CORE Technologies (2012) Soft Cores for FPGA. Available independent speaker isolated speech recognition system on an
online at: http://www.1-core.com/library/digital/soft-cpu- FPGA’, Research in Microelectronics and Electronics, pp.81–84.
cores/soft-cpu-cores.pdf (accessed on 1 November 2013).
Gracuarena, M., Franco, H., Zheng, J., Vergyri, D. and Stolcke, A.
Ahadi, S.M., Sheikhzadeh, H., Brennan, R.L. and Freeman, G.H. (2004) ‘Voicing feature integration in SRI’s decipher LVCSR
(2003) ‘An efficient front-end for automatic speech system’, IEEE International Conference on Acoustics Speech
recognition’, Proceedings of the 10th IEEE International and Signal Processing, pp.1–4.
Conference on Electronics, Circuits and Systems (ICECS
Grimm, M., Kroschel, K., Harris, H., Nass, C., Schuller, B., Rigoll,
2003), pp.14–17.
G. and Moosmayr, T. (2007) ‘On the necessity and feasibility
Ahmed, N., Natarajan, T. and Rao, K.R. (1974) ‘Discrete cosine of detecting a drivers emotional state while driving’, 2nd
transform’, IEEE Transactions on Computers, pp.90–93. International Conference on Affective Computing and
Allen, J.S. (1994) ‘How do humans recognise speech’, IEEE Intelligent Interaction, Lisbon, pp.126–138.
Transactions on Speech and Audio Processing, pp.567–577. Hauck, S. (1998) ‘The role of FPGAs in reprogrammable systems’,
Anusuya, M.A. and Katti, S.K. (2010) ‘Front end analysis of Proceedings of the IEEE, pp.615–639.
speech recognition: a review’, International Journal of Heungsuk, C., Kim, J., Kim, I., Kwon, Y., Lee, K. and Yang, S.I.
Speech Technology, pp.99–145. (2001) ‘Realisation of speech recognition using DSP (digital
Barker, J., Josifovski, L., Cooke, M. and Green, P. (2000) ‘Soft signal processor)’, IEEE International Symposium on
decisions in missing data techniques for robust automatic Industrial Electronics, IEEE, pp.508–512.
speech recognition’, Proceedings of Interspeech, pp.373–376. Huang, X., Alex. A. and Hsiao-Wuen, H. (2001) Spoken Language
Betkowska, A., Koichi, S. and Sadaoki, F. (2007) ‘Robust speech Processing, A guide to Theory, Algorithm and System
recognition using factorial HMMs for home environment’, Development, Prentice Hall, Upper Saddle River NJ.
EURASIP Journal on Advances in Signal Processing, No. 1, Jurafsky, D. and Martin, J.H. (2000) Speech and Language
pp.1–9. Processing an Introduction to Natural Language Processing,
Cheng, O., Abdulla, W. and Salcic, Z. (2009) ‘Speech recognition Prentice Hall, Upper Saddle River, NJ.
system for embedded real-time applications’, EEE International Ke, S., Hou, Y., Huang, Z. and Li, H. (2008) ‘A HMM speech
Symposium on Signal Processing and Information Technology recognition system based on FPGA’, Congress on Image and
(ISSPIT), pp.118–122. Signal Processing CISP, IEEE, pp.305–309.
Kim, G. and Loizou, P. (2010) ‘Improving speech intelligibility in Rabiner, L. and Juang, B-H. (1986) ’An Introduction to hidden
noise using a binary mask that is based on continue spectrum Markov models’, ASSP Magazine, pp.64–16.
constraints’, IEEE Signal Processing Letters, pp.1010–1013. Rabiner, L. and Juang, B-H. (1993) Fundamentals of Speech
Krishnamurthy, N. and Hansen, J.H.L. (2009) ‘Babble noise: Recognition, Prentice Hall International, Englewood Cliffs, NJ.
modeling, analysis and applications’, IEEE Transactions on Rabiner, L.R., Levinson, S.E. and Sondhi, M.M. (1983) ‘On the
Audio, Speech, and Language Proceeding, Vol. 17, No. 7, application of vector quantization and hidden Markov models
September, pp.1394–1407. to speaker-independent isolated word recognition’, Bell
Kuhne, M., Togneri, R. and Nordholm, S. (2011) ‘A new evidence Systems Technical Journal, pp.1075–1105.
model for missing data speech recognition with applications in
Rabiner, L.R. and Schafer, R.W. (2007) ‘Introduction to digital
reverberant multi-source environments’, IEEE Transactions on
speech processing’, Foundations and Trends in Signal
Audio, Speech and Language Processing, pp.372–384.
Processing, pp.1–194.
Lee, K-F., Hon, H.W. and Ready, R. (1990) ‘An overview of the
Rabiner, L.R. and Schafer, R.W. (2011) Theory and Applications
SPHINX speech recognition system’, IEEE Transactions on
of Digital Speech Processing, Pearson Higher Education Inc.,
Acoustics, Speech and Signal Processing, pp.35–45.
Upper Saddle River, NJ.
Lim, H., You, K. and Sung, W. (2006) ‘Design and implementation
of speech recognition on a softcore based FPGA’, IEEE Raj, B. and Stern, R.M. (2005) ‘Missing feature approaches in
International Conference on Acoustics Speech and Signal speech recognition’, IEEE Signal Processing Magazine,
Processing, IEEE, pp.1044–1047. September, pp.101–116.
Maithani, S. and Tyagi., R. (2008) ‘Noise characterization and Rapp, B. (2008) ‘N-gram language models for Polish language:
classification for background estimation’, IEEE – International basic concepts and applications in automatic speech
Conference on Signal Processing, Communications and recognition systems’, International Multiconference on
Networking, Chennai, pp.208–213. Computer Science and Information Technology, pp.321–324.
Melnikoff, S.J., Quigley, S.F. and Russell, M.J. (2002a) Rodríguez-Andina, J.J., Fagundes, R.D.R. and Junior, D.B. (2001)
‘Implementing a simple continuous speech recognition ‘A FPGA-based Viterbi algorithm implementation for speech
system on FPGA’, Proceedings of the 10th Annual IEE recognition systems’, IEEE International Conference on
Symposium on Field Programmable Custom Computing Acoustics, Speech and Signal Processing, Vol. 2, 7–11 May,
Machines, IEEE, pp.275–278. pp.1217–1220.
Melnikoff, S.J., Quigley, S.F. and Russell. M.J. (2002b) Sacerdote, G. (1959) ‘Statistical measurement of factory noise’,
‘Performing speech recognition on multiple parallel files Noise Control, Vol. 5, No. 6, pp.29–31.
using continuous hidden Markov models on an FPGA’, IEEE Schuller, B., Wollmer, M., Moosmayr, T. and Rigoll, G. (2009)
International Conference on Field-Programmable Technology, ‘Recognition of noisy speech: a comparative survey of robust
IEEE, pp.399–402. model architecture and feature enhancement’, EURASIP
Meng, Y., Tan, L. and Chinh, P.C. (2005) ‘Speech recognition on Journal on Audio, Speech and Music Processing, February,
DSP: issues on computational efficiency and performance pp.1–17.
analysis’, International Conference on Communications, Schuster, J., Gupta, K., Hoare, R. and Jones, A.K. (2006) ‘Speech
Circuits and Systems, IEEE, pp.852–856. Silicon: an FPGA architecture for real-time hidden Markov-
Miura, K., Noguchi, H., Kawaguchi, H. and Yoshimoto, M. (2008) model-based speech recognition’, EURASIP Journal on
‘A low memory bandwidth Gaussian mixture model processor Embedded Systems, pp.1–19.
for 20000 word real-time speech recognition FPGA system’, Sharma, D.P. (2009) ‘On the development of sustainable VLSI
International Conference on Technology ICECE, pp.341–344. infrastructure for Caribbean countries’, Journal of the
Morgan, N. and Hermansky, H. (1992) Rasta: Extensions: Association of Professional Engineers of Trinidad and Tobago,
Robustness to Additive and Convolutional Noise, ESCS Vol. 38, No. 1, October, pp.16–23.
workshop on Speech Processing in Adverse Conditions. Sharma, D.P. and Atkins, J.M. (2010) ‘FPGA-based embedded
Mosleh, M., Setayeshi, S., Mehdi Lotfinejad, M. and Mirshekari, A. solution for automatic speech recognition’, The 2nd Industrial
(2010) ‘FPGA implementation of a linear systolic array for Engineering and Management Conference on Fostering
speech recognition based on HMM’, International Conference Engineering Networking, Collaboration and Competence,
on Computer and Automation Engineering, pp.75–78. University of the West Indies, St. Augustine, pp.146–152.
Muzaffar, F., Mohsin, B., Naz, F. and Jawed, F. (2005) ‘DSP Srinivasan, S. and Wang, D. (2007) ‘Transforming binary
implementation of voice recognition using dynamic time uncertainties for robust speech recognition’, IEEE Transactions
warping algorithm’, Student Conference on Engineering on Audio, Speech and Language Processing, Vol. 15, No. 7,
Sciences and Technology, IEEE, pp.1–7. pp.2130–2140.
Neves, C., Veiga, A., Sa, L. and Perdigao, F. (2008) ‘Efficient noise- Staworko, M. and Rawski, M. (2010) ‘FPGA implementation of
robust speech recognition front-end based on ETSI standard’, feature extraction algorithm for speaker verification’, 17th
ICSP, pp.609–612. International Conference ‘Mixed Design of Integrated Circuits
Paliwal, K. and Yao, K. (2010) Robust Speech Recognition and Systems’, pp.557–561.
under Noisy Ambient Conditions, Elsevier, Academic Press, Stitt, G. (2011) ‘Are field-programmable gate arrays ready for the
pp.135–162. mainstream?’, Computing Now, December, pp.58–63.
Patel, I. and Roa, S. (2010) ‘Speech recognition using hidden Markov Van Compernolle, D., Claes, T., Xie, F. and Smolders, J. (2007)
models with MFCC-subband technique’, International Measuring the Signal-to-Noise Ratio for Noisy Data, Internal
Conference on Recent Trends in Information, Telecommunication Report, K U Leuven, ESAT.
and Computing, IEEE Computer Society, pp.168–172. Van Segbroeck, M. and Hamme, H. (2011) ‘Advances in missing
Rabiner, L.R. (1989) ‘A tutorial on hidden Markov models and feature techniques for robust large vocabulary continuous
selected applications in speech recognition’, Proceedings of speech recognition’, IEEE Transactions on Audio, Speech and
the IEEE, pp.257–286. Language Processing, pp.123–137.
Veitch, R., Aubert, L-M., Woods, R. and Fischaber, S. (2010) Xilinx Foundation (2011a) ‘ML506 development board’, Xilinx
‘Acceleration of HMM-based speech recognition system by Foundation Web Site. Xilinx Foundation. Available online at:
parallel FPGA Gaussian calculation’, IEEE VI Southern http://www.xilinx.com/images/boards/ml506/ml506_front.jpg
Programmable Logic Conference (SPL), pp.197–200. (accessed on 28 May 2011).
Vu, Ngoc-Vinh, Whittington, J., Ye, H. and Devlin, J. (2010) Xilinx Foundation (2011b) ‘Support and documentation’, Xilinx
‘Implementation of the MFCC front-end for low-cost speech Company Website, Xilinx. 6 February. Available online at:
recognition systems’, Circuits and Systems, pp.2334–2337. http://www.xilinx.com/support/documentation/data_sheets/ds
Wang, X. and Li., L. (2009) ‘An N-gram based Chinese syllable 100.pdf (accessed on 28 May 2011).
evaluation approach for speech recognition error detection’, Xilinx Foundation (2011c) ‘User guides’, Xilinx Foundation Web
International Conference on Natural Language Processing Site, Xilinx, May 17. Available online at: http://www.xilinx.
and Knowledge Engineering (NLP-KE’2009), pp.1–6. com/support/documentation/user_guides/ug190.pdf (accessed
Wessel, F., Schluter, R., Macherey, K. and Ney, H. (2001) on 28 May 2011).
‘Confidence measures for large vocabulary continuous speech Zolnay, A., Ralf, S. and Hermann, N. (2003) ‘Extraction methods
recognition’, IEEE Transaction on Speech and Audio Processing, of voicing feature for robust speech recognition’, Proceedings
Vol. 9, No. 3, pp.288–298. of Eurospeech, pp.497–500.
View publication stats

IJSISE 3 SharmaAtkins

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IJSISE 3 SharmaAtkins

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Automatic speech recognition systems: Challenges and recent

Article in International Journal of Signal and Imaging Systems Engineering · December 2014

The user has requested enhancement of the downloaded file.

Automatic speech recognition systems: challenges

Davinder Pal Sharma* and Jamin Atkins

1 Introduction flaws, ASR is an established technology with integration

Copyright © 2014 Inderscience Enterprises Ltd.

Figure 1 Basic speech recognition system (Rodríguez-Andina

The frontend is responsible for digital conversion of the

3 The backend of ASR system

The backend of the speech recognition system is responsible

3.2 Acoustic modelling 3.4 Language modelling

While much work has been done in the field of ASR to

where Zp(m,k) represents the k-th frequency band of the

7.1 Soft-core processor-based implementations

Table 1 Comparison of available soft cores (1-CORE Technologies, 2012)

Word Size Pipeline Cycles Per Hardware Floating Size in Logic

Table 2 Comparison of various FPGA-based implementation methods

Full Implementation Soft-Core Implementation Partial Implementation

Table 5 Comparison of typical 1024-point FFT implementation on various FPGA platforms

Spartan 6 Spartan 3A DSP

View publication stats

You might also like