You are on page 1of 4

,&633URFHHGLQJV

A novel voice recognition model based on HMM and


fuzzy PPM
Jackson Zhang*
*Software Engineer, G&PS (R&D)
Chengdu Site, Motorola China
*Email: a22026@motorola.com
Bruce Wang
Software Engineer, G&PS (R&D)
Chengdu Site, Motorola China
Email: a21992@motorola.com
to speech recognition problem. The success for the
hybridization is to build a paradigm where these two methods
can work together smoothly. In this paper, we propose a
scheme which first incorporates the HMM in our signal
processing model, then it is followed b> fuzzy matching to
generate the likelihood factor. Fuzzy logic enables the
technique to handle noise better and improve classification
accuracy. An important consequence of using a fuzzy based
system is that the system's level of confidence in its pattern
selection can be used to identify further information. Nest. the
PPM model is used to generate the compression indicators
based on the prediction power of PPM. Finally, a matching
model is used to compute and weight different factors for the
determination of the best patterns.

Abstract

Hidden Markov Model (HMM) is a robust statistical


methodology for automatic speech recognition. It has being
tested in a wide range of applications. A prediction approach
traditionally applied for the text compression and coding,
Prediction by Partial Matching (PPM) which is a finite-context
statistical modeling technique and can predict the next
characters based on the context, has shown a great potential in
developing novel solutions to several language modeling
problems in speech recognition.
These two different approaches have their own special
features respectively contributing to voice recognition.
However, no work has been reported in integrating them at
attempt to forming a hybrid voice recognition scheme.

II. HMM SIGNAL PROCESSING MODEL

To take the advantages of strengths of these two approaches,


we propose a hybrid speech recognition model based on HMM
and fuzzy PPM, which has demonstrated by the experiment
competitive and promising performance in speech recognition.

As the front-end of the speech recognition systems, the


signal processing unit must be capable of capturing most, if not
all of the speech features to facilitate a loss-less speech
processing. It should be able to represent the whole word
pattern against pattern variation and background noise, and at
the same time, it should also be sophisticated enough to
represent the phonemic features within a word pattern. A
hybrid word spotting-method that combines the word-based
pattern matching and phoneme-based HMM is a novel solution
to this problem. With a time-frequency spectrum and the single
frame frequency spectrums, the word can be globally described
without losing the detailed features in it [I]. For the traditional
approach in HMM, the behavior of the utterance of the speech
segment is considered as the model parameters. Many
algorithms are developed to optimize these model parameters
in order to best describe the trained observation sequences.
However, when processing human speech, these parameters
can only be estimated from an initial guess and iteratively
converge to a local maximum. Therefore, a genetic algorithm
of HMM is employed to generate a complete set of word lattice
within the defined searching space. Human speech contains

Keywords-HMM; PPM; voice recognition; fuzzy logic;


statistical model

I. INTRODUCTION
Hidden Markov Model (HMM) is a highly robust
statistical methodology for automatic speech recognition.
Prediction by Partial Matching (PPM) is a finite-context
statistical modeling technique that can be viewed as blending
together several fixed-order context models to predict the next
character in the input sequence. While HMM focuses on the
global statistical features of the speech to build the robust
speech recognition process. PPM emphasizes the context-based
partial matching prediction for the enhancement of the
recognition. The PPM method could be further improved by
incorporating the fuzzy matching into the matching processing.
With their respective strengths. HMM and the fuzzy PPM,
when integrated, will naturally lead to a hybrid novel solution
___________________________________
978-1-4244-5900-1/10/$26.00 2010 IEEE

637

many of fuzziness factors, such as pitch, amplitude and


environmental noise. To enable the capability of handling the
fuzziness, a simple fuzzy supplementary process is used. The
fuzzy implementation assumes that all waveforms contain
uncertainty. This uncertainty comes from speaker variation
waveform quantization, noise and the inability to completely
specify the process of speech recognition. Each amplitude is
therefore represented as a fuzzy number. The fuzzy number can
then be viewed as a set of numbers within range f i h f of the
original, whereas k is a fuzzy number theoretically ranged from
0 to I. For example, integers close to X can be represented by a
fuzzy set in which the closer a number is to X. the higher its
membership is in the fuzzy set. When generating a set of word
lattice. a certain degree of deviation. Say 60 YO. is allowed to
generate a huge set of words for an input speech pattern. When
no accepted match is found in the preprocessed patterns within
the threshold. The input speech pattern will be considered as a
new pattern, either to be ignored or readily to be stored in the
pattern database.

The hidden Markov Model is represented by

= ( , A,

B ).
= initial state distribution vector.
A = State transition probability matrix.
B = continuous observation probability

density function
matrix.
The three fundamental problems in the Hidden Markov Model
design are the following
Problem one - Recognition
Given the observation sequence O = (o1, o2,...,oT) and the

model = ( , A, B ), how is the probability of the


observation sequence given the model, computed? That is,
how is P(O|) computed efficiently?
Problem two - Optimal state sequence
Given the observation sequence O = (o1, o2,...,oT) and the

For each detected end point (an individual word).


Waveform is being chopped into n intervals. In each interval, a
fixed number of sample points are set. Number of sample
points, m, falls within the fi-2.f region at each interval is
counted and its membership of the original waveform is
calculated. The membership is calculated by the average of the
membership of every point along the original waveform. The
result of comparing each template to an unknown waveform is
a fuzzy set consisting of the set of templates and their degrees
of membership, which represent the templates' similarity to the
unknown waveform. Generally speaking there are two main
cases that most of the wave sample points can fall into as
Figure I. The points are either bounded inside by the wave or
outside the wave's area. When the number of sampling point is
large enough, the points can roughly represent the
corresponding curve and their relative distance from the wave
can be regarded as an indicator, showing its likeliness to the
wave in a speech recognition sense. the templates stored are
chopped into small sampling chunks. These sample points of
templates are matched against the input wave. The relative
distance of the points to the wave shows the membership of the
template to the input wave, thus degree of matching.

model = ( , A, B ), how is a corresponding state


sequence, q = (q1, q2,...,qT), chosen to be optimal in some
sense (i.e. best explains the observations)?
Problem three Adjustment
How are the probability measures,
adjusted to maximize P(O|)?

= ( , A, B ),

IV. IMPROVED HMM COMPUTATION ALGORITHM


A. Signal The utterance
The signal used for training purposes are ordinary utterances
of the specific word, the word to be recognized.

III.THE TRAINING MODEL OF HMM


As mentioned in the introduction part the technique used to
implement the speech recognition system was the Hidden
Markov Model, HMM. The technique is used to train a model
which in our case should represent a utterance of a word. This
model is used later on in the testing of a utterance and
calculating the probability of that the model has created the
sequence of vectors (utterance after parameterization done in
chapter 4).
Fig.1 Original signal(y(n)) and preemphasized(x(n))

The difference between an Observable Markov Model and a


Hidden Markov Model is that in the Observable the output
state is completely determined at each time t. In the hidden
Markov Model the state at each time t must be inferred from
observations. An observation is a probabilistic function of a
state.

B. MFCC Mel Frequency Cepstrum Coefficients

638

The MFCC matrix is calculated according to Speech Signal to


Mel Frequency Cepstrum Coefficients, This is also used when
testing an utterance against model.

2)

i , initialize the initial state distribution

vector, using the

left-to-right model
The initial state distribution vector is initialized with the
probability to be in state one at the beginning, which is
assumed in speech recognition theory. It is also assumed that i
is equal to five states in this case.

[1

0 0 0 0] ,1 i number of states, in this


case i = 5

D. MULTIPLE UTTERANCE ITERATION


As mentioned, the complication of the direct observation of
the state of the speech process is not possible there is need for
some statistic calculation. This is done by introducing the
continuous observation probability density function matrix, B.
The idea is to that there is a probability of making a certain
observation in the state, the probability that the model has
produced the observed Mel Frequency Cepstrum Coefficients.
There is a discrete observation probability alternative to use.
This is less complicated in calculations but it uses a vector
quantization which generates a quantization error.
The advantage with continuous observation probability
density functions is that the probabilities are calculated direct
from the MFCC without any quantization.

Fig.2 Original signal(y(n)) and preemphasized(x(n))

C. INITIALIZATION
1) A, the state transition probability matrix, using the left-toright model
The state transition probability matrix, A is initialized with the
equal probability for each state.

The common used distribution to describe the observation


densities is the Gaussian one. This is also used in this project.
To represent the continuous observation probability density
function matrix, B the mean, and variance, are used.

0
0
0.5 0.5 0
0 0.5 0.5 0
0

A= 0
0 0.5 0.5 0

0
0 0.5 0.5
0
0
0
0
0
1

Due to that the MFCC are normally not frequency distributed


a weight coefficient is necessary to use when the mixture of
the pdf is applied. This weight coefficient, more the number of
these weights is used to model the frequency functions which
lead to a mixture of the pdf.

During the experimentation with the number of iterance within


the re-estimation of A the final estimated values of A where
shown to deviate quite a lot from the beginning estimation.
The final initialization values of A where initialized with the
following values instead, which is more likely to the reestimated values (the re-estimation problem is dealt with later
on in this chapter.

b j ( ot ) =

jk

b jk ( o t )

, j = 1, 2 ,..., N

k =1

And M is the number of mixture weights,

c jk . These are

restricted due to
M

0
0
0
0.85 0.15
0
0.85 0.15
0
0

A= 0
0
0.85 0.15
0

0
0
0.85 0.15
0
0
0
0
0
1

jk

=1

, j = 1, 2 ,..., N

k =1

c jk 0

, j = 1, 2 ,..., N , k = 1, 2 ,..., M

With the use of diagonal covariance matrices, due to the


less computation and a faster implementation, then the
following formula is used.

The change of initialization values is not a critical event thus


the re-estimation adjust the values to the correct ones
according to the estimation procedure.

639

b jk ( o t ) =

1
( 2 ) D / 2 ( l = 1
D

jkl

( o tl
2

l =1

)1 / 2

jkl

V. CONCLUSION

)2

jkl

In this paper, we have introduced a hybrid model of HMM


and fuzzy PPM and applied the model to speech recognition.
The model provides a global speech suggestion scheme not
only based on the speech signals. but also on the contest.
Different weightings. such as the Pattern based (TCO and
Context-based (TCJVI. to deliver a broader and more
meaningful speech recognition. During the process. fuzziness
is being incorporated to handle the nature of human speech. A
final text file. with the highest overall weighting. is generated
which best matches the input speech in a global and local
contest-based fashion.

E. A re-estimate the state transition probability matrix


When solving problem three, Optimize model parameters
[Rab89], an adjustment of the parameters of the model is done.
The Baum-Welch is used as mentioned in the previous section
of this chapter. The adjustment of the model parameters
should be done in a way that maximizes the probability of the
model having generated the observation sequence.

VI.ACKNOWLEDGEMENTS

= arg max[ P(O | )]

The authors would like to express appreciation for the


assistance of colleagues at CDC during writing this paper.

The variable is calculated for every word in the training


session. This is used with the variable which is calculated for
every word in the training session, which means that we have
two (nr of words * samples per word large) matrix. The
following equation is used.
t (i, j ) =

VII.

[1] H.Kanaza\va. M. Tachimori and Y. Takebayashi. A hybrid


wrodspotting inethod for spontaneous speech understanding using
word-based pattem matching and phoneme-based HMM . /E
lnrrrnatronal Conference on Acorisric. Speech und Signd Processing.
Vol. I. pp 289-292. May. 1995
[2] W. J. Teahan. J. G. Cleary: Applying compression to natural
language processing Submitted to .-I.YLP '9[3] C. W. Chau. S. Kwng. C. K. Diu. W. R. Fahmer Optimization of
HMM by a Genetic Algorithm. IEEE lnrernutional Conference on
Acoustic. Speech und Signal Processing, Vol. 3. pp 289-292 (April.
1997)
[4] Kruchten, Ph.: The Rational Unified Process: An
Introduction. Addison-Wesley, 2003.
[5] Manzoni, L.V.; Price, R.T.: Identifying extensions required by
RUP (rational unified process) to comply with CMM (capability
maturity model) levels 2 and 3. IEEE Transactions on Software
Engineering, vol. 29 , no. 2 , IEEE, Feb. 2003, pp. 181-92.
[6] Microsoft Cooperation: Microsoft Solutions Framework White
Paper. Microsoft Press, 1999.
[7] Nawrocki, J., Walter, B., and Wojciechowski, A.: Toward
maturity model for extreme programming. In Proceedings Euromicro
Conference, 2001. IEEE, 2001, pp. 2339.
[8] Paulk, N.C: Extreme programming from a CMM perspective.
IEEE Software, vol. 18, no. 6, IEEE, Nov.-Dec. 2001, pp. 19-26.
[9] Pollice, G.: Using the Rational Unified Process for Small Projects:
Expanding Upon eXtreme Programming. A Rational Software White
Paper, Rational, 2001.
[10] Runeson, P., Isacsson, P.: Software quality assuranceconcepts
and misconceptions, In Proceedings of the 24th EUROMICRO
Conference, IEEE Computer Soc, 1998, pp. 853-9.
[11] Osterweil, L.J.: Improving the quality of software quality
determination processes, In Proceedings of the IFIP TC2/WG2.5
Working Conference on Quality of Numerical Software. Assessment
and Enhancement, Chapman & Hall, London, 1997, pp. 90-105.
[12] Ward, W. A., and Venkataraman, B.: Some Observsations on
Software Quality, In Proceedings of the 37th annual Southeast
regional conference (CD-ROM), ACM, 1999, Article No. 2.
[13] Wikipedia: http://www.wikipedia.org.

t (i ) aij b j (ot +1 ) t + 1 ( j )
N

i =1

j =1

(i)a b (o
t

t (i ) =

ij

t +1

) t + 1 ( j )

t (i ) t (i )
N

(i) (i)
t

i =1

Note that there are no difference in using the scaled and or


the unscaled. This is due to the dependency of time and not of
state.
The re-estimation of the A matrix is quite extensive due to the
use of multiple observation sequences. For the collection of
words in the training session there is calculated an average
estimation with contribution from all utterances used in
training session. The following equation is used.
T 1

a ij (i ) =

exp ected number of transition s from state i to state j


exp ected number of transition s from state i

(i , j )

t =1
T 1

Eq.6.9
t

REFERENCES

(i )

t =1

640

You might also like