You are on page 1of 9

CS5241 Speech Processing

Project Report
Ngo Minh Duc

23/04/2010

1 Summary
In this project, we build a simple Continuous Speech Recognition (CSR) system for the
English language, using the Hidden Markov Model Toolkit (HTK).
We aim to learn the basic usage of HTK and basic knowledge of speech recognition
design rather than to achieve a high recognition accuracy value.

2 Introduction
The goal of automatic speech recognition (ASR) is to build systems that map from a
human voice signal to a string of words. A major application area of ASR is in human-
computer interaction. In continuous speech recognition (CSR), the voice signal is a
natural continuous utterance of a speaker in which words run into each other naturally
and have to be segmented. CSR is much harder than isolated word recognition.
CSR tasks themselves vary greatly in diculty. For instance, human speaking to
computer is easier to recognized than a human to human conversation. One general and
important area of CSR is Large-Vocabulary Continuous Speech Recognition (LVCSR).
Generally, large vocabulary means that the vocabulary of the system is as large as 50000
words.
Hidden Markov Model (HMM) is a statistical model that nds numerous applications
in Machine Learning and Natural Language Processing. HMM-based speech recognition
system would turn out to dominate the eld of Speech Recognition research in recent
decades.
The Hidden Markov Model Toolkit (HTK) is a collection of tools for building and
manipulating HMMs, which is primarily used to build HMM-based speech recognition
systems. HTK is originally developed at the Cambridge University, UK.
In this project, we use HTK to build an HMM-based CSR with medium vocabulary
size. The techniques used are very similar to a LVCSR system. The training data
is recorded by students of CS5241 Speech Processing from the National University of

1
Singapore. There are a total of 623 voice sessions as training data and 180 voice sessions
as test data. Each voice session consists of several transcript sentences.

3 Basic System Design


We describe the basic work-ow of our CSR system.

• The training data and test data are provided as sound wave les.

• Feature vectors are extracted from the sound wave les.

• We then train the parameters of our acoustic model, which is an HMM.

• To recognize a voice signal, a decoding algorithm is applied on our trained HMM


to nd the most probable result.

3.1 Feature Extraction


The sound wave les are sampled, quantized and converted to some sort of spectral repre-
sentation. A common used spectral representation is the mel cepstrum or MFCC which
provides a real-valued feature vector for each frame in the input. Typically, a MFCC
feature vector consists of 39 features, including:

• 12 cepstral coecients

• 1 energy coecient

• 12 delta cepstral coecients

• 1 delta energy coecient

• 12 double delta cepstral coecients

• 1 double delta energy coecient

It turns out that the cepstral coecients in an MFCC feature vector tends to be uncor-
related, which makes MFCC useful since it simplies our acoustic model.
In our experiment, to generate the MFCC feature les from the wave les, we use the
HCopy command with a conguration le as follows.

HCopy -T 1 -C lib/cfgs/wave2mfcc.cfg -S lib/flists/all.mfcc.scp

• all.mfcc.scp is a list contains the mapping between wave les and corresponding
MFCC les

• wave2mfcc.cfg is a conguration le that tells HTK to transform from wave data to
TARGETKIND which
MFCC features. In this conguration le, there is the option
species the destination format. If we set TARGETKIND = MFCC_E_D_A_Z,
the wave data are transformed to the typical MFCC features vectors of 39 elements.

2
3.2 Hidden Markov Model
An HMM model of speech recognition is parameterized by

• Q = q1 q2 . . . qn : a set of states corresponding to subphones. Because of the non-


homogeneous nature of phones over time, each phone is usually divided into 3
subphones corresponding to 3 HMM states: a beginning, middle and end state.

• A = a01 a02 . . . an1 . . . ann : a transition probability matrix. Each element represents
the probability for each subphone to take a self-loop or going to the next subphone.

• O: the set of observations, which are the cepstral feature vectors in our task.

• B = bi (ot ): a set of observation likelihoods, each expressing the probability of a


cepstral feature vector observed from a subphone state.

We can view the HMM as a representation of the lexicon: a set of pronunciations for
words, each pronunciation is a set of subphones whose order is specied by the transition
probability matrix.
Given a HMM state, we need to nd a way to compute the probability of an obser-
vation or feature vector. The most common method is to use Gaussian mixture model
(GMM) probability density functions. We treat the feature vectors as being generated
by k Gaussian distributions for a xed k. The distributions have unknown means and
variances, we can train these unknown parameters iteratively by using the Baum Welch
algorithm , a special case of the Expectation Maximization (EM) algorithm. Each iter-
ation of the algorithm would produce more accurate means, variances and observation
probability function.
To train our acoustic HMM model in HTK, we rst need a prototype of the model,
which is provided in the le lib/proto. The prototype is initialized with zero mean and
unit variance.
After that, we initialize the HMM model by a single Gaussian component model and
set the mean and variance to be the mean and variance of all the feature vectors. This
can be done using HCompV as follows.

HCompV -T 7 -C lib/cfgs/basic.cfg -m -f 0.01 \


-M exp/hmm10 \
-S lib/flists/train.scp\
lib/proto

This command takes in the MFCC feature les listed in train.scp, the model proto-
type, a conguration le basic.cfg and generates the HMM model les in the directory
exp/hmm10.
To perform an iteration of the Baum Welch algorithm on the existing model, we can
use the HERest command as follows:

HERest -T 1 -C lib/cfgs/basic.cfg \
-H exp/hmm10/MMF \

3
-M exp/hmm11 \
-I lib/mlabs/train.mlf \
-S lib/flists/train.scp \
lib/mlists/all.mlist \
> exp/hmm11/LOG

• train.mlf is the phone-level transcription of the training data

• train.scp is the list of MFCC training feature les

• all.mlist is the list of phones

• exp/hmm10/MMF is the initial HMM model generated by the command HCompV


as described above

• basic.cfg is a conguration le

We perform Baum Welch iteration a few times to achieve higher likelihood value.
A more sophisticated HMM model is the triphones context-dependent model. There
is a fact that phones vary enormously depending on the neighboring phones. Thus, in
triphone HMM, each phone is represented along with its left and right neighboring phone.
For example, the triphone [y − eh + l] means [eh] preceded by [y] and followed by [l].
In HTK, a triphones model can be generated from a monophone model by using the
command HLEd with the edit conguration le mktri.led.

3.3 Speaker Dependent vs. Independent Acoustic Modeling


It is usually impractical to collect enough training data to build a single speaker de-
pendent recognition system. Thus speak recognition systems are usually designed to
be speaker independent. But when we have enough data, speaker dependent systems
function better since the test data is more like the training data.
Although it is rarely to have enough training data for each independent speaker, we
can still train separate models for two broad groups of speakers: males and females since
their voice obviously have dierent acoustic characteristics. When a test sentence comes
in, we can use a gender detector component to decide which of the two acoustic models
to use. A gender detector component is a binary classier that can be built with high
accuracy using GMM classiers based on cepstral features.
Even if there is not enough training data for speaker dependent system, there are
speaker adaptation techniques to adapt speaker-independent acoustic models to new
speakers for a small amount of new voice data from the new speaker. A commonly
used technique is maximum likelihood linear regression (MLLR). The MLLR algorithm
takes in a trained acoustic model and a small adaptation new data from a new speaker
which can be as small as ten seconds of speech. It learns a linear transform matrix W
and a bias vector ω to transform the means of the current acoustic model.
In HTK, speaker adaptation can be performed using the command HEAdapt. HEAdapt
does an supervised MLLR adaptation in which the transcription of the voice data from

4
the new speaker is known. If the transcription is unknown, unsupervised adaptation can
be performed using the tool HVite.

4 Experimental Results
4.1 Word-loop vs. Unigram/Bigram Recognition
We perform recognition using word-loop nets as follows

HVite -T 1 -C lib/cfgs/basic.cfg -H exp/hmm84/MMF \


-i exp/hmm84/test.mlf \
-w lib/nets/word.loop.net -S lib/flists/test.scp \
lib/dcts/words.dct lib/mlists/all.mlist \
> exp/hmm84/test.LOG

The accuracy result is 6.50.

------------------------ Overall Results --------------------------


SENT: %Correct=0.00 [H=0, S=180, N=180]
WORD: %Corr=13.47, Acc=6.50 [H=230, D=313, S=1164, I=119, N=1707]
===================================================================

For unigram nets, we set the cut-o threshold to be -25 to reduce the running time.

HVite -t 25 -T 1 -C lib/cfgs/basic.cfg -H exp/hmm84/MMF \


-i exp/hmm84/test_unigram.mlf -w lib/nets/unigram.all.net \
-S lib/flists/test.scp lib/dcts/words.dct lib/mlists/all.mlist \
> exp/hmm84/test_unigram.LOG

The accuracy result is 3.82.

------------------------ Overall Results --------------------------


SENT: %Correct=0.00 [H=0, S=92, N=92]
WORD: %Corr=11.82, Acc=3.82 [H=102, D=139, S=622, I=69, N=863]
===================================================================

For bigram nets, we set the threshold to be -50.

HVite -t 50 -T 1 -C lib/cfgs/basic.cfg -H exp/hmm84/MMF \


-i exp/hmm84/test_bigram.mlf -w lib/nets/bigram.all.net \
-S lib/flists/test.scp lib/dcts/words.dct lib/mlists/all.mlist \
> exp/hmm84/test_bigram.LOG

The accuracy result is 4.94.

5
------------------------ Overall Results --------------------------
SENT: %Correct=0.00 [H=0, S=141, N=141]
WORD: %Corr=13.97, Acc=4.94 [H=184, D=174, S=959, I=119, N=1317]
===================================================================

4.2 Varying amount of training data


For this experiment, we remove 314 out of 623 training sessions, about half of the train-
ing data, and retrain the HMM models. The nal model is hmmcut84. We perform
recognition using word-loop nets as follows

HVite -T 1 -C lib/cfgs/basic.cfg -H exp/hmmcut84/MMF \


-i exp/hmmcut84/test.mlf -w lib/nets/word.loop.net \
-S lib/flists/test.scp lib/dcts/words.dct lib/mlists/all.mlist \
> exp/hmmcut84/test.LOG

The accuracy result is -4.86.

------------------------ Overall Results --------------------------


SENT: %Correct=0.00 [H=0, S=180, N=180]
WORD: %Corr=12.48, Acc=-4.86 [H=213, D=261, S=1233, I=296, N=1707]
===================================================================

4.3 Adding dierential parameters


For this experiment, we set TARGETKIND to be MFCC_E_D_A_Z. The feature vec-
tors have dimension 39. We regenerate the MFCC features les, retrain the models and
perform recognition as follows

HVite -T 1 -C lib/cfgs/basic_new.cfg -H exp/hmmnew84/MMF \


-i exp/hmmnew84/test.mlf -w lib/nets/word.loop.net \
-S lib/flists/test.scp lib/dcts/words.dct lib/mlists/all.mlist \
> exp/hmmnew84/test.LOG

The accuracy result is improved to be 8.14

------------------------ Overall Results --------------------------


SENT: %Correct=0.00 [H=0, S=180, N=180]
WORD: %Corr=32.57, Acc=8.14 [H=556, D=120, S=1031, I=417, N=1707]
===================================================================

6
4.4 Varying the number of Gaussian components
For this experiment, we perform recognition with the model of 1, 2, 4, 6 and 8 Gaussian
components ( hmm14, hmm24, hmm44, hmm64 and hmm84 ).
Our accuracy result is respectively -3.69, 3.46, 4.45, 4.57, 6.50.

4.5 Varying the grammar scale factor and insertion penalty


For this experiment, we set the grammar scale factor to be 10, the insertion penalty to
be -10 and perform recognition as follows

HVite -s 10.0 -p -10.0 -T 1 -C lib/cfgs/basic_new.cfg \


-H exp/hmmnew84/MMF -i exp/hmmnew84/test_grammar.mlf \
-w lib/nets/word.loop.net -S lib/flists/test.scp \
lib/dcts/words.dct lib/mlists/all.mlist \
> exp/hmmnew84/test.LOG

The accuracy result improves substantially to 22.03

------------------------ Overall Results --------------------------


SENT: %Correct=0.00 [H=0, S=180, N=180]
WORD: %Corr=28.12, Acc=22.03 [H=480, D=506, S=721, I=104, N=1707]
===================================================================

4.6 Varying the pruning thresholds for beam search


In this experiment, we perform recognition with various beam search pruning thresholds
from -50 to -400 at 50 intervals.

HVite -t 50.0 -s 10.0 -p -10.0 -T 1 -C lib/cfgs/basic_new.cfg \


-H exp/hmmnew84/MMF -i exp/hmmnew84/test_grammar_50.mlf -w lib/nets/word.loop.net \
-S lib/flists/test.scp lib/dcts/words.dct lib/mlists/all.mlist \
> exp/hmmnew84/test.LOG

The accuracy results are provided below

• t = −50 : We receive the message No transcription found by HTK. The threshold
is too high that it cut o every nodes from the search tree.

• t = −100: Acc = 18.40

• t = −150: Acc = 21.83

• t = −200, −250, −300, −350, −400: Acc = 22.03

The higher the threshold, the faster the recognition is performed. For example, among
the above threshold values, the recognition run fastest when t = −50.

7
4.7 Other Experiments
We have also tried to build triphone models using HLEd and speaker adaptation systems
using HERest but have not nished the recognition part successfully .
To build triphone HMM models from existing monophone HMM models, one needs to
create a conguration le mktri.led. This le can be generated by using the Perl script
maketrihed in HTKTutorial as follows

Perl maketrihed all.mlist triphones.mlist

After that, we generate the triphone HMM models using the HLEd command as follows

HLEd -n triphones1 -i lib/mlabs/train_tri.mlf lib/edfiles/mktri.led \


lib/mlabs/train.mlf
HLEd -n triphones1 -i lib/mlabs/test_tri.mlf lib/edfiles/mktri.led \
lib/mlabs/test.mlf

5 Discussions
In experiment 4.1, the running time of unigram/bigram recognition is very large if we do
not set the cut-o threshold values.
In experiment 4.2, as expected, cutting o the amount of training data reduces the
accuracy value extensively.
In experiment 4.3, increasing the dimension of the feature vectors by adding dierential
components improves the accuracy value but also increases the training and recognition
running time.
In experiment 4.4, the accuracy value increases if the number of Gaussian components
increases. No over tting occurs. We think it is due to the fact that 8 components is
still not much to represent the distribution of the feature vectors.
Experiment 4.5 improves the accuracy the most by changing the grammar scale factor
and insertion penalty. The word insertion penalty is a xed value added to each token
when it transits from the end of one word to the start of the next. The grammar scale
factor is the amount by which the language model probability is scaled before being
added to each token as it transits from the end of one word to the start of the next [1].
As expected, in experiment 4.6, increasing the threshold values (we use the convention
that the threshold is negative) reduces the recognition running time, but it trades o for
recognition performance.

6 Conclusions
By working on this practical project on speech recognition, we can review the theory
of speech recognition systems along the way and also gain practical usage skills of the
HTK tools. We learn simple techniques to improve the recognition rate. We also nd out

8
that there are numerous parameters that can aect the recognition performance. These
factors can be tuned with both theoretical understanding of speech recognition systems
and practical usage skills of the HTK tools.
As a future work, we could use the HTK tools for speech recognition applications in
our mother language Vietnamese.

References
[1] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, P. Woodland (1999). The
HTK Book (for HTK Version 3.1)

[2] D. Jurafsky, J. H. Martin (2009). Speech and Language Processing, Second Edition

You might also like