You are on page 1of 4

BACK-PROPAGATION TRAINING OF A NEURAL NETWORK

FOR WORD SPOTTING


Thomas M. English? and Lois C. Boggess*
b e p t . of Computer Science, Texas Tech University, Lubbock, Texas 79409-3104 USA
*Dept. of Computer Science, Mississippi State University, Mississippi State, Mississippi 39762 USA

ABSTRACT
An approach to back-propagation training of a neural net-

work for word spotting is described. It is assumed that the


network has one output unit for each keyword to be detected, and that features of the speech signal are input at
fmed intervals. The goal of training is to obtain a network
that emits a detection pulse at the appropriate output unit
when the utterance of a keyword is completed. We have
developed a successful back-propagation strategy which incorporates "don't care" targets for outputs expected to be in
the process of rising or falling, propagation of errors for
only a subset of those times at which no detection pulse is
expected, iterative refinement of the the temporal placement
of target outputs, and use of a super-squared e m r criterion.
In an application of the strategy to speaker-dependent, continuous digit recognition (i.e., digit spotting with no utterances of non-digits), word-error rates of 0% and 2.5% were
achieved for the training and test utterances, respectively.

1. INTRODUCTION
We describe a procedure for training neural networks to spot
utterances of "keywords" in fluent speech. A successful
word spotter would have many applications, including monitoring of enemy voice transmissions, voice control of
devices in environments where non-control speech is prevalent (e.g., in homes, offices, and airplane cockpits),
information retrieval h m stored speech messages [l], and
limited vocabulary speech recognition in domains where
talkers' interjections are frequent and varied (e.g., as in
determining the type of a long-distance call [2]).
The training procedure described here incorporates error
back-propagation [3], and should be applicable to various
neural network architectures. We assume that the network
has one output unit for each keyword to be spotted. At
fixed intervals, a representation of the speech signal is
input to the network. When the network detects the completion of a keyword utterance, it identifies the word by
generating a "high" signal at the appropriate output unit.
At all other times, the network generates "low" signals at
all output units.
Obtaining this behavior through back-propagation
training has proven to be difficult. One problem is that
there is no general way to specify precisely when in the
utterance of a keyword the appropriate output unit should
"turn on" and when it should "turn off." The behavior that
is realizable by the network depends upon the particular
keyword and the context in which it is uttered.
Another problem is that output units are expected to

map

memory

feed-forward subnet

Figure 1. Modules of a neural network for word spotting.


emit low signals much more often than high signals. This
means that failure to emit desired high signals contributes
relatively little to the mean squared error of the network. In
our initial attempts to apply back-propagation, the network
entered local minima in which some output units gave low
responses to all inputs.
Before presenting details of the training procedure
(Section 3), we describe the particular architecture for which
it was developed (Section 2). Section 4 gives a rule for
deciding when a keyword utterance has been spotted.
Experiments, discussion, and conclusions are presented in
Sections 5-7.

2. WORD-SPOTTING ARCHITECTURE
As suggested in Figure 1, the word-spotting network comprises three subnetworks. Vectors of speech signal features
are input to a Kohonen map 141, the outputs of which are
processed by a layer of self-connected "memory" units. The
outputs of the memory layer are transformed into detection
pulses by a feed-forward subnet.

2.1.

Kohonen Map

The Kohonen map functions as a vector quantizer,


effectively labeling each input by setting one and only one
of its outputs high. That is, for each input vector x(t) there
is precisely one value of index i for which map output
bi(t) = 1. For all j i, b t ) = 0. The map's partition of
the input space is obtain through an unsupervised clustering procedure [4].

ed+

11-357
0-7803-0532-9192 $3 .OO 0 1992 IEEE

1
2
-

lo 0

3. BACK-PROPAGATION TRAINING

***o
I

L.d''T
Figure 2. Feed-foxward subnet with two pools in each hidden
layer, and with three units in each pool of the second
hidden layer. Units with squashing functions are labeled
"s." Those with Gaussian functions are labeled "r."

2.2. Memory Units


Each memory unit is "hard-wired'' to encode how recently its
corresponding map unit has emitted a high signal. When
the speech trajectory is within the map unit's block of input
space, the activation of the memory unit is maximal.
During other intervals the memory unit's activation decays
exponentially.
More formally, the memory units' activations, mi(t),
are all 0 at time t = 0. Then
mi(t) = min{bi(t) +a mi(t-l), 1 )
for t > 0, with a E (0, 1). Implicitly, the weight of the
connection from a map unit to its corresponding memory
unit is 1, and the weight of the feedback connection is a.

2.3. Feed-Forward Subnet


Figure 2 depicts a feed-forward architecture which has
been successful in generating detection pulses. Trainable
biases of non-input units are omitted from the picture. The
input units simply hold "copies" of the memory unit activations. Each unit in the first hidden layer has only one
input connection, and applies the Gaussian activation function exp(-x?. Other non-input units apply the logistic
"squashing" function (1 + exp(-x))-l.
Units are grouped into pools, and the arrows between
pools in Figure 2 are labeled to indicate connectivity of
units within the pools. Note that the size of pools in the
first hidden layer is implicitly that of the Kohonen map,
and that each pool in the first hidden layer is connected to
precisely one pool in the second hidden layer.
Members of this family of networks may be identified
by two parameters: the number of pools per hidden layer,
and the size of pools in the second hidden layer. The network of Figure 2 has two pools per hidden layer, with three
units per pool in the second hidden layer.
Training of the feed-forward subnet (described in the
following section) begins only when training of the
Kohonen map is complete.

As mentioned in the introduction, it is difficult to specify


precisely the shape of detection pulses to be emitted by the
network. Also, the fact that pulses are desired a relatively
small fraction of the time leads to problems with local minima in which some output units never produce high signals.
We adQess these problems by incorporating a) "don't care"
targets for outputs we expect to be in the process of rising
or falling, b) back-propagation of errors for only a subset
of those times at which no output is expected to be high, c)
iterative refinement of the temporal placement of targets,
and d) use of back-propagation to reduce the mean of supersquared errors.

3.1. Defining a Sequence of Desired Outputs


We assume that the training set has been divided into
segments, with each segment comprising a single keyword
utterance or an interval in which no keyword occurs. We
consider the cases in which the current segment is an utterance of the keyword with index c. If the previous segment
is a keyword utterance, we denote the index of the keyword
with 1. Let t denote the time at which the last segment
ended, and let n denote the number of time steps in the
current segment.
The sequence of target outputs is defined in terms of
constants P and T, where P is the pulse duration and T is the
number of transitional time steps at which to train the net.
We assume that T is odd, and that P + T I n.
The appropriate output unit is trained to emit a high
signal for the last P time steps of the segment:
i = 0, 1, ..., P-1,
dc(t+n-i) = 1,
where dc(t) is the desired output for unit c at time t . Of the
m = n - P transitional time steps (preceding the pulse), T
evenly spaced steps are selected for training:
i = 1 , 2 , ..., T,
ti = f + [im I (T+l)],
where [XI
is x rounded to the nearest integer. If the preceding segment is the utterance of a keyword, then the output
for that word is a "don't care" condition for the f m t part of
the c m n t segment:
i L LT/2J
dl(ri) = O d t i ) ,
where ol(r) is the actual output of unit 1 at time r .
Similarly, the output of the unit representing the current
keyword is not adjusted in the latter part of the transitional
region:
i > r~/21.
&(ri) = ~ c ( r i ) ,
Any target output not explicitly specified in the preceding
is set to zero. Note that all target outputs are zero at the
middle of the transitional region.
When a keyword utterance is followed by a nonkeyword segment, an "all-zero" target vector is specified
about 1/5 s after the end of the desired pulse or just prior to
the next keyword utterance, depending upon which comes
first. This encourages the net to "tum off' the pulse.
For consecutive utterances of a single keyword (I = c),
we have modified two of the transitional targets,
i E {LT/21, rT/21+1].
&(ti) = min{oc(ti), O S } ,
Put simply, the three targets at the middle of the transition
are "not high," "low," and "not high." This was beneficial
with a decision rule which gave detection of repeated
utterances only if the output dropped below a threshold [ 5 ] ,

11-358

but may be unnecessary with the rule described in Section 4.


In our experience, use of "don't care" and "not high"
targets has produced higher accuracy than any precise specification of the rise and fall of an output unit's activation.
Reducing the number of times at which we train for
only low outputs increases the mean desired activation of
the output units, and consequently ameliorates problems
with local minima in training. Also, the computational
cost of training is reduced.

3.2. Iterative Adjustment of Desired Outputs


In experiments reported upon here, pulse locations were
initially specified through manual segmentation of the
training utterances. Comparison of the actual pulses of
trained nets with the desired pubes revealed that the nets
were often giving correct pulses at times quite different from
those specified. Consequently, centers of desired pulses
were manually adjusted to agree with observed pulses, and
nets were retrained with random initial weights.
Target adjustment was complicated by the tendency for
fwo pulses to emerge for keyword utterances followed by
pauses. In such a context, the duration of the final phone
is quite long, and the first pulse occurs in response to an
acoustic prefix that "sounds like" an utterance of the word
in mid-sentence. The second pulse occurs close to the end
of the utterance, where the desired pulse was initially
placed.
We consistently aligned the desired pulse with the
earlier of the two pulses, although the latter of the actual
pulses was often the stronger. This strategy, combined
with lowpass filtering of network outputs, did result in
single detection pulses. Our experience in manual realignment suggests that the best initial placement of the pulse
for a keyword followed by a pause is at the onset of the
final phone.
In summary, bringing our specification of when output
activations should go high into better agreement with the
actual response of trained networks has given stronger, less
noisy detection pulses.

3.3. Super-squared Error Criteria


The back-propagation procedure [ 3 ] is easily generalized to reduce the Lp.norm of the vector of deviations of
actual outputs from desired outputs,

For p = 2, the conventional squared error criterion is


obtained.
A super-squared e m r criterion (p > 2) places greater
emphasis upon reduction of occasional large errors than
does the squared error criterion. Since deviations from
desired high outputs tend to be greater than deviations from
desired low outputs, this essentially gives more importance
to the high target outputs. In experiments described below,
the super-squared error criterion appeared to remove all
problems with local minima that were encountered when the
squared error criterion was used.

4. DECISION RULE
We have developed an effective approach to interpreting the

outputs of the network. An utterance of a keyword is spotted if the corresponding output is above threshold and is
also the maximum of all outputs in a temporal window centered at the current time. That is, an above-threshold output
is ignored if it is weaker than some output in the recent
past or near future. The window length is set somewhat less
than twice the minimum duration of a keyword utterance.
Accuracy is improved when the network's outputs are
smoothed prior to application of this decision rule. The
object is to "flatten" short pulses sufficiently to keep their
peaks below threshold. Appropriate definition of the
smoothing operation depends upon the setting of the pulse
length parameter, P (Section 3.2).

5. EXPERIMENTS
We have trained several variants of a hybrid neural network
(Section 2 ) to spot completed utterances of the English
digits. As a simplification, our preliminary experiments
have not included utterances of words outside the keyword
vocabulary. Thus it is appropriate to evaluate the sequences
of spotted digits as though they were joint decisions made
by a continuous speech recognizer.

5.1.

Signal Processing

Simulated filterbank analysis of a 12.5 kHz signal produced a 15-component mel-scale power spectrum about 98
times per second [ 5 ] . Energy in frequencies below 200 Hz
was ignored.
At each time step, the three most recent spectra were
concatenated into a 45-component "long vector." Loudness
normalization was accomplished by subtracting the mean of
all components in the long vector from each component.

5.2. Network Parameters


The Kohonen map was an 8-by-8 grid of 64 units, with
hexagonal adjacency of units [4]. The training set comprised approximately 12 000 long vectors extracted from
multiple utterances of all pairs of digits. Kohonen's training algorithm [4] was run for 50000 iterations.
Studies of isolated digit recognition [ 5 ] have suggested
that accuracy is not at all sensitive to the setting of memory parameter a, and here a was chosen to make the halflife of a memory unit's activation equal to half the average
duration of a digit utterance. That is, in the absence of a
high input from the corresponding map unit, activation decayed by half in about 15 time steps.
In experiments involving a decision rule different f"
that described here, various feed-forward subnets from the
family described in Section 2.3 gave virtually identical
results [ 5 ] . Here we report upon a subnet with 3 pools per
hidden layer, and with 6 squashing units in each pool of the
second hidden layer.

5.3. Training Parameters


In defining the desired outputs for back-propagation
training, the pulse duration was 5 time steps (P = 5), and
targets were specified for 7 time steps in each transitional
segment (T = 7). The error criterion was based on the L1o
norm. "Batch" back-propagation was applied, with weight
updates occurring about once per 2000 inputs. We allowed
training to continue sufficiently long that further training
was unlikely to increase the trial-set error rate.

11-359

problem, and c) empirical evidence that super-squared error


criteria are sometimes preferable to the conventional squared
error criterion. The back-propagation techniques may be
used in studies of networks very different h m our own.
Our rationale for super-squared error criteria applies whenever the mean desired activations for output units are very
low. For instance, super-squared error criteria have proven
advantageous in the context of handwritten letter
recognition [9].

5.4.

Results
Random strings of digits were read by a single talker.
The digit "0" was always pronounced as "zero." The feedforward subnet was trained with 385 strings (1540 words),
and was tested with 140 strings (560 words). The detection
threshold was set arbitrarily to the mean of possible output
values (0.5). Each unit's output sequence was smoothed
with a 7-point rectangular window (coefficient of I n ) . The
length of the decision window was about 0.2 s.
The network recognized the training and test utterances
with word-error rates of 0% and 2.596, respectively.

ACKNOWLEDGEMENTS

6. DISCUSSION
While the accuracy we have observed compares poorly to
that of digit recognizers utilizing time alignment procedures
(e.g., hidden control neural nets [6]), the results seem quite
good for a system that makes strictly local decisions. We
emphasize that the ephemeral peak performance of the network has not been reported, and that no system parameters
have been tuned using the test set.
Recalling, however, that our primary interest is the
training strategy, the important observation is that a network with reasonably good generalization properties leamed
to process the training set perfectly. The strategy should
allow for training of different types of networks, some of
which may generalize better than the type presented here.
An alternate architecture of particular interest is the
time-delay neural network (TDNN),which has been applied
to phoneme spotting [7, 81. Transitional regions have
often been omitted from the training sets for TDNNs, with
the tacit assumption that reasonable behavior would
spontaneously emerge for transitions in the trial set. It is
possible that insertion errors could reduced with our
approach to defining target outputs for transitions.
A critical problem, at present, is that we have no procedure for automatically aligning desired pulses with actual
pulses. Given the "two-pulse'' phenomenon described in
Section 3.2, it seems unlikely that a simple dynamic programming algorithm would be adequate. Automating the
alignment of pulse centers for regions processed correctly
by the network, and leaving other regions for manual
alignment, should be straightfomard, however.
We have not addressed directly the problem of training
a network to reject non-keyword utterances. Our observations strongly suggest that it will be necessary to limit the
"weight" given to non-keyword intervals. This might be
accomplished by initially training only during keyword
utterances, and then adding to the training set those intervals in which "false alarm" pulses emerge. Another possibility is to use a weighted error criterion, giving relatively
large weight to errors for keyword intervals. Also, increasing the parameter p of the error criterion has been shown
effective in coping with low mean desired activations. But
we do not know how high one can set p without encountering numerical (or other) problems.

Computing resources were supplied by the Mississippi


Center for Supercomputing Research (MCSR). Special
thanks go to Kathy Gates of MCSR for her generous
assistance in various matters.
REFERENCES
[ l ] R. Rose, E. Chang, and R. Lippmann, "Techniques for
information retrieval from voice messages," in ICASSP 91,
pp. 317-20, 1991.

[2] J. G. Wilpon, L. R. Rabiner, C.-H. Lee, and E. R.


Goldman, "Automatic recognition of keywords in
unconstrained speech using hidden Markov models," IEEE
Trans. ASSF, vol. ASSP-38, pp, 1870-78, 1990.
[3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,
"Learning internal representations by error propagation," in
Parallel Distributed Processing, D. E. Rumelhart, J.
L. McClelland, and the PDF Research Group, eds.
Cambridge: MIT Press, 1986.
[4] T. Kohonen, K. Torkkola, M. Shozakai, J. Kangas,
and 0. Venta, "Phonetic typewriter for Finnish and
Japanese," in ICASSP 88, pp. 607-10, 1988.

[SI T. English, A Neural Network for Classification of


Spoken Digits. Ph.D. dissertation, Dept. of Computer
Science, Mississippi State Univ., 1990.
[6] E. Levin, "Modeling time varying systems using
hidden control neural architecture," in Advances in
Neural Information Processing Systems 3, R. P.
Lippmann, J. E. Moody, and D. S. Touretzky, eds. San
Mateo, Calif.: Morgan Kautinann, 1991.

[7] M. Miyatake, H. Sawai, Y. Minami, and K. Shikano,


"Integrated training for spotting Japanese phonemes using
large phonemic time-delay neurai networks,'' in ICASSP
90, pp. 449-52, 1990.
[8] W. Ma and D. Van Compemolle, "TDNN Labeling for
a HMM Recognizer," in ICASSP 90, pp. 421-23, 1990.
[9] T. English, M. Gomez-Gil, and W. Oldham,
"Recognition of handwritten lower-case letters: a
comparison of LeCun networks to other classifiers,"
submitted for publication, 1991.

7. CONCLUSION
The contributions of this work are a) an approach to word
spotting that does not require time alignment, b) demonstration of how to apply back-propagation to the word spotting

11-360

You might also like