Professional Documents
Culture Documents
ABSTRACT
An approach to back-propagation training of a neural net-
1. INTRODUCTION
We describe a procedure for training neural networks to spot
utterances of "keywords" in fluent speech. A successful
word spotter would have many applications, including monitoring of enemy voice transmissions, voice control of
devices in environments where non-control speech is prevalent (e.g., in homes, offices, and airplane cockpits),
information retrieval h m stored speech messages [l], and
limited vocabulary speech recognition in domains where
talkers' interjections are frequent and varied (e.g., as in
determining the type of a long-distance call [2]).
The training procedure described here incorporates error
back-propagation [3], and should be applicable to various
neural network architectures. We assume that the network
has one output unit for each keyword to be spotted. At
fixed intervals, a representation of the speech signal is
input to the network. When the network detects the completion of a keyword utterance, it identifies the word by
generating a "high" signal at the appropriate output unit.
At all other times, the network generates "low" signals at
all output units.
Obtaining this behavior through back-propagation
training has proven to be difficult. One problem is that
there is no general way to specify precisely when in the
utterance of a keyword the appropriate output unit should
"turn on" and when it should "turn off." The behavior that
is realizable by the network depends upon the particular
keyword and the context in which it is uttered.
Another problem is that output units are expected to
map
memory
feed-forward subnet
2. WORD-SPOTTING ARCHITECTURE
As suggested in Figure 1, the word-spotting network comprises three subnetworks. Vectors of speech signal features
are input to a Kohonen map 141, the outputs of which are
processed by a layer of self-connected "memory" units. The
outputs of the memory layer are transformed into detection
pulses by a feed-forward subnet.
2.1.
Kohonen Map
ed+
11-357
0-7803-0532-9192 $3 .OO 0 1992 IEEE
1
2
-
lo 0
3. BACK-PROPAGATION TRAINING
***o
I
L.d''T
Figure 2. Feed-foxward subnet with two pools in each hidden
layer, and with three units in each pool of the second
hidden layer. Units with squashing functions are labeled
"s." Those with Gaussian functions are labeled "r."
11-358
4. DECISION RULE
We have developed an effective approach to interpreting the
outputs of the network. An utterance of a keyword is spotted if the corresponding output is above threshold and is
also the maximum of all outputs in a temporal window centered at the current time. That is, an above-threshold output
is ignored if it is weaker than some output in the recent
past or near future. The window length is set somewhat less
than twice the minimum duration of a keyword utterance.
Accuracy is improved when the network's outputs are
smoothed prior to application of this decision rule. The
object is to "flatten" short pulses sufficiently to keep their
peaks below threshold. Appropriate definition of the
smoothing operation depends upon the setting of the pulse
length parameter, P (Section 3.2).
5. EXPERIMENTS
We have trained several variants of a hybrid neural network
(Section 2 ) to spot completed utterances of the English
digits. As a simplification, our preliminary experiments
have not included utterances of words outside the keyword
vocabulary. Thus it is appropriate to evaluate the sequences
of spotted digits as though they were joint decisions made
by a continuous speech recognizer.
5.1.
Signal Processing
Simulated filterbank analysis of a 12.5 kHz signal produced a 15-component mel-scale power spectrum about 98
times per second [ 5 ] . Energy in frequencies below 200 Hz
was ignored.
At each time step, the three most recent spectra were
concatenated into a 45-component "long vector." Loudness
normalization was accomplished by subtracting the mean of
all components in the long vector from each component.
11-359
5.4.
Results
Random strings of digits were read by a single talker.
The digit "0" was always pronounced as "zero." The feedforward subnet was trained with 385 strings (1540 words),
and was tested with 140 strings (560 words). The detection
threshold was set arbitrarily to the mean of possible output
values (0.5). Each unit's output sequence was smoothed
with a 7-point rectangular window (coefficient of I n ) . The
length of the decision window was about 0.2 s.
The network recognized the training and test utterances
with word-error rates of 0% and 2.596, respectively.
ACKNOWLEDGEMENTS
6. DISCUSSION
While the accuracy we have observed compares poorly to
that of digit recognizers utilizing time alignment procedures
(e.g., hidden control neural nets [6]), the results seem quite
good for a system that makes strictly local decisions. We
emphasize that the ephemeral peak performance of the network has not been reported, and that no system parameters
have been tuned using the test set.
Recalling, however, that our primary interest is the
training strategy, the important observation is that a network with reasonably good generalization properties leamed
to process the training set perfectly. The strategy should
allow for training of different types of networks, some of
which may generalize better than the type presented here.
An alternate architecture of particular interest is the
time-delay neural network (TDNN),which has been applied
to phoneme spotting [7, 81. Transitional regions have
often been omitted from the training sets for TDNNs, with
the tacit assumption that reasonable behavior would
spontaneously emerge for transitions in the trial set. It is
possible that insertion errors could reduced with our
approach to defining target outputs for transitions.
A critical problem, at present, is that we have no procedure for automatically aligning desired pulses with actual
pulses. Given the "two-pulse'' phenomenon described in
Section 3.2, it seems unlikely that a simple dynamic programming algorithm would be adequate. Automating the
alignment of pulse centers for regions processed correctly
by the network, and leaving other regions for manual
alignment, should be straightfomard, however.
We have not addressed directly the problem of training
a network to reject non-keyword utterances. Our observations strongly suggest that it will be necessary to limit the
"weight" given to non-keyword intervals. This might be
accomplished by initially training only during keyword
utterances, and then adding to the training set those intervals in which "false alarm" pulses emerge. Another possibility is to use a weighted error criterion, giving relatively
large weight to errors for keyword intervals. Also, increasing the parameter p of the error criterion has been shown
effective in coping with low mean desired activations. But
we do not know how high one can set p without encountering numerical (or other) problems.
7. CONCLUSION
The contributions of this work are a) an approach to word
spotting that does not require time alignment, b) demonstration of how to apply back-propagation to the word spotting
11-360