Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
0Activity
0 of .
Results for:
No results containing your search query
P. 1
00674414

00674414

Ratings: (0)|Views: 1 |Likes:
Published by scribd1235207

More info:

Published by: scribd1235207 on Oct 14, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

10/14/2011

pdf

text

original

 
USING
A
REAL-TIME, TRACKING MICROPHONE ARRAY
AS
INPUT TO AN HMMSPEECH RECOGNIZER
Tadd B. Hughes, Hong-Seok Kim, Joseph H. DiBiase,
and
Harvey
I?
Silverman
Laboratory
for
Engineering Man/Machine Systems (LEMS)Divisionof]Engineering,Brown UniversityProvidence,
RI
02912
ABSTRACT
A
major problem for speech recognition systems is reliieving thetalker of the need to use a close-talking, head-mounted cir a desk-stand microphone. A likely solution is the use of an army of mi-crophones that can steer itself to the talker and can use a beam-forming algorithm to overcome the reduced signal-to-mise ratiodue to room acoustics. This paper reports results for a tracking,real-time microphone-array as an input to an HMM-based con-nected alpha-digits speech recognizer. For a talker in the very nearfield of the array (within a meter), performance approaches hat ofa close-talking microphone input device. The effects of both thenoise reducing steered array and the use of a Maximm
a
poste-
riori
(MAP)
training step are shown to be significant. Here, thearray system and the recognizer are described, experiments arepresented, and the implications of combining these
twci
systemsdiscussed.model was initially trained from a large dataset collected fromtalkers using a close-talking microphone for an input device. Themodel was then incrementally retrained with a relatively smalldataset,
4500
word tokens, taken from talkers using the beam-formed, microphone-array signal.The second experiment compares, through recognition perfor-mance of the Baseline Model, the microphone array, a single re-mote microphone and a close-talkingmicrophone
as
input devices.For this experiment an additional dataset was collectedbough theuse of the same apparatus and testing procedure as the finst, exceptthe close-talking microphone was replaced by a single remote mi-crophone (of the same type as the microphones used in the array)centered in the array.The sections following briefly describe the microphone arraysystem, he LEMS recognition system and properties of the work-ing dataset. Next, the experiments are described and the resultsare presented. Finally, there is a discussion ofthe implications thatfollow.
1. INTRODUCTION
2.THE
BROWN MEGAMIKE I1
Microphone-array systemshave potential as a replacement for inc-onvenient conventionalmicrophones,such as head-mounted,close- talking microphones or desk-standmicrophones,for speech recog-nition applications. However, only a few attempts have been madeto date to use a real-time microphone-array system for
this
pur-pose. Most early microphone-array work has been done off-lineusing a fixed talker with fixed beamforming[l,
2,
31
resulting inperformance that has, in general, been significantly poorer thanthat of using a close-talking microphone. While a remote micro-phone system is less obtrusive, it suffers from degradation due toreverberant room acoustic artifacts and background noise. Re-mote microphone array systems, however, are designed to over-come these degradations. These effects are simply not significantwhen the talker is within a few inches of a microphone.The novelty in this paper is the combination of a real-time,tracking microphone-array ystem with a modem HMM-basedrec-ognizer for the difficult English alpha-digits vocabulary Thoughsimilar results have been shown by Yamada,Nakamura antd Shikano
[4,
51,
the system here can be distinguished by
its
use
of
very nearfield, three-dimensionalbeamforming and an incremental-trainingmethod
[6].
the
recognition performance
of
a
Baseline HMM Model incre-mentally retrained with microphone array data. Here,
an
HMM
This
work
was
funded
by
NSF
grants
MIP-9120813, MIP-
Two separate experimentshave beenperformed. The first shows
93
14625.and MIP-9509505.
A
by-product
of
a Huge Microphone Array(HM.4) project [7] is acircuit board that may be populated and packaged as a stand-alone16-channelmicrophone-array ystem called
Brown
Megrimike
11.
The system interfaces to a personal computer as shown in Figure
1.
The microphone array itself consists of
16
pressure gradient
I
Figure
1:
Block Diagram
of
the
Apparatusmicrophone elements set in rubber grommets on hardware clothabout one inch in front of a rectangular piece of acoustic foam at
249
0-7803-4428-6/98
$10.00
0
19588
IEEE
 
a 45 degree angle (see Figure 2)
.
The array is placed between themonitor and the keyboard of a
SUN
workstation.All location estimation and beamforming is done by a micro-processor in the microphone-array electronics. While there is a ca-pability for transmitting both audio and auxiliary data (e.g., point-ing location) digitally, in
this
work the beamformed signal waspassed through a DAC system and the analog signal was resam-pled using the recording capabilities of the
SUN
workstation. The
k
44dSm4
dataset
train
adjusttesttotalFigure 2: Microphone Array Showing Delay Estimation Pairings
#
talkers
#
utterancesfemale maletotal female male total
30
50
80
1328 2156
3484
9 11 20243 370620812 20 234 361 59547731201805 28874699processing used in these experiments s as follows:
0
Signals from all 16 microphone channels are sampled si-multaneously at 2Oks/s to make frames of 1024 points perchannel.
0
The window which is applied to each frame is essentiallyrectangular, but has 100points of cosine taper and
100
pointsof zero padding at each end.
This
gives 4 1.2ms of nonzerodata.
0
Sixteen. 1024-aoint real DFTs are taken.
On
the ADSP-2 1020 microprocessor, these are quite fast, requiring onlyabout 365p each.Fifteen time delays for the differences n the time-of-arrivalof the talker signals in pairs of microphones (see Figure 2)are estimated using a weighted linear fitting and unwrap-ping algorithm to the phase difference in the frequency do-main [8].Three dimensionalbearing estimates are made for each
quad
(see Figure 2).For practical reasons, the array is considered o be two fullyindependent sub-arrays,each having three quads and, there-fore, three spatial bearings. An estimate of the source loca-tion is made from the
close-crossingpoints
of the bearingsof the two sub-arrays. (A
close-crossingpoint
is the mid-point of the line representing the shortest distance betweentwobearing lines.)When the time-delay estimates are all considered
good
(atight cloud), then the beamformer time delays are updatedfor the current frame.
In
those cases where the estimates arenot deemed
good,
the
time-delay parameters
for the
beam-
former remain unchanged.The array is pointed by multiplying each signal in the fie-quency domain by its appropriate phase shift.All the delayed signals are added across microphones,
in-
cludingthe unshifted signal from the reference microphone.
This
is
essentially unweighted
delay and
sum
(ds)
beam-forming.The IDFT is taken and overlap-add is used to assemble heoutput.
3.
THE
LEMS
RECOGNIZER
The LEMS recognizer, developed over the last 18 years, is anHMM-based system that has been described more fully in
[9].
Itspurpose is to allow the study of better signal processing and fea-ture analysis methods for speech recognition. For this purpose,the English alphabet plus digits vocabulary is a good choice. Tothese 36 words are added two control words,
period
and
space,
Thus,
results are gauged on a 38-word vocabulary. A tied-mixture,explicit-duration
HMM
is used for the recognition system. No lan-guage model is used.Three feature sets are generated and treated as being indepen-dent; these are (1)
-
12 mean-corrected ow-order cepstral coeffi-cients, (2) -their 12 first differences in time), and (3t set of twofeatures: overall energy and its first difference. The DFT-basedsignal processing has been optimized in earlier work
[IO]
in whichthe log-magnitudesof the DFT are low-pass filtered(liftered), non-linearly sampled, and transformed to obtain the real cepstal coeffi-cients.Word models are used in the LEMS recognizer. These averageabout five states
-
words like
w
have several more, and words like
e
may have as few as four states. The models are very simple asthere are no
self-loops,
since explicit duration modeling eliminates
this
need. Skipping states is also not allowed, as experiments haveshown that allowing states to be skipped generally degrades per-formance for the alpha-digits ask with speech data obtained fromtalkers reading orthographic strings. Some 256 Gaussians are usedfor each feature space, with unique mixture coefficients for eachstate in eachmodel.
In
training, a discrete, explicit-duration modelis trained first after clustering each feature-set vector into a code-book of 256 classes. The training from the discrete system is usedto initialize the training for the tied-mixture system
4.
DATABASES
The largest dataset used was taken several years ago for talkersusing a close-talking microphone who uttered connected stringseach of about twelve alpha-digits, asting about four seconds. One-hundred twenty talkers were recorded, each giving three sessionsof 15 strings each.
This
totaled about8hours of speech data.From
this,
about five hours, or 3600 utterances,
from
80
talkersboth male and female, have been selected for training. Twentymore talkers are considered the "tweaking" set, and 20 more havebeen used exclusively for testing. The important statistics for thisdataset are given in Table
1.
All talkers have standard Americandialects, with some bias due to our location in New England.
This
dataset is known as the Baseline dataset.For the experiments presented here, three new speech databaseswere collected from 15 male talkers and 15 female talkers whowere all native speakers of American English. Table 2 gives thestatistics of the new datasets. Ideally the second experiment would
 
maleatasettraining (Session
A)
5
testing (Session
A)
5
testing (Session
B)
5
Table
2:
DatasetsNew Simultaneously Recorded (for each Session)female
#utts
5
5
400
5
3
62
#words/
require three simultaneously recorded datasets, one from the closetalking headset, one from the microphone array and one from thesingle remote microphone. Unfortunately, the current apparatuslimited concurrent recording to
two
channels (stereo).
In
orderto still make a fair comparison, one set of simultaneous recordingswas made with the close-talking head set and the microphone array(Session A). Recently (some eight months later), a second set ofsimultaneous recordings has been made (Session
€3).
One channeltook data directly from a single pressure gradient electret elementmounted in the center of the array. While the other channel tookthe analog array output. Several of the talkers were the same as inSession
A,
but the room environment had changed somewhat overthe interim. The new environment had more background noise inthe form of “screaming”monitors and noisy disk drives. 1,isteningto the data, one can hear a distinct improvement in the signal tonoise ratio of the array signal relative to that of the singlie micro-phone.Baseline
5.
INCREMENTAL MAXIMUM
A
POSTERZORZ(MAP)TRAINING
FOR
HMM PARAMETERS
The leaming technique used to adapt the recognition model is avariation on the
recursive
Bayes
approach for performing sequen-tial estimation of model parameters
[l
11 given incremental data
In
the case
ofmissing-data problems
(e.g.,
Ws),
he Expect-ation-Maximization
(EM)
algorithm can be used to pr0vid.e an iter-ative solution for estimation of the MAP parameters. The trainingalgorithm that incorporates the recursive Bayes approach with theincremental
EM
method
[
121 (i.e., randomly selecting a subset ofdata from the training set and immediately applying the updatedmodel) is fully described in
[
131 and partly in
[
141,
thus it is notrepeated here. However, it has been shown that incremental train-ing can quickly and efficiently adapt an
HMM
system with a
min-
imal amount of new data. At each update of the iterative training,there is a corresponding sequence of posterior parameters whichact as the memory for previously observed data. Here, the Base-line Model is used as the initial prior information.[121.
I
1.7
6. EXPERIMENTS
The large LEMS alpha-digits speech dataset described above wasused to develop a Baseline
HMM
Model. This model became the
initial prior information
for MAP training. The testing subsets&om Session A were used exclusively to generate
two
new recog-nition models: a new close-talking microphone model
(CTMM),
and a new microphone-array model
0.
he new recognizerswere developed by applying incremental
MAP
training with priors&om the Baseline Model.The objective
of
these experiments was to determine the
im-
pact of real-time, near-field beamforming (Sessions
A
arid B) andof
MAP
training (Session
A)
on a complex speech recognitiontask. For the new data recorded
in
Session
A,
talkers properlywore a head-mounted microphone and were seated in an office-type chair that swivels and
is
on casters, in
front
of
the array
/
workstation apparatus. They were asked to read strings from a pre-defined, balanced vocabulary displayed in large type on the work-station screen with no instruction given about posture or move-ment. However, the screen-reading ask itself likely reduced move-ment substantially. The
two
recordings were not exactly simulta-neous since the microphone-array speech data has some latency
of
about looms, which was estimated and corrected in every record-ing. The talkers for SessionB were seated in the same position andused the same procedure and apparatus except for the repLacementofthe close-talking microphone by a single remote microphone.Results for Session
A
are presented in Table
3.
The perfor-mance of the Baseline Model recognizing the close-talking micro-phone (Session
A)
data is better than the same model on the Base-line dataset. Considering that the
two
test datasets were different,these performances are consistent. The results for the CTMM and
I
TestData
11
Baseline
I
CTMM
I
1
1
C1o;Talking
11
94.0Mic
Arrav
83.2
Table 3: Recognition Performance for Session
A
(%
Correct)the MAM set a new standard. The CTMM had a
94.5%
perfor-mance rate on the close-talking test data (Session A), an increaseof
0.5%
over the Baseline Model on the same data. The
MAM
s
90.9% performance on the array test data (Session A) represents anincrease of nearly
8%
over the Baseline Model on the same data.One may conclude that incremental MAP training has significantlyimproved the
HMhl
model.The results for Session
B
were obtained using the BaselineModel exclusively and are given in Table
4.
The raw score for themicrophone array
is
5.4%
lower than the Baseline score
in
SessionA. The reason for this can be found by investigating the details ofhow these scores were computed; the number of substitutions, in-sertions and deletions were summed and then divided by the totalnumber of words spoken. It was noticed that a higher number of
(%
Correct)Single Mic
.
Mic. ArravTable
4:
Recognition Performance for Session B
(%
Correct)insertions was being made in Session B than in Session
A,
whilethe number of substitutions roughly stayedthe same. Furthermore,
57.0%
of the insertions in Session B were occurring at the begin-ning of eachutterance compared with only 11.6% for Session
A.
Itwas also noticed that there were audible artifacts at the beginningof many of the recordings made during Session B. These artifactswere due to the microphone array initially miss-aiming, possiblyto a noise source, then adapting and steering correctly to the talker.
In
the eight month interval between Sessions A and
B,
backgroundnoise in the recording environment increased substantially and was
25
1

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->