You are on page 1of 4

Speaker direction tracking using microphones located at the vertices of equilateral-triangle

Yusuke Hioka and Nozomu Hamada

School of Integrated Design Engineering, Keio University Yokohama, Kanagawa, 223-8522 Japan Email: hioka@hamada.sd.keio.ac.jp, hamada@sd.keio.ac.jp

Abstract In this report, we propose an algorithm of speaker direction tracking using three microphones located at the vertices of equilateraltriangle. The method estimates the speaker direction by integrating the data derived from different three microphone pairs to achieve the omnidirectionality of accuracy. Using the algorithm, we propose a tracking method of speaker direction. At the end of this report, we show some results of simulation and real experiments to verify the effectiveness of the proposed method.

with uniform accuracy for omni-direction, and (2)Speaker direction tracking method for both gradually and abruptly moving directions. In the following part of this report, we rst explain the basic algorithm of direction estimation using equilateral-triangular microphone array in Sec.II, and then the tracking algorithm is explained in Sec.III. After the presentation of some simulation and experimental results, we nally conclude this report with some comments. II. P RINCIPLE OF DIRECTION ESTIMATION USING
EQUILATERAL - TRIANGULAR MICROPHONE ARRAY

I. I NTRODUCTION Under the recent rapid increase of data transmission bit rate, teleconference and remote learning systems will be getting familiar even with individual users. The speaker direction is one of the essential information at these systems not only for enhancing desired speech signal but also for steering camera to the direction of the speaker, and several algorithms to estimate it using microphone array have been reported. the well-known approach is the Time Delay Estimation(TDE) performed by calculating the Generalized Cross Correlation(GCC)[1], and many amendments of it have been proposed[2][3]. Another method uses the MUSIC[4] with Coherent Signal Subspace[5]. In this method, the components in several frequency bands are gathered together, however, it requires roughly estimated direction in advance and the pre-estimation error highly affects the nal estimation result in practice. The speaker direction tracking is an important and advanced subject of direction estimation. There are several methods for this problem as well[6][9]. Some of them rely on adaptive beamforming, such as LCMV[6] and GSC[7], where the speaker direction is determined from the beampattern of the adaptively calculated array weights. In this strategy, however, the computational cost becomes quite heavy because of the beampattern calculation and the accuracy depends on the beampattern resolution. On the other hand, Kawakami et al.[8] proposed a method that does not require the beampattern calculation by minimizing the output power of null steering xed beamformer, and Suyama et al.[9] extended this method to double talk situation. In the practical situations such as teleconference, we have multiple attendances uttering alternately, therefore, abrupt alternation of the current speaker direction often occurs besides gradual speaker movement. However, the applications of the methods[8], [9] are restricted to the single speaker with smooth movement. Thus the speaker direction tracking method is expected to be able to cope with both gradual and abrupt movement. From the practical point of view in addition, the accuracy should be spatially uniform for omnidirection, and it is preferable that the number of microphones and array aperture are small. In this paper, we propose the method of speaker direction tracking using three microphones located at the vertices of equilateral-triangle (we call it equilateral-triangular microphone array). The main proposals in the method are, (1)New direction estimation algorithm

A. Problem settings We use the equilateral-triangular microphone array as shown in Fig.1. A speaker in the direction utters a speech signal . The microphones receive the signal ( ) given by

(1)

under the assumption that we receive a plane wave at anechoic condition with the additive sensor noise signals modeled as spatially uncorrelated. Here, is signal delay at microphone with respect to the reference point located at the array origin , and is the sampling index. In the triangular conguration, we can take three pairs of mibetween microphones and crophones that have equal distance each pair faces to different direction of every [rad]. Because a microphone pair has the highest spatial resolution to its facing direction, we aim to realize the uniform accuracy by integrating the data of three pairs. As well-known in the GCC based methods[1] [3], the cross spectrum of a microphone pair contains the speaker direction information in its phase term. So the proposed method aims to estimate the direction by integrating the cross spectra derived from three different microphone pairs. Here we assume next a) and b) for the input signal. a) Only one speech signal is received. b) The location of the speaker is restricted on the array plane. B. Cross spectra at harmonics The short-time Fourier transforms of each microphone input signals , and in Fig.1 are given by

(2)

where and are the Fourier transform of the speech and noise ( ), respectively. Here we can dene the cross spectra of three microphone pairs given by

(3)

0-7803-8834-8/05/$20.00 2005 IEEE.

2847

Cross Spectrum & Harmonics Selection

[FD=IA] Short-time DFT

s(n)
x(n)

Gx2y

y(n)

Gyz Gzx

Gz2y

G,

mi

z(n)

x(n)
120

o
120

120

wavefront

Short-time DFT

z(n) D

y(n)

x : 0 33 D y : 1 D 63 D 2 z : 1 D 63 D 2

Cross Spectrum & Harmonics Selection

[FD=IA ] y(n) x(n) z(n)

Gxy Gyz Gzx

Gx2y

Gz2y

G,

Fig. 1.

Model of input signal to the equilateral-triangular microphone array

where and the expectation denote the power spectral density of and the average of DFT at several frames respectively, and means the complex conjugate. The delay constants in Eq.(3) are the function of given by

Fig. 2.

Flow diagram of the proposed method

The equation

Update Decision

Gxy

(11)

(4)

where denotes the sound velocity. Then we derive the following cross spectra

. is satised if and only if From this theorem, the direction estimation results in a problem of searching that satises the equality in Eq.(11). In the following Sec.III, we propose a method of speaker direction tracking based on this algorithm. III. S PEAKER DIRECTION
TRACKING

(5)

A. Steepest descent method using harmonics First of all, we derive the following lemma from the theorem given in Eq.(11), that is

ltered by the whitening prelter, which is generally called Phase Transform(PHAT)[2]. Because major power of speech signal is localized in its harmonic frequencies, the SNRs at these frequencies are rather high, and as a result, harmonic elements contribute to improving the estimation accuracy. Thus in the following process, we utilize the cross spectra at the harmonics of the fundamental frequency , i.e. ( is the order of harmonics) selected by the SNR higher than [10]. Here the a threshold , i.e. selected harmonic frequencies should be smaller than to follow the spatial sampling theory[4].

(12)

. where the equality is satised if and only if From this lemma, we dene the following non-negative performance index which takes its global minimum at .

(13)

C. Direction estimation by the cross spectra integration Now let us consider the difference between delay terms of two cross spectra for a signal propagating from direction .

(6) (7)

Then, we dene the following phase rotation factors composed of the above phase compensating components.

(8) (9)

Using these phase rotation factors, we dene the following integrated cross spectrum.

(10)

Now for the three terms in [Theorem]


1 The

, following theorem1 is satised.

proof of this theorem is given in the Appendix.

that satises Thus, the tracking problem results in searching . The speaker direction tracking is achieved by minimizing a per formance index that consists of the combination of for selected harmonic frequencies. The minimization is performed by the steepest descent method. As stated in Sec.I, we aim at tracking abruptly moving speaker direction by avoiding the local minimum convergence in the adaptation. To realize it, the adaptation process takes two steps. The rst step aims at global convergence, and the second step is used for estimation accuracy improvement. These two adaptation steps are switched according to the state of convergence. In Phase 1, we select , at which provides global convergence in the steepest descent algorithm. The performance index in this phase is the sum of the selected s. In contrast, in Phase 2, the weighted sum of all for is used as its performance index. We show the ow diagrams of the respective steps in Fig.2. 1) [Phase 1] Steepest descent method with performance index with respect to and selection: Fig.3 shows the prole of at different frequencies. Fig.4 shows the simulation result on the number of local minima in for all and . We nd unique local minimum at the lower band, and so we have veried that the global convergence is guaranteed at the lower band (We call around this band as guaranteed band). In contrast,

2848

steeply decreases as goes to higher (even though it is nonguaranteed band), so that the convergence speed would be faster. Motivated by these s features, we propose the following recursive method to obtain optimal . We start to update using the performance index within the guaranteed band, then, use that in the non-guaranteed band by switching the set of selected harmonics according to the convergence rate. Now we update indexes the by

(a)
0 8 60 7

(b)
0 6 5 60

120 [deg]

6 5 [deg]

120 4 180 3 240 2 300 1 0

180 4 240 3 2 300 1 0 60 120 180 [deg] 240 300 0

60

120

180 [deg]

240

300

(14)

Fig. 3.

Performance index
guaranteed band
0

at (a)

(b)

where and are the iteration index and the stepsize parameter, is nearly proportional to , we normalize respectively. Since is the index set utilized for update, it by . and it is modied by the following rules. [Initial setting] For an initial , we use the set of harmonic frequencies in the guaranteed band

non-guaranteed band
6

60

120 4 [deg] 180 3 240 2

(15)

300

where should be less than . ] [Update of If the conditions (17) and (18) are simultaneously satised at as we update

360

0.2 0.278 0.4 0.6 0.8 Normalized Frequency / max

,
Fig. 4. Number of local minima in the performance index

(16) At rst, we examined the uniformity of estimation accuracy and global convergence by using real phoneme data (/a/,/e/,/i/,/o/,/u/) uttered by subjects( each for male and female) as the source signal. The result in Fig.5 reveals that the proposed method sufciently achieves the uniformity of accuracy, furthermore, the non zero elevation angle that occasionally occurs at the practical case gives little inuence to the estimation accuracy. In the Fig.6 and Fig.7, we compare the adaptation prole of both conventional[8] and the proposed method. The proposed method succeeds to converge to the global optimum wherever the initial parameter is settled while the conventional method converges to the nearest local optimum. Finally in Fig.8 and Fig.9, we show examples of tracking results for generally and abruptly moving speaker directions respectively. These satisfactory results conrm the ability of the method in practical use. V. C ONCLUSION In this report, we have proposed the method of speaker direction tracking using the equilateral-triangular microphone array. The
TABLE I PARAMETERS FOR EXPERIMENT Sampling Frequency Wave Velocity Microphone Distance

(17)

(18)

where is the complement of and means the maximum/minimum element in the real integer set . [Termination of Phase 1] The Phase 1 is terminated if the following condition is satised.

(19)

2) [Phase 2] Weighted steepest descent method based on SNR: In Phase 2, we adopt the following weighted steepest descent method given by

(20)

The weight is given by

(21) where is the SNR at . This weighting aims to improve the accuracy of nal estimation result. IV. R ESULTS
OF SIMULATION AND EXPERIMENTS

To verify the effectiveness of the proposed method, we performed some experiments at a real acoustic environment (A conference room [m]). For the parameters with the size of W D H in the proposed method, we use the values given in Tab.I and the results are quantitatively evaluated by deviation of estimation error (DEE)[10].

Window FFT point Frame Length Frame Overlap Data Length

Hz m/s m dB Hamming samples samples ms

2849

5 120 90 60 30 0 -30 -60 -90 0 Estimated Angle[deg]

10

15 Time[sec]

20

25

Fig. 5.

DEEs for omni-direction

10

15 Time[sec]

20

25

Fig. 8. Example of tracking a generally moving speaker direction (A male ) speaker moves from to and goes backward up to

[A]

[B]

[C]

[D]

[E]

10

15 Time[sec]

20

25

Estimated Angle[deg]

Fig. 6. Adaptation prole of the conventional method

Fig. 7. Adaptation prole of the proposed method

120 60 -30 -90 -150 0 5 10 15 Time[sec] 20 25

method realizes the omni-directionality of estimation accuracy by the integrated use of microphone pairs in the equilateral-triangular microphone array. Through the experiments, the method is veried to track both abruptly and generally changing speaker directions. For future subject, the tracking problem for multiple and simultaneous speakers should be considered. ACKNOWLEDGMENT The authors kindly thank our colleagues, especially Mr. Matsuo for his devotion to carry out the experiments. This work is supported in part by a Grant in Aid for the 21st century Center Of Excellence for Optical and Electronic Device Technology for Access Network from the Ministry of Education, Culture, Sport, Science, and Technology in Japan. R EFERENCES
[1] C.H. Knapp and G.C. Carter, The Generalized Correlation Method for Estimation of Time Delays, IEEE Trans. ASSP, Vol.ASSP-24, No.4, pp.320327, Aug 1976. [2] M. Omologo and P. Svaizer, Use of the Crosspower-Spectrum Phase in Acoustic Event Location, IEEE Trans. SAP, Vol.5, No.3, pp.288292, May 1997. [3] M.S. Brandstein, Time-delay estimation of reverberated speech exploiting harmonic structure, J. Acous. Soc. America, Vol.105, No.5, pp.2914 2919, May 1999. [4] D.H. Johnson and D.E. Dedgeon, Array Signal Processing, PTRP Prentice Hall, 1993. [5] H. Wang and M. Kaveh, Coherent Signal-Subspace Processing for the Detection and Estimation of Angles of Arrival of Multiple Wide-Band Sources, IEEE Trans. on ASSP, Vol.ASSP-33, No.4, Aug 1985. [6] G. Nokas and E. Dermatas, Speaker Tracking for Hands-Free Continuous Speech Recognition in Noise Based on a Spectrum-Entropy Beamforming Method, in IEICE Trans. on Inf. & Syst., Vol.E86-D, No.4, pp.755758, Apr. 2003. [7] Y. Nagata and M. Abe, Two-Channel Adaptive Microphone Array with Target Tracking, in IEICE Trans. in Fundamentals, Vol.J82-A, No.6, pp.860866, Jun. 1999. (in Japanese) [8] H. Kawakami, M. Abe, and M. Kawamata, A Two-Channel Microphone Array with Adaptive Target Tracking Using Frequency Domain Generalized Sidelobe Cancellers, in IEEE Int. Symp. on Intelligent Sign. Process. & Commun. Systems, pp.291296, Nov. 2002.

Fig. 9. Example of tracking abruptly alternating speakers direction (5 speakers surrounding the microphone array speak alternately : [B] [C] [D] [E] , Dash-dot line : Beginning [A] of sentence, Dash line : End of sentence)

[9] K. Suyama and T. Tasaki, A Study on Target Talker Tracking via Two Microphones, in Proc. 18th IEICE DSP Symposium, A5-6, Nov. 2003. (in Japanese) [10] Y. Hioka and N. Hamada, DOA Estimation of Speech Signal Using Microphones Located at Vertices of Equilateral Triangle, in IEICE Trans. on Fundamentals, Vol.87-A, No.3, pp.559566, Mar. 2004.

A PPENDIX Because the magnitudes of both phase rotation factors and normalized(preltered) cross spectra are unity, the following relation holds.

(22)

(23) (24)

The equality between Eq.(22) and Eq.(23) is satised if and only if the three complex terms are equal. By the following proof, the arguments . Therefore, the of three terms take the same value only at theorem in Sec.II-C holds. and holds at holds at and and holds at where and .

2850

You might also like