You are on page 1of 6

Multi-Speaker Localization by Central and Lateral

Microphone Arrays Based on the Combination of


2D-SRP and Subband GEVD Algorithms
Ali Dehghan Firoozabadi1,*, Pablo Irarrázaval2,3,4, Pablo Adasme5, David Zabala-Blanco6, Pablo Palacios-Játiva7,8, Hugo Durney1, Miguel
Sanhueza Olave1, Cesar Azurdia7
2022 8th International Conference on Signal Processing and Communication (ICSC) | 978-1-6654-5430-8/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICSC56524.2022.10009454

1Department of Electricity, Universidad Tecnológica Metropolitana, Av. José Pedro Alessandri 1242, Santiago 7800002, Chile
2Electrical Engineering Department, Pontificia Universidad Católica de Chile, Santiago 7820436, Chile
3Biomedical Imaging Center, Pontificia Universidad Católica de Chile, Santiago 7820436, Chile
4Institute for Biological and Medical Engineering, Pontificia Universidad Católica de Chile, Santiago 7820436, Chile

5Electrical Engineering Department, Universidad de Santiago de Chile, Av. Victor Jara 3519, Santiago 9170124, Chile
6Department of Computer Science and Industry, Universidad Católica del Maule, Talca 3466706, Chile
7Department of Electrical Engineering, Universidad de Chile, Santiago 8370451, Chile
8Escuela de Informática y Telecomunicaciones, Universidad Diego Portales, Santiago 8370190, Chile

Tel.: +56 (2) 2787 7117; E-mail: adehghanfirouzabadi@utem.cl, ORCID: 0000-0002-6391-6863

Abstract—In this paper, a novel speech source localization for evaluating the probability of existence the sound sources
(SSL) algorithm by the use of the central and lateral microphone in different positions.
arrays is proposed with low complexity and high precision in
undesirable acoustical environments. Firstly, a proposed star- The cross-correlation based methods are one of the
shaped microphone array (SSMA) at the room’s center is important categories in the two-step SSL algorithms [10]. In
considered in combination with the improved 2D steered contrary, the SRP-based methods are used for the condition,
response power (2D-SRP) algorithm for determining the where a large number of microphones is permitted [11].
speakers’ directions, where the PENS algorithm is considered Nikolas et al. presented a perpendicular cross-spectra fusion
for speakers counting. Then, the two closest cross-shaped arrays (PCSF) method in 2017 as a robust algorithm for speaker
(CSMA) to each speaker on the wall are selected, where these localization [12] that selects the mathematical formulas for
arrays prepare the proper information with less microphones. estimating the speakers’ positions. The subsystems in speaker
Each CSMA is considered in combination with subband localization are parallelly implemented for preparing the
generalized eigenvalue decomposition (SB-GEVD) method on candidate DOA points in each TF bins. They proposed the
various time frames for estimating the two others speaker’s divergence feature of various DOA estimators for verifying
direction. Therefore, the 3D speakers’ positions are calculated the reliability of estimated locations. The calculated DOAs are
by the crossing of the estimated directions from SSMA and two considered as an input data for the presented classification
CSMAs. The proposed lateral-central microphone array- process, which is based on a match criterion. Ning et al. in
multiple sound source localization method (LCMA-MSSL) is
2021 proposed the morphological component analysis for the
compared with other research, which shows the superiority of
the proposed method in undesirable noisy and reverberant
sound field (SF-MCA) by alternating direction method of
scenarios. multipliers (ADMM) for multiple SSL [13]. In the first step,
the Helmholtz formula is selected for analysis the speech
Keywords— Speaker localization, microphone array, adaptive components. In the next step, the plane-wave and green
processing, eigenvalue decomposition, time delay estimation. functions are considered for representing the sparse model of
the speech signal, where the decomposition parameters are
I. INTRODUCTION extracted by using of ADMM method. Inkyu et al. in 2022
The multiple simultaneous speech source localization presented a localization method based on Monte Carlo
(SSL) in undesirable environments is a challenge in the criterion by using of diffraction and reflection features (DR-
implementation of signal processing systems. The SSL MCL) [14]. The proposed method estimates the propagation
algorithms are normally used as a pre-processing method for path based on the diffraction and reflection effects of sound
other speech processing applications such as speech signals by using of background ray tracking system. For
enhancement [1], speaker tracking [2], and speech recognition estimating the diffraction of propagation path, the ray tracking
[3]. The microphone array (MA) is considered as a proper method is combined by uniform theory of diffraction model
instrument for increasing the precision of the SSL algorithms based on the features around the scatter of obstacles. The 3D
in such applications as hearing aids [4], videoconferencing scatter of obstacles is reconstructed in the computation phase
[5], smart robots [6], and acoustical speech processing rooms. domain, where the obtained information is considered for
The learning-based algorithms [7] localize the speakers more producing the reflection, primary, and diffraction sound rays
accurately, but it is not possible to implement in the real in the real-time phase domain.
conditions based on the high computational complexity. In the In this research, a novel 3D muti-source localization
recent decade, various algorithms have been introduced for the system is proposed based on the central and lateral MAs by
SSL, which are categorized to time-frequency (TF) signal using of 2D-SRP and subband generalized eigenvalue
processing [8], and spatial spectrum analysis [9]. The other decomposition (SB-GEVD) methods on undesirable
category of the SSL algorithms calculates the steered response environments. In the first step, a proposed central star-shaped
power (SRP) for the candidate areas in the recording space by microphone array (SSMA) is considered for estimating the
considering the spatial diversity of acoustical environments direction of speakers. The obtained information by this SSMA

978-1-6654-5430-8/22/$31.00 ©2022 IEEE

433

Authorized licensed use limited to: Universidad Tecnologica Metropolitana (UTEM). Downloaded on January 12,2023 at 22:01:55 UTC from IEEE Xplore. Restrictions apply.
is used as an input for 2D-SRP algorithm. Implementing the considered for DOA estimation by using of 2D-SRP
SRP function in 2D format reduces the computational algorithm. The microphone pairs (1,6), (1,7), (2,7), (2,8),
complexity. The peaks’ positions in 2D-SRP method, which (3,8), (3,9), (4,9), (4,10), (5,10), and (5,6) are used for
are the speakers’ directions, are extracted by PENS algorithm calculating the cross-correlation in the 2D-SRP method.
[15]. Two closest lateral cross-shaped arrays (CSMA) to each Considering these microphone pairs not only decreases the
speaker are selected based on the speakers’ directions. In the complexity, but also estimates the speakers’ directions (
following, the SB-GEVD algorithm is implemented αˆC ,l for l = 1,..., L ) with high accuracy. As mentioned, two
adaptively on CSMA on various time frames. Finally, the categories of lateral MAs in distributed format are considered
location of speakers are calculated by the intersection between for accurate SSL. Figure 1 (b and c) shows the lateral CSMAs
DOA of central MA and the DOAs of lateral MAs. This for the second step of the localization algorithm. Each CSMA
algorithm is repeated for all speakers to calculate each 3D contains of 5 microphones, which are installed in a proper
speakers’ locations. distance on the walls. These lateral MAs in combination with
Section 2 shows the microphone signal model. The central central MA complete the localization process. Figure 1(b)
SSMA and lateral CSMAs are presented in section 3. Section represents the microphone pairs in the first CSMA (LA1) for
4 represents the proposed multi-speaker SSL method by using horizontal DOA estimation ( αˆ LA1,l for l = 1,..., L ). Three
of SB-GEVD method in combination with 2D-SRP microphone pairs (m1, m2), (m2, m3), and (m3, m1) are
algorithms. Section 5 shows the results and discussion of the selected for considering in adaptive SB-GEVD method for
simulations for multiple speakers. Finally, section 6 includes DOA estimation. As mentioned, two closest CSMAs are
some conclusions of the proposed SSL algorithms. considered for the speaker. In the next step, the second closest
II. MICROPHONE SIGNAL MODEL IN ACOUSTICAL CSMA to the speaker is considered for estimating the vertical
ENVIRONMENT direction estimation ( αˆ LA 2,l for l = 1,..., L ). Figure 1(c) shows
The acoustic sensors specially MAs are the principal base the microphone pairs (m1, m2), (m2, m3), and (m3, m1) in the
of SSL algorithms. Since the SSL methods are implemented second CSMA (LA2) by using of adaptive SB-GEVD
firstly on simulated data and then on real signals, the selected algorithm for vertical DOA estimation. The central and lateral
model for simulations should be as similar as possible to the MAs (SSMA and CSMAs) prepare the best acoustical
real environments. The real model is considered for information with minimal noise and reverberation effects,
microphone signal simulations, which is expressed as: which increases the reliability of estimated locations. In
addition, using of lateral and central MAs decrease the
L
( computational complexity.
xm ( t ) =  sl ( t ) * γ ( d m ,l , t ) + vm ( t ) ( m = 1,..., M ) (1)
l =1

where in Eq. (1), xm ( t ) is the signal in m-th microphone, sl ( t )


is the signal in l-th speech source, M is the number of
microphones, L is the number of speech sources, γ ( d m,l , t ) is
the room impulse response (RIR) between the l-th speech
(
source and m-th microphone and, vm ( t ) is the white noise, and
* is the convolution operator. By considering the ideal model
with the noise and reverberation effects, the simulated signals Fig. 1. a) The central star-shaped array, b) the lateral cross-shaped array for
horizontal direction estimation, and c) the lateral CSMA for vertical DOA
are highly similar to the captured speech in real scenarios. estimation with selected microphone pairs.
III. THE PROPOSED STAR AND CROSS-SHAPED MICROPHONE
IV. THE PROPOSED SSL METHOD BY 2D-SRP AND SUBBAND
ARRAYS
GEVD ALGORITHMS
The acoustical environments decrease the accuracy of the
The multi-speaker localization in real conditions is a
localization algorithms because of the noise and
challenge in speech processing specially for noisy and
reverberation. Therefore, the MA is proposed as a proper
reverberant environments. The one-step algorithms by the
instrument for increasing the accuracy of localization
energy calculation contain high computational complexity and
algorithms by the use of spatial layout and preparing the extra
proper accuracy, but the time-delay estimation (TDE)-based
information. Various structures were proposed for MAs [16]
algorithms estimate the speakers’ locations faster with less
based on the speech processing algorithms. In this paper, a
precision. Figure 2 shows the proposed 3D SSL system for
combination of central and lateral MAs with the SSL
simultaneous speakers in acoustical environments. In this
algorithms is proposed for increasing the estimation accuracy
system, the SSMA and CSMAs are proposed in combination
and covering a wide range of acoustical environments. The
with 2D-SRP and adaptive SB-GEVD algorithms for multiple
proposed MA layout provides the proper information for the
SSL. The high accuracy in SRP algorithm [9] with low
algorithms by considering the sufficient microphone pairs. In
complexity of GEVD [17] method is selected as a strategic
the first step, a SSMA is proposed for estimating the speakers’
criterion in the proposed system. In addition, implementation
directions. Figure 1(a) show the structure of the proposed
of the SRP function in two dimensions highly decreases the
central SSMA with 10 microphones and the selected
complexity of the method. Also, the accuracy of the GEVD
microphone pairs for the SSL algorithm. This structure is
method is increased by adaptive and subband implementation.

434

Authorized licensed use limited to: Universidad Tecnologica Metropolitana (UTEM). Downloaded on January 12,2023 at 22:01:55 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. The block diagram of the presented localization system based on the central and latera MAs in combination with 2D-SRP and adaptive subband GEVD.

+∞
A. The 2D-SRP method in combination with central star-  2D 2D
  2D 2D
 ′  2D 2D

shaped microphone array Q  θ1 ,...,θ M  =  P  ω , θ1 ,..., θ M  P  ω , θ1 ,..., θ M  dω (5)
  −∞    
In this paper, a novel system is proposed for simultaneous
SSL. As shown in Figure 2, the first step is estimating the  2D 2D

where P  ω , θ1 ,..., θ M  is the FSB output in the limited two
directions of the speakers by the SSMA in combination with  
a 2D version of the SRP method for decreasing the complexity dimensions. By replacing Eq. (4) to Eq. (5) and changing the
with high accuracy. Based on the microphones signal model parameters in the sigma and integral, the 2D-SRP is
in Eq. (1), the delay-and-sum beamformer (DSB) is informed implemented as:
as:
 2D 
2D M M
Q  θ1 ,...,θ M  = ψ ab (ω ) X a (ω ) X b′ ( ω ) e ( a b )
jω θ −θ
 2D 2D

M (6)
P  t , θ1 ,..., θ M  =  xm ( t − θ m ) (2)   a =1 b =1
  m =1
2D 2D
where the weighting function is expressed as
where in Eq. (2), θ1 ,...,θ M are M steered delays for orienting ψ ab (ω ) = Ga (ω ) Gb′ (ω ) . In this paper, the phase transform
the array to the 2D candidate points in the acoustical 1
environments. Therefore, the DSB output is expressed as a (PHAT) function ( ψ abPHAT (ω ) = ) is selected
function of the microphone signals and steered delays as: X a ( ω ) X b′ (ω )
because of its performance in the reverberation environments,
 2D 2D

P  t , θ1 ,..., θ M = which is considered in combination with 2D-SRP algorithm
  (3)
M M
as:
1 1 (
s ( t − τ 0 ) *  γ ( d m ,l , t − τ 0 + τ m ) +  vm ( d m ,l , t − τ 0 + τ m )
r
m =1 m r
m =1 m  2D 2D

Q  θ1 ,..., θ M=
where in Eq. (3), τ 0 is phase center, τ m is the delay between m-  
M M (7)
th source and center of the array, and rm is the distance between 1
m-th source and center of the array. Since the reverberation 
a =1 b =1 X a ( ω ) X b′ ( ω )
X a ( ω ) X ′
b ( ω ) e
jω (θ a −θb )

contains the multi-path versions of speech signals, noise


( In the ideal conditions, the peak positions in this function
vm ( t ) is highly correlated with the source signal, which are the source locations. But the results contain big errors in
decreases the performance of DSB. The alternative method is the real scenarios with the high level undesirable factors.
considering the adaptive filters in this beamformer, which is Therefore, the number of local peak positions in this function
named filter-and-sum beamformer (FSB) for decreasing the is calculated based on the PENS algorithm (L first peaks) for
noise and reverberation effects, which is expressed in the decreasing the complexity as:
frequency domain as:
αˆ C ,1 = arg max Q  θ1 ,..., θ M 
2D 2D

 2D 
2D M
P  ω , θ1 ,...,θ M  =  Gm ( ω )X m (ω ) e − jωθm (4) θ1 ,...,θ M ∈R ( x,y )  
  m =1
. . . (8)
where in Eq. (4), X 1 (ω ) ,..., X M (ω ) are the Fourier transform
 2D 2D

of the signals in SSMA, and G1 (ω ) ,..., GM (ω ) are the Fourier αˆ C , L = arg max Q  θ1 ,...,θ M 
θ1 ,...,θ M ∈R ( x,y )  
αˆ C ,L ≠αˆ C ,1 ,...,αˆ C ,L −1
transform of the adaptive filters. The selection of these filters
depends on the source and the noise signals in the acoustical where in Eq. (8), αˆC ,l is the speaker’s direction for l-th sound
environments. The 2D-SRP is calculated based on the output source (l=1,…,L) based on the central SSMA with M
power of FSB as: microphones. In addition, the z distance is selected as a fixed
value for decreasing the computational complexity in the SRP

435

Authorized licensed use limited to: Universidad Tecnologica Metropolitana (UTEM). Downloaded on January 12,2023 at 22:01:55 UTC from IEEE Xplore. Restrictions apply.
algorithm to be able for real implementations. Since the 2D- The impulse response between the sound source and 3
SRP function is just considered for DOA estimations, the microphone pairs ( q1 , q2 , q3 ) is calculated by estimating
limited 2D assumption does not affect the accuracy of the final
vector F . Finally, the DOA for each speaker (
results. In the following, two closed latera CSMAs to each
speaker are selected by combination with adaptive SB-GEVD αˆ LA1,l for l = 1,..., L ) is obtained by averaging of three DOAs
algorithm for vertical and horizontal direction estimations. from each microphone pair. This process is repeated for the
other closest CSMA to obtain the vertical DOAs (
B. The adaptive SB-GEVD algorithm in combination with
lateral MAs αˆ LA 2,l for l = 1,..., L ). In this paper, the SB-GEVD algorithm
In the real scenarios, the acoustical environments are is adaptively implemented on frequency bands b1 = 0 − 4KHz
considered as a linear system. Therefore, the relation between and b2 = 4 − 8KHz on continues time frames of the recorded
the microphone signal and RIR is explained as: signals. The aim of subband processing is to focus on low
xi ( n ) q j = x j ( n ) qi (9) frequency components od recorded signal (more speech
information) and decrease the effect of high frequency
where xi ( n ) is the recorded signal by microphone including components (high noisy and less speech signals) (
the noise and reverberation effects. The RIR between the αˆ LAi ,l ,b1 , αˆ LAi ,l ,b2 for i = 1, 2 ). This process is parallelly repeated
source and each microphone is show as: on two closest CSMAs to each speaker on continues time
frames. Therefore, the weights for each subband are calculated
q j =  q j ,0 , q j ,1 ,...., q j ,B  (10) as:
where B is the length of RIR vector. Since three microphone
pairs in the CSMA is considered for the evaluations, variables 1 N (
max Hist αˆ LAi ,l ,b1 ) i = 1, 2
i and j are selected as 3. Therefore, the covariance matrix is wi =  2
for  (16)
N l = 1,..., L
calculated as:
n =1
 max Hist (αˆ
j =1
LAi , l , b j )
 hx1 x1 hx1 x2 hx1x3  The final histogram is calculated by affecting the weights
  in each histogram for two different subbands, where the peak
H =  hx2 x1 hx2 x2 hx2 x3  (11) positions are the DOA for each speaker.
 
 hx3 x1 hx3 x2 hx3 x3  1 2 i = 1, 2
where in Eq. (11), the covariance values in this matrix are αˆ LA ,l = arg max  (
wi Hist αˆ LAi ,l ,b j for  ) (17)
θ1 ,...,θ M ∈R ( x,y ) 2 j =1 l = 1,..., L
i

hxi x j = E { xi ( n ) xTj ( n )} for i,j=1,2,3. For solving this


This calculation is performed on both closest CSMAs to
covariance matrix, the impulse response vector each speaker for estimating the horizontal ( αˆ LA1,l ) and vertical
F =  q3 ; − q2 ; − q1  is designed as a series of impulse ( αˆ LA 2,l ) directions. Finally, three directions are calculated for
responses obtained by three microphone pairs with length 3B. each speaker ( αˆC ,l , αˆ LA1,l , αˆ LA 2,l ), which are the DOAs in 3D
Vector F includes the eigenvector for the eigenvalue 0. The
adaptive SB-GEVD algorithm considers the iterative random space. The intersection between these DOAs provides a point
gradient methods for calculating the eigenvector related to the (in the ideal case) and an area (in the real conditions), where
smallest eigenvalue of signal covariance matrix H Bx and noise the 3D source location is selected as closest point in the area
to the three DOAs.
covariance matrix H Bv on continues time frames. The
generalized eigenvalue of covariance matrix H is obtained by ( xˆl , yˆ l , zˆl ) =
minimizing the cost function of F T H Bx F , which is considered 2 2 (18)
L2 2

as minimizing the mean square error (MSE) as:


min (x− x ) +(y− y ) +(z − z )
l
C
l
L1
l for l = 1,..., L
The various combinations of x, y, z are evaluated in this
F T ( n ) xB ( n )
e (n) = (12) area for estimating the final 3D location ( xˆl , yˆl , zˆl ) for each
F T ( n ) H Bv F ( n ) speaker. This estimation is repeated to calculate all 3D
which is implemented by least mean square (LMS) method as: speakers’ locations in the acoustical environment.

∂F ( n ) V. SIMULATIONS AND RESULTS


F ( n + 1) = F ( n ) − µ e ( n ) (13) The experiments are done on simulated data of TIMIT
∂e ( n )
dataset [18]. Based on the experiments in the real scenarios,
where µ is the adaptation step. To avoid the delay 90% and 8% of the overlapping are between 2 and 3
propagation, an extra normalization step is implemented in simultaneous speakers, respectively [19]. Therefore, the
LMS algorithm, where finally vector F is expressed as: experiments are implemented for 2 and 3 overlapped speakers.
One female (X1) and one male (X2) speech signals are
F% ( n + 1) selected for 2 simultaneous speakers. The third speaker (X3)
F ( n + 1) = (14)
is a male in the 3 simultaneous speakers’ scenarios. Figure 3
F% T ( n + 1) H Bv F% ( n + 1)
shows the simulated room for the evaluations of the proposed
where, LCMA-MSSL algorithm in acoustical scenarios. As shown,
F% ( n + 1) = F ( n ) − µ e ( n ) { xB ( n ) − e ( n ) H Bv F ( n )} (15) the SSMA with 10 microphones is located at (230,175,95) cm.
In addition, two CSMAs with 5 microphones are installed on
each wall with 170 cm height of the ground. The room

436

Authorized licensed use limited to: Universidad Tecnologica Metropolitana (UTEM). Downloaded on January 12,2023 at 22:01:55 UTC from IEEE Xplore. Restrictions apply.
dimension is (460,350, 270) cm, which is similar to the real addition, increasing the noise level decreases the accuracy of
conditions. Also, the three speakers are located at all localization methods.
X1=(125,262,165) cm, X2=(141,74,175) cm, and
X3=(383,158,179) cm, respectively.

Fig. 3. The simulated room for speech signal recording in the multiple SSL
application by SSMA and CSMAs for 2 and 3 simultaneous speakers.

The Hamming window with 40 ms and 50% inter-window Fig. 4. The averaged MAEE for the LCMA-MSSL in comparison with the
overlap is selected in the simulations to use the maximum PCSF, SF-MCA, and DR-MCL algorithms for 2 simultaneous speakers, a)
recorded data and updating the noise covariance matrix. The for fixed SNR=5 dB and variable 0 ≤ RT60 ≤ 600 ms , b) fixed
Gaussian noise is selected to be similar to the real scenarios. RT60 = 500 ms and variable −10 ≤ SNR ≤ 20 dB .
In addition, the Image method [20] is considered for preparing
the reverberation in the acoustical room.
Figure 5 represents the results of the LCMA-MSSL
The proposed LCMA-MSSL algorithm is compared with comparing with the PCSF, SF-MCA, and DR-MCL by the
the PCSF [12], SF-MCA [13], and DR-MCL [14] methods by MAEE for three overlapped speakers on noisy and reverberant
the mean absolute estimation error (MAEE) for 2 and 3 environments. As seen, the presented algorithm localizes the
simultaneous speakers on different noisy and reverberant speakers with more precision comparing with other works.
environments. In the first step, the signal-to-noise ratio (SNR) Figure 5(a) represents the averaged MAEE results for fixed
is fixed and the reverberation time ( RT60 ) is variable, but in SNR=5 dB and variable 0 ≤ RT60 ≤ 600 ms . The proposed
the second step, the RT60 is fixed and SNR is variable. Figure LCMA-MSSL localizes the 3 simultaneous speakers with
4 shows the results of the proposed LCMA-MSSL in more precision comparing with traditional works. For
comparison with the PCSF, SF-MCA, and DR-MCL for 2 example, in RT60 = 500 ms the proposed LCMA-MSSL
simultaneous speakers based on the averaged MAEE criteria. localize the speakers with 38 cm based on averaged MAEE
Figure 4(a) represents the results of this comparison for fixed criteria in comparison with the PCSF by 58 cm, SF-MCA by
SNR=5 dB and variable 0 ≤ RT60 ≤ 600 ms , where the aim is 51 cm, and DR-MCL by 48 cm. Figure 5(b) represents the
evaluating the reverberation effects on the accuracy of MAEE for the LCMA-MSSL comparing with the previous
estimated locations. As shown, the PCSF and LCMA-MSSL works for 3 simultaneous speakers on fixed RT60 = 500 ms
methods contain the lowest and highest localization accuracy, and variable −10 ≤ SNR ≤ 20 dB for verifying the noise
respectively. For example, the MAEE results for the LCMA- effects on the accuracy of the localization. As seen, the
MSSL is 33 cm in comparison with PCSF by 51 cm, SF-MCA proposed method has better results comparing with the
by 47 cm, and DR-MCL by 40 cm for RT60 = 500 ms , which previous methods. In SNR=10 dB, the averaged MAEE
shows the superiority of the presented algorithm. The results for the proposed LCMA-MSSL method is 36 cm in
precision of all methods is decreased by increasing the comparison with the PCSF by 54 cm, SF-MCA by 47 cm, and
reverberation effects. Figure 4(b) shows the MAEE results of DR-MCL by 42 cm, where the accuracy of all methods
the LCMA-MSSL in comparison with the PCSF, SF-MCA, decreases by the noise effects. Figures 4 and 5 show the
and DR-MCL algorithms for fixed RT60 = 500 ms and reliability of the presented method comparing with the other
algorithms.
−10 ≤ SNR ≤ 20 dB to evaluate the noise effects on the
localization accuracy. In SNR=10 dB the averaged MAEE VI. CONCLUSIONS
results for the LCMA-MSSL is 31 cm in comparison with In this paper, a two-step 3D SSL system was proposed for
PCSF by 48 cm, SF-MCA by 43 cm, and DR-MCL by 38 cm. simultaneous speakers by the central and lateral MAs. The
As shown, the DR-MCL method contains the closest results central SSMA and lateral CSMAs were proposed for
to the LCMA-MSSL algorithm, which shows that the increasing the localization accuracy. Firstly, the speakers’
presented algorithm localizes the speakers more accurately. In directions are calculated by central SSMA in combination
with 2D-SRP method. Implementing the SRP method in 2D

437

Authorized licensed use limited to: Universidad Tecnologica Metropolitana (UTEM). Downloaded on January 12,2023 at 22:01:55 UTC from IEEE Xplore. Restrictions apply.
prepares the high accuracy for estimating the direction of Conference on Intelligent Computing, Automation and Applications
speakers. In the following, two closest CSMAs are considered (ICAA), 2021, pp. 333-338.
for each speaker. These CSMAs are for horizontal (L1) and [4] M. Farmani, M. S. Pedersen, and J. Jensen, “Sound Source Localization
for Hearing Aid Applications Using Wireless Microphones,” in
vertical (L2) directions’ estimations combining with adaptive Proceedings 10th Sensor Array and Multichannel Signal Processing
SB-GEVD method. Finally, the intersection between these Workshop (SAM), 2018, pp. 455-459.
DOAs prepares an 3D area in the acoustical space, where the [5] R. Cutler et al., “Multimodal Active Speaker Detection and Virtual
closest area in this sector to all three DOAs is selected as the Cinematography for Video Conferencing,” in Proceedings
3D speaker’s location. The proposed LCMA-MSSL methos International Conference on Acoustics, Speech, and Signal Processing
was compared with the PCSF, SF-MCA, and DR-MCL (ICASSP 2020), 2020, pp. 4527-4531.
algorithms based on the averaged MAEE criteria for 2 and 3 [6] G. Liu, S. Yuan, J. Wu, and R. Zhang, “A Sound Source Localization
Method Based on Microphone Array for Mobile Robot,” in
overlapped speakers. The results show the reliability of the Proceedings Chinese Automation Congress (CAC), 2018, pp. 1621-
presented algorithm in comparison with the other SSL 1625.
algorithms. [7] B. Yang, H. Liu, C. Pang, and X. Li, “Multiple sound source counting
and localization based on TF-wise spatial spectrum clustering,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing,
vol. 27, no. 8, 2019, pp. 1241-1255.
[8] T. N. T. Nguyen, S. Zhao, and D. L. Jones, “Robust DOA estimation
of multiple speech sources,” in Proceedings IEEE International
Conference Acoustics, Speech, Signal Processing, 2014, pp. 2287-
2291.
[9] H. Do, and H. F. Silverman, “SRP-PHAT methods of locating
simultaneous multiple talkers using a frame of microphone array data,”
in Proceedings IEEE International Conference Acoustics, Speech,
Signal Processing, 2010, pp. 125-128.
[10] A. N. Dwiputra, R. H. G. Nainggolan, and M. A. M. Nasution, “3-
dimension sound source localization with cross-correlation and
CORDIC algorithm on FPGA,” in Proceedings International
Symposium on Electronics and Smart Devices (ISESD), 2016, pp. 365-
370.
[11] H. Do, H. F. Silverman, and Y. Yu, “A Real-Time SRP-PHAT Source
Location Implementation using Stochastic Region Contraction (SRC)
on a Large-Aperture Microphone Array,” in Proceedings IEEE
International Conference on Acoustics, Speech, and Signal Processing,
2007, pp. I-121-I-124.
[12] N. Stefanakis, D. Pavlidi, and A. Mouchtaris, “Perpendicular Cross-
Spectra Fusion for Sound Source Localization With a Planar
Microphone Array,” IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 25, no. 9, 2017, pp. 1821-1835.
[13] N. Chu, N, Y. Ning, L. Yu, Q. Liu, Q. Huang, D. Wu, and P. Hou,
“Acoustic Source Localization in a Reverberant Environment Based on
Fig. 5. The averaged MAEE for the LCMA-MSSL in comparison with the Sound Field Morphological Component Analysis and Alternating
PCSF, SF-MCA, and DR-MCL algorithms for 3 simultaneous speakers, a) Direction Method of Multipliers,” IEEE Transactions on
for fixed SNR=5 dB and variable 0 ≤ RT60 ≤ 600 ms , b) fixed Instrumentation and Measurement, vol. 70, 2021, pp. 1-13.
RT60 = 500 ms and variable −10 ≤ SNR ≤ 20 dB . [14] I. An, Y. Kwon, and S. -e. Yoon, “Diffraction- and Reflection-Aware
Multiple Sound Source Localization,” IEEE Transactions on Robotics,
vol. 38, no. 3, 2022, pp. 1925-1944.
ACKNOWLEDGMENT [15] H. Sayoud and S. Ouamour, “Proposal of a new confidence parameter
The authors acknowledge financial support from: estimating the number of speakers-An experimental investigation,”
Journal of Information Hiding and Multimedia Signal Processing, vol.
ANID/FONDECYT Postdoctorado No. 3190147, 1, no. 2, 2010, pp. 101-109.
Competition for Research Regular Projects, year 2021, code
[16] K. Niwa, Y. Hioka and K. Kobayashi, “Optimal Microphone Array
LPR21-02, Universidad Tecnológica Metropolitana, projects Observation for Clear Recording of Distant Sound
DICYT 062117S, ANID FCHA/Beca de Doctorado Sources,” IEEE/ACM Transactions on Audio, Speech, and Language
Nacional/2019 21190489, CODELCO: Concurso Piensa Processing, vol. 24, no. 10, 2016, pp. 1785-1795.
Minería 2021, SENESCYT “Convocatoria abierta 2014- [17] S. Doclo, and M. Moonen, “Robust Adaptive Time Delay Estimation
primera fase, Acta CIBAE-023-2014”. for Speaker Localization in Noisy and Reverberant Acoustic
Environments,” EURASIP Journal on Advances in Signal Processing,
REFERENCES vo. 11, 2003, pp. 1110-1124.
[18] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N.
[1] C. Xu, B. Zhou, and L. Xu, “Adaptive Speech Enhancement Algorithm L. Dahlgren, and V. Zue, “TIMIT Acoustic-Phonetic Continuous
Based on First-order Differential Microphone Array,” in Proceedings Speech Corpus LDC93S1”, Web Download. Philadelphia: Linguistic
IEEE 2nd International Conference on Big Data, Artificial Intelligence Data Consortium (1993). Available from:
and Internet of Things Engineering (ICBAIE), 2021, pp. 41-44. https://catalog.ldc.upenn.edu/LDC93S1. Last accessed May 2022.
[2] Q. Zhang, Z. Chen, and F. Yin, “Speaker Tracking Based on [19] O. Cetin and E. Shriberg, “Analysis of overlaps in meetings by dialog
Distributed Particle Filter in Distributed Microphone Networks,” IEEE factors, hot spots, speakers, and collection site: In-sights for automatic
Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. speech recognition,” in Proceedings Interspeech, 2006, pp. 293-296.
9, 2017, pp. 2433-2443.
[20] J. Allen, and D. Berkley, “Image method for efficiently simulating
[3] F. Zeng and X. Kong, “Design of Speech Recognition System Based small room acoustics,” Journal of the Acoustical Society of America,
on Linear Microphone Array,” in Proceedings International vol. 65, no. 4, 1979, pp. 943-950.

438

Authorized licensed use limited to: Universidad Tecnologica Metropolitana (UTEM). Downloaded on January 12,2023 at 22:01:55 UTC from IEEE Xplore. Restrictions apply.

You might also like