Professional Documents
Culture Documents
This is a special issue published in volume 2010 of “EURASIP Journal on Advances in Signal Processing.” All articles are open access
articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Editor-in-Chief
Phillip Regalia, Institut National des Télécommunications, France
Associate Editors
Adel M. Alimi, Tunisia Sudharman K. Jayaweera, USA Douglas O’Shaughnessy, Canada
Kenneth Barner, USA Soren Holdt Jensen, Denmark Björn Ottersten, Sweden
Yasar Becerikli, Turkey Mark Kahrs, USA Jacques Palicot, France
Kostas Berberidis, Greece Moon Gi Kang, South Korea Ana Perez-Neira, Spain
Enrico Capobianco, Italy Walter Kellermann, Germany Wilfried R. Philips, Belgium
A. Enis Cetin, Turkey Lisimachos P. Kondi, Greece Aggelos Pikrakis, Greece
Jonathon Chambers, UK Alex Chichung Kot, Singapore Ioannis Psaromiligkos, Canada
Mei-Juan Chen, Taiwan Ercan E. Kuruoglu, Italy Athanasios Rontogiannis, Greece
Liang-Gee Chen, Taiwan Tan Lee, China Gregor Rozinaj, Slovakia
Satya Dharanipragada, USA Geert Leus, The Netherlands Markus Rupp, Austria
Kutluyil Dogancay, Australia T.-H. Li, USA William Sandham, UK
Florent Dupont, France Husheng Li, USA B. Sankur, Turkey
Frank Ehlers, Italy Mark Liao, Taiwan Erchin Serpedin, USA
Sharon Gannot, Israel Y.-P. Lin, Taiwan Ling Shao, UK
Samanwoy Ghosh-Dastidar, USA Shoji Makino, Japan Dirk Slock, France
Norbert Goertz, Austria Stephen Marshall, UK Yap-Peng Tan, Singapore
M. Greco, Italy C. Mecklenbräuker, Austria João Manuel R. S. Tavares, Portugal
Irene Y. H. Gu, Sweden Gloria Menegaz, Italy George S. Tombras, Greece
Fredrik Gustafsson, Sweden Ricardo Merched, Brazil Dimitrios Tzovaras, Greece
Ulrich Heute, Germany Marc Moonen, Belgium Bernhard Wess, Austria
Sangjin Hong, USA Christophoros Nikou, Greece Jar-Ferr Yang, Taiwan
Jiri Jan, Czech Republic Sven Nordholm, Australia Azzedine Zerguine, Saudi Arabia
Magnus Jansson, Sweden Patrick Oonincx, The Netherlands Abdelhak M. Zoubir, Germany
Contents
Microphone Array Speech Processing, Sven Nordholm, Thushara Abhayapala, Simon Doclo,
Sharon Gannot (EURASIPMember), Patrick Naylor, and Ivan Tashev
Volume 2010, Article ID 694216, 3 pages
Selective Frequency Invariant Uniform Circular Broadband Beamformer, Xin Zhang, Wee Ser,
Zhang Zhang, and Anoop Kumar Krishna
Volume 2010, Article ID 678306, 11 pages
First-Order Adaptive Azimuthal Null-Steering for the Suppression of Two Directional Interferers,
René M. M. Derkx
Volume 2010, Article ID 230864, 16 pages
Musical-Noise Analysis in Methods of Integrating Microphone Array and Spectral Subtraction Based on
Higher-Order Statistics, Yu Takahashi, Hiroshi Saruwatari, Kiyohiro Shikano, and Kazunobu Kondo
Volume 2010, Article ID 431347, 25 pages
Microphone Diversity Combining for In-Car Applications, Jürgen Freudenberger, Sebastian Stenzel,
and Benjamin Venditti
Volume 2010, Article ID 509541, 13 pages
Editorial
Microphone Array Speech Processing
Copyright © 2010 Sven Nordholm et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Significant knowledge about microphone arrays has been highly reverberant speech given that we only can observe the
gained from years of intense research and product develop- received microphone signals.
ment. There have been numerous applications suggested, for This special issue contains contributions to traditional
example, from large arrays (in the order of >100 elements) areas of research such as frequency invariant beamforming
for use in auditoriums to small arrays with only 2 or 3 [1], hand-free operation of microphone arrays in cars [2],
elements for hearing aids and mobile telephones. Apart from and source localisation [3]. The contributions show new
that, microphone array technology has been widely applied ways to study these traditional problems and give new
in speech recognition, surveillance, and warfare. Traditional insights into those problems. Small size arrays have always
techniques that have been used for microphone arrays a lot of applications and interest for mobile terminals,
include fixed spatial filters, such as, frequency invariant hearing aids, and close up microphones [4]. The novel
beamformers, optimal and adaptive beamformers. These way to represent small size arrays leads to a capability to
array techniques assume either model knowledge or cali- suppress multiple interferers. Abnormalities in noise and
bration signal knowledge as well as localization information speech stemming from processing are largely unavoidable,
for their design. Thus they usually combine some form and using nonlinear processing results often in significant
of localisation and tracking with the beamforming. Today character change particularly in noise character. It is thus
contemporary techniques using blind signal separation (BSS) important to provide new insights into those phenomena
and time frequency masking technique have attracted sig- particularly the so called musical noise [5]. Finally, new
nificant attention. Those techniques are less reliant on array and unusual use of microphone arrays is always interesting
model and localization, but more on the statistical properties to see. Distributed microphone arrays in a sensor network
of speech signals such as sparseness, non-Gaussianity, and [6] provide a novel approach to find snipers. This type of
non-stationarity. The main advantage that multiple micro- processing has good opportunities to grow in interest for new
phones add from a theoretical perspective is the spatial and improved applications.
diversity, which is an effective tool to combat interference, The contributions found in this special issue can be
reverberation, and noise. The underpinning physical feature categorized to three main aspects of microphone array
used is a difference in coherence in the target field (speech processing: (i) microphone array design based on eigenmode
signal) versus the noise field. Viewing the processing in this decomposition [1, 4]; (ii) multichannel processing methods
way one can understand also the difficulty in enhancing [2, 5]; and (iii) source localisation [3, 6].
2 EURASIP Journal on Advances in Signal Processing
The paper by Zhang et al., “Selective frequency invariant array signal processing and spectral subtraction. To obtain
uniform circular broadband beamformer” [1], describes a better noise reduction, methods of integrating microphone
design method for Frequency-Invariant (FI) beamforming. array signal processing and nonlinear signal processing have
This problem is a well-known array signal processing tech- been researched. However, nonlinear signal processing often
nique used in many applications such as, speech acquisition, generates musical noise. Since such musical noise causes
acoustic imaging and communications purposes. However, discomfort to users, it is desirable that musical noise is
many existing FI beamformers are designed to have a mitigated. Moreover, it has been recently reported that
frequency invariant gain over all angles. This might not be higher-order statistics are strongly related to the amount
necessary and if a gain constraint is confined to a specific of musical noise generated. This implies that it is possible
angle, then the FI performance over that selected region (in to optimize the integration method from the viewpoint of
frequency and angle) can be expected to improve. Inspired not only noise reduction performance but also the amount
by this idea, the proposed algorithm attempts to optimize of musical noise generated. Thus, the simplest methods
the frequency invariant beampattern solely for the mainlobe of integration, that is, the delay-and-sum beamformer and
and relax the FI requirement on the sidelobes. This sacrifice spectral subtraction, are analysed and the features of musical
on performance in the undesired region is traded off for noise generated by each method are clarified. As a result, it is
better performance in the desired region as well as reduced clarified that a specific structure of integration is preferable
number of microphones employed. The objective function from the viewpoint of the amount of generated musical
is designed to minimize the overall spatial response of the noise. The validity of the analysis is shown via a computer
beamformer with a constraint on the gain being smaller simulation and a subjective evaluation.
than a predefined threshold value across a specific frequency The paper by Freudenberger et al., “Microphone diversity
range and at a specific angle. This problem is formulated as a combining for in-car applications” [2], proposes a frequency
convex optimization problem and the solution is obtained domain diversity approach for two or more microphone
by using the Second-Order Cone Programming (SOCP) signals, for example, for in-car applications. The micro-
technique. An analysis of the computational complexity phones should be positioned separately to ensure diverse
of the proposed algorithm is presented as well as its signal conditions and incoherent recording of noise. This
performance. The performance is evaluated via computer enables a better compromise for the microphone position
simulation for different number of sensors and different with respect to different speaker sizes and noise sources. This
threshold values. Simulation results show that the proposed work proposes a two-stage approach: In the first stage, the
algorithm is able to achieve a smaller mean square error of microphone signals are weighted with respect to their signal-
the spatial response gain for the specific FI region compared to-noise ratio and then summed similar to maximum-ratio-
to existing algorithms. combining. The combined signal is then used as a reference
The paper by Derkx, “First-order azimuthal null-steering for a frequency domain least-mean-squares (LMS) filter for
for the suppression of two directional interferers” [4] shows each input signal. The output SNR is significantly improved
that an azimuth steerable first-order super directional micro- compared to coherence-based noise reduction systems, even
phone response can be constructed by a linear combination if one microphone is heavily corrupted by noise.
of three eigenbeams: a monopole and two orthogonal The paper by Ichikawa et al., “DOA estimation with
dipoles. Although the response of a (rotation symmetric) local-peak-weighted CSP” [3], proposes a novel weighting
first-order response can only exhibit a single null, the algorithm for Cross-power Spectrum Phase (CSP) analysis
paper studies a slice through this beampattern lying in the to improve the accuracy of direction of arrival (DOA)
azimuthal plane. In this way, a maximum of two nulls estimation for beamforming in a noisy environment. As
in the azimuthal plane can be defined. These nulls are a sound source, a human speaker is used, and as a noise
symmetric with respect to the main-lobe axis. By placing source broadband automobile noise is used. The harmonic
these two nulls on maximally two-directional sources to structures in the human speech spectrum can be used for
be rejected and compensating for the drop in level for the weighting the CSP analysis, because harmonic bins must
desired direction, these directional sources can be effectively contain more speech power than the others and thus give
rejected without attenuating the desired source. An adaptive us more reliable information. However, most conventional
null-steering scheme for adjusting the beampattern, which methods leveraging harmonic structures require pitch esti-
enables automatic source suppression, is presented. Closed- mation with voiced-unvoiced classification, which is not
form expressions for this optimal null-steering are derived, sufficiently accurate in noisy environments. The suggested
enabling the computation of the azimuthal angles of the approach employs the observed power spectrum, which is
interferers. It is shown that the proposed technique has a directly converted into weights for the CSP analysis by
good directivity index when the angular difference between retaining only the local peaks considered to be coming
the desired source and each directional interferer is at least from a harmonic structure. The presented results show that
90 degrees. the proposed approach significantly reduces the errors in
In the paper by Takahashi et al. “Musical noise analysis localization, and it also shows further improvement when
in methods of integrating microphone array and spectral used with other weighting algorithms.
subtraction based on higher-order statistics” [5], an objective The paper by Lindgren et al., “Shooter localization in
analysis on musical noise is conducted. The musical noise wireless microphone networks” [6], is an interesting com-
is generated by two methods of integrating microphone bination of microphone array technology with distributed
EURASIP Journal on Advances in Signal Processing 3
References
[1] X. Zhang, W. Ser, Z. Zhang, and A. K. Krishna, “Selective
frequency invariant uniform circular broadband beamformer,”
EURASIP Journal on Advances in Signal Processing, vol. 2010,
Article ID 678306, 11 pages, 2010.
[2] J. Freudenberger, S. Stenzel, and B. Venditti, “Microphone
diversity combining for In-car applications,” EURASIP Journal
on Advances in Signal Processing, vol. 2010, Article ID 509541,
13 pages, 2010.
[3] O. Ichikawa, T. Fukuda, and M. Nishimura, “DOA estimation
with local-peak-weighted CSP,” EURASIP Journal on Advances
in Signal Processing, vol. 2010, Article ID 358729, 9 pages, 2010.
[4] R. M. M. Derkx, “First-order adaptive azimuthal null-steering
for the suppression of two directional interferers,” EURASIP
Journal on Advances in Signal Processing, vol. 2010, Article ID
230864, 16 pages, 2010.
[5] Yu. Takahashi, H. Saruwatari, K. Shikano, and K. Kondo,
“Musical-noise analysis in methods of integrating microphone
array and spectral subtraction based on higher-order statistics,”
EURASIP Journal on Advances in Signal Processing, vol. 2010,
Article ID 431347, 25 pages, 2010.
[6] D. Lindgren, O. Wilsson, F. Gustafsson, and H. Habberstad,
“Shooter localization in wireless sensor networks,” in Proceed-
ings of the 12th International Conference on Information Fusion
(FUSION ’09), pp. 404–411, July 2009.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 678306, 11 pages
doi:10.1155/2010/678306
Research Article
Selective Frequency Invariant Uniform
Circular Broadband Beamformer
Xin Zhang,1 Wee Ser,1 Zhang Zhang,1 and Anoop Kumar Krishna2
1 Center for Signal Processing, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798
2 EADS Innovation Works, EADS Singapore Pte Ltd., No. 41, Science Park Road, 01-30, Singapore 117610
Copyright © 2010 Xin Zhang et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Frequency-Invariant (FI) beamforming is a well known array signal processing technique used in many applications. In this paper,
an algorithm that attempts to optimize the frequency invariant beampattern solely for the mainlobe, and relax the FI requirement
on the sidelobe is proposed. This sacrifice on performance in the undesired region is traded off for better performance in the
desired region as well as reduced number of microphones employed. The objective function is designed to minimize the overall
spatial response of the beamformer with a constraint on the gain being smaller than a pre-defined threshold value across a specific
frequency range and at a specific angle. This problem is formulated as a convex optimization problem and the solution is obtained
by using the Second Order Cone Programming (SOCP) technique. An analysis of the computational complexity of the proposed
algorithm is presented as well as its performance. The performance is evaluated via computer simulation for different number
of sensors and different threshold values. Simulation results show that, the proposed algorithm is able to achieve a smaller mean
square error of the spatial response gain for the specific FI region compared to existing algorithms.
the desired beampattern is minimized over a range of In this configuration, the intersensor spacing is fixed at
frequencies. Some of such beamformers are designed in the λ/2, where λ is the wavelength of the signals of interest
time-frequency domain [10–12], while others are designed and its minimum value is denoted by λmin . The radius
in the eigen-space domain [13]. corresponding to λmin is given by [14]
The third type of FI beamformers is designed based on
“Signal Transformation.” For this type of beamformers, the λmin
signal received at the sensor array is transformed into a r= . (1)
4 sin(π/K)
domain such that the frequency response and the spatial
response of the signal can be decoupled and hence adjusted
independently. This is the principle adopted in [14], where Assuming that the circular array is on a horizontal plane,
a uniform concentric circular array (UCCA) is designed the steering vector is
to achieve the FI beampattern. Excellent results have been
produced by this algorithm. One limitation of the UCCA T
beamformer is that a relatively large number of sensors have a f , φ = e j2π f r cos(φ−φ0 )/c , . . . , e j2π f r cos(φ−φK −1 )/c , (2)
to be used to form the concentric circular array.
Inspired by the UCCA beamformer design, a new where T denotes transpose. For convenience, let ω be the
algorithm has been proposed by the authors of this paper normalized angular frequency, that is, ω = 2π f / fs , let
and presented in [15]. The proposed algorithm attempts be the ratio of the sampling frequency and the maximum
to optimize the FI beampattern solely for the main lobe frequency, that is, = fs / fmax , and let r be the normalized
where the signal of interest is from and relaxes the FI radius, that is, r = r/λmin , the steering vector can be rewritten
requirement on the side lobe. As a result, the sacrifice as
on performance in the undesired region is traded off for
better performance in the desired region and fewer number T
of microphones are employed. To achieve this goal, an a ω, φ = e jωr cos(φ−φ0 ) , . . . , e jωr cos(φ−φK −1 ) . (3)
objective function with a quadratic constraint is designed.
This constraint function allows the FI characteristic to be Figure 2 shows the system structure of the proposed
accurately controlled over the specified bandwidth at the uniform circular array beamformer. The sampled signals
expense of other parts of the spectrum which are not of after the sensor are represented by the vector X[n] =
concern to the designer. This objective function is formulated [x0 (n), x1 (n), . . . , xK −1 (n)]T where n is the sampling instance.
into a convex optimization problem and solved by SOCP These sampled signals are transformed into a set of coef-
readily. Our algorithm has a frequency band of interest from ficients via the Inverse Discrete Fourier Transform (IDFT),
0.3π to 0.95π. If the sampling frequency is 16000 Hz, the where each of the coefficients is called a phase mode [17].
frequency band of interest ranges from 2400 Hz to 7600 Hz. The mth phase mode at time instance n can be expressed as
This algorithm can be applied in speech processing as the
labial and fricative sounds of speech mostly lie in the 8th
−1
K
to 9th octave. If the sampling frequency is 8000 Hz, the
pm [n] = xk [n]e j2πkm/K . (4)
frequency band of interest is from 1200 Hz to 3800 Hz. k=0
This frequency range is useful for respiratory sounds
[16].
The aim of this paper is to provide the full details of These phase modes are passed through an FIR (Finite
the design proposed in [15]. In addition, a computational Impulse Response) filter where the filter coefficients are
complexity analysis of the proposed algorithm and the denoted as bm [n]. The purpose of this filter is to remove
sensitivity performance evaluations at different numbers of the frequency dependency of the received signal X[n]. The
sensors and different constraint parameter values are also beamformer output y[n] is then determined as the weighted
included. sum of the filtered signals:
The remaining paper is organized in the following way:
in Section 2, problem formulation is discussed; in Section 3,
L
the proposed beamforming design is described; in Section 4, y[n] = pm [n] ∗ bm [n] · hm , (5)
the design of the beamforming weight using SOCP is m=−L
shown; numerical results are given in Section 5, and finally,
conclusions are drawn in Section 6. where hm is the phase spatial weighting coefficients or the
beamforming weights, and ∗ is the discrete-time convolu-
tion operator.
2. Problem Formulation Let M be the total number of phase modes and it is
assumed to be an odd number. It can be seen from Figure 2
A uniformly distributed circular sensor array with K number that the K received signals are transformed into M phase
of microphones is arranged as shown in Figure 1. Each modes, where L = (M − 1)/2.
omnidirectional sensor is located at (r cos φk , r sin φk ), where The corresponding spectrum of the phase modes can
r is the radius of the circle, φk = 2kπ/K and k = 0, . . . , K − 1. be obtained by taking the Discrete Time Fourier Transform
EURASIP Journal on Advances in Signal Processing 3
−1
K
Pm (ω) = Xk (ω)e j2πkm/K
k=0 φk
(6)
−1
K
= S(ω) · e jωr cos(φ−φk ) e j2πkm/K ,
k=0
L
Figure 1: Uniform Circular Array Configuration.
Y (ω) = hm Pm (ω)Bm (ω)
m=−L
⎛ ⎞
L −1
K
= S(ω) · hm ⎝ e jωr cos(φ−φk ) e j2πkm/K ⎠Bm (ω). objective function is proposed:
m=−L k=0
(7) min
G ω, φ
2 dω dφ,
ωφ (11)
Consequently, the response of the beamformer can be
s.t.
G ω, φ0 − 1
≤ δ, ω ∈ [ωl , ωu ],
expressed as
where G(ω, φ) is the spatial response of the beamformer
⎛ ⎞
L −1
K given in (10), and ωl and ωu are the lower and upper limit of
G ω, φ = hm ⎝ e jωr cos(φ−φk ) e j2πkm/K ⎠Bm (ω). (8) the specified frequency region respectively. φ0 is the specified
m=−L k=0 direction and δ is a predefined threshold value that controls
the magnitude of the ripples of the main beam.
In principle, the objective function defined above aims
In order to obtain an FI response, terms which are to minimize the square of the spatial gain response across
functions of ω are grouped together using the Jacobi-Anger all frequencies and all angles, while constraining the gain
expansion given as follows [18]: to the value of one at the specified angle. This is to relax
the gain constraint to one angle instead of all angles,
+∞
so that the FI beampattern in the specified region can
e jβ cos γ = j n Jn β e jnγ , (9) be improved. With this constraint setting, the resulting
n=−∞ beamformer can enhance broadband desired signals arriving
from one direction while attenuate broadband noise received
where Jn (β) is the Bessel function of the first kind of order n. from other directions. The concept for formulating the
Substituting (9) into (8), and applying property of the objective function is similar to Capon beamformer [19]. One
Bessel function, the spatial response of the beamformer can difference is that the Capon beamformer aims to minimize
now be approximated by the data dependent array output power at a single frequency,
while the proposed algorithm aims to minimize the data
independent array output power across a wide range of
L frequencies. Another difference is that the constraint used in
G ω, φ = hm · e jmφ · K · j m · Jm (ωr) · Bm (ω). (10) Capon beamformer is a hard constraint, whereas the array
m=−L gain used in the proposed algorithm is a soft constraint,
which can result in a higher degree of flexibility.
This process has been described in [13] and its detailed The proposed algorithm is expected to have lower com-
derivation can be found in [14]. putational complexity compared to the UCCA beamformer.
The later is designed to achieve FI beampattern for all
angles whereas the proposed algorithm focuses only on a
3. Proposed Novel Beamformer specified angle. For the same reason, the proposed algorithm
is expected to have a larger degree of freedom too. This
With the above formulation, we propose the following beam explains the result in having a better FI beampattern for a
pattern synthesis method. The basic idea is to enhance the given number of sensors. These performance improvements
broadband signals for a specific frequency region and at a have been supported by computer simulations and will be
certain direction. In order to achieve this goal, the following discussed in the later part of this paper.
4 EURASIP Journal on Advances in Signal Processing
pL [n] hL
x0 [n]
bL [n]
hL−1
x1 [n] pL−1 [n]
bL−1 [n]
y[n]
. IDFT . . SUM
. . .
. . .
The optimization problems defined by (10) and (11) Using the identity e− jnω = cos(nω) − j sin(nω), (13)
require the optimum values of both the compensation filter becomes
and the spatial weightings to be determined simultaneously.
L
As such, Cholesky factorization is used to transform the G ω, φ = hm · e jmφ · K · j m · Jm (ωr)
objective function further into the Second-Order Cone Pro- m=−L
gramming (SOCP) problem. The details of implementation ⎡ ⎤
Nm
will be discussed in the following section. It should be noted ·⎣ bm [n] cos(nω) − j sin(nω) ⎦
that when the threshold value δ equals zero, the optimization n=0
process becomes a linearly constrained problem.
L
=K hm · e jmφ · j m · Jm (ωr)
4. Convex Optimization-Based Implementation m=−L
⎡ ⎤
Nm
Second-Order Cone Programming (SOCP) is a popular tool ·⎣ bm [n] cos(nω) − jbm [n] sin(nω) ⎦
for solving convex optimization problem, and it has been n=0
used for array pattern synthesis problem [20–22] since the
L
early papers by Lobo et al. [23]. One advantage of SOCP =K hm · e jmφ · j m · Jm (ωr)
is that the global optimal solution is guaranteed if it exists, m=−L
whereas constrained least square optimization procedure ⎡ ⎛ ⎞⎤
looks for local minimum. Another important advantage is
Nm
Nm
·⎣ ⎝bm [n] cos(nω) − j bm [n] sin(nω))⎠⎦
that it is very convenient to include additional linear or n=0 n=0
convex quadratic constraints, such as the norm constraint of
L
the variable vector, in the problem formulation. The standard
=K hm · e jmφ · j m · Jm (ωr) · cm bm − jsm bm ,
form of SOCP can be written as follows: m=−L
T (14)
min b x,
(12)
s.t. dTi x + qi ≥ Ai x + ci 2 , i = 1, . . . , N, where bm = [bm [0], bm [1], . . . , bm [Nm ]]T ; cm = [cos(0),
cos(ω), . . . , cos(Nm · ω)]; sm = [sin(0), sin(ω), . . . , sin(Nm ·
where x ∈ Rm is the variable vector; the parameters are b ∈ ω)].
Rm , Ai ∈ R(ni −1)×m , ci ∈ Rni −1 , di ∈ Rm , and qi ∈ R. The hm is the spatial weighting in the system structure, and
norm appearing in the constraints is the standard Euclidean bm is the FIR filter coefficient vector for each phase.
norm, that is, u2 = (uT u)1/2 . Let um = hm · j m · bm , we have
L
4.1. Convex Optimization of the Beampattern Synthesis G ω, φ = K e jmφ · Jm (ωr) · cm um
Problem. The following transformations are carried out to m=−L
convert (11) into the standard form defined by (12).
L (15)
First, Bm (ω) = Nn=m0 bm [n]e− jnω is substituted into (10),
− j·K e jmφ · Jm (ωr) · sm um
where Nm is the filter order for each phase. The spatial m=−L
response of the beamformer can now be expressed as
⎡ ⎤ = c ω, φ u − js ω, φ u,
L
Nm
G ω, φ = hm · e jmφ · K · j m · Jm (ωr) · ⎣ bm [n]e− jnω ⎦. where c(ω, φ) = [Ke j(−L)φ J−L (ωr)c−L , . . . , Ke j(L)φ JL (ωr)cL ];
m=−L n=0 u = [uT−L , uT−L+1 , . . . , uTL ]T ; s(ω, φ) = [Ke j(−L)φ J−L (ωr)s−L ,
(13) . . . , Ke j(L)φ JL (ωr)sL ].
EURASIP Journal on Advances in Signal Processing 5
Gain (dB)
−30
−s ω, φ
H −40
Hence, G(ω, φ)2 = gH g = (A(ω, φ)H u) (A(ω, φ)H u)
H
= uH A(ω, φ)A(ω, φ) u. −50
The objective function and the constraint inequality
defined in (11) can now be written as −60
H
min u Ru,
u −70
(17) −200 −150 −100 −50 0 50 100 150 200
s.t.
G ω, φ0 − 1
≤ δ, for ω ∈ [ωl , ωu ], Angle (deg)
where R = ω φ A(ω, φ)A(ω, φ)H dω dφ. Figure 3: The normalized spatial response of the proposed
In order to transform (17) into the SOCP form defined beamformer for ω = [0.3π, 0.95π].
by (12), the cost function must be a linear equation.
Since matrix R is hermitian and positive definite, it can
be decomposed into an upper triangular matrix and its where 0 is the zero matrix with its dimension determined
transpose using Cholesky factorization, that is, R = DH D, from the context.
where D is the Cholesky factorization of R. Substituting this Equation (21) can now be solved using convex optimiza-
into (17), we have tion toolbox such as SeDuMi [24] with great efficiency.
uH Ru = uH DH D u = (Du)H (Du). (18)
4.2. Computational Complexity. When the Interior-Point
This further simplifies (17) into the following form: Method (IPM) is used to solve the SOCP problem defined
in √(21), the number of iterations needed is bounded by
min d2 ,
u O( N) where N is the number of constraints.
The amount
⎧ of computation per iteration is O(n2 i ni ) [23].
⎨d 2 = D · u2 , (19)
The bulk of the computational requirement of the broad-
s.t. ⎩
band array pattern synthesis comes from the optimization
G ω, φ0 − 1
≤ δ for ω ∈ [ωl , ωu ].
process. The computational complexity of the optimization
Denoting t as the maximum norm of vector Du process of the proposed algorithm and that of the UCCA
subject to various choices of u, (19) reduces to algorithm have been calculated and are listed in Table 1.
It can be seen from Table 1 that the proposed algorithm
min t, requires a similar amount of computation per iterations
u
⎧ but a much smaller number of iterations compared to
⎨D · u ≤ t, (20) the UCCA algorithm. The overall computational load of
s.t. ⎩
the proposed method is therefore much smaller that that
G ω, φ0 − 1
≤ δ for ω ∈ [ωl , ωu ]. is needed by the UCCA algorithm. It should be noted
It should be noted that (20) contains I different con- that, as the coefficients are optimized in the phase modes,
straints where I uniformly divides the frequency range the comparative computational load presented above is
spanned by ω. calculated based on the number of phase modes and not the
Lastly, in order to solve (20) by SOCP toolbox, we stack number of sensors. Nevertheless, the larger the number of
t and the coefficients of u together and define y = [t; u]. Let sensors, the larger the number of phase modes too.
a = [1, 0]T , so that t = aT y. As a result, the objective function
and the constraint defined in (11) can be expressed as 5. Numerical Results
T
min a y, In this numerical study, the performance of the proposed
y
beamformer is compared with that of UCCA beamformer
⎧
⎪
⎪
0 D y
≤ aT y, [14] and Yan’s beamformer [25], for the specified frequency
⎪
⎪
⎨
⎛ ⎞
region. The evaluation metric used to quantify the frequency
s.t.
1
⎪
invariance (FI) characteristics is the mean squared error
⎪
⎪
0 A(ω, φ0 ) y − ⎝ ⎠
≤ δ
H
for ω ∈ ωl ωu ,
⎩
⎪
0
of the array gain variation at the specified direction. The
sensitivity performance of the proposed algorithm will also
(21) be evaluated for different number of sensors and different
6 EURASIP Journal on Advances in Signal Processing
Gain (dB)
−20
0.4 −0.0248 −0.8230 0.1760
0.5 0.0044 −1.3292 −0.022 −25
−45
−200 −150 −100 −50 0 50 100 150 200
threshold values set for magnitude control of the ripples of Angle (deg)
the main beam.
A uniform circular array consisting of 20 sensors is Figure 4: The normalized spatial response of the UCCA beam-
former for ω = [0.3π, 0.95π].
considered. All the sensors are assumed perfectly calibrated.
The number of phase modes M is set to be 17 and thus
there are 17 spatial weighting coefficients. The order of the 0
compensation filter is set to be 16 for all the phase modes. −5
The frequency region of interest is specified to be from
0.3π to 0.95π. The threshold value, δ, which controls the −10
microphone is located.
There are several optimization criteria presented in −25
[25]. The one that is chosen to compare is peak sidelobe −30
constrained minimax mainlobe spatial response variation
−35
(MSRV) design. Its objective is to minimize the maximum
MSRV with peak sidelobe constraint. The mathematic −40
expression is shown as follows: −45
min σ, −50
h −200 −150 −100 −50 0 50 100 150 200
⎧ Angle (deg)
⎪
⎪u f0 , φ0 h = 1,
T
⎪
⎪
⎪
⎪
⎪ Figure 5: The normalized spatial response of Yan’s beamformer for
⎪
⎪ T
⎪
⎪ ω = [0.3π, 0.95π].
⎨ u f ,
k qθ − u f ,
0 qθ h
≤ σ,
(22)
s.t. ⎪
⎪
⎪
⎪ T
⎪
⎪ u fk , θs h ≤ ε,
⎪
⎪ spaced discrete frequencies is superimposed. It can be seen
⎪
⎪
⎩ f ∈ f , f , θ ∈ Θ , θ ∈ Θ ,
⎪
that, the proposed beamformer has approximately a constant
k l u q ML s SL
gain within the frequency region of interest in the specified
where f0 is the reference frequency and choose to have direction (0◦ ). As the direction deviates from 0◦ , the FI
the value of fl , and h is the beamformed weightings to be property becomes poorer. The peak sidelobe level has a value
optimized. ε is the peak sidelobe constraint and set to be of −8 dB.
0.036. ΘML and ΘSL represent the mainlobe and sidelobe The beampattern of the UCCA beamformer is shown in
region, respectively. Figure 4. As the proposed algorithm is based on a circular
The beampattern obtained for the proposed beamformer array, only one layer of the UCCA concentric array is used
for the frequency region of interest is shown in Figure 3. The for the numerical study. All other parameter settings remain
spatial response of the proposed beamformer at 10 uniformly the same as that used for the proposed algorithm. As shown
EURASIP Journal on Advances in Signal Processing 7
20 12
18
11
16
14 10
Mean square error
2
6
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
5
Normalised frequency (radians/sample) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalised frequency (radians/sample)
Yan’s beamformer
Proposed beamformer Figure 8: White noise gain versus frequency for the broadband
UCCA beamformer beam pattern shown in Figure 3.
Figure 6: Comparison on FI characteristic between the proposed
beamformer, UCCA beamformer and Yan’s beamformer at 0 degree
for ω = [0.3π, 0.95π]. different methods are shown in Figure 6. It is seen that the
proposed beamformer outperforms both the UCCA beam-
15 former and Yan’s beamformer on achieving FI characteristic
at the desired direction. Table 2 tabulates the array gain at
14.5
each frequency along the desired direction for these three
14 methods.
Furthermore, the performance of the frequency invariant
13.5
beam pattern obtained by the proposed method is assessed
Directivity (dB)
0 0
−5
−10
−10
−20
−15
Gain (dB)
Gain (dB)
−20 −30
−25 −40
−30
−50
−35
−60
−40
−45 −70
−200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200
Angle (deg) Angle (deg)
Figure 9: The normalized spatial response of the proposed FI Figure 11: The normalized spatial response of the Yan’s beam-
beamformer for 10 microphones. former for 10 microphones.
0 0
−5
−10
−10
−20
−15
Gain (dB)
Gain (dB)
−20 −30
−25
−40
−30
−50
−35
−40 −60
−45 −70
−200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200
Angle (deg) Angle (deg)
Figure 10: The normalized spatial response of the UCCA beam- Figure 12: The normalized spatial response of the proposed FI
former for 10 microphones. beamformer for 8 microphones.
are reduced from 20 to 10 and 8 and the performances the allowed ripples in the magnitude of the main beam
of the proposed FI beamformer, UCCA beamformer, and spatial gain response. In this section, different values of δ
Yan’s beamformer are compared. The results are shown in are used to study the sensitivity of the performance of the
Figures 9, 10, 11, 12, 13, and 14. As seen from the simula- proposed algorithm to this parameter value. Three values,
tions, when 10 microphones are employed, the proposed namely, δ = [0.001, 0.01, 0.1] are selected and the results
algorithm achieves the best FI performance in the mainlobe obtained are shown in Figures 15, 16, and 17, respectively.
region, with a sidelobe level of −8 dB. For UCCA method The specified frequency region of interest remains the same.
and Yan’s method, frequency invariant characteristics are Figure 18 shows the mean squared error of the array gain at
not promising at the desired direction, and higher sidelobes the specified direction (0◦ ) for the three different δ values
are obtained. When the number of microphone is further studied.
reduced to 8, our proposed method is still able to produce As shown in the figures, as the value of δ decreases, the FI
reasonable FI beampattern whereas the FI property of the performance at the specified direction improves. The results
beampattern of the UCCA algorithm becomes much poorer also show that the improvement in the FI performance in
in the specified direction. the specified direction is achieved with an increase in the
peak sidelobe level and a poorer FI beampattern in the other
5.2. Sensitivity Study—Different Threshold Value δ. In this directions in the main beam. For example, when the value
proposed algorithm, δ is a parameter created to define of δ is 0.001, the peak sidelobe of the spatial response is
EURASIP Journal on Advances in Signal Processing 9
0 0
−2
−10
−4
−6 −20
Gain (dB)
Gain (dB)
−8
−30
−10
−12 −40
−14
−50
−16
−18 −60
−200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200
Angle (deg) Angle (deg)
Figure 13: The normalized spatial response of the UCCA beam- Figure 16: The normalized spatial response of the proposed
former for 8 microphones. beamformer for δ = 0.01.
0 0
−5
−10
−10
−15
−20
−20
Gain (dB)
Gain (dB)
−25 −30
−30
−40
−35
−40
−50
−45
−50 −60
−200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200
Angle (deg) Angle (deg)
Figure 14: The normalized spatial response of the Yan’s beam- Figure 17: The normalized spatial response of the proposed
former for 8 microphones. beamformer for δ = 0.1.
0
as high as −5 dB and the beampatterns do not overlap well
−5
in the main beam. As δ increases to 0.1, the peak sidelobe
−10 of the spatial response is approximately −10 dB (lower) and
−15
the beampatterns in the main beam are observed to have a
relatively good FI characteristics.
−20
Gain (dB)
−25 6. Conclusion
−30
A selective frequency invariant uniform circular broadband
−35 beamformer is presented in this paper. Other than pro-
−40 viding the details of a recent conference paper presented
by the authors of this paper, a complexity analysis and
−45
two sensitivity studies on the proposed algorithm are also
−50 presented in this paper. The proposed algorithm is designed
−200 −150 −100 −50 0 50 100 150 200
to minimize an objective function of the spatial response gain
Angle (deg)
with a constraint on the gain being smaller than a predefined
Figure 15: The normalized spatial response of the proposed threshold value across a specified frequency range and in a
beamformer for δ = 0.001. specified direction. The problem is formulated as a convex
10 EURASIP Journal on Advances in Signal Processing
0.18 [5] O. L. Frost III, “An algorithm for linearly constrained adaptive
array processing,” Proceedings of the IEEE, vol. 60, no. 8, pp.
0.16
926–935, 1972.
0.14 [6] W. Liu, D. McLernon, and M. Ghogho, “Frequency invariant
beamforming without tapped delay-lines,” in Proceedings
Mean square error
Research Article
First-Order Adaptive Azimuthal Null-Steering for
the Suppression of Two Directional Interferers
René M. M. Derkx
Digitial Signal Processing Group, High Tech Campus 36, 5656 AE Eindhoven, The Netherlands
Copyright © 2010 René M. M. Derkx. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
An azimuth steerable first-order superdirectional microphone response can be constructed by a linear combination of three
eigenbeams: a monopole and two orthogonal dipoles. Although the response of a (rotation symmetric) first-order response can
only exhibit a single null, we will look at a slice through this beampattern lying in the azimuthal plane. In this way, we can define
maximally two nulls in the azimuthal plane which are symmetric with respect to the main-lobe axis. By placing these two nulls on
maximally two directional sources to be rejected and compensating for the drop in level for the desired direction, we can effectively
reject these directional sources without attenuating the desired source. We present an adaptive null-steering scheme for adjusting
the beampattern so as to obtain this suppression of the two directional interferers automatically. Closed-form expressions for this
optimal null-steering are derived, enabling the computation of the azimuthal angles of the interferers. It is shown that the proposed
technique has a good directivity index when the angular difference between the desired source and each directional interferer is at
least 90 degrees.
y
M1
φ
y (0, 0) x
M1
M0
θ M0
x
M2
M2
by a linear combination of the uni-directional microphone experiments in Section 6. Finally, in Section 7, conclusions
signals. In such an approach, there is no need to apply a first- are given.
order integrator (as was the case for omni-directional micro-
phone elements), and we avoid a 20 dB/decade increased 2. Construction of Eigenbeams
sensitivity for sensor-noise [7]. Nevertheless, uni-directional
microphones may have a low-frequency roll-off, which We know from [7, 9] that by using a circular array of at least
can be compensated for by means of proper equalization three (omni- or uni-directional microphone) sensors in a
techniques. Throughout this paper, we will assume that the planar geometry and applying signal processing techniques,
uni-directional microphones have a flat frequency response. it is possible to construct a first-order superdirectional
We focus on the construction of first-order superdirec- response. This superdirectional response can be steered
tional beampatterns where the nulls of the beampattern are with its main-lobe to any desired azimuthal angle and
steered to the directional interferers, while having a unity can be adjusted to have any first-order directivity pattern.
response in the direction of the desired sound-source. In As mentioned in the introduction, we will use three uni-
Section 2, we construct a monopole and two orthogonal directional cardioid microphones (with a heart-shaped
dipole responses (known as “eigen-beams” [10, 11]) out directional pattern) in a circular configuration, where the
of a circular array of three first-order cardioid microphone main-lobes of the three cardioid responses are pointed
elements M0 , M1 , and M2 (with a heart-shaped directional outwards, as shown in Figure 1.
pattern), as shown in Figure 1. Here θ and φ are the standard The responses of the three cardioid microphones M0 , M1 ,
spherical coordinate angles: elevation and azimuth. and M2 are given by, respectively, Ec0 (r, θ, φ), Ec1 (r, θ, φ), and
Based on these eigenbeams, we are able to construct Ec2 (r, θ, φ), having their main-lobes at, respectively, φ = 0,
arbitrary first-order responses that can be steered with 2π/3, and 4π/3 radians. Assuming that we have no sensor-
the main-lobe in any azimuthal direction (see Section 2). noise, the nth cardioid microphone response, with n =
Although the (rotation symmetric) first-order response can 0, 1, 2, for a harmonic plane-wave with frequency f is ideally
only exhibit a single null, we will look at a slice through given by [11]
the beampattern lying in the azimuthal plane. In this
Ecn r, θ, φ = An e jψn . (1)
way, we can define maximally two nulls in the azimuthal
plane which are symmetric with respect to the main-lobe The magnitude-response An and phase-response ψn of the
axis. By placing these two nulls on the two directional nth cardioid microphone are given by, respectively:
sources to be rejected and compensating for the drop in
1 1 2nπ
level for the desired direction, we can effectively reject the An = + cos φ − sin θ, (2)
2 2 3
directional sources without attenuating the desired source.
In Section 3 expressions are derived for this beampattern 2π f
ψn = sin θ xn cos φ + yn sin φ . (3)
synthesis. c
To develop an adaptive null-steering algorithm, we first Here c is the speed of sound and xn and yn are the x and y
show in Section 4 how the superdirective beampattern can coordinates of the nth microphone (as shown in Figure 1),
be synthesized via the Generalized Sidelobe Canceller (GSC) given by
[12]. This GSC enables us to optimize a cost-function in
2nπ
an unconstrained manner with a gradient descent search- xn = r cos φ − ,
method that is described in Section 5. Furthermore, the GSC 3
(4)
enables tracking of the angles of the separate directional 2nπ
interferers, which is validated by means of simulations and yn = r sin φ − ,
3
EURASIP Journal on Advances in Signal Processing 3
1 1 1
0 0 0
−1 −1 −1
1 1 1
0.5 1 0.5 1 0.5 1
0 0 0
0 0 0
−0.5 −0.5 −0.5
−1 −1 −1 −1 −1 −1
(a) Em (θ, φ) (b) Ed0 (θ, φ) (c) Edπ/2 (θ, φ)
with r being the radius of the circle on which the micro- spatial aliasing effects will occur) , that is, r c/ f , the
phones are located. phase-component ψn , given by (5) can be neglected and the
We can simplify (3) as responses of the eigenbeams for these frequencies are equal
to
2π f 2nπ
ψn = r sin θ cos . (5) Em =, 1
c 3
0
From the three cardioid microphone responses, we Ed θ, φ = cos φ sin θ, (10)
can construct the circular harmonics [7], also known as
π
“eigenbeams” [10, 11]), by using the 3-point Discrete Fourier Edπ/2 θ, φ = cos φ − sin θ.
2
Transform (DFT) with the three microphones as inputs. This
DFT produces three phase-modes Pi (r, θ, φ) [7] with i = The directivity patterns of these eigenbeams are shown in
1, 2, 3: Figure 2.
The zeroth-order eigenbeam Em represents the monopole
1 n
2
response, while the first-order eigenbeams Ed0 (θ, φ) and
P0 r, θ, φ = E r, θ, φ , Edπ/2 (θ, φ) represent the orthogonal dipole responses.
3 n=0 c
The dipole can be steered to any angle ϕs by means of a
∗ weighted combination of the orthogonal dipole pair:
P1 r, θ, φ = P2 r, θ, φ (6)
ϕ
Ed s θ, φ = cos ϕs Ed0 θ, φ + sin ϕs Edπ/2 θ, φ , (11)
1
2
= Ec r, θ, φ e− j 2πn/3 ,
n
with 0 ≤ ϕs ≤ 2π being the steering angle.
3 n=0
Finally, the steered and scaled superdirectional micro-
√ phone response can be constructed via
with j = −1 and ∗ being the complex-conjugate operator.
Via the phase-modes, we can construct the monopole as E θ, φ = S αEm + (1 − α)Ed s θ, φ
ϕ
(12)
Em r, θ, φ = 2P0 r, θ, φ , (7)
= S α + (1 − α) cos φ − ϕs sin θ ,
and the orthogonal dipoles as with α ≤ 1 being the parameter for controlling the
0 directional pattern of the first-order response and S being an
Ed r, θ, φ = 2 P1 r, θ, φ + P2 r, θ, φ , arbitrary scaling factor. Both parameters α and S may also
(8) have negative values.
Edπ/2 r, θ, φ = 2 j P1 r, θ, φ − P2 r, θ, φ . Alternatively, we can write the construction of the
response in matrix-vector notation:
In matrix notation
⎡ ⎤ ⎡ ⎤⎡ ⎤ E θ, φ = SFTα Rϕs X, (13)
Em 1 1 1 Ec0
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ 0 ⎥ 2⎢ ⎥⎢ ⎥ with the pattern-synthesis vector:
⎢ Ed ⎥ = ⎢2 −1 −1 ⎥⎢Ec1 ⎥. (9)
⎣ ⎦ 3⎣ √ √ ⎦⎣ 2 ⎦ ⎡ ⎤
π/2
Ed 0 3 − 3 Ec α
⎢ ⎥
⎢ ⎥
Fα = ⎢(1 − α)⎥, (14)
For frequencies with wavelengths larger than the size of ⎣ ⎦
the array (for wavelengths smaller than the size of the array, 0
4 EURASIP Journal on Advances in Signal Processing
the rotation-matrix Rϕs : Solving the two unknowns α and ϕs gives
⎡ ⎤
1 0 0 ϕs = 2 arctan X, (21)
⎢ ⎥
⎢
R ϕs = ⎢0 cos ϕs sin ϕs ⎥
⎥, (15) sin Δϕn X
⎣ ⎦ α = ,
0 − sin ϕs cos ϕs cos ϕn1 − cos ϕn2 + X sin ϕn1 − sin ϕn2 + sin Δϕn
(22)
and the input-vector:
⎡ ⎤ ⎡ ⎤ with
Em 1
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ sin ϕn1 − sin ϕn2 ± 2 − 2 cos Δϕn
X = ⎢ Ed0 θ, φ ⎥ = ⎢cos φ sin θ ⎥. (16) X= (23)
⎣ ⎦ ⎣ ⎦ ,
π/2 cos ϕn1 − cos ϕn2
Ed θ, φ sin φ sin θ
Δϕn = ϕn1 − ϕn2 . (24)
In the remainder of this paper, we will assume that we
have unity response of the superdirectional microphone for It is noted that (23) can have two solutions, leading to
a desired source coming from an arbitrary azimuthal angle However, the resulting
different solutions for ϕs , α, and S.
φ = ϕs and for an elevation angle θ = π/2 and we want to beampatterns are identical.
suppress two interferers by steering two nulls towards two As can be seen we get a vanishing denominator in (22)
azimuthal angles φ = ϕn1 and φ = ϕn2 , also for an elevation for ϕn1 = ϕs and/or ϕn2 = ϕs . Similarly, this is the case when
angle θ = π/2. Hence, we assume θ = π/2 in the remainder Δϕn = ϕn1 − ϕn2 goes to zero. For this latter case, we can
of this paper. compute the limit of ϕs and α:
3. Optimal Null-Steering for Two Directional lim ϕs = 2 arctan
sin ϕni
= ϕni + π, (25)
Interferers via Direct Pattern Synthesis Δϕn → 0 cos ϕni − 1
3.1. Pattern Synthesis. The first-order response of (12), with with i = 1, 2 and
the main-lobe of the response steered to ϕs , has two nulls for
1
α ≤ 1/2, given by (see [13]) lim α = , (26)
Δϕn → 0 2
−α
ϕn1 , ϕn2 = ϕs ± arccos . (17) where Δϕn = ϕn1 − ϕn2 .
1−α
For the case Δϕn = 0, we actually steer a single null
If we want to steer two nulls to arbitrary angles ϕn1 and towards the two directional interferers ϕn1 and ϕn2 . Equations
ϕn2 , not lying symmetrical with respect to ϕs , it can be seen (25) and (26) describe the limit-case solution for which there
that we cannot steer the main-lobe of the first-order response are an infinite number of solutions that satisfy the system of
to ϕs . Therefore, we steer the main-lobe to ϕs and use a scale- equations, given by (21).
factor S under the constraint that a unity response is obtained
at angle ϕs . In matrix notation, 3.2. Analysis of Directivity Index. Although the optimization
in this paper is focused on the suppression of two directional
E θ, φ = T Rϕs X,
SF (18)
α interferers, it is also important to analyze the noise-reduction
performance for isotropic noise circumstances. We will only
with the rotation-matrix and the pattern-synthesis matrix
analyze the spherical isotropic noise case, for which we
being as in (15) and (14), respectively, with α, ϕs instead of
compute the spherical directivity factor QS given by [4, 5]
α, ϕs .
From (12), we see that a unity desired response at angle
4πE2 π/2, ϕs
ϕs is obtained when we choose the scale-factor S as QS = 2π π . (27)
2
φ=0 θ =0 E θ, φ sin θdθ dφ
1
S = , (19) If we combine (27) with (18), we get
α + (1 − α) cos ϕs − ϕs
with α being the parameter for controlling the directional 6 1 − cos ϕ1 1 − cos ϕ2
QS ϕ1 , ϕ2 = , (28)
pattern of the first-order response (similar to the parameter 5 + 3 cos ϕ1 − ϕ2
α), ϕs the angle for the desired sound, and ϕs the angle for
the steering (which, in general, is different from ϕs ). with
Next, we want to place the nulls at ϕn1 and ϕn2 . Hence, we ϕ1 = ϕn1 − ϕs , (29)
solve the following system of two equations:
ϕ2 = ϕn2 − ϕs . (30)
S α + (1 − α) cos ϕn1 − ϕs = 0,
(20) In Figure 3, the contour-plot of the directivity factor QS
S α + (1 − α) cos ϕn2 − ϕs = 0. is shown with ϕ1 and ϕ2 on the x- and y-axes, respectively.
EURASIP Journal on Advances in Signal Processing 5
3
2.5 2 1 Em
1
0.5
5
1.
1.
3 5
Ed0 Ep E
5
π
2.
2 +
2
Rϕs Fα − −
Edπ/2
3.5
3.5
3
1
2
3
3
B Er2 w2
ϕ2 (rad)
2.5 3
2
π 2.5 2
3
3
GSC scheme, first a prefiltering with a fixed value of ϕs and
3.5
α is performed, to construct a primary signal with a unity
2.5
3.5
1
π 1.5 response to angle ϕs and two noise references. As the two
2
1.5
2
0.5
1 2 2.5 3 noise references do not include the source coming from angle
1
1
π π
3
π
ϕs , two noise-canceller weights w1 and w2 can be optimized
2 2 in an unconstrained manner. The GSC scheme is shown in
ϕ1 (rad)
Figure 4.
Figure 3: Contour-plot of the directivity factor QS (ϕ1 , ϕ2 ). We start by constructing the primary-response as
E p θ, φ = FTα Rϕs X, (32)
As can be seen in (28), the directivity factor goes to
zero if one of the angles ϕn1 or ϕn2 gets close to ϕs . Clearly, with FTα , Rϕs , and X being as defined in the introduction and
a directivity factor which is smaller than unity is not very using a scale-factor S = 1.
useful in practice. Hence, the pattern synthesis technique is Furthermore, we can create two noise-references via
only useful when the angles ϕn1 and ϕn2 are located in one ⎡ ⎤
half-plane and the desired source is located around the center Er1 θ, φ
⎣ ⎦ = BT Rϕs X (33)
of the opposite half-plane. Er2 θ, φ
It can be found in the appendix that for
with a blocking-matrix B [14] given by
1
ϕ1 = arccos − ,
3 ⎡ 1 ⎤
(31) 0
1 ⎢ 2 ⎥
ϕ2 = 2π − arccos − , ⎢ ⎥
⎢ ⎥
3 B=⎢ 1
⎢− 0⎥.
⎥ (34)
a maximum directivity factor QS = 4 is obtained. This cor- ⎢ 2 ⎥
⎣ ⎦
responds with 6 dB directivity index, defined as 10 log10 QS , 0 1
where the directivity pattern resembles a hypercardioid.
Furthermore for (ϕ1 , ϕ2 ) = (π, π) rad. a directivity factor It is noted that the noise-references Er1 and Er2 are,
QS = 3 is obtained, corresponding with 4.8 dB directivity respectively, a cardioid and a dipole response, with a null
index, where the directivity pattern yields a cardioid. As can steered towards the angle of the desired source at azimuth
be seen from Figure 3, we can define a usable region, where φ = ϕs and elevation θ = π/2.
the directivity-factor is QS > 3/4 for π/2 ≤ ϕ1 , ϕ2 ≤ 3π/2. The primary- and the noise-responses can be used in the
generalized sidelobe canceller structure, to obtain an output
4. Optimal Null-Steering for Two Directional as
Interferers via GSC
E θ, φ = E p θ, φ − w1 Er1 θ, φ − w2 Er2 θ, φ . (35)
4.1. Generalized Sidelobe Canceller (GSC) Structure. To
develop an adaptive algorithm for steering two nulls towards It is important to note that for any value of ϕs , α, w1 , and
the two directional interferers based on the pattern-synthesis w2 , a unity-response at the output of the GSC is maintained
technique in Section 3, it would be required to use a for angle φ = ϕs and θ = π/2.
constrained optimization technique where we want to main- In the next sections we give some details in computing w1
tain a unity response towards the angle ϕs . For adaptive and w2 for the suppression of two directional interferers, as
algorithms, it is generally easier to adapt in an unconstrained discussed in the previous section.
manner. Therefore, we first present an alternative scheme
for the null-steering, similar to the direct pattern-synthesis 4.2. Optimal GSC Null-Steering for Two Directional Inter-
technique as discussed in Section 3, but based on the well- ferers. Using the GSC structure of Figure 4 having a unity
known Generalized Sidelobe Canceller (GSC) [12]. In the response at angle φ = ϕs , we can compute the weights w1
6 EURASIP Journal on Advances in Signal Processing
and w2 to steer two nulls towards azimuthal angles ϕn1 and 2 0.5
0.5
ϕn2 , by solving
1
π π π
Ep , ϕi − w1 Er1 , ϕi − w2 Er2 , ϕi = 0 (36) 1
2 2 2 1
1.5
for i = 1, 2.
0.5
2
1.5
This results in the following relations: 2.5
2
1
3.5
2 sin ϕ1 − ϕ2
2.5
1.5
w2
w1 = 2α + , (37) 0
1
3
sin ϕ1 − sin ϕ1 − ϕ2 − sin ϕ2
3
3.5
cos ϕ1 − cos ϕ2
1
w2 = , (38) 1.5
2 2.5
0.5
−1
1.5
where ϕ1 and ϕ2 are defined as given by (29) and (30), 1
respectively. 1
noise reference signals, and p[k] the primary signal. The with
inclusion of the term 2α in (47) is a consequence of the fact ⎡σ ⎤
n1
that w1 [k] is an estimate of w1 (see (39) in which 2α is not ⎢ 2
1 − cos ϕ1 σn1 sin ϕ1 ⎥
included). Ap = ⎢
⎣ σn
⎥,
⎦
In the ideal case that we want to obtain a unity response 2
1 − cos ϕ2 σn2 sin ϕ2
for a source-signal s[k] originating from angle ϕs and have 2
⎡ ⎤
an undesired source-signal n1 [k] originating from angle ϕn1 w1 (53)
together with an undesired source-signal n2 [k] originating w = ⎣ ⎦,
from angle ϕn2 , we have w2
⎡ ⎤
p[k] = s[k] + α + (1 − α) cos ϕi ni [k], σn1 cos ϕ1
i=1,2
vp = ⎣ ⎦.
σn2 cos ϕ2
1 1
r1 [k] = − cos ϕi ni [k], (48) The singularity of ATp A p can be analyzed by computing
i=1,2
2 2
the determinant of A p and setting this determinant to zero:
r2 [k] = sin ϕi ni [k]. σn1 σn2
i=1,2 sin ϕ2 1 − cos ϕ1 − sin ϕ1 1 − cos ϕ2 = 0. (54)
2
The cost-function J(w1 , w2 ) is defined as a function of w1 Equation (54) is satisfied when σn1 and/or σn2 are equal to
and w2 and is given by zero, ϕ1 and/or ϕ2 are equal to zero, or when
J(w1 , w2 ) = E y 2 [k] , (49)
sin ϕ1 sin ϕ2 ϕ1 ϕ2
= ≡ cot = cot . (55)
with E {·} being the expectation operator. 1 − cos ϕ1 1 − cos ϕ2 2 2
Using that E {n1 [k]n2 [k]} = 0 and E {ni [k]s[k]} = 0 for
i = 1, 2, we can write Equation (55) is satisfied only when ϕ1 = ϕ2 . This agrees
2 with the result that was obtained in Section 3.1, where Δϕ =
J(w1 , w2 ) = E p[k] − (w1 [k] + 2α)r1 [k] − w2 [k]r2 [k] 0.
In all other cases (so when ϕ1 =
/ ϕ2 , σn1 > 0 and σn2 > 0),
= σs2 [k] + σn2i [k] the matrix A p is nonsingular and the matrix ATp A p is positive
i=1,2 definite. Hence, the cost-function is a convex function with
a global minimum that can be found by solving the least-
1
× w1 [k]2 + w2 [k]2 squares problem:
4
" #−1
1 w opt = ATp A p ATp v p
+ cos ϕi w1 [k]2 + w1 [k] − w2 [k]2 + 1
2
4
= A−p 1 v p
1 (56)
+ cos ϕi − w1 [k]2 − w1 [k] +sin ϕi w1 [k]w2 [k] ⎡ ⎤
2
1 ⎣ 2 sin ϕ1 − ϕ2 ⎦
= ,
+ cos ϕi sin ϕi (−2w2 [k] − w1 [k]w2 [k]) A cos ϕ1 − cos ϕ2
with
= σs2 [k] + w1 [k] − (2 + w1 [k]) cos ϕi
i=1,2
A = sin ϕ1 − sin ϕ1 − ϕ2 − sin ϕ2 , (57)
2 σn2i [k]
+2w2 [k] sin ϕi , similar to the solutions as given in (37) and (38).
4
(50) As an example, we show the contour-plot of the cost-
function 10 log10 J(w1 , w2 ) in Figure 6, for the case where
with ϕs = π/2, ϕn1 = 0, ϕn2 = π rad., σn2i = 1 for i = 1, 2, and
σs2 = 0.
σs2 [k] = E s2 [k] ,
(51) As can be seen, the global minimum is obtained for w1 =
σn2i [k] = E n2i [k] . 0 and w2 = 0, resulting in a dipole beampattern. When we
change σn21 =/ σn22 , the shape of the cost-function will be more
We can see that the cost-function is a quadratic-function and more stretched, but the global optimum will be obtained
[15] that can be written in matrix-notation (for convenience, for the same values of w1 and w2 . In the extreme case when
we leave out the index k): σn22 = 0 and σn21 > 0, we obtain the cost-function as shown
! !2 in Figure 7. (It is interesting to note that this cost-function is
! !
J(w1 , w2 ) = σs2 + !A p w − v p ! exactly the same as for the case where ϕs = π/2, ϕn1 = ϕn2 = 0
(52) radians with σn2i = 1 for i = 1, 2 and σs2 = 0.) Although
= σs2 + w T ATp A p w − 2w T ATp v p + v Tp v p , still w1 = 0 and w2 = 0 is an optimal solution, it can be
8 EURASIP Journal on Advances in Signal Processing
2 2 10
5 0
1.5 1.5
10 −5
5 5
−10
5
1 −1
1 5 0
5 5
5 −1 10
−5 −
0
0
−10
0.5 0.5 −1
5 −5
−5 5 0
−10 5 0
−1 10
−15 −5 −
w2
w2
0 0 −10
−5
−10 −5 5
0 −1
−5
0
5 0 5
−1 10
−5
−0.5
0
−0.5 −
−10
0 15 −5
5 5 −
0
−1 −1 5
−1 10
−
5
5 5
−5 10
−1.5 −1.5 0 5
−2 −2 10
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
w1 w1
Figure 6: Contour-plot of the cost-function 10 log10 J(w1 , w2 ) for Figure 7: Contour-plot of the cost-function 10 log10 J(w1 , w2 ) for
the case where ϕs = π/2, ϕn1 = 0, and ϕn2 = π radians. the case where ϕs = π/2 and ϕn1 = ϕn2 = 0 radians.
seen that there is no strict global minimum. For example, Assuming that there are no directional interferers,
also w1 = −2 and w2 = 1 is an optimal solution (yielding a we obtain the following primary signal p[k] and noise-
cardioid beampattern). references r1 [k] and r2 [k] in the generalized sidelobe can-
For the situation where there is only a single interferer celler scheme:
or the situation where there are two interferers coming from
(nearly) the same angle, the resulting beampattern will have $
p[k] = s[k] + αd1 [k] + (1 − α)d2 [k] γ,
a null to this angle, while the other (second) null will be
placed randomly (i.e., the second null is not uniquely defined 1 1 $
r1 [k] = d1 [k] − d2 [k] γ, (59)
and the adaptation of this second null is poor). However in 2 2
situations where we have additive diffuse-noise present, we $
r2 [k] = d3 [k] γ.
obtain an extra degree of freedom, for example, optimization
of the directivity index. This is however outside the scope of
this paper. As di [k] with i = 1, 2, 3 and s[k] are mutually uncorre-
lated, we can write the cost-function as
5.2. Cost-Function for Isotropic Noise. It is also useful to 2 2
1 1
analyze the cost-function in the presence of isotropic (i.e., J(w1 , w2 ) = σs2 [k] + σd2 w1 + γ 1 + w1 + γw22 .
diffuse) noise. We know from [16] that spherical and 2 2
cylindrical isotropic noise can be modelled by adding (60)
uncorrelated additive white-noise signals d1 , d2 , and d3 to the
three eigenbeams Em , Ed0 , and Edπ/2 with variances σd2 , σd2 γ, and Just as for the cost-function with two directional interfer-
σd2 γ, respectively, or alternatively with a covariance matrix Kd ers, we can write the cost-function for isotropic noise also as
given by a quadratic function in matrix notation:
⎡ ⎤
1 0 0 ! !2 γ
⎢ ⎥
⎢ ⎥ Jd (w1 , w2 ) = σs2 + !Ad w − vd ! + , (61)
Kd = σd2 ⎢0 γ 0⎥. (58) 1+γ
⎣ ⎦
0 0 γ
with
(for diffuse noise situations, the individual elements are ⎡σ $ ⎤
d
1+γ 0 ⎦
correlated. However, due the construction of eigenbeams, Ad = ⎣ 2 √ ,
the diffuse noise will be decorrelated. Hence, it is allowed 0 σd γ
to add uncorrelated additive white-noise signals to these ⎡ −σ γ ⎤ (62)
eigenbeams to simulate diffuse-noise situations,) We choose d
⎢ $1 + γ ⎥
γ = 1/3 for spherically isotropic noise and γ = 1/2 for vd = ⎣ ⎦.
cylindrically isotropic noise. 0
EURASIP Journal on Advances in Signal Processing 9
It can be easily seen that Ad is positive definite and hence and where μ is the update step-size. As in practice, the
we have a convex cost-function with a global minimum. ensemble average E { y 2 [k]} is not available, we have to use an
Via (56) we can easily compute this minimum of the cost- instantaneous estimate of the gradient ∇ wi J(w
1 , w
2 ), which is
function, which is obtained by solving the least-squares computed as
problem:
" #−1 wi J(w
d y 2 [k]
∇ 1 , w
2 ) =
w opt = ATd Ad ATd vd dwi
= A−p 1 v p = −2 p[k] − (w
1 + 2α)r1 [k] − w
2 r2 [k] ri [k]
⎡ ⎤ (63)
2γ = −2y[k]ri [k].
⎢− ⎥ (69)
= ⎣ 1 + γ ⎦.
0
Hence, we can write the update equation as
5.3. Cost-Function for Directional Interferers and Isotropic wi [k + 1] = wi [k] + 2μy[k]ri [k]. (70)
Noise. In case we have directional interferers as well as
isotropic noise and assume that all these noise-components Just as proposed in [5], we can apply a power-
are mutually uncorrelated, we can construct the cost- normalization such that the convergence speed is indepen-
function based on addition of the two cost-functions: dent of the power:
J p,d (w1 , w2 ) = J p (w1 , w2 ) + Jd (w1 , w2 ) 2μy[k]ri [k]
wi [k + 1] = wi [k] + , (71)
!
!
!2 !
! !2 σ 2γ Pri [k] +
− v p ! + ! Ad w
= σs2 + !A p w − vd ! + d
1+γ with being a small value to prevent zero division and where
!
!
!2
! σd2 γ the power-estimate Pri [k] of the i th reference signal ri [k] can
= σs2 + !A p,d w
− v p,d ! + , be computed by a recursive averaging:
1+γ
(64)
Pri [k + 1] = βPri [k] + 1 − β ri2 [k], (72)
with:
⎡ ⎤ with β being a smoothing parameter (lower, but close to 1).
Ap The gradient search only needs to be performed in case
A p,d = ⎣ ⎦,
one or both of the directional interferers are present. In
Ad
case the desired speech is present during the adaptation,
⎡ ⎤ (65)
vp the gradient search will not behave robustly in practice.
v p,d = ⎣ ⎦. This nonrobust behaviour is caused by leakage of speech
vd in the noise references r1 and r2 due to either variations
of the desired speaker location, microphone mismatches
Since J p (w1 , w2 ) and Jd (w1 , w2 ) were found to be convex, or reverberation (multipath) effects. To avoid adaptation
the sum J p,d (w1 , w2 ) is also convex. The optimal weights w opt during desired speech, we will apply a step-size control factor
can be obtained by computing in the adaptation-rule, given by
" #−1
w opt = ATp,d A p,d ATp,d v p,d , (66) Pr1 [k] + Pr2 [k]
Ψ[k] = , (73)
Pr1 [k] + Pr2 [k] + Pp [k] +
which can be solved numerically via standard SVD tech-
niques [15].
where Pr1 [k] + Pr2 [k] is an estimate of the noise power and
Pp [k] is an estimate of the primary signal p[k] that contains
5.4. Gradient Search Algorithm. As we know that the cost-
mainly desired speech. The power estimate Pp [k] is, just
function is a convex function with a global minimum, we
can find this optimal solution by means of a steepest descent as for the reference-signal powers Pr1 and Pr2 , obtained via
update equation for wi with i = 1, 2 by stepping in the recursive averaging:
direction opposite to the surface J(w1 , w2 ) with respect to wi ,
similar to [5] Pp [k + 1] = βPp [k] + 1 − β p2 [k]. (74)
wi [k + 1] = wi [k] − μ∇wi J(w1 , w2 ), (67) We can see that the value of Ψ[k] will be small when the
desired speech is dominating, while Ψ[k] will be much larger
with a gradient given by (but lower than 1) when either the directional interferers or
spherically isotropic noise is dominating. As it is beneficial
∂J(w1 [k], w2 [k]) ∂E y 2 [k] to have a low amount of noise components in the power
∇wi J(w
1 , w
2 ) = = , (68)
∂wi [k] ∂wi [k] estimate Pp [k], we found that α = 0.25 is a good choice.
10 EURASIP Journal on Advances in Signal Processing
Initialize w1 [0] = 0, w2 [0] = 0, Pr1 [0] = r12 [0], Pr2 [0] = r22 [0] and Pp [0] = p2 [0]
for k = 0, ∞: do
Pr1 [k] + Pr2 [k]
Ψ[k] =
Pr1 [k] + Pr2 [k] + Pp [k] +
1 [k]w
−2(w 2 [k] + X1 )
N=
X2
The algorithm now looks as shown in Algorithm 1. Table 1: Computed values of ϕs , α, and S for placing two nulls at
As can be seen in the algorithm, the two weights w1 [k] ϕn1 and ϕn2 and having a unity response at ϕs .
and w2 [k] are adapted based on a gradient-search method.
Based on these two weights, a computation with arctan- ϕn1 ϕn2 ϕs ϕs
α S QS
function is performed to obtain the angles of the directional (deg) (deg) (deg) (deg)
interferers ϑni with i = 1, 2. 45 180 90 292.5 0.277 1.141 0.61
0 180 90 90 0 1.0 3.0
0 225 90 112.5 0.277 1.058 3.56
6. Validation
0 0 90 0 0.5 2 0.75
6.1. Directivity Pattern for Directional Interferers. First, we
show the beampatterns for a number of situations where two
nulls are placed. In Table 1, we show the computed values for
the direct pattern synthesis for 4 different situations, where Table 2: Computed values of w1 and w2 for placing two nulls at ϕn1
and ϕn2 and having a unity response at ϕs .
nulls are placed at different angles. Furthermore, we assume
that there is no isotropic noise present. ϕn1 ϕn2 ϕs
As was explained in Section 3.1, we can obtain two w1 w2 QS
(deg) (deg) (deg)
In Table 1, we show
different sets of solutions for ϕs , α, and S. √ 1√
the set of solutions where α is positive. 45 180 90 2 − 2 0.61
2
Similarly, in Table 2, we show the computed values for w1
and w2 in the GSC structure as explained in Section 4 for the 0 180 90 0 0 3.0
same situations as for the direct pattern synthesis.
−2 −1
The polar-plots resulting from the computed values in 0 225 90 √ √ 3.56
2+ 2 2+ 2
Tables 1 and 2 are shown in Figure 8. It is noted that the two
examples of Section 5.1 where we analyzed the cost-function 0 0 90 −2 −1 0.75
are depicted in Figures 8(b) and 8(d).
EURASIP Journal on Advances in Signal Processing 11
90 3 90 1
120 60 120 60
0.8
2 0.6
150 30 150 30
1 0.4
0.2
180 0 180 0
180 0 180 0
Figure 8: Azimuthal polar-plots for the placement of two nulls with nulls placed at (a) 45 and 180 degrees, (b) 0 and 180 degrees, (c) 0 and
225 degrees and (d) 0, and 0 degrees (two identical nulls).
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
ϑn1 and ϑn2
−2 −2
−2.5 −2.5
−3 −3
−3.5 −3.5
−4 −4
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k ×103 k ×103
ϑn1 ϑn1
ϑn2 ϑn2
ϕni with i = 1, 2 ϕni with i = 1, 2
(a) (b)
Figure 10: Simulation of the null-steering algorithm with two directional interferers where σn21 = σn22 = 1 and with a desired source where
σs2 = 1/16 with ϕs = 90 degrees (a) and ϕs = 60 degrees (b).
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
ϑn1 and ϑn2
−1.5 −1.5
−2 −2
−2.5 −2.5
−3 −3
−3.5 −3.5
−4 −4
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k ×103 k ×103
ϑn1 ϑn1
ϑn2 ϑn2
ϕni with i = 1, 2 ϕni with i = 1, 2
(a) (b)
Figure 11: Simulation of the null-steering algorithm with two directional interferers where σn21 = σn22 = 1 and with (spherically isotropic)
spherical isotropic noise (γ = 1/3), where σd2 = 1/16 (a) and σd2 = 1/4 (b).
from 135 to 45 degrees) and we linearly decrease the angle σn2i = 1. The results are shown in Figure 9. It can be seen
of a second undesired directional interferer (ranging from 30 that ϑn1 and ϑn2 do not cross (in contrast to the angles of the
degrees to −90 degrees) in a time-span of 10000 samples. For directional interferers ϕn1 and ϕn2 ). The first null placed at ϑn1
the simulation, we used α = 0.25, μ = 0.02, and β = 0.95.
adapts very well, while the second null, placed at ϑn2 , is poorly
First, we simulate the situation, where only two direc- adapted. The reason for this was explained in Section 5.1.
tional interferers are present. The two directional interferers Similarly, we simulate the situation with the same two
are uncorrelated white random-noise signals with variance directional interferers but now together with a desired
EURASIP Journal on Advances in Signal Processing 13
0.1
0.05
0
−0.05
−0.1
0 2 4 6 8 10 12 14 16
t (s)
(a) Cardioid to 0 degrees, that is, M0
0.1
0.05
0
−0.05
−0.1
0 2 4 6 8 10 12 14 16
Figure 12: Microphone array with 3 outward facing cardioid t (s)
microphones.
(b) Proposed adaptive null-steering algorithm
6
π rad
5
4
M2 M1
ϑni with i = 1, 2
0 rad 1
S
0
0 2 4 6 8 10 12 14 16
t (s)
180 0 180 0
larger compared to the directional interferers, the adaptation was generated via four loudspeakers, placed close to the walls
will be influenced by the diffuse noise that is present. The and each facing diffusers hanging on the walls. The level of
larger the diffuse noise, the more the final beampattern the diffuse noise is 12 dB lower compared to the directional
will resemble the hypercardioid. If diffuse noise would be (interfering) sources. The experiment is done in a time-span
dominant over the directional interferers, the estimates ϕn1 of 17.5 seconds, where we switch the directional sources as
and ϕn2 will be equal to 90−109 degrees, and 90+109 degrees, shown in Table 3.
respectively, (or −0.33 and −2.81 radians, resp.). We use mutually uncorrelated white random-noise
sequences for the directional sources N1, N2, and N3 played
6.3. Real-Life Experiments. To validate the null-steering by loudspeakers and use speech for the desired sound-source
algorithm in real-life, we used a microphone array with S.
3 outward facing cardioid electret microphones, as shown For the algorithm, we use discrete-time signals with a
in Figure 12. As directional cardioid microphones have sample-rate of 8 KHz. Furthermore, we used α = 0.25, μ =
openings on both sides, the microphones are placed in 0.001, and β = 0.95.
rubber holders, enabling sound to enter both sides of the Figure 14(a) shows the waveform obtained from micro-
directional microphones. phone #0 (M0 ), which is a cardioid pointed with its main-
The type of microphone elements used for this array lobe to 0 radians. This waveform is compared with the result-
is the Primo EM164 cardioid microphones [17]. These ing waveform of the null-steering algorithm, and is shown
elements are placed uniformly on a circle with a radius in Figure 14(b). As the proposed null-steering algorithm is
of 1 cm. This radius is sufficient for the construction of able to steer nulls toward the directional interferers, the direct
eigenbeams up to a frequency of 4 KHz. part of the interferers is removed effectively (this can be seen
For the experiment, we placed the array on a table in by the lower noise-level in Figure 14(b) in the time-frame
a moderately reverberant room (conferencing-room) with from 0–10.5 seconds). In the segment from 10.5–14 seconds
a T60 of approximately 200 milliseconds. As shown in the (where there is only a single directional interferer at φ = π
setup in Figure 13, all directional sources are placed at a radians), it can be seen that the null-steering algorithm is
distance of 1 meter from the array (at discrete azimuthal able to reject this interferer just as good as the single cardioid
angles: φ = 0, π/2, π, and 3π/2 radians), while diffuse noise microphone.
EURASIP Journal on Advances in Signal Processing 15
Source angle φ (rad) 0–3.5 (s) 3.5–7 (s) 7–10.5 (s) 10.5–14 (s) 14 s–17.5 (s)
N1 π/2 active — active — —
N2 π active active — active —
N3 3π/2 — active active — —
S 0 active active active active active
In Figure 15, the resulting angle-estimates from the null- Proof. First, we compute the numerator of the partial
steering algorithm are shown. Here, it can be seen that the derivative ∂QS /∂ϕ1 and set this derivative to zero:
angle-estimation for the first three segments of 3.5 seconds
6 1 − cos ϕ1 sin ϕ1 5 + 3 cos ϕ1 − ϕ2
is done accurately. For the fourth segment, there is only (A.2)
a single point interferer. In this segment, only a single + 6 1 − cos ϕ1 1 − cos ϕ2 3 sin ϕ1 − ϕ2 = 0.
angle-estimation is stable, while the other angle-estimation
is highly influenced by the diffuse noise. Finally, in the The common factor 6(1 − cos ϕ1 ) can be removed, resulting
fifth segment, only diffuse noise is present and the final in
beampattern will optimize the directivity-index, leading to a sin ϕ1 5 + 3 cos ϕ1 − ϕ2 + 3 1 − cos ϕ1 sin ϕ1 − ϕ2 = 0.
more hypercardioid beampattern steered with its main-lobe (A.3)
to 0 degrees (as explained in Section 6.2).
Finally, in Figure 16, the resulting polar-patterns from Similarly, setting the partial derivative ∂QS /∂ϕ2 equal to
the null-steering algorithm are shown for some discrete zero, we get
time-stamps. Again, it becomes clear that the null-steering sin ϕ2 5 + 3 cos ϕ2 − ϕ1 + 3 1 − cos ϕ2 sin ϕ2 − ϕ1 = 0.
algorithm is able to steer the nulls toward the angles where (A.4)
the interferers are coming from.
Combining (A.3) and (A.4) gives
sin ϕ1 −3 sin ϕ1 − ϕ2
=
7. Conclusions 1 − cos ϕ1 5 + 3 cos ϕ1 − ϕ2
(A.5)
We analyzed the construction of a first-order superdirec- 3 sin ϕ2 − ϕ1 − sin ϕ2
= = ,
tional response in order to obtain a unity response for a 5 + 3 cos ϕ2 − ϕ1 1 − cos ϕ2
desired azimuthal angle and to obtain a placement of two
nulls to undesired azimuthal angles to suppress two direc- or alternatively
tional interferers. We derived a gradient search algorithm to 2 sin ϕ1 /2 cos ϕ1 /2 ϕ1 ϕ2
adapt two weights in a generalized sidelobe canceller scheme. 2 = cot = −cot , (A.6)
2 sin ϕ1 /2 2 2
Furthermore, we analyzed the cost-function of this gradient
search algorithm, which was found to be convex. Hence with ϕ1 , ϕ2 ∈ [0, π].
a global minimum is obtained in all cases. From the two From (A.6), we can see that ϕ1 /2 + ϕ2 /2 = π (or ϕ1 + ϕ2 =
weights in the algorithm and using a four-quadrant inverse- 2π) and can derive
tangent operation, it is possible to obtain estimates of the cos ϕ2 = cos 2π − ϕ1 = cos ϕ1 , (A.7)
azimuthal angles where the two directional interferers are
coming from. Simulations and real-life experiments show a sin ϕ2 = sin 2π − ϕ1 = − sin ϕ1 . (A.8)
good performance in moderate reverberant situations. Using (A.7) and (A.8) in (A.1) gives
2 2
6 1 − cos ϕ1 6 1 − cos ϕ1 6(1 − x)2
QS = = = ,
Appendix 5 + 3 2 cos ϕ1 − 1 2 + 6 cos2 ϕ1 2 + 6x2
(A.9)
Proofs
with x = cos ϕ1 ∈ [−1, 1].
Maximum Directivity Factor QS . We prove that for We can compute the optimal value for x by differentia-
tion of (A.9) and setting the result to zero:
2
6 1 − cos ϕ1 1 − cos ϕ2 − 12(1 − x) 2 + 6x2 − 6(1 − x) 12x = 0
QS ϕ1 , ϕ2 = , (A.1) (A.10)
5 + 3 cos ϕ1 − ϕ2 ≡ −2 − 6x2 − 6x + 6x2 = 0.
Solving (A.10) gives x = cos ϕ1 = −1/3 and conse-
with ϕ1 , ϕ2 ∈ [0, 2π], a maximum QS = 4 is obtained for quently, ϕ1 = arccos (−1/3) and ϕ2 = 2π − arccos (−1/3). Via
ϕ1 = arccos (−1/3) and ϕ2 = 2π − arccos (−1/3). (A.9), we can see that for these values, we have QS = 4.
16 EURASIP Journal on Advances in Signal Processing
Acknowledgment (OCEANS ’03), vol. 4, pp. 2127–2132, San Diego, Calif, USA,
September 2003.
The author likes to thank Dr. A. J. E. M. Janssen for his [17] R. M. M. Derkx, “Spatial harmonic analysis of unidirectional
valuable suggestions. microphones for use in superdirective beamformers,” in
Proceedings of the 36th International Conference: Automotive
Audio, Dearborn, Mich, USA, June 2009.
References
[1] G. W. Elko, F. Pardo, D. Lopez, D. Bishop, and P. Gammel,
“Surface-micromachined mems microphone,” in Proceedings
of the 115th AES Convention, p. 1–8, October 2003.
[2] P. L. Chu, “Superdirective microphone array for a set-top
video conferencing system,” in Proceedings of the IEEE Inter-
national Conference on Acoustics, Speech, and Signal Processing
(ICASSP ’97), vol. 1, pp. 235–238, Munich, Germany, April
1997.
[3] R. L. Pritchard, “Maximum directivity index of a linear point
array,” Journal of the Acoustical Society of America, vol. 26, no.
6, pp. 1034–1039, 1954.
[4] H. Cox, “Super-directivity revisited,” in Proceedings of the 21st
IEEE Instrumentation and Measurement Technology Conference
(IMTC ’04), vol. 2, pp. 877–880, May 2004.
[5] G. W. Elko and A. T. Nguyen Pong, “A simple first-order
differential microphone,” in Proceedings of the IEEE Workshop
on Applications of Signal Processing to Audio and Acoustics
(WASPAA ’95), pp. 169–172, New Paltz, NY, USA, October
1995.
[6] G. W. Elko and A. T. Nguyen Pong, “A steerable and variable
first-order differential microphone array,” in Proceedings of
the IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP ’97), vol. 1, pp. 223–226, Munich,
Germany, April 1997.
[7] M. A. Poletti, “Unified theory of horizontal holographic sound
systems,” Journal of the Audio Engineering Society, vol. 48, no.
12, pp. 1155–1182, 2000.
[8] H. Cox, R. M. Zeskind, and M. M. Owen, “Robust adaptive
beamforming,” IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 35, no. 10, pp. 1365–1376, 1987.
[9] R. M. M. Derkx and K. Janse, “Theoretical analysis of a first-
order azimuth-steerable superdirective microphone array,”
IEEE Transactions on Audio, Speech and Language Processing,
vol. 17, no. 1, pp. 150–162, 2009.
[10] Y. Huang and J. Benesty, Audio Signal Processing for Next Gen-
eration Multimedia Communication Systems, Kluwer Academic
Publishers, Dordrecht, The Netherlands, 1st edition, 2004.
[11] H. Teutsch, Modal Array Signal Processing: Principles and
Applications of Acoustic Wavefield Decomposition, Springer,
Berlin, Germany, 1st edition, 2007.
[12] L. J. Griffiths and C. W. Jim, “An alternative approach to lin-
early constrained adaptive beamforming,” IEEE Transactions
on Antennas and Propagation, vol. 30, no. 1, pp. 27–34, 1982.
[13] R. M. M. Derkx, “Optimal azimuthal steering of a first-
order superdirectional microphone response,” in Proceedings
of the 11th International Workshop on Acoustic Echo and Noise
Control (IWAENC ’08), Seattle, Wash, USA, September 2008.
[14] J.-H. Lee and Y.-H. Lee, “Two-dimensional adaptive array
beamforming with multiple beam constraints using a general-
ized sidelobe canceller,” IEEE Transactions on Signal Processing,
vol. 53, no. 9, pp. 3517–3529, 2005.
[15] W. Kaplan, Maxima and Minima with Applications: Practical
Optimization and Duality, John Wiley & Sons, New York, NY,
USA, 1999.
[16] B. H. Maranda, “The statistical accuracy of an arctangent
bearing estimator,” in Proceedings of the Oceans Conference
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 431347, 25 pages
doi:10.1155/2010/431347
Research Article
Musical-Noise Analysis in Methods of Integrating Microphone
Array and Spectral Subtraction Based on Higher-Order Statistics
Copyright © 2010 Yu Takahashi et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
We conduct an objective analysis on musical noise generated by two methods of integrating microphone array signal processing and
spectral subtraction. To obtain better noise reduction, methods of integrating microphone array signal processing and nonlinear
signal processing have been researched. However, nonlinear signal processing often generates musical noise. Since such musical
noise causes discomfort to users, it is desirable that musical noise is mitigated. Moreover, it has been recently reported that higher-
order statistics are strongly related to the amount of musical noise generated. This implies that it is possible to optimize the
integration method from the viewpoint of not only noise reduction performance but also the amount of musical noise generated.
Thus, we analyze the simplest methods of integration, that is, the delay-and-sum beamformer and spectral subtraction, and fully
clarify the features of musical noise generated by each method. As a result, it is clarified that a specific structure of integration
is preferable from the viewpoint of the amount of generated musical noise. The validity of the analysis is shown via a computer
simulation and a subjective evaluation.
Multichannel
observed signal
. Beamforming to +
. Spectral Output
. enhance target speech
subtraction
(delay-and-sum)
. Beamforming to
.
. estimate noise signal
Figure 1: Block diagram of architecture for spectral subtraction after beamforming (BF+SS).
Multichannel
observed signal
+ Spectral
Beamforming to
. subtraction
. . enhance target speech
. .. Output
− + Spectral (delay-and-sum)
subtraction
−
Beamforming to
. .
. estimate noise signal .
. .
in each channel
Figure 2: Block diagram of architecture for channelwise spectral subtraction before beamforming (chSS+BF).
methods of integrating microphone array signal processing (i) The amount of musical noise generated strongly
and nonlinear signal processing can be optimized from the depends on not only the oversubtraction parameter
viewpoint of not only noise reduction performance but of SS but also the statistical characteristics of the input
also the sound quality according to human hearing. As signal.
a first step toward achieving this goal, in this study we (ii) Except for the specific condition that the input signal
analyze the simplest case of the integration of microphone is Gaussian, the noise reduction performances of the
array signal processing and nonlinear signal processing by two methods are not equivalent even if we set the
considering the integration of DS and SS. As a result of the same SS parameters.
analysis, we clarify the musical-noise generation features of
two types of methods on integration of microphone array (iii) Under equivalent noise reduction performance con-
signal processing and SS. ditions, chSS+BF generates less musical noise than
Figure 1 shows a typical architecture used for the inte- BF+SS for almost all practical cases.
gration of microphone array signal processing and SS, where The most important contribution of this paper is that
SS is performed after beamforming. Thus, we call this type these findings are mathematically proved. In particular, the
of architecture BF+SS. Such a structure has been adopted amount of musical noise generated and the noise reduction
in many integration methods [11, 15]. On the other hand, performance resulting from the integration of microphone
the integration architecture illustrated in Figure 2 is an array signal processing and SS are analytically formulated on
alternative architecture used when SS is performed before the basis of HOS. Although there have been many studies on
beamforming. Such a structure is less commonly used, optimization methods based on HOS [21], this is the first
but some integration methods use this structure [12, 14]. time they have been used for musical-noise assessment. The
In this architecture, channelwise SS is performed before validity of the analysis based on HOS is demonstrated via a
beamforming, and we call this type of architecture chSS+BF. computer simulation and a subjective evaluation by humans.
We have already tried to analyze such methods of The rest of the paper is organized as follows. In Section 2,
integrating DS and SS from the viewpoint of musical-noise the two methods of integrating microphone array signal
generation on the basis of HOS [20]. However, in the processing and SS are described in detail. In Section 3, the
analysis, we did not consider the effect of flooring in SS metric based on HOS used for the amount of musical noise
and the noise reduction performance. On the other hand, generated is described. Next, the musical-noise analysis of
in this study we perform an exact analysis considering the SS, microphone array signal processing, and their integration
effect of flooring in SS and the noise reduction performance. methods are discussed in Section 4. In Section 5, the noise
We analyze these two architectures on the basis of HOS and reduction performances of the two integration methods are
obtain the following results. discussed, and both methods are compared under equivalent
EURASIP Journal on Advances in Signal Processing 3
T
θU yDS f , τ = gDS f , θU x f , τ ,
T
gDS f , θU = g1(DS) f , θU , . . . , gJ(DS) f , θU , (2)
0 d
Mic. 1 Mic. 2 Mic. j Mic. J
··· ···
(d = d1 ) (d = d2 ) (d = d j ) (d = dJ ) i2π f /M fs d j sin θU
g (DS)
j f , θU =J −1
· exp − ,
Figure 3: Configuration of microphone array and signals. c
where x( f , τ) = [x1 ( f , τ), . . . , xJ ( f , τ)]T is the observed where gNBF ( f ) is the filter coefficient vector of the null
signal vector, h( f ) = [h1 ( f ), . . . , hJ ( f )]T is the transfer beamformer [22] that steers the null directivity to the speech
function vector, s( f , τ) is the target speech signal, and direction θU , and λ( f ) is the gain adjustment term, which
n( f , τ) = [n1 ( f , τ), . . . , nJ ( f , τ)]T is the noise signal vector. is determined in a speech break period. Since the null
beamformer can remove the speech signal by steering the
null directivity to the speech direction, we can estimate
2.2. SS after Beamforming. In BF+SS, the single-channel the noise signal. Moreover, a method exists in which
target-speech-enhanced signal is first obtained by beam- independent component analysis (ICA) is utilized as a noise
forming, for example, by DS. Next, single-channel noise estimator instead of the null beamformer [15].
estimation is performed by a beamforming technique, for
example, null beamformer [22] or adaptive beamforming
[1]. Finally, we extract the resultant target-speech-enhanced 2.3. Channelwise SS before Beamforming. In chSS+BF, we
signal via SS. The full details of signal processing are given first perform SS independently in each input channel and
below. then we derive a multichannel target-speech-enhanced signal
4 EURASIP Journal on Advances in Signal Processing
by channelwise SS. This can be expressed as 3.2. Relation between Musical-Noise Generation and Kurtosis.
In our previous works [18–20], we defined musical noise as
(chSS)
yj f ,τ the audible isolated spectral components generated through
⎧ signal processing. Figure 4(b) shows an example of a spectro-
⎪
⎪
⎪ 2 2 gram of musical noise in which many isolated components
⎪
⎪ x f , τ − β · E n f , τ
⎪
⎪
j τ j
can be observed. We speculate that the amount of musical
⎪
⎪
⎪
⎪
⎪
⎪
2 noise is strongly related to the number of such isolated
⎪
⎨ where x j f , τ (5)
components and their level of isolation.
=
⎪
⎪ Hence, we introduce kurtosis to quantify the isolated
⎪
⎪
⎪
2
⎪
⎪ nj f , τ ≥ 0 ,
−β · Eτ spectral components, and we focus on the changes in kur-
⎪
⎪
⎪
⎪ tosis. Since isolated spectral components are dominant, they
⎪
⎪
⎪
⎩η · are heard as tonal sounds, which results in our perception
x j f , τ (otherwise),
of musical noise. Therefore, it is expected that obtaining
where y (chSS)
j ( f , τ) is the target-speech-enhanced signal the number of tonal components will enable us to quantify
obtained by SS at a specific channel j and n j ( f , τ) is the the amount of musical noise. However, such a measurement
estimated noise signal in the jth channel. For instance, is extremely complicated; so instead we introduce a simple
the multichannel noise can be estimated by single-input statistical estimate, that is, kurtosis.
multiple-output ICA (SIMO-ICA) [24] or a combination of This strategy allows us to obtain the characteristics of
ICA and the projection back method [25]. These techniques tonal components. The adopted kurtosis can be used to
can provide the multichannel estimated noise signal, unlike evaluate the width of the probability density function (p.d.f.)
traditional ICA. SIMO-ICA can separate mixed signals not and the weight of its tails; that is, kurtosis can be used to
into monaural source signals but into SIMO-model signals evaluate the percentage of tonal components among the total
at the microphone. Here SIMO denotes the specific trans- components. A larger value indicates a signal with a heavy
mission system in which the input signal is a single source tail in its p.d.f., meaning that it has a large number of tonal
signal and the outputs are its transmitted signals observed components. Also, kurtosis has the advantageous property
at multiple microphones. Thus, the output signals of SIMO- that it can be easily calculated in a concise algebraic form.
ICA maintain the rich spatial qualities of the sound sources
[24] Also the projection back method provides SIMO- 3.3. Kurtosis. Kurtosis is one of the most commonly used
model-separated signals using the inverse of an optimized HOS for the assessment of non-Gaussianity. Kurtosis is
ICA filter [25]. defined as
Finally, we extract the target-speech-enhanced signal by μ4
kurtx = 2 , (7)
applying DS to ychSS ( f , τ) = [y1(chSS) ( f , τ), . . . , yJ(chSS) ( f , τ)]T . μ2
This procedure can be expressed by where x is a random variable, kurtx is the kurtosis of x, and
y f ,τ = T
gDS f , θU ychSS f , τ , (6) μn is the nth-order moment of x. Here μn is defined as
+∞
where y( f , τ) is the final output of chSS+BF. μn = xn P(x)dx, (8)
Such a chSS+BF structure performs DS after (multichan- −∞
nel) SS. Since DS is basically signal processing in which the where P(x) denotes the p.d.f. of x. Note that this μn is
summation of the multichannel signal is taken, it can be not a central moment but a raw moment. Thus, (7) is not
considered that interchannel smoothing is applied to the kurtosis according to the mathematically strict definition,
multichannel spectral-subtracted signal. On the other hand, but a modified version; however, we refer to (7) as kurtosis
the resultant output signal of BF+SS remains as it is after SS. in this paper.
That is to say, it is expected that the output signal of chSS+BF
is more natural (contains less musical noise) than that of 3.4. Kurtosis Ratio. Although we can measure the number of
BF+SS. In the following sections, we reveal that chSS+BF can tonal components by kurtosis, it is worth mentioning that
output a signal with less musical noise than BF+SS in almost kurtosis itself is not sufficient to measure musical noise. This
all cases on the basis of HOS. is because that the kurtosis of some unprocessed signals such
as speech signals is also high, but we do not perceive speech
3. Kurtosis-Based Musical-Noise as musical noise. Since we aim to count only the musical-
Generation Metric noise components, we should not consider genuine tonal
components. To achieve this aim, we focus on the fact that
3.1. Introduction. It has been reported by the authors that the musical noise is generated only in artificial signal processing.
amount of musical noise generated is strongly related to the Hence, we should consider the change in kurtosis during
difference between the kurtosis of a signal before and after signal processing. Consequently, we introduce the following
signal processing. Thus, in this paper, we analyze the amount kurtosis ratio [18] to measure the kurtosis change:
of musical noise generated through BF+SS and chSS+BF on
kurtproc
the basis of the change in the measured kurtosis. Hereinafter, kurtosis ratio = , (9)
we give details of the kurtosis-based musical-noise metric. kurtinput
EURASIP Journal on Advances in Signal Processing 5
Frequency (Hz)
Frequency (Hz)
Time (s) Time (s)
(a) (b)
where kurtproc is the kurtosis of the processed signal and 4.2. Signal Model Used for Analysis. Musical-noise compo-
kurtinput is the kurtosis of the input signal. A larger kurtosis nents generated from the noise-only period are dominant
ratio (1) indicates a marked increase in kurtosis as a result in spectrograms (see Figure 4); hence, we mainly focus our
of processing, implying that a larger amount of musical noise attention on musical-noise components originating from
is generated. On the other hand, a smaller kurtosis ratio input noise signals.
(1) implies that less musical noise is generated. It has been Moreover, to evaluate the resultant kurtosis of SS, we
confirmed that this kurtosis ratio closely matches the amount introduce a gamma distribution to model the noise in the
of musical noise in a subjective evaluation based on human power domain [26–28]. The p.d.f. of the gamma distribution
hearing [18]. for random variable x is defined as
1 α−1 x
PGM (x) = ·x exp − , (10)
4. Kurtosis-Based Musical-Noise Analysis for Γ(α)θ α θ
Microphone Array Signal Processing and SS where x ≥ 0, α > 0, and θ > 0. Here, α denotes the shape
parameter, θ is the scale parameter, and Γ(·) is the gamma
4.1. Analysis Flow. In the following sections, we carry out an function. The gamma distribution with α = 1 corresponds
analysis on musical-noise generation in BF+SS and chSS+BF to the chi-square distribution with two degrees of freedom.
based on kurtosis. The analysis is composed of the following Moreover, it is well known that the mean of x for a gamma
three parts. distribution is E[x] = αθ, where E[·] is the expectation
operator. Furthermore, the kurtosis of a gamma distribution,
(i) First, an analysis on musical-noise generation in kurtGM , can be expressed as [18]
BF+SS and chSS+BF based on kurtosis that does (α + 2)(α + 3)
not take noise reduction performance into account kurtGM = . (11)
α(α + 1)
is performed in this section.
Moreover, let us consider the power-domain noise signal,
(ii) The noise reduction performance is analyzed in xp , in the frequency domain, which is defined as
Section 5, and we reveal that the noise reduction
performances of BF+SS and chSS+BF are not equiv- xp = |xre + i · xim |2
∗
alent. Moreover, a flooring parameter designed to = (xre + i · xim )(xre + i · xim ) (12)
align the noise reduction performances of BF+SS and 2 2
= xre + xim ,
chSS+BF is also derived to ensure the fair comparison
of BF+SS and chSS+BF. where xre is the real part of the complex-valued signal and xim
is its imaginary part, which are independent and identically
(iii) The kurtosis-based comparison between BF+SS and distributed (i.i.d.) with each other, and the superscript ∗
chSS+BF under the same noise reduction perfor- expresses complex conjugation. Thus, the power-domain
mance conditions is carried out in Section 6. signal is the sum of two squares of random variables with
the same distribution.
In the analysis in this section, we first clarify how kurtosis Hereinafter, let xre and xim be the signals after DFT
is affected by SS. Next, the same analysis is applied to analysis of signal in a specific microphone j, x j , and we
DS. Finally, we analyze how kurtosis is increased by BF+SS suppose that the statistical properties of x j equal to xre and
and chSS+BF. Note that our analysis contains no limiting xim . Moreover, we assume the following; x j is i.i.d. in each
assumptions on the statistical characteristics of noise; thus, channel, the p.d.f. of x j is symmetrical, and its mean is zero.
all noises including Gaussian and super-Gaussian noise can These assumptions mean that the odd-order cumulants and
be considered. moments are zero except for the first order.
6 EURASIP Journal on Advances in Signal Processing
0 βαθ 0 βαθ
Flooring
0 βαη2 θ
(3) The region corresponding to (4) Positive components
the negative components is remain as they are.
compressed by a small positive (5) Remaining positive components
flooring parameter η. and floored components are merged.
Although kurtx = 3 if x is a Gaussian signal, note where z is the random variable of the p.d.f. after SS. The
that the kurtosis of a Gaussian signal in the power spectral derivation of PSS (z) is described in Appendix A.
domain is 6. This is because a Gaussian signal in the time From (13), the kurtosis after SS can be expressed as
domain obeys the chi-square distribution with two degrees of
freedom in the power spectral domain; for such a chi-square F α, β, η
kurtSS = Γ(α) 2 , (14)
distribution, μ4 /μ22 = 6. G α, β, η
4.3. Resultant Kurtosis after SS. In this section, we analyze the where
kurtosis after SS. In traditional SS, the long-term-averaged
power spectrum of a noise signal is utilized as the estimated G α, β, η = Γ(α)Γ βα, α + 2 − 2βαΓ βα, α + 1
noise power spectrum. Then, the estimated noise power
spectrum multiplied by the oversubtraction parameter β + β2 α2 Γ βα, α + η4 γ βα, α + 2 ,
is subtracted from the observed power spectrum. When a
F α, β, η = Γ βα, α + 4 − 4βαΓ βα, α + 3
gamma distribution is used to model the noise signal, its
mean is αθ. Thus, the amount of subtraction is βαθ. The + 6β2 α2 Γ βα, α + 2 − 4β3 α3 Γ βα, α + 1
subtraction of the estimated noise power spectrum in each
frequency band can be considered as a shift of the p.d.f. to + β4 α4 Γ βα, α + η8 γ βα, α + 4 .
the zero-power direction (see Figure 5). As a result, negative- (15)
power components with nonzero probability arise. To avoid
this, such negative components are replaced by observations Here, Γ(b, a) is the upper incomplete gamma function
that are multiplied by a small positive value η (the so-called defined as
flooring technique). This means that the region correspond- ∞
ing to the probability of the negative components, which Γ(b, a) = t a−1 exp{−t }dt, (16)
forms a section cut from the original gamma distribution, is b
compressed by the effect of the flooring. Finally, the floored
components are superimposed on the laterally shifted p.d.f. and γ(b, a) is the lower incomplete gamma function defined
(see Figure 5). Thus, the resultant p.d.f. after SS, PSS (z), can as
be written as b
⎧ γ(b, a) = t a−1 exp{−t }dt. (17)
⎪
⎪ 1 α−1 z + βαθ 0
⎪
⎪ z + βαθ exp −
⎪
⎪ θ α Γ(α) θ
⎪
⎪
⎪
⎪ The detailed derivation of (14) is given in Appendix B.
⎪
⎪ z ≥ βαη2 θ ,
⎪
⎪ Although Uemura et al. have given an approximated form
⎪
⎪
⎪
⎪ (lower bound) of the kurtosis after SS in [18], (14) involves
⎨
α−1 z + βαθ
PSS (z) = ⎪ 1 z + βαθ exp − no approximation throughout its derivation. Furthermore,
⎪ θ α Γ(α)
⎪ θ (14) takes into account the effect of the flooring technique
⎪
⎪
⎪
⎪
⎪
⎪ unlike [18].
⎪
⎪ 1 − z
⎪ + 2 α
⎪ z exp − 2
α 1
Figure 6(a) depicts the theoretical kurtosis ratio after
⎪
⎪ η θ Γ(α) η θ
⎪
⎪ SS, kurtSS /kurtGM , for various values of oversubtraction
⎪
⎩
2
0 < z < βαη θ , parameter β and flooring parameter η. In the figure, the
(13) kurtosis of the input signal is fixed to 6.0, which corresponds
EURASIP Journal on Advances in Signal Processing 7
60 100
50
80
40
Kurtosis ratio
Kurtosis ratio
60
30
40
20
10 20
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 10 100
Oversubtraction parameter Input kurtosis
Figure 6: (a) Theoretical kurtosis ratio after SS for various values of oversubtraction parameter β and flooring parameter η. In this figure,
kurtosis of input signal is fixed to 6.0. (b) Theoretical kurtosis ratio after SS for various values of input kurtosis. In this figure, flooring
parameter η is fixed to 0.0.
to a Gaussian signal. From this figure, it is confirmed that For cumulants, when X and Y are independent random
thekurtosis ratio is basically proportional to the oversub- variables it is well known that the following relation holds:
traction parameter β. However, kurtosis does not mono-
cumn (aX + bY ) = an cumn (X) + bn cumn (Y ), (18)
tonically increase when the flooring parameter is nonzero.
For instance, the kurtosis ratio is smaller than the peak where cumn (·) denotes the nth-order cumulant. The cumu-
value when β = 4 and η = 0.4. This phenomenon can be lants of the random variable X, cumn (X), are defined by
explained as follows. For a large oversubtraction parameter, a cumulant-generating function, which is the logarithm of
almost all the spectral components become negative due to the moment-generating function. The cumulant-generating
the larger lateral shift of the p.d.f. by SS. Since flooring is function C(ζ) is defined as
applied to avoid such negative components, almost all the ∞
components are reconstructed by flooring. Therefore, the ζn
C(ζ) = log E exp ζX = cumn (X) , (19)
statistical characteristics of the signal never change except for n=1
n!
its amplitude if η =/ 0. Generally, kurtosis does not depend on
the change in amplitude; consequently, it can be considered where ζ is an auxiliary variable and E[exp{ζX }] is the
that kurtosis does not markedly increase when a larger moment-generating function. Thus, the nth-order cumulant
oversubtraction parameter and a larger flooring parameter cumn (X) is represented by
are set. cumn (X) = C (n) (0), (20)
The relation between the theoretical kurtosis ratio and
the kurtosis of the original input signal is shown in where C (n) (ζ) is the nth-order derivative of C(ζ).
Figure 6(b). In the figure, η is fixed to 0.0. It is revealed Now we consider the DS beamformer, which is steered
that the kurtosis ratio after SS rapidly decreases as the to θU = 0 and whose array weights are 1/J. Using (18), the
input kurtosis increases, even with the same oversubtraction resultant nth-order cumulant after DS, Kn = cumn (yDS ),
parameter β. Therefore, the kurtosis ratio after SS, which is can be expressed by
related to the amount of musical noise, strongly depends on
1
the statistical characteristics of the input signal. That is to say, Kn = Kn , (21)
SS generates a larger amount of musical noise for a Gaussian J n−1
input signal than for a super-Gaussian input signal. This fact where Kn = cumn (x j ) is the nth-order cumulant of x j .
has been reported in [18]. Therefore, using (21) and the well-known mathematical rela-
tion between cumulants and moments, the power-spectral-
4.4. Resultant Kurtosis after DS. In this section, we analyze domain kurtosis after DS, kurtDS can be expressed by
the kurtosis after DS, and we reveal that DS can reduce the K8 + 38K42 + 32K2 K6 + 288K22 K4 + 192K24
kurtosis of input signals. Since we assume that the statistical kurtDS = .
2K42 + 16K22 K4 + 32K24
properties of xre or xim are the same as that of x j , the effect (22)
of DS on the change in kurtosis can be derived from the
cumulants and moments of x j . The detailed derivation of (22) is described in Appendix C.
8 EURASIP Journal on Advances in Signal Processing
100 100
80 80
Output kurtosis
Output kurtosis
60 60
40 40
20 20
6 6
20 40 60 80 100 20 40 60 80 100
Input kurtosis Input kurtosis
(a) 1-microphone case (b) 2-microphone case
100 100
80 80
Output kurtosis
Output kurtosis
60 60
40 40
20 20
6 6
20 40 60 80 100 20 40 60 80 100
Input kurtosis Input kurtosis
Simulation Simulation
Theoretical Theoretical
Approximated Approximated
(c) 4-microphone case (d) 8-microphone case
Figure 7: Relation between input kurtosis and output kurtosis after DS. Solid lines indicate simulation results, broken lines express
theoretical plots obtained by (22), and dotted lines show approximate results obtained by (23).
18 18
15 15
Kurtosis
Kurtosis
12 12
9 9
6 6
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
Experimental Experimental
Theoretical Theoretical
(a) 1000 Hz (b) 8000 Hz
Figure 8: Simulation result for noise with interchannel correlation (solid line) and theoretical effect of DS assuming no interchannel
correlation (broken line) in each frequency subband.
(11), we can derive a shape parameter for the gamma First, we derive the average power of the input signal. We
distribution corresponding to kurtDS , α, as assume that the input signal in the power domain can be
! modeled by a gamma distribution. Then, the average power
kurt2DS + 14 kurtDS + 1 − kurtDS + 5 of the input signal is given as
α = . (24)
2 kurtDS − 2 ∞
E[nin ] = E[x] = xPGM (x)dx
The derivation of (24) is shown in Appendix D. Conse- 0
quently, using (14) and (24), the resultant kurtosis after ∞
1 x
BF+SS, kurtBF+SS , can be written as = x· α
xα−1 exp − dx (29)
0 θ Γ(α) θ
F α, β, η ∞
kurtBF+SS ) 2
= Γ(α . (25) 1 x
G α, β, η = xα exp − dx.
θ α Γ(α) 0 θ
In chSS+BF, SS is first applied to each input channel.
Thus, the output kurtosis after channelwise SS, kurtchSS , is Here, let t = x/θ, then θdt = dx. Thus,
given by ∞
1
E[nin ] = (θt)α exp{−t }θdt
F α, β, η θ α Γ(α) 0
kurtchSS = Γ(α) . (26) ∞
G2 α, β, η θ α+1
= t α exp{−t }dt (30)
Finally, DS is performed and the resultant kurtosis after θ α Γ(α) 0
35 35
Noise reduction performance (dB)
25 25
20 20
15 15
10 10
5 5
0 0
0 1 2 3 4 5 6 7 8 6 10 100
Oversubtraction parameter Input kurtosis
Figure 10: (a) Theoretical noise reduction performance of SS with various oversubtraction parameters β and flooring parameters η. In this
figure, kurtosis of input signal is fixed to 6.0. (b) Theoretical noise reduction performance of SS with various values of input kurtosis. In this
figure, flooring parameter η is fixed to 0.0.
Also, we deal with the second term of the right-hand side in In this figure, η is fixed to 0.0. It is revealed that NRPSS
(31). We let t = z/(η2 θ) then η2 θdt = dz, resulting in decreases as the input kurtosis increases. This is because the
mean of a high-kurtosis signal tends to be small. Since the
βαη2 θ shape parameter α of a high-kurtosis signal becomes small,
z α−1 z
α z exp − 2 dz the mean αθ corresponding to the amount of subtraction
0 η2 θ Γ(α) η θ also becomes small. As a result, NRPSS is decreased as the
βα input kurtosis increases. That is to say, the NRPSS strongly
1 α
= α η2 θt · exp{−t }η2 θdt (33) depends on the statistical characteristics of the input signal
η2 θ Γ(α) 0 as well as the values of the oversubtraction and flooring
η2 θ parameters.
= γ βα, α + 1 .
Γ(α)
5.2. Noise Reduction Performance of DS. It is well known
that the noise reduction performance of DS (NRPDS ) is
Using (30), (32), and (33), the noise reduction performance
proportional to the number of microphones. In particular,
of SS, NRPSS , can be expressed by
for spatially uncorrelated multichannel signals, NRPDS is
given as [1]
E[z]
NRPSS = 10 log10
E[x] NRPDS = 10 log10 J. (35)
"
Γ βα, α + 1
= −10 log10 5.3. Resultant Noise Reduction Performance: BF+SS versus
Γ(α + 1)
chSS+BF. In the previous subsections, the noise reduction
# performances of SS and DS were discussed. In this subsec-
Γ βα, α γ βα, α + 1
−β · + η2 . tion, we derive the resultant noise reduction performances
Γ(α) Γ(α + 1)
of the composite systems of SS and DS, that is, BF+SS and
(34) chSS+BF.
The noise reduction performance of BF+SS is analyzed
Figure 10(a) shows the theoretical value of NRPSS for as follows. In BF+SS, DS is first applied to a multichannel
various values of oversubtraction parameter β and flooring input signal. If this input signal is spatially uncorrelated, its
parameter η, where the kurtosis of the input signal is fixed noise reduction performance can be represented by 10 log10 J.
to 6.0, corresponding to a Gaussian signal. From this figure, After DS, SS is applied to the signal. Note that DS affects
it is confirmed that NRPSS is proportional to β. However, the kurtosis of the input signal. As described in Section 4.4,
NRPSS hits a peak when η is nonzero even for a large value of the resultant kurtosis after DS can be approximated as
β. The relation between the theoretical value of NRRSS and J −0.7 · (kurtin − 6) + 6. Thus, SS is applied to the kurtosis-
the kurtosis of the input signal is illustrated in Figure 10(b). modified signal. Consequently, using (24), (34), and (35),
12 EURASIP Journal on Advances in Signal Processing
24 24
Noise reduction performance (dB)
16 16
12 12
8 8
0 2 4 6 8 0 2 4 6 8
Oversubtraction parameter Oversubtraction parameter
(a) Input kurtosis = 6 (b) Input kurtosis = 20
24
Noise reduction performance (dB)
20
16
12
8
0 2 4 6 8
Oversubtraction parameter
BF+SS
chSS+BF
(c) Input kurtosis = 80
Figure 11: Comparison of noise reduction performances of chSS+BF with BF+SS. In this figure, flooring parameter is fixed to 0.2 and
number of microphones is 8.
the noise reduction performance of BF+SS, NRPBF+SS , is (34) and (35), the noise reduction performance of chSS+BF,
given as NRPchSS+BF , can be represented by
NRPBF+SS NRPchSS+BF
= 10 log10 J − 10 log10 1
" # = −10 log10
Γ βα, α + 1 Γ βα, α γ βα, α + 1 J · Γ(α) (37)
× −β· + η2 " #
Γ(α + 1) Γ(α) Γ(α + 1) Γ βα, α + 1 γ βα, α + 1
(36) × − β · Γ βα, α + η2 .
1 α α
= −10 log10
J · Γ(α)
" # Figure 11 depicts the values of NRPBF+SS and NRPchSS+BF .
Γ βα, α + 1 γ βα, α + 1
× − β · Γ βα + η2
, α , From this result, we can see that the noise reduction
α α
performances of both methods are equivalent when the input
signal is Gaussian. However, if the input signal is super-
where α is defined by (24). Gaussian, NRPBF+SS exceeds NRPchSS+BF . This is due to the
In chSS+BF, SS is first applied to a multichannel input fact that DS is first applied to the input signal in BF+SS;
signal; then DS is applied to the resulting signal. Thus, using thus, DS reduces the kurtosis of the signal. Since NRPSS for
EURASIP Journal on Advances in Signal Processing 13
1.5 1.5
1 1
R 0.5 R 0.5
0 0
−0.5 −0.5
10 100 10 100
Input kurtosis Input kurtosis
(a) Flooring parameter η = 0.0 (b) Flooring parameter η = 0.1
1.5 1.5
1 1
R 0.5 R 0.5
0 0
−0.5 −0.5
10 100 10 100
Input kurtosis Input kurtosis
Figure 12: Theoretical kurtosis ratio between BF+SS and chSS+BF for various values of input kurtosis. In this figure, oversubtraction
parameter is β = 2.0 and flooring parameter in chSS+BF is (a) η = 0.0, (b) η = 0.1, (c) η = 0.2, and (d) η = 0.4.
5.4. Flooring-Parameter Design in BF+SS for Equivalent Noise 6. Output Kurtosis Comparison under
Reduction Performance. In this section, we describe the
flooring-parameter design in BF+SS so that NRPBF+SS and
Equivalent NRP Condition
NRPchSS+BF become equivalent. In this section, using the new flooring parameter for BF+SS,
Using (36) and (37), the flooring parameter η that makes η, we compare the output kurtosis of BF+SS and chSS+BF.
NRPBF+SS equal to NRPchSS+BF , is Setting η to (25), the output kurtosis of BF+SS is
$ " # modified to
%
% α Γ(α)
η = & · H α, β, η − I α, β , (38) F α, β, η
γ βα, α + 1 Γ(α) kurtBF+SS = Γ(α) . (41)
G2 α, β, η
14 EURASIP Journal on Advances in Signal Processing
4 4
3 3
2 2
1 1
R R
0 0
−1 −1
−2 −2
−3 −3
0 5 10 15 20 0 5 10 15 20
Oversubtraction parameter Oversubtraction parameter
Figure 13: Theoretical kurtosis ratio between BF+SS and chSS+BF for various oversubtraction parameters. In this figure, number of
microphones is fixed to 8, and input kurtosis is (a) 6.0 (Gaussian) and (b) 20.0 (super-Gaussian).
Loudspeakers (for interferences) In this figure, β is fixed to 2.0 and the flooring parameter
in chSS+BF is set to η = 0.0, 0.1, 0.2, and 0.4. The
flooring parameter for BF+SS is automatically determined
by (38). From this figure, we can confirm that chSS+BF
Loudspeaker (for target source) reduces the kurtosis more than BF+SS for almost all input
signals with various values of input kurtosis. Theoretical
values of R for various oversubtraction parameters are
depicted in Figure 13. Figure 13(a) shows that the output
kurtosis after chSS+BF is always less than that after BF+SS
1m
10 20
6
5
4 10
3
2
1 5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
Figure 15: Results for Gaussian input signal. (a) Kurtosis ratio and (b) noise reduction performance for BF+SS with various flooring
parameters.
Here, we utilize the kurtosis ratio defined in Section 3.4 reduction performance closely fit the experimental results.
to measure the difference in kurtosis, which is related to These findings also support the validity of the analysis in
the amount of musical noise generated. The kurtosis ratio Sections 4, 5, and 6.
is given by Figures 18–20 illustrate the simulation results for a super-
Gaussian input signal. It is confirmed from Figure 18(a) that
kurt nproc f , τ the kurtosis ratio of chSS+BF also decreases monotonically
Kurtosis ratio = , (43)
kurt norg f , τ with increasing number of microphones. Unlike the case
of the Gaussian input signal, the kurtosis ratio of BF+SS
where nproc ( f , τ) is the power spectra of the residual noise with η = 0.8 also decreases with increasing number of
signal after processing, and norg ( f , τ) is the power spectra microphones. However, for a lower value of the flooring
of the original noise signal before processing. This kurtosis parameter, the kurtosis ratio of BF+SS is not degraded.
ratio indicates the extent to which kurtosis is increased Moreover, the kurtosis ratio of chSS+BF is lower than that
with processing. Thus, a smaller kurtosis ratio is desirable. of BF+SS for almost all cases. For the super-Gaussian input
Moreover, the noise reduction performance is measured signal, in contrast to the case of the Gaussian input signal,
using (28). the noise reduction performance of BF+SS with η = 0.0
Figures 15–17 show the simulation results for a Gaussian is greater than that of chSS+BF (see Figure 18(b)). That
input signal. From Figure 15(a), we can see that the kurtosis is to say, the noise reduction performance of BF+SS is
ratio of chSS+BF decreases almost monotonically with superior to that of chSS+BF for the same flooring parameter.
increasing number of microphones. On the other hand, the This result is consistent with the analysis in Section 5. The
kurtosis ratio of BF+SS does not exhibit such a tendency noise reduction performance of BF+SS with η = 0.4 is
regardless of the flooring parameter. Also, the kurtosis ratio comparable to that of chSS+BF. However, the kurtosis ratio
of chSS+BF is lower than that of BF+SS for all cases except of chSS+BF is still lower than that of BF+SS with η = 0.4.
for η = 0.8. Moreover, we can confirm from Figure 15(b) This result also coincides with the analysis in Section 6.
that the values of noise reduction performance for BF+SS On the other hand, the kurtosis ratio of BF+SS with η =
with flooring parameter η = 0.0 and chSS+BF are almost the 0.8 is almost the same as that of chSS+BF. However, the
same. When the flooring parameter for BF+SS is nonzero, noise reduction performance of BF+SS with η = 0.8 is
the kurtosis ratio of BF+SS becomes smaller but the noise lower than that of chSS+BF. Thus, it can be confirmed that
reduction performance degrades. On the other hand, for chSS+BF reduces the kurtosis ratio more than BF+SS for
Gaussian signals, chSS+BF can reduce the kurtosis ratio, a super-Gaussian signal under the same noise reduction
that is, reduce the amount of musical noise generated, performance. Furthermore, the theoretical kurtosis ratio and
without degrading the noise reduction performance. Indeed noise reduction performance closely fit the experimental
BF+SS with η = 0.8 reduces the kurtosis ratio more than results in Figures 19 and 20.
chSS+BF, but the noise reduction performance of BF+SS We also compare speech distortion originating from
is extremely degraded. Furthermore, we can confirm from chSS+BF and BF+SS on the basis of cepstral distortion
Figures 16 and 17 that the theoretical kurtosis ratio and noise (CD) [29] for the four-microphone case. The comparison
16 EURASIP Journal on Advances in Signal Processing
10 10
8 8
Kurtosis ratio
Kurtosis ratio
6 6
4 4
2 2
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
(a) chSS+BF (b) BF+SS (η = 0.0)
10 10
8 8
Kurtosis ratio
Kurtosis ratio
6 6
4 4
2 2
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
(c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)
10
8
Kurtosis ratio
2 4 6 8 10 12 14 16
Number of microphones
Experimental
Theoretical
(e) BF+SS (η = 0.8)
Figure 16: Comparison between experimental and theoretical kurtosis ratios for Gaussian input signal.
EURASIP Journal on Advances in Signal Processing 17
20 20
Noise reduction performance
10 10
5 5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
(a) chSS+BF (b) BF+SS (η = 0.0)
20 20
Noise reduction performance
15 15
10 10
5 5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
(c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)
20
Noise reduction performance
15
10
5
2 4 6 8 10 12 14 16
Number of microphones
Experimental
Theoretical
(e) BF+SS (η = 0.8)
Figure 17: Comparison between experimental and theoretical noise reduction performances for Gaussian input signal.
18 EURASIP Journal on Advances in Signal Processing
6 20
15
Kurtosis ratio
3
10
2
1
5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
Figure 18: Results for super-Gaussian input signal. (a) Kurtosis ratio and (b) noise reduction performance for BF+SS with various flooring
parameters.
Table 1: Speech distortion comparison of chSS+BF and BF+SS on (iii) Under the same level of noise reduction performance,
the basis of CD for four-microphone case. the amount of musical noise generated via chSS+BF
is less than that generated via BF+SS.
Input noise type chSS+BF BF+SS
Gaussian 6.15 dB 6.45 dB (iv) Thus, the chSS+BF structure is preferable from the
viewpoint of musical-noise generation.
Super-Gaussian 6.17 dB 5.12 dB
(v) However, the noise reduction performance of BF+SS
is superior to that of chSS+BF for a super-Gaussian
is made under the condition that the noise reduction signal when the same parameters are set in the SS part
performances of both methods are almost the same. For for both methods.
the Gaussian input signal, the same parameters β = 2.0
and η = 0.0 are utilized for BF+SS and chSS+BF. On (vi) These results imply a trade-off between the amount
the other hand, β = 2.0 and η = 0.4 are utilized of musical noise generated and the noise reduction
for BF+SS and β = 2.0 and η = 0.0 are utilized for performance. Thus, we should use an appropriate
chSS+BF for the super-Gaussian input signal. Table 1 shows structure depending on the application.
the result of the comparison, from which we can see that
the amount of speech distortion originating from BF+SS and These results should be applicable under different SNR con-
chSS+BF is almost the same for the Gaussian input signal. ditions because our analysis is independent of the noise level.
For the super-Gaussian input signal, the speech distortion In the case of more reverberation, the observed signal tends
originating from BF+SS is less than that from chSS+BF. This to become Gaussian because many reverberant components
is owing to the difference in the flooring parameter for each are mixed. Therefore, the behavior of both methods under
method. more reverberant conditions should be similar to that in the
In conclusion, all of these results are strong evidence for case of a Gaussian signal.
the validity of the analysis in Sections 4, 5, and 6. These
results suggest the following.
7.2. Subjective Evaluation. Next, we conduct a subjective
evaluation to confirm that chSS+BF can mitigate musical
(i) Although BF+SS can reduce the amount of musical
noise. In the evaluation, we presented two signals processed
noise by employing a larger flooring parameter,
by BF+SS and by chSS+BF to seven male examinees in
it leads to a deterioration of the noise reduction
random order, who were asked to select which signal they
performance.
considered to contain less musical noise (the so-called AB
(ii) In contrast, chSS+BF can reduce the kurtosis ratio, method). Moreover, we instructed examinees to evaluate
which corresponds to the amount of musical noise only the musical noise and not to consider the amplitude of
generated, without degradation of the noise reduc- the remaining noise. Here, the flooring parameter in BF+SS
tion performance. was automatically determined so that the output SNR of
EURASIP Journal on Advances in Signal Processing 19
6 6
5 5
Kurtosis ratio
Kurtosis ratio
4 4
3 3
2 2
1 1
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
(a) chSS+BF (b) BF+SS (η = 0.0)
6 6
5 5
Kurtosis ratio
Kurtosis ratio
4 4
3 3
2 2
1 1
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
(c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)
5
Kurtosis ratio
1
2 4 6 8 10 12 14 16
Number of microphones
Experimental
Theoretical
(e) BF+SS (η = 0.8)
Figure 19: Comparison between experimental and theoretical kurtosis ratios for super-Gaussian input signal.
BF+SS and chSS+BF was equivalent. We used the preference used. Note that noises (b) and (c) were recorded in the actual
score as the index of the evaluation, which is the frequency of room shown in Figure 14 and therefore include interchannel
the selected signal. correlation because they were recordings of actual noise
In the experiment, three types of noise, (a) artificial signals.
spatially uncorrelated white Gaussian noise, (b) recorded Each test sample is a 16-kHz-sampled signal, and
railway-station noise emitted from 36 loudspeakers, and (c) the target speech is the original speech convoluted with
recorded human speech emitted from 36 loudspeakers, were impulse responses recorded in a room with 200 millisecond
20 EURASIP Journal on Advances in Signal Processing
20 20
Noise reduction performance
10 10
5 5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
(a) chSS+BF (b) BF+SS (η = 0.0)
20 20
Noise reduction performance
10 10
5 5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
(c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)
20
Noise reduction performance
15
10
5
2 4 6 8 10 12 14 16
Number of microphones
Experimental
Theoretical
(e) BF+SS (η = 0.8)
Figure 20: Comparison between experimental and theoretical noise reduction performances for super-Gaussian input signal.
reverberation (see Figure 14) and to which the above- Figure 21 shows the subjective evaluation results, which
mentioned recorded noise signal is added. Ten pairs of signals confirm that the output of chSS+BF is preferred to that
per type of noise, that is, a total of 30 pairs of processed of BF+SS, even for actual acoustic noises including non-
signals, were presented to each examinee. Gaussianity and interchannel correlation properties.
EURASIP Journal on Advances in Signal Processing 21
80 distribution becomes
60 1 α−1 x + βαθ
PGM (x) = α
· x + βαθ exp −
40 Γ(α)θ θ (A.1)
20
x ≥ −βαθ .
0
White Gaussian Station noise from Speech from 36
36 loudspeakers loudspeakers Since the domain of the original gamma distribution is x ≥
0, the domain of the resultant p.d.f. is x ≥ −βαθ. Thus,
chSS+BF negative-power components with nonzero probability arise,
which can be represented by
BF+SS
95% confidence interval 1 α−1 x + βαθ
Pnegative (x) = α
· x + βαθ exp −
Figure 21: Subjective evaluation results. Γ(α)θ θ
−βαθ ≤ x ≤ 0 ,
(A.2)
8. Conclusion
where Pnegative (x) is part of PGM (x). To remove the negative-
In this paper, we analyze two methods of integrating power components, the signals corresponding to Pnegative (x)
microphone array signal processing and SS, that is, BF+SS are replaced by observations multiplied by a small positive
and chSS+BF, on the basis of HOS. As a result of the analysis,
value η. The observations corresponding to (A.2), Pobs (x),
it is revealed that the amount of musical noise generated
are given by
via SS strongly depends on the statistical characteristics of
the input signal. Moreover, it is also clarified that the noise
1 α−1 x
reduction performances of BF+SS and chSS+BF are different Pobs (x) = α
· (x) exp − 0 ≤ x ≤ βαθ .
Γ(α)θ θ
except in the case of a Gaussian input signal. As a result of (A.3)
our analysis under equivalent noise reduction performance
conditions, it is shown that chSS+BF reduces musical noise Since a small positive flooring parameter η is applied to
more than BF+SS in almost all practical cases. The results (A.3), the scale parameter θ becomes η2 θ and the range is
of a computer simulation also support the validity of our changed from 0 ≤ x ≤ βαθ to 0 ≤ x ≤ βαη2 θ. Then, (A.3) is
analysis. Moreover, by carrying out a subjective evaluation, modified to
it is confirmed that the output of chSS+BF is considered to
contain less musical noise than that of BF+SS. These analytic 1 x
Pfloor (x) = α · (x)α−1 exp − 2
and experimental results imply the considerable potential of 2
Γ(α) η θ η θ (A.4)
optimization based on HOS to reduce musical noise.
As a future work, it remains necessary to carry out 0 ≤ x ≤ βαη2 θ ,
signal analysis based on more general distributions. For
instance, analysis using a generalized gamma distribution where Pfloor (x) is the probability of the floored components.
[26, 27] can lead to more general results. Moreover, an exact This Pfloor (x) is superimposed on the p.d.f. given by (A.1)
formulation of how kurtosis is changed through DS under within the range 0 ≤ x ≤ βαη2 θ. By considering the positive
a coherent condition is still an open problem. Furthermore, range of (A.1) and Pfloor (x), the resultant p.d.f. of SS can be
the robustness of BF+SS and chSS+BF against low-SNR or formulated as
more reverberant conditions is not discussed in this paper.
In the future, the discussion should involve not only noise PSS (z)
reduction performance and musical-noise generation but ⎧
also such robustness. ⎪
⎪
⎪
1 α−1 z + βαθ
⎪
⎪ z + βαθ exp −
⎪
⎪ θ α Γ(α) θ
⎪
⎪
⎪
⎪
⎪
⎪ z ≥ βαη2 θ ,
Appendices ⎪
⎪
⎪
⎪
⎪
⎨ (A.5)
A. Derivation of (13) = 1 α−1 z + βαθ
⎪
⎪ z + βαθ exp −
⎪ α
⎪ θ Γ(α) θ
⎪
⎪
When we assume that the input signal of the power domain ⎪
⎪
⎪
⎪ 1 z
can be modeled by a gamma distribution, the amount ⎪
⎪
+ 2 α −1
z exp − 2
α
⎪
⎪
of subtraction is βαθ. The subtraction of the estimated ⎪
⎪ η θ Γ(α) η θ
⎪
⎪
⎩ 2
noise power spectrum in each frequency subband can be 0 < z < βαη θ ,
considered as a lateral shift of the p.d.f. to the zero-power
direction (see Figure 5). As a result of this subtraction, the where the variable x is replaced with z for convenience.
22 EURASIP Journal on Advances in Signal Processing
B. Derivation of (14) Consequently, using (B.4) and (B.5), the kurtosis after SS is
given as
To derive the kurtosis after SS, the 2nd- and 4th-order
moments of z are required. For PSS (z), the 2nd-order
moment is given by F α, β, η
kurtSS = Γ(α) , (B.6)
∞ G2 α, β, η
μ2 = z2 · PSS (z)dz
0
where
∞
1 α−1 z + βαθ
= z2 α z + βαθ exp − dz (B.1)
0 θ Γ(α) θ G α, β, η = Γ(α)Γ βα, α + 2 − 2βαΓ βα, α + 1
βαη2 θ
1 z + β2 α2 Γ βα, α + η4 γ βα, α + 2 ,
+ z2 2 α zα−1 exp − 2 dz.
0 η θ Γ(α) η θ
F α, β, η = Γ βα, α + 4 − 4βαΓ βα, α + 3
We now expand the first term of the right-hand side of (B.1).
Here, let t = (z + βαθ)/θ; then θdt = dz and z = θ(t − βα). + 6β2 α2 Γ βα, α + 2 − 4β3 α3 Γ βα, α + 1
Consequently,
+ β4 α4 Γ βα, α + η8 γ βα, α + 4 .
∞ (B.7)
1
2
α−1 z + βαθ
z α z + βαθ exp − dz
0 θ Γ(α) θ
∞ C. Derivation of (22)
1 2
= θ t − βα α 2
(θt)α−1 exp{−t }θdt
βα θ Γ(α) As described in (12), the power-domain signal is the sum of
∞ two squares of random variables with the same distribution.
θ2 (p)
= t 2 − 2βαt + β2 α2 t α−1 exp{−t }dt Using (18), the power-domain cumulants Kn can be written
Γ(α) βα as
θ2
⎧ (p)
= Γ βα, α + 2 − 2βαΓ βα, α + 1 + β2 α2 Γ βα, α .
Γ(α) ⎪
⎪
⎪K1 = 2K1(2) ,
⎪
⎪
(B.2) ⎨K (p) = 2K (2) ,
power-domain cumulants ⎪ 2(p) 2
(2) (C.1)
Next we consider the second term of the right-hand side of ⎪
⎪K = 2K 3 ,
⎪
⎪
3
(B.1). Here, let t = z/(η2 θ); then η2 θdt = dz. Thus, ⎩ (p)
K4 = 2K4(2) ,
βαη2 θ
2 1 α−1 z
z α z exp − 2 dz where Kn(2) is the nth square-domain moment. Here, the
0 η2 θ Γ(α) η θ
p.d.f. of such a square-domain signal is not symmetrical and
βα its mean is not zero. Thus, we utilize the following relations
2 1 2 α−1
= η2 θt 2 α η θt exp{−t }η2 θdt between the moments and cumulants around the origin:
0 η θ Γ(α)
βα ⎧
η4 θ 2 γ βα, α + 2 ⎪
= t α+1 exp{−t }dt = η4 θ 2 . ⎨μ1 = κ1 ,
⎪
Γ(α) 0 Γ(α) moments ⎪μ2 = κ2 + κ12 , (C.2)
(B.3) ⎪
⎩μ = κ + 4κ κ + 3κ2 + 6κ κ2 + κ4 ,
4 4 3 1 2 2 1 1
As a result, the 2nd-order moment after SS, μ(SS)
2 , is a
composite of (B.2) and (B.3) and is given as where μn is the nth-order raw moment and κn is the nth-
θ2 order cumulant. Moreover, the square-domain moments μ(2)
μ(SS)
2 = Γ βα, α + 2 − 2βαΓ βα, α + 1 n
Γ(α) (B.4) can be expressed by
+β2 α2 Γ βα, α + η4 γ βα, α + 2 . ⎧ (2)
⎪
In the same manner, the 4th-order moment after SS, ⎨ μ1 = μ2 ,
⎪
μ(SS) squared-domain moments ⎪μ(2)
2 = μ4 , (C.3)
4 , can be represented by ⎪
⎩ (2)
μ4 = μ8 .
θ4
μ(SS)
4 = Γ βα, α + 4 − 4βαΓ βα, α + 3
Γ(α)
Using (C.1)–(C.3), the power-domain moments can be
+ 6β2 α2 Γ βα, α + 2 − 4β3 α3 Γ βα, α + 1 expressed in terms of the 4th- and 8th-order moments in the
time domain. Therefore, to obtain the kurtosis after DS in
+β4 α4 Γ βα, α + η8 γ βα, α + 4 . the power domain, the moments and cumulants after DS up
(B.5) to the 8th order are needed.
EURASIP Journal on Advances in Signal Processing 23
The 3rd-, 5th-, and 7th-order cumulants are zero because D. Derivation of (24)
we assume that the p.d.f. of x j is symmetrical and that its
mean is zero. If these conditions are satisfied, the following According to (11), the shape parameter α corresponding to
relations between moments and cumulants hold: the kurtosis after DS, kurtDS , is given by the solution of the
quadratic equation:
⎧
⎪
⎪μ1 = 0,
⎪
⎪ (α + 2)(α + 3)
⎪
⎪ kurtDS = . (D.1)
⎪
⎪ μ2 = κ2 ,
⎪
⎪ α(α + 1)
⎨
moments ⎪μ4 = κ4 + 3κ22 ,
⎪
⎪ This can be expanded as
⎪
⎪
⎪
⎪ μ6 = κ6 + 15κ4 κ2 + 15κ23 ,
⎪
⎪
⎪
⎩μ = κ + 35κ2 + 28κ κ + 210κ2 κ + 105κ4 . α2 (kurtDS − 1) + α(kurtDS − 5) − 6 = 0. (D.2)
8 8 4 2 6 2 4 2
(C.4)
Using the quadratic formula,
Using (21) and (C.4), the time-domain moments after !
−kurtDS + 5 ± kurt2DS + 14 kurtDS + 1
DS are expressed as α = , (D.3)
2 kurtDS − 2
⎧ (DS)
⎪
⎪μ 2 = K2 ,
⎪
⎪ whose denominator is larger than zero because kurtDS > 1.
⎪
⎪
⎪ (DS) Here, since α > 0, we must select the appropriate numerator
⎪μ4 = K4 + 3K2 ,
⎪ 2
⎪
⎨ of (D.3). First, suppose that
moments after DS ⎪μ(DS)
6 = K6 + 15K2 K4 + 15K23 ,
⎪
⎪ !
⎪
⎪
⎪
⎪
⎪
μ(DS)
8 = K8 + 35K42 + 28K2 K6 −kurtDS + 5 + kurt2DS + 14 kurtDS + 1 > 0. (D.4)
⎪
⎪
⎩
+210K2 K4 + 105K2 ,
2 4
where μ(DS)
n is the nth-order raw moment after DS in the time !
domain. −kurtDS + 5 > − kurt2DS + 14 kurtDS + 1. (D.5)
Using (C.2), (C.3), and (C.5), the square-domain cumu-
lants can be written as When kurtDS ≥ 5, the following relation also holds:
⎧ (2)
⎪ (−kurtDS + 5)2 < kurt2DS + 14 kurtDS + 1,
⎪ K1 = K2 ,
⎪
⎪ (D.6)
⎪
⎪
⎪
⎪K2(2) = K4 + 2K22 , ⇐⇒ 24 kurtDS > 24.
⎪
⎪
⎨
square-domain cumulants ⎪K3(2) = K6 +12K4 K2 +8K23 ,
⎪
⎪ Since (D.6) is true when kurtDS ≥ 5, (D.4) holds. In
⎪
⎪
⎪
⎪
⎪
K4(2) = K8 +32K42 +24K2 K6 summary, (D.4) always holds for 1 < kurtDS < 5 and 5 ≤
⎪
⎪
⎩ kurtDS . Thus,
+144K2 K4 + 48K2 ,
2 4
(C.6) !
−kurtDS + 5 + kurt2DS + 14 kurtDS + 1 > 0 for kurtDS > 1.
(D.7)
where Kn(2) is the nth-order cumulant in the square domain.
Moreover, using (C.1), (C.2), and (C.6), the 2nd- and Overall,
4th-order power-domain moments can be written as
!
−kurtDS + 5 + kurt2DS + 14 kurtDS + 1
(p) > 0. (D.8)
μ2 = 2 K4 + 4K22 , 2 kurtDS − 2
(p)
μ4 = 2 K8 + 38K42 + 32K6 K2 + 288K4 K22 + 192K24 . On the other hand, let
(C.7) !
−kurtDS + 5 − kurt2DS + 14 kurtDS + 1 > 0. (D.9)
As a result, the power-domain kurtosis after DS, kurtDS , is
given as ! satisfied when kurtDS > 5 because
This inequality is not
−kurtDS + 5 < 0 and kurt2DS + 14 kurtDS + 1 > 0. Now (D.9)
K8 + 38K42 + 32K2 K6 + 288K22 K4 + 192K24 can be modified as
kurtDS = .
2K42 + 16K22 K4 + 32K24 !
(C.8) −kurtDS + 5 > kurt2DS + 14 kurtDS + 1, (D.10)
24 EURASIP Journal on Advances in Signal Processing
then the following relation also holds for 1 < kurtDS ≤ 5: This can be rewritten as
Γ(α) γ βα, α + 1
(−kurtDS + 5)2 > kurt2DS + 14 kurtDS + 1, η2
(D.11) Γ(α) α
⇐⇒ 24 kurtDS < 24. " #
Γ βα, α + 1 γ βα, α + 1
= − β · Γ βα, α + η2 (E.5)
This is not true for 1 < kurtDS ≤ 5. Thus, (D.9) is not α α
appropriate for kurtDS > 1. Therefore, α corresponding to " #
Γ(α) Γ βα, α + 1
kurtDS is given by − − β · Γ βα
, α
,
Γ(α) α
!
−kurtDS + 5 + kurt2DS + 14 kurtDS + 1 and consequently
α = . (D.12)
2 kurtDS − 2 " #
α Γ(α)
η =
2 H α, β, η − I α, β , (E.6)
E. Derivation of (38) γ βα, α + 1 Γ(α)
For 0 < α ≤ 1, which corresponds to a Gaussian or super- where H (α, β, η) is defined by (39) and I(α, β) is given by
Gaussian input signal, it is revealed that the noise reduction (40). Using (E.3) and (E.4), the right-hand side of (E.5) is
performance of BF+SS is superior to that of chSS+BF from clearly greater than or equal to zero. Moreover, since Γ(α) >
the numerical simulation in Section 5.3. Thus, the following 0, Γ(α) > 0, α > 0, and γ(βα, α + 1) > 0, the right-hand side
relation holds: of (E.6) is also greater than or equal to zero. Therefore,
$ " #
1 %
% α Γ(α)
− 10 log10 η = & · H α, β, η − I α, β . (E.7)
J · Γ(α) γ βα, α + 1 Γ(α)
" #
Γ βα, α + 1 γ βα, α + 1
× − β · Γ βα + η2
, α
α α Acknowledgment
1 This work was partly supported by MIC Strategic Informa-
≥ −10 log10
J · Γ(α) tion and Communications R&D Promotion Programme in
" # Japan.
Γ βα, α + 1 γ βα, α + 1
× − β · Γ βα, α + η2 .
α α
(E.1) References
[1] M. Brandstein and D. Ward, Eds., Microphone Arrays: Signal
This inequality corresponds to Processing Techniques and Applications, Springer, Berlin, Ger-
" many, 2001.
#
1 Γ βα, α + 1 γ βα, α + 1 [2] J. L. Flanagan, J. D. Johnston, R. Zahn, and G. W. Elko,
− β · Γ βα + η2
, α “Computer-steered microphone arrays for sound transduc-
Γ(α) α α
tion in large rooms,” Journal of the Acoustical Society of
" # America, vol. 78, no. 5, pp. 1508–1518, 1985.
1 Γ βα, α + 1 γ βα, α + 1
≤ − β · Γ βα, α + η2 . [3] M. Omologo, M. Matassoni, P. Svaizer, and D. Giuliani,
Γ(α) α α “Microphone array based speech recognition with different
(E.2) talker-array positions,” in Proceedings of the International
Conference on Acoustics, Speech, and Signal Processing (ICASSP
Then, the new flooring parameter η in BF+SS, which makes ’97), pp. 227–230, Munich, Germany, September 1997.
the noise reduction performance of BF+SS equal to that of [4] H. F. Silverman and W. R. Patterson, “Visualizing the perfor-
chSS+BF, satisfies η ≥ η (≥ 0) because mance of large-aperture microphone arrays,” in Proceedings of
the International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’99), pp. 962–972, 1999.
γ βα, α + 1
≥ 0. (E.3) [5] O. Frost, “An algorithm for linearly constrained adaptive array
α processing,” Proceedings of the IEEE, vol. 60, pp. 926–935,
1972.
Moreover, the following relation for η also holds: [6] L. J. Griffiths and C. W. Jim, “An alternative approach to lin-
" # early constrained adaptive beamforming,” IEEE Transactions
1 Γ βα, α + 1 γ βα, α + 1 on Antennas and Propagation, vol. 30, no. 1, pp. 27–34, 1982.
− β · Γ βα + η2
, α [7] Y. Kaneda and J. Ohga, “Adaptive microphone-array system
Γ(α) α α
for noise reduction,” IEEE Transactions on Acoustics, Speech,
" #
1 Γ βα, α + 1 γ βα, α + 1 and Signal Processing, vol. 34, no. 6, pp. 1391–1400, 1986.
= − β · Γ βα, α + η2 . [8] S. Boll, “Suppression of acoustic noise in speech using spectral
Γ(α) α α subtraction,” IEEE Transactions on Acoustics, Speech and Signal
(E.4) Processing, vol. 27, no. 2, pp. 113–120, 1979.
EURASIP Journal on Advances in Signal Processing 25
[9] J. Meyer and K. Simmer, “Multi-channel speech enhancement Journal on Applied Signal Processing, vol. 2003, no. 11, pp.
in a car environment using Wiener filtering and spectral 1135–1146, 2003.
subtraction,” in Proceedings of the International Conference [23] M. Mizumachi and M. Akagi, “Noise reduction by paired-
on Acoustics, Speech, and Signal Processing (ICASSP ’97), pp. microphone using spectral subtraction,” in Proceedings of
1167–1170, 1997. the International Conference on Acoustics, Speech, and Signal
[10] S. Fischer and K. D. Kammeyer, “Broadband beamforming Processing (ICASSP ’98), vol. 2, pp. 1001–1004, 1998.
with adaptive post filtering for speech acquisition in noisy [24] T. Takatani, T. Nishikawa, H. Saruwatari, and K. Shikano,
environment,” in Proceedings of the International Conference “High-fidelity blind separation of acoustic signals using
on Acoustics, Speech, and Signal Processing (ICASSP ’97), pp. SIMO-model-based independent component analysis,” IEICE
359–362, 1997. Transactions on Fundamentals of Electronics, Communications
[11] R. Mukai, S. Araki, H. Sawada, and S. Makino, “Removal and Computer Sciences, vol. E87-A, no. 8, pp. 2063–2072, 2004.
of residual cross-talk components in blind source separation
[25] S. Ikeda and N. Murata, “A method of ICA in the frequency
using time-delayed spectral subtraction,” in Proceedings of
domain,” in Proceedings of the International Workshop on
the International Conference on Acoustics, Speech, and Signal
Independent Component Analysis and Blind Signal Separation,
Processing (ICASSP ’02), pp. 1789–1792, Orlando, Fla, USA,
pp. 365–371, 1999.
May 2002.
[12] J. Cho and A. Krishnamurthy, “Speech enhancement using [26] E. W. Stacy, “A generalization of the gamma distribution,” The
microphone array in moving vehicle environment,” in Pro- Annals of Mathematical Statistics, pp. 1187–1192, 1962.
ceedings of the IEEE Intelligent Vehicles Symposium, pp. 366– [27] K. Kokkinakis and A. K. Nandi, “Generalized gamma density-
371, Graz, Austria, April 2003. based score functions for fast and flexible ICA,” Signal
[13] Y. Ohashi, T. Nishikawa, H. Saruwatari, A. Lee, and K. Processing, vol. 87, no. 5, pp. 1156–1162, 2007.
Shikano, “Noise robust speech recognition based on spatial [28] J. W. Shin, J.-H. Chang, and N. S. Kim, “Statistical modeling
subtraction array,” in Proceedings of the International Workshop of speech signals based on generalized gamma distribution,”
on Nonlinear Signal and Image Processing, pp. 324–327, 2005. IEEE Signal Processing Letters, vol. 12, no. 3, pp. 258–261, 2005.
[14] J. Even, H. Saruwatari, and K. Shikano, “New architecture [29] L. Rabiner and B. Juang, Fundamentals of Speech Recognition,
combining blind signal extraction and modified spectral sub- Prentice-Hall PTR, 1993.
traction for suppression of background noise,” in Proceedings
of the International Workshop on Acoustic Echo and Noise
Control (IWAENC ’08), Seattle, Wash, USA, 2008.
[15] Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, and K.
Shikano, “Blind spatial subtraction array for speech enhance-
ment in noisy environment,” IEEE Transactions on Audio,
Speech and Language Processing, vol. 17, no. 4, pp. 650–664,
2009.
[16] S. B. Jebara, “A perceptual approach to reduce musical
noise phenomenon with Wiener denoising technique,” in
Proceedings of the International Conference on Acoustics, Speech,
and Signal Processing (ICASSP ’06), vol. 3, pp. 49–52, 2006.
[17] Y. Ephraim and D. Malah, “Speech enhancement using a
minimum mean-square error short-time spectral amplitude
estimator,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 32, no. 6, pp. 1109–1121, 1984.
[18] Y. Uemura, Y. Takahashi, H. Saruwatari, K. Shikano, and
K. Kondo, “Automatic optimization scheme of spectral sub-
traction based on musical noise assessment via higher-order
statistics,” in Proceedings of the International Workshop on
Acoustic Echo and Noise Control (IWAENC ’08), Seattle, Wash,
USA, 2008.
[19] Y. Uemura, Y. Takahashi, H. Saruwatari, K. Shikano, and K.
Kondo, “Musical noise generation analysis for noise reduction
methods based on spectral subtraction and MMSE STSA
estimatio,” in Proceedings of the International Conference on
Acoustics, Speech, and Signal Processing (ICASSP ’09), pp.
4433–4436, 2009.
[20] Y. Takahashi, Y. Uemura, H. Saruwatari, K. Shikano, and K.
Kondo, “Musical noise analysis based on higher order statistics
for microphone array and nonlinear signal processing,” in
Proceedings of the International Conference on Acoustics, Speech,
and Signal Processing (ICASSP ’09), pp. 229–232, 2009.
[21] P. Comon, “Independent component analysis, a new concept?”
Signal Processing, vol. 36, pp. 287–314, 1994.
[22] H. Saruwatari, S. Kurita, K. Takeda, F. Itakura, T. Nishikawa,
and K. Shikano, “Blind source separation combining inde-
pendent component analysis and beamforming,” EURASIP
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 509541, 13 pages
doi:10.1155/2010/509541
Research Article
Microphone Diversity Combining for In-Car Applications
Copyright © 2010 Jürgen Freudenberger et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
This paper proposes a frequency domain diversity approach for two or more microphone signals, for example, for in-car
applications. The microphones should be positioned separately to insure diverse signal conditions and incoherent recording of
noise. This enables a better compromise for the microphone position with respect to different speaker sizes and noise sources. This
work proposes a two-stage approach. In the first stage, the microphone signals are weighted with respect to their signal-to-noise
ratio and then summed similar to maximum ratio combining. The combined signal is then used as a reference for a frequency
domain least-mean-squares (LMS) filter for each input signal. The output SNR is significantly improved compared to coherence-
based noise reduction systems, even if one microphone is heavily corrupted by noise.
SNR (dB)
different signal conditions? In this paper, we consider a diver- 20
sity technique that combines the processed signals of several 10
separate microphones. The basic idea of our approach is to 0
apply maximum-ratio-combining (MRC) to speech signals, −10
−20
where we propose a frequency domain diversity approach for 0 1000 2000 3000 4000 5000
two or more microphone signals. MRC maximizes the signal- Frequency (Hz)
to-noise ratio in the combined signal.
A major issue for the application of maximum-ratio- mic. 1
mic. 2
combining for multimicrophone setups is the estimation
of the acoustic transfer functions. In telecommunications, Figure 1: Input SNR values for a driving situation at a car speed of
the signal attenuation as well as the phase shift for each 100 km/h.
transmission path are usually measured to apply MRC. With
speech applications we have no means to directly measure
the acoustic transfer functions. There exists several blind
approaches to estimate the acoustic transfer functions (see environment. For these measurements, we used two cardioid
e.g., [14–16]) which were successfully applied to derever- microphones with positions suited for car integration. One
beration. However, the proposed estimation methods are microphone (denoted by mic. 1) was installed close to the
computationally demanding. inside mirror. The second microphone (mic. 2) was mounted
In this paper, we show that maximum-ratio-combining at the A-pillar.
can be achieved without explicit knowledge of the acoustic Figure 1 depicts the SNR versus frequency for a driving
transfer functions. Proper signal weighting can be achieved situation at a car speed of 100 km/h. From this figure, we
based on an estimate of the input signal-to-noise ratio. We observe that the SNR values are quite distinct for these
propose a two stage processing of the microphone signals. two microphone positions with differences of up to 10 dB
In the first stage, the microphone signals are weighted depending on the particular frequency. We also note that
with respect to their input signal-to-noise ratio. These the better microphone position is not obvious in this case,
weights guarantee maximum-ratio-combining of the signals because the SNR curves cross several times.
with respect to the signal magnitudes. To ensure cophasal Theoretically, a MRC combining of the two input signals
addition of the weighted signals, we use the combined would result in an output SNR equal to the sum of the input
signal as reference signal for frequency domain LMS filters SNR values. With two inputs, MRC achieves a maximum
in the second stage. These filters adjust the phases of the gain of 3 dB for equal input SNR values. In case of the input
microphone signals to guarantee coherent signal combining. SNR values being rather different, the sum is dominated by
The proposed concept is similar to the single channel the maximum value. Hence, for the curves in Figure 1 the
noise reduction system presented by Mukherjee and Gwee output SNR would essentially be the envelope of the two
[17]. This system uses spectral subtraction to obtain a crude curves.
estimate of the speech signal. This estimate is then used as Next we consider the coherence for the noise and speech
the reference signal of a single LMS filter. In this paper, we signals. The corresponding results are depicted in Figure 2.
generalize this concept to multimicrophone systems, where The figure presents measurements for two microphones
our aim is not only noise reduction, but also dereverberation installed close to the inside mirror in an end-fire beamformer
of the microphone signals. constellation with a microphone distance of 7 cm. The lower
The paper is organized as follows: In Section 2, we figure contains the results for the microphone positions
present some measurement results obtained in a car environ- mic. 1 and mic. 2 (distance of 65 cm). From these results,
ment. This results motivate the proposed diversity approach. we observe that the noise coherence closely follows the
In Section 3, we present a signal combiner that achieves theoretical coherence function (dotted line in Figure 2) in an
MRC weighting based on the knowledge of the input ideal diffuse sound field [18]. Separating the microphones
signal-to-noise ratios. Coherence based signal combining significantly reduces the noise coherence for low frequencies.
is discussed in Section 4. In the subsequent section, we On the other hand, both microphone constellations have
consider implementation issues. In particular, we present similar speech coherence. We note that the speech coherence
an estimator for the required input signal-to-noise ratios. is not ideal, as it has steep dips. The corresponding frequen-
Finally, in Section 6, we present some simulation results for cies will probably be attenuated by a signal combiner that is
different real world noise situations. solely based on coherence.
M
2
=
=X + G(1) + G(2) + ··· , 1
MRC N1 MRC N2 H j f (19)
cSC f j =1
where we have omitted the dependency on f . The estimated
speech spectrum X( f ) is therefore equal to the actual speech can be interpreted as the resulting transfer characteristic
spectrum X( f ) plus some weighted noise term. of the system. An example is depicted in Figure 3. The
The filter defined in (12) was previously applied to speech upper figure presents the measured transfer characteristics
dereverberation by Gannot and Moonen in [14], because for two microphones in a car environment. Note that the
it ideally equalizes the microphone signals if a sufficiently microphones have a high-pass characteristic and attenuate
accurate estimate of the acoustic transfer functions is avail- signal components for frequencies below 1 kHz. The lower
able. The problem at hand with maximum-ratio-combining figure is the curve 1/cSC ( f ). The spectral combiner equalizes
is that it is rather difficult and computationally complex to most of the deep dips in the transfer functions from the
explicitly estimate the acoustic transfer characteristic Hi ( f ) mouth of the speaker to the microphones while the envelope
for our microphone system. of the transfer functions is not equalized.
In the next section, we show that MRC combining
can be achieved without explicit knowledge of the acoustic 3.3. Magnitude Combining. One challenge in multimicro-
channels. The weights for the different microphones can phone systems with spatially separated microphones is a
be calculated based on an estimate of the signal-to-noise reliable phase estimation of the different input signals. For
ratio for each microphone. The proposed filter achieves a a coherent combining of the speech signals, we have to
signal-to-noise ratio according to (9), but does not guarantee compensate the phase difference between the speech signals
perfect equalization. at each microphone. Therefore, it is sufficient to estimate the
phase differences to a reference microphone, for example,
3.2. Diversity Combining for Speech Signals. We consider the to the first microphone Δi ( f ) = φ1 ( f ) − φi ( f ), for all i =
weights 2, . . . , M. Cophasal addition is then achieved by
γi f X = G(1) (2) jΔ2
Y2 + G(3) jΔ3
Y3 · · · .
G(i) = M . (14) SC Y1 + GSC e SC e (20)
SC
j =1 γ j f
But a reliable estimation of the phase differences is only
Assuming the noise power is the same for all microphones possible in speech active periods and furthermore only for
and substituting γi ( f ) by (10) leads to that frequencies where speech is present. Estimating the
phase differences
Hi f Hi f
2
(i)
GSC f = 2 =
. Y1 f Yi∗ f
M M
2
(15)
e jΔi ( f ) = E
j =1 H j f H j f Y1 f Yi f (21)
j =1
EURASIP Journal on Advances in Signal Processing 5
leads to unreliable phase values for time-frequency points Transfer characteristics to the microphones
without speech. In particular, if Hi ( f ) = 0 for some
H1 ( f ), H2 ( f ) (dB)
0
frequency f , the estimated phase Δi ( f ) is undefined. A −10
combining using this estimate leads to additional signal
distortions. Additionally, noise correlation would distort the −20
1/cSC (dB)
differences. −20
Because of the drawbacks, which come along with the
−30
phase estimation methods described above, we propose
another scheme. Therefore, we use a two stage combining −40
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
approach. In the first stage, we use the spectral combining Frequency (Hz)
approach as described in Section 3.2 with a simple magni-
tude combining of the microphone signals. For the mag- (b)
nitude combining the noisy phase of the first microphone Figure 3: Transfer characteristics to the microphones and of the
signal is adopted to the other microphone signals. This is also combined signal.
obvious in Figure 5, where the phase of the noisy spectrum
e j φ1 ( f ) is taken for the spectrum at the output of the filter
G(2)
SC ( f ), before the signals were combined. This leads to the applied the dereverberation principle of Allen et al. [13]
following incoherent combining of the input signals to noise reduction. In particular, they proposed an LMS-
(2) based time domain algorithm to combine the different
X f = G(1) Y2 f e j φ1 ( f ) + · · ·
SC f Y1 f + GSC f microphone signals. This approach provides effective noise
j φ ( f ) suppression for frequencies where the noise components of
+ G(M) YM f e 1
SC f the microphone signals are uncorrelated.
(1) (2)
However, as we have seen in Section 2, for practical
= GSC f Y1 f + GSC f Y2 f e j φ1 ( f ) + · · · . microphone distances in the range of 0.4 to 0.8 m the noise
(22) signals are correlated for low frequencies. These correlations
reduce the noise suppression capabilities of the algorithm
f ) is equal to
The estimated speech spectrum X( and lead to musical noise.
We will show in this section that a combination of the
X f e j φ1 ( f ) spectral combining with the coherence based approach by
(23) Martin and Vary reduces this issues.
cSC f
plus some weighted noise terms. It follows from the triangle 4.1. Analysis of the LMS Approach. We present now an
inequality that analysis of the scheme by Martin and Vary as depicted in
Figure 4. The filter gi (k) is adapted using the LMS algorithm.
M
2 For stationary signals x(k), n1 (k), and n2 (k), the adaptation
=
1 1
≤ H j f . (24) converts to filter coefficients gi (k) and a corresponding filter
cSC f cSC f j =1 transfer function
Magnitude combining does not therefore guarantee maxi- E Yi∗ f Y j f
mum-ratio-combining. Yet the signal X( f ) is taken as a refer- G(i)
LMS f = , i=
/ j (25)
E Yi f
2
ence signal in the second stage where the phase compensation
is done. This coherence based signal combining scheme is
described in the following section. that minimizes the expected value
(i)
2
E Yi f GLMS f − Y j f , (26)
4. Coherence-Based Combining
As an example of a coherence based diversity system we where E{Yi∗ ( f )Y j ( f )} is the cross-power spectrum of the
first consider the two microphone approach by Martin two microphone signals and E{|Yi ( f )|2 } is the power
and Vary [5, 6] as depicted in Figure 4. Martin and Vary spectrum of the ith microphone signal.
6 EURASIP Journal on Advances in Signal Processing
− 0.5
x(k) h1 (k)
x(k)
−
h2 (k)
n2 (k) g2 (k)
y2 (k) = x(k) ∗ h2 (k) + n2 (k) y1 (k)
Assuming that the speech signal and the noise signals are 4.2. Combining MRC and LMS. To ensure suitable weighting
uncorrelated, (25) can be written as and coherent signal addition we combine the diversity
technique with the LMS approach to process the signals
E X f Hi∗ f H j f + E Ni∗ f N j f
2
of the different microphones. It is informative to examine
G(i)
LMS f = .
E X f Hi f + E Ni f the combined approach under ideal conditions, that is, we
2 2 2
assume ideal MRC weighting.
(27)
Analog to (13), weighting with the MRC gains factors
For frequencies where the noise components are uncorre- according to (12) results in the estimate
lated, that is, E{Ni∗ ( f )N j ( f )} = 0, this formula is reduced
to X f = X f + G(1) (2)
MRC f N1 f + GMRC f N2 f + · · · .
(30)
E X f Hi∗ f H j f
2
G(i)
LMS f = 2 2 . (28) f ) as the reference signal for the
E X f Hi f + E Ni f We now use the estimate X(
2
LMS algorithm. That is, we adapted a filter for each input
The filter G(i) signal such that the expected value
LMS ( f ) according to (28) results in fact in a
minimum mean squared error (MMSE) estimate of the (i)
2
signal X( f )H j ( f ) based on the signal Yi ( f ). Hence, the E Yi f GLMS f − X f (31)
weighted output is a combination of the MMSE estimates
of the speech components of the two input signals. This is minimized. The adaptation results in the filter transfer
explains the good noise reduction properties of the approach functions
by Martin and Vary.
On the other hand, the coherence of the noise depends E Yi∗ f X f
G(i)
LMS f = . (32)
E Yi f
strongly on the distance between the microphones. For in- 2
in this sum is the Wiener filter that results in a minimum This formula shows that noise suppression can be introduced
mean squared error estimate of the signal X( f ) based on by simply adding a constant to the numerator term in (14).
the signal Yi ( f ). The Wiener filter equalizes the microphone Most, if not all, implementations of spectral subtraction
signal and minimizes the mean squared error between the are based on an over-subtraction approach, where an
filter output and the actual speech signal X( f ). Note that the overestimate of the noise power is subtracted from the
phase of the term in (36) is −φi , that is, the filter compensates power spectrum of the input signal (see e.g., [22–25]). Over-
the phase of the acoustic transfer function Hi ( f ). subtraction can be included in (40) by using a constant ρ
The other terms in the sum can be considered as filter larger than one. This leads to the final gain factor
biases where the term in (34) depends on the noise power
n1 (k) Y1 (κ, ν)
y1 (k)
Windowing (1)
GSC (κ, ν) (1)
GLMS (κ, ν)
x(k) ∗ h1 (k) + FFT
−
SNR and gain Phase ν)
X(κ, IFFT x(k)
computing computing + OLA
x(k) ∗ h2 (k) −
Windowing (2)
(2)
GSC (κ, ν) |·| e j φ1 (κ,ν) GLMS (κ, ν)
y2 (k) + FFT
n2 (k) Y2 (κ, ν)
Figure 5: Basic system structure of the diversity system with two inputs.
To perform spectral combining we have to estimate the [23, 27]. With this approach, the noise PSD estimate is
current signal-to-noise ratio based on the noisy microphone determined by the minimum value
input signals. In the next sections, we propose a simple
and efficient method to estimate the noise power spectral λmin,i (κ, ν) = min λY ,i (l, ν) (47)
l∈[κ−W+1,κ]
densities of the microphone inputs.
within a sliding window of W consecutive values of λY ,i (κ, ν).
5.2. PSD Estimation. Commonly the noise PSD is estimated The noise PSD is then estimated by
in speech pauses where the pauses are detected using voice
2
activity detection (VAD, see e.g., [24, 26]). VAD-based E |Ni (κ, ν)| ≈ omin · λmin,i (κ, ν), (48)
methods provide good estimates for stationary noise. How-
ever, they may suffer from error propagation if subsequent where omin is a parameter of the algorithm and should be
decisions are not independent. Other methods, like the min- approximated as
imum statistics approach introduced by Martin [23, 27], use
a continuous estimation that does not explicitly differentiate 1
omin = . (49)
between speech pauses and speech active segments. E{λmin }
Our estimation method combines the VAD approach The MS approach provides a rough estimate of the noise
with the minimum statistics (MS) method. Minimum power that strongly depends on the smoothing parameter α
statistics is a robust technique to estimate the power spectral and the window size of the sliding window (for details cf.
density of non-stationary noise by tracing the minimum of [27]). However, this estimate can be obtained regardless of
the recursively smoothed power spectral density within a speech being present or not.
time window of 1 to 2 seconds. We use these MS estimates The idea of our approach is to approximate the PSD
and a simple threshold test to determine voice activity for by the MS estimate during speech active periods while the
each time-frequency point. smoothed input power is used for time-frequency points
The proposed method prevents error propagation, where speech is absent.
because the MS approach is independent of the VAD. During
speech pauses the noise PSD estimation can be enhanced
2
E |Ni (κ, ν)| ≈ β(κ, ν)omin · λmin,i (κ, ν)
compared with an estimate solely based on minimum (50)
statistics. A similar time-frequency dependent VAD was
+ 1 − β(κ, ν) λY ,i (κ, ν),
presented by Cohen to enhance the noise power spectral
density estimation of minimum statistics [28]. where β(κ, ν) ∈ {0, 1} is an indicator function for speech
For time-frequency points (κ, ν) where the speech signal activity which will be discussed in more detail in the next
is inactive, the noise PSD E{|Ni (κ, ν)|2 } can be approximated section.
by recursive smoothing The current signal-to-noise ratio is then obtained by
2 2
2
E |Ni (κ, ν)| ≈ λY ,i (κ, ν) (45) E |Yi (κ, ν)| − E |Ni (κ, ν)|
γi (κ, ν) = , (51)
2
E |Ni (κ, ν)|
with
assuming that the noise and speech signals are uncorrelated.
λY ,i (κ, ν) = (1 − α)λY ,i (κ − 1, ν) + α|Yi (κ, ν)|2 , (46)
5.3. Voice Activity Detection. Human speech contains gaps
where α ∈ (0, 1) is the smoothing parameter. not only in time but also in frequency domain. It is
During speech active periods the PSD can be estimated therefore reasonable to estimate the voice activity in the time-
using the minimum statistics method introduced by Martin frequency domain in order to obtain a more accurate VAD.
EURASIP Journal on Advances in Signal Processing 9
The VAD function β(κ, ν) can then be calculated upon the The decision rule for the ith channel is based on the
current input noise PSD obtained by minimum statistics. conditional speech presence probability
Our aim is to determine for each time-frequency point ⎧
(κ, ν) whether the speech signal is active or inactive. We ⎪
⎨1, P H1 | Yi
≥ T,
therefore consider the two hypotheses H1 (κ, ν) and H0 (κ, ν) βi (κ, ν) = ⎪ P H0 | Yi (61)
which indicate speech presence or absence at the time- ⎩0, otherwise.
frequency point (κ, ν), respectively. We assume that the
coefficients X(κ, ν) and Ni (κ, ν) of the short-time spectra of The parameter T > 0 enables a tradeoff between the two
both the speech and the noise signal are complex Gaussian possible error probabilities of voice activity detection. A
random variables. In this case, the current input power, that value T > 1 decreases the probability of a false alarm, that
is, squared magnitude |Yi (κ, ν)|2 , is exponentially distributed is, β(κ, ν) = 1 when speech is absent. T < 1 reduces the
with mean (power spectral density) probability of a miss, that is, β(κ, ν) = 0 in the presence of
speech. Note that the generalized likelihood-ratio test
λYi (κ, ν) = E |Y (κ, ν)|2 . (52)
P H1 | Yi pi (κ, ν)
= ≥T (62)
Similarly we define P H0 | Yi 1 − pi (κ, ν)
λXi (κ, ν) = |Hi (κ, ν)|2 E |X(κ, ν)|2 , is according to the Neyman-Pearson-Lemma (see e.g., [30])
(53) an optimal decision rule. That is, for a fixed probability of a
λNi (κ, ν) = E |Ni (κ, ν)|2 . false alarm it minimizes the probability of a miss and vice
versa. The generalized likelihood-ratio test was previously
We assume that speech and noise are uncorrelated. used by Sohn and Sung to detect speech activity in subbands
Hence, we have [29, 31].
The test in inequality (62) is equivalent to
λYi (κ, ν) = λXi (κ, ν) + λNi (κ, ν) (54)
−1 λX,i + λN,i q 1+T
pi (κ, ν) = 1+ exp(−ui ) ≤ ,
during speech active periods and λN,i 1 − q T
(63)
λYi (κ,ν) = λNi (κ, ν) (55)
where we have used (59). Solving for |Yi (κ, ν)|2 using (60),
in speech pauses. we obtain a simple threshold test for the ith microphone
In the following, we occasionally omit the dependency on
κ and ν in order to keep the notation lucid. The conditional 1, 2
|Yi (κ, ν)| ≥ λN,i (κ, ν)Θi (κ, ν),
probability density functions of the random variable Yi = βi (κ, ν) = (64)
0, otherwise.
|Yi (κ, ν)|2 are [29]
⎧ with the threshold
⎪
⎨ 1 exp −Yi ,
⎪
Yi ≥ 0,
f Yi | H0 = ⎪ λNi λNi (56) λ Tq 1 + λX,i /λN,i
⎪
⎩0, Θi (κ, ν) = 1 + N,i log . (65)
Yi < 0, λX,i 1−q
⎧
⎪
⎪ 1 −Yi This threshold test is equivalent to the decision rule in (61).
⎨ exp , Yi ≥ 0,
f Yi | H1 = λXi + λNi
⎪
λXi + λNi (57) With this threshold test, speech is detected if the current
⎪
⎩0, input power |Yi (κ, ν)|2 is greater or equal to the average noise
Yi < 0.
power λN,i (κ, ν) times the threshold Θi (κ, ν). This factor
Applying Bayes rule for the conditional speech presence depends on the input signal-to-noise ratio λX,i /λN,i and the
probability a priori probability of speech absence q(κ, ν).
In order to combine the activity estimates for the
pi (κ, ν) = P H1 | Yi (58) different input signals, we use the following rule
we have [29] 1, if |Yi (κ, ν)|2 ≥ λN,i Θi for any i,
β(κ, ν) = (66)
−1 0, otherwise.
λXi + λNi q
pi (κ, ν) = 1+ exp(−ui ) , (59)
λNi 1 − q
×103 mic. 1 Table 1: Average input SNR values [dB] from mic. 1/mic. 2 for
5 typical background noise conditions in a car.
Frequency (Hz)
4
SNR IN 100 km/h 140 km/h defrost
3
short speaker 1.2/3.1 −0.7/−0.5 1.7/1.3
2
1 tall speaker 1.9/10.8 −0.1/7.2 2.4/9.0
0
1 2 3 4 5 6 7 Table 2: Log spectral distances with minimum statistics noise PSD
Time (s) estimation and with the proposed noise PSD estimator.
(a)
DLS [dB] 100 km/h 140 km/h defrost
×103 Activity
mic. 1 3.93/3.33 2.47/2.07 3.07/1.27
5
mic. 2 4.6/4.5 3.03/2.33 3.4/1.5
Frequency (Hz)
4
3
2 while the second ones are according to a tall person. For all
1 algorithms, we used an FFT length of L = 512 and an overlap
0 of 256 samples. For time windowing we apply a Hamming
1 2 3 4 5 6 7
window.
Time (s)
(b)
6.1. Estimating the Noise PSD. The spectrogram of one input
Figure 6: Spectrogram of the microphone input (mic. 1 at car speed signal and the result of the voice activity detection are shown
of 140 km/h, short speaker). The lower figure depicts the results in Figure 6 for the worst case scenario (short speaker at car
of the voice activity detection (black representing estimated speech speed of 140 km/h). It can be observed that time-frequency
activity) with T = 1.2 and q = 0.5. points with speech activity are reliably detected. Because the
noise PSD is estimated with minimum statistics also during
speech activity, the false alarms in speech pauses do hardly
10
0 affect the noise PSD estimation.
−10 In Figure 7, we compare the estimated noise PSD with
PSD (dB)
−20 actual PSD for the same scenario. The PSD is well approx-
−30 imated with only minor deviations for high frequencies.
−40
To evaluate the noise PSD estimation for several driving
−50
−60 situations we calculated as an objective performance measure
0 1000 2000 3000 4000 5000 6000 the log spectral distance (LSD)
Frequency (Hz)
2
1 λN (ν)
Noise
DLS = 10 log10 (67)
Estimate L ν λN (ν)
Figure 7: Estimated and actual noise PSD for mic. 2 at car speed of
140 km/h. between the actual noise power spectrum λN (ν) and the
estimate λN (ν). From the definition, it is obvious that the
LSD can be interpreted as the mean distance between two
because this is probably the most interesting case for in-car PSDs in dB. An extended analysis of different distance
applications. measures is presented in [33].
With respect to three different background noise situa- The log spectral distances of the proposed noise PSD
tions, we recorded driving noise at 100 km/h and 140 km/h. estimator are shown in Table 2. The first number in each field
As third noise situation, we considered the noise which arises is the LSD achieved with the minimum statistics approach
from an electric fan (defroster). With an artificial head we while the second number is the value for the proposed
recorded speech samples for two different seat positions. scheme. Note that every noise situation was evaluated with
From both positions, we recorded two male and two female four different voices (two male and two female). From these
speech samples, each of a length of 8 seconds. Therefore, results, we observe that the voice activity detection improves
we took the German-speaking speech samples from the rec- the PSD estimation for all considered driving situations.
ommendation P.501 of the International Telecommunication
Union (ITU) [32]. Hence the evaluation was done using 6.2. Spectral Combining. Next we consider the spectral
four different voices with two different speaker sizes, which combining as discussed in Section 3. Figure 8 presents the
leads to 8 different speaker configurations. For all recordings, output SNR values for a driving situation with a car speed of
we used a sampling rate of 11025 Hz. Table 1 contains the 100 km/h. For this simulation we used ρ = 0, that is, spectral
average SNR values for the considered noise conditions. The combining without noise suppression. In addition to the
first values in each field are with respect to a short speaker output SNR, the curve for ideal maximum-ratio-combining
EURASIP Journal on Advances in Signal Processing 11
30 30
SNR (dB)
SNR (dB)
20 20
10 10
0 0
−10 −10
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz) Frequency (Hz)
20
10 SNR OUT 100 km/h 140 km/h defrost
0 FLMS 8.8/13.3 4.4/9.0 7.8/12.3
−10 SC 16.3/20.9 13.3/18.0 14.9/19.9
0 500 1000 1500 2000 2500 3000 3500 4000
SC + FLMS 13.5/17.8 10.5/15.0 12.5/16.9
Frequency (Hz)
ideal FLMS 12.6/15.2 10.5/13.3 14.5/17.3
Out MRC-FLMS
Ideal MRC
Table 4: Cosh spectral distances for different combining tech-
Figure 9: Output SNR values for the combined approach without niques—short/tall speaker.
additional noise suppression (car speed of 100 km/h, ρ = 0).
DCH 100 km/h 140 km/h defrost
FLMS 0.9/0.9 0.9/1.0 1.2/1.2
is depicted. This curve is simply the sum of the input SNR SC 1.3/1.4 1.4/1.5 1.5/1.7
values for the two microphones which we calculated based SC + FLMS 1.2/1.1 1.2/1.2 1.4/1.5
on the actual noise and speech signals (cf. Figure 1). ideal FLMS 0.9/0.8 1.1/1.0 1.5/1.4
We observe that the output SNR curve closely follows the
ideal curve but with a loss of 1–3 dB. This loss is essentially
caused by the phase differences of the input signals. With the as presented in [21] (see also Section 4.1). The label SC
spectral combining approach only a magnitude combining marks results solely based on spectral combining with
is possible. Furthermore, the power spectral densities are additional noise suppression as discussed in Sections 3 and
estimates based on the noisy microphone signals, this leads 4.3. The results with the combined approach are labeled by
to an additional loss in the SNR. SC + FLMS. Finally, the values marked with the label ideal
FLMS are a benchmark obtained by using the clean and
6.3. Combining SC and FLMS. The output SNR of the unreverberant speech signal x(k) as a reference for the FLMS
combined approach without additional noise suppression is algorithm.
depicted in Figure 9. It is obvious that the theoretical SNR From the results in Table 3, we observe that the spectral
curve for ideal MRC is closely approximated by the output combining leads to a significant improvement of the output
SNR of the combined system. This is the result of the implicit SNR compared to the coherence based noise reduction. It
phase estimation of the FLMS approach which leads to a even outperforms the “ideal” FLMS scheme. However, the
coherent combining of the speech signals. spectral combining introduces undesired speech distortions
Now we consider the combined approach with additional similar to single channel noise reduction. This is also
noise suppression (ρ = 10). Figure 10 presents the corre- indicated by the results in Table 4. This table presents
sponding results for a driving situation with a car speed of distance values for the different combining systems. As an
100 km/h. The output SNR curve still follows the ideal MRC objective measure of speech distortion, we calculated the
curve but now with a gain of up to 5 dB. cosh spectral distance (a symmetrical version of the Itakura-
In Table 3, we compare the output SNR values of the Saito distance) between the power spectra of the clean input
three considered noise conditions for different combining signal (without reverberation and noise) and the output
techniques. The first value is the output SNR for a short speech signal (filter coefficients were obtained from noisy
speaker while the second number represents the result for data).
the tall speaker. The values marked with FLMS correspond to The benefit of the combined system is also indicated by
the coherence based FLMS approach with bias compensation the results in Table 5 which presents Mean Opinion Score
12 EURASIP Journal on Advances in Signal Processing
4
we have assumed that the noise power spectral densities
are equal for all microphone inputs. This assumption might
2
be unrealistic. However, the simulation results for a two-
0 microphone system demonstrate that a performance close to
0 1 2 3 4 5 6 7
Time (s)
that of MRC can be achieved with real world noise situations.
Moreover, diversity combining is an effective means to
(a) reduce signal distortions due to reverberation and therefore
×103 Input 2 improves the speech intelligibility compared to single chan-
Frequency (Hz)
4 Acknowledgments
2
0
Research for this paper was supported by the German Federal
0 1 2 3 4 5 6 7 Ministry of Education and Research (Grant no. 17 N11 08).
Time (s) Last but not the least, the authors would like to thank the
(c) reviewers for their constructive comments and suggestions
which greatly improve the quality of this paper.
Figure 11: Spectrograms of the input and output signals with the
SC + FLMS approach (car speed of 100 km/h, ρ = 10).
References
[1] E. Hänsler and G. Schmidt, Acoustic Echo and Noise Control:
A Practical Approach, John Wiley & Sons, New York, NY, USA,
2004.
(MOS) values for the different algorithms. The MOS test
[2] P. Vary and R. Martin, Digital Speech Transmission: Enhance-
was performed by 24 persons. The test set was taken in a ment, Coding and Error Concealment, John Wiley & Sons, New
randomized order to avoid statistical dependences on the York, NY, USA, 2006.
test order. Obviously, the FLMS approach using spectral [3] E. Hänsler and G. Schmidt, Speech and Audio Processing in
combining as reference signal and the “ideal” FLMS filter Adverse Environments: Signals and Communication Technolo-
reference approach are rated as the best noise reduction gie, Springer, Berlin, Germany, 2008.
algorithm, where the values of the combined approach are [4] S. Boll, “Suppression of acoustic noise in speech using spectral
similar to the results with the reference implementation subtraction,” IEEE Transactions on Acoustics, Speech and Signal
of the “ideal” FLMS filter solution. From this evalua- Processing, vol. 27, no. 2, pp. 113–120, 1979.
tion, it can also be seen that the FLMS approach with [5] R. Martin and P. Vary, “A symmetric two microphone speech
spectral combining outperforms the pure FLMS and the enhancement system theoretical limits and application in a car
pure spectral combining algorithms in all tested acoustic environment,” in Proceedings of the Digital Signal Processing
Workshop, pp. 451–452, Helsingoer, Denmark, August 1992.
situations.
[6] R. Martin and P. Vary, “Combined acoustic echo cancellation,
The combined approach sounds more natural compared dereverberation and noise reduction: a two microphone
to the pure spectral combining. The SNR and distance values approach,” Annales des Télécommunications, vol. 49, no. 7-8,
are close to the “ideal” FLMS scheme. The speech is free of pp. 429–438, 1994.
musical tones. The lack of musical noise can also be seen in [7] A. A. Azirani, R. L. Bouquin-Jeannès, and G. Faucon,
Figure 11, which shows the spectrograms of the enhanced “Enhancement of speech degraded by coherent and incoher-
speech and the input signals. ent noise using a cross-spectral estimator,” IEEE Transactions
EURASIP Journal on Advances in Signal Processing 13
on Speech and Audio Processing, vol. 5, no. 5, pp. 484–487, [23] R. Martin, “Spectral subtraction based on minimum statis-
1997. tics,” in Proceedings of the European Signal Processing Confer-
[8] A. Guérin, R. L. Bouquin-Jeannès, and G. Faucon, “A two- ence (EUSIPCO ’94), pp. 1182–1185, Edinburgh, UK, April
sensor noise reduction system: applications for hands-free car 1994.
kit,” EURASIP Journal on Applied Signal Processing, vol. 2003, [24] H. Puder, “Single channel noise reduction using time-
no. 11, pp. 1125–1134, 2003. frequency dependent voice activity detection,” in Proceedings
[9] J. Freudenberger and K. Linhard, “A two-microphone diver- of International Workshop on Acoustic Echo and Noise Control
sity system and its application for hands-free car kits,” in (IWAENC ’99), pp. 68–71, Pocono Manor, Pa, USA, Septem-
Proceedings of European Conference on Speech Communication ber 1999.
and Technology (INTERSPEECH ’05), pp. 2329–2332, Lisbon, [25] A. Juneja, O. Deshmukh, and C. Espy-Wilson, “A multi-band
Portugal, September 2005. spectral subtraction method for enhancing speech corrupted
[10] T. Gerkmann and R. Martin, “Soft decision combining by colored noise,” in Proceedings of IEEE International Confer-
for dual channel noise reduction,” in Proceedings of the ence on Acoustics, Speech, and Signal Processing (ICASSP ’02),
9th International Conference on Spoken Language Processing vol. 4, pp. 4160–4164, Orlando, Fla, USA, May 2002.
(INTERSPEECH—ICSLP ’06), vol. 5, pp. 2134–2137, Pitts- [26] J. Ramı́rez, J. C. Segura, C. Benı́tez, A. de La Torre, and A.
burgh, Pa, USA, September 2006. Rubio, “A new voice activity detector using subband order-
[11] J. Freudenberger, S. Stenzel, and B. Venditti, “Spectral com- statistics filters for robust speech recognition,” in Proceedings
bining for microphone diversity systems,” in Proceedings of of IEEE International Conference on Acoustics, Speech and
European Signal Processing Conference (EUSIPCO ’09), pp. Signal Processing (ICASSP ’04), vol. 1, pp. I849–I852, 2004.
854–858, Glasgow, UK, July 2009. [27] R. Martin, “Noise power spectral density estimation based on
[12] J. L Flanagan and R. C. Lummis, “Signal processing to reduce optimal smoothing and minimum statistics,” IEEE Transac-
multipath distortion in small rooms,” Journal of the Acoustical tions on Speech and Audio Processing, vol. 9, no. 5, pp. 504–512,
Society of America, vol. 47, no. 6, pp. 1475–1481, 1970. 2001.
[13] J. B. Allen, D. A. Berkley, and J. Blauert, “Multimicrophone [28] I. Cohen, “Noise spectrum estimation in adverse environ-
signal-processing technique to remove room reverberation ments: improved minima controlled recursive averaging,”
from speech signals,” Journal of the Acoustical Society of IEEE Transactions on Speech and Audio Processing, vol. 11, no.
America, vol. 62, no. 4, pp. 912–915, 1977. 5, pp. 466–475, 2003.
[14] S. Gannot and M. Moonen, “Subspace methods for mul- [29] J. Sohn and W. Sung, “A voice activity detector employing soft
timicrophone speech dereverberation,” EURASIP Journal on decision based noise spectrum adaptation,” in Proceedings of
Applied Signal Processing, vol. 2003, no. 11, pp. 1074–1090, IEEE International Conference on Acoustics, Speech and Signal
2003. Processing (ICASSP ’98), vol. 1, pp. 365–368, 1998.
[15] M. Delcroix, T. Hikichi, and M. Miyoshi, “Dereverberation [30] G. D. Forney Jr., “Exponential error bounds for erasure,
and denoising using multichannel linear prediction,” IEEE list, and decision feedback schemes,” IEEE Transactions on
Transactions on Audio, Speech and Language Processing, vol. 15, Information Theory, vol. 14, no. 2, pp. 206–220, 1968.
no. 6, pp. 1791–1801, 2007. [31] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based
[16] I. Ram, E. Habets, Y. Avargel, and I. Cohen, “Multi- voice activity detection,” IEEE Signal Processing Letters, vol. 6,
microphone speech dereverberation using LIME and least no. 1, pp. 1–3, 1999.
squares filtering,” in Proceedings of European Signal Processing [32] ITU-T, Test signals for use in telephonometry, Recommenda-
Conference (EUSIPCO ’08), Lausanne, Switzerland, August tion ITU-T P.501, International Telecommunication Union,
2008. Geneva, Switzerland, 2007.
[17] K. Mukherjee and B.-H. Gwee, “A 32-point FFT based noise [33] A. H. Gray Jr. and J. D. Markel, “Distance measures for speech
reduction algorithm for single channel speech signals,” in processing,” IEEE Transactions on Acoustics, Speech and Signal
Proceedings of IEEE International Symposium on Circuits and Processing, vol. 24, no. 5, pp. 380–391, 1976.
Systems (ISCAS ’07), pp. 3928–3931, New Orleans, La, USA,
May 2007.
[18] W. Armbrüster, R. Czarnach, and P. Vary, “Adaptive noise
cancellation with reference input,” in Signal Processing III, pp.
391–394, Elsevier, 1986.
[19] B. Sklar, Digital Communications: Fundamentals and Applica-
tions, Prentice Hall, Upper Saddle River, NJ, USA, 2001.
[20] C. Knapp and G. Carter, “The generalized correlation method
for estimation of time delay,” IEEE Transactions on Acoustics
Speech and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976.
[21] J. Freudenberger, S. Stenzel, and B. Venditti, “An FLMS
based two-microphone speech enhancement system for in-
car applications,” in Proceedings of the 15th IEEE Workshop on
Statistical Signal Processing (SSP ’09), pp. 705–708, 2009.
[22] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement
of speech corrupted by acoustic noise,” in Proceedings of
IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’79), pp. 208–211, Washington, DC, USA,
April 1979.
Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2010, Article ID 358729, 9 pages
doi:10.1155/2010/358729
Research Article
DOA Estimation with Local-Peak-Weighted CSP
Copyright © 2010 Osamu Ichikawa et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
This paper proposes a novel weighting algorithm for Cross-power Spectrum Phase (CSP) analysis to improve the accuracy of
direction of arrival (DOA) estimation for beamforming in a noisy environment. Our sound source is a human speaker and the
noise is broadband noise in an automobile. The harmonic structures in the human speech spectrum can be used for weighting the
CSP analysis, because harmonic bins must contain more speech power than the others and thus give us more reliable information.
However, most conventional methods leveraging harmonic structures require pitch estimation with voiced-unvoiced classification,
which is not sufficiently accurate in noisy environments. In our new approach, the observed power spectrum is directly converted
into weights for the CSP analysis by retaining only the local peaks considered to be harmonic structures. Our experiment showed
the proposed approach significantly reduced the errors in localization, and it showed further improvements when used with other
weighting algorithms.
0.05
where 2H + 1 is the number of averaged frames. Figure 1
shows an example of ϕT . In clean conditions, there is a sharp
0
−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 peak for a sound source. The estimated DOA iT for the sound
i source is
−0.05
1.1
2.2. Tracking a Moving Sound Source. If a sound source is
moving, the past location or DOA can be used as a cue
1
to the new location. Tracking techniques may use Dynamic
Programming (DP), the Viterbi search [10], Kalman Filters,
0.9
or Particle Filters [11]. For example, to find the series of
Weight
Figure 2: Average speech spectrum weight. 2.3. Weighted CSP Analysis. Equation (1) can be viewed as a
summation of each contribution at bin j. Therefore we can
introduce a weight W( j) on each bin so as to focus on the
pitches (F0) of the target sound and extracted localization more reliable bins, as
cues from the harmonic structures based on those pitches. ∗
However, the pitch estimation and the associated voiced- S1,T j · S2,T j
ϕT (i) = IDFT W j ·
S1,T j · S2,T j . (5)
unvoiced classification may be insufficiently accurate in noisy
environments. Also, it should be noted that not all harmonic
bins have distinct harmonic structures. Some bins may not Denda et al. introduced an average speech spectrum for the
be in the speech formants and be dominated by noise. weights [7] to focus on human speech. Figure 2 shows their
Therefore, we want a special weighting algorithm that puts weights. We use the symbol WDenda for later reference to
larger weights on the bins where the harmonic structures these weights. It does not have any suffix T, since it is time
are distinct, without requiring explicit pitch detection and invariant.
voiced-unvoiced classification. Another weighting approach would be to use the local
SNR [12], as long as the ambient noise is stationary and
measurable. For our evaluation in Section 4, we simply used
2. Sound Source Localization Using
larger weights where local SNR is high as
CSP Analysis
2.1. CSP Analysis. CSP analysis measures the normalized WSNRT j
correlations between two-microphone inputs with an Inverse
with microphone m and ∗ means complex conjugate. The bin KT = max log |ST (k)|2 − log |NT (k)|2 ,ε . (7)
number j corresponds to the frequency. The CSP coefficient k
ϕT (i) is a time-domain representation of the normalized
correlation for the i-sample delay. For a stable representation, Figure 3(c) shows an example of the local SNR weights.
EURASIP Journal on Advances in Signal Processing 3
12 12
10 10
Log power
Log power
8 8
6 6
4 4
2 2
0 2000 4000 6000 8000 0 2000 4000 6000 8000
(Hz) (Hz)
(a) A sample of the average noise spectrum. (b) A sample of the observed noisy speech spectrum.
0.1 0.1
Weight
Weight
0.05 0.05
0 0
0 2000 4000 6000 8000 0 2000 4000 6000 8000
(Hz) (Hz)
(c) A sample of the local SNR weights. (d) A sample of the local peak weights.
Figure 3: Sample spectra and the associated weights. The spectra were of the recording with air conditioner noise at an SNR of 0 dB. The
noisy speech spectrum (b) was sampled in a vowel segment.
1.5
1
Weight
(a)
400
0.5 300
(Hz)
200
0 100
0 2000 4000 6000 8000
(Hz) 25 dB(clean) 5 dB
10 dB 0 dB
Figure 4: A Sample of comb weight. (pitch = 300 Hz).
(b)
7500
7000
6500
6000
5500
5000
4500
Observed 4000
spectrum 3500
3000
2500
2000
1500
1000
500
Log power
spectrum
DCT to get
cepstrum
Cut off upper
and lower
cepstrum
I-DCT
Get exponential
and normalise
to get weights
W(ω)
Weighted CSP
classification [16]. When noise-corrupted speech is falsely continuous converter from an input spectrum to a weight
detected as unvoiced, there is little benefit from the CSP vector, which can be locally large for the bins whose
weighting. harmonic structures are distinct.
There is another problem with the uniform adoption of
comb weights for all of the bins. Those bins not in the speech 3.2. Proposed Local Peak Weights. We previously proposed a
formants and degraded by noise may not contain reliable method for speech enhancement called Local Peak Enhance-
cues even though they are harmonic bins. Such bins should ment (LPE) to provide robust ASR even in very low SNR
receive smaller weights. conditions due to driving noises from an open window
Therefore, in Section 3.2, we explore a new weighting or loud air conditioner noises [17]. LPE does not leverage
algorithm that does not depend on explicit pitch detection pitch information explicitly, but estimates the filters from
or voiced-unvoiced classification. Our approach is like a the observed speech to enhance the speech spectrum. LPE
EURASIP Journal on Advances in Signal Processing 5
DFT DFT
−7 7
6
5
−4 4
Get weight
±0
W( j)
35
8
DOA detection error (%)
30
7
6 25
5
20
4
0 2000 4000 6000 8000 15
(Hz)
Window full open 10
Fan max
5
Figure 8: Averaged noise spectrum used in the experiment. 0
Clean 10 dB 0 dB
SNR
assumes that pitch information containing the harmonic 1. CSP (Baseline) 4. W-CSP (Local SNR)
structure is included in the middle range of the cepstral 2. W-CSP (Comb) 5. W-CSP (Denda)
coefficients obtained with the discrete cosine transform 3. W-CSP (LPW)
(DCT) from the power spectral coefficients. The LPE filter
retrieves information only from that range, so it is designed Figure 10: Error rate of frame-based DOA detection. (Fan Max:
to enhance the local peaks of the harmonic structures for single-weight cases).
voiced speech frames. Here, we propose the LPE filter be used
for the weights in the CSP approach. This use of the LPE filter
is named Local Peak Weight (LPW), and we refer to the CSP the bin index of the DFT. Optionally, we may take
with LPW as the Local-Peak-Weighted CSP (LPW-CSP). a moving average using several frames around T, to
Figure 6 shows all of the steps for obtaining the LPW and smooth the power spectrum for YT ( j).
sample outputs of each step for both a voiced frame and an (2) Convert the log power spectrum YT ( j) into the
unvoiced frame. The process is the same for all of the frames, cepstrum CT (i) by using D(i, j), a DCT matrix.
but the generated filters differ depending on whether or not
the frame is voiced speech, as shown in the figure. CT (i) = D i, j · YT j , (8)
Here are the details for each step. j
(1) Convert the observed spectrum from one of the where i is the bin number of the cepstral coefficients.
microphones to a log power spectrum YT ( j) for each In our experiments, the size of the DCT matrix is 256
frame, where T and j are the frame number and by 256.
6 EURASIP Journal on Advances in Signal Processing
25 25
20 20
15 15
10 10
5 5
0 0
Clean 10 dB 0 dB Clean 10 dB 0 dB
SNR SNR
35
30
for human speech is from 100 Hz to 400 Hz. This
DOA detection error (%)
hope to minimize some fake peaks in the weights by using the graph, “Window Full Open” contains lots of transient noise
products of different metrics. Equations (13) to (16) show from the wind and other automobiles.
the combinations we evaluate in Section 4. Figure 9 shows the system used for this evaluation.
We used various types of weights for the weighted CSP
analysis. The input from one microphone was used to
WLPW&DendaT j = WLPWT j · WDenda j , (13) generate the weights. Using both microphones could provide
better weights, but in this experiment we used only one
WLPW&SNRT j = WLPWT j · WSNRT j , (14)
microphone for simplicity. Since the baseline (normal CSP)
WSNR&DendaT j = WSNRT j · WDenda j , (15) does not use weighting, all of its weights were set to 1.0.
The weighted CSP was calculated using (5), with smoothing
WLPW&SNR&DendaT j = WLPWT j · WSNRT j over the frames using (2). In addition to the weightings,
(16) we introduced a lower cut-off frequency of 100 Hz and an
· WDenda j .
upper cut-off frequency of 5 kHz to stabilize the CSP analysis.
Finally, the DOA was estimated using (3) for each frame. We
4. Experiment did not use the tracking algorithms discussed in Section 2.2,
because we wanted to accurately measure the contributions
In the experimental car, two microphones were installed of the various types of weights in a simplified form. Actually,
near the map-reading lights on the ceiling with 12.5 cm the subject speakers rarely moved when speaking.
between them. We used omnidirectional microphones. The The performance was measured as frame-based accuracy.
sampling frequency for the recordings was 22 kHz. In this The frames reporting the correct DOA were counted, and
configuration, CSP gives 15 steps from −7 to +7 for the DOA that was divided by the total number of speech frames. The
resolution (see Figure 7). correct DOA values were determined manually. The speech
A higher sampling rate might yield higher directional segments were determined using clean speech data with a
resolution. However, many beamformers do not support rather strict threshold, so extra segments were not included
higher sampling frequencies because of processing costs and before or after the phrases.
aliasing problems. We also know that most ASR systems work
at sampling rates below 22 kHz. These considerations led us 4.1. Experiment Using Single Weights. We evaluated five types
to use 22 kHz. of CSP analysis.
Again, we could have gained directional resolution
by increasing the distance between the microphones. In Case 1. Normal CSP (uniform weights, baseline).
general, a larger baseline distance improves the performance
of a beamformer, especially for lower frequency sounds. Case 2. Comb-Weighted CSP.
However, this increases the aliasing problems for higher
frequency sounds. Our separation of 12.5 cm was another Case 3. Local-Peak-Weighted CSP (our proposal).
tradeoff.
Our analysis used a Hamming window, 23-ms-long Case 4. Local-SNR-Weighted CSP.
frames with 10-ms frame shifts. The FFT length was 512. For
(2), the length of the moving average was 0.2 seconds. Case 5. Average-Speech-Spectrum-Weighted CSP (Denda).
The test subject speakers were 4 females and 4 males.
Case 2 requires the pitch and voiced-unvoiced infor-
Each speaker read 50 Japanese commands. These are short
mation. We used SPTK-3.0 [14] with default parameters
phrases for automobiles known as Free Form Command
to obtain this data. Case 4 requires estimating the noise
[18]. The total number of utterances was 400. They were
spectrum. In this experiment, the noise spectrum was
recorded in a stationary car, a full-size sedan. The subject
continuously updated within the noise segments based on
speakers sat in the driver’s seat. The seat was adjusted to
oracle VAD information as
each speaker’s preference, so the distance to the microphones
varied from approximately 40 cm to 60 cm. Two types of
NT j = (1 − α) · NT −1 j + α · ST j
noise were recorded separately in a moving car, and they
⎧
were combined with the speech data at various SNRs (clean, ⎨0.0 if VAD = active, (17)
10 dB, and 0 dB). The SNRs were measured as ratios of α=⎩
speech power and noise power, ignoring the frequency 0.1 otherwise.
components below 300 Hz. One of the recorded noises was
an air-conditioner at maximum fan speed while driving on The initial value of the noise spectrum for each utterance file
a highway with the windows closed. This will be referred was given by the average of all of the noise segments in that
to as “Fan Max”. The other was of driving noise on a file.
highway with the windows fully opened. This will be referred Figures 10 and 11 show the experimental results for
to as “Window Full Open”. Figure 8 compares the average “Fan Max” and “Window Full Open”, respectively. Case 2
spectra of the two noises. “Window Full Open” contains failed to show significant error reduction in both situations.
more power around 1 kHz, and “Fan Max” contains relatively This failure is probably due to bad pitch estimation or poor
large power around 4 kHz. Although it is not shown in the voiced-unvoiced classification in the noisy environments.
8 EURASIP Journal on Advances in Signal Processing
This suggests that the result could be improved by intro- Case 9, the combination of the three weights worked
ducing robust pitch trackers and voiced-unvoiced classifiers. well in both situations. Because each weighting method has
However, there is an intrinsic problem since noisier speech different characteristics, we expected that their combination
segments are more likely to be classified as unvoiced and thus would help against variations in the noise. Actually, the
lose the benefit of weighting. results were almost equivalent to the best combinations of
Case 5 failed to show significant error reduction for “Fan the paired weights in each situation.
Max”, but it showed good improvement for “Window Full
Open”. As shown in Figure 8, “Fan Max” contains more
noise power around 4 kHz than around 1 kHz. In contrast, 5. Conclusion
the speech power is usually lower around 4 kHz than
We proposed a new weighting algorithm for CSP analysis to
around 1 kHz. Therefore, the 4-kHz region tends to be more
improve the accuracy of DOA estimation for beamforming in
degraded. However Denda’s approach does not sufficiently
a noisy environment, assuming the source is human speech
lower the weights in the 4-kHz region, because the weights
and the noise is broadband noise such as a fan, wind, or road
are time-invariant and independent on the noise. Case 3
noise in an automobile.
and Case 4 outperformed the baseline in both situations.
The proposed weights are extracted directly from the
For “Fan Max”, since the noise was almost stationary, the
input speech using the midrange of the cepstrum. They
local-SNR approach can accurately estimate the noise. This
represent the local peaks of the harmonic structures. As
is also a favorable situation for LPW, because the noise does
the process does not involve voiced-unvoiced classification,
not include harmonic components. However, LPW does little
it does not have to switch its behavior over the voiced-
for consonants. Therefore, Case 4 had the best results for
unvoiced transitions.
“Fan Max”. In contrast, since the noise is nonstationary for
Experiments showed the proposed local peak weighting
“Window Full Open”, Case 3 had slightly fewer errors than
algorithm significantly reduced the errors in localization
Case 4. We believe this is because the noise estimation for the
using CSP analysis. A weighting algorithm using local SNR
local SNR calculations is inaccurate for nonstationary noises.
also reduced the errors, but it did not produce the best results
Considering that the local SNR approach in this experiment
in the nonstationary noise situation in our evaluations. Also,
used the given and accurate VAD information, the actual
it requires VAD information to estimate the noise spectrum.
performance in the real world would probably be worse than
Our proposed algorithm does not require VAD information,
our results. LPW has an advantage in that it does not require
voiced-unvoiced information, or pitch information. It does
either noise estimation or VAD information.
not assume the noise is stationary. Therefore, it showed
advantages in the nonstationary noise situation. Also, it can
4.2. Experiment Using Combined Weights. We also evaluated be combined with existing weighting algorithms for further
some combinations of the weights in Cases 3 to 5. The improvements.
combined weights were calculated using (13) to (16).
Case 6. CSP weighted with LPW and Denda (Cases 3 and 5). References
[1] D. Johnson and D. Dudgeon, Array Signal Processing, Prentice-
Case 7. CSP weighted with LPW and Local SNR (Cases 3 and
Hall, Englewood Cliffs, NJ, USA.
4). [2] F. Asano, H. Asoh, and T. Matsui, “Sound source localization
and separation in near field,” IEICE Transactions on Funda-
Case 8. CSP weighted with Local SNR and Denda (Cases 4 mentals of Electronics, Communications and Computer Sciences,
and 5). vol. E83-A, no. 11, pp. 2286–2294, 2000.
[3] M. Omologo and P. Svaizer, “Acoustic event localization using
Case 9. CSP weighted with LPW, Local SNR, and Denda a crosspower-spectrum phase based technique,” in Proceedings
(Cases 3, 4, and 5). of the International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’94), pp. 273–276, 1994.
Figures 12 and 13 show the experimental results for [4] K. D. Martin, “Estimating azimuth and elevation from
“Fan Max” and “Window Full Open”, respectively, for the interaural differences,” in Proceedings of IEEE ASSP Workshop
combined weight cases. on Applications of Signal Processing to Audio and Acoustics
For the combination of two weights, the best combina- (WASPAA ’95), p. 4, 1995.
tion was dependent on the situation. For “Fan Max”, Case 7, [5] O. Ichikawa, T. Takiguchi, and M. Nishimura, “Sound source
the combination of LPW and the local SNR approach was localization using a profile fitting method with sound reflec-
best in reducing the error by 51% for 0 dB. For “Window tors,” IEICE Transactions on Information and Systems, vol. E87-
D, no. 5, pp. 1138–1145, 2004.
Full Open”, Case 6, the combination of LPW and Denda’s
[6] T. Nishiura, T. Yamada, S. Nakamura, and K. Shikano, “Local-
approach was best in reducing the error by 37% for 0 dB. ization of multiple sound sources based on a CSP analysis
These results correspond to the discussion in Section 4.1 with a microphone array,” in Proceedings of IEEE International
about how the local SNR approach is suitable for stationary Conference on Acoustics, Speech and Signal Processing (ICASSP
noises, while LPW is suitable for nonstationary noises, and ’00), vol. 2, pp. 1053–1056, 2000.
Denda’s approach works well with noise concentrated in the [7] Y. Denda, T. Nishiura, and Y. Yamashita, “Robust talker
lower frequency region. direction estimation based on weighted CSP analysis and
EURASIP Journal on Advances in Signal Processing 9
Research Article
Shooter Localization in Wireless Microphone Networks
Copyright © 2010 David Lindgren et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Shooter localization in a wireless network of microphones is studied. Both the acoustic muzzle blast (MB) from the gunfire and
the ballistic shock wave (SW) from the bullet can be detected by the microphones and considered as measurements. The MB
measurements give rise to a standard sensor network problem, similar to time difference of arrivals in cellular phone networks,
and the localization accuracy is good, provided that the sensors are well synchronized compared to the MB detection accuracy.
The detection times of the SW depend on both shooter position and aiming angle and may provide additional information beside
the shooter location, but again this requires good synchronization. We analyze the approach to base the estimation on the time
difference of MB and SW at each sensor, which becomes insensitive to synchronization inaccuracies. Cramér-Rao lower bound
analysis indicates how a lower bound of the root mean square error depends on the synchronization error for the MB and the
MB-SW difference, respectively. The estimation problem is formulated in a separable nonlinear least squares framework. Results
from field trials with different types of ammunition show excellent accuracy using the MB-SW difference for both the position and
the aiming angle of the shooter.
Also the bullet speed v and speed of sound c are unknown. Here, θL is the weighted least squares estimate and PL is the
Basic signal models for the detected times as a function of the covariance matrix of the estimation error. This simplifies the
parameters will be derived in the next section. The notation nonlinear minimization to
is summarized in Table 1.
The derived signal models will be of the form x = arg min min V x, θN , θL ; p
x θN
−1
y = h x, θ; p + e, (1)
= arg min min T −1
y − hN + hL hL R hL
x θN
(6)
where y is a vector with the measured detection times, h
2
is a nonlinear function with values in RM+S , and where θ × hTL R −1
y − hN
,
represents the unknown parameters apart from x. The error R
e is assumed to be stochastic; see Section 4.5. Given the R = R + hL PL hTL .
sensor locations in p ∈ RM ×3 , nonlinear optimization can
be performed to estimate x, using the nonlinear least squares This general separable least squares (SLSs) approach will now
(NLS) criterion: be applied to four different combinations of signal models for
the MB and SW detection times.
x = arg min min V x, θ; p ,
x θ
(2) 4. Signal Models
2
V x, θ; p = y − h x, θ; p R .
4.1. Muzzle Blast Model (MB). According to the clock at
Here, argmin denotes the minimizing argument, min the microphone k, the muzzle blast (MB) sound is assumed to
minimum of the function, and v2Q denotes the Q-norm, reach pk at the time
that is, v2Q vT Q−1 v. Whenever Q is omitted, Q = I 1
is assumed. The loss function norm R is chosen by con- yk = t0 + bk + pk − x + ek . (7)
c
sideration of the expected error characteristics. Numerical
optimization, for instance, the Gauss-Newton method, can The shooter position x and microphone location pk are in
here be applied to get the NLS estimate. Rn , where generally n = 3. However, both computational
In the next section it will become clear that the assumed and numerical issues occasionally motivate a simplified plane
unknown firing time and the inverse speed of sound enter model with n = 2. For all M microphones, the model is
the model equations linearly. To exploit this fact we identify represented in vector form as
a sublinear structure in the signal model and apply the
weighted least squares method to the parameters appearing y = b + hL x; p θL + e, (8)
linearly, the separable least squares method; see, for instance
[17]. By doing so, the NLS search space is reduced which in where
turn significantly reduces the computational burden. For that T
1
reason, the signal model (1) is rewritten as θL = t0 , (9a)
c
T
y = hN x, θN ; p + hL x, θN ; p θL + e. (3) hL,k x; p = 1 pk − x , (9b)
Note that θL enters linearly here. The NLS problem can then and where y, b, and e are vectors with elements yk , bk , and
be formulated as ek , respectively. 1M is the vector with M ones, where M might
be omitted if there is no ambiguity regarding the dimension.
x = arg min min V x, θN , θL ; p , Furthermore, p is M-by-n, where each row is a microphone
x θL ,θN
position. Note that the inverse of the speed of sound enters
2
V x, θN , θL ; p = y − hN x, θN ; p − hL x, θN ; p θL R . linearly. The ·L notation indicates that · is part of a linear
(4) relation, as described in the previous section. With hN = 0
and hL = hL (x; p), (6) gives
Since θL enters linearly, it can be solved for by linear least −1 2
squares (the arguments of hL (x, θN ; p) and hN (x, θN ; p) are x = arg min T −1
y − hL hL R hL hTL R−1 y
, (10a)
suppressed for clarity): x R
−1
R = R + hL hTL R−1 hL hTL . (10b)
θL = arg minV x, θN , θL ; p
θL
(5a) Here, hL depends on x as given in (9b).
−1
= hTL R−1 hL hTL R−1 y − hN , This criterion has computationally efficient implemen-
tations, that in many applications make the time it takes to
−1 do an exhaustive minimization over a, say, 10-meter grid
PL = hTL R−1 hL . (5b) acceptable. The grid-based minimization of course reduces
4 EURASIP Journal on Advances in Signal Processing
Table 1: Notation. MB, SW, and MB-SW are different models, and L/N indicates if model parameters or signals enter the model linearly (L)
or nonlinearly (N).
90◦ + βk Bu L N (25)
dk
||
SW
γk
−x
Gun α ||
dk − hL x, θN ; p θNSW + ekMB − ekSW ,
x
for k = 1, 2 . . . S. This rather special model has also
Figure 3: Geometry of supersonic bullet trajectory and shock wave. been analyzed in [12, 15]. The key idea is that y is by
Given the shooter location x, the shooting direction (aim) α, the cancellation independent of both the firing time t0 and the
bullet speed v, and the speed of sound c, the time it takes from firing synchronization error b. The drawback, of course, is that
the gun to detecting the shock wave can be calculated. there are only S equations (instead of a total of M + S) and
the detection error increases, ekMB − ekSW . However, when the
synchronization errors are expected to be significantly larger
The vector form of the model is than the detection errors, and when also S is sufficiently large
y = b + hN x, θN ; p + hL x, θN ; p θL + e, (16) (at least as large as the number of parameters), this model
is believed to give better localization accuracy. This will be
where investigated later.
hL x, θN ; p = 1, There are no parameters in (25) that appear linearly
everywhere. Thus, the vector form for the MB-SW model can
θL = t0 , be written as
(17)
T
1 T y MB-SW = hMB-SW
N x, θN ; p + e, (26)
θN = α v0 ,
c
where
and where row k of hN (x, θN ; p) ∈ RS×1 is
hMB-SW x, θN ; pk
1 v0 1 N,k
hN,k x, θN ; pk = log + dk − pk , (18)
r v0 − r dk − x c 1
pk − x − 1 log v0 1
= − dk − pk ,
and dk is the admissible solution to (12) and (15). c r v0 − r dk − x c
(27)
4.3. Combined Model (MB;SW). In the MB and SW models, and y = y MB − y SW and e = eMB − eSW . As before, dk is
the synchronization error has to be regarded as a noise the admissible solution to (12) and (15). The MB-SW least
component. In a combined model, each pair of MB and SW squares criterion is
detections depends on the same synchronization error, and
consequently the synchronization error can be regarded as a −SW 2
x = arg min y MB−SW − hMB
N x, θN ; p R , (28)
parameter (at least for all sensor nodes inside the SW cone). x,θN
The total signal model could be fused from the MB and SW
which requires numerical optimization. Numerical experi-
models as the total observation vector:
ments indicate that this optimization problem is more prone
y MB;SW = hMB;SW
N x, θN ; p + hMB;SW
L x, θN ; p θL + e, (19) to local minima, compared to (10a) for the MB model;
therefore good starting points for the numerical search are
where essential. One such starting point could, for instance, be the
⎡ ⎤
y MB MB estimate xMB . Initial shooting direction could be given by
y MB;SW
=⎣ ⎦, (20) assuming, in a sense, the worst possible case, that the shooter
y SW aims at some point close to the center of the microphone
T network.
θL = t 0 b T , (21)
4.5. Error Model. At an arbitrary moment, the detection
1 IM
hMB;SW
L x, θN ; p = M,1 , (22) errors and synchronization errors are assumed to be inde-
1S,1 IS 0S,M −S pendent stochastic variables with normal distribution:
T
1 T
θN = α v0 , (23) eMB ∼ N 0, RMB , (29a)
c
⎡
⎤ eSW ∼ N 0, RSW ,
1 T (29b)
MB;SW ⎢hMB x; p 0 ⎥
hN x, θN ; p = ⎣ L c ⎦. (24)
SW b ∼ N 0, Rb . (29c)
hN x, θN ; p
6 EURASIP Journal on Advances in Signal Processing
For the MB-SW model the error is consequently 5. Cramér-Rao Lower Bound
The accuracy of any unbiased estimator η in the rather
eMB-SW ∼ N 0, RMB + RSW . (29d)
general model
Assuming that S = M in the MB;SW model, the covariance y =h η +e (30)
of the summed detection and synchronization errors can be
expressed in a simple manner as is, under not too restrictive assumptions [20], bounded by
the Cramér-Rao bound:
RMB + Rb Rb
RMB;SW = b . (29e) Cov η ≥ I−1 ηo , (31)
R R + Rb
SW
Table 2: Summary of parameter vectors for the different models y = hL (θN )θL + hN (θN ) + e, where the noise models are summarized in
(29a), (29b), (29c), (29d), and (29e). The values of the dimensions assume that the set of microphones giving SW observations is a subset of
the MB observations.
Model Linear Parameters Nonlinear Parameters dim(θ) dim(y)
MB θLMB = [t0 1/c]T θNMB = [ ] 2+0 M
T
SW θLSW = t0 θNMB = [1/c, αT , v0 ] 1 + (n + 1) S
MB;SW T MB;SW T
MB;SW θL = [t0 b] θN = [1/c, αT , v0 ] (M + 1) + (n + 1) M+S
T
MB-SW θLMB-SW = [ ] θNMB-SW = [1/c, αT , v0 ] 0 + (n + 1) S
Target
2
3
1.5 σe = 1000 μs
1
RMSE (m)
1
500 m
σe = 500 μs
0.5 Shooter
σe = 200 μs
Microphone
σe = 50 μs
Figure 6: Scene of the shooter localization field trial. There are ten
0 microphones, three shooter positions, and a common target.
0.1 1 10 100
σb (ms)
Table 3: Armament and ammunition used at the trial, and number of rounds fired at each shooter position. Also, the resulting localization
RMSE for the MB-SW model for each shooter position. For the Luger Pistol the MB model RMSE is given, since only one microphone is
located in the Luger Pistol SW cone.
Apparently, the use of the shock wave significantly improves Table 4: Localization RMSE and theoretical bound (34) for the
localization at positions 1 and 2, while rather the opposite three different shooter positions using the MB and the MB-SW
holds at position 3. Figure 8 visualizes the shooting direction models, respectively, beside the aim RMSE for the MB-SW model.
estimates, α. Estimate root mean square errors (RMSEs) for The aim RMSE is with respect to the aim at x against the target,
the three shooter positions, together with the theoretical α , not with respect to the true direction α. This way the ability to
identify the target is assessed.
bounds (34), are given in Table 4. The practical results
indicate that the use of the shock wave from distant shooters Shooter position 1 2 3
cut the error by at least 75%. RMSE(xMB ) 105 m 28 m 2.4 m
MB Bound 1m 0.4 m 0.02 m
6.3.1. Synchronization and Detection Errors. Since all micro- RMSE (xMB-SW ) 26 m 5.7 m 5.2 m
phones are recorded by a common recorder, there are actually MB-SW Bound 9m 0.1 m 0.08 m
no timing errors due to inaccurate clocks. This is of course RMSE(α ) 0.041◦ 0.14◦ 17◦
the best way to conduct a controlled experiment, where any
uncertainty renders the dataset less useful. From experimen-
tal point of view, it is then simple to add synchronization
errors of any desired magnitude off-line. On the dataset at model discrepancy that could have greater impact than we
hand, this is however work under progress. At the moment, first anticipated. This will be investigated in experiments to
there are apparently other sources of error, worth identifying. come.
It should however be clarified that in the final wireless sensor
product, there will always be an unpredictable clock error. 6.3.3. Numerical Uncertainties. Finally, we face numerical
As mentioned, detection errors are present, and the expected uncertainties. There is no guarantee that the numerical
level of these (80 μs) is used for bound calculations in Table 4. minimization programs we have used here for the MB-
It is noted that the bounds are in level with, or below, the SW model really deliver the global minimum. In a realistic
positioning errors. implementation, every possible a priori knowledge and also
There are at least two explanations for the bad perfor- qualitative analysis of the SW and MB signals (amplitude,
mance using the MB-SW model at shooter position 3. One is duration, caliber classification, etc.) together with basic
that the number of microphones reached by the shock wave consistency checks are used to reduce the search space. The
is insufficient to make accurate estimates. There are four reduced search space may then be exhaustively sampled over
unknown model parameters, but for the relatively low speed a grid prior to the final numerical minimization. Simple
of pistol ammunition, for instance, only one microphone has experiments on an ordinary desktop PC indicate that with
a valid shock wave detection. Another explanation is that the an efficient implementation, it is feasible to, within the time
increased detection uncertainty (due to SW/MB intermix) frame of one second, minimize any of the described model
impacts the MB-SW model harder, since it relies on accurate objective functions over a discrete grid with 107 points. Thus,
detection of both the MB and SW. by allowing—say—one second extra of computation time,
the risk for hitting a local optima could be significantly
reduced.
6.3.2. Model Errors. No doubt, there are model inaccuracies
both in the ballistic and in the acoustic domain. To that end,
there are meteorological uncertainties out of our control. 7. Conclusions
For instance, looking at the MB-SW localizations around
shooter position 1 in Figure 7 (squares), three clusters We have presented a framework for estimation of shooter
are identified that correspond to three ammunition types location and aiming angle from wireless networks where each
with different ballistic properties; see the RMSE for each node has a single microphone. Both the acoustic muzzle
ammunition and position in Table 3. This clustering or bias blast (MB) and the ballistic shock wave (SW) contain useful
more likely stems from model errors than from detection information about the position, but only the SW contains
errors and could at least partially explain the large gap information about the aiming angle. A separable nonlinear
between theoretical bound and RMSE in Table 4. Working least squares (SNLSs) framework was proposed to limit the
with three-dimensional data in the plane is of course another parametric search space and to enable the use of global
10 EURASIP Journal on Advances in Signal Processing
40 Target
20
0 1
−20 500 m
−50 0 50 100 150
(m)
(a) Shooter
8 Microphone
Estimated position
−8
−10 0 10 20 30 40
(m) the synchronization error distribution may be completely
disregarded.
(b)
The bullet speed occurs as nuisance parameters in the
0 3
proposed signal model. Further, the bullet retardation con-
stant was optimized manually. Future work will investigate
if the retardation constant should also be estimated, and if
these two parameters can be used, together with the MB and
−2
SW signal forms, to identify the weapon and ammunition.
−4
Acknowledgment
This work is funded by the VINNOVA supported Centre
for Advanced Sensors, Multisensors and Sensor Networks,
−6
FOCUS, at the Swedish Defence Research Agency, FOI.
−6 −4 −2 0 2 4 References
(m)
[1] J. Bédard and S. Paré, “Ferret, a small arms’ fire detection
Shooter system: localization concepts,” in Sensors, and Command,
MB model Control, Communications, and Intelligence (C31) Technologies
MB-SW model for Homeland Defense and Law Enforcement II, vol. 5071 of
Proceedings of SPIE, pp. 497–509, 2003.
(c)
[2] J. A. Mazurek, J. E. Barger, M. Brinn et al., “Boomerang
Figure 7: Estimated positions x based on the MB model and on the mobile counter shooter detection system,” in Sensors, and C3I
MB-SW model. The diagrams are enlargements of the interesting Technologies for Homeland Security and Homeland Defense IV,
areas around the shooter positions. The dashed lines identify the vol. 5778 of Proceedings of SPIE, pp. 264–282, Bellingham,
shooting directions. Wash, USA, 2005.
[3] D. Crane, “Ears-MM soldier-wearable gun-shot/sniper detec-
tion and location system,” Defence Review, 2008.
[4] “PILAR Sniper Countermeasures System,” November 2008,
grid-based optimization algorithms (for the MB model), http://www.canberra.com.
eliminating potential problems with local minima. [5] J. Millet and B. Balingand, “Latest achievements in gunfire
For a perfectly synchronized network, both MB and detection systems,” in Proceedings of the of the RTO-MP-SET-
SW measurements should be stacked into one large signal 107 Battlefield Acoustic Sensing for ISR Applications, Neuilly-
model for which SNLS is applied. However, when the sur-Seine, France, 2006.
synchronization error in the network becomes comparable [6] P. Volgyesi, G. Balogh, A. Nadas, et al., “Shooter localization
and weapon classification with soldier-wearable networked
to the detection error for MB and SW, the performance
sensors,” in Proceedings of the 5th International Conference on
quickly deteriorates. For that reason, the time difference of Mobile Systems, Applications, and Services (MobiSys ’07), San
MB and SW at each microphone is used, which automatically Juan, Puerto Rico, 2007.
eliminates any clock offset. The effective number of measure- [7] A. Saxena and A. Y. Ng, “Learning Sound Location from a
ments decreases in this approach, but as the CRLB analysis single microphone,” in Proceedings of the IEEE International
showed, the root mean square position error is comparable Conference on Robotics and Automation (ICRA ’09), pp. 1737–
to that of the ideal stacked model, at the same time as 1742, Kobe, Japan, May 2009.
EURASIP Journal on Advances in Signal Processing 11