Microphone Array Speech Processing

EURASIP Journal on Advances in Signal Processing
Microphone Array Speech Processing
Guest Editors: Sven Nordholm, Thushara Abhayapala, Simon Doclo,

Sharon Gannot, Patrick Naylor, and Ivan Tashev
EURASIP Journal on
Advances in Signal Processing

Guest Editors: Sven Nordholm, Thushara Abhayapala,
Simon Doclo, Sharon Gannot, Patrick Naylor,
and Ivan Tashev
Copyright © 2010 Hindawi Publishing Corporation. All rights reserved.
This is a special issue published in volume 2010 of “EURASIP Journal on Advances in Signal Processing.” All articles are open access
articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Editor-in-Chief
Phillip Regalia, Institut National des Télécommunications, France
Associate Editors
Adel M. Alimi, Tunisia Sudharman K. Jayaweera, USA Douglas O’Shaughnessy, Canada
Kenneth Barner, USA Soren Holdt Jensen, Denmark Björn Ottersten, Sweden
Yasar Becerikli, Turkey Mark Kahrs, USA Jacques Palicot, France
Kostas Berberidis, Greece Moon Gi Kang, South Korea Ana Perez-Neira, Spain
Enrico Capobianco, Italy Walter Kellermann, Germany Wilfried R. Philips, Belgium
A. Enis Cetin, Turkey Lisimachos P. Kondi, Greece Aggelos Pikrakis, Greece
Jonathon Chambers, UK Alex Chichung Kot, Singapore Ioannis Psaromiligkos, Canada
Mei-Juan Chen, Taiwan Ercan E. Kuruoglu, Italy Athanasios Rontogiannis, Greece
Liang-Gee Chen, Taiwan Tan Lee, China Gregor Rozinaj, Slovakia
Satya Dharanipragada, USA Geert Leus, The Netherlands Markus Rupp, Austria
Kutluyil Dogancay, Australia T.-H. Li, USA William Sandham, UK
Florent Dupont, France Husheng Li, USA B. Sankur, Turkey
Frank Ehlers, Italy Mark Liao, Taiwan Erchin Serpedin, USA
Sharon Gannot, Israel Y.-P. Lin, Taiwan Ling Shao, UK
Samanwoy Ghosh-Dastidar, USA Shoji Makino, Japan Dirk Slock, France
Norbert Goertz, Austria Stephen Marshall, UK Yap-Peng Tan, Singapore
M. Greco, Italy C. Mecklenbräuker, Austria João Manuel R. S. Tavares, Portugal
Irene Y. H. Gu, Sweden Gloria Menegaz, Italy George S. Tombras, Greece
Fredrik Gustafsson, Sweden Ricardo Merched, Brazil Dimitrios Tzovaras, Greece
Ulrich Heute, Germany Marc Moonen, Belgium Bernhard Wess, Austria
Sangjin Hong, USA Christophoros Nikou, Greece Jar-Ferr Yang, Taiwan
Jiri Jan, Czech Republic Sven Nordholm, Australia Azzedine Zerguine, Saudi Arabia
Magnus Jansson, Sweden Patrick Oonincx, The Netherlands Abdelhak M. Zoubir, Germany
Contents
Microphone Array Speech Processing, Sven Nordholm, Thushara Abhayapala, Simon Doclo,
Sharon Gannot (EURASIPMember), Patrick Naylor, and Ivan Tashev
Volume 2010, Article ID 694216, 3 pages
Selective Frequency Invariant Uniform Circular Broadband Beamformer, Xin Zhang, Wee Ser,
Zhang Zhang, and Anoop Kumar Krishna
First-Order Adaptive Azimuthal Null-Steering for the Suppression of Two Directional Interferers,
René M. M. Derkx
Musical-Noise Analysis in Methods of Integrating Microphone Array and Spectral Subtraction Based on
Higher-Order Statistics, Yu Takahashi, Hiroshi Saruwatari, Kiyohiro Shikano, and Kazunobu Kondo
Microphone Diversity Combining for In-Car Applications, Jürgen Freudenberger, Sebastian Stenzel,
and Benjamin Venditti
DOA Estimation with Local-Peak-Weighted CSP, Osamu Ichikawa, Takashi Fukuda,

and Masafumi Nishimura
Shooter Localization in Wireless Microphone Networks, David Lindgren, Olof Wilsson,

Fredrik Gustafsson, and Hans Habberstad
Hindawi Publishing Corporation
doi:10.1155/2010/694216
Editorial
Sven Nordholm (EURASIP Member),1 Thushara Abhayapala (EURASIP Member),2

Simon Doclo (EURASIP Member),3 Sharon Gannot (EURASIP Member),4
Patrick Naylor (EURASIP Member),5 and Ivan Tashev6
1 Department of Electrical and Computer Engineering, Curtin University of Technology, Perth, WA 6845, Australia
2 College of Engineering & Computer Science, The Australian National University, Canberra, ACT 0200, Australia
3 Institute of Physics, Signal Processing Group, University of Oldenburg, 26111 Oldenburg, Germany
4 School of Engineering, Bar-Ilan University, 52900 Tel Aviv, Israel
5 Department of Electrical and Electronic Engineering, Imperial College, London SW7 2AZ, UK
6 Microsoft Research, USA
Correspondence should be addressed to Sven Nordholm, s.nordholm@curtin.edu.au
Received 21 July 2010; Accepted 21 July 2010
Copyright © 2010 Sven Nordholm et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Significant knowledge about microphone arrays has been highly reverberant speech given that we only can observe the
gained from years of intense research and product develop- received microphone signals.
ment. There have been numerous applications suggested, for This special issue contains contributions to traditional
example, from large arrays (in the order of >100 elements) areas of research such as frequency invariant beamforming
for use in auditoriums to small arrays with only 2 or 3 [1], hand-free operation of microphone arrays in cars [2],
elements for hearing aids and mobile telephones. Apart from and source localisation [3]. The contributions show new
that, microphone array technology has been widely applied ways to study these traditional problems and give new
in speech recognition, surveillance, and warfare. Traditional insights into those problems. Small size arrays have always
techniques that have been used for microphone arrays a lot of applications and interest for mobile terminals,
include fixed spatial filters, such as, frequency invariant hearing aids, and close up microphones [4]. The novel
beamformers, optimal and adaptive beamformers. These way to represent small size arrays leads to a capability to
array techniques assume either model knowledge or cali- suppress multiple interferers. Abnormalities in noise and
bration signal knowledge as well as localization information speech stemming from processing are largely unavoidable,
for their design. Thus they usually combine some form and using nonlinear processing results often in significant
of localisation and tracking with the beamforming. Today character change particularly in noise character. It is thus
contemporary techniques using blind signal separation (BSS) important to provide new insights into those phenomena
and time frequency masking technique have attracted sig- particularly the so called musical noise [5]. Finally, new
nificant attention. Those techniques are less reliant on array and unusual use of microphone arrays is always interesting
model and localization, but more on the statistical properties to see. Distributed microphone arrays in a sensor network
of speech signals such as sparseness, non-Gaussianity, and [6] provide a novel approach to find snipers. This type of
non-stationarity. The main advantage that multiple micro- processing has good opportunities to grow in interest for new
phones add from a theoretical perspective is the spatial and improved applications.
diversity, which is an effective tool to combat interference, The contributions found in this special issue can be
reverberation, and noise. The underpinning physical feature categorized to three main aspects of microphone array
used is a difference in coherence in the target field (speech processing: (i) microphone array design based on eigenmode
signal) versus the noise field. Viewing the processing in this decomposition [1, 4]; (ii) multichannel processing methods
way one can understand also the difficulty in enhancing [2, 5]; and (iii) source localisation [3, 6].
2 EURASIP Journal on Advances in Signal Processing
The paper by Zhang et al., “Selective frequency invariant array signal processing and spectral subtraction. To obtain
uniform circular broadband beamformer” [1], describes a better noise reduction, methods of integrating microphone
design method for Frequency-Invariant (FI) beamforming. array signal processing and nonlinear signal processing have
This problem is a well-known array signal processing tech- been researched. However, nonlinear signal processing often
nique used in many applications such as, speech acquisition, generates musical noise. Since such musical noise causes
acoustic imaging and communications purposes. However, discomfort to users, it is desirable that musical noise is
many existing FI beamformers are designed to have a mitigated. Moreover, it has been recently reported that
frequency invariant gain over all angles. This might not be higher-order statistics are strongly related to the amount
necessary and if a gain constraint is confined to a specific of musical noise generated. This implies that it is possible
angle, then the FI performance over that selected region (in to optimize the integration method from the viewpoint of
frequency and angle) can be expected to improve. Inspired not only noise reduction performance but also the amount
by this idea, the proposed algorithm attempts to optimize of musical noise generated. Thus, the simplest methods
the frequency invariant beampattern solely for the mainlobe of integration, that is, the delay-and-sum beamformer and
and relax the FI requirement on the sidelobes. This sacrifice spectral subtraction, are analysed and the features of musical
on performance in the undesired region is traded off for noise generated by each method are clarified. As a result, it is
better performance in the desired region as well as reduced clarified that a specific structure of integration is preferable
number of microphones employed. The objective function from the viewpoint of the amount of generated musical
is designed to minimize the overall spatial response of the noise. The validity of the analysis is shown via a computer
beamformer with a constraint on the gain being smaller simulation and a subjective evaluation.
than a predefined threshold value across a specific frequency The paper by Freudenberger et al., “Microphone diversity
range and at a specific angle. This problem is formulated as a combining for in-car applications” [2], proposes a frequency
convex optimization problem and the solution is obtained domain diversity approach for two or more microphone
by using the Second-Order Cone Programming (SOCP) signals, for example, for in-car applications. The micro-
technique. An analysis of the computational complexity phones should be positioned separately to ensure diverse
of the proposed algorithm is presented as well as its signal conditions and incoherent recording of noise. This
performance. The performance is evaluated via computer enables a better compromise for the microphone position
simulation for different number of sensors and different with respect to different speaker sizes and noise sources. This
threshold values. Simulation results show that the proposed work proposes a two-stage approach: In the first stage, the
algorithm is able to achieve a smaller mean square error of microphone signals are weighted with respect to their signal-
the spatial response gain for the specific FI region compared to-noise ratio and then summed similar to maximum-ratio-
to existing algorithms. combining. The combined signal is then used as a reference
The paper by Derkx, “First-order azimuthal null-steering for a frequency domain least-mean-squares (LMS) filter for
for the suppression of two directional interferers” [4] shows each input signal. The output SNR is significantly improved
that an azimuth steerable first-order super directional micro- compared to coherence-based noise reduction systems, even
phone response can be constructed by a linear combination if one microphone is heavily corrupted by noise.
of three eigenbeams: a monopole and two orthogonal The paper by Ichikawa et al., “DOA estimation with
dipoles. Although the response of a (rotation symmetric) local-peak-weighted CSP” [3], proposes a novel weighting
first-order response can only exhibit a single null, the algorithm for Cross-power Spectrum Phase (CSP) analysis
paper studies a slice through this beampattern lying in the to improve the accuracy of direction of arrival (DOA)
azimuthal plane. In this way, a maximum of two nulls estimation for beamforming in a noisy environment. As
in the azimuthal plane can be defined. These nulls are a sound source, a human speaker is used, and as a noise
symmetric with respect to the main-lobe axis. By placing source broadband automobile noise is used. The harmonic
these two nulls on maximally two-directional sources to structures in the human speech spectrum can be used for
be rejected and compensating for the drop in level for the weighting the CSP analysis, because harmonic bins must
desired direction, these directional sources can be effectively contain more speech power than the others and thus give
rejected without attenuating the desired source. An adaptive us more reliable information. However, most conventional
null-steering scheme for adjusting the beampattern, which methods leveraging harmonic structures require pitch esti-
enables automatic source suppression, is presented. Closed- mation with voiced-unvoiced classification, which is not
form expressions for this optimal null-steering are derived, sufficiently accurate in noisy environments. The suggested
enabling the computation of the azimuthal angles of the approach employs the observed power spectrum, which is
interferers. It is shown that the proposed technique has a directly converted into weights for the CSP analysis by
good directivity index when the angular difference between retaining only the local peaks considered to be coming
the desired source and each directional interferer is at least from a harmonic structure. The presented results show that
90 degrees. the proposed approach significantly reduces the errors in
In the paper by Takahashi et al. “Musical noise analysis localization, and it also shows further improvement when
in methods of integrating microphone array and spectral used with other weighting algorithms.
subtraction based on higher-order statistics” [5], an objective The paper by Lindgren et al., “Shooter localization in
analysis on musical noise is conducted. The musical noise wireless microphone networks” [6], is an interesting com-
is generated by two methods of integrating microphone bination of microphone array technology with distributed
EURASIP Journal on Advances in Signal Processing 3
communications. By detecting the muzzle blast as well as

the ballistic shock wave, the microphone array algorithm
is able to locate the shooter in the case when the sensors
are synchronized. However, in the distributed sensor case,
synchronization is either not achievable or very expensive to
achieve and therefore the accuracy of localization comes into
question. Field trials are described to support the algorithmic
development.
Sven Nordholm
Thushara Abhayapala
Simon Doclo
Sharon Gannot
Patrick Naylor
Ivan Tashev
References
[1] X. Zhang, W. Ser, Z. Zhang, and A. K. Krishna, “Selective
frequency invariant uniform circular broadband beamformer,”
EURASIP Journal on Advances in Signal Processing, vol. 2010,
Article ID 678306, 11 pages, 2010.
[2] J. Freudenberger, S. Stenzel, and B. Venditti, “Microphone
diversity combining for In-car applications,” EURASIP Journal
on Advances in Signal Processing, vol. 2010, Article ID 509541,
13 pages, 2010.
[3] O. Ichikawa, T. Fukuda, and M. Nishimura, “DOA estimation
with local-peak-weighted CSP,” EURASIP Journal on Advances
in Signal Processing, vol. 2010, Article ID 358729, 9 pages, 2010.
[4] R. M. M. Derkx, “First-order adaptive azimuthal null-steering
for the suppression of two directional interferers,” EURASIP
Journal on Advances in Signal Processing, vol. 2010, Article ID
230864, 16 pages, 2010.
[5] Yu. Takahashi, H. Saruwatari, K. Shikano, and K. Kondo,
“Musical-noise analysis in methods of integrating microphone
array and spectral subtraction based on higher-order statistics,”
EURASIP Journal on Advances in Signal Processing, vol. 2010,
Article ID 431347, 25 pages, 2010.
[6] D. Lindgren, O. Wilsson, F. Gustafsson, and H. Habberstad,
“Shooter localization in wireless sensor networks,” in Proceed-
ings of the 12th International Conference on Information Fusion
(FUSION ’09), pp. 404–411, July 2009.
doi:10.1155/2010/678306
Research Article
Selective Frequency Invariant Uniform
Circular Broadband Beamformer
Xin Zhang,1 Wee Ser,1 Zhang Zhang,1 and Anoop Kumar Krishna2
1 Center for Signal Processing, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798
2 EADS Innovation Works, EADS Singapore Pte Ltd., No. 41, Science Park Road, 01-30, Singapore 117610
Correspondence should be addressed to Xin Zhang, zhang xin@pmail.ntu.edu.sg
Received 16 April 2009; Revised 24 August 2009; Accepted 3 December 2009
Academic Editor: Thushara Abhayapala
Copyright © 2010 Xin Zhang et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Frequency-Invariant (FI) beamforming is a well known array signal processing technique used in many applications. In this paper,
an algorithm that attempts to optimize the frequency invariant beampattern solely for the mainlobe, and relax the FI requirement
on the sidelobe is proposed. This sacrifice on performance in the undesired region is traded off for better performance in the
desired region as well as reduced number of microphones employed. The objective function is designed to minimize the overall
spatial response of the beamformer with a constraint on the gain being smaller than a pre-defined threshold value across a specific
frequency range and at a specific angle. This problem is formulated as a convex optimization problem and the solution is obtained
by using the Second Order Cone Programming (SOCP) technique. An analysis of the computational complexity of the proposed
algorithm is presented as well as its performance. The performance is evaluated via computer simulation for different number
of sensors and different threshold values. Simulation results show that, the proposed algorithm is able to achieve a smaller mean
square error of the spatial response gain for the specific FI region compared to existing algorithms.
1. Introduction to use the Frequency-Invariant (FI) beampattern synthesis

technique. As the name implies, such beamformers are
Broadband beamforming techniques using an array of designed to have constant spatial gain response over the
microphones have been applied widely in hearing aids, tele- desired frequency bands.
conferencing, and voice-activated human-computer inter- Over recent years, FI beamforming techniques are
face applications. Several broadband beamformer designs developed in a fast pace. It is difficult to make a distinct
have been reported in the literature [1–3]. One design classification. However, in order to grasp the literature on FI
approach is to decompose the broadband signal into several beamforming in a glimpse, we classify them loosely into the
narrowband signals and apply narrowband beamforming following three types.
techniques for each narrowband signal [4]. This approach One type of FI beamformers includes those that focus
requires several narrowband processing to be conducted on the design based on array geometry. These include, for
simultaneously and is computationally expensive. Another example, the 3D sensor array design reported in [6], the
design approach is to use adaptive broadband beamformers. rectangular sensor array design reported in [7], and the
Such techniques use a bank of linear transversal filters to design of using subarrays in [8]. In [9], the FI beampattern is
generate the desired beampattern. The filter coefficients can achieved by exploiting the relationship among the frequency
be derived adaptively from the received signals. One classic responses of the various filters implemented at the output of
design example is the Frost Beamformer [5]. However, in each sensor.
order to have a similar beampattern over the entire frequency The second type of FI beamformers is designed on
range, a large number of sensors and filter taps will be the base of a least-square approach. For this type of FI
needed. This again leads to high computational complexity. beamformers, the weights of the beamformer are optimized
The third approach of designing broadband beamformers is such that the error between the actual beampattern and
the desired beampattern is minimized over a range of In this configuration, the intersensor spacing is fixed at
frequencies. Some of such beamformers are designed in the λ/2, where λ is the wavelength of the signals of interest
time-frequency domain [10–12], while others are designed and its minimum value is denoted by λmin . The radius
in the eigen-space domain [13]. corresponding to λmin is given by [14]
The third type of FI beamformers is designed based on
“Signal Transformation.” For this type of beamformers, the λmin
signal received at the sensor array is transformed into a r= . (1)
4 sin(π/K)
domain such that the frequency response and the spatial
response of the signal can be decoupled and hence adjusted
independently. This is the principle adopted in [14], where Assuming that the circular array is on a horizontal plane,
a uniform concentric circular array (UCCA) is designed the steering vector is
to achieve the FI beampattern. Excellent results have been
produced by this algorithm. One limitation of the UCCA T
beamformer is that a relatively large number of sensors have a f , φ = e j2π f r cos(φ−φ0 )/c , . . . , e j2π f r cos(φ−φK −1 )/c , (2)
to be used to form the concentric circular array.
Inspired by the UCCA beamformer design, a new where T denotes transpose. For convenience, let ω be the
algorithm has been proposed by the authors of this paper normalized angular frequency, that is, ω = 2π f / fs , let
and presented in [15]. The proposed algorithm attempts be the ratio of the sampling frequency and the maximum
to optimize the FI beampattern solely for the main lobe frequency, that is, = fs / fmax , and let r be the normalized
where the signal of interest is from and relaxes the FI radius, that is, r = r/λmin , the steering vector can be rewritten
requirement on the side lobe. As a result, the sacrifice as
on performance in the undesired region is traded off for
better performance in the desired region and fewer number T
of microphones are employed. To achieve this goal, an a ω, φ = e jωr cos(φ−φ0 ) , . . . , e jωr cos(φ−φK −1 ) . (3)
objective function with a quadratic constraint is designed.
This constraint function allows the FI characteristic to be Figure 2 shows the system structure of the proposed
accurately controlled over the specified bandwidth at the uniform circular array beamformer. The sampled signals
expense of other parts of the spectrum which are not of after the sensor are represented by the vector X[n] =
concern to the designer. This objective function is formulated [x0 (n), x1 (n), . . . , xK −1 (n)]T where n is the sampling instance.
into a convex optimization problem and solved by SOCP These sampled signals are transformed into a set of coef-
readily. Our algorithm has a frequency band of interest from ficients via the Inverse Discrete Fourier Transform (IDFT),
0.3π to 0.95π. If the sampling frequency is 16000 Hz, the where each of the coefficients is called a phase mode [17].
frequency band of interest ranges from 2400 Hz to 7600 Hz. The mth phase mode at time instance n can be expressed as
This algorithm can be applied in speech processing as the
labial and fricative sounds of speech mostly lie in the 8th
−1
K
to 9th octave. If the sampling frequency is 8000 Hz, the
pm [n] = xk [n]e j2πkm/K . (4)
frequency band of interest is from 1200 Hz to 3800 Hz. k=0
This frequency range is useful for respiratory sounds
[16].
The aim of this paper is to provide the full details of These phase modes are passed through an FIR (Finite
the design proposed in [15]. In addition, a computational Impulse Response) filter where the filter coefficients are
complexity analysis of the proposed algorithm and the denoted as bm [n]. The purpose of this filter is to remove
sensitivity performance evaluations at different numbers of the frequency dependency of the received signal X[n]. The
sensors and different constraint parameter values are also beamformer output y[n] is then determined as the weighted
included. sum of the filtered signals:
The remaining paper is organized in the following way:
in Section 2, problem formulation is discussed; in Section 3,
L

the proposed beamforming design is described; in Section 4, y[n] = pm [n] ∗ bm [n] · hm , (5)
the design of the beamforming weight using SOCP is m=−L
shown; numerical results are given in Section 5, and finally,
conclusions are drawn in Section 6. where hm is the phase spatial weighting coefficients or the
beamforming weights, and ∗ is the discrete-time convolu-
tion operator.
2. Problem Formulation Let M be the total number of phase modes and it is
assumed to be an odd number. It can be seen from Figure 2
A uniformly distributed circular sensor array with K number that the K received signals are transformed into M phase
of microphones is arranged as shown in Figure 1. Each modes, where L = (M − 1)/2.
omnidirectional sensor is located at (r cos φk , r sin φk ), where The corresponding spectrum of the phase modes can
r is the radius of the circle, φk = 2kπ/K and k = 0, . . . , K − 1. be obtained by taking the Discrete Time Fourier Transform
(DTFT) of the phase modes defined in (4):

kth element
−1
K
Pm (ω) = Xk (ω)e j2πkm/K
k=0 φk
(6)
−1
K
= S(ω) · e jωr cos(φ−φk ) e j2πkm/K ,
k=0
where S(ω) is the spectrum of the source signal. Radius r

Taking DTFT on both side of (5) and using (6), we have

L
Figure 1: Uniform Circular Array Configuration.
Y (ω) = hm Pm (ω)Bm (ω)
m=−L
⎛ ⎞

L −1
K
= S(ω) · hm ⎝ e jωr cos(φ−φk ) e j2πkm/K ⎠Bm (ω). objective function is proposed:
m=−L k=0

(7) min G ω, φ 2 dω dφ,
ωφ (11)
Consequently, the response of the beamformer can be
s.t. G ω, φ0 − 1 ≤ δ, ω ∈ [ωl , ωu ],
expressed as
where G(ω, φ) is the spatial response of the beamformer
⎛ ⎞

L −1
K given in (10), and ωl and ωu are the lower and upper limit of
G ω, φ = hm ⎝ e jωr cos(φ−φk ) e j2πkm/K ⎠Bm (ω). (8) the specified frequency region respectively. φ0 is the specified
m=−L k=0 direction and δ is a predefined threshold value that controls
the magnitude of the ripples of the main beam.
In principle, the objective function defined above aims
In order to obtain an FI response, terms which are to minimize the square of the spatial gain response across
functions of ω are grouped together using the Jacobi-Anger all frequencies and all angles, while constraining the gain
expansion given as follows [18]: to the value of one at the specified angle. This is to relax
the gain constraint to one angle instead of all angles,
+∞
so that the FI beampattern in the specified region can
e jβ cos γ = j n Jn β e jnγ , (9) be improved. With this constraint setting, the resulting
n=−∞ beamformer can enhance broadband desired signals arriving
from one direction while attenuate broadband noise received
where Jn (β) is the Bessel function of the first kind of order n. from other directions. The concept for formulating the
Substituting (9) into (8), and applying property of the objective function is similar to Capon beamformer [19]. One
Bessel function, the spatial response of the beamformer can difference is that the Capon beamformer aims to minimize
now be approximated by the data dependent array output power at a single frequency,
while the proposed algorithm aims to minimize the data
independent array output power across a wide range of

L frequencies. Another difference is that the constraint used in
G ω, φ = hm · e jmφ · K · j m · Jm (ωr) · Bm (ω). (10) Capon beamformer is a hard constraint, whereas the array
m=−L gain used in the proposed algorithm is a soft constraint,
which can result in a higher degree of flexibility.
This process has been described in [13] and its detailed The proposed algorithm is expected to have lower com-
derivation can be found in [14]. putational complexity compared to the UCCA beamformer.
The later is designed to achieve FI beampattern for all
angles whereas the proposed algorithm focuses only on a
3. Proposed Novel Beamformer specified angle. For the same reason, the proposed algorithm
is expected to have a larger degree of freedom too. This
With the above formulation, we propose the following beam explains the result in having a better FI beampattern for a
pattern synthesis method. The basic idea is to enhance the given number of sensors. These performance improvements
broadband signals for a specific frequency region and at a have been supported by computer simulations and will be
certain direction. In order to achieve this goal, the following discussed in the later part of this paper.
pL [n] hL
x0 [n]
bL [n]
hL−1
x1 [n] pL−1 [n]
bL−1 [n]
y[n]
. IDFT . . SUM
. . .
. . .
xK −1 [n] p−L [n] h−L

b−L [n]
Figure 2: The system structure of a uniform circular array beamformer.
The optimization problems defined by (10) and (11) Using the identity e− jnω = cos(nω) − j sin(nω), (13)
require the optimum values of both the compensation filter becomes
and the spatial weightings to be determined simultaneously.

L
As such, Cholesky factorization is used to transform the G ω, φ = hm · e jmφ · K · j m · Jm (ωr)
objective function further into the Second-Order Cone Pro- m=−L
gramming (SOCP) problem. The details of implementation ⎡ ⎤

Nm

will be discussed in the following section. It should be noted ·⎣ bm [n] cos(nω) − j sin(nω) ⎦
that when the threshold value δ equals zero, the optimization n=0
process becomes a linearly constrained problem.

L
=K hm · e jmφ · j m · Jm (ωr)
4. Convex Optimization-Based Implementation m=−L
⎡ ⎤

Nm

Second-Order Cone Programming (SOCP) is a popular tool ·⎣ bm [n] cos(nω) − jbm [n] sin(nω) ⎦
for solving convex optimization problem, and it has been n=0
used for array pattern synthesis problem [20–22] since the

L
early papers by Lobo et al. [23]. One advantage of SOCP =K hm · e jmφ · j m · Jm (ωr)
is that the global optimal solution is guaranteed if it exists, m=−L
whereas constrained least square optimization procedure ⎡ ⎛ ⎞⎤
looks for local minimum. Another important advantage is
Nm
Nm
·⎣ ⎝bm [n] cos(nω) − j bm [n] sin(nω))⎠⎦
that it is very convenient to include additional linear or n=0 n=0
convex quadratic constraints, such as the norm constraint of

L
the variable vector, in the problem formulation. The standard
=K hm · e jmφ · j m · Jm (ωr) · cm bm − jsm bm ,
form of SOCP can be written as follows: m=−L
T (14)
min b x,
(12)
s.t. dTi x + qi ≥ Ai x + ci 2 , i = 1, . . . , N, where bm = [bm [0], bm [1], . . . , bm [Nm ]]T ; cm = [cos(0),
cos(ω), . . . , cos(Nm · ω)]; sm = [sin(0), sin(ω), . . . , sin(Nm ·
where x ∈ Rm is the variable vector; the parameters are b ∈ ω)].
Rm , Ai ∈ R(ni −1)×m , ci ∈ Rni −1 , di ∈ Rm , and qi ∈ R. The hm is the spatial weighting in the system structure, and
norm appearing in the constraints is the standard Euclidean bm is the FIR filter coefficient vector for each phase.
norm, that is, u2 = (uT u)1/2 . Let um = hm · j m · bm , we have

L
4.1. Convex Optimization of the Beampattern Synthesis G ω, φ = K e jmφ · Jm (ωr) · cm um
Problem. The following transformations are carried out to m=−L
convert (11) into the standard form defined by (12).

L (15)
First, Bm (ω) = Nn=m0 bm [n]e− jnω is substituted into (10),
− j·K e jmφ · Jm (ωr) · sm um
where Nm is the filter order for each phase. The spatial m=−L
response of the beamformer can now be expressed as

⎡ ⎤ = c ω, φ u − js ω, φ u,

L
Nm
G ω, φ = hm · e jmφ · K · j m · Jm (ωr) · ⎣ bm [n]e− jnω ⎦. where c(ω, φ) = [Ke j(−L)φ J−L (ωr)c−L , . . . , Ke j(L)φ JL (ωr)cL ];
m=−L n=0 u = [uT−L , uT−L+1 , . . . , uTL ]T ; s(ω, φ) = [Ke j(−L)φ J−L (ωr)s−L ,
(13) . . . , Ke j(L)φ JL (ωr)sL ].
Representing the complex spatial response G(ω, φ) by 0

a 2-dimensional vector g(ω, φ) which display the real and
imaginary parts into rows of a vector separately, (15) is −10
rewritten in the following form:
⎛ −20
⎞
c ω, φ
g ω, φ = ⎝ ⎠u = A(ω, φ) u.
H
(16)
Gain (dB)
−30
−s ω, φ
H −40
Hence, G(ω, φ)2 = gH g = (A(ω, φ)H u) (A(ω, φ)H u)
H
= uH A(ω, φ)A(ω, φ) u. −50
The objective function and the constraint inequality
defined in (11) can now be written as −60
H
min u Ru,
u −70
(17) −200 −150 −100 −50 0 50 100 150 200
s.t. G ω, φ0 − 1 ≤ δ, for ω ∈ [ωl , ωu ], Angle (deg)

where R = ω φ A(ω, φ)A(ω, φ)H dω dφ. Figure 3: The normalized spatial response of the proposed
In order to transform (17) into the SOCP form defined beamformer for ω = [0.3π, 0.95π].
by (12), the cost function must be a linear equation.
Since matrix R is hermitian and positive definite, it can
be decomposed into an upper triangular matrix and its where 0 is the zero matrix with its dimension determined
transpose using Cholesky factorization, that is, R = DH D, from the context.
where D is the Cholesky factorization of R. Substituting this Equation (21) can now be solved using convex optimiza-
into (17), we have tion toolbox such as SeDuMi [24] with great efficiency.

uH Ru = uH DH D u = (Du)H (Du). (18)
4.2. Computational Complexity. When the Interior-Point
This further simplifies (17) into the following form: Method (IPM) is used to solve the SOCP problem defined
in √(21), the number of iterations needed is bounded by
min d2 ,
u O( N) where N is the number of constraints.
The amount
⎧ of computation per iteration is O(n2 i ni ) [23].
⎨d 2 = D · u2 , (19)
The bulk of the computational requirement of the broad-
s.t. ⎩ band array pattern synthesis comes from the optimization
G ω, φ0 − 1 ≤ δ for ω ∈ [ωl , ωu ].
process. The computational complexity of the optimization
Denoting t as the maximum norm of vector Du process of the proposed algorithm and that of the UCCA
subject to various choices of u, (19) reduces to algorithm have been calculated and are listed in Table 1.
It can be seen from Table 1 that the proposed algorithm
min t, requires a similar amount of computation per iterations
u
⎧ but a much smaller number of iterations compared to
⎨D · u ≤ t, (20) the UCCA algorithm. The overall computational load of
s.t. ⎩ the proposed method is therefore much smaller that that
G ω, φ0 − 1 ≤ δ for ω ∈ [ωl , ωu ]. is needed by the UCCA algorithm. It should be noted
It should be noted that (20) contains I different con- that, as the coefficients are optimized in the phase modes,
straints where I uniformly divides the frequency range the comparative computational load presented above is
spanned by ω. calculated based on the number of phase modes and not the
Lastly, in order to solve (20) by SOCP toolbox, we stack number of sensors. Nevertheless, the larger the number of
t and the coefficients of u together and define y = [t; u]. Let sensors, the larger the number of phase modes too.
a = [1, 0]T , so that t = aT y. As a result, the objective function
and the constraint defined in (11) can be expressed as 5. Numerical Results
T
min a y, In this numerical study, the performance of the proposed
y
beamformer is compared with that of UCCA beamformer
⎧

⎪
⎪ 0 D y ≤ aT y, [14] and Yan’s beamformer [25], for the specified frequency
⎪
⎪
⎨ ⎛ ⎞ region. The evaluation metric used to quantify the frequency
s.t. 1
⎪ invariance (FI) characteristics is the mean squared error
⎪
⎪ 0 A(ω, φ0 ) y − ⎝ ⎠ ≤ δ
H
for ω ∈ ωl ωu ,
⎩
⎪

0
of the array gain variation at the specified direction. The
sensitivity performance of the proposed algorithm will also
(21) be evaluated for different number of sensors and different
Table 1: Computational complexity of different broadband beampattern synthesis method.
Method Number of iteration Amount of computation per iteration

√
UCCA O{ I × M } O{(1 + P(1 + Nm))2 [2M(I + 1)]}
√
Equation (11) O{ 1 + I } O{[M(Nm + 1)2 ][2I + M(Nm + 1) + 1]}
Table 2: Comparison of array gain at each frequency along the 0

desired direction for the three methods.
−5
Normalized Proposed Yan’s UCCA
−10
Frequency Beamformer beamformer Beamformer
(radians/sample) (dB) (dB) (dB) −15
0.3 −0.0007 0 0.6761
Gain (dB)
−20
0.4 −0.0248 −0.8230 0.1760
0.5 0.0044 −1.3292 −0.022 −25
0.6 −0.0097 −1.6253 −0.2186 −30

0.7 −0.0046 −1.8789 −0.6301
−35
0.8 0.0085 −2.9498 −0.1291
0.9 −0.0033 −6.2886 0.1477 −40
−45
−200 −150 −100 −50 0 50 100 150 200
threshold values set for magnitude control of the ripples of Angle (deg)
the main beam.
A uniform circular array consisting of 20 sensors is Figure 4: The normalized spatial response of the UCCA beam-
former for ω = [0.3π, 0.95π].
considered. All the sensors are assumed perfectly calibrated.
The number of phase modes M is set to be 17 and thus
there are 17 spatial weighting coefficients. The order of the 0
compensation filter is set to be 16 for all the phase modes. −5
The frequency region of interest is specified to be from
0.3π to 0.95π. The threshold value, δ, which controls the −10
magnitude of the ripples of the main beam is set to 0.1. −15

The specified direction is set to be 0◦ where the reference
−20
Gain (dB)
microphone is located.
There are several optimization criteria presented in −25
[25]. The one that is chosen to compare is peak sidelobe −30
constrained minimax mainlobe spatial response variation
−35
(MSRV) design. Its objective is to minimize the maximum
MSRV with peak sidelobe constraint. The mathematic −40
expression is shown as follows: −45
min σ, −50
h −200 −150 −100 −50 0 50 100 150 200
⎧ Angle (deg)
⎪
⎪u f0 , φ0 h = 1,
T
⎪
⎪
⎪
⎪
⎪ Figure 5: The normalized spatial response of Yan’s beamformer for
⎪
⎪ T

⎪
⎪ ω = [0.3π, 0.95π].
⎨ u f ,
k qθ − u f ,
0 qθ h
≤ σ,
(22)
s.t. ⎪
⎪
⎪
⎪ T
⎪
⎪ u fk , θs h ≤ ε,
⎪
⎪ spaced discrete frequencies is superimposed. It can be seen
⎪
⎪
⎩ f ∈ f , f , θ ∈ Θ , θ ∈ Θ ,
⎪
that, the proposed beamformer has approximately a constant
k l u q ML s SL
gain within the frequency region of interest in the specified
where f0 is the reference frequency and choose to have direction (0◦ ). As the direction deviates from 0◦ , the FI
the value of fl , and h is the beamformed weightings to be property becomes poorer. The peak sidelobe level has a value
optimized. ε is the peak sidelobe constraint and set to be of −8 dB.
0.036. ΘML and ΘSL represent the mainlobe and sidelobe The beampattern of the UCCA beamformer is shown in
region, respectively. Figure 4. As the proposed algorithm is based on a circular
The beampattern obtained for the proposed beamformer array, only one layer of the UCCA concentric array is used
for the frequency region of interest is shown in Figure 3. The for the numerical study. All other parameter settings remain
spatial response of the proposed beamformer at 10 uniformly the same as that used for the proposed algorithm. As shown
20 12
18
11
16
14 10
Mean square error
White noise gain (dB)

12
9
10
8 8
6
4 7
2
6
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
5
Normalised frequency (radians/sample) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalised frequency (radians/sample)
Yan’s beamformer
Proposed beamformer Figure 8: White noise gain versus frequency for the broadband
UCCA beamformer beam pattern shown in Figure 3.
Figure 6: Comparison on FI characteristic between the proposed
beamformer, UCCA beamformer and Yan’s beamformer at 0 degree
for ω = [0.3π, 0.95π]. different methods are shown in Figure 6. It is seen that the
proposed beamformer outperforms both the UCCA beam-
15 former and Yan’s beamformer on achieving FI characteristic
at the desired direction. Table 2 tabulates the array gain at
14.5
each frequency along the desired direction for these three
14 methods.
Furthermore, the performance of the frequency invariant
13.5
beam pattern obtained by the proposed method is assessed
Directivity (dB)
13 by evaluating the directivity and the white-noise gain over

12.5 the entire frequency band considered, as shown in Figures
7 and 8, respectively. Directivity describes the ability of the
12 array to suppress a diffuse noise field, while white noise
11.5 gain shows the ability of the array to suppress spatially
uncorrelated noise, which can be caused by self-noise of the
11 sensors. Because our array is a circular array, the directivity
10.5 D(ω) is calculated using the following equation:
10 2
L
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 m=−L Bm (ω)
Normalised frequency (radians/sample) D(ω) = L L
,
m=−L n=−L Bm (ω)Bn (ω) sinc[(m − n)2πωr/c]
Figure 7: Directivity versus frequency for the broadband beam (23)
pattern shown in Figure 3.
where Bm (ω) is the frequency response of the FIR filter at mth
phase mode, and r is the radius of the circle.
in the figure, the beampattern of the UCCA beamformer As shown in the figure, the directivity has a constant
is not as constant as that of the proposed beamformer in profile, with an average value equal to 13.1755 dB. The
the specified direction (0◦ ). The peak sidelobe level which white noise gain ranges from 5.5 dB to 11.3 dB. These
has a value of −6 dB is higher as compared to the proposed positive values represent an attenuation of self-noise of the
beamformer too. microphones. As expected, the lower the frequency, the
The beampattern of Yan’s beamformer is shown in smaller the white noise gain, and the higher the sensitivity
Figure 5. The frequency invariant characteristics is poorer to array imperfections. Hence, the proposed beamformer
at the desired direction. However it has the lowest sidelobe is more sensitive to array imperfection at low frequency
level among all. From this comparison, we find that having and is the most robust to array imperfection at normalized
processed the signal in phase mode, the frequency range frequency 0.75π.
for the beamformer to achieve Frequency Invariant (FI)
characteristics is wider. 5.1. Sensitivity Study—Number of Sensors. Most FI beam-
The mean squared errors of the spatial response gain in formers reported in the literature employ a large number
the specified direction and across different frequencies for of sensors. In this study, the number of sensors used
0 0
−5
−10
−10
−20
−15
Gain (dB)
Gain (dB)
−20 −30
−25 −40
−30
−50
−35
−60
−40
−45 −70
−200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200
Angle (deg) Angle (deg)
Figure 9: The normalized spatial response of the proposed FI Figure 11: The normalized spatial response of the Yan’s beam-
beamformer for 10 microphones. former for 10 microphones.
0 0
−5
−10
−10
−20
−15
Gain (dB)
Gain (dB)
−20 −30
−25
−40
−30
−50
−35
−40 −60
−45 −70
−200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200
Figure 10: The normalized spatial response of the UCCA beam- Figure 12: The normalized spatial response of the proposed FI
former for 10 microphones. beamformer for 8 microphones.
are reduced from 20 to 10 and 8 and the performances the allowed ripples in the magnitude of the main beam
of the proposed FI beamformer, UCCA beamformer, and spatial gain response. In this section, different values of δ
Yan’s beamformer are compared. The results are shown in are used to study the sensitivity of the performance of the
Figures 9, 10, 11, 12, 13, and 14. As seen from the simula- proposed algorithm to this parameter value. Three values,
tions, when 10 microphones are employed, the proposed namely, δ = [0.001, 0.01, 0.1] are selected and the results
algorithm achieves the best FI performance in the mainlobe obtained are shown in Figures 15, 16, and 17, respectively.
region, with a sidelobe level of −8 dB. For UCCA method The specified frequency region of interest remains the same.
and Yan’s method, frequency invariant characteristics are Figure 18 shows the mean squared error of the array gain at
not promising at the desired direction, and higher sidelobes the specified direction (0◦ ) for the three different δ values
are obtained. When the number of microphone is further studied.
reduced to 8, our proposed method is still able to produce As shown in the figures, as the value of δ decreases, the FI
reasonable FI beampattern whereas the FI property of the performance at the specified direction improves. The results
beampattern of the UCCA algorithm becomes much poorer also show that the improvement in the FI performance in
in the specified direction. the specified direction is achieved with an increase in the
peak sidelobe level and a poorer FI beampattern in the other
5.2. Sensitivity Study—Different Threshold Value δ. In this directions in the main beam. For example, when the value
proposed algorithm, δ is a parameter created to define of δ is 0.001, the peak sidelobe of the spatial response is
0 0
−2
−10
−4
−6 −20
Gain (dB)
Gain (dB)
−8
−30
−10
−12 −40
−14
−50
−16
−18 −60
−200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200
Figure 13: The normalized spatial response of the UCCA beam- Figure 16: The normalized spatial response of the proposed
former for 8 microphones. beamformer for δ = 0.01.
0 0
−5
−10
−10
−15
−20
−20
Gain (dB)
Gain (dB)
−25 −30
−30
−40
−35
−40
−50
−45
−50 −60
−200 −150 −100 −50 0 50 100 150 200 −200 −150 −100 −50 0 50 100 150 200
Figure 14: The normalized spatial response of the Yan’s beam- Figure 17: The normalized spatial response of the proposed
former for 8 microphones. beamformer for δ = 0.1.
0
as high as −5 dB and the beampatterns do not overlap well
−5
in the main beam. As δ increases to 0.1, the peak sidelobe
−10 of the spatial response is approximately −10 dB (lower) and
−15
the beampatterns in the main beam are observed to have a
relatively good FI characteristics.
−20
Gain (dB)
−25 6. Conclusion
−30
A selective frequency invariant uniform circular broadband
−35 beamformer is presented in this paper. Other than pro-
−40 viding the details of a recent conference paper presented
by the authors of this paper, a complexity analysis and
−45
two sensitivity studies on the proposed algorithm are also
−50 presented in this paper. The proposed algorithm is designed
−200 −150 −100 −50 0 50 100 150 200
to minimize an objective function of the spatial response gain
Angle (deg)
with a constraint on the gain being smaller than a predefined
Figure 15: The normalized spatial response of the proposed threshold value across a specified frequency range and in a
beamformer for δ = 0.001. specified direction. The problem is formulated as a convex
0.18 [5] O. L. Frost III, “An algorithm for linearly constrained adaptive
array processing,” Proceedings of the IEEE, vol. 60, no. 8, pp.
0.16
926–935, 1972.
0.14 [6] W. Liu, D. McLernon, and M. Ghogho, “Frequency invariant
beamforming without tapped delay-lines,” in Proceedings
Mean square error
0.12 of the IEEE International Conference on Acoustics, Speech,

and Signal Processing (ICASSP ’07), vol. 2, pp. 997–1000,
0.1
Honolulu, Hawaii, USA, April 2007.
0.08 [7] M. Ghavami, “Wideband smart antenna theory using rectan-
gular array structures,” IEEE Transactions on Signal Processing,
0.06 vol. 50, no. 9, pp. 2143–2151, 2002.
[8] T. Chou, “Frequency-independent beamformer with low
0.04
response error,” in Proceedings of the 20th IEEE Transactions
0.02 on Acoustics, Speech, and Signal Processing (ICASSP ’95), vol.
5, pp. 2995–2998, Detroit, Mich, USA, May 1995.
0 [9] D. B. Ward, R. A. Kennedy, and R. C. Williamson, “FIR filter
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
design for frequency invariant beamformers,” IEEE Signal
Normalised frequency (radians/sample)
Processing Letters, vol. 3, no. 3, pp. 69–71, 1996.
δ = 0.1 [10] A. Trucco and S. Repetto, “Frequency invariant beamforming
δ = 0.01 in very short arrays,” in Proceedings of the MTS/IEEE Techno-
δ = 0.001 Ocean (Oceans ’04), vol. 2, pp. 635–640, November 2004.
[11] A. Trucco, M. Crocco, and S. Repetto, “A stochastic approach
Figure 18: Comparison on FI characteristic of the proposed to the synthesis of a robust frequency-invariant filter-and-
beamformer for δ = 0.001, 0.01 and 0.1 at 0 degree for ω = sum beamformer,” IEEE Transactions on Instrumentation and
[0.3π, 0.95π]. Measurement, vol. 55, no. 4, pp. 1407–1415, 2006.
[12] S. Doclo and M. Moonen, “Design of far-field and near-field
broadband beamformers using eigenfilters,” Signal Processing,
optimization problem and the solution is obtained by using vol. 83, no. 12, pp. 2641–2673, 2003.
the Second-Order Cone Programming (SOCP) technique. [13] L. C. Parra, “Least squares frequency-invariant beamforming,”
The complexity analysis shows that the proposed algorithm in Proceedings of the IEEE Workshop on Applications of Signal
has a lower computational requirement compared to that of Processing to Audio and Acoustics, pp. 102–105, New Paltz, NY,
the UCCA algorithm for the problem defined. Numerical USA, October 2005.
results show that the proposed algorithm is able to achieve [14] S. C. Chan and H. H. Chen, “Uniform concentric circu-
a more FI beampattern and a smaller mean square error of lar arrays with frequency-invariant characteristics—theory,
the spatial response gain in the specified direction across the design, adaptive beamforming and DOA estimation,” IEEE
Transactions on Signal Processing, vol. 55, no. 1, pp. 165–177,
specified FI region compared to the UCCA algorithm.
2007.
[15] X. Zhang, W. Ser, Z. Zhang, and A. K. Krishna, “Uniform
Acknowledgments circular broadband beamformer with selective frequency and
spatial invariant region,” in Proceedings of the 1st International
The authors would like to acknowledge the helpful discus- Conference on Signal Processing and Communication System
sions given by H. H. Chen of the University of Hong Kong (ICSPCS ’07), Gold Coast, Australia, December 2007.
on the UCCA algorithm. The authors would also like to [16] W. Ser, T. T. Zhang, J. Yu, and J. Zhang, “Detection of
thank STMicroelectronics (Singapore) for the sponsorship of wheezes using a wearable distributed array of microphones,”
this project. Last but not the least, the authors would like in Proceedings of the 6th International Workshop on Wearable
and Implantable Body Sensor Networks (BSN ’09), pp. 296–300,
to thank the reviewers too for their constructive comments
Berkeley, Calif, USA, June 2009.
and suggestions which greatly improve the quality of this
[17] D. E. N. Davies, “Circular arrays,” in Handbook of Antenna
manuscript. Design, Peregrinus, London, UK, 1983.
[18] M. Abramowitz and I. A. Stegum, Handbook of Mathematical
References Functions, Dover, New York, NY, USA, 1965.
[19] J. Capon, “High-resolution frequency-wavenumber spectrum
[1] H. Krim and M. Viberg, “Two decades of array signal analysis,” Proceedings of the IEEE, vol. 57, no. 8, pp. 1408–1418,
processing research: the parametric approach,” IEEE Signal 1969.
Processing Magazine, vol. 13, no. 4, pp. 67–94, 1996. [20] F. Wang, V. Balakrishnan, P. Y. Zhou, J. J. Chen, R. Yang, and
[2] D. H. Johnson and D. E. Dudgeon, Array Signal Processing: C. Frank, “Optimal array pattern synthesis using semidefinite
Concepts and Techniques, Prentice-Hall, Upper Saddle River, programming,” IEEE Transactions on Signal Processing, vol. 51,
NJ, USA, 1993. no. 5, pp. 1172–1183, 2003.
[3] R. A. Monzingo and T. W. Miller, Introduction to Adaptive [21] J. Liu, A. B. Gershman, Z.-Q. Luo, and K. M. Wong, “Adaptive
Arrays, John Wiley & Sons, SciTech, New York, NY, USA, 2004. beamforming with sidelobe control: a second-order cone
[4] B. D. Van Veen and K. M. Buckley, “Beamforming: a programming approach,” IEEE Signal Processing Letters, vol.
versatile approach to spatial filtering,” in Proceedings of the 10, no. 11, pp. 331–334, 2003.
IEEE Transactions on Acoustics, Speech, and Signal Processing [22] S. Autrey, “Design of arrays to achieve specified spatial
(ICASSP ’88), April 1988. characteristics over broadbands,” in Signal Processing, J. W. R.
Griffiths, Ed., pp. 507–524, Academic Press, New York, NY,

USA, 1973.
[23] M. S. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret,
“Applications of second-order cone programming,” Linear
Algebra and Its Applications, vol. 284, no. 1–3, pp. 193–228,
1998.
[24] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for
optimization over symmetric cones,” Optimization Methods
and Software, vol. 11, no. 1–4, pp. 625–653, 1999.
[25] S. Yan, Y. Ma, and C. Hou, “Optimal array pattern synthesis for
broadband arrays,” Journal of the Acoustical Society of America,
vol. 122, no. 5, pp. 2686–2696, 2007.
doi:10.1155/2010/230864
Research Article
First-Order Adaptive Azimuthal Null-Steering for
the Suppression of Two Directional Interferers
René M. M. Derkx
Digitial Signal Processing Group, High Tech Campus 36, 5656 AE Eindhoven, The Netherlands
Correspondence should be addressed to René M. M. Derkx, renederkx@online.nl
Received 21 July 2009; Revised 10 November 2009; Accepted 15 December 2009
Academic Editor: Simon Doclo
Copyright © 2010 René M. M. Derkx. This is an open access article distributed under the Creative Commons Attribution License,
An azimuth steerable first-order superdirectional microphone response can be constructed by a linear combination of three
eigenbeams: a monopole and two orthogonal dipoles. Although the response of a (rotation symmetric) first-order response can
only exhibit a single null, we will look at a slice through this beampattern lying in the azimuthal plane. In this way, we can define
maximally two nulls in the azimuthal plane which are symmetric with respect to the main-lobe axis. By placing these two nulls on
maximally two directional sources to be rejected and compensating for the drop in level for the desired direction, we can effectively
reject these directional sources without attenuating the desired source. We present an adaptive null-steering scheme for adjusting
the beampattern so as to obtain this suppression of the two directional interferers automatically. Closed-form expressions for this
optimal null-steering are derived, enabling the computation of the azimuthal angles of the interferers. It is shown that the proposed
technique has a good directivity index when the angular difference between the desired source and each directional interferer is at
least 90 degrees.
1. Introduction superdirectional response is constructed by means of a linear

combination of a pressure and velocity (first-order spatial
In applications such as hands-free communication and voice derivative of the pressure field) response.)
control systems, the microphone signal does not only contain A first method to obtain this first-order superdirectivity
the desired sound-source (e.g., a speech signal) but can also is by using microphone-arrays with omnidirectional
contain undesired directional interferers and background microphone elements and to apply beamforming-techniques
noise (e.g., diffuse-noise). To reduce the amount of noise with asymmetrical filter-coefficients [3]. Basically, this
and minimize the influence of interferers, we can use a asymmetrical filtering corresponds to subtraction of signals,
microphone array and apply beamforming techniques to like in delay-and-subtract techniques [4, 5] or by taking
steer the main-lobe of a beam towards the desired source- spatial derivatives of the sound pressure field [6, 7]. As
signal, for example, a speech signal. In this paper, we focus subtraction leads to smaller signals for low frequencies, a
on arrays where the wavelength of the sound is much large first-order integrator needs to be applied to equalize the
than the size of the array. These arrays are therefore called frequency-response, resulting in an increased sensitivity
“Small Microphone Arrays.” When using omnidirectional (20 dB/decade) for sensor-noise and increased sensitivity
(monopole) microphones in a small microphone array for mismatches in microphones characteristics [8, 9] for the
configuration, additive beamformers like delay-and-sum are lower-frequency-range.
not able to obtain a sufficient directivity as the beamwidth A second method to obtain first-order superdirectivity is
deteriorates for larger wavelengths [1, 2]. A common method by using microphone-arrays with first-order unidirectional
to obtain improved directivity is to apply superdirective microphone elements. As the separate uni-directional micro-
beamforming techniques. In this paper, we will focus on phone elements already have a first-order superdirective
first-order superdirective beamforming. (The term “first- response, consisting out of a sum of a pressure and a
order” is used to indicate that the directivity-pattern of the velocity response, the beamformer can simply be constructed
y
M1
φ
y (0, 0) x
M1
M0
θ M0
x
M2
M2
Figure 1: Circular array geometry with three cardioid microphones.
by a linear combination of the uni-directional microphone experiments in Section 6. Finally, in Section 7, conclusions
signals. In such an approach, there is no need to apply a first- are given.
order integrator (as was the case for omni-directional micro-
phone elements), and we avoid a 20 dB/decade increased 2. Construction of Eigenbeams
sensitivity for sensor-noise [7]. Nevertheless, uni-directional
microphones may have a low-frequency roll-off, which We know from [7, 9] that by using a circular array of at least
can be compensated for by means of proper equalization three (omni- or uni-directional microphone) sensors in a
techniques. Throughout this paper, we will assume that the planar geometry and applying signal processing techniques,
uni-directional microphones have a flat frequency response. it is possible to construct a first-order superdirectional
We focus on the construction of first-order superdirec- response. This superdirectional response can be steered
tional beampatterns where the nulls of the beampattern are with its main-lobe to any desired azimuthal angle and
steered to the directional interferers, while having a unity can be adjusted to have any first-order directivity pattern.
response in the direction of the desired sound-source. In As mentioned in the introduction, we will use three uni-
Section 2, we construct a monopole and two orthogonal directional cardioid microphones (with a heart-shaped
dipole responses (known as “eigen-beams” [10, 11]) out directional pattern) in a circular configuration, where the
of a circular array of three first-order cardioid microphone main-lobes of the three cardioid responses are pointed
elements M0 , M1 , and M2 (with a heart-shaped directional outwards, as shown in Figure 1.
pattern), as shown in Figure 1. Here θ and φ are the standard The responses of the three cardioid microphones M0 , M1 ,
spherical coordinate angles: elevation and azimuth. and M2 are given by, respectively, Ec0 (r, θ, φ), Ec1 (r, θ, φ), and
Based on these eigenbeams, we are able to construct Ec2 (r, θ, φ), having their main-lobes at, respectively, φ = 0,
arbitrary first-order responses that can be steered with 2π/3, and 4π/3 radians. Assuming that we have no sensor-
the main-lobe in any azimuthal direction (see Section 2). noise, the nth cardioid microphone response, with n =
Although the (rotation symmetric) first-order response can 0, 1, 2, for a harmonic plane-wave with frequency f is ideally
only exhibit a single null, we will look at a slice through given by [11]
the beampattern lying in the azimuthal plane. In this
Ecn r, θ, φ = An e jψn . (1)
way, we can define maximally two nulls in the azimuthal
plane which are symmetric with respect to the main-lobe The magnitude-response An and phase-response ψn of the
axis. By placing these two nulls on the two directional nth cardioid microphone are given by, respectively:
sources to be rejected and compensating for the drop in
1 1 2nπ
level for the desired direction, we can effectively reject the An = + cos φ − sin θ, (2)
2 2 3
directional sources without attenuating the desired source.
In Section 3 expressions are derived for this beampattern 2π f
ψn = sin θ xn cos φ + yn sin φ . (3)
synthesis. c
To develop an adaptive null-steering algorithm, we first Here c is the speed of sound and xn and yn are the x and y
show in Section 4 how the superdirective beampattern can coordinates of the nth microphone (as shown in Figure 1),
be synthesized via the Generalized Sidelobe Canceller (GSC) given by
[12]. This GSC enables us to optimize a cost-function in
2nπ
an unconstrained manner with a gradient descent search- xn = r cos φ − ,
method that is described in Section 5. Furthermore, the GSC 3
(4)
enables tracking of the angles of the separate directional 2nπ
interferers, which is validated by means of simulations and yn = r sin φ − ,
3
1 1 1
0.5 0.5 0.5
0 0 0
−0.5 −0.5 −0.5
−1 −1 −1
1 1 1
0.5 1 0.5 1 0.5 1
0 0 0
0 0 0
−0.5 −0.5 −0.5
−1 −1 −1 −1 −1 −1
(a) Em (θ, φ) (b) Ed0 (θ, φ) (c) Edπ/2 (θ, φ)
Figure 2: Eigenbeams (monopole and two orthogonal dipoles).
with r being the radius of the circle on which the micro- spatial aliasing effects will occur) , that is, r c/ f , the
phones are located. phase-component ψn , given by (5) can be neglected and the
We can simplify (3) as responses of the eigenbeams for these frequencies are equal
to
2π f 2nπ
ψn = r sin θ cos . (5) Em =, 1
c 3
0
From the three cardioid microphone responses, we Ed θ, φ = cos φ sin θ, (10)
can construct the circular harmonics [7], also known as
π
“eigenbeams” [10, 11]), by using the 3-point Discrete Fourier Edπ/2 θ, φ = cos φ − sin θ.
2
Transform (DFT) with the three microphones as inputs. This
DFT produces three phase-modes Pi (r, θ, φ) [7] with i = The directivity patterns of these eigenbeams are shown in
1, 2, 3: Figure 2.
The zeroth-order eigenbeam Em represents the monopole
1 n
2
response, while the first-order eigenbeams Ed0 (θ, φ) and
P0 r, θ, φ = E r, θ, φ , Edπ/2 (θ, φ) represent the orthogonal dipole responses.
3 n=0 c
The dipole can be steered to any angle ϕs by means of a
∗ weighted combination of the orthogonal dipole pair:
P1 r, θ, φ = P2 r, θ, φ (6)
ϕ
Ed s θ, φ = cos ϕs Ed0 θ, φ + sin ϕs Edπ/2 θ, φ , (11)
1
2

= Ec r, θ, φ e− j 2πn/3 ,
n
with 0 ≤ ϕs ≤ 2π being the steering angle.
3 n=0
Finally, the steered and scaled superdirectional micro-
√ phone response can be constructed via
with j = −1 and ∗ being the complex-conjugate operator.

Via the phase-modes, we can construct the monopole as E θ, φ = S αEm + (1 − α)Ed s θ, φ
ϕ
(12)
Em r, θ, φ = 2P0 r, θ, φ , (7)
= S α + (1 − α) cos φ − ϕs sin θ ,
and the orthogonal dipoles as with α ≤ 1 being the parameter for controlling the
0 directional pattern of the first-order response and S being an
Ed r, θ, φ = 2 P1 r, θ, φ + P2 r, θ, φ , arbitrary scaling factor. Both parameters α and S may also
(8) have negative values.
Edπ/2 r, θ, φ = 2 j P1 r, θ, φ − P2 r, θ, φ . Alternatively, we can write the construction of the
response in matrix-vector notation:
In matrix notation

⎡ ⎤ ⎡ ⎤⎡ ⎤ E θ, φ = SFTα Rϕs X, (13)
Em 1 1 1 Ec0
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ 0 ⎥ 2⎢ ⎥⎢ ⎥ with the pattern-synthesis vector:
⎢ Ed ⎥ = ⎢2 −1 −1 ⎥⎢Ec1 ⎥. (9)
⎣ ⎦ 3⎣ √ √ ⎦⎣ 2 ⎦ ⎡ ⎤
π/2
Ed 0 3 − 3 Ec α
⎢ ⎥
⎢ ⎥
Fα = ⎢(1 − α)⎥, (14)
For frequencies with wavelengths larger than the size of ⎣ ⎦
the array (for wavelengths smaller than the size of the array, 0
the rotation-matrix Rϕs : Solving the two unknowns α and ϕs gives
⎡ ⎤
1 0 0 ϕs = 2 arctan X, (21)
⎢ ⎥
⎢
R ϕs = ⎢0 cos ϕs sin ϕs ⎥
⎥, (15) sin Δϕn X

⎣ ⎦ α = ,
0 − sin ϕs cos ϕs cos ϕn1 − cos ϕn2 + X sin ϕn1 − sin ϕn2 + sin Δϕn
(22)
and the input-vector:
⎡ ⎤ ⎡ ⎤ with
Em 1
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ sin ϕn1 − sin ϕn2 ± 2 − 2 cos Δϕn
X = ⎢ Ed0 θ, φ ⎥ = ⎢cos φ sin θ ⎥. (16) X= (23)
⎣ ⎦ ⎣ ⎦ ,
π/2 cos ϕn1 − cos ϕn2
Ed θ, φ sin φ sin θ
Δϕn = ϕn1 − ϕn2 . (24)
In the remainder of this paper, we will assume that we
have unity response of the superdirectional microphone for It is noted that (23) can have two solutions, leading to
a desired source coming from an arbitrary azimuthal angle However, the resulting
different solutions for ϕs , α, and S.
φ = ϕs and for an elevation angle θ = π/2 and we want to beampatterns are identical.
suppress two interferers by steering two nulls towards two As can be seen we get a vanishing denominator in (22)
azimuthal angles φ = ϕn1 and φ = ϕn2 , also for an elevation for ϕn1 = ϕs and/or ϕn2 = ϕs . Similarly, this is the case when
angle θ = π/2. Hence, we assume θ = π/2 in the remainder Δϕn = ϕn1 − ϕn2 goes to zero. For this latter case, we can
of this paper. compute the limit of ϕs and α:

3. Optimal Null-Steering for Two Directional lim ϕs = 2 arctan
sin ϕni
= ϕni + π, (25)
Interferers via Direct Pattern Synthesis Δϕn → 0 cos ϕni − 1
3.1. Pattern Synthesis. The first-order response of (12), with with i = 1, 2 and
the main-lobe of the response steered to ϕs , has two nulls for
1
α ≤ 1/2, given by (see [13]) lim α = , (26)
Δϕn → 0 2

−α
ϕn1 , ϕn2 = ϕs ± arccos . (17) where Δϕn = ϕn1 − ϕn2 .
1−α
For the case Δϕn = 0, we actually steer a single null
If we want to steer two nulls to arbitrary angles ϕn1 and towards the two directional interferers ϕn1 and ϕn2 . Equations
ϕn2 , not lying symmetrical with respect to ϕs , it can be seen (25) and (26) describe the limit-case solution for which there
that we cannot steer the main-lobe of the first-order response are an infinite number of solutions that satisfy the system of
to ϕs . Therefore, we steer the main-lobe to ϕs and use a scale- equations, given by (21).
factor S under the constraint that a unity response is obtained
at angle ϕs . In matrix notation, 3.2. Analysis of Directivity Index. Although the optimization
in this paper is focused on the suppression of two directional
E θ, φ = T Rϕs X,
SF (18)
α interferers, it is also important to analyze the noise-reduction
performance for isotropic noise circumstances. We will only
with the rotation-matrix and the pattern-synthesis matrix
analyze the spherical isotropic noise case, for which we
being as in (15) and (14), respectively, with α, ϕs instead of
compute the spherical directivity factor QS given by [4, 5]
α, ϕs .
From (12), we see that a unity desired response at angle
4πE2 π/2, ϕs
ϕs is obtained when we choose the scale-factor S as QS = 2π π . (27)
2
φ=0 θ =0 E θ, φ sin θdθ dφ
1
S = , (19) If we combine (27) with (18), we get
α + (1 − α) cos ϕs − ϕs

with α being the parameter for controlling the directional 6 1 − cos ϕ1 1 − cos ϕ2
QS ϕ1 , ϕ2 = , (28)
pattern of the first-order response (similar to the parameter 5 + 3 cos ϕ1 − ϕ2
α), ϕs the angle for the desired sound, and ϕs the angle for
the steering (which, in general, is different from ϕs ). with
Next, we want to place the nulls at ϕn1 and ϕn2 . Hence, we ϕ1 = ϕn1 − ϕs , (29)
solve the following system of two equations:
ϕ2 = ϕn2 − ϕs . (30)
S α + (1 − α) cos ϕn1 − ϕs = 0,
(20) In Figure 3, the contour-plot of the directivity factor QS
S α + (1 − α) cos ϕn2 − ϕs = 0. is shown with ϕ1 and ϕ2 on the x- and y-axes, respectively.
3
2.5 2 1 Em
1
0.5
5
1.
1.
3 5
Ed0 Ep E
5
π
2.
2 +
2
Rϕs Fα − −
Edπ/2
3.5
3.5
3
1
2
3
3.5 2.5 1.5 Er1 w1
3
B Er2 w2
ϕ2 (rad)
2.5 3
2
π 2.5 2
3
Figure 4: Generalized Sidelobe Canceller scheme.

3
1.5 2. 3.5
5
1 2
3
3
GSC scheme, first a prefiltering with a fixed value of ϕs and
3.5
α is performed, to construct a primary signal with a unity
2.5
3.5
1
π 1.5 response to angle ϕs and two noise references. As the two
2
1.5
2
0.5
1 2 2.5 3 noise references do not include the source coming from angle
1
1
π π
3
π
ϕs , two noise-canceller weights w1 and w2 can be optimized
2 2 in an unconstrained manner. The GSC scheme is shown in
ϕ1 (rad)
Figure 4.
Figure 3: Contour-plot of the directivity factor QS (ϕ1 , ϕ2 ). We start by constructing the primary-response as

E p θ, φ = FTα Rϕs X, (32)
As can be seen in (28), the directivity factor goes to
zero if one of the angles ϕn1 or ϕn2 gets close to ϕs . Clearly, with FTα , Rϕs , and X being as defined in the introduction and
a directivity factor which is smaller than unity is not very using a scale-factor S = 1.
useful in practice. Hence, the pattern synthesis technique is Furthermore, we can create two noise-references via
only useful when the angles ϕn1 and ϕn2 are located in one ⎡ ⎤
half-plane and the desired source is located around the center Er1 θ, φ
⎣ ⎦ = BT Rϕs X (33)
of the opposite half-plane. Er2 θ, φ
It can be found in the appendix that for
with a blocking-matrix B [14] given by
1
ϕ1 = arccos − ,
3 ⎡ 1 ⎤
(31) 0
1 ⎢ 2 ⎥
ϕ2 = 2π − arccos − , ⎢ ⎥
⎢ ⎥
3 B=⎢ 1
⎢− 0⎥.
⎥ (34)
a maximum directivity factor QS = 4 is obtained. This cor- ⎢ 2 ⎥
⎣ ⎦
responds with 6 dB directivity index, defined as 10 log10 QS , 0 1
where the directivity pattern resembles a hypercardioid.
Furthermore for (ϕ1 , ϕ2 ) = (π, π) rad. a directivity factor It is noted that the noise-references Er1 and Er2 are,
QS = 3 is obtained, corresponding with 4.8 dB directivity respectively, a cardioid and a dipole response, with a null
index, where the directivity pattern yields a cardioid. As can steered towards the angle of the desired source at azimuth
be seen from Figure 3, we can define a usable region, where φ = ϕs and elevation θ = π/2.
the directivity-factor is QS > 3/4 for π/2 ≤ ϕ1 , ϕ2 ≤ 3π/2. The primary- and the noise-responses can be used in the
generalized sidelobe canceller structure, to obtain an output
4. Optimal Null-Steering for Two Directional as
Interferers via GSC
E θ, φ = E p θ, φ − w1 Er1 θ, φ − w2 Er2 θ, φ . (35)
4.1. Generalized Sidelobe Canceller (GSC) Structure. To
develop an adaptive algorithm for steering two nulls towards It is important to note that for any value of ϕs , α, w1 , and
the two directional interferers based on the pattern-synthesis w2 , a unity-response at the output of the GSC is maintained
technique in Section 3, it would be required to use a for angle φ = ϕs and θ = π/2.
constrained optimization technique where we want to main- In the next sections we give some details in computing w1
tain a unity response towards the angle ϕs . For adaptive and w2 for the suppression of two directional interferers, as
algorithms, it is generally easier to adapt in an unconstrained discussed in the previous section.
manner. Therefore, we first present an alternative scheme
for the null-steering, similar to the direct pattern-synthesis 4.2. Optimal GSC Null-Steering for Two Directional Inter-
technique as discussed in Section 3, but based on the well- ferers. Using the GSC structure of Figure 4 having a unity
known Generalized Sidelobe Canceller (GSC) [12]. In the response at angle φ = ϕs , we can compute the weights w1
and w2 to steer two nulls towards azimuthal angles ϕn1 and 2 0.5
0.5
ϕn2 , by solving
1
π π π
Ep , ϕi − w1 Er1 , ϕi − w2 Er2 , ϕi = 0 (36) 1
2 2 2 1
1.5
for i = 1, 2.
0.5
2
1.5
This results in the following relations: 2.5
2
1
3.5
2 sin ϕ1 − ϕ2
2.5
1.5

w2
w1 = 2α + , (37) 0
1
3
sin ϕ1 − sin ϕ1 − ϕ2 − sin ϕ2
3
3.5
cos ϕ1 − cos ϕ2
1
w2 = , (38) 1.5
2 2.5
sin ϕ1 − sin ϕ1 − ϕ2 − sin ϕ2
0.5
−1
1.5
where ϕ1 and ϕ2 are defined as given by (29) and (30), 1
respectively. 1
To eliminate the dependency of α in (37), we will use

0.5
−2 0.5
w1 = w1 − 2α. (39) −2 −1 0 1 2
w1
The denominators in (37) and (38) vanish when ϕn1 = ϕs
and/or ϕn2 = ϕs . Also when Δϕn = ϕn1 − ϕn2 goes to zero, the Figure 5: Contour-plot of the directivity factor QS (w1 , w2 ).
denominator vanishes. In this case, we can compute the limit
of w1 and w2 :
lim w1 = −2, Note that with this computation, it is not necessarily true
Δϕn → 0 (40) that ϑ1 = ϕ1 and ϑ2 = ϕ2 , that is, we can have a permutation
ambiguity. Furthermore, we compute the resolved angles of
lim w2 = sin ϕi (41) the directional interferers as
Δϕn → 0
with i = 1, 2. ϑni = ϑi − ϕs , (45)

For the case Δϕn = 0, we actually steer a single null
towards the two directional interferers ϕn1 and ϕn2 . Equations where (ϑn1 , ϑn2 ) = (ϕn1 , ϕn2 ) or (ϑn1 , ϑn2 ) = (ϕn2 , ϕn1 ).
(40) and (41) describe the limit-case solution for which there
are an infinite number of solutions (w1 , w2 ) that satisfy (36). 4.3. Analysis of Directivity Index. Just as for the direct
From the values of w1 and w2 , we can derive the pattern synthesis in the previous section, we can analyze the
two angles of the directional interferers ϑ1 and ϑ2 , where directivity factor for spherical isotropic noise. We can insert
(ϑ1 , ϑ2 ) = (ϕ1 , ϕ2 ) or (ϑ1 , ϑ2 ) = (ϕ2 , ϕ1 ). The two angles the values of w1 and w2 into (27) and (35) and get
are obtained via a computation involving the arctan-function
3
with additional sign checking to resolve all four quadrants in QS (w1 , w2 ) = . (46)
the azimuthal plane and can be computed as w1 + w12 + w22 + 1
⎧ In Figure 5, we show the contour-plot of the directivity
⎪ N
⎪
⎪ arctan for : D ≥ 0, factor with w1 and w2 on the x- and y-axes, respectively.
⎪
⎪ D
⎪
⎪
⎪
⎨ From Figure 5 and (46), it can be seen that the contours
N are concentric circles with the center at coordinate (w1 , w2 ) =
ϑ1 , ϑ2 = ⎪arctan +π for : D < 0, N ≥ 0, (42)
⎪
⎪ D (−1/2, 0) where the maximum directivity factor of 4 is
⎪
⎪
⎪
⎪ obtained.
⎪
⎩arctan N − π for : D < 0, N < 0,
D
with 5. Adaptive Algorithm
−2(w
1 w2 ∓ X1 ) 5.1. Cost-Function for Directional Interferers. Next, we
N= ,
X2 develop an adaptation scheme to adapt two weights in the
(43) GSC structure as discussed in the previous Section 4. We aim
w 3 + 4w12 + 4w1 ± 4w2 X1 at obtaining the solution, where a unity response is obtained
D= 1 ,
X2 (w1 + 2) at angle ϕs and two nulls are placed at angles ϕn1 and ϕn2 .
We start with
with
y[k] = p[k] − (w1 [k] + 2α)r1 [k] − w2 [k]r2 [k], (47)
2
X1 = (w1 + 2) 1 + w1 + w22 ,
(44) with k being the discrete-time index, y[k] the output signal,
X2 = 4 + 4w1 + w12 + 4w22 . w1 [k] and w2 [k] the adaptive weights, r1 [k] and r2 [k] the
noise reference signals, and p[k] the primary signal. The with
inclusion of the term 2α in (47) is a consequence of the fact ⎡σ ⎤
n1
that w1 [k] is an estimate of w1 (see (39) in which 2α is not ⎢ 2
1 − cos ϕ1 σn1 sin ϕ1 ⎥
included). Ap = ⎢
⎣ σn
⎥,
⎦
In the ideal case that we want to obtain a unity response 2
1 − cos ϕ2 σn2 sin ϕ2
for a source-signal s[k] originating from angle ϕs and have 2
⎡ ⎤
an undesired source-signal n1 [k] originating from angle ϕn1 w1 (53)
together with an undesired source-signal n2 [k] originating w = ⎣ ⎦,
from angle ϕn2 , we have w2
⎡ ⎤
p[k] = s[k] + α + (1 − α) cos ϕi ni [k], σn1 cos ϕ1
i=1,2
vp = ⎣ ⎦.
σn2 cos ϕ2
1 1

r1 [k] = − cos ϕi ni [k], (48) The singularity of ATp A p can be analyzed by computing
i=1,2
2 2
the determinant of A p and setting this determinant to zero:

r2 [k] = sin ϕi ni [k]. σn1 σn2
i=1,2 sin ϕ2 1 − cos ϕ1 − sin ϕ1 1 − cos ϕ2 = 0. (54)
2
The cost-function J(w1 , w2 ) is defined as a function of w1 Equation (54) is satisfied when σn1 and/or σn2 are equal to
and w2 and is given by zero, ϕ1 and/or ϕ2 are equal to zero, or when

J(w1 , w2 ) = E y 2 [k] , (49)
sin ϕ1 sin ϕ2 ϕ1 ϕ2
= ≡ cot = cot . (55)
with E {·} being the expectation operator. 1 − cos ϕ1 1 − cos ϕ2 2 2
Using that E {n1 [k]n2 [k]} = 0 and E {ni [k]s[k]} = 0 for
i = 1, 2, we can write Equation (55) is satisfied only when ϕ1 = ϕ2 . This agrees
2 with the result that was obtained in Section 3.1, where Δϕ =
J(w1 , w2 ) = E p[k] − (w1 [k] + 2α)r1 [k] − w2 [k]r2 [k] 0.
In all other cases (so when ϕ1 =
/ ϕ2 , σn1 > 0 and σn2 > 0),
= σs2 [k] + σn2i [k] the matrix A p is nonsingular and the matrix ATp A p is positive
i=1,2 definite. Hence, the cost-function is a convex function with
a global minimum that can be found by solving the least-
1
× w1 [k]2 + w2 [k]2 squares problem:
4
" #−1
1 w opt = ATp A p ATp v p
+ cos ϕi w1 [k]2 + w1 [k] − w2 [k]2 + 1
2
4
= A−p 1 v p
1 (56)
+ cos ϕi − w1 [k]2 − w1 [k] +sin ϕi w1 [k]w2 [k] ⎡ ⎤
2
1 ⎣ 2 sin ϕ1 − ϕ2 ⎦
= ,
+ cos ϕi sin ϕi (−2w2 [k] − w1 [k]w2 [k]) A cos ϕ1 − cos ϕ2
with
= σs2 [k] + w1 [k] − (2 + w1 [k]) cos ϕi
i=1,2
A = sin ϕ1 − sin ϕ1 − ϕ2 − sin ϕ2 , (57)
2 σn2i [k]
+2w2 [k] sin ϕi , similar to the solutions as given in (37) and (38).
4
(50) As an example, we show the contour-plot of the cost-
function 10 log10 J(w1 , w2 ) in Figure 6, for the case where
with ϕs = π/2, ϕn1 = 0, ϕn2 = π rad., σn2i = 1 for i = 1, 2, and
σs2 = 0.
σs2 [k] = E s2 [k] ,
(51) As can be seen, the global minimum is obtained for w1 =

σn2i [k] = E n2i [k] . 0 and w2 = 0, resulting in a dipole beampattern. When we
change σn21 =/ σn22 , the shape of the cost-function will be more
We can see that the cost-function is a quadratic-function and more stretched, but the global optimum will be obtained
[15] that can be written in matrix-notation (for convenience, for the same values of w1 and w2 . In the extreme case when
we leave out the index k): σn22 = 0 and σn21 > 0, we obtain the cost-function as shown
! !2 in Figure 7. (It is interesting to note that this cost-function is
! !
J(w1 , w2 ) = σs2 + !A p w − v p ! exactly the same as for the case where ϕs = π/2, ϕn1 = ϕn2 = 0
(52) radians with σn2i = 1 for i = 1, 2 and σs2 = 0.) Although
= σs2 + w T ATp A p w − 2w T ATp v p + v Tp v p , still w1 = 0 and w2 = 0 is an optimal solution, it can be
2 2 10
5 0
1.5 1.5
10 −5
5 5
−10
5
1 −1
1 5 0
5 5
5 −1 10
−5 −
0
0
−10
0.5 0.5 −1
5 −5
−5 5 0
−10 5 0
−1 10
−15 −5 −
w2
w2
0 0 −10
−5
−10 −5 5
0 −1
−5
0
5 0 5
−1 10
−5
−0.5
0
−0.5 −
−10
0 15 −5
5 5 −
0
−1 −1 5
−1 10
−
5
5 5
−5 10
−1.5 −1.5 0 5
−2 −2 10
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
w1 w1
Figure 6: Contour-plot of the cost-function 10 log10 J(w1 , w2 ) for Figure 7: Contour-plot of the cost-function 10 log10 J(w1 , w2 ) for
the case where ϕs = π/2, ϕn1 = 0, and ϕn2 = π radians. the case where ϕs = π/2 and ϕn1 = ϕn2 = 0 radians.
seen that there is no strict global minimum. For example, Assuming that there are no directional interferers,
also w1 = −2 and w2 = 1 is an optimal solution (yielding a we obtain the following primary signal p[k] and noise-
cardioid beampattern). references r1 [k] and r2 [k] in the generalized sidelobe can-
For the situation where there is only a single interferer celler scheme:
or the situation where there are two interferers coming from
(nearly) the same angle, the resulting beampattern will have $
p[k] = s[k] + αd1 [k] + (1 − α)d2 [k] γ,
a null to this angle, while the other (second) null will be
placed randomly (i.e., the second null is not uniquely defined 1 1 $
r1 [k] = d1 [k] − d2 [k] γ, (59)
and the adaptation of this second null is poor). However in 2 2
situations where we have additive diffuse-noise present, we $
r2 [k] = d3 [k] γ.
obtain an extra degree of freedom, for example, optimization
of the directivity index. This is however outside the scope of
this paper. As di [k] with i = 1, 2, 3 and s[k] are mutually uncorre-
lated, we can write the cost-function as
5.2. Cost-Function for Isotropic Noise. It is also useful to 2 2
1 1
analyze the cost-function in the presence of isotropic (i.e., J(w1 , w2 ) = σs2 [k] + σd2 w1 + γ 1 + w1 + γw22 .
diffuse) noise. We know from [16] that spherical and 2 2
cylindrical isotropic noise can be modelled by adding (60)
uncorrelated additive white-noise signals d1 , d2 , and d3 to the
three eigenbeams Em , Ed0 , and Edπ/2 with variances σd2 , σd2 γ, and Just as for the cost-function with two directional interfer-
σd2 γ, respectively, or alternatively with a covariance matrix Kd ers, we can write the cost-function for isotropic noise also as
given by a quadratic function in matrix notation:
⎡ ⎤
1 0 0 ! !2 γ
⎢ ⎥
⎢ ⎥ Jd (w1 , w2 ) = σs2 + !Ad w − vd ! + , (61)
Kd = σd2 ⎢0 γ 0⎥. (58) 1+γ
⎣ ⎦
0 0 γ
with
(for diffuse noise situations, the individual elements are ⎡σ $ ⎤
d
1+γ 0 ⎦
correlated. However, due the construction of eigenbeams, Ad = ⎣ 2 √ ,
the diffuse noise will be decorrelated. Hence, it is allowed 0 σd γ
to add uncorrelated additive white-noise signals to these ⎡ −σ γ ⎤ (62)
eigenbeams to simulate diffuse-noise situations,) We choose d
⎢ $1 + γ ⎥
γ = 1/3 for spherically isotropic noise and γ = 1/2 for vd = ⎣ ⎦.
cylindrically isotropic noise. 0
It can be easily seen that Ad is positive definite and hence and where μ is the update step-size. As in practice, the
we have a convex cost-function with a global minimum. ensemble average E { y 2 [k]} is not available, we have to use an
Via (56) we can easily compute this minimum of the cost- instantaneous estimate of the gradient ∇ wi J(w
1 , w
2 ), which is
function, which is obtained by solving the least-squares computed as
problem:
" #−1 wi J(w
d y 2 [k]
∇ 1 , w
2 ) =
w opt = ATd Ad ATd vd dwi

= A−p 1 v p = −2 p[k] − (w
1 + 2α)r1 [k] − w
2 r2 [k] ri [k]
⎡ ⎤ (63)
2γ = −2y[k]ri [k].
⎢− ⎥ (69)
= ⎣ 1 + γ ⎦.
0
Hence, we can write the update equation as
5.3. Cost-Function for Directional Interferers and Isotropic wi [k + 1] = wi [k] + 2μy[k]ri [k]. (70)
Noise. In case we have directional interferers as well as
isotropic noise and assume that all these noise-components Just as proposed in [5], we can apply a power-
are mutually uncorrelated, we can construct the cost- normalization such that the convergence speed is indepen-
function based on addition of the two cost-functions: dent of the power:
J p,d (w1 , w2 ) = J p (w1 , w2 ) + Jd (w1 , w2 ) 2μy[k]ri [k]
wi [k + 1] = wi [k] + , (71)
!
!
!2 !
! !2 σ 2γ Pri [k] +
− v p ! + ! Ad w
= σs2 + !A p w − vd ! + d
1+γ with being a small value to prevent zero division and where
!
!
!2
! σd2 γ the power-estimate Pri [k] of the i th reference signal ri [k] can
= σs2 + !A p,d w
− v p,d ! + , be computed by a recursive averaging:
1+γ
(64)
Pri [k + 1] = βPri [k] + 1 − β ri2 [k], (72)
with:
⎡ ⎤ with β being a smoothing parameter (lower, but close to 1).
Ap The gradient search only needs to be performed in case
A p,d = ⎣ ⎦,
one or both of the directional interferers are present. In
Ad
case the desired speech is present during the adaptation,
⎡ ⎤ (65)
vp the gradient search will not behave robustly in practice.
v p,d = ⎣ ⎦. This nonrobust behaviour is caused by leakage of speech
vd in the noise references r1 and r2 due to either variations
of the desired speaker location, microphone mismatches
Since J p (w1 , w2 ) and Jd (w1 , w2 ) were found to be convex, or reverberation (multipath) effects. To avoid adaptation
the sum J p,d (w1 , w2 ) is also convex. The optimal weights w opt during desired speech, we will apply a step-size control factor
can be obtained by computing in the adaptation-rule, given by
" #−1
w opt = ATp,d A p,d ATp,d v p,d , (66) Pr1 [k] + Pr2 [k]
Ψ[k] = , (73)
Pr1 [k] + Pr2 [k] + Pp [k] +
which can be solved numerically via standard SVD tech-
niques [15].
where Pr1 [k] + Pr2 [k] is an estimate of the noise power and
Pp [k] is an estimate of the primary signal p[k] that contains
5.4. Gradient Search Algorithm. As we know that the cost-
mainly desired speech. The power estimate Pp [k] is, just
function is a convex function with a global minimum, we
can find this optimal solution by means of a steepest descent as for the reference-signal powers Pr1 and Pr2 , obtained via
update equation for wi with i = 1, 2 by stepping in the recursive averaging:
direction opposite to the surface J(w1 , w2 ) with respect to wi ,
similar to [5] Pp [k + 1] = βPp [k] + 1 − β p2 [k]. (74)
wi [k + 1] = wi [k] − μ∇wi J(w1 , w2 ), (67) We can see that the value of Ψ[k] will be small when the
desired speech is dominating, while Ψ[k] will be much larger
with a gradient given by (but lower than 1) when either the directional interferers or
spherically isotropic noise is dominating. As it is beneficial
∂J(w1 [k], w2 [k]) ∂E y 2 [k] to have a low amount of noise components in the power
∇wi J(w
1 , w
2 ) = = , (68)
∂wi [k] ∂wi [k] estimate Pp [k], we found that α = 0.25 is a good choice.
Initialize w1 [0] = 0, w2 [0] = 0, Pr1 [0] = r12 [0], Pr2 [0] = r22 [0] and Pp [0] = p2 [0]
for k = 0, ∞: do
Pr1 [k] + Pr2 [k]
Ψ[k] =
Pr1 [k] + Pr2 [k] + Pp [k] +

y[k] = p[k] − (w1 [k] + 2α)r1 [k] − w2 [k]r2 [k]

for i = 1, 2: do
2μy[k]ri [k]
wi [k + 1] = wi [k] + Ψ[k]
Pri [k] +

X1 = (−1)i (w1 [k]2 + 2)2 (1 + w1 [k] + w2 [k]2 )
X2 = 4 + 4w1 [k] + w1 [k]2 + 4w2 [k]2
1 [k]w
−2(w 2 [k] + X1 )
N=
X2
w1 [k] + 4w1 [k]2 + 4w1 [k] − 4w2 [k]X1

3
D=
X2 (w1 [k] + 2)

N
ϑni = arctan − ϕs
D
if D < 0 then
ϑni = ϑni − π sgn(N)
end if
Pri [k + 1] = βPri [k] + (1 − β)ri2 [k]
Pp [k + 1] = βPp [k] + (1 − β)p2 [k]
end for
end for
Algorithm 1: Optimal null-steering for two directional interferers.
The algorithm now looks as shown in Algorithm 1. Table 1: Computed values of ϕs , α, and S for placing two nulls at
As can be seen in the algorithm, the two weights w1 [k] ϕn1 and ϕn2 and having a unity response at ϕs .
and w2 [k] are adapted based on a gradient-search method.
Based on these two weights, a computation with arctan- ϕn1 ϕn2 ϕs ϕs
α S QS
function is performed to obtain the angles of the directional (deg) (deg) (deg) (deg)
interferers ϑni with i = 1, 2. 45 180 90 292.5 0.277 1.141 0.61
0 180 90 90 0 1.0 3.0
0 225 90 112.5 0.277 1.058 3.56
6. Validation
0 0 90 0 0.5 2 0.75
6.1. Directivity Pattern for Directional Interferers. First, we
show the beampatterns for a number of situations where two
nulls are placed. In Table 1, we show the computed values for
the direct pattern synthesis for 4 different situations, where Table 2: Computed values of w1 and w2 for placing two nulls at ϕn1
and ϕn2 and having a unity response at ϕs .
nulls are placed at different angles. Furthermore, we assume
that there is no isotropic noise present. ϕn1 ϕn2 ϕs
As was explained in Section 3.1, we can obtain two w1 w2 QS
(deg) (deg) (deg)
In Table 1, we show
different sets of solutions for ϕs , α, and S. √ 1√
the set of solutions where α is positive. 45 180 90 2 − 2 0.61
2
Similarly, in Table 2, we show the computed values for w1
and w2 in the GSC structure as explained in Section 4 for the 0 180 90 0 0 3.0
same situations as for the direct pattern synthesis.
−2 −1
The polar-plots resulting from the computed values in 0 225 90 √ √ 3.56
2+ 2 2+ 2
Tables 1 and 2 are shown in Figure 8. It is noted that the two
examples of Section 5.1 where we analyzed the cost-function 0 0 90 −2 −1 0.75
are depicted in Figures 8(b) and 8(d).
90 3 90 1
120 60 120 60
0.8
2 0.6
150 30 150 30
1 0.4
0.2
180 0 180 0
210 330 210 330
240 300 240 300

270 270
(a) (b)
90 1.5 90 2
120 60 120 60
1.5
1
150 30 150 1 30
0.5
0.5
180 0 180 0
210 330 210 330
240 300 240 300

270 270
(c) (d)
Figure 8: Azimuthal polar-plots for the placement of two nulls with nulls placed at (a) 45 and 180 degrees, (b) 0 and 180 degrees, (c) 0 and
225 degrees and (d) 0, and 0 degrees (two identical nulls).
1 From the plots in Figure 8, it can be seen that if one of

0.5 the two null-angles is close to the desired source angle (e.g.,
in Figure 8(a)), the directivity index becomes worse. Because
0 of this poor directivity index, the null-steering method as
−0.5 is proposed in this paper will only be useful when either
−1
azimuthal angle of the two directional interferers is not very
ϑn1 and ϑn2
close to the azimuthal angle of the desired source. When we

−1.5 limit the main-beam to be steered maximally 90 degrees away
−2 from the desired direction, that is, |ϕs − ϕs | < π/2, we avoid
a poor directivity index. For example, in Figure 8(d) such a
−2.5
situation is shown where the main-beam is steered 90 degrees
−3 away from the desired direction. In case the two directional
interferers will change quickly from 0 to 180 degrees, the
−3.5
adaptive algorithm will automatically adapt and removes
−4 these two directional interferers at 180 degrees. As only two
0 1 2 3 4 5 6 7 8 9 10
×103
weights are used in the adaptive algorithm, the convergence
k
to the optimal weights will be very fast.
ϑn1
ϑn2
ϕni with i = 1, 2 6.2. Gradient Search Algorithm. Next, we validate the track-
ing behaviour of the gradient update algorithm, as proposed
Figure 9: Simulation of the null-steering algorithm with two in Section 5.4. We perform a simulation, where we have a
directional interferers only where σn21 = σn22 = 1. desired source at 90 degrees and where we linearly increase
the angle of a first undesired directional interferer (ranging
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
ϑn1 and ϑn2
ϑn1 and ϑn2

−1.5 −1.5
−2 −2
−2.5 −2.5
−3 −3
−3.5 −3.5
−4 −4
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k ×103 k ×103
ϑn1 ϑn1
ϑn2 ϑn2
ϕni with i = 1, 2 ϕni with i = 1, 2
(a) (b)
Figure 10: Simulation of the null-steering algorithm with two directional interferers where σn21 = σn22 = 1 and with a desired source where
σs2 = 1/16 with ϕs = 90 degrees (a) and ϕs = 60 degrees (b).
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
ϑn1 and ϑn2
ϑn1 and ϑn2
−1.5 −1.5
−2 −2
−2.5 −2.5
−3 −3
−3.5 −3.5
−4 −4
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k ×103 k ×103
ϑn1 ϑn1
ϑn2 ϑn2
ϕni with i = 1, 2 ϕni with i = 1, 2
(a) (b)
Figure 11: Simulation of the null-steering algorithm with two directional interferers where σn21 = σn22 = 1 and with (spherically isotropic)
spherical isotropic noise (γ = 1/3), where σd2 = 1/16 (a) and σd2 = 1/4 (b).
from 135 to 45 degrees) and we linearly decrease the angle σn2i = 1. The results are shown in Figure 9. It can be seen
of a second undesired directional interferer (ranging from 30 that ϑn1 and ϑn2 do not cross (in contrast to the angles of the
degrees to −90 degrees) in a time-span of 10000 samples. For directional interferers ϕn1 and ϕn2 ). The first null placed at ϑn1
the simulation, we used α = 0.25, μ = 0.02, and β = 0.95.
adapts very well, while the second null, placed at ϑn2 , is poorly
First, we simulate the situation, where only two direc- adapted. The reason for this was explained in Section 5.1.
tional interferers are present. The two directional interferers Similarly, we simulate the situation with the same two
are uncorrelated white random-noise signals with variance directional interferers but now together with a desired
0.1
0.05
0
−0.05
−0.1
0 2 4 6 8 10 12 14 16
t (s)
(a) Cardioid to 0 degrees, that is, M0
0.1
0.05
0
−0.05
−0.1
0 2 4 6 8 10 12 14 16
Figure 12: Microphone array with 3 outward facing cardioid t (s)
microphones.
(b) Proposed adaptive null-steering algorithm
Figure 14: Results of the real-life experiment (waveform).

N2
6
π rad
5
4
M2 M1
ϑni with i = 1, 2
N3 3π/2 rad π/2 rad N1

3
M0 1m
φ 2
0 rad 1
S
0
0 2 4 6 8 10 12 14 16
t (s)
Figure 13: Practical setup of the microphone array. ϑni with i = 1, 2

ϕni with i = 1, 2
Figure 15: Results of the real-life experiment (angle estimates).

source-signal s[k]. The desired source is modelled as a white-
noise signal, with a variance σs2 = 1/16. The result is shown in
Figure 10(a). We see that due to the adaptation-noise (caused
by s[k]), there is more variance in the estimates of the angles end of the simulation where k = 10000, this can be clearly
ϑn1 and ϑn2 . In contrast to the situation with two directional seen for ϑn1 .
interferers only, we see that there is a region where ϑn1 = ϑn2 . Finally, we simulate the situation of the same directional
To show how the adaptation behaviour looks in presence interferers, but now in a spherical isotropic noise situation.
of variation in the desired source location, we do a similar As was explained in Section 5.2, isotropic noise can be
simulation as above, but now with ϕs set to 60 degrees, while modelled by adding uncorrelated additive white-noise to the
the desired source is coming from 90 degrees. This means three eigenbeams Em , Ed0 , and Edπ/2 with variances σd2 , σd2 γ,
that there will be leakage of the desired source signal into the and σd2 γ, respectively. Here γ = 1/3 for spherically isotropic
noise reference signals r1 [k] and r2 [k]. The results are shown noise and γ = 1/2 for cylindrically isotropic noise. In our
in Figure 10(b). Here, it can be seen that the adaptation simulation, we use γ = 1/3. The results are shown in Figures
shows a small offset if one of the directional source angles 11(a) and 11(b) with variances σd2 = 1/16 and σd2 = 1/4,
comes close to the desired source angle. For example, at the respectively. When the variance of the diffuse noise gets
90 1.5 90 1.5 90 1.5

120 60 120 60 120 60
1 1 1
150 30 150 30 150 30
0.5 0.5 0.5
180 0 180 0 180 0
210 330 210 330 210 330
240 300 240 300 240 300

270 270 270
(a) t = 2.5 seconds (b) t = 6 seconds (c) t = 9.5 seconds
90 1.5 90 1.5
120 60 120 60
1 1
150 30 150 30
0.5 0.5
180 0 180 0
210 330 210 330
240 300 240 300

270 270
(d) t = 13 seconds (e) t = 16.5 seconds
Figure 16: Polar-plot results of the real-life experiment.
larger compared to the directional interferers, the adaptation was generated via four loudspeakers, placed close to the walls
will be influenced by the diffuse noise that is present. The and each facing diffusers hanging on the walls. The level of
larger the diffuse noise, the more the final beampattern the diffuse noise is 12 dB lower compared to the directional
will resemble the hypercardioid. If diffuse noise would be (interfering) sources. The experiment is done in a time-span
dominant over the directional interferers, the estimates ϕn1 of 17.5 seconds, where we switch the directional sources as
and ϕn2 will be equal to 90−109 degrees, and 90+109 degrees, shown in Table 3.
respectively, (or −0.33 and −2.81 radians, resp.). We use mutually uncorrelated white random-noise
sequences for the directional sources N1, N2, and N3 played
6.3. Real-Life Experiments. To validate the null-steering by loudspeakers and use speech for the desired sound-source
algorithm in real-life, we used a microphone array with S.
3 outward facing cardioid electret microphones, as shown For the algorithm, we use discrete-time signals with a
in Figure 12. As directional cardioid microphones have sample-rate of 8 KHz. Furthermore, we used α = 0.25, μ =
openings on both sides, the microphones are placed in 0.001, and β = 0.95.
rubber holders, enabling sound to enter both sides of the Figure 14(a) shows the waveform obtained from micro-
directional microphones. phone #0 (M0 ), which is a cardioid pointed with its main-
The type of microphone elements used for this array lobe to 0 radians. This waveform is compared with the result-
is the Primo EM164 cardioid microphones [17]. These ing waveform of the null-steering algorithm, and is shown
elements are placed uniformly on a circle with a radius in Figure 14(b). As the proposed null-steering algorithm is
of 1 cm. This radius is sufficient for the construction of able to steer nulls toward the directional interferers, the direct
eigenbeams up to a frequency of 4 KHz. part of the interferers is removed effectively (this can be seen
For the experiment, we placed the array on a table in by the lower noise-level in Figure 14(b) in the time-frame
a moderately reverberant room (conferencing-room) with from 0–10.5 seconds). In the segment from 10.5–14 seconds
a T60 of approximately 200 milliseconds. As shown in the (where there is only a single directional interferer at φ = π
setup in Figure 13, all directional sources are placed at a radians), it can be seen that the null-steering algorithm is
distance of 1 meter from the array (at discrete azimuthal able to reject this interferer just as good as the single cardioid
angles: φ = 0, π/2, π, and 3π/2 radians), while diffuse noise microphone.
Table 3: Switching of sound-sources during the real-life experiment.
Source angle φ (rad) 0–3.5 (s) 3.5–7 (s) 7–10.5 (s) 10.5–14 (s) 14 s–17.5 (s)
N1 π/2 active — active — —
N2 π active active — active —
N3 3π/2 — active active — —
S 0 active active active active active
In Figure 15, the resulting angle-estimates from the null- Proof. First, we compute the numerator of the partial
steering algorithm are shown. Here, it can be seen that the derivative ∂QS /∂ϕ1 and set this derivative to zero:
angle-estimation for the first three segments of 3.5 seconds
6 1 − cos ϕ1 sin ϕ1 5 + 3 cos ϕ1 − ϕ2
is done accurately. For the fourth segment, there is only (A.2)

a single point interferer. In this segment, only a single + 6 1 − cos ϕ1 1 − cos ϕ2 3 sin ϕ1 − ϕ2 = 0.
angle-estimation is stable, while the other angle-estimation
is highly influenced by the diffuse noise. Finally, in the The common factor 6(1 − cos ϕ1 ) can be removed, resulting
fifth segment, only diffuse noise is present and the final in

beampattern will optimize the directivity-index, leading to a sin ϕ1 5 + 3 cos ϕ1 − ϕ2 + 3 1 − cos ϕ1 sin ϕ1 − ϕ2 = 0.
more hypercardioid beampattern steered with its main-lobe (A.3)
to 0 degrees (as explained in Section 6.2).
Finally, in Figure 16, the resulting polar-patterns from Similarly, setting the partial derivative ∂QS /∂ϕ2 equal to
the null-steering algorithm are shown for some discrete zero, we get

time-stamps. Again, it becomes clear that the null-steering sin ϕ2 5 + 3 cos ϕ2 − ϕ1 + 3 1 − cos ϕ2 sin ϕ2 − ϕ1 = 0.
algorithm is able to steer the nulls toward the angles where (A.4)
the interferers are coming from.
Combining (A.3) and (A.4) gives

sin ϕ1 −3 sin ϕ1 − ϕ2
=
7. Conclusions 1 − cos ϕ1 5 + 3 cos ϕ1 − ϕ2
(A.5)
We analyzed the construction of a first-order superdirec- 3 sin ϕ2 − ϕ1 − sin ϕ2
= = ,
tional response in order to obtain a unity response for a 5 + 3 cos ϕ2 − ϕ1 1 − cos ϕ2
desired azimuthal angle and to obtain a placement of two
nulls to undesired azimuthal angles to suppress two direc- or alternatively

tional interferers. We derived a gradient search algorithm to 2 sin ϕ1 /2 cos ϕ1 /2 ϕ1 ϕ2
adapt two weights in a generalized sidelobe canceller scheme. 2 = cot = −cot , (A.6)
2 sin ϕ1 /2 2 2
Furthermore, we analyzed the cost-function of this gradient
search algorithm, which was found to be convex. Hence with ϕ1 , ϕ2 ∈ [0, π].
a global minimum is obtained in all cases. From the two From (A.6), we can see that ϕ1 /2 + ϕ2 /2 = π (or ϕ1 + ϕ2 =
weights in the algorithm and using a four-quadrant inverse- 2π) and can derive

tangent operation, it is possible to obtain estimates of the cos ϕ2 = cos 2π − ϕ1 = cos ϕ1 , (A.7)
azimuthal angles where the two directional interferers are
coming from. Simulations and real-life experiments show a sin ϕ2 = sin 2π − ϕ1 = − sin ϕ1 . (A.8)
good performance in moderate reverberant situations. Using (A.7) and (A.8) in (A.1) gives
2 2
6 1 − cos ϕ1 6 1 − cos ϕ1 6(1 − x)2
QS = = = ,
Appendix 5 + 3 2 cos ϕ1 − 1 2 + 6 cos2 ϕ1 2 + 6x2
(A.9)
Proofs
with x = cos ϕ1 ∈ [−1, 1].
Maximum Directivity Factor QS . We prove that for We can compute the optimal value for x by differentia-
tion of (A.9) and setting the result to zero:
2
6 1 − cos ϕ1 1 − cos ϕ2 − 12(1 − x) 2 + 6x2 − 6(1 − x) 12x = 0
QS ϕ1 , ϕ2 = , (A.1) (A.10)
5 + 3 cos ϕ1 − ϕ2 ≡ −2 − 6x2 − 6x + 6x2 = 0.
Solving (A.10) gives x = cos ϕ1 = −1/3 and conse-
with ϕ1 , ϕ2 ∈ [0, 2π], a maximum QS = 4 is obtained for quently, ϕ1 = arccos (−1/3) and ϕ2 = 2π − arccos (−1/3). Via
ϕ1 = arccos (−1/3) and ϕ2 = 2π − arccos (−1/3). (A.9), we can see that for these values, we have QS = 4.
Acknowledgment (OCEANS ’03), vol. 4, pp. 2127–2132, San Diego, Calif, USA,
September 2003.
The author likes to thank Dr. A. J. E. M. Janssen for his [17] R. M. M. Derkx, “Spatial harmonic analysis of unidirectional
valuable suggestions. microphones for use in superdirective beamformers,” in
Proceedings of the 36th International Conference: Automotive
Audio, Dearborn, Mich, USA, June 2009.
References
[1] G. W. Elko, F. Pardo, D. Lopez, D. Bishop, and P. Gammel,
“Surface-micromachined mems microphone,” in Proceedings
of the 115th AES Convention, p. 1–8, October 2003.
[2] P. L. Chu, “Superdirective microphone array for a set-top
video conferencing system,” in Proceedings of the IEEE Inter-
national Conference on Acoustics, Speech, and Signal Processing
(ICASSP ’97), vol. 1, pp. 235–238, Munich, Germany, April
1997.
[3] R. L. Pritchard, “Maximum directivity index of a linear point
array,” Journal of the Acoustical Society of America, vol. 26, no.
6, pp. 1034–1039, 1954.
[4] H. Cox, “Super-directivity revisited,” in Proceedings of the 21st
IEEE Instrumentation and Measurement Technology Conference
(IMTC ’04), vol. 2, pp. 877–880, May 2004.
[5] G. W. Elko and A. T. Nguyen Pong, “A simple first-order
differential microphone,” in Proceedings of the IEEE Workshop
on Applications of Signal Processing to Audio and Acoustics
(WASPAA ’95), pp. 169–172, New Paltz, NY, USA, October
1995.
[6] G. W. Elko and A. T. Nguyen Pong, “A steerable and variable
first-order differential microphone array,” in Proceedings of
the IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP ’97), vol. 1, pp. 223–226, Munich,
Germany, April 1997.
[7] M. A. Poletti, “Unified theory of horizontal holographic sound
systems,” Journal of the Audio Engineering Society, vol. 48, no.
12, pp. 1155–1182, 2000.
[8] H. Cox, R. M. Zeskind, and M. M. Owen, “Robust adaptive
beamforming,” IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 35, no. 10, pp. 1365–1376, 1987.
[9] R. M. M. Derkx and K. Janse, “Theoretical analysis of a first-
order azimuth-steerable superdirective microphone array,”
IEEE Transactions on Audio, Speech and Language Processing,
vol. 17, no. 1, pp. 150–162, 2009.
[10] Y. Huang and J. Benesty, Audio Signal Processing for Next Gen-
eration Multimedia Communication Systems, Kluwer Academic
Publishers, Dordrecht, The Netherlands, 1st edition, 2004.
[11] H. Teutsch, Modal Array Signal Processing: Principles and
Applications of Acoustic Wavefield Decomposition, Springer,
Berlin, Germany, 1st edition, 2007.
[12] L. J. Griffiths and C. W. Jim, “An alternative approach to lin-
early constrained adaptive beamforming,” IEEE Transactions
on Antennas and Propagation, vol. 30, no. 1, pp. 27–34, 1982.
[13] R. M. M. Derkx, “Optimal azimuthal steering of a first-
order superdirectional microphone response,” in Proceedings
of the 11th International Workshop on Acoustic Echo and Noise
Control (IWAENC ’08), Seattle, Wash, USA, September 2008.
[14] J.-H. Lee and Y.-H. Lee, “Two-dimensional adaptive array
beamforming with multiple beam constraints using a general-
ized sidelobe canceller,” IEEE Transactions on Signal Processing,
vol. 53, no. 9, pp. 3517–3529, 2005.
[15] W. Kaplan, Maxima and Minima with Applications: Practical
Optimization and Duality, John Wiley & Sons, New York, NY,
USA, 1999.
[16] B. H. Maranda, “The statistical accuracy of an arctangent
bearing estimator,” in Proceedings of the Oceans Conference
doi:10.1155/2010/431347
Research Article
Musical-Noise Analysis in Methods of Integrating Microphone
Array and Spectral Subtraction Based on Higher-Order Statistics
Yu Takahashi,1 Hiroshi Saruwatari (EURASIP Member),1

Kiyohiro Shikano (EURASIP Member),1 and Kazunobu Kondo2
1 Graduate School of Information Science, Nara Institute of Science and Technology, Nara 630-0192, Japan
2 SP Group, Center for Advanced Sound Technologies, Yamaha Corporation, Shizuoka 438-0192, Japan
Correspondence should be addressed to Yu Takahashi, yuu-t@yuu-t.sakura.ne.jp
Received 5 August 2009; Revised 3 November 2009; Accepted 16 March 2010
Academic Editor: Simon Doclo
Copyright © 2010 Yu Takahashi et al. This is an open access article distributed under the Creative Commons Attribution License,
We conduct an objective analysis on musical noise generated by two methods of integrating microphone array signal processing and
spectral subtraction. To obtain better noise reduction, methods of integrating microphone array signal processing and nonlinear
signal processing have been researched. However, nonlinear signal processing often generates musical noise. Since such musical
noise causes discomfort to users, it is desirable that musical noise is mitigated. Moreover, it has been recently reported that higher-
order statistics are strongly related to the amount of musical noise generated. This implies that it is possible to optimize the
integration method from the viewpoint of not only noise reduction performance but also the amount of musical noise generated.
Thus, we analyze the simplest methods of integration, that is, the delay-and-sum beamformer and spectral subtraction, and fully
clarify the features of musical noise generated by each method. As a result, it is clarified that a specific structure of integration
is preferable from the viewpoint of the amount of generated musical noise. The validity of the analysis is shown via a computer
simulation and a subjective evaluation.
1. Introduction methods, the strength parameter to mitigate musical noise

in nonlinear signal processing is determined heuristically.
There have recently been various studies on microphone Although there have been some studies on reducing musical
array signal processing [1]; in particular, the delay-and- noise [16] and on nonlinear signal processing with less
sum (DS) [2–4] array and the adaptive beamformer [5– musical noise [17], evaluations have mainly depended on
7] are the most conventionally used microphone arrays for subjective tests by humans, and no objective evaluations have
speech enhancement. Moreover, many methods of integrat- been performed to the best of our knowledge.
ing microphone array signal processing and nonlinear signal In our recent study, it was reported that the amount of
processing such as spectral subtraction (SS) [8] have been generated musical noise is strongly related to the difference
studied with the aim of achieving better noise reduction between higher-order statistics (HOS) before and after
[9–15]. It has been well demonstrated that such integration nonlinear signal processing [18]. This fact makes it possible
methods can achieve higher noise reduction performance to analyze the amount of musical noise arising through
than that obtained using conventional adaptive microphone nonlinear signal processing. Therefore, on the basis of HOS,
arrays [13] such as the Griffith-Jim array [6]. However, a seri- we can establish a mathematical metric for the amount of
ous problem exists in such methods: artificial distortion (so- musical noise generated in an objective manner. One of
called musical noise [16]) due to nonlinear signal processing. the authors has analyzed single-channel nonlinear signal
Since the artificial distortion causes discomfort to users, it processing based on the objective metric and clarified the
is desirable that musical noise is controlled through signal features of the amount of musical noise generated [18, 19].
processing. However, in almost all nonlinear noise reduction In addition, this objective metric suggests the possibility that
Multichannel
observed signal
. Beamforming to +
. Spectral Output
. enhance target speech
subtraction
(delay-and-sum)
. Beamforming to
.
. estimate noise signal
Figure 1: Block diagram of architecture for spectral subtraction after beamforming (BF+SS).
Multichannel
observed signal
+ Spectral
Beamforming to
. subtraction
. . enhance target speech
. .. Output
− + Spectral (delay-and-sum)
subtraction
−
Beamforming to
. .
. estimate noise signal .
. .
in each channel
Figure 2: Block diagram of architecture for channelwise spectral subtraction before beamforming (chSS+BF).
methods of integrating microphone array signal processing (i) The amount of musical noise generated strongly
and nonlinear signal processing can be optimized from the depends on not only the oversubtraction parameter
viewpoint of not only noise reduction performance but of SS but also the statistical characteristics of the input
also the sound quality according to human hearing. As signal.
a first step toward achieving this goal, in this study we (ii) Except for the specific condition that the input signal
analyze the simplest case of the integration of microphone is Gaussian, the noise reduction performances of the
array signal processing and nonlinear signal processing by two methods are not equivalent even if we set the
considering the integration of DS and SS. As a result of the same SS parameters.
analysis, we clarify the musical-noise generation features of
two types of methods on integration of microphone array (iii) Under equivalent noise reduction performance con-
signal processing and SS. ditions, chSS+BF generates less musical noise than
Figure 1 shows a typical architecture used for the inte- BF+SS for almost all practical cases.
gration of microphone array signal processing and SS, where The most important contribution of this paper is that
SS is performed after beamforming. Thus, we call this type these findings are mathematically proved. In particular, the
of architecture BF+SS. Such a structure has been adopted amount of musical noise generated and the noise reduction
in many integration methods [11, 15]. On the other hand, performance resulting from the integration of microphone
the integration architecture illustrated in Figure 2 is an array signal processing and SS are analytically formulated on
alternative architecture used when SS is performed before the basis of HOS. Although there have been many studies on
beamforming. Such a structure is less commonly used, optimization methods based on HOS [21], this is the first
but some integration methods use this structure [12, 14]. time they have been used for musical-noise assessment. The
In this architecture, channelwise SS is performed before validity of the analysis based on HOS is demonstrated via a
beamforming, and we call this type of architecture chSS+BF. computer simulation and a subjective evaluation by humans.
We have already tried to analyze such methods of The rest of the paper is organized as follows. In Section 2,
integrating DS and SS from the viewpoint of musical-noise the two methods of integrating microphone array signal
generation on the basis of HOS [20]. However, in the processing and SS are described in detail. In Section 3, the
analysis, we did not consider the effect of flooring in SS metric based on HOS used for the amount of musical noise
and the noise reduction performance. On the other hand, generated is described. Next, the musical-noise analysis of
in this study we perform an exact analysis considering the SS, microphone array signal processing, and their integration
effect of flooring in SS and the noise reduction performance. methods are discussed in Section 4. In Section 5, the noise
We analyze these two architectures on the basis of HOS and reduction performances of the two integration methods are
obtain the following results. discussed, and both methods are compared under equivalent
+ To enhance the target speech, DS is applied to the

Noise observed signal. This can be represented by
Target speech
T
θU yDS f , τ = gDS f , θU x f , τ ,
T
gDS f , θU = g1(DS) f , θU , . . . , gJ(DS) f , θU , (2)
0 d
Mic. 1 Mic. 2 Mic. j Mic. J
··· ···
(d = d1 ) (d = d2 ) (d = d j ) (d = dJ ) i2π f /M fs d j sin θU
g (DS)
j f , θU =J −1
· exp − ,
Figure 3: Configuration of microphone array and signals. c
where gDS ( f , θU ) is the coefficient vector of the DS array

and θU is the specific fixed look direction known in advance.
noise reduction performance conditions in Section 6. More- Also, fs is the sampling frequency, M is the DFT size, and c
over, the result of a computer simulation and experimental is the sound velocity. Finally, we obtain the target-speech-
results are given in Section 7. Following a discussion of enhanced spectral amplitude based on SS. This procedure
the results of the experiments, we give our conclusions in can be expressed as
Section 8.

ySS f , τ
2. Methods of Integrating Microphone Array
⎧ 2 2
Signal Processing and SS ⎪
⎪ yDS f , τ − β · Eτ n
⎪
⎪ f ,τ
⎪
⎪
⎪
⎪
In this section, the formulations of the two methods of ⎪
⎪ 2
⎪
⎪ where yDS f , τ
integrating microphone array signal processing and SS are ⎨ (3)
described. First, BF+SS, which is a typical method of =
⎪
⎪
⎪ 2
integration, is formulated. Next, an alternative method of ⎪
⎪
⎪ −β · Eτ n
f ,τ ≥ 0 ,
⎪
⎪
integration, chSS+BF, is introduced. ⎪
⎪
⎪
⎪
⎩
η · yDS f , τ (otherwise),
2.1. Sound-Mixing Model. In this study, a uniform linear
microphone array is assumed, where the coordinates of the
where this procedure is a type of extended SS [23]. Here,
elements are denoted by d j ( j = 1, . . . , J) (see Figure 3) and
ySS ( f , τ) is the target-speech-enhanced signal, β is the
J is the number of microphones. We consider one target
oversubtraction parameter, η is the flooring parameter, and
speech signal and an additive interference signal. Multiple
n( f , τ) is the estimated noise signal, which can generally
mixed signals are observed at each microphone element, and
be obtained by a beamforming techniques such as fixed
the short-time analysis of the observed signals is conducted
or adaptive beamforming. Eτ [·] denotes the expectation
by a frame-by-frame discrete Fourier transform (DFT). The
operator with respect to the time-frame index. For example,
observed signals are given by
n( f , τ) can be expressed as [13]

x f ,τ = h f s f ,τ + n f ,τ , (1) T

n f , τ = λ f gNBF f x f ,τ , (4)
where x( f , τ) = [x1 ( f , τ), . . . , xJ ( f , τ)]T is the observed where gNBF ( f ) is the filter coefficient vector of the null
signal vector, h( f ) = [h1 ( f ), . . . , hJ ( f )]T is the transfer beamformer [22] that steers the null directivity to the speech
function vector, s( f , τ) is the target speech signal, and direction θU , and λ( f ) is the gain adjustment term, which
n( f , τ) = [n1 ( f , τ), . . . , nJ ( f , τ)]T is the noise signal vector. is determined in a speech break period. Since the null
beamformer can remove the speech signal by steering the
null directivity to the speech direction, we can estimate
2.2. SS after Beamforming. In BF+SS, the single-channel the noise signal. Moreover, a method exists in which
target-speech-enhanced signal is first obtained by beam- independent component analysis (ICA) is utilized as a noise
forming, for example, by DS. Next, single-channel noise estimator instead of the null beamformer [15].
estimation is performed by a beamforming technique, for
example, null beamformer [22] or adaptive beamforming
[1]. Finally, we extract the resultant target-speech-enhanced 2.3. Channelwise SS before Beamforming. In chSS+BF, we
signal via SS. The full details of signal processing are given first perform SS independently in each input channel and
below. then we derive a multichannel target-speech-enhanced signal
by channelwise SS. This can be expressed as 3.2. Relation between Musical-Noise Generation and Kurtosis.
In our previous works [18–20], we defined musical noise as
(chSS)
yj f ,τ the audible isolated spectral components generated through
⎧ signal processing. Figure 4(b) shows an example of a spectro-
⎪
⎪
⎪ 2 2 gram of musical noise in which many isolated components
⎪
⎪ x f , τ − β · E n f , τ
⎪
⎪
j τ j
can be observed. We speculate that the amount of musical
⎪
⎪
⎪
⎪
⎪
⎪
2 noise is strongly related to the number of such isolated
⎪
⎨ where x j f , τ (5)
components and their level of isolation.
=
⎪
⎪ Hence, we introduce kurtosis to quantify the isolated
⎪
⎪
⎪
2
⎪
⎪ nj f , τ ≥ 0 ,
−β · Eτ spectral components, and we focus on the changes in kur-
⎪
⎪
⎪
⎪ tosis. Since isolated spectral components are dominant, they
⎪
⎪
⎪
⎩η · are heard as tonal sounds, which results in our perception
x j f , τ (otherwise),
of musical noise. Therefore, it is expected that obtaining
where y (chSS)
j ( f , τ) is the target-speech-enhanced signal the number of tonal components will enable us to quantify
obtained by SS at a specific channel j and n j ( f , τ) is the the amount of musical noise. However, such a measurement
estimated noise signal in the jth channel. For instance, is extremely complicated; so instead we introduce a simple
the multichannel noise can be estimated by single-input statistical estimate, that is, kurtosis.
multiple-output ICA (SIMO-ICA) [24] or a combination of This strategy allows us to obtain the characteristics of
ICA and the projection back method [25]. These techniques tonal components. The adopted kurtosis can be used to
can provide the multichannel estimated noise signal, unlike evaluate the width of the probability density function (p.d.f.)
traditional ICA. SIMO-ICA can separate mixed signals not and the weight of its tails; that is, kurtosis can be used to
into monaural source signals but into SIMO-model signals evaluate the percentage of tonal components among the total
at the microphone. Here SIMO denotes the specific trans- components. A larger value indicates a signal with a heavy
mission system in which the input signal is a single source tail in its p.d.f., meaning that it has a large number of tonal
signal and the outputs are its transmitted signals observed components. Also, kurtosis has the advantageous property
at multiple microphones. Thus, the output signals of SIMO- that it can be easily calculated in a concise algebraic form.
ICA maintain the rich spatial qualities of the sound sources
[24] Also the projection back method provides SIMO- 3.3. Kurtosis. Kurtosis is one of the most commonly used
model-separated signals using the inverse of an optimized HOS for the assessment of non-Gaussianity. Kurtosis is
ICA filter [25]. defined as
Finally, we extract the target-speech-enhanced signal by μ4
kurtx = 2 , (7)
applying DS to ychSS ( f , τ) = [y1(chSS) ( f , τ), . . . , yJ(chSS) ( f , τ)]T . μ2
This procedure can be expressed by where x is a random variable, kurtx is the kurtosis of x, and

y f ,τ = T
gDS f , θU ychSS f , τ , (6) μn is the nth-order moment of x. Here μn is defined as
+∞
where y( f , τ) is the final output of chSS+BF. μn = xn P(x)dx, (8)
Such a chSS+BF structure performs DS after (multichan- −∞
nel) SS. Since DS is basically signal processing in which the where P(x) denotes the p.d.f. of x. Note that this μn is
summation of the multichannel signal is taken, it can be not a central moment but a raw moment. Thus, (7) is not
considered that interchannel smoothing is applied to the kurtosis according to the mathematically strict definition,
multichannel spectral-subtracted signal. On the other hand, but a modified version; however, we refer to (7) as kurtosis
the resultant output signal of BF+SS remains as it is after SS. in this paper.
That is to say, it is expected that the output signal of chSS+BF
is more natural (contains less musical noise) than that of 3.4. Kurtosis Ratio. Although we can measure the number of
BF+SS. In the following sections, we reveal that chSS+BF can tonal components by kurtosis, it is worth mentioning that
output a signal with less musical noise than BF+SS in almost kurtosis itself is not sufficient to measure musical noise. This
all cases on the basis of HOS. is because that the kurtosis of some unprocessed signals such
as speech signals is also high, but we do not perceive speech
3. Kurtosis-Based Musical-Noise as musical noise. Since we aim to count only the musical-
Generation Metric noise components, we should not consider genuine tonal
components. To achieve this aim, we focus on the fact that
3.1. Introduction. It has been reported by the authors that the musical noise is generated only in artificial signal processing.
amount of musical noise generated is strongly related to the Hence, we should consider the change in kurtosis during
difference between the kurtosis of a signal before and after signal processing. Consequently, we introduce the following
signal processing. Thus, in this paper, we analyze the amount kurtosis ratio [18] to measure the kurtosis change:
of musical noise generated through BF+SS and chSS+BF on
kurtproc
the basis of the change in the measured kurtosis. Hereinafter, kurtosis ratio = , (9)
we give details of the kurtosis-based musical-noise metric. kurtinput
Frequency (Hz)
Frequency (Hz)
Time (s) Time (s)
(a) (b)
Figure 4: (a) Observed spectrogram and (b) processed spectrogram.
where kurtproc is the kurtosis of the processed signal and 4.2. Signal Model Used for Analysis. Musical-noise compo-
kurtinput is the kurtosis of the input signal. A larger kurtosis nents generated from the noise-only period are dominant
ratio (1) indicates a marked increase in kurtosis as a result in spectrograms (see Figure 4); hence, we mainly focus our
of processing, implying that a larger amount of musical noise attention on musical-noise components originating from
is generated. On the other hand, a smaller kurtosis ratio input noise signals.
(1) implies that less musical noise is generated. It has been Moreover, to evaluate the resultant kurtosis of SS, we
confirmed that this kurtosis ratio closely matches the amount introduce a gamma distribution to model the noise in the
of musical noise in a subjective evaluation based on human power domain [26–28]. The p.d.f. of the gamma distribution
hearing [18]. for random variable x is defined as

1 α−1 x
PGM (x) = ·x exp − , (10)
4. Kurtosis-Based Musical-Noise Analysis for Γ(α)θ α θ
Microphone Array Signal Processing and SS where x ≥ 0, α > 0, and θ > 0. Here, α denotes the shape
parameter, θ is the scale parameter, and Γ(·) is the gamma
4.1. Analysis Flow. In the following sections, we carry out an function. The gamma distribution with α = 1 corresponds
analysis on musical-noise generation in BF+SS and chSS+BF to the chi-square distribution with two degrees of freedom.
based on kurtosis. The analysis is composed of the following Moreover, it is well known that the mean of x for a gamma
three parts. distribution is E[x] = αθ, where E[·] is the expectation
operator. Furthermore, the kurtosis of a gamma distribution,
(i) First, an analysis on musical-noise generation in kurtGM , can be expressed as [18]
BF+SS and chSS+BF based on kurtosis that does (α + 2)(α + 3)
not take noise reduction performance into account kurtGM = . (11)
α(α + 1)
is performed in this section.
Moreover, let us consider the power-domain noise signal,
(ii) The noise reduction performance is analyzed in xp , in the frequency domain, which is defined as
Section 5, and we reveal that the noise reduction
performances of BF+SS and chSS+BF are not equiv- xp = |xre + i · xim |2
∗
alent. Moreover, a flooring parameter designed to = (xre + i · xim )(xre + i · xim ) (12)
align the noise reduction performances of BF+SS and 2 2
= xre + xim ,
chSS+BF is also derived to ensure the fair comparison
of BF+SS and chSS+BF. where xre is the real part of the complex-valued signal and xim
is its imaginary part, which are independent and identically
(iii) The kurtosis-based comparison between BF+SS and distributed (i.i.d.) with each other, and the superscript ∗
chSS+BF under the same noise reduction perfor- expresses complex conjugation. Thus, the power-domain
mance conditions is carried out in Section 6. signal is the sum of two squares of random variables with
the same distribution.
In the analysis in this section, we first clarify how kurtosis Hereinafter, let xre and xim be the signals after DFT
is affected by SS. Next, the same analysis is applied to analysis of signal in a specific microphone j, x j , and we
DS. Finally, we analyze how kurtosis is increased by BF+SS suppose that the statistical properties of x j equal to xre and
and chSS+BF. Note that our analysis contains no limiting xim . Moreover, we assume the following; x j is i.i.d. in each
assumptions on the statistical characteristics of noise; thus, channel, the p.d.f. of x j is symmetrical, and its mean is zero.
all noises including Gaussian and super-Gaussian noise can These assumptions mean that the odd-order cumulants and
be considered. moments are zero except for the first order.
Before subtraction After subtraction P.d.f. after SS

As a result of subtraction,
(1) p.d.f. is laterally shifted to the P.d.f. after SS
zero-power direction, and Original p.d.f.
(2) negative components with
nonzero probability arise.
0 βαθ 0 βαθ
Flooring
0 βαη2 θ
(3) The region corresponding to (4) Positive components
the negative components is remain as they are.
compressed by a small positive (5) Remaining positive components
flooring parameter η. and floored components are merged.
Figure 5: Deformation of original p.d.f. of power-domain signal via SS.
Although kurtx = 3 if x is a Gaussian signal, note where z is the random variable of the p.d.f. after SS. The
that the kurtosis of a Gaussian signal in the power spectral derivation of PSS (z) is described in Appendix A.
domain is 6. This is because a Gaussian signal in the time From (13), the kurtosis after SS can be expressed as
domain obeys the chi-square distribution with two degrees of
freedom in the power spectral domain; for such a chi-square F α, β, η
kurtSS = Γ(α) 2 , (14)
distribution, μ4 /μ22 = 6. G α, β, η
4.3. Resultant Kurtosis after SS. In this section, we analyze the where
kurtosis after SS. In traditional SS, the long-term-averaged
power spectrum of a noise signal is utilized as the estimated G α, β, η = Γ(α)Γ βα, α + 2 − 2βαΓ βα, α + 1
noise power spectrum. Then, the estimated noise power
spectrum multiplied by the oversubtraction parameter β + β2 α2 Γ βα, α + η4 γ βα, α + 2 ,
is subtracted from the observed power spectrum. When a
F α, β, η = Γ βα, α + 4 − 4βαΓ βα, α + 3
gamma distribution is used to model the noise signal, its
mean is αθ. Thus, the amount of subtraction is βαθ. The + 6β2 α2 Γ βα, α + 2 − 4β3 α3 Γ βα, α + 1
subtraction of the estimated noise power spectrum in each
frequency band can be considered as a shift of the p.d.f. to + β4 α4 Γ βα, α + η8 γ βα, α + 4 .
the zero-power direction (see Figure 5). As a result, negative- (15)
power components with nonzero probability arise. To avoid
this, such negative components are replaced by observations Here, Γ(b, a) is the upper incomplete gamma function
that are multiplied by a small positive value η (the so-called defined as
flooring technique). This means that the region correspond- ∞
ing to the probability of the negative components, which Γ(b, a) = t a−1 exp{−t }dt, (16)
forms a section cut from the original gamma distribution, is b
compressed by the effect of the flooring. Finally, the floored
components are superimposed on the laterally shifted p.d.f. and γ(b, a) is the lower incomplete gamma function defined
(see Figure 5). Thus, the resultant p.d.f. after SS, PSS (z), can as
be written as b
⎧ γ(b, a) = t a−1 exp{−t }dt. (17)
⎪
⎪ 1 α−1 z + βαθ 0
⎪
⎪ z + βαθ exp −
⎪
⎪ θ α Γ(α) θ
⎪
⎪
⎪
⎪ The detailed derivation of (14) is given in Appendix B.
⎪
⎪ z ≥ βαη2 θ ,
⎪
⎪ Although Uemura et al. have given an approximated form
⎪
⎪
⎪
⎪ (lower bound) of the kurtosis after SS in [18], (14) involves
⎨
α−1 z + βαθ
PSS (z) = ⎪ 1 z + βαθ exp − no approximation throughout its derivation. Furthermore,
⎪ θ α Γ(α)
⎪ θ (14) takes into account the effect of the flooring technique
⎪
⎪
⎪
⎪
⎪
⎪ unlike [18].
⎪
⎪ 1 − z
⎪ + 2 α
⎪ z exp − 2
α 1
Figure 6(a) depicts the theoretical kurtosis ratio after
⎪
⎪ η θ Γ(α) η θ
⎪
⎪ SS, kurtSS /kurtGM , for various values of oversubtraction
⎪
⎩
2
0 < z < βαη θ , parameter β and flooring parameter η. In the figure, the
(13) kurtosis of the input signal is fixed to 6.0, which corresponds
60 100
50
80
40
Kurtosis ratio
Kurtosis ratio
60
30
40
20
10 20
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 10 100
Oversubtraction parameter Input kurtosis
η=0 η = 0.2 β=1 β=4

η = 0.1 η = 0.4 β=2 β=8
(a) (b)
Figure 6: (a) Theoretical kurtosis ratio after SS for various values of oversubtraction parameter β and flooring parameter η. In this figure,
kurtosis of input signal is fixed to 6.0. (b) Theoretical kurtosis ratio after SS for various values of input kurtosis. In this figure, flooring
parameter η is fixed to 0.0.
to a Gaussian signal. From this figure, it is confirmed that For cumulants, when X and Y are independent random
thekurtosis ratio is basically proportional to the oversub- variables it is well known that the following relation holds:
traction parameter β. However, kurtosis does not mono-
cumn (aX + bY ) = an cumn (X) + bn cumn (Y ), (18)
tonically increase when the flooring parameter is nonzero.
For instance, the kurtosis ratio is smaller than the peak where cumn (·) denotes the nth-order cumulant. The cumu-
value when β = 4 and η = 0.4. This phenomenon can be lants of the random variable X, cumn (X), are defined by
explained as follows. For a large oversubtraction parameter, a cumulant-generating function, which is the logarithm of
almost all the spectral components become negative due to the moment-generating function. The cumulant-generating
the larger lateral shift of the p.d.f. by SS. Since flooring is function C(ζ) is defined as
applied to avoid such negative components, almost all the ∞
components are reconstructed by flooring. Therefore, the ζn
C(ζ) = log E exp ζX = cumn (X) , (19)
statistical characteristics of the signal never change except for n=1
n!
its amplitude if η =/ 0. Generally, kurtosis does not depend on
the change in amplitude; consequently, it can be considered where ζ is an auxiliary variable and E[exp{ζX }] is the
that kurtosis does not markedly increase when a larger moment-generating function. Thus, the nth-order cumulant
oversubtraction parameter and a larger flooring parameter cumn (X) is represented by
are set. cumn (X) = C (n) (0), (20)
The relation between the theoretical kurtosis ratio and
the kurtosis of the original input signal is shown in where C (n) (ζ) is the nth-order derivative of C(ζ).
Figure 6(b). In the figure, η is fixed to 0.0. It is revealed Now we consider the DS beamformer, which is steered
that the kurtosis ratio after SS rapidly decreases as the to θU = 0 and whose array weights are 1/J. Using (18), the
input kurtosis increases, even with the same oversubtraction resultant nth-order cumulant after DS, Kn = cumn (yDS ),
parameter β. Therefore, the kurtosis ratio after SS, which is can be expressed by
related to the amount of musical noise, strongly depends on
1
the statistical characteristics of the input signal. That is to say, Kn = Kn , (21)
SS generates a larger amount of musical noise for a Gaussian J n−1
input signal than for a super-Gaussian input signal. This fact where Kn = cumn (x j ) is the nth-order cumulant of x j .
has been reported in [18]. Therefore, using (21) and the well-known mathematical rela-
tion between cumulants and moments, the power-spectral-
4.4. Resultant Kurtosis after DS. In this section, we analyze domain kurtosis after DS, kurtDS can be expressed by
the kurtosis after DS, and we reveal that DS can reduce the K8 + 38K42 + 32K2 K6 + 288K22 K4 + 192K24
kurtosis of input signals. Since we assume that the statistical kurtDS = .
2K42 + 16K22 K4 + 32K24
properties of xre or xim are the same as that of x j , the effect (22)
of DS on the change in kurtosis can be derived from the
cumulants and moments of x j . The detailed derivation of (22) is described in Appendix C.
100 100
80 80
Output kurtosis
Output kurtosis
60 60
40 40
20 20
6 6
20 40 60 80 100 20 40 60 80 100
Input kurtosis Input kurtosis
(a) 1-microphone case (b) 2-microphone case
100 100
80 80
Output kurtosis
Output kurtosis
60 60
40 40
20 20
6 6
20 40 60 80 100 20 40 60 80 100
Simulation Simulation
Theoretical Theoretical
Approximated Approximated
(c) 4-microphone case (d) 8-microphone case
Figure 7: Relation between input kurtosis and output kurtosis after DS. Solid lines indicate simulation results, broken lines express
theoretical plots obtained by (22), and dotted lines show approximate results obtained by (23).
Regarding the power-spectral components obtaining explicit function form:

from a gamma distribution, we illustrate the relation
between input kurtosis and output kurtosis after DS in
kurtDS J −0.7 · (kurtin − 6) + 6, (23)
Figure 7. In the figure, solid lines indicate simulation
results and broken lines show theoretical relations given by
(22). The simulation results are derived as follows. First, where kurtin is the input kurtosis. The approximated plots
multichannel signals with various values of kurtosis are also match the simulation results in Figure 7.
generated artificially from a gamma distribution. Next, DS When input signals involve interchannel correlation, the
is applied to the generated signals. Finally, kurtosis after DS relation between input kurtosis and output kurtosis after
is estimated from the signal resulting from DS. From this DS approaches that for only one microphone. If all input
figure, it is confirmed that the theoretical plots closely fit signals are identical signals, that is, the signals are completely
the simulation results. The relation between input/output correlated, the output after DS also becomes the same as the
kurtosis behaves as follows: (i) The output kurtosis is very input signal. In such a case, the effect of DS on the change
close to a linear function of the input kurtosis, and (ii) in kurtosis corresponds to that for only one microphone.
the output kurtosis is almost inversely proportional to the However, the interchannel correlation is not equal to one
number of microphones. These behaviors result in the within all frequency subbands for a diffuse noise field that
following simplified (but useful) approximation with an is a typically considered noise field. It is well known that the
18 18
15 15
Kurtosis
Kurtosis
12 12
9 9
6 6
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of microphones Number of microphones
Experimental Experimental
Theoretical Theoretical
(a) 1000 Hz (b) 8000 Hz
Figure 8: Simulation result for noise with interchannel correlation (solid line) and theoretical effect of DS assuming no interchannel
correlation (broken line) in each frequency subband.
27 we experimentally investigate the effect of interchannel

correlation in the following.
24
Figures 8 and 9 show preliminary simulation results of
21 DS. In this simulation, SS is first applied to a multichannel
Gaussian signal with interchannel correlation in the diffuse
18
Kurtosis
noise field. Next, DS is applied to the signal after SS. In the

15
preliminary simulation, the interelement distance between
microphones is 2.15 cm. From the results shown in Figures
12 8(a) and 9, we can confirm that the effect of DS on kurtosis
is weak in lower-frequency subbands, although it should be
9
noted that the effect does not completely disappear. Also,
6 the theoretical kurtosis curve is in good agreement with the
0 2000 4000 6000 8000 actual results in higher-frequency subbands (see Figures 8(b)
Frequency and 9). This is because the interchannel correlation is weak in
Observed
higher-frequency subbands. Consequently, for a diffuse noise
Experimental field, DS can reduce the kurtosis of the input signal even if
Theoretical interchannel correlation exists.
If input noise signals contain no interchannel correlation,
Figure 9: Simulation result for noise with interchannel correlation the distance between microphones does not affect the results.
(solid line), theoretical effect of DS assuming no interchannel
That is to say, the kurtosis change via DS can be well fit to
correlation (broken line), and kurtosis of the observed signal
without any signal processing (dotted line) in eight-microphone
(23). Otherwise, in lower-frequency subbands, it is expected
case. that the mitigation effect of kurtosis by DS degrades with
decreasing distance between microphones. This is because
the interchannel correlation in lower-frequency subbands
increases with decreasing distance between microphones.
intensity of the interchannel correlation is strong in lower- In higher-frequency subbands, the effect of the distance
frequency subbands and weak in higher-frequency subbands between microphones is thought to be small.
for the diffuse noise field [1]. Therefore, in lower-frequency
subbands, it can be expected that DS does not significantly 4.5. Resultant Kurtosis: BF+SS versus chSS+BF. In the pre-
reduce the kurtosis of the signal. vious subsections, we discussed the resultant kurtosis after
As it is well known that the interchannel correlation for SS and DS. In this subsection, we analyze the resultant
a diffuse noise field between two measurement locations kurtosis for two types of composite systems, that is, BF+SS
can be expressed by the sinc function [1], we can state and chSS+BF, and compare their effect on musical-noise
how array signal processing is affected by the interchannel generation. As described in Section 3, it is expected that a
correlation. However, we cannot know exactly how cumu- smaller increase in kurtosis leads to a smaller amount of
lants are changed by the interchannel correlation because musical noise generated.
(18) only holds when signals are mutually independent. In BF+SS, DS is first applied to a multichannel input
Therefore, we cannot formulate how kurtosis is changed via signal. At this point, the resultant kurtosis in the power
DS for signals with interchannel correlation. For this reason, spectral domain, kurtDS , can be represented by (23). Using
(11), we can derive a shape parameter for the gamma First, we derive the average power of the input signal. We
distribution corresponding to kurtDS , α, as assume that the input signal in the power domain can be
! modeled by a gamma distribution. Then, the average power
kurt2DS + 14 kurtDS + 1 − kurtDS + 5 of the input signal is given as
α = . (24)
2 kurtDS − 2 ∞
E[nin ] = E[x] = xPGM (x)dx
The derivation of (24) is shown in Appendix D. Conse- 0
quently, using (14) and (24), the resultant kurtosis after ∞
1 x
BF+SS, kurtBF+SS , can be written as = x· α
xα−1 exp − dx (29)
0 θ Γ(α) θ
F α, β, η ∞
kurtBF+SS ) 2
= Γ(α . (25) 1 x
G α, β, η = xα exp − dx.
θ α Γ(α) 0 θ
In chSS+BF, SS is first applied to each input channel.
Thus, the output kurtosis after channelwise SS, kurtchSS , is Here, let t = x/θ, then θdt = dx. Thus,
given by ∞
1
E[nin ] = (θt)α exp{−t }θdt
F α, β, η θ α Γ(α) 0
kurtchSS = Γ(α) . (26) ∞
G2 α, β, η θ α+1
= t α exp{−t }dt (30)
Finally, DS is performed and the resultant kurtosis after θ α Γ(α) 0
chSS+BF, kurtchSS+BF , can be written as θΓ(α + 1)

" # = = θα.
Γ(α)
F α, β, η
kurtchSS+BF = J −0.7 Γ(α) 2 − 6 + 6, (27)
G α, β, η This corresponds to the mean of a random variable with a
gamma distribution.
where we use (23).
Next, the average power of the signal after SS is calcu-
We should compare kurtBF+SS and kurtchSS+BF here.
lated. Here, let z obey the p.d.f. of the signal after SS, PSS (z),
However, one problem still remains: comparison under
defined by (13); then the average power of the signal after SS
equivalent noise reduction performance; the noise reduction
can be expressed as
performances of BF+SS and chSS+BF are not equivalent as
described in the next section. Moreover, the design of a
E[nout ] = E[z]
flooring parameter so that the noise reduction performances
of both methods become equivalent will be discussed in ∞
the next section. Therefore, kurtBF+SS and kurtchSS+BF will = zPSS (z)dz
0
be compared in Section 6 under equivalent noise reduction
∞
performance conditions. z α−1 z + βαθ
= α
z + βαθ exp − dz
0 θ Γ(α) θ
5. Noise Reduction Performance Analysis βαη2 θ
z α−1 z
+ α z exp − 2 dz.
In the previous section, we did not discuss the noise reduc- 0 η2 θ Γ(α) η θ
tion performances of BF+SS and chSS+BF. In this section, a (31)
mathematical analysis of the noise reduction performances
of BF+SS and chSS+BF is given. As a result of this analysis, it We now consider the first term of the right-hand side in (31).
is revealed that the noise reduction performances of BF+SS We let t = z + βαθ, then dt = dz. As a result,
and chSS+BF are not equivalent even if the same parameters
∞
are set in the SS part. We then derive a flooring-parameter z α−1 z + βαθ
design strategy for aligning the noise reduction performances z + βαθ exp − dz
0 θ α Γ(α) θ
of BF+SS and chSS+BF.
∞
1 t
= t − βαθ · · t α−1 exp − dt
5.1. Noise Reduction Performance of SS. We utilize the βαθ θ α Γ(α) θ
following index to measure the noise reduction performance ∞
1 t
(NRP): = · t α exp − dt (32)
βαθ θ α Γ(α) θ
E[nout ] ∞
NRP = 10 log10 , (28) βαθ t
E[nin ] − · t α−1 exp − dt
βαθ θ α Γ(α) θ
where nin is the power-domain (noise) signal of the input and
nout is the power-domain (noise) signal of the output after θ · Γ βα, α + 1 Γ βα, α
= − βαθ · .
processing. Γ(α) Γ(α)
35 35
Noise reduction performance (dB)

30 30
25 25
20 20
15 15
10 10
5 5
0 0
0 1 2 3 4 5 6 7 8 6 10 100
Oversubtraction parameter Input kurtosis
η=0 η = 0.2 β=1 β=4

η = 0.1 η = 0.4 β=2 β=8
(a) (b)
Figure 10: (a) Theoretical noise reduction performance of SS with various oversubtraction parameters β and flooring parameters η. In this
figure, kurtosis of input signal is fixed to 6.0. (b) Theoretical noise reduction performance of SS with various values of input kurtosis. In this
figure, flooring parameter η is fixed to 0.0.
Also, we deal with the second term of the right-hand side in In this figure, η is fixed to 0.0. It is revealed that NRPSS
(31). We let t = z/(η2 θ) then η2 θdt = dz, resulting in decreases as the input kurtosis increases. This is because the
mean of a high-kurtosis signal tends to be small. Since the
βαη2 θ shape parameter α of a high-kurtosis signal becomes small,
z α−1 z
α z exp − 2 dz the mean αθ corresponding to the amount of subtraction
0 η2 θ Γ(α) η θ also becomes small. As a result, NRPSS is decreased as the
βα input kurtosis increases. That is to say, the NRPSS strongly
1 α
= α η2 θt · exp{−t }η2 θdt (33) depends on the statistical characteristics of the input signal
η2 θ Γ(α) 0 as well as the values of the oversubtraction and flooring
η2 θ parameters.
= γ βα, α + 1 .
Γ(α)
5.2. Noise Reduction Performance of DS. It is well known
that the noise reduction performance of DS (NRPDS ) is
Using (30), (32), and (33), the noise reduction performance
proportional to the number of microphones. In particular,
of SS, NRPSS , can be expressed by
for spatially uncorrelated multichannel signals, NRPDS is
given as [1]
E[z]
NRPSS = 10 log10
E[x] NRPDS = 10 log10 J. (35)
"
Γ βα, α + 1
= −10 log10 5.3. Resultant Noise Reduction Performance: BF+SS versus
Γ(α + 1)
chSS+BF. In the previous subsections, the noise reduction
# performances of SS and DS were discussed. In this subsec-
Γ βα, α γ βα, α + 1
−β · + η2 . tion, we derive the resultant noise reduction performances
Γ(α) Γ(α + 1)
of the composite systems of SS and DS, that is, BF+SS and
(34) chSS+BF.
The noise reduction performance of BF+SS is analyzed
Figure 10(a) shows the theoretical value of NRPSS for as follows. In BF+SS, DS is first applied to a multichannel
various values of oversubtraction parameter β and flooring input signal. If this input signal is spatially uncorrelated, its
parameter η, where the kurtosis of the input signal is fixed noise reduction performance can be represented by 10 log10 J.
to 6.0, corresponding to a Gaussian signal. From this figure, After DS, SS is applied to the signal. Note that DS affects
it is confirmed that NRPSS is proportional to β. However, the kurtosis of the input signal. As described in Section 4.4,
NRPSS hits a peak when η is nonzero even for a large value of the resultant kurtosis after DS can be approximated as
β. The relation between the theoretical value of NRRSS and J −0.7 · (kurtin − 6) + 6. Thus, SS is applied to the kurtosis-
the kurtosis of the input signal is illustrated in Figure 10(b). modified signal. Consequently, using (24), (34), and (35),
24 24

20 20
16 16
12 12
8 8
0 2 4 6 8 0 2 4 6 8
Oversubtraction parameter Oversubtraction parameter
(a) Input kurtosis = 6 (b) Input kurtosis = 20
24
20
16
12
8
0 2 4 6 8
Oversubtraction parameter
BF+SS
chSS+BF
(c) Input kurtosis = 80
Figure 11: Comparison of noise reduction performances of chSS+BF with BF+SS. In this figure, flooring parameter is fixed to 0.2 and
number of microphones is 8.
the noise reduction performance of BF+SS, NRPBF+SS , is (34) and (35), the noise reduction performance of chSS+BF,
given as NRPchSS+BF , can be represented by
NRPBF+SS NRPchSS+BF
= 10 log10 J − 10 log10 1
" # = −10 log10
Γ βα, α + 1 Γ βα, α γ βα, α + 1 J · Γ(α) (37)
× −β· + η2 " #
Γ(α + 1) Γ(α) Γ(α + 1) Γ βα, α + 1 γ βα, α + 1
(36) × − β · Γ βα, α + η2 .
1 α α
= −10 log10
J · Γ(α)
" # Figure 11 depicts the values of NRPBF+SS and NRPchSS+BF .
Γ βα, α + 1 γ βα, α + 1
× − β · Γ βα + η2
, α , From this result, we can see that the noise reduction
α α
performances of both methods are equivalent when the input
signal is Gaussian. However, if the input signal is super-
where α is defined by (24). Gaussian, NRPBF+SS exceeds NRPchSS+BF . This is due to the
In chSS+BF, SS is first applied to a multichannel input fact that DS is first applied to the input signal in BF+SS;
signal; then DS is applied to the resulting signal. Thus, using thus, DS reduces the kurtosis of the signal. Since NRPSS for
1.5 1.5
1 1
R 0.5 R 0.5
0 0
−0.5 −0.5
10 100 10 100
(a) Flooring parameter η = 0.0 (b) Flooring parameter η = 0.1
1.5 1.5
1 1
R 0.5 R 0.5
0 0
−0.5 −0.5
10 100 10 100
1 mic. 4 mics. 1 mic. 4 mics.

2 mics. 8 mics. 2 mics. 8 mics.
(c) Flooring parameter η = 0.2 (d) Flooring parameter η = 0.4
Figure 12: Theoretical kurtosis ratio between BF+SS and chSS+BF for various values of input kurtosis. In this figure, oversubtraction
parameter is β = 2.0 and flooring parameter in chSS+BF is (a) η = 0.0, (b) η = 0.1, (c) η = 0.2, and (d) η = 0.4.
a low-kurtosis signal is greater than that for a high-kurtosis where

signal (see Figure 10(b)), the noise reduction performance of
Γ βα, α+1 γ βα, α+1
BF+SS is superior to that of chSS+BF. H α, β, η = − β · Γ βα, α +η2 ,
α α
This discussion implies that NRPBF+SS and NRPchSS+BF (39)
are not equivalent under some conditions. Thus the kurtosis- Γ βα , α
+1
I α, β = − β · Γ βα
, α
. (40)
based analysis described in Section 4 is biased and requires α
some adjustment. In the following subsection, we will discuss The detailed derivation of (38) is given in Appendix E. By
how to align the noise reduction performances of BF+SS and replacing η in (3) with this new flooring parameter η, we can
chSS+BF. align NRPBF+SS and NRPchSS+BF to ensure a fair comparison.
5.4. Flooring-Parameter Design in BF+SS for Equivalent Noise 6. Output Kurtosis Comparison under
Reduction Performance. In this section, we describe the
flooring-parameter design in BF+SS so that NRPBF+SS and
Equivalent NRP Condition
NRPchSS+BF become equivalent. In this section, using the new flooring parameter for BF+SS,
Using (36) and (37), the flooring parameter η that makes η, we compare the output kurtosis of BF+SS and chSS+BF.
NRPBF+SS equal to NRPchSS+BF , is Setting η to (25), the output kurtosis of BF+SS is
$ " # modified to
%
% α Γ(α)
η = & · H α, β, η − I α, β , (38) F α, β, η
γ βα, α + 1 Γ(α) kurtBF+SS = Γ(α) . (41)
G2 α, β, η
4 4
3 3
2 2
1 1
R R
0 0
−1 −1
−2 −2
−3 −3
0 5 10 15 20 0 5 10 15 20
Oversubtraction parameter Oversubtraction parameter
η=0 η = 0.2 η=0 η = 0.2

η = 0.1 η = 0.4 η = 0.1 η = 0.4
(a) Input kurtosis = 6.0 (b) Input kurtosis = 20.0
Figure 13: Theoretical kurtosis ratio between BF+SS and chSS+BF for various oversubtraction parameters. In this figure, number of
microphones is fixed to 8, and input kurtosis is (a) 6.0 (Gaussian) and (b) 20.0 (super-Gaussian).
Loudspeakers (for interferences) In this figure, β is fixed to 2.0 and the flooring parameter
in chSS+BF is set to η = 0.0, 0.1, 0.2, and 0.4. The
flooring parameter for BF+SS is automatically determined
by (38). From this figure, we can confirm that chSS+BF
Loudspeaker (for target source) reduces the kurtosis more than BF+SS for almost all input
signals with various values of input kurtosis. Theoretical
values of R for various oversubtraction parameters are
depicted in Figure 13. Figure 13(a) shows that the output
kurtosis after chSS+BF is always less than that after BF+SS
1m
for a Gaussian signal, even if η is nonzero. On the other

hand, Figure 13(b) implies that the output kurtosis after
3.9 m
BF+SS becomes less than that after chSS+BF for some

parameter settings. However, this only occurs for a large
Microphone array
oversubtraction parameter, for example, β ≥ 7, which is not
(with interelement spacing of 2.15 cm)
often applied in practical use. Therefore, it can be considered
that chSS+BF reduces the kurtosis and musical noise more
than BF+SS in almost all cases.
(Reverberation time: 200 ms)
7. Experiments and Results

7.1. Computer Simulations. First, we compare BF+SS and
3.9 m
chSS+BF in terms of kurtosis ratio and noise reduction
Figure 14: Reverberant room used in our simulations. performance. We use 16-kHz-sampled signals as test data,
in which the target speech is the original speech convoluted
with impulse responses recorded in a room with 200
Here, we adopt the following index to compare the resultant millisecond reverberation (see Figure 14), and to which an
kurtosis after BF+SS and chSS+BF: artificially generated spatially uncorrelated white Gaussian
kurtBF+SS or super-Gaussian signal is added. We use six speakers
R = ln , (42) (six sentences) as sources of the original clean speech. The
kurtchSS+BF
number of microphone elements in the simulation is varied
where R expresses the resultant kurtosis ratio between BF+SS from 2 to 16, and their interelement distance is 2.15 cm each.
and chSS+BF. Note that a positive R indicates that chSS+BF The oversubtraction parameter β is set to 2.0 and the flooring
reduces the kurtosis more than BF+SS, implying that less parameter for BF+SS, η, is set to 0.0, 0.2, 0.4, or 0.8. Note
musical noise is generated in chSS+BF. The behavior of that the flooring parameter in chSS+BF is set to 0.0. In the
R is depicted in Figures 12 and 13. Figure 12 illustrates simulation, we assume that the long-term-averaged power
theoretical values of R for various values of input kurtosis. spectrum of noise is estimated perfectly in advance.
10 20

9
8
7 15
Kurtosis ratio
6
5
4 10
3
2
1 5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
chSS+BF BF+SS (η = 0.4) chSS+BF BF+SS (η = 0.4)

BF+SS (η = 0) BF+SS (η = 0.8) BF+SS (η = 0) BF+SS (η = 0.8)
BF+SS (η = 0.2) BF+SS (η = 0.2)
(a) (b)
Figure 15: Results for Gaussian input signal. (a) Kurtosis ratio and (b) noise reduction performance for BF+SS with various flooring
parameters.
Here, we utilize the kurtosis ratio defined in Section 3.4 reduction performance closely fit the experimental results.
to measure the difference in kurtosis, which is related to These findings also support the validity of the analysis in
the amount of musical noise generated. The kurtosis ratio Sections 4, 5, and 6.
is given by Figures 18–20 illustrate the simulation results for a super-
Gaussian input signal. It is confirmed from Figure 18(a) that
kurt nproc f , τ the kurtosis ratio of chSS+BF also decreases monotonically
Kurtosis ratio = , (43)
kurt norg f , τ with increasing number of microphones. Unlike the case
of the Gaussian input signal, the kurtosis ratio of BF+SS
where nproc ( f , τ) is the power spectra of the residual noise with η = 0.8 also decreases with increasing number of
signal after processing, and norg ( f , τ) is the power spectra microphones. However, for a lower value of the flooring
of the original noise signal before processing. This kurtosis parameter, the kurtosis ratio of BF+SS is not degraded.
ratio indicates the extent to which kurtosis is increased Moreover, the kurtosis ratio of chSS+BF is lower than that
with processing. Thus, a smaller kurtosis ratio is desirable. of BF+SS for almost all cases. For the super-Gaussian input
Moreover, the noise reduction performance is measured signal, in contrast to the case of the Gaussian input signal,
using (28). the noise reduction performance of BF+SS with η = 0.0
Figures 15–17 show the simulation results for a Gaussian is greater than that of chSS+BF (see Figure 18(b)). That
input signal. From Figure 15(a), we can see that the kurtosis is to say, the noise reduction performance of BF+SS is
ratio of chSS+BF decreases almost monotonically with superior to that of chSS+BF for the same flooring parameter.
increasing number of microphones. On the other hand, the This result is consistent with the analysis in Section 5. The
kurtosis ratio of BF+SS does not exhibit such a tendency noise reduction performance of BF+SS with η = 0.4 is
regardless of the flooring parameter. Also, the kurtosis ratio comparable to that of chSS+BF. However, the kurtosis ratio
of chSS+BF is lower than that of BF+SS for all cases except of chSS+BF is still lower than that of BF+SS with η = 0.4.
for η = 0.8. Moreover, we can confirm from Figure 15(b) This result also coincides with the analysis in Section 6.
that the values of noise reduction performance for BF+SS On the other hand, the kurtosis ratio of BF+SS with η =
with flooring parameter η = 0.0 and chSS+BF are almost the 0.8 is almost the same as that of chSS+BF. However, the
same. When the flooring parameter for BF+SS is nonzero, noise reduction performance of BF+SS with η = 0.8 is
the kurtosis ratio of BF+SS becomes smaller but the noise lower than that of chSS+BF. Thus, it can be confirmed that
reduction performance degrades. On the other hand, for chSS+BF reduces the kurtosis ratio more than BF+SS for
Gaussian signals, chSS+BF can reduce the kurtosis ratio, a super-Gaussian signal under the same noise reduction
that is, reduce the amount of musical noise generated, performance. Furthermore, the theoretical kurtosis ratio and
without degrading the noise reduction performance. Indeed noise reduction performance closely fit the experimental
BF+SS with η = 0.8 reduces the kurtosis ratio more than results in Figures 19 and 20.
chSS+BF, but the noise reduction performance of BF+SS We also compare speech distortion originating from
is extremely degraded. Furthermore, we can confirm from chSS+BF and BF+SS on the basis of cepstral distortion
Figures 16 and 17 that the theoretical kurtosis ratio and noise (CD) [29] for the four-microphone case. The comparison
10 10
8 8
Kurtosis ratio
Kurtosis ratio
6 6
4 4
2 2
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
(a) chSS+BF (b) BF+SS (η = 0.0)
10 10
8 8
Kurtosis ratio
Kurtosis ratio
6 6
4 4
2 2
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
(c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)
10
8
Kurtosis ratio
2 4 6 8 10 12 14 16
Number of microphones
Experimental
Theoretical
(e) BF+SS (η = 0.8)
Figure 16: Comparison between experimental and theoretical kurtosis ratios for Gaussian input signal.
20 20
Noise reduction performance

15 15
10 10
5 5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
20 20
15 15
10 10
5 5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
(c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)
20
15
10
5
2 4 6 8 10 12 14 16
Experimental
Theoretical
(e) BF+SS (η = 0.8)
Figure 17: Comparison between experimental and theoretical noise reduction performances for Gaussian input signal.
6 20

5
15
Kurtosis ratio
3
10
2
1
5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
chSS+BF BF+SS (η = 0.4) chSS+BF BF+SS (η = 0.4)

BF+SS (η = 0) BF+SS (η = 0.8) BF+SS (η = 0) BF+SS (η = 0.8)
BF+SS (η = 0.2) BF+SS (η = 0.2)
(a) (b)
Figure 18: Results for super-Gaussian input signal. (a) Kurtosis ratio and (b) noise reduction performance for BF+SS with various flooring
parameters.
Table 1: Speech distortion comparison of chSS+BF and BF+SS on (iii) Under the same level of noise reduction performance,
the basis of CD for four-microphone case. the amount of musical noise generated via chSS+BF
is less than that generated via BF+SS.
Input noise type chSS+BF BF+SS
Gaussian 6.15 dB 6.45 dB (iv) Thus, the chSS+BF structure is preferable from the
viewpoint of musical-noise generation.
Super-Gaussian 6.17 dB 5.12 dB
(v) However, the noise reduction performance of BF+SS
is superior to that of chSS+BF for a super-Gaussian
is made under the condition that the noise reduction signal when the same parameters are set in the SS part
performances of both methods are almost the same. For for both methods.
the Gaussian input signal, the same parameters β = 2.0
and η = 0.0 are utilized for BF+SS and chSS+BF. On (vi) These results imply a trade-off between the amount
the other hand, β = 2.0 and η = 0.4 are utilized of musical noise generated and the noise reduction
for BF+SS and β = 2.0 and η = 0.0 are utilized for performance. Thus, we should use an appropriate
chSS+BF for the super-Gaussian input signal. Table 1 shows structure depending on the application.
the result of the comparison, from which we can see that
the amount of speech distortion originating from BF+SS and These results should be applicable under different SNR con-
chSS+BF is almost the same for the Gaussian input signal. ditions because our analysis is independent of the noise level.
For the super-Gaussian input signal, the speech distortion In the case of more reverberation, the observed signal tends
originating from BF+SS is less than that from chSS+BF. This to become Gaussian because many reverberant components
is owing to the difference in the flooring parameter for each are mixed. Therefore, the behavior of both methods under
method. more reverberant conditions should be similar to that in the
In conclusion, all of these results are strong evidence for case of a Gaussian signal.
the validity of the analysis in Sections 4, 5, and 6. These
results suggest the following.
7.2. Subjective Evaluation. Next, we conduct a subjective
evaluation to confirm that chSS+BF can mitigate musical
(i) Although BF+SS can reduce the amount of musical
noise. In the evaluation, we presented two signals processed
noise by employing a larger flooring parameter,
by BF+SS and by chSS+BF to seven male examinees in
it leads to a deterioration of the noise reduction
random order, who were asked to select which signal they
performance.
considered to contain less musical noise (the so-called AB
(ii) In contrast, chSS+BF can reduce the kurtosis ratio, method). Moreover, we instructed examinees to evaluate
which corresponds to the amount of musical noise only the musical noise and not to consider the amplitude of
generated, without degradation of the noise reduc- the remaining noise. Here, the flooring parameter in BF+SS
tion performance. was automatically determined so that the output SNR of
6 6
5 5
Kurtosis ratio
Kurtosis ratio
4 4
3 3
2 2
1 1
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
6 6
5 5
Kurtosis ratio
Kurtosis ratio
4 4
3 3
2 2
1 1
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
(c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)
5
Kurtosis ratio
1
2 4 6 8 10 12 14 16
Experimental
Theoretical
(e) BF+SS (η = 0.8)
Figure 19: Comparison between experimental and theoretical kurtosis ratios for super-Gaussian input signal.
BF+SS and chSS+BF was equivalent. We used the preference used. Note that noises (b) and (c) were recorded in the actual
score as the index of the evaluation, which is the frequency of room shown in Figure 14 and therefore include interchannel
the selected signal. correlation because they were recordings of actual noise
In the experiment, three types of noise, (a) artificial signals.
spatially uncorrelated white Gaussian noise, (b) recorded Each test sample is a 16-kHz-sampled signal, and
railway-station noise emitted from 36 loudspeakers, and (c) the target speech is the original speech convoluted with
recorded human speech emitted from 36 loudspeakers, were impulse responses recorded in a room with 200 millisecond
20 20

15 15
10 10
5 5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
20 20

15 15
10 10
5 5
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
(c) BF+SS (η = 0.2) (d) BF+SS (η = 0.4)
20
15
10
5
2 4 6 8 10 12 14 16
Experimental
Theoretical
(e) BF+SS (η = 0.8)
Figure 20: Comparison between experimental and theoretical noise reduction performances for super-Gaussian input signal.
reverberation (see Figure 14) and to which the above- Figure 21 shows the subjective evaluation results, which
mentioned recorded noise signal is added. Ten pairs of signals confirm that the output of chSS+BF is preferred to that
per type of noise, that is, a total of 30 pairs of processed of BF+SS, even for actual acoustic noises including non-
signals, were presented to each examinee. Gaussianity and interchannel correlation properties.
100 random variable x is replaced with x + βαθ and the gamma

Preference score (%)
80 distribution becomes

60 1 α−1 x + βαθ
PGM (x) = α
· x + βαθ exp −
40 Γ(α)θ θ (A.1)
20
x ≥ −βαθ .
0
White Gaussian Station noise from Speech from 36
36 loudspeakers loudspeakers Since the domain of the original gamma distribution is x ≥
0, the domain of the resultant p.d.f. is x ≥ −βαθ. Thus,
chSS+BF negative-power components with nonzero probability arise,
which can be represented by
BF+SS

95% confidence interval 1 α−1 x + βαθ
Pnegative (x) = α
· x + βαθ exp −
Figure 21: Subjective evaluation results. Γ(α)θ θ

−βαθ ≤ x ≤ 0 ,
(A.2)
8. Conclusion
where Pnegative (x) is part of PGM (x). To remove the negative-
In this paper, we analyze two methods of integrating power components, the signals corresponding to Pnegative (x)
microphone array signal processing and SS, that is, BF+SS are replaced by observations multiplied by a small positive
and chSS+BF, on the basis of HOS. As a result of the analysis,
value η. The observations corresponding to (A.2), Pobs (x),
it is revealed that the amount of musical noise generated
are given by
via SS strongly depends on the statistical characteristics of
the input signal. Moreover, it is also clarified that the noise
1 α−1 x
reduction performances of BF+SS and chSS+BF are different Pobs (x) = α
· (x) exp − 0 ≤ x ≤ βαθ .
Γ(α)θ θ
except in the case of a Gaussian input signal. As a result of (A.3)
our analysis under equivalent noise reduction performance
conditions, it is shown that chSS+BF reduces musical noise Since a small positive flooring parameter η is applied to
more than BF+SS in almost all practical cases. The results (A.3), the scale parameter θ becomes η2 θ and the range is
of a computer simulation also support the validity of our changed from 0 ≤ x ≤ βαθ to 0 ≤ x ≤ βαη2 θ. Then, (A.3) is
analysis. Moreover, by carrying out a subjective evaluation, modified to
it is confirmed that the output of chSS+BF is considered to
contain less musical noise than that of BF+SS. These analytic 1 x
Pfloor (x) = α · (x)α−1 exp − 2
and experimental results imply the considerable potential of 2
Γ(α) η θ η θ (A.4)
optimization based on HOS to reduce musical noise.
As a future work, it remains necessary to carry out 0 ≤ x ≤ βαη2 θ ,
signal analysis based on more general distributions. For
instance, analysis using a generalized gamma distribution where Pfloor (x) is the probability of the floored components.
[26, 27] can lead to more general results. Moreover, an exact This Pfloor (x) is superimposed on the p.d.f. given by (A.1)
formulation of how kurtosis is changed through DS under within the range 0 ≤ x ≤ βαη2 θ. By considering the positive
a coherent condition is still an open problem. Furthermore, range of (A.1) and Pfloor (x), the resultant p.d.f. of SS can be
the robustness of BF+SS and chSS+BF against low-SNR or formulated as
more reverberant conditions is not discussed in this paper.
In the future, the discussion should involve not only noise PSS (z)
reduction performance and musical-noise generation but ⎧
also such robustness. ⎪
⎪
⎪
1 α−1 z + βαθ
⎪
⎪
⎪ θ α Γ(α) θ
⎪
⎪
⎪
⎪
⎪
⎪ z ≥ βαη2 θ ,
Appendices ⎪
⎪
⎪
⎪
⎪
⎨ (A.5)
A. Derivation of (13) = 1 α−1 z + βαθ
⎪
⎪ α
⎪ θ Γ(α) θ
⎪
⎪
When we assume that the input signal of the power domain ⎪
⎪
⎪
⎪ 1 z
can be modeled by a gamma distribution, the amount ⎪
⎪
+ 2 α −1
z exp − 2
α
⎪
⎪
of subtraction is βαθ. The subtraction of the estimated ⎪
⎪ η θ Γ(α) η θ
⎪
⎪
⎩ 2
noise power spectrum in each frequency subband can be 0 < z < βαη θ ,
considered as a lateral shift of the p.d.f. to the zero-power
direction (see Figure 5). As a result of this subtraction, the where the variable x is replaced with z for convenience.
B. Derivation of (14) Consequently, using (B.4) and (B.5), the kurtosis after SS is
given as
To derive the kurtosis after SS, the 2nd- and 4th-order
moments of z are required. For PSS (z), the 2nd-order
moment is given by F α, β, η
kurtSS = Γ(α) , (B.6)
∞ G2 α, β, η
μ2 = z2 · PSS (z)dz
0
where
∞
1 α−1 z + βαθ
= z2 α z + βαθ exp − dz (B.1)
0 θ Γ(α) θ G α, β, η = Γ(α)Γ βα, α + 2 − 2βαΓ βα, α + 1
βαη2 θ
1 z + β2 α2 Γ βα, α + η4 γ βα, α + 2 ,
+ z2 2 α zα−1 exp − 2 dz.
0 η θ Γ(α) η θ
F α, β, η = Γ βα, α + 4 − 4βαΓ βα, α + 3
We now expand the first term of the right-hand side of (B.1).
Here, let t = (z + βαθ)/θ; then θdt = dz and z = θ(t − βα). + 6β2 α2 Γ βα, α + 2 − 4β3 α3 Γ βα, α + 1
Consequently,
+ β4 α4 Γ βα, α + η8 γ βα, α + 4 .
∞ (B.7)
1
2
α−1 z + βαθ
z α z + βαθ exp − dz
0 θ Γ(α) θ
∞ C. Derivation of (22)
1 2
= θ t − βα α 2
(θt)α−1 exp{−t }θdt
βα θ Γ(α) As described in (12), the power-domain signal is the sum of
∞ two squares of random variables with the same distribution.
θ2 (p)
= t 2 − 2βαt + β2 α2 t α−1 exp{−t }dt Using (18), the power-domain cumulants Kn can be written
Γ(α) βα as
θ2
⎧ (p)
= Γ βα, α + 2 − 2βαΓ βα, α + 1 + β2 α2 Γ βα, α .
Γ(α) ⎪
⎪
⎪K1 = 2K1(2) ,
⎪
⎪
(B.2) ⎨K (p) = 2K (2) ,
power-domain cumulants ⎪ 2(p) 2
(2) (C.1)
Next we consider the second term of the right-hand side of ⎪
⎪K = 2K 3 ,
⎪
⎪
3
(B.1). Here, let t = z/(η2 θ); then η2 θdt = dz. Thus, ⎩ (p)
K4 = 2K4(2) ,
βαη2 θ
2 1 α−1 z
z α z exp − 2 dz where Kn(2) is the nth square-domain moment. Here, the
0 η2 θ Γ(α) η θ
p.d.f. of such a square-domain signal is not symmetrical and
βα its mean is not zero. Thus, we utilize the following relations
2 1 2 α−1
= η2 θt 2 α η θt exp{−t }η2 θdt between the moments and cumulants around the origin:
0 η θ Γ(α)
βα ⎧
η4 θ 2 γ βα, α + 2 ⎪
= t α+1 exp{−t }dt = η4 θ 2 . ⎨μ1 = κ1 ,
⎪
Γ(α) 0 Γ(α) moments ⎪μ2 = κ2 + κ12 , (C.2)
(B.3) ⎪
⎩μ = κ + 4κ κ + 3κ2 + 6κ κ2 + κ4 ,
4 4 3 1 2 2 1 1
As a result, the 2nd-order moment after SS, μ(SS)
2 , is a
composite of (B.2) and (B.3) and is given as where μn is the nth-order raw moment and κn is the nth-
θ2 order cumulant. Moreover, the square-domain moments μ(2)
μ(SS)
2 = Γ βα, α + 2 − 2βαΓ βα, α + 1 n
Γ(α) (B.4) can be expressed by

+β2 α2 Γ βα, α + η4 γ βα, α + 2 . ⎧ (2)
⎪
In the same manner, the 4th-order moment after SS, ⎨ μ1 = μ2 ,
⎪
μ(SS) squared-domain moments ⎪μ(2)
2 = μ4 , (C.3)
4 , can be represented by ⎪
⎩ (2)
μ4 = μ8 .
θ4
μ(SS)
4 = Γ βα, α + 4 − 4βαΓ βα, α + 3
Γ(α)
Using (C.1)–(C.3), the power-domain moments can be
+ 6β2 α2 Γ βα, α + 2 − 4β3 α3 Γ βα, α + 1 expressed in terms of the 4th- and 8th-order moments in the
time domain. Therefore, to obtain the kurtosis after DS in
+β4 α4 Γ βα, α + η8 γ βα, α + 4 . the power domain, the moments and cumulants after DS up
(B.5) to the 8th order are needed.
The 3rd-, 5th-, and 7th-order cumulants are zero because D. Derivation of (24)
we assume that the p.d.f. of x j is symmetrical and that its
mean is zero. If these conditions are satisfied, the following According to (11), the shape parameter α corresponding to
relations between moments and cumulants hold: the kurtosis after DS, kurtDS , is given by the solution of the
quadratic equation:
⎧
⎪
⎪μ1 = 0,
⎪
⎪ (α + 2)(α + 3)
⎪
⎪ kurtDS = . (D.1)
⎪
⎪ μ2 = κ2 ,
⎪
⎪ α(α + 1)
⎨
moments ⎪μ4 = κ4 + 3κ22 ,
⎪
⎪ This can be expanded as
⎪
⎪
⎪
⎪ μ6 = κ6 + 15κ4 κ2 + 15κ23 ,
⎪
⎪
⎪
⎩μ = κ + 35κ2 + 28κ κ + 210κ2 κ + 105κ4 . α2 (kurtDS − 1) + α(kurtDS − 5) − 6 = 0. (D.2)
8 8 4 2 6 2 4 2
(C.4)
Using the quadratic formula,
Using (21) and (C.4), the time-domain moments after !
−kurtDS + 5 ± kurt2DS + 14 kurtDS + 1
DS are expressed as α = , (D.3)
2 kurtDS − 2
⎧ (DS)
⎪
⎪μ 2 = K2 ,
⎪
⎪ whose denominator is larger than zero because kurtDS > 1.
⎪
⎪
⎪ (DS) Here, since α > 0, we must select the appropriate numerator
⎪μ4 = K4 + 3K2 ,
⎪ 2
⎪
⎨ of (D.3). First, suppose that
moments after DS ⎪μ(DS)
6 = K6 + 15K2 K4 + 15K23 ,
⎪
⎪ !
⎪
⎪
⎪
⎪
⎪
μ(DS)
8 = K8 + 35K42 + 28K2 K6 −kurtDS + 5 + kurt2DS + 14 kurtDS + 1 > 0. (D.4)
⎪
⎪
⎩
+210K2 K4 + 105K2 ,
2 4
(C.5) ! holds when 1 < kurtDS < 5 because

This inequality clearly
−kurtDS + 5 > 0 and kurt2DS + 14 kurtDS + 1 > 0. Thus,
where μ(DS)
n is the nth-order raw moment after DS in the time !
domain. −kurtDS + 5 > − kurt2DS + 14 kurtDS + 1. (D.5)
Using (C.2), (C.3), and (C.5), the square-domain cumu-
lants can be written as When kurtDS ≥ 5, the following relation also holds:
⎧ (2)
⎪ (−kurtDS + 5)2 < kurt2DS + 14 kurtDS + 1,
⎪ K1 = K2 ,
⎪
⎪ (D.6)
⎪
⎪
⎪
⎪K2(2) = K4 + 2K22 , ⇐⇒ 24 kurtDS > 24.
⎪
⎪
⎨
square-domain cumulants ⎪K3(2) = K6 +12K4 K2 +8K23 ,
⎪
⎪ Since (D.6) is true when kurtDS ≥ 5, (D.4) holds. In
⎪
⎪
⎪
⎪
⎪
K4(2) = K8 +32K42 +24K2 K6 summary, (D.4) always holds for 1 < kurtDS < 5 and 5 ≤
⎪
⎪
⎩ kurtDS . Thus,
+144K2 K4 + 48K2 ,
2 4
(C.6) !
−kurtDS + 5 + kurt2DS + 14 kurtDS + 1 > 0 for kurtDS > 1.
(D.7)
where Kn(2) is the nth-order cumulant in the square domain.
Moreover, using (C.1), (C.2), and (C.6), the 2nd- and Overall,
4th-order power-domain moments can be written as
!
−kurtDS + 5 + kurt2DS + 14 kurtDS + 1
(p) > 0. (D.8)
μ2 = 2 K4 + 4K22 , 2 kurtDS − 2
(p)
μ4 = 2 K8 + 38K42 + 32K6 K2 + 288K4 K22 + 192K24 . On the other hand, let
(C.7) !
−kurtDS + 5 − kurt2DS + 14 kurtDS + 1 > 0. (D.9)
As a result, the power-domain kurtosis after DS, kurtDS , is
given as ! satisfied when kurtDS > 5 because
This inequality is not
−kurtDS + 5 < 0 and kurt2DS + 14 kurtDS + 1 > 0. Now (D.9)
K8 + 38K42 + 32K2 K6 + 288K22 K4 + 192K24 can be modified as
kurtDS = .
2K42 + 16K22 K4 + 32K24 !
(C.8) −kurtDS + 5 > kurt2DS + 14 kurtDS + 1, (D.10)
then the following relation also holds for 1 < kurtDS ≤ 5: This can be rewritten as

Γ(α) γ βα, α + 1
(−kurtDS + 5)2 > kurt2DS + 14 kurtDS + 1, η2
(D.11) Γ(α) α
⇐⇒ 24 kurtDS < 24. " #
Γ βα, α + 1 γ βα, α + 1
= − β · Γ βα, α + η2 (E.5)
This is not true for 1 < kurtDS ≤ 5. Thus, (D.9) is not α α
appropriate for kurtDS > 1. Therefore, α corresponding to " #
Γ(α) Γ βα, α + 1
kurtDS is given by − − β · Γ βα
, α
,
Γ(α) α
!
−kurtDS + 5 + kurt2DS + 14 kurtDS + 1 and consequently
α = . (D.12)
2 kurtDS − 2 " #
α Γ(α)
η =
2 H α, β, η − I α, β , (E.6)
E. Derivation of (38) γ βα, α + 1 Γ(α)
For 0 < α ≤ 1, which corresponds to a Gaussian or super- where H (α, β, η) is defined by (39) and I(α, β) is given by
Gaussian input signal, it is revealed that the noise reduction (40). Using (E.3) and (E.4), the right-hand side of (E.5) is
performance of BF+SS is superior to that of chSS+BF from clearly greater than or equal to zero. Moreover, since Γ(α) >
the numerical simulation in Section 5.3. Thus, the following 0, Γ(α) > 0, α > 0, and γ(βα, α + 1) > 0, the right-hand side
relation holds: of (E.6) is also greater than or equal to zero. Therefore,
$ " #
1 %
% α Γ(α)
− 10 log10 η = & · H α, β, η − I α, β . (E.7)
J · Γ(α) γ βα, α + 1 Γ(α)
" #
Γ βα, α + 1 γ βα, α + 1
× − β · Γ βα + η2
, α
α α Acknowledgment
1 This work was partly supported by MIC Strategic Informa-
≥ −10 log10
J · Γ(α) tion and Communications R&D Promotion Programme in
" # Japan.
Γ βα, α + 1 γ βα, α + 1
× − β · Γ βα, α + η2 .
α α
(E.1) References
[1] M. Brandstein and D. Ward, Eds., Microphone Arrays: Signal
This inequality corresponds to Processing Techniques and Applications, Springer, Berlin, Ger-
" many, 2001.
#
1 Γ βα, α + 1 γ βα, α + 1 [2] J. L. Flanagan, J. D. Johnston, R. Zahn, and G. W. Elko,
− β · Γ βα + η2
, α “Computer-steered microphone arrays for sound transduc-
Γ(α) α α
tion in large rooms,” Journal of the Acoustical Society of
" # America, vol. 78, no. 5, pp. 1508–1518, 1985.
1 Γ βα, α + 1 γ βα, α + 1
≤ − β · Γ βα, α + η2 . [3] M. Omologo, M. Matassoni, P. Svaizer, and D. Giuliani,
Γ(α) α α “Microphone array based speech recognition with different
(E.2) talker-array positions,” in Proceedings of the International
Conference on Acoustics, Speech, and Signal Processing (ICASSP
Then, the new flooring parameter η in BF+SS, which makes ’97), pp. 227–230, Munich, Germany, September 1997.
the noise reduction performance of BF+SS equal to that of [4] H. F. Silverman and W. R. Patterson, “Visualizing the perfor-
chSS+BF, satisfies η ≥ η (≥ 0) because mance of large-aperture microphone arrays,” in Proceedings of
the International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’99), pp. 962–972, 1999.
γ βα, α + 1
≥ 0. (E.3) [5] O. Frost, “An algorithm for linearly constrained adaptive array
α processing,” Proceedings of the IEEE, vol. 60, pp. 926–935,
1972.
Moreover, the following relation for η also holds: [6] L. J. Griffiths and C. W. Jim, “An alternative approach to lin-
" # early constrained adaptive beamforming,” IEEE Transactions
1 Γ βα, α + 1 γ βα, α + 1 on Antennas and Propagation, vol. 30, no. 1, pp. 27–34, 1982.
− β · Γ βα + η2
, α [7] Y. Kaneda and J. Ohga, “Adaptive microphone-array system
Γ(α) α α
for noise reduction,” IEEE Transactions on Acoustics, Speech,
" #
1 Γ βα, α + 1 γ βα, α + 1 and Signal Processing, vol. 34, no. 6, pp. 1391–1400, 1986.
= − β · Γ βα, α + η2 . [8] S. Boll, “Suppression of acoustic noise in speech using spectral
Γ(α) α α subtraction,” IEEE Transactions on Acoustics, Speech and Signal
(E.4) Processing, vol. 27, no. 2, pp. 113–120, 1979.
[9] J. Meyer and K. Simmer, “Multi-channel speech enhancement Journal on Applied Signal Processing, vol. 2003, no. 11, pp.
in a car environment using Wiener filtering and spectral 1135–1146, 2003.
subtraction,” in Proceedings of the International Conference [23] M. Mizumachi and M. Akagi, “Noise reduction by paired-
on Acoustics, Speech, and Signal Processing (ICASSP ’97), pp. microphone using spectral subtraction,” in Proceedings of
1167–1170, 1997. the International Conference on Acoustics, Speech, and Signal
[10] S. Fischer and K. D. Kammeyer, “Broadband beamforming Processing (ICASSP ’98), vol. 2, pp. 1001–1004, 1998.
with adaptive post filtering for speech acquisition in noisy [24] T. Takatani, T. Nishikawa, H. Saruwatari, and K. Shikano,
environment,” in Proceedings of the International Conference “High-fidelity blind separation of acoustic signals using
on Acoustics, Speech, and Signal Processing (ICASSP ’97), pp. SIMO-model-based independent component analysis,” IEICE
359–362, 1997. Transactions on Fundamentals of Electronics, Communications
[11] R. Mukai, S. Araki, H. Sawada, and S. Makino, “Removal and Computer Sciences, vol. E87-A, no. 8, pp. 2063–2072, 2004.
of residual cross-talk components in blind source separation
[25] S. Ikeda and N. Murata, “A method of ICA in the frequency
using time-delayed spectral subtraction,” in Proceedings of
domain,” in Proceedings of the International Workshop on
the International Conference on Acoustics, Speech, and Signal
Independent Component Analysis and Blind Signal Separation,
Processing (ICASSP ’02), pp. 1789–1792, Orlando, Fla, USA,
pp. 365–371, 1999.
May 2002.
[12] J. Cho and A. Krishnamurthy, “Speech enhancement using [26] E. W. Stacy, “A generalization of the gamma distribution,” The
microphone array in moving vehicle environment,” in Pro- Annals of Mathematical Statistics, pp. 1187–1192, 1962.
ceedings of the IEEE Intelligent Vehicles Symposium, pp. 366– [27] K. Kokkinakis and A. K. Nandi, “Generalized gamma density-
371, Graz, Austria, April 2003. based score functions for fast and flexible ICA,” Signal
[13] Y. Ohashi, T. Nishikawa, H. Saruwatari, A. Lee, and K. Processing, vol. 87, no. 5, pp. 1156–1162, 2007.
Shikano, “Noise robust speech recognition based on spatial [28] J. W. Shin, J.-H. Chang, and N. S. Kim, “Statistical modeling
subtraction array,” in Proceedings of the International Workshop of speech signals based on generalized gamma distribution,”
on Nonlinear Signal and Image Processing, pp. 324–327, 2005. IEEE Signal Processing Letters, vol. 12, no. 3, pp. 258–261, 2005.
[14] J. Even, H. Saruwatari, and K. Shikano, “New architecture [29] L. Rabiner and B. Juang, Fundamentals of Speech Recognition,
combining blind signal extraction and modified spectral sub- Prentice-Hall PTR, 1993.
traction for suppression of background noise,” in Proceedings
of the International Workshop on Acoustic Echo and Noise
Control (IWAENC ’08), Seattle, Wash, USA, 2008.
[15] Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, and K.
Shikano, “Blind spatial subtraction array for speech enhance-
ment in noisy environment,” IEEE Transactions on Audio,
Speech and Language Processing, vol. 17, no. 4, pp. 650–664,
2009.
[16] S. B. Jebara, “A perceptual approach to reduce musical
noise phenomenon with Wiener denoising technique,” in
Proceedings of the International Conference on Acoustics, Speech,
and Signal Processing (ICASSP ’06), vol. 3, pp. 49–52, 2006.
[17] Y. Ephraim and D. Malah, “Speech enhancement using a
minimum mean-square error short-time spectral amplitude
estimator,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 32, no. 6, pp. 1109–1121, 1984.
[18] Y. Uemura, Y. Takahashi, H. Saruwatari, K. Shikano, and
K. Kondo, “Automatic optimization scheme of spectral sub-
traction based on musical noise assessment via higher-order
statistics,” in Proceedings of the International Workshop on
Acoustic Echo and Noise Control (IWAENC ’08), Seattle, Wash,
USA, 2008.
[19] Y. Uemura, Y. Takahashi, H. Saruwatari, K. Shikano, and K.
Kondo, “Musical noise generation analysis for noise reduction
methods based on spectral subtraction and MMSE STSA
estimatio,” in Proceedings of the International Conference on
Acoustics, Speech, and Signal Processing (ICASSP ’09), pp.
4433–4436, 2009.
[20] Y. Takahashi, Y. Uemura, H. Saruwatari, K. Shikano, and K.
Kondo, “Musical noise analysis based on higher order statistics
for microphone array and nonlinear signal processing,” in
Proceedings of the International Conference on Acoustics, Speech,
and Signal Processing (ICASSP ’09), pp. 229–232, 2009.
[21] P. Comon, “Independent component analysis, a new concept?”
Signal Processing, vol. 36, pp. 287–314, 1994.
[22] H. Saruwatari, S. Kurita, K. Takeda, F. Itakura, T. Nishikawa,
and K. Shikano, “Blind source separation combining inde-
pendent component analysis and beamforming,” EURASIP
doi:10.1155/2010/509541
Research Article
Microphone Diversity Combining for In-Car Applications
Jürgen Freudenberger, Sebastian Stenzel (EURASIP Member),

and Benjamin Venditti (EURASIP Member)
Department of Computer Science, University of Applied Sciences Konstanz, Hochschule Konstanz, Brauneggerstr. 55,
78462 Konstanz, Germany
Correspondence should be addressed to Jürgen Freudenberger, juergen.freudenberger@htwg-konstanz.de
Received 1 August 2009; Revised 23 January 2010; Accepted 17 March 2010
Academic Editor: Ivan Tashev
Copyright © 2010 Jürgen Freudenberger et al. This is an open access article distributed under the Creative Commons Attribution
cited.
This paper proposes a frequency domain diversity approach for two or more microphone signals, for example, for in-car
applications. The microphones should be positioned separately to insure diverse signal conditions and incoherent recording of
noise. This enables a better compromise for the microphone position with respect to different speaker sizes and noise sources. This
work proposes a two-stage approach. In the first stage, the microphone signals are weighted with respect to their signal-to-noise
ratio and then summed similar to maximum ratio combining. The combined signal is then used as a reference for a frequency
domain least-mean-squares (LMS) filter for each input signal. The output SNR is significantly improved compared to coherence-
based noise reduction systems, even if one microphone is heavily corrupted by noise.
1. Introduction channel noise reduction or beamformer arrays [1–3]. Good

noise robustness of single microphone systems requires the
With in-car speech applications like hands-free car kits and use of single channel noise suppression techniques, most
speech recognition systems, speech is corrupted by engine of them derived from spectral subtraction [4]. Such noise
noise and other noise sources like airflow from electric fans reduction algorithms improve the signal-to-noise ratio, but
or car windows. For safety and comfort reasons, hands-free they usually introduce undesired speech distortion. Micro-
telephone systems should provide the same quality of speech phone arrays can improve the performance compared to
as conventional fixed telephones. In practice however, the single microphone systems. Nevertheless, the signal quality
speech quality of a hands-free car kit heavily depends on does still depend on the speaker position. Moreover, the
the particular position of the microphone. Speech has to be microphones are located in close proximity. Therefore,
picked up as directly as possible to reduce reverberation and microphone arrays are often vulnerable to airflow that might
to provide a sufficient signal-to-noise ratio. The important disturb all microphone signals.
question, where to place the microphone inside the car, is, Alternatively, multimicrophone setups have been pro-
however, difficult to answer. The position is apparently a posed that combine the processed signals of two or more
compromise for different speaker sizes, because the distance separate microphones. The microphones are positioned
between microphone and speaker depends significantly on separately (e.g., 40 to 80 cm apart) in order to ensure
the position of the driver and therefore on the size of the incoherent recording of noise [5–11]. Similar multichannel
driver. Furthermore, noise sources like airflow from electric signal processing systems have been suggested to reduce
fans or car windows have to be considered. Placing two or signal distortion due to reverberation [12, 13]. Basically,
more microphones in different positions enables a better all these approaches exploit the fact that speech compo-
compromise with respect to different speaker sizes and yields nents in the microphone signals are strongly correlated
more noise robustness. while the noise components are only weakly correlated
Today, noise reduction in hands-free car kits and in- if the distance between the microphones is sufficiently
car speech recognition systems is usually based on single large.
The question at hand with distributed arrays is how 50

40
to combine these microphone signals with possibly rather 30
SNR (dB)
different signal conditions? In this paper, we consider a diver- 20
sity technique that combines the processed signals of several 10
separate microphones. The basic idea of our approach is to 0
apply maximum-ratio-combining (MRC) to speech signals, −10
−20
where we propose a frequency domain diversity approach for 0 1000 2000 3000 4000 5000
two or more microphone signals. MRC maximizes the signal- Frequency (Hz)
to-noise ratio in the combined signal.
A major issue for the application of maximum-ratio- mic. 1
mic. 2
combining for multimicrophone setups is the estimation
of the acoustic transfer functions. In telecommunications, Figure 1: Input SNR values for a driving situation at a car speed of
the signal attenuation as well as the phase shift for each 100 km/h.
transmission path are usually measured to apply MRC. With
speech applications we have no means to directly measure
the acoustic transfer functions. There exists several blind
approaches to estimate the acoustic transfer functions (see environment. For these measurements, we used two cardioid
e.g., [14–16]) which were successfully applied to derever- microphones with positions suited for car integration. One
beration. However, the proposed estimation methods are microphone (denoted by mic. 1) was installed close to the
computationally demanding. inside mirror. The second microphone (mic. 2) was mounted
In this paper, we show that maximum-ratio-combining at the A-pillar.
can be achieved without explicit knowledge of the acoustic Figure 1 depicts the SNR versus frequency for a driving
transfer functions. Proper signal weighting can be achieved situation at a car speed of 100 km/h. From this figure, we
based on an estimate of the input signal-to-noise ratio. We observe that the SNR values are quite distinct for these
propose a two stage processing of the microphone signals. two microphone positions with differences of up to 10 dB
In the first stage, the microphone signals are weighted depending on the particular frequency. We also note that
with respect to their input signal-to-noise ratio. These the better microphone position is not obvious in this case,
weights guarantee maximum-ratio-combining of the signals because the SNR curves cross several times.
with respect to the signal magnitudes. To ensure cophasal Theoretically, a MRC combining of the two input signals
addition of the weighted signals, we use the combined would result in an output SNR equal to the sum of the input
signal as reference signal for frequency domain LMS filters SNR values. With two inputs, MRC achieves a maximum
in the second stage. These filters adjust the phases of the gain of 3 dB for equal input SNR values. In case of the input
microphone signals to guarantee coherent signal combining. SNR values being rather different, the sum is dominated by
The proposed concept is similar to the single channel the maximum value. Hence, for the curves in Figure 1 the
noise reduction system presented by Mukherjee and Gwee output SNR would essentially be the envelope of the two
[17]. This system uses spectral subtraction to obtain a crude curves.
estimate of the speech signal. This estimate is then used as Next we consider the coherence for the noise and speech
the reference signal of a single LMS filter. In this paper, we signals. The corresponding results are depicted in Figure 2.
generalize this concept to multimicrophone systems, where The figure presents measurements for two microphones
our aim is not only noise reduction, but also dereverberation installed close to the inside mirror in an end-fire beamformer
of the microphone signals. constellation with a microphone distance of 7 cm. The lower
The paper is organized as follows: In Section 2, we figure contains the results for the microphone positions
present some measurement results obtained in a car environ- mic. 1 and mic. 2 (distance of 65 cm). From these results,
ment. This results motivate the proposed diversity approach. we observe that the noise coherence closely follows the
In Section 3, we present a signal combiner that achieves theoretical coherence function (dotted line in Figure 2) in an
MRC weighting based on the knowledge of the input ideal diffuse sound field [18]. Separating the microphones
signal-to-noise ratios. Coherence based signal combining significantly reduces the noise coherence for low frequencies.
is discussed in Section 4. In the subsequent section, we On the other hand, both microphone constellations have
consider implementation issues. In particular, we present similar speech coherence. We note that the speech coherence
an estimator for the required input signal-to-noise ratios. is not ideal, as it has steep dips. The corresponding frequen-
Finally, in Section 6, we present some simulation results for cies will probably be attenuated by a signal combiner that is
different real world noise situations. solely based on coherence.
2. Measurement Results 3. Spectral Combining

The basic idea of our spectral combining approach is to In this section, we present the basic system concept. To sim-
apply MRC to speech signals. To motivate this approach, plify the discussion, we assume that all signals are stationary
we first discuss some measurement results obtained in a car and that the acoustic system is linear and time-invariant.
1 f ) is maximized. In the frequency domain, the signal

X(
Coherence |γx1 x2 ( f ) |2
0.8 combining can be expressed as

0.6

M

0.4 X f = Gi f Yi f , (3)
0.2 i=1
0 where Gi ( f ) is the weight of the ith microphone signal. With

0 1000 2000 3000 4000 5000
(2) we have
Frequency (Hz)
(a)
M

M

X f = X f Gi f Hi f + Gi f Ni f , (4)
i=1 i=1
1
Coherence |γx1 x2 ( f ) |2
where the first sum represents the speech component and

0.8
the second sum represents the noise component of the
0.6
combined signal. Hence, the overall signal-to-noise ratio of
0.4 the combined signal is

0.2
M 2
0 E X f G
i=1 i f Hi f
0 1000 2000 3000 4000 5000 γ f = .
M 2
(5)
Frequency (Hz) E i=1 Gi f Ni f
Noise
Speech 3.1. Maximum-Ratio-Combining. The optimal combining
Theoretical
strategy that maximizes the signal-to-noise ratio in the com-
(b) f ) is usually called maximal-ratio-combining
bined signal X(
Figure 2: Coherence for noise and speech signals for tow different (MRC) [19]. In this section, we briefly outline the derivation
microphone positions. of the MRC weights for completeness. Furthermore, some of
the properties of maximal ratio combining are discussed.
Let λX ( f ) = E{|X( f )|2 } be the speech power spectral
density. Assuming that the noise power λN ( f ) is the same
In the subsequent section we consider the modifications for for all microphones and that the noise at the different
nonstationary signals and time variant systems. microphones is uncorrelated, we have
We consider a scenario with M microphones. The
M

2
microphone signals yi (k) can be modeled by the convolution λX f i=1 Gi f Hi f
of the speech signal x(k) with the impulse response hi (k) of γ f = M 2 . (6)
λN f
the acoustic system plus additive noise ni (k). Hence the M i=1 Gi f
microphone signals yi (k) can be expressed as
We consider now the term | M i=1 Gi ( f )Hi ( f )| in the
2
denominator of (6). Using the Cauchy-Schwarz inequality

yi (k) = hi (k) ∗ x(k) + ni (k), (1) we have
2

M M
M

where ∗ denotes the convolution. Gi f Hi f ≤ Gi f 2 Hi f 2 (7)

To apply the diversity technique, it is convenient to i=1 i=1 i=1
consider the signals in the frequency domain. Let X( f ) be
with equality if Gi ( f ) = cHi∗ ( f ), where Hi∗ is the complex
the spectrum of the speech signal x(k) and Yi ( f ) be the
conjugate of the channel coefficient Hi . Here c is a real-valued
spectrum of the ith microphone signal yi (k). The speech
constant common to all weights Gi ( f ). Thus, for the signal-
signal is linearly distorted by the acoustic transfer function
to-noise ratio we obtain
Hi ( f ) and corrupted by the noise term Ni ( f ). Hence, the
M 2 M 2
signal observed at the ith microphone has the spectrum λX f
i=1 Gi f i=1 Hi f
γ f ≤ M 2
λN f Gi f
i=1
Yi f = X f Hi f + Ni f . (2) M (8)
λX f
= Hi f 2 .
In the following, we assume that the speech signal and λN f i=1
the channel coefficients are uncorrelated. We assume a
complex Gaussian distribution of the noise terms Ni ( f ). With the weights Gi ( f ) = cHi∗ ( f ), we obtain the maximum
Moreover, we presume that the noise power spectral density signal-to-noise ratio of the combined signal as the sum of the
λN ( f ) = E{|Ni ( f )|2 } is the same for all microphones. This signal-to-noise ratios of the M received signals
assumption is reasonable for a diffuse sound field.
M

Our aim is to linearly combine the M microphone signals γ f = γi f , (9)
Yi ( f ) so that the signal-to-noise ratio in the combined signal i=1
where Hence, we have

2
λX f Hi f G(i) Hi f
γi f = (10) SC f = cSC f (16)
λN f
with
is the input signal-to-noise ratio of the ith microphone. It is
1
appropriate to chose c as cSC f = .
M 2 (17)
1 j =1 H j f
cMRC f = 2 .
M (11)
j =1 H j f
We observe that the weight G(i) SC ( f ) is proportional to the
This leads to the MRC weights magnitude of the MRC weights Hi ( f )∗ , because the factor
cSC is the same for all M microphone signals. Consequently,
Hi∗ f
G(i)
MRC f = cMRC f Hi ∗
f = 2 , (12) coherent addition of the sensor signals weighted with the
M
j =1 H j f gain factors G(i)SC ( f ) still leads to a combining, where the
signal-to-noise ratio at the combiner output is the sum of
and the estimated (equalized) speech spectrum
the input SNR values. However, coherent addition requires
X = G(1) (2) (3)
MRC Y1 + GMRC Y2 + GMRC Y3 · · ·
an additional phase estimate. Let φi ( f ) denote the phase
of Hi ( f ) at frequency f . Assuming cophasal addition the
H∗ H∗ estimated speech spectrum is
X = M 1 2 Y1 + M 2 2 Y2 + · · ·
i=1 |Hi | i=1 |Hi |
X = G(1)
SC e
− jφ1
Y1 + G(2)
SC e
− jφ2
Y2 + G(3)
SC e
− jφ3
Y3 · · ·
∗ ∗
H1 (H1 X + N1 ) H2 (H2 X + N2 ) (18)
= M 2 + M 2 + ··· (13) 1
i=1 |Hi | i=1 |Hi | = X + G(1)
SC e
− jφ1
N1 + G(2)
SC e
− jφ2
N2 + · · · .
cSC
H1∗ H2∗
= X + M 2 N1 + M 2 N2 + · · ·
Hence, in the case of stationary signals the term
i=1 |Hi | i=1 |Hi |
M
2
=
=X + G(1) + G(2) + ··· , 1
MRC N1 MRC N2 H j f (19)
cSC f j =1
where we have omitted the dependency on f . The estimated
speech spectrum X( f ) is therefore equal to the actual speech can be interpreted as the resulting transfer characteristic
spectrum X( f ) plus some weighted noise term. of the system. An example is depicted in Figure 3. The
The filter defined in (12) was previously applied to speech upper figure presents the measured transfer characteristics
dereverberation by Gannot and Moonen in [14], because for two microphones in a car environment. Note that the
it ideally equalizes the microphone signals if a sufficiently microphones have a high-pass characteristic and attenuate
accurate estimate of the acoustic transfer functions is avail- signal components for frequencies below 1 kHz. The lower
able. The problem at hand with maximum-ratio-combining figure is the curve 1/cSC ( f ). The spectral combiner equalizes
is that it is rather difficult and computationally complex to most of the deep dips in the transfer functions from the
explicitly estimate the acoustic transfer characteristic Hi ( f ) mouth of the speaker to the microphones while the envelope
for our microphone system. of the transfer functions is not equalized.
In the next section, we show that MRC combining
can be achieved without explicit knowledge of the acoustic 3.3. Magnitude Combining. One challenge in multimicro-
channels. The weights for the different microphones can phone systems with spatially separated microphones is a
be calculated based on an estimate of the signal-to-noise reliable phase estimation of the different input signals. For
ratio for each microphone. The proposed filter achieves a a coherent combining of the speech signals, we have to
signal-to-noise ratio according to (9), but does not guarantee compensate the phase difference between the speech signals
perfect equalization. at each microphone. Therefore, it is sufficient to estimate the
phase differences to a reference microphone, for example,
3.2. Diversity Combining for Speech Signals. We consider the to the first microphone Δi ( f ) = φ1 ( f ) − φi ( f ), for all i =
weights 2, . . . , M. Cophasal addition is then achieved by

γi f X = G(1) (2) jΔ2
Y2 + G(3) jΔ3
Y3 · · · .
G(i) = M . (14) SC Y1 + GSC e SC e (20)
SC
j =1 γ j f
But a reliable estimation of the phase differences is only
Assuming the noise power is the same for all microphones possible in speech active periods and furthermore only for
and substituting γi ( f ) by (10) leads to that frequencies where speech is present. Estimating the

phase differences

Hi f Hi f
2
(i)
GSC f = 2 = . Y1 f Yi∗ f
M M
2
(15)
e jΔi ( f ) = E
j =1 H j f H j f Y1 f Yi f (21)
j =1
leads to unreliable phase values for time-frequency points Transfer characteristics to the microphones
without speech. In particular, if Hi ( f ) = 0 for some
H1 ( f ), H2 ( f ) (dB)
0
frequency f , the estimated phase Δi ( f ) is undefined. A −10
combining using this estimate leads to additional signal
distortions. Additionally, noise correlation would distort the −20
phase estimation. A coarse estimate of the phase difference −30

can also be obtained from the time-shift τi between the −40
speech components in the microphone signals, for example, 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
using the generalized correlation method [20]. The estimate Frequency (Hz)
is then Δi ( f ) ≈ 2π f τi . Note that a combiner using these (a)
phase values would in a certain manner be equivalent Overall transfer characteristic
to a delay-and-sum beamformer. However, for distributed
microphone arrays in reverberant environments this phase 0
compensation leads to a poor estimate of the actual phase −10
1/cSC (dB)
differences. −20
Because of the drawbacks, which come along with the
−30
phase estimation methods described above, we propose
another scheme. Therefore, we use a two stage combining −40
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
approach. In the first stage, we use the spectral combining Frequency (Hz)
approach as described in Section 3.2 with a simple magni-
tude combining of the microphone signals. For the mag- (b)
nitude combining the noisy phase of the first microphone Figure 3: Transfer characteristics to the microphones and of the
signal is adopted to the other microphone signals. This is also combined signal.
obvious in Figure 5, where the phase of the noisy spectrum
e j φ1 ( f ) is taken for the spectrum at the output of the filter
G(2)
SC ( f ), before the signals were combined. This leads to the applied the dereverberation principle of Allen et al. [13]
following incoherent combining of the input signals to noise reduction. In particular, they proposed an LMS-
(2) based time domain algorithm to combine the different
X f = G(1) Y2 f e j φ1 ( f ) + · · ·
SC f Y1 f + GSC f microphone signals. This approach provides effective noise
j φ ( f ) suppression for frequencies where the noise components of
+ G(M) YM f e 1
SC f the microphone signals are uncorrelated.
(1) (2)
However, as we have seen in Section 2, for practical
= GSC f Y1 f + GSC f Y2 f e j φ1 ( f ) + · · · . microphone distances in the range of 0.4 to 0.8 m the noise
(22) signals are correlated for low frequencies. These correlations
reduce the noise suppression capabilities of the algorithm
f ) is equal to
The estimated speech spectrum X( and lead to musical noise.
We will show in this section that a combination of the

X f e j φ1 ( f ) spectral combining with the coherence based approach by
(23) Martin and Vary reduces this issues.
cSC f
plus some weighted noise terms. It follows from the triangle 4.1. Analysis of the LMS Approach. We present now an
inequality that analysis of the scheme by Martin and Vary as depicted in

Figure 4. The filter gi (k) is adapted using the LMS algorithm.
M
2 For stationary signals x(k), n1 (k), and n2 (k), the adaptation
=
1 1
≤ H j f . (24) converts to filter coefficients gi (k) and a corresponding filter
cSC f cSC f j =1 transfer function

Magnitude combining does not therefore guarantee maxi- E Yi∗ f Y j f
mum-ratio-combining. Yet the signal X( f ) is taken as a refer- G(i)
LMS f = , i=
/ j (25)
E Yi f
2
ence signal in the second stage where the phase compensation
is done. This coherence based signal combining scheme is
described in the following section. that minimizes the expected value

(i)
2
E Yi f GLMS f − Y j f , (26)
4. Coherence-Based Combining
As an example of a coherence based diversity system we where E{Yi∗ ( f )Y j ( f )} is the cross-power spectrum of the
first consider the two microphone approach by Martin two microphone signals and E{|Yi ( f )|2 } is the power
and Vary [5, 6] as depicted in Figure 4. Martin and Vary spectrum of the ith microphone signal.
y1 (k) = x(k) ∗ h1 (k) + n1 (k) y2 (k)

n1 (k) g1 (k)
− 0.5
x(k) h1 (k)
x(k)
−
h2 (k)
n2 (k) g2 (k)
y2 (k) = x(k) ∗ h2 (k) + n2 (k) y1 (k)
Figure 4: Basic system structure of the LMS approach.
Assuming that the speech signal and the noise signals are 4.2. Combining MRC and LMS. To ensure suitable weighting
uncorrelated, (25) can be written as and coherent signal addition we combine the diversity
technique with the LMS approach to process the signals
E X f Hi∗ f H j f + E Ni∗ f N j f
2
of the different microphones. It is informative to examine
G(i)
LMS f = .
E X f Hi f + E Ni f the combined approach under ideal conditions, that is, we
2 2 2
assume ideal MRC weighting.
(27)
Analog to (13), weighting with the MRC gains factors
For frequencies where the noise components are uncorre- according to (12) results in the estimate
lated, that is, E{Ni∗ ( f )N j ( f )} = 0, this formula is reduced
to X f = X f + G(1) (2)
MRC f N1 f + GMRC f N2 f + · · · .
(30)
E X f Hi∗ f H j f
2

G(i)
LMS f = 2 2 . (28) f ) as the reference signal for the
E X f Hi f + E Ni f We now use the estimate X(
2
LMS algorithm. That is, we adapted a filter for each input
The filter G(i) signal such that the expected value
LMS ( f ) according to (28) results in fact in a

minimum mean squared error (MMSE) estimate of the (i)
2
signal X( f )H j ( f ) based on the signal Yi ( f ). Hence, the E Yi f GLMS f − X f (31)
weighted output is a combination of the MMSE estimates
of the speech components of the two input signals. This is minimized. The adaptation results in the filter transfer
explains the good noise reduction properties of the approach functions
by Martin and Vary.
On the other hand, the coherence of the noise depends E Yi∗ f X f
G(i)
LMS f = . (32)
E Yi f
strongly on the distance between the microphones. For in- 2
car applications, practical distances are in the range of 0.4 to

0.8 m. Therefore, only the noise components for frequencies Assuming that the speech signal and the noise signals are
above 1 kHz can be considered to be uncorrelated [6]. f ) according to (30) leads
uncorrelated and substituting X(
According to formula (27), the noise correlation leads to to
a bias
E Yi∗ f X f
E Ni∗ f N j f G(i)
LMS f = (33)
E Yi f
2
(29)
E Yi f
2

E Ni f
2

of the filter transfer function. An approach to correct the + G(i)
MRC f , (34)
E Yi f
2
filter bias by estimating the noise cross-power density was
presented in [21]. Another issue with speech enhancement
solely based on the LMS approach is that the speech signals ( j) E Ni∗ f N j f
+ GMRC f + ··· . (35)
at the microphone inputs may only be weakly correlated E Yi f
2
for some frequencies as shown in Section 2. Consequently,
these frequency components will be attenuated in the output The first term
signals. ∗ 2

In the following, we discuss a modified LMS approach, E Yi∗ f X f Hi f E X f
where we first combine the microphone signals to obtain =
E Yi f Hi f 2 E X f 2 + E Ni f 2
2
an improved reference signal for the adaptation of the LMS
filters. (36)
in this sum is the Wiener filter that results in a minimum This formula shows that noise suppression can be introduced
mean squared error estimate of the signal X( f ) based on by simply adding a constant to the numerator term in (14).
the signal Yi ( f ). The Wiener filter equalizes the microphone Most, if not all, implementations of spectral subtraction
signal and minimizes the mean squared error between the are based on an over-subtraction approach, where an
filter output and the actual speech signal X( f ). Note that the overestimate of the noise power is subtracted from the
phase of the term in (36) is −φi , that is, the filter compensates power spectrum of the input signal (see e.g., [22–25]). Over-
the phase of the acoustic transfer function Hi ( f ). subtraction can be included in (40) by using a constant ρ
The other terms in the sum can be considered as filter larger than one. This leads to the final gain factor
biases where the term in (34) depends on the noise power
density of the ith input. The remaining terms depend on

(i) γi f
the noise cross power and vanish for uncorrelated noise GSC f = . (41)
ρ+γ f
signals. However, noise correlation might distort the phase
estimation.
The parameter ρ does hardly affect the gain factors for
Similarly, when we consider the actual reference signal
high signal-to-noise ratios retaining optimum weighting. For
f ) according to (22), the filter equation for G(i)
X( LMS ( f ) low signal-to-noise ratios this term leads to an additional
contains the term attenuation. The over-subtraction factor is usually a function
∗ 2 jφ ( f )
Hi fE X f e 1 of the SNR, sometimes it is also chosen differently for
2 2 (37) different frequency bands [25].
cSC f Hi f E X f + E Ni f
2
with the sought phase Δi ( f ) = φ1 ( f ) − φi ( f ). If the 5. Implementation Issues

correlation of the noise terms is sufficiently small we obtain
the estimated phase Real world speech and noise signals are non-stationary
processes. For an implementation of the spectral weighting,
i f = arg G(i)
Δ LMS f . (38) we have to consider short-time spectra of the microphone
signals and estimate the short-time power spectral densities
The LMS algorithm estimates implicitly the phase differences
(PSD) of the speech signal and the noise components.
between the reference signal X( f ) and the input signals
Therefore, the noisy signal yi (k) is transformed into the
Yi ( f ). Hence, the spectra at the outputs of the filters G(i)
LMS ( f ) frequency domain using a short-time Fourier transform of
are in phase. This enables a cophasal addition of the signals length L. Each block of L consecutive samples is multiplied
according to (20). with a Hamming window. Subsequent blocks are overlapping
By estimating the noise power and noise cross-power by K samples. Let Yi (κ, ν), Xi (κ, ν), and Ni (κ, ν) denote the
densities we could correct the biases of the LMS filter transfer corresponding short-time spectra, where κ is the subsampled
functions. Similarly, reducing the noisy signal components time index and ν is the frequency bin index.
in (30) diminishes the filter biases. In the following, we will
pursue the latter approach.
5.1. System Structure. The processing system for two inputs
ν) results from
is depicted in Figure 5. The spectrum X(κ,
4.3. Noise Suppression. Maximum-ratio-combining provides incoherent magnitude combining of the input signals
an optimum weighting of the M sensor signals. However,
it does not necessarily suppress the noisy signal compo- ν) = G(1)
nents. We therefore combine the spectral combining with
X(κ, SC (κ, ν)Y1 (κ, ν)
(42)
an additional noise suppression filter. Of the numerous
+ G(2)
SC (κ, ν)|Y2 (κ, ν)|e
j φ1 (κ,ν)
+ ··· ,
proposed noise reduction techniques in literature, we con-
sider only spectral subtraction [4] which supplements the
where
spectral combining quite naturally. The basic idea of spectral

subtraction is to subtract an estimate of the noise floor from γ (κ, ν)

i
an estimate of the spectrum of the noisy signal. G(i)
SC (κ, ν) = . (43)
ρ + γ(κ, ν)
Estimating the overall SNR according to (9) the spectral
subtraction filter (see i.e., [1, page 239]) for the combined The power spectral density of speech signals is relatively
f ) can be written as
signal X( fast time varying. Therefore, the FLMS algorithm requires

a quick update, that is, a large step size. If the step size
γ f
GNS f = . (39) is sufficiently large the magnitudes of the FLMS filters
1+γ f G(i) (i)
LMS (κ, ν) follow the filters GSC (κ, ν). Because the spectra at
Multiplying this filter transfer function with (14) leads to the the outputs of the filters G(i)
LMS ( f ) are in phase, we obtain the
term estimated speech spectrum as

γi f γ f γi f ν) = G(1) (2)
= (40) X(κ, LMS (κ, ν)Y1 (κ, ν) + GLMS (κ, ν)Y2 (κ, ν) + · · · .
γ f 1+γ f 1+γ f (44)
n1 (k) Y1 (κ, ν)
y1 (k)
Windowing (1)
GSC (κ, ν) (1)
GLMS (κ, ν)
x(k) ∗ h1 (k) + FFT
−
SNR and gain Phase ν)
X(κ, IFFT x(k)
computing computing + OLA
x(k) ∗ h2 (k) −
Windowing (2)
(2)
GSC (κ, ν) |·| e j φ1 (κ,ν) GLMS (κ, ν)
y2 (k) + FFT
n2 (k) Y2 (κ, ν)
Figure 5: Basic system structure of the diversity system with two inputs.
To perform spectral combining we have to estimate the [23, 27]. With this approach, the noise PSD estimate is
current signal-to-noise ratio based on the noisy microphone determined by the minimum value
input signals. In the next sections, we propose a simple
and efficient method to estimate the noise power spectral λmin,i (κ, ν) = min λY ,i (l, ν) (47)
l∈[κ−W+1,κ]
densities of the microphone inputs.
within a sliding window of W consecutive values of λY ,i (κ, ν).
5.2. PSD Estimation. Commonly the noise PSD is estimated The noise PSD is then estimated by
in speech pauses where the pauses are detected using voice
2
activity detection (VAD, see e.g., [24, 26]). VAD-based E |Ni (κ, ν)| ≈ omin · λmin,i (κ, ν), (48)
methods provide good estimates for stationary noise. How-
ever, they may suffer from error propagation if subsequent where omin is a parameter of the algorithm and should be
decisions are not independent. Other methods, like the min- approximated as
imum statistics approach introduced by Martin [23, 27], use
a continuous estimation that does not explicitly differentiate 1
omin = . (49)
between speech pauses and speech active segments. E{λmin }
Our estimation method combines the VAD approach The MS approach provides a rough estimate of the noise
with the minimum statistics (MS) method. Minimum power that strongly depends on the smoothing parameter α
statistics is a robust technique to estimate the power spectral and the window size of the sliding window (for details cf.
density of non-stationary noise by tracing the minimum of [27]). However, this estimate can be obtained regardless of
the recursively smoothed power spectral density within a speech being present or not.
time window of 1 to 2 seconds. We use these MS estimates The idea of our approach is to approximate the PSD
and a simple threshold test to determine voice activity for by the MS estimate during speech active periods while the
each time-frequency point. smoothed input power is used for time-frequency points
The proposed method prevents error propagation, where speech is absent.
because the MS approach is independent of the VAD. During
speech pauses the noise PSD estimation can be enhanced
2
E |Ni (κ, ν)| ≈ β(κ, ν)omin · λmin,i (κ, ν)
compared with an estimate solely based on minimum (50)
statistics. A similar time-frequency dependent VAD was
+ 1 − β(κ, ν) λY ,i (κ, ν),
presented by Cohen to enhance the noise power spectral
density estimation of minimum statistics [28]. where β(κ, ν) ∈ {0, 1} is an indicator function for speech
For time-frequency points (κ, ν) where the speech signal activity which will be discussed in more detail in the next
is inactive, the noise PSD E{|Ni (κ, ν)|2 } can be approximated section.
by recursive smoothing The current signal-to-noise ratio is then obtained by

2 2
2
E |Ni (κ, ν)| ≈ λY ,i (κ, ν) (45) E |Yi (κ, ν)| − E |Ni (κ, ν)|
γi (κ, ν) = , (51)
2
E |Ni (κ, ν)|
with
assuming that the noise and speech signals are uncorrelated.
λY ,i (κ, ν) = (1 − α)λY ,i (κ − 1, ν) + α|Yi (κ, ν)|2 , (46)
5.3. Voice Activity Detection. Human speech contains gaps
where α ∈ (0, 1) is the smoothing parameter. not only in time but also in frequency domain. It is
During speech active periods the PSD can be estimated therefore reasonable to estimate the voice activity in the time-
using the minimum statistics method introduced by Martin frequency domain in order to obtain a more accurate VAD.
The VAD function β(κ, ν) can then be calculated upon the The decision rule for the ith channel is based on the
current input noise PSD obtained by minimum statistics. conditional speech presence probability
Our aim is to determine for each time-frequency point ⎧
(κ, ν) whether the speech signal is active or inactive. We ⎪
⎨1, P H1 | Yi
≥ T,
therefore consider the two hypotheses H1 (κ, ν) and H0 (κ, ν) βi (κ, ν) = ⎪ P H0 | Yi (61)
which indicate speech presence or absence at the time- ⎩0, otherwise.
frequency point (κ, ν), respectively. We assume that the
coefficients X(κ, ν) and Ni (κ, ν) of the short-time spectra of The parameter T > 0 enables a tradeoff between the two
both the speech and the noise signal are complex Gaussian possible error probabilities of voice activity detection. A
random variables. In this case, the current input power, that value T > 1 decreases the probability of a false alarm, that
is, squared magnitude |Yi (κ, ν)|2 , is exponentially distributed is, β(κ, ν) = 1 when speech is absent. T < 1 reduces the
with mean (power spectral density) probability of a miss, that is, β(κ, ν) = 0 in the presence of
speech. Note that the generalized likelihood-ratio test
λYi (κ, ν) = E |Y (κ, ν)|2 . (52)
P H1 | Yi pi (κ, ν)
= ≥T (62)
Similarly we define P H0 | Yi 1 − pi (κ, ν)

λXi (κ, ν) = |Hi (κ, ν)|2 E |X(κ, ν)|2 , is according to the Neyman-Pearson-Lemma (see e.g., [30])
(53) an optimal decision rule. That is, for a fixed probability of a

λNi (κ, ν) = E |Ni (κ, ν)|2 . false alarm it minimizes the probability of a miss and vice
versa. The generalized likelihood-ratio test was previously
We assume that speech and noise are uncorrelated. used by Sohn and Sung to detect speech activity in subbands
Hence, we have [29, 31].
The test in inequality (62) is equivalent to
λYi (κ, ν) = λXi (κ, ν) + λNi (κ, ν) (54)
−1 λX,i + λN,i q 1+T
pi (κ, ν) = 1+ exp(−ui ) ≤ ,
during speech active periods and λN,i 1 − q T
(63)
λYi (κ,ν) = λNi (κ, ν) (55)
where we have used (59). Solving for |Yi (κ, ν)|2 using (60),
in speech pauses. we obtain a simple threshold test for the ith microphone
In the following, we occasionally omit the dependency on

κ and ν in order to keep the notation lucid. The conditional 1, 2
|Yi (κ, ν)| ≥ λN,i (κ, ν)Θi (κ, ν),
probability density functions of the random variable Yi = βi (κ, ν) = (64)
0, otherwise.
|Yi (κ, ν)|2 are [29]
⎧ with the threshold
⎪
⎨ 1 exp −Yi ,
⎪
Yi ≥ 0,
f Yi | H0 = ⎪ λNi λNi (56) λ Tq 1 + λX,i /λN,i
⎪
⎩0, Θi (κ, ν) = 1 + N,i log . (65)
Yi < 0, λX,i 1−q
⎧
⎪
⎪ 1 −Yi This threshold test is equivalent to the decision rule in (61).
⎨ exp , Yi ≥ 0,
f Yi | H1 = λXi + λNi
⎪
λXi + λNi (57) With this threshold test, speech is detected if the current
⎪
⎩0, input power |Yi (κ, ν)|2 is greater or equal to the average noise
Yi < 0.
power λN,i (κ, ν) times the threshold Θi (κ, ν). This factor
Applying Bayes rule for the conditional speech presence depends on the input signal-to-noise ratio λX,i /λN,i and the
probability a priori probability of speech absence q(κ, ν).

In order to combine the activity estimates for the
pi (κ, ν) = P H1 | Yi (58) different input signals, we use the following rule

we have [29] 1, if |Yi (κ, ν)|2 ≥ λN,i Θi for any i,
β(κ, ν) = (66)
−1 0, otherwise.
λXi + λNi q
pi (κ, ν) = 1+ exp(−ui ) , (59)
λNi 1 − q
where q(κ, ν) = P(H0 (κ, ν)) is the a priori probability of

6. Simulation Results
speech absence and In this section, we present some simulation results for dif-
2 ferent noise conditions typical in a car. For our simulations
Yi λXi |Yi (κ, ν)| λXi we consider the same microphone setup as described in
ui (κ, ν) = = . (60)
λNi λXi + λNi λNi λXi + λNi Section 2, that is, we use a two-channel diversity system,
×103 mic. 1 Table 1: Average input SNR values [dB] from mic. 1/mic. 2 for
5 typical background noise conditions in a car.
Frequency (Hz)
4
SNR IN 100 km/h 140 km/h defrost
3
short speaker 1.2/3.1 −0.7/−0.5 1.7/1.3
2
1 tall speaker 1.9/10.8 −0.1/7.2 2.4/9.0
0
1 2 3 4 5 6 7 Table 2: Log spectral distances with minimum statistics noise PSD
Time (s) estimation and with the proposed noise PSD estimator.
(a)
DLS [dB] 100 km/h 140 km/h defrost
×103 Activity
mic. 1 3.93/3.33 2.47/2.07 3.07/1.27
5
mic. 2 4.6/4.5 3.03/2.33 3.4/1.5
Frequency (Hz)
4
3
2 while the second ones are according to a tall person. For all
1 algorithms, we used an FFT length of L = 512 and an overlap
0 of 256 samples. For time windowing we apply a Hamming
1 2 3 4 5 6 7
window.
Time (s)
(b)
6.1. Estimating the Noise PSD. The spectrogram of one input
Figure 6: Spectrogram of the microphone input (mic. 1 at car speed signal and the result of the voice activity detection are shown
of 140 km/h, short speaker). The lower figure depicts the results in Figure 6 for the worst case scenario (short speaker at car
of the voice activity detection (black representing estimated speech speed of 140 km/h). It can be observed that time-frequency
activity) with T = 1.2 and q = 0.5. points with speech activity are reliably detected. Because the
noise PSD is estimated with minimum statistics also during
speech activity, the false alarms in speech pauses do hardly
10
0 affect the noise PSD estimation.
−10 In Figure 7, we compare the estimated noise PSD with
PSD (dB)
−20 actual PSD for the same scenario. The PSD is well approx-
−30 imated with only minor deviations for high frequencies.
−40
To evaluate the noise PSD estimation for several driving
−50
−60 situations we calculated as an objective performance measure
0 1000 2000 3000 4000 5000 6000 the log spectral distance (LSD)
Frequency (Hz)

2

1 λN (ν)
Noise
DLS = 10 log10 (67)
Estimate L ν λN (ν)
Figure 7: Estimated and actual noise PSD for mic. 2 at car speed of
140 km/h. between the actual noise power spectrum λN (ν) and the
estimate λN (ν). From the definition, it is obvious that the
LSD can be interpreted as the mean distance between two
because this is probably the most interesting case for in-car PSDs in dB. An extended analysis of different distance
applications. measures is presented in [33].
With respect to three different background noise situa- The log spectral distances of the proposed noise PSD
tions, we recorded driving noise at 100 km/h and 140 km/h. estimator are shown in Table 2. The first number in each field
As third noise situation, we considered the noise which arises is the LSD achieved with the minimum statistics approach
from an electric fan (defroster). With an artificial head we while the second number is the value for the proposed
recorded speech samples for two different seat positions. scheme. Note that every noise situation was evaluated with
From both positions, we recorded two male and two female four different voices (two male and two female). From these
speech samples, each of a length of 8 seconds. Therefore, results, we observe that the voice activity detection improves
we took the German-speaking speech samples from the rec- the PSD estimation for all considered driving situations.
ommendation P.501 of the International Telecommunication
Union (ITU) [32]. Hence the evaluation was done using 6.2. Spectral Combining. Next we consider the spectral
four different voices with two different speaker sizes, which combining as discussed in Section 3. Figure 8 presents the
leads to 8 different speaker configurations. For all recordings, output SNR values for a driving situation with a car speed of
we used a sampling rate of 11025 Hz. Table 1 contains the 100 km/h. For this simulation we used ρ = 0, that is, spectral
average SNR values for the considered noise conditions. The combining without noise suppression. In addition to the
first values in each field are with respect to a short speaker output SNR, the curve for ideal maximum-ratio-combining
30 30
SNR (dB)
SNR (dB)
20 20
10 10
0 0
−10 −10
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz) Frequency (Hz)
Out MRC Out MRC-FLMS

Ideal MRC Ideal MRC
Figure 8: Output SNR values for spectral combining without Figure 10: Output SNR values for the combined approach with
additional noise suppression (car speed of 100 km/h, ρ = 0). additional noise suppression (car speed of 100 km/h, ρ = 10).
30 Table 3: Output SNR values [dB] for different combining tech-

niques—short/tall speaker.
SNR (dB)
20
10 SNR OUT 100 km/h 140 km/h defrost
0 FLMS 8.8/13.3 4.4/9.0 7.8/12.3
−10 SC 16.3/20.9 13.3/18.0 14.9/19.9
0 500 1000 1500 2000 2500 3000 3500 4000
SC + FLMS 13.5/17.8 10.5/15.0 12.5/16.9
Frequency (Hz)
ideal FLMS 12.6/15.2 10.5/13.3 14.5/17.3
Out MRC-FLMS
Ideal MRC
Table 4: Cosh spectral distances for different combining tech-
Figure 9: Output SNR values for the combined approach without niques—short/tall speaker.
additional noise suppression (car speed of 100 km/h, ρ = 0).
DCH 100 km/h 140 km/h defrost
FLMS 0.9/0.9 0.9/1.0 1.2/1.2
is depicted. This curve is simply the sum of the input SNR SC 1.3/1.4 1.4/1.5 1.5/1.7
values for the two microphones which we calculated based SC + FLMS 1.2/1.1 1.2/1.2 1.4/1.5
on the actual noise and speech signals (cf. Figure 1). ideal FLMS 0.9/0.8 1.1/1.0 1.5/1.4
We observe that the output SNR curve closely follows the
ideal curve but with a loss of 1–3 dB. This loss is essentially
caused by the phase differences of the input signals. With the as presented in [21] (see also Section 4.1). The label SC
spectral combining approach only a magnitude combining marks results solely based on spectral combining with
is possible. Furthermore, the power spectral densities are additional noise suppression as discussed in Sections 3 and
estimates based on the noisy microphone signals, this leads 4.3. The results with the combined approach are labeled by
to an additional loss in the SNR. SC + FLMS. Finally, the values marked with the label ideal
FLMS are a benchmark obtained by using the clean and
6.3. Combining SC and FLMS. The output SNR of the unreverberant speech signal x(k) as a reference for the FLMS
combined approach without additional noise suppression is algorithm.
depicted in Figure 9. It is obvious that the theoretical SNR From the results in Table 3, we observe that the spectral
curve for ideal MRC is closely approximated by the output combining leads to a significant improvement of the output
SNR of the combined system. This is the result of the implicit SNR compared to the coherence based noise reduction. It
phase estimation of the FLMS approach which leads to a even outperforms the “ideal” FLMS scheme. However, the
coherent combining of the speech signals. spectral combining introduces undesired speech distortions
Now we consider the combined approach with additional similar to single channel noise reduction. This is also
noise suppression (ρ = 10). Figure 10 presents the corre- indicated by the results in Table 4. This table presents
sponding results for a driving situation with a car speed of distance values for the different combining systems. As an
100 km/h. The output SNR curve still follows the ideal MRC objective measure of speech distortion, we calculated the
curve but now with a gain of up to 5 dB. cosh spectral distance (a symmetrical version of the Itakura-
In Table 3, we compare the output SNR values of the Saito distance) between the power spectra of the clean input
three considered noise conditions for different combining signal (without reverberation and noise) and the output
techniques. The first value is the output SNR for a short speech signal (filter coefficients were obtained from noisy
speaker while the second number represents the result for data).
the tall speaker. The values marked with FLMS correspond to The benefit of the combined system is also indicated by
the coherence based FLMS approach with bias compensation the results in Table 5 which presents Mean Opinion Score
Table 5: Evaluation of the MOS-Test. 7. Conclusions

MOS 100 km/h 140 km/h defrost average In this paper, we have presented a diversity technique that
FLMS 2.58 2.77 2.10 2.49 combines the processed signals of several separate micro-
SC 3.19 3.15 2.96 3.10 phones. The aim of our approach was noise robustness
SC + FLMS 3.75 3.73 3.88 3.78 for in-car hands-free applications, because single channel
ideal FLMS 3.81 3.67 3.94 3.81 noise suppression methods are sensitive to the microphone
location and in particular to the distance between speaker
and microphone.
We have shown theoretically that the proposed signal
×103 Input 1 weighting is equivalent to maximum-ratio-combining. Here
Frequency (Hz)
4
we have assumed that the noise power spectral densities
are equal for all microphone inputs. This assumption might
2
be unrealistic. However, the simulation results for a two-
0 microphone system demonstrate that a performance close to
0 1 2 3 4 5 6 7
Time (s)
that of MRC can be achieved with real world noise situations.
Moreover, diversity combining is an effective means to
(a) reduce signal distortions due to reverberation and therefore
×103 Input 2 improves the speech intelligibility compared to single chan-
Frequency (Hz)
nel noise reduction. This improvement can be explained by

4
the fact that spectral combining equalizes frequency dips that
2 occur only in one microphone input (cf. Figure 3).
0 The spectral combining requires an SNR estimate for
0 1 2 3 4 5 6 7
each input signal. We have presented a simple noise PSD
Time (s)
estimator that reliably approximates the noise power for
(b) stationary as well as instationary noise.
×103 Output
Frequency (Hz)
4 Acknowledgments
2
0
Research for this paper was supported by the German Federal
0 1 2 3 4 5 6 7 Ministry of Education and Research (Grant no. 17 N11 08).
Time (s) Last but not the least, the authors would like to thank the
(c) reviewers for their constructive comments and suggestions
which greatly improve the quality of this paper.
Figure 11: Spectrograms of the input and output signals with the
SC + FLMS approach (car speed of 100 km/h, ρ = 10).
References
[1] E. Hänsler and G. Schmidt, Acoustic Echo and Noise Control:
A Practical Approach, John Wiley & Sons, New York, NY, USA,
2004.
(MOS) values for the different algorithms. The MOS test
[2] P. Vary and R. Martin, Digital Speech Transmission: Enhance-
was performed by 24 persons. The test set was taken in a ment, Coding and Error Concealment, John Wiley & Sons, New
randomized order to avoid statistical dependences on the York, NY, USA, 2006.
test order. Obviously, the FLMS approach using spectral [3] E. Hänsler and G. Schmidt, Speech and Audio Processing in
combining as reference signal and the “ideal” FLMS filter Adverse Environments: Signals and Communication Technolo-
reference approach are rated as the best noise reduction gie, Springer, Berlin, Germany, 2008.
algorithm, where the values of the combined approach are [4] S. Boll, “Suppression of acoustic noise in speech using spectral
similar to the results with the reference implementation subtraction,” IEEE Transactions on Acoustics, Speech and Signal
of the “ideal” FLMS filter solution. From this evalua- Processing, vol. 27, no. 2, pp. 113–120, 1979.
tion, it can also be seen that the FLMS approach with [5] R. Martin and P. Vary, “A symmetric two microphone speech
spectral combining outperforms the pure FLMS and the enhancement system theoretical limits and application in a car
pure spectral combining algorithms in all tested acoustic environment,” in Proceedings of the Digital Signal Processing
Workshop, pp. 451–452, Helsingoer, Denmark, August 1992.
situations.
[6] R. Martin and P. Vary, “Combined acoustic echo cancellation,
The combined approach sounds more natural compared dereverberation and noise reduction: a two microphone
to the pure spectral combining. The SNR and distance values approach,” Annales des Télécommunications, vol. 49, no. 7-8,
are close to the “ideal” FLMS scheme. The speech is free of pp. 429–438, 1994.
musical tones. The lack of musical noise can also be seen in [7] A. A. Azirani, R. L. Bouquin-Jeannès, and G. Faucon,
Figure 11, which shows the spectrograms of the enhanced “Enhancement of speech degraded by coherent and incoher-
speech and the input signals. ent noise using a cross-spectral estimator,” IEEE Transactions
on Speech and Audio Processing, vol. 5, no. 5, pp. 484–487, [23] R. Martin, “Spectral subtraction based on minimum statis-
1997. tics,” in Proceedings of the European Signal Processing Confer-
[8] A. Guérin, R. L. Bouquin-Jeannès, and G. Faucon, “A two- ence (EUSIPCO ’94), pp. 1182–1185, Edinburgh, UK, April
sensor noise reduction system: applications for hands-free car 1994.
kit,” EURASIP Journal on Applied Signal Processing, vol. 2003, [24] H. Puder, “Single channel noise reduction using time-
no. 11, pp. 1125–1134, 2003. frequency dependent voice activity detection,” in Proceedings
[9] J. Freudenberger and K. Linhard, “A two-microphone diver- of International Workshop on Acoustic Echo and Noise Control
sity system and its application for hands-free car kits,” in (IWAENC ’99), pp. 68–71, Pocono Manor, Pa, USA, Septem-
Proceedings of European Conference on Speech Communication ber 1999.
and Technology (INTERSPEECH ’05), pp. 2329–2332, Lisbon, [25] A. Juneja, O. Deshmukh, and C. Espy-Wilson, “A multi-band
Portugal, September 2005. spectral subtraction method for enhancing speech corrupted
[10] T. Gerkmann and R. Martin, “Soft decision combining by colored noise,” in Proceedings of IEEE International Confer-
for dual channel noise reduction,” in Proceedings of the ence on Acoustics, Speech, and Signal Processing (ICASSP ’02),
9th International Conference on Spoken Language Processing vol. 4, pp. 4160–4164, Orlando, Fla, USA, May 2002.
(INTERSPEECH—ICSLP ’06), vol. 5, pp. 2134–2137, Pitts- [26] J. Ramı́rez, J. C. Segura, C. Benı́tez, A. de La Torre, and A.
burgh, Pa, USA, September 2006. Rubio, “A new voice activity detector using subband order-
[11] J. Freudenberger, S. Stenzel, and B. Venditti, “Spectral com- statistics filters for robust speech recognition,” in Proceedings
bining for microphone diversity systems,” in Proceedings of of IEEE International Conference on Acoustics, Speech and
European Signal Processing Conference (EUSIPCO ’09), pp. Signal Processing (ICASSP ’04), vol. 1, pp. I849–I852, 2004.
854–858, Glasgow, UK, July 2009. [27] R. Martin, “Noise power spectral density estimation based on
[12] J. L Flanagan and R. C. Lummis, “Signal processing to reduce optimal smoothing and minimum statistics,” IEEE Transac-
multipath distortion in small rooms,” Journal of the Acoustical tions on Speech and Audio Processing, vol. 9, no. 5, pp. 504–512,
Society of America, vol. 47, no. 6, pp. 1475–1481, 1970. 2001.
[13] J. B. Allen, D. A. Berkley, and J. Blauert, “Multimicrophone [28] I. Cohen, “Noise spectrum estimation in adverse environ-
signal-processing technique to remove room reverberation ments: improved minima controlled recursive averaging,”
from speech signals,” Journal of the Acoustical Society of IEEE Transactions on Speech and Audio Processing, vol. 11, no.
America, vol. 62, no. 4, pp. 912–915, 1977. 5, pp. 466–475, 2003.
[14] S. Gannot and M. Moonen, “Subspace methods for mul- [29] J. Sohn and W. Sung, “A voice activity detector employing soft
timicrophone speech dereverberation,” EURASIP Journal on decision based noise spectrum adaptation,” in Proceedings of
Applied Signal Processing, vol. 2003, no. 11, pp. 1074–1090, IEEE International Conference on Acoustics, Speech and Signal
2003. Processing (ICASSP ’98), vol. 1, pp. 365–368, 1998.
[15] M. Delcroix, T. Hikichi, and M. Miyoshi, “Dereverberation [30] G. D. Forney Jr., “Exponential error bounds for erasure,
and denoising using multichannel linear prediction,” IEEE list, and decision feedback schemes,” IEEE Transactions on
Transactions on Audio, Speech and Language Processing, vol. 15, Information Theory, vol. 14, no. 2, pp. 206–220, 1968.
no. 6, pp. 1791–1801, 2007. [31] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based
[16] I. Ram, E. Habets, Y. Avargel, and I. Cohen, “Multi- voice activity detection,” IEEE Signal Processing Letters, vol. 6,
microphone speech dereverberation using LIME and least no. 1, pp. 1–3, 1999.
squares filtering,” in Proceedings of European Signal Processing [32] ITU-T, Test signals for use in telephonometry, Recommenda-
Conference (EUSIPCO ’08), Lausanne, Switzerland, August tion ITU-T P.501, International Telecommunication Union,
2008. Geneva, Switzerland, 2007.
[17] K. Mukherjee and B.-H. Gwee, “A 32-point FFT based noise [33] A. H. Gray Jr. and J. D. Markel, “Distance measures for speech
reduction algorithm for single channel speech signals,” in processing,” IEEE Transactions on Acoustics, Speech and Signal
Proceedings of IEEE International Symposium on Circuits and Processing, vol. 24, no. 5, pp. 380–391, 1976.
Systems (ISCAS ’07), pp. 3928–3931, New Orleans, La, USA,
May 2007.
[18] W. Armbrüster, R. Czarnach, and P. Vary, “Adaptive noise
cancellation with reference input,” in Signal Processing III, pp.
391–394, Elsevier, 1986.
[19] B. Sklar, Digital Communications: Fundamentals and Applica-
tions, Prentice Hall, Upper Saddle River, NJ, USA, 2001.
[20] C. Knapp and G. Carter, “The generalized correlation method
for estimation of time delay,” IEEE Transactions on Acoustics
Speech and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976.
[21] J. Freudenberger, S. Stenzel, and B. Venditti, “An FLMS
based two-microphone speech enhancement system for in-
car applications,” in Proceedings of the 15th IEEE Workshop on
Statistical Signal Processing (SSP ’09), pp. 705–708, 2009.
[22] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement
of speech corrupted by acoustic noise,” in Proceedings of
IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’79), pp. 208–211, Washington, DC, USA,
April 1979.
doi:10.1155/2010/358729
Research Article
DOA Estimation with Local-Peak-Weighted CSP
Osamu Ichikawa, Takashi Fukuda, and Masafumi Nishimura

IBM Research-Tokyo, 1623-14, Shimotsuruma, Yamato, Kanagawa 242-8502, Japan
Correspondence should be addressed to Osamu Ichikawa, ichikaw@jp.ibm.com
Received 31 July 2009; Revised 18 December 2009; Accepted 4 January 2010
Academic Editor: Sharon Gannot
Copyright © 2010 Osamu Ichikawa et al. This is an open access article distributed under the Creative Commons Attribution
cited.
This paper proposes a novel weighting algorithm for Cross-power Spectrum Phase (CSP) analysis to improve the accuracy of
direction of arrival (DOA) estimation for beamforming in a noisy environment. Our sound source is a human speaker and the
noise is broadband noise in an automobile. The harmonic structures in the human speech spectrum can be used for weighting the
CSP analysis, because harmonic bins must contain more speech power than the others and thus give us more reliable information.
However, most conventional methods leveraging harmonic structures require pitch estimation with voiced-unvoiced classification,
which is not sufficiently accurate in noisy environments. In our new approach, the observed power spectrum is directly converted
into weights for the CSP analysis by retaining only the local peaks considered to be harmonic structures. Our experiment showed
the proposed approach significantly reduced the errors in localization, and it showed further improvements when used with other
weighting algorithms.
1. Introduction reflection with the advantage of reducing the effects of noise

sources through localization.
The performance of automatic speech recognition (ASR) Among these methods, CSP analysis is popular because
is severely affected in noisy environments. For example, in it is accurate, reliable, and simple. CSP analysis measures
automobiles the ASR error rates during high-speed cruising the time differences in the signals from two microphones
with an open window are generally high. In such situations, using normalized correlation. The differences correspond to
the noise reduction of beamforming technology can improve the direction of arrival (DOA) of the sound sources. Using
the ASR accuracy. However, all beamformers except for Blind multiple pairs of microphones, CSP analysis can be enhanced
Signal Separation (BSS) require accurate localization to focus for 2D or 3D space localization [6].
on the target sound source. If a beamformer has high perfor- This paper seeks to improve CSP analysis in noisy
mance with acute directivity, then the performance declines environments with a special weighting algorithm. We assume
greatly if the localization is inaccurate. This means ASR may the target sound source is a human speaker and the noise
actually lose accuracy with a beamformer, if the localization is broadband noise such as a fan, wind, or road noise
is poor in a noisy environment. Accurate localization is in an automobile. Denda et al. proposed weighted CSP
critically important for ASR with a beamformer. analysis using average speech spectrums as weights [7]. The
For sound source localization, conventional methods assumption is that a subband with more speech power
include MUSIC [1, 2], Minimum Variance (MV), Delay conveys more reliable information for localization. However,
and Sum (DS), and Cross-power Spectrum Phase (CSP) [3] it did not use the harmonic structures of human speech.
analysis. For two-microphone systems installed on physical Because the harmonic bins must contain more speech power
objects such as dummy heads or external ears, approaches than the other bins, they should give us more reliable
with head-related transfer functions (HRTF) have been information in noisy environments. The use of harmonic
investigated to model the effect of diffraction and reflection structures for localization has been investigated in prior art
[4]. Profile Fitting [5] can also address the diffraction and [8, 9], but not for CSP analysis. This work estimated the
DOA the CSP coefficients should be processed as a moving average

0.2
using several frames around T, as long as the sound source is
not moving, using
0.15
H
l=−H ϕT (i + l)
0.1 ϕT (i) = , (2)
(2H + 1)
φT (i)
0.05
where 2H + 1 is the number of averaged frames. Figure 1
shows an example of ϕT . In clean conditions, there is a sharp
0
−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 peak for a sound source. The estimated DOA iT for the sound
i source is
−0.05

Figure 1: An example of CSP. iT = argmax ϕT (i) . (3)

i
1.1
2.2. Tracking a Moving Sound Source. If a sound source is
moving, the past location or DOA can be used as a cue
1
to the new location. Tracking techniques may use Dynamic
Programming (DP), the Viterbi search [10], Kalman Filters,
0.9
or Particle Filters [11]. For example, to find the series of
Weight
DOAs that maximize the function for the input speech

0.8
frames, DP can use the evaluation function Ψ as
0.7
ΨT (i) = ϕT (i) · L(k, i) + max (ΨT −1 (k)), (4)
i−1≤k≤i+1
0.6
0 2000 4000 6000 8000 where L(k, i) is a cost function from k to i.
(Hz)
Figure 2: Average speech spectrum weight. 2.3. Weighted CSP Analysis. Equation (1) can be viewed as a
summation of each contribution at bin j. Therefore we can
introduce a weight W( j) on each bin so as to focus on the
pitches (F0) of the target sound and extracted localization more reliable bins, as
cues from the harmonic structures based on those pitches. ∗
However, the pitch estimation and the associated voiced- S1,T j · S2,T j
ϕT (i) = IDFT W j ·
S1,T j · S2,T j . (5)
unvoiced classification may be insufficiently accurate in noisy
environments. Also, it should be noted that not all harmonic
bins have distinct harmonic structures. Some bins may not Denda et al. introduced an average speech spectrum for the
be in the speech formants and be dominated by noise. weights [7] to focus on human speech. Figure 2 shows their
Therefore, we want a special weighting algorithm that puts weights. We use the symbol WDenda for later reference to
larger weights on the bins where the harmonic structures these weights. It does not have any suffix T, since it is time
are distinct, without requiring explicit pitch detection and invariant.
voiced-unvoiced classification. Another weighting approach would be to use the local
SNR [12], as long as the ambient noise is stationary and
measurable. For our evaluation in Section 4, we simply used
2. Sound Source Localization Using
larger weights where local SNR is high as
CSP Analysis

2.1. CSP Analysis. CSP analysis measures the normalized WSNRT j
correlations between two-microphone inputs with an Inverse

max log ST j − log NT j , ε

2 2 (6)
Discrete Fourier Transform (IDFT) as
= ,
∗ KT
S1,T j · S2,T j
ϕT (i) = IDFT
S1,T j · S2,T j , (1) where NT is the spectral magnitude of the average noise, ε is
a very small constant, and KT is a normalizing factor
where Sm,T is a complex spectrum at the Tth frame observed

with microphone m and ∗ means complex conjugate. The bin KT = max log |ST (k)|2 − log |NT (k)|2 ,ε . (7)
number j corresponds to the frequency. The CSP coefficient k
ϕT (i) is a time-domain representation of the normalized
correlation for the i-sample delay. For a stable representation, Figure 3(c) shows an example of the local SNR weights.
12 12
10 10
Log power
Log power
8 8
6 6
4 4
2 2
0 2000 4000 6000 8000 0 2000 4000 6000 8000
(Hz) (Hz)
(a) A sample of the average noise spectrum. (b) A sample of the observed noisy speech spectrum.
0.1 0.1
Weight
Weight
0.05 0.05
0 0
0 2000 4000 6000 8000 0 2000 4000 6000 8000
(Hz) (Hz)
(c) A sample of the local SNR weights. (d) A sample of the local peak weights.
Figure 3: Sample spectra and the associated weights. The spectra were of the recording with air conditioner noise at an SNR of 0 dB. The
noisy speech spectrum (b) was sampled in a vowel segment.
1.5
1
Weight
(a)
400
0.5 300
(Hz)
200
0 100
0 2000 4000 6000 8000
(Hz) 25 dB(clean) 5 dB
10 dB 0 dB
Figure 4: A Sample of comb weight. (pitch = 300 Hz).
(b)
Figure 5: A sample waveform (clean) and its pitches detected by

SPTK in various SNR situations. The threshold of voiced-unvoiced
3. Harmonic Structure-Based Weighting classification was set to 6.0 (SPTK default). For the frames detected
as unvoiced, SPTK outputs zero. The test data was prepared by
3.1. Comb Weights. If there is accurate information about the
blending noise at different SNRs. The noise was recorded in a car
pitch and voiced-unvoiced labeling of the input speech, then moving on an expressway with a fan at a medium level.
we can design comb filters [13] for the frames in the voiced
segments. The optimal CSP weights will be equivalent to the
gain of the comb filters to selectively use those harmonic
bins. Figure 4 shows an example of the weights when the in SPTK-3.0 [14] to obtain the pitch and voiced-unvoiced
pitch is 300 Hz. information. There are many outliers in the low SNR
Unfortunately, the estimates of the pitch and the voiced- conditions. Many researchers have tried to improve the
unvoiced classification become inaccurate in noisy environ- accuracy of the detection in noisy environments [15], but
ments. Figure 5 shows our tests using the “Pitch command” their solutions require some threshold for voiced-unvoiced
7500
7000
6500
6000
5500
5000
4500
Observed 4000
spectrum 3500
3000
2500
2000
1500
1000
500
1 2 3 4 5 6 7 8 9 10 1112 13 14 15 1617 1819 20 21 22 2324 25 26 27 28 30 31 32 33 34 35 36 37 38 39 40

Noise or
unvoiced frame Voiced frame
Log power
spectrum
DCT to get
cepstrum
Cut off upper
and lower
cepstrum
I-DCT
Get exponential
and normalise
to get weights
W(ω)
Weighted CSP
Figure 6: Process to obtain Local Peak Weight.
classification [16]. When noise-corrupted speech is falsely continuous converter from an input spectrum to a weight
detected as unvoiced, there is little benefit from the CSP vector, which can be locally large for the bins whose
weighting. harmonic structures are distinct.
There is another problem with the uniform adoption of
comb weights for all of the bins. Those bins not in the speech 3.2. Proposed Local Peak Weights. We previously proposed a
formants and degraded by noise may not contain reliable method for speech enhancement called Local Peak Enhance-
cues even though they are harmonic bins. Such bins should ment (LPE) to provide robust ASR even in very low SNR
receive smaller weights. conditions due to driving noises from an open window
Therefore, in Section 3.2, we explore a new weighting or loud air conditioner noises [17]. LPE does not leverage
algorithm that does not depend on explicit pitch detection pitch information explicitly, but estimates the filters from
or voiced-unvoiced classification. Our approach is like a the observed speech to enhance the speech spectrum. LPE
DFT DFT
−7 7
6
5
−4 4
Get weight
±0
W( j)
S1,T ( j) Calculate weighted CSP S2,T ( j)

φT (i)
Microphone
Smooth over frames
Figure 7: Microphone installation and the resolution of DOA in the
φT (i)
experimental car.
Determine DOA
12
Figure 9: System for the evaluation.
11
10 40
9
Log power
35
8
DOA detection error (%)
30
7
6 25
5
20
4
0 2000 4000 6000 8000 15
(Hz)
Window full open 10
Fan max
5
Figure 8: Averaged noise spectrum used in the experiment. 0
Clean 10 dB 0 dB
SNR
assumes that pitch information containing the harmonic 1. CSP (Baseline) 4. W-CSP (Local SNR)
structure is included in the middle range of the cepstral 2. W-CSP (Comb) 5. W-CSP (Denda)
coefficients obtained with the discrete cosine transform 3. W-CSP (LPW)
(DCT) from the power spectral coefficients. The LPE filter
retrieves information only from that range, so it is designed Figure 10: Error rate of frame-based DOA detection. (Fan Max:
to enhance the local peaks of the harmonic structures for single-weight cases).
voiced speech frames. Here, we propose the LPE filter be used
for the weights in the CSP approach. This use of the LPE filter
is named Local Peak Weight (LPW), and we refer to the CSP the bin index of the DFT. Optionally, we may take
with LPW as the Local-Peak-Weighted CSP (LPW-CSP). a moving average using several frames around T, to
Figure 6 shows all of the steps for obtaining the LPW and smooth the power spectrum for YT ( j).
sample outputs of each step for both a voiced frame and an (2) Convert the log power spectrum YT ( j) into the
unvoiced frame. The process is the same for all of the frames, cepstrum CT (i) by using D(i, j), a DCT matrix.
but the generated filters differ depending on whether or not

the frame is voiced speech, as shown in the figure. CT (i) = D i, j · YT j , (8)
Here are the details for each step. j
(1) Convert the observed spectrum from one of the where i is the bin number of the cepstral coefficients.
microphones to a log power spectrum YT ( j) for each In our experiments, the size of the DCT matrix is 256
frame, where T and j are the frame number and by 256.
25 25
20 20

15 15
10 10
5 5
0 0
Clean 10 dB 0 dB Clean 10 dB 0 dB
SNR SNR
1. CSP (Baseline) 4. W-CSP (Local SNR) 1. CSP (Baseline)

2. W-CSP (Comb) 5. W-CSP (Denda) 6. W-CSP (LPW and Denda)
3. W-CSP (LPW) 7. W-CSP (LPW and Local SNR)
8. W-CSP (Local SNR and Denda)
Figure 11: Error rate of frame-based DOA detection. (Window Full 9. W-CSP(LPW and Local SNR and Denda)
Open: single-weight cases).
Figure 13: Error rate of frame-based DOA detection. (Window Full
Open: combined-weight cases).
40
35
30
for human speech is from 100 Hz to 400 Hz. This
assumption gives IL = 55 and IH = 220, when the

25 sampling frequency is 22 kHz.
20 (4) Convert CT (i) back to the log power spectrum
15
domain VT (i) by using the inverse DCT:

10 VT j = D−1 j, i · CT (i). (10)
i
5
0 (5) Then converted back to a linear power spectrum:

Clean 10 dB 0 dB

SNR wT j = exp VT j . (11)
1. CSP (Baseline)
6. W-CSP (LPW and Denda) (6) Finally, we obtain LPW, after normalizing, as
7. W-CSP (LPW and Local SNR)
8. W-CSP (Local SNR and Denda) wT j
9. W-CSP(LPW and Local SNR and Denda) WLPWT j = . (12)
k wT (k)
Figure 12: Error rate of frame-based DOA detection. (Fan Max:
combined-weight cases). For voiced speech frames, LPW will be designed to retain
only the local peaks of the harmonic structure as shown in
the bottom-right graph in Figure 6 (see also Figure 3(d))
(3) The cepstra represent the curvatures of the log power For unvoiced speech frames, the result will be almost flat
spectra. The lower and higher cepstra include long due to the lack of local peaks with the target harmonic
and short oscillations while the medium cepstra structure. Unlike the comb weights, the LPW is not uniform
capture the harmonic structure information. Thus over the target frequencies and is more focused on the
the range of cepstra is chosen by filtering out the frequencies where harmonic structures are observed in the
lower and upper cepstra in order to cover the possible input spectrum.
harmonic structures in the human voice.
⎧ 3.3. Combination with Existing Weights. The proposed LPW
⎨λ · CT (i) if (i < IL ) or (i > IH ),
CT (i) = ⎩ (9) and existing weights can be used in various combinations.
CT (i) otherwise, For the combinations, the two choices are sum and product.
In this paper, they are defined as the products of each
where λ is a small constant. IL and IH correspond component for each bin j, because the scale of each
to the bin index of the possible pitch range, which component is too different for a simple summation and we
hope to minimize some fake peaks in the weights by using the graph, “Window Full Open” contains lots of transient noise
products of different metrics. Equations (13) to (16) show from the wind and other automobiles.
the combinations we evaluate in Section 4. Figure 9 shows the system used for this evaluation.
We used various types of weights for the weighted CSP
analysis. The input from one microphone was used to
WLPW&DendaT j = WLPWT j · WDenda j , (13) generate the weights. Using both microphones could provide
better weights, but in this experiment we used only one
WLPW&SNRT j = WLPWT j · WSNRT j , (14)

microphone for simplicity. Since the baseline (normal CSP)
WSNR&DendaT j = WSNRT j · WDenda j , (15) does not use weighting, all of its weights were set to 1.0.
The weighted CSP was calculated using (5), with smoothing
WLPW&SNR&DendaT j = WLPWT j · WSNRT j over the frames using (2). In addition to the weightings,
(16) we introduced a lower cut-off frequency of 100 Hz and an
· WDenda j .
upper cut-off frequency of 5 kHz to stabilize the CSP analysis.
Finally, the DOA was estimated using (3) for each frame. We
4. Experiment did not use the tracking algorithms discussed in Section 2.2,
because we wanted to accurately measure the contributions
In the experimental car, two microphones were installed of the various types of weights in a simplified form. Actually,
near the map-reading lights on the ceiling with 12.5 cm the subject speakers rarely moved when speaking.
between them. We used omnidirectional microphones. The The performance was measured as frame-based accuracy.
sampling frequency for the recordings was 22 kHz. In this The frames reporting the correct DOA were counted, and
configuration, CSP gives 15 steps from −7 to +7 for the DOA that was divided by the total number of speech frames. The
resolution (see Figure 7). correct DOA values were determined manually. The speech
A higher sampling rate might yield higher directional segments were determined using clean speech data with a
resolution. However, many beamformers do not support rather strict threshold, so extra segments were not included
higher sampling frequencies because of processing costs and before or after the phrases.
aliasing problems. We also know that most ASR systems work
at sampling rates below 22 kHz. These considerations led us 4.1. Experiment Using Single Weights. We evaluated five types
to use 22 kHz. of CSP analysis.
Again, we could have gained directional resolution
by increasing the distance between the microphones. In Case 1. Normal CSP (uniform weights, baseline).
general, a larger baseline distance improves the performance
of a beamformer, especially for lower frequency sounds. Case 2. Comb-Weighted CSP.
However, this increases the aliasing problems for higher
frequency sounds. Our separation of 12.5 cm was another Case 3. Local-Peak-Weighted CSP (our proposal).
tradeoff.
Our analysis used a Hamming window, 23-ms-long Case 4. Local-SNR-Weighted CSP.
frames with 10-ms frame shifts. The FFT length was 512. For
(2), the length of the moving average was 0.2 seconds. Case 5. Average-Speech-Spectrum-Weighted CSP (Denda).
The test subject speakers were 4 females and 4 males.
Case 2 requires the pitch and voiced-unvoiced infor-
Each speaker read 50 Japanese commands. These are short
mation. We used SPTK-3.0 [14] with default parameters
phrases for automobiles known as Free Form Command
to obtain this data. Case 4 requires estimating the noise
[18]. The total number of utterances was 400. They were
spectrum. In this experiment, the noise spectrum was
recorded in a stationary car, a full-size sedan. The subject
continuously updated within the noise segments based on
speakers sat in the driver’s seat. The seat was adjusted to
oracle VAD information as
each speaker’s preference, so the distance to the microphones
varied from approximately 40 cm to 60 cm. Two types of
NT j = (1 − α) · NT −1 j + α · ST j
noise were recorded separately in a moving car, and they
⎧
were combined with the speech data at various SNRs (clean, ⎨0.0 if VAD = active, (17)
10 dB, and 0 dB). The SNRs were measured as ratios of α=⎩
speech power and noise power, ignoring the frequency 0.1 otherwise.
components below 300 Hz. One of the recorded noises was
an air-conditioner at maximum fan speed while driving on The initial value of the noise spectrum for each utterance file
a highway with the windows closed. This will be referred was given by the average of all of the noise segments in that
to as “Fan Max”. The other was of driving noise on a file.
highway with the windows fully opened. This will be referred Figures 10 and 11 show the experimental results for
to as “Window Full Open”. Figure 8 compares the average “Fan Max” and “Window Full Open”, respectively. Case 2
spectra of the two noises. “Window Full Open” contains failed to show significant error reduction in both situations.
more power around 1 kHz, and “Fan Max” contains relatively This failure is probably due to bad pitch estimation or poor
large power around 4 kHz. Although it is not shown in the voiced-unvoiced classification in the noisy environments.
This suggests that the result could be improved by intro- Case 9, the combination of the three weights worked
ducing robust pitch trackers and voiced-unvoiced classifiers. well in both situations. Because each weighting method has
However, there is an intrinsic problem since noisier speech different characteristics, we expected that their combination
segments are more likely to be classified as unvoiced and thus would help against variations in the noise. Actually, the
lose the benefit of weighting. results were almost equivalent to the best combinations of
Case 5 failed to show significant error reduction for “Fan the paired weights in each situation.
Max”, but it showed good improvement for “Window Full
Open”. As shown in Figure 8, “Fan Max” contains more
noise power around 4 kHz than around 1 kHz. In contrast, 5. Conclusion
the speech power is usually lower around 4 kHz than
We proposed a new weighting algorithm for CSP analysis to
around 1 kHz. Therefore, the 4-kHz region tends to be more
improve the accuracy of DOA estimation for beamforming in
degraded. However Denda’s approach does not sufficiently
a noisy environment, assuming the source is human speech
lower the weights in the 4-kHz region, because the weights
and the noise is broadband noise such as a fan, wind, or road
are time-invariant and independent on the noise. Case 3
noise in an automobile.
and Case 4 outperformed the baseline in both situations.
The proposed weights are extracted directly from the
For “Fan Max”, since the noise was almost stationary, the
input speech using the midrange of the cepstrum. They
local-SNR approach can accurately estimate the noise. This
represent the local peaks of the harmonic structures. As
is also a favorable situation for LPW, because the noise does
the process does not involve voiced-unvoiced classification,
not include harmonic components. However, LPW does little
it does not have to switch its behavior over the voiced-
for consonants. Therefore, Case 4 had the best results for
unvoiced transitions.
“Fan Max”. In contrast, since the noise is nonstationary for
Experiments showed the proposed local peak weighting
“Window Full Open”, Case 3 had slightly fewer errors than
algorithm significantly reduced the errors in localization
Case 4. We believe this is because the noise estimation for the
using CSP analysis. A weighting algorithm using local SNR
local SNR calculations is inaccurate for nonstationary noises.
also reduced the errors, but it did not produce the best results
Considering that the local SNR approach in this experiment
in the nonstationary noise situation in our evaluations. Also,
used the given and accurate VAD information, the actual
it requires VAD information to estimate the noise spectrum.
performance in the real world would probably be worse than
Our proposed algorithm does not require VAD information,
our results. LPW has an advantage in that it does not require
voiced-unvoiced information, or pitch information. It does
either noise estimation or VAD information.
not assume the noise is stationary. Therefore, it showed
advantages in the nonstationary noise situation. Also, it can
4.2. Experiment Using Combined Weights. We also evaluated be combined with existing weighting algorithms for further
some combinations of the weights in Cases 3 to 5. The improvements.
combined weights were calculated using (13) to (16).
Case 6. CSP weighted with LPW and Denda (Cases 3 and 5). References
[1] D. Johnson and D. Dudgeon, Array Signal Processing, Prentice-
Case 7. CSP weighted with LPW and Local SNR (Cases 3 and
Hall, Englewood Cliffs, NJ, USA.
4). [2] F. Asano, H. Asoh, and T. Matsui, “Sound source localization
and separation in near field,” IEICE Transactions on Funda-
Case 8. CSP weighted with Local SNR and Denda (Cases 4 mentals of Electronics, Communications and Computer Sciences,
and 5). vol. E83-A, no. 11, pp. 2286–2294, 2000.
[3] M. Omologo and P. Svaizer, “Acoustic event localization using
Case 9. CSP weighted with LPW, Local SNR, and Denda a crosspower-spectrum phase based technique,” in Proceedings
(Cases 3, 4, and 5). of the International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’94), pp. 273–276, 1994.
Figures 12 and 13 show the experimental results for [4] K. D. Martin, “Estimating azimuth and elevation from
“Fan Max” and “Window Full Open”, respectively, for the interaural differences,” in Proceedings of IEEE ASSP Workshop
combined weight cases. on Applications of Signal Processing to Audio and Acoustics
For the combination of two weights, the best combina- (WASPAA ’95), p. 4, 1995.
tion was dependent on the situation. For “Fan Max”, Case 7, [5] O. Ichikawa, T. Takiguchi, and M. Nishimura, “Sound source
the combination of LPW and the local SNR approach was localization using a profile fitting method with sound reflec-
best in reducing the error by 51% for 0 dB. For “Window tors,” IEICE Transactions on Information and Systems, vol. E87-
D, no. 5, pp. 1138–1145, 2004.
Full Open”, Case 6, the combination of LPW and Denda’s
[6] T. Nishiura, T. Yamada, S. Nakamura, and K. Shikano, “Local-
approach was best in reducing the error by 37% for 0 dB. ization of multiple sound sources based on a CSP analysis
These results correspond to the discussion in Section 4.1 with a microphone array,” in Proceedings of IEEE International
about how the local SNR approach is suitable for stationary Conference on Acoustics, Speech and Signal Processing (ICASSP
noises, while LPW is suitable for nonstationary noises, and ’00), vol. 2, pp. 1053–1056, 2000.
Denda’s approach works well with noise concentrated in the [7] Y. Denda, T. Nishiura, and Y. Yamashita, “Robust talker
lower frequency region. direction estimation based on weighted CSP analysis and
maximum likelihood estimation,” IEICE Transactions on Infor-

mation and Systems, vol. E89-D, no. 3, pp. 1050–1057, 2006.
[8] T. Yamada, S. Nakamura, and K. Shikano, “Robust speech
recognition with speaker localization by a microphone array,”
in Proceedings of the International Conference on Spoken
Language Processing (ICSLP ’96), vol. 3, pp. 1317–1320, 1996.
[9] T. Nagai, K. Kondo, M. Kaneko, and A. Kurematsu, “Esti-
mation of source location based on 2-D MUSIC and its
application to speech recognition in cars,” in Proceedings of
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP ’01), vol. 5, pp. 3041–3044, 2001.
[10] T. Yamada, S. Nakamura, and K. Shikano, “Distant-talking
speech recognition based on a 3-D Viterbi search using a
microphone array,” IEEE Transactions on Speech and Audio
Processing, vol. 10, no. 2, pp. 48–56, 2002.
[11] H. Asoh, I. Hara, F. Asano, and K. Yamamoto, “Tracking
human speech events using a particle filter,” in Proceedings of
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP ’05), vol. 2, pp. 1153–1156, 2005.
[12] J.-M. Valin, F. Michaud, J. Rouat, and D. Létourneau, “Robust
sound source localization using a microphone array on a
mobile robot,” in Proceedings of IEEE International Conference
on Intelligent Robots and Systems (IROS ’03), vol. 2, pp. 1228–
1233, 2003.
[13] H. Tolba and D. O’Shaughnessy, “Robust automatic
continuous-speech recognition based on a voiced-unvoiced
decision,” in Proceedings of the International Conference on
Spoken Language Processing (ICSLP ’98), p. 342, 1998.
[14] SPTK: http://sp-tk.sourceforge.net/.
[15] M. Wu, D. L. Wang, and G. J. Brown, “A multi-pitch tracking
algorithm for noisy speech,” in Proceedings of IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP ’02), vol. 1, pp. 369–372, 2002.
[16] T. Nakatani, T. lrino, and P. Zolfaghari, “Dominance spectrum
based V/UV classification and F0 estimation,” in Proceedings
of the 8th European Conference on Speech Communication and
Technology (Eurospeech ’03), pp. 2313–2316, 2003.
[17] O. Ichikawa, T. Fukuda, and M. Nishimura, “Local peak
enhancement combined with noise reduction algorithms for
robust automatic speech recognition in automobiles,” in
Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP ’08), pp. 4869–4872,
2008.
[18] http://www-01.ibm.com/software/pervasive/embedded via-
voice/.
doi:10.1155/2010/690732
Research Article
Shooter Localization in Wireless Microphone Networks
David Lindgren,1 Olof Wilsson,2 Fredrik Gustafsson (EURASIP Member),2

and Hans Habberstad1
1 Swedish Defence Research Agency, FOI Department of Information Systems, Division of Informatics, 581 11 Linköping, Sweden
2 Linköping University, Department of Electrical Engineering, Division of Automatic Control, 581 83 Linköping, Sweden
Correspondence should be addressed to David Lindgren, david.lindgren@foi.se
Received 31 July 2009; Accepted 14 June 2010
Academic Editor: Patrick Naylor
Copyright © 2010 David Lindgren et al. This is an open access article distributed under the Creative Commons Attribution
cited.
Shooter localization in a wireless network of microphones is studied. Both the acoustic muzzle blast (MB) from the gunfire and
the ballistic shock wave (SW) from the bullet can be detected by the microphones and considered as measurements. The MB
measurements give rise to a standard sensor network problem, similar to time difference of arrivals in cellular phone networks,
and the localization accuracy is good, provided that the sensors are well synchronized compared to the MB detection accuracy.
The detection times of the SW depend on both shooter position and aiming angle and may provide additional information beside
the shooter location, but again this requires good synchronization. We analyze the approach to base the estimation on the time
difference of MB and SW at each sensor, which becomes insensitive to synchronization inaccuracies. Cramér-Rao lower bound
analysis indicates how a lower bound of the root mean square error depends on the synchronization error for the MB and the
MB-SW difference, respectively. The estimation problem is formulated in a separable nonlinear least squares framework. Results
from field trials with different types of ammunition show excellent accuracy using the MB-SW difference for both the position and
the aiming angle of the shooter.
1. Introduction time difference of arrival (TDOA), it is of major importance

that synchronization errors are carefully controlled. Regard-
Several acoustic shooter localization systems are today less if the synchronization is solved by using GPS or other
commercially available; see, for instance [1–4]. Typically, one techniques, see, for instance [8–10], the synchronization
or more microphone arrays are used, each synchronously procedures are associated with costs in battery life or
sampling acoustic phenomena associated with gunfire. An communication resources that usually must be kept at a
overview is found in [5]. Some of these systems are mobile, minimum.
and in [6] it is even described how soldiers can carry the In [11] the synchronization error impact on the sniper
microphone arrays on their helmets. One interesting attempt localization ability of an urban network is studied by using
to find direction of sound from one microphone only is
Monte Carlo simulations. One of the results is that the
described in [7]. It is based on direction dependent spatial
inaccuracy increased significantly (>2 m) for synchroniza-
filters (mimicking the human outer ear) and prior knowledge
of the sound waveform, but this approach has not yet been tion errors exceeding approximately 4 ms. 56 small wireless
applied to gun shots. sensor nodes were modeled. Another closely related work
Indeed, less common are shooter localization systems that deals with mobile asynchronous sensors is [12], where
based on singleton microphones geographically distributed the estimation bounds with respect to both sensor synchro-
in a wireless sensor network. An obvious issue in wireless nization and position errors are developed and validated by
networks is the sensor synchronization. For localization Monte Carlo simulations. Also [13] should be mentioned,
algorithms that rely on accurate timing like the ones based on where combinations of directional and omnidirectional
acoustic sensors for sniper localization are evaluated by per-

turbation analysis. In [14], estimation bounds for multiple Shock wave
acoustic arrays are developed and validated by Monte Carlo
simulations.
In this paper we derive fundamental estimation bounds
for shooter localization systems based on wireless sensor
networks, with the synchronization errors in focus. An
accurate method independent of the synchronization errors
will be analyzed (the MB-SW model) as well as a useful
bullet deceleration model. The algorithms are tested on data
from a field trial with 10 microphones spread over an area
of 100 m and with gunfire at distances up to 400 m. Partial
results of this investigation appeared in [15] and almost Muzzle blast
simultaneously in [12].
The outline is as follows. Section 2 sketches the local- 0 50 100 150 200
ization principle and describes the acoustical phenomena (ms)
that are used. Section 3 gives the estimation framework.
Figure 1: Signal from a microphone placed 180 m from a firing gun.
Section 4 derives the signal models for the muzzle blast
Initial bullet speed is 767 m/s. The bullet passes the microphone at a
(MB), shock wave (SW), combined MB;SW, and difference distance of 30 m. The shockwave from the supersonic bullet reaches
MB-SW, respectively. Section 5 derives expressions for the the microphone before the muzzle blast.
root mean square error (RMSE) Cramér-Rao lower bound
(CRLB) for the described models and provides numerical
results from a realistic scenario. Section 6 presents the results
to the microphone. The figure shows real data, but a rather
from field trials, and Section 7 gives the conclusions.
ideal case. Usually, and particularly in urban environments,
there are reflections and other acoustic effects that make
it difficult to accurately determine the MB and SW times.
2. Localization Principle This issue will however not be treated in this work. We will
instead assume that the detection error is stochastic with a
Two acoustical phenomena associated with gunfire will be
certain distribution. A more thorough analysis of the SW
exploited to determine the shooter’s position: the muzzle
propagation is given in [16].
blast and the shock wave. The principle is to detect and time
Of course, the MB and SW (when present) can be used
stamp the phenomena as they reach microphones distributed
in conjunction with each other. One of the ideas exploited
over an area, and let the shooter’s position be estimated by,
later is to utilize the time difference between the MB and
in a sense, the most likely point, considering the microphone
SW detections. This way, the localization is independent of
locations and detection times.
the clock synchronization errors that are always present in
The muzzle blast (MB) is the sound that probably most of
wireless sensor networks.
us associate with a gun shot, the “bang.” The MB is generated
by the pressure depletion in effect of the bullet leaving the
gun barrel. The sound of the MB travels at the speed of sound 3. Estimation Framework
in all directions from the shooter. Provided that a sufficient
number of microphones detect the MB, the shooters position It is assumed throughout this work that
can be more or less accurately determined. (1) the coordinates of the microphones are known with
The shock wave (SW) is formed by supersonic bullets. negligible error,
The SW has (approximately) the shape of an expanding
cone, with the bullet trajectory as axis, and reaches only (2) the arrival times of the MB and SW at each micro-
microphones that happens to be located inside the cone. phone are measured with significant synchronization
The SW propagates at the speed of sound in direction away error,
from the bullet trajectory, but since it is generated by a (3) the shooter position and aim direction are the sought
supersonic bullet, it always reaches the microphone before parameters.
the MB, if it reaches the microphone at all. A number of SW
Thus, assume that there are M microphones with known
detections may primarily reveal the direction to the shooter.
positions { pk }M k=1 in the network detecting the muzzle blast.
Extra observations or assumptions on the ammunition are
Without loss of generality, the first S ≤ M ones also detect
generally needed to deduce the distance to the shooter. The
SW detection is also more difficult to utilize than the MB the shock wave. The detected times are denoted by { ykMB }M 1
detection, since it depends on the bullet’s speed and ballistic and { ykSW }S1 , respectively. Each detected time is subject to a
behavior. detection error {ekMB }M SW S
1 and {ek }1 , different for all times,
Figure 1 shows an acoustic recording of gunfire. The and a clock synchronization error {bk }M 1 specific for each
first pulse is the SW, which for distant shooters significantly microphone. The firing time t0 , shooter position x ∈ R3 ,
dominates the MB, not the least if the bullet passes close and shooting direction α ∈ R2 are unknown parameters.
Also the bullet speed v and speed of sound c are unknown. Here, θL is the weighted least squares estimate and PL is the
Basic signal models for the detected times as a function of the covariance matrix of the estimation error. This simplifies the
parameters will be derived in the next section. The notation nonlinear minimization to
is summarized in Table 1.
The derived signal models will be of the form x = arg min min V x, θN , θL ; p
x θN
−1
y = h x, θ; p + e, (1)
= arg min min T −1
y − hN + hL hL R hL
x θN
(6)
where y is a vector with the measured detection times, h
2
is a nonlinear function with values in RM+S , and where θ × hTL R −1
y − hN
,
represents the unknown parameters apart from x. The error R
e is assumed to be stochastic; see Section 4.5. Given the R = R + hL PL hTL .
sensor locations in p ∈ RM ×3 , nonlinear optimization can
be performed to estimate x, using the nonlinear least squares This general separable least squares (SLSs) approach will now
(NLS) criterion: be applied to four different combinations of signal models for
the MB and SW detection times.
x = arg min min V x, θ; p ,
x θ
(2) 4. Signal Models
2
V x, θ; p = y − h x, θ; p R .
4.1. Muzzle Blast Model (MB). According to the clock at
Here, argmin denotes the minimizing argument, min the microphone k, the muzzle blast (MB) sound is assumed to
minimum of the function, and v2Q denotes the Q-norm, reach pk at the time
that is, v2Q vT Q−1 v. Whenever Q is omitted, Q = I 1
is assumed. The loss function norm R is chosen by con- yk = t0 + bk + pk − x + ek . (7)
c
sideration of the expected error characteristics. Numerical
optimization, for instance, the Gauss-Newton method, can The shooter position x and microphone location pk are in
here be applied to get the NLS estimate. Rn , where generally n = 3. However, both computational
In the next section it will become clear that the assumed and numerical issues occasionally motivate a simplified plane
unknown firing time and the inverse speed of sound enter model with n = 2. For all M microphones, the model is
the model equations linearly. To exploit this fact we identify represented in vector form as
a sublinear structure in the signal model and apply the
weighted least squares method to the parameters appearing y = b + hL x; p θL + e, (8)
linearly, the separable least squares method; see, for instance
[17]. By doing so, the NLS search space is reduced which in where
turn significantly reduces the computational burden. For that T
1
reason, the signal model (1) is rewritten as θL = t0 , (9a)
c

T
y = hN x, θN ; p + hL x, θN ; p θL + e. (3) hL,k x; p = 1 pk − x , (9b)
Note that θL enters linearly here. The NLS problem can then and where y, b, and e are vectors with elements yk , bk , and
be formulated as ek , respectively. 1M is the vector with M ones, where M might
be omitted if there is no ambiguity regarding the dimension.
x = arg min min V x, θN , θL ; p , Furthermore, p is M-by-n, where each row is a microphone
x θL ,θN

position. Note that the inverse of the speed of sound enters
2
V x, θN , θL ; p = y − hN x, θN ; p − hL x, θN ; p θL R . linearly. The ·L notation indicates that · is part of a linear
(4) relation, as described in the previous section. With hN = 0
and hL = hL (x; p), (6) gives
Since θL enters linearly, it can be solved for by linear least −1 2

squares (the arguments of hL (x, θN ; p) and hN (x, θN ; p) are x = arg min T −1
y − hL hL R hL hTL R−1 y
, (10a)
suppressed for clarity): x R
−1
R = R + hL hTL R−1 hL hTL . (10b)
θL = arg minV x, θN , θL ; p
θL
(5a) Here, hL depends on x as given in (9b).
−1
= hTL R−1 hL hTL R−1 y − hN , This criterion has computationally efficient implemen-
tations, that in many applications make the time it takes to
−1 do an exhaustive minimization over a, say, 10-meter grid
PL = hTL R−1 hL . (5b) acceptable. The grid-based minimization of course reduces
Table 1: Notation. MB, SW, and MB-SW are different models, and L/N indicates if model parameters or signals enter the model linearly (L)
or nonlinearly (N).
Variable MB SW MB-SW Description

M Number of microphones
S Number of microphones receiving shock wave, S ≤ M
x N N N Position of shooter, Rn (n = 2, 3)
pk N N N Position of microphone k, Rn (n = 2, 3)
yk L L L Measured detection time for microphone at position pk
t0 L L Rifle or gun firing time
c L N N Speed of sound
v N N Speed of bullet
α N N Shooting direction, Rn−1 (n = 2, 3)
bk L L Synchronization error for microphone k
ek L L L Detection error at microphone k
r N N Bullet speed decay rate
dk Point of origin for shock wave received by microphone k
β Mach angle, sin β = c/v
γ Angle between line of sight to shooter and shooting angle
Microphones The shock wave from the bullet trajectory propagates at

the speed of sound c with angle βk to the bullet heading. βk
Shooter is the Mach angle defined as
c c
sin βk = = . (12)
v v0 − r dk − x
1000 m dk is now the point where the shock wave that reaches
microphone k is generated. The time it takes the bullet to
Figure 2: Level curves of the muzzle blast localization criterion
reach dk is
based on data from a field trial.
x−dk
dξ 1 v0
= log . (13)
0 v0 − r · ξ r v0 − r dk − x
the risk to settle on suboptimal local minimizers, which This time and the wave propagation time from dk to pk sum
otherwise could be a risk using greedy search methods. up to the total time from firing to detection:
The objective function does, however, behave rather well.
Figure 2 visualizes (10a) in logarithmic scale for data from 1 v0 1
a field trial (the norm is R = I). Apparently, there are only yk = t0 + bk + log + dk − pk + ek ,
r v0 − r dk − x c
two local minima. (14)
according to the clock at microphone k. Note that the

4.2. Shock Wave Model (SW). In general, the bullet follows a variable names y and e for notational simplicity have been
ballistic three-dimensional trajectory. In practice, a simpler reused from the MB model. Below, also h, θN , and θL
model with a two-dimensional trajectory with constant will be reused. When there is ambiguity, a superscript will
deceleration might suffice. Thus, it will be assumed that the indicate exactly which entity that is referred to, for instance,
bullet follows a straight line with initial speed v0 ; see Figure 3. y MB , hSW .
Due to air friction, the bullet decelerates; so when the bullet It is a little bit tedious to calculate dk . The law of sines
has traveled the distance dk − x, for some point dk on the gives
trajectory, the speed is reduced to

sin 90◦ − βk − γk sin 90◦ + βk
=
pk − x , (15)
v = v0 − r dk − x, (11) dk − x
which together with (12) implicitly defines dk . We have not

where r is an assumed known ballistic parameter. This is found any simple closed form for dk ; so we solve for dk
a rather coarse bullet trajectory model, compared with, for numerically, and in case of multiple solutions we keep the
instance, the curvilinear trajectories proposed by [18], but admissible one (which turns out to be unique). γk is trivially
we use it here for simplicity. This model is also a special case induced by the shooting direction α (and x, pk ). Both these
of the ballistic model used in [19]. angles thus depend on x implicitly.
pk c 4.4. Difference Model (MB-SW). Motivated by accurate

Shock wave v localization despite synchronization errors, we study the MB-
βk SW model:
ykMB-SW = ykMB − ykSW
|
ory
x| je c t
−
tr a
llet
pk
= hMB x; p θLMB − hSW x, θNSW ; p

||
90◦ + βk Bu L N (25)
dk
||
SW
γk
−x
Gun α ||
dk − hL x, θN ; p θNSW + ekMB − ekSW ,
x
for k = 1, 2 . . . S. This rather special model has also
Figure 3: Geometry of supersonic bullet trajectory and shock wave. been analyzed in [12, 15]. The key idea is that y is by
Given the shooter location x, the shooting direction (aim) α, the cancellation independent of both the firing time t0 and the
bullet speed v, and the speed of sound c, the time it takes from firing synchronization error b. The drawback, of course, is that
the gun to detecting the shock wave can be calculated. there are only S equations (instead of a total of M + S) and
the detection error increases, ekMB − ekSW . However, when the
synchronization errors are expected to be significantly larger
The vector form of the model is than the detection errors, and when also S is sufficiently large

y = b + hN x, θN ; p + hL x, θN ; p θL + e, (16) (at least as large as the number of parameters), this model
is believed to give better localization accuracy. This will be
where investigated later.

hL x, θN ; p = 1, There are no parameters in (25) that appear linearly
everywhere. Thus, the vector form for the MB-SW model can
θL = t0 , be written as
(17)
T
1 T y MB-SW = hMB-SW
N x, θN ; p + e, (26)
θN = α v0 ,
c
where
and where row k of hN (x, θN ; p) ∈ RS×1 is
hMB-SW x, θN ; pk
1 v0 1 N,k
hN,k x, θN ; pk = log + dk − pk , (18)
r v0 − r dk − x c 1
pk − x − 1 log v0 1
= − dk − pk ,
and dk is the admissible solution to (12) and (15). c r v0 − r dk − x c
(27)
4.3. Combined Model (MB;SW). In the MB and SW models, and y = y MB − y SW and e = eMB − eSW . As before, dk is
the synchronization error has to be regarded as a noise the admissible solution to (12) and (15). The MB-SW least
component. In a combined model, each pair of MB and SW squares criterion is
detections depends on the same synchronization error, and
consequently the synchronization error can be regarded as a −SW 2
x = arg min y MB−SW − hMB
N x, θN ; p R , (28)
parameter (at least for all sensor nodes inside the SW cone). x,θN
The total signal model could be fused from the MB and SW
which requires numerical optimization. Numerical experi-
models as the total observation vector:
ments indicate that this optimization problem is more prone

y MB;SW = hMB;SW
N x, θN ; p + hMB;SW
L x, θN ; p θL + e, (19) to local minima, compared to (10a) for the MB model;
therefore good starting points for the numerical search are
where essential. One such starting point could, for instance, be the
⎡ ⎤
y MB MB estimate xMB . Initial shooting direction could be given by
y MB;SW
=⎣ ⎦, (20) assuming, in a sense, the worst possible case, that the shooter
y SW aims at some point close to the center of the microphone

T network.
θL = t 0 b T , (21)
4.5. Error Model. At an arbitrary moment, the detection

1 IM
hMB;SW
L x, θN ; p = M,1 , (22) errors and synchronization errors are assumed to be inde-
1S,1 IS 0S,M −S pendent stochastic variables with normal distribution:
T
1 T
θN = α v0 , (23) eMB ∼ N 0, RMB , (29a)
c
⎡
⎤ eSW ∼ N 0, RSW ,
1 T (29b)
MB;SW ⎢hMB x; p 0 ⎥
hN x, θN ; p = ⎣ L c ⎦. (24)
SW b ∼ N 0, Rb . (29c)
hN x, θN ; p
For the MB-SW model the error is consequently 5. Cramér-Rao Lower Bound

The accuracy of any unbiased estimator η in the rather
eMB-SW ∼ N 0, RMB + RSW . (29d)
general model

Assuming that S = M in the MB;SW model, the covariance y =h η +e (30)
of the summed detection and synchronization errors can be
expressed in a simple manner as is, under not too restrictive assumptions [20], bounded by
the Cramér-Rao bound:

RMB + Rb Rb
RMB;SW = b . (29e) Cov η ≥ I−1 ηo , (31)
R R + Rb
SW
where I(ηo ) is Fisher’s information matrix evaluated at

Note that the correlation structure of the clock synchroniza- the correct parameter values ηo . Here, the location x is
tion error b enables estimation of these. Note also that the for notational purposes part of the parameter vector η.
(assumed known) total error covariance, generally denoted Also the sensor positions pk can be part of η, if these are
by R, dictates the norm used in the weighted least squares known only with a certain uncertainty. The Cramér-Rao
criterion. R also impacts the estimation bounds. This will be lower bound provides a fundamental estimation limit for
discussed in the next section. unbiased estimators; see [20]. This bound has been analyzed
thoroughly in the literature, primarily for AOA, TOA, and
4.6. Summary of Models. Four models with different pur- TDOA [21–23].
poses have been described in this section. The Fisher information matrix for e ∼ N (0, R) takes the
form
(i) MB. Given that the acoustic environment enables
reliable detection of the muzzle blast, the MB I η = ∇η h η R−1 ∇Tη h η . (32)
model promises the most robust estimation algo- The bound is evaluated for a specific location, parameter
rithms. It also allows global minimization with setting, and microphone positioning, collectively η = ηo .
low-dimensional exhaustive search algorithms. This The bound for the localization error is
model is thus suitable for initialization of algorithms
based on the subsequent models.
In
−1
Cov(x) ≥ In 0 I η o
. (33)
0
(ii) SW. The SW model extends the MB model with
shooting angle, bullet speed, and deceleration param- This covariance can be converted to a more convenient scalar
eters, which provide useful information for sniper value giving a bound on the root mean square error (RMSE)
detection applications. The SW is easier to detect using the trace operator:
in disturbed environments, particularly when the

shooter is far away and the bullet passes closely. 1
In
However, a sufficient number of microphones are RMSE ≥ tr n I 0 I η
−1 o . (34)
n 0
required to be located within the SW cone, and the
SW measurements alone cannot be used to determine The RMSE bound can be used to compare the information
the distance to the shooter. in different models in a simple and unambiguous way, which
(iii) MB;SW. The total MB;SW model keeps all informa- does not depend on which optimization criterion is used or
tion from the observations and should thus provide which numerical algorithm that is applied to minimize the
the most accurate and general estimation perfor- criterion.
mance. However, the complexity of the estimation
problem is large. 5.1. MB Case. For the MB case, the entities in (32) are
(iv) MB-SW. All algorithms based on the models above identified by
require that the synchronization error in each micro-
T
phone either is negligible or can be described with η = xT θLT ,
a statistical distribution. The MB-SW model relaxes
h η = hMB
L x; p θL , (35)
such assumptions by eliminating the synchronization
error by taking differences of the two pulses at each
R = RMB + Rb .
microphone. This also eliminates the shooting time.
The final model contains all interesting parameters Note that b is accounted for by the error model. The Jacobian
for the problem, but only one nuisance parameter ∇η h is an M-by-n+2 matrix, n being the dimension of x. The
(actual speed of sound, which further may be elim- LS solution in (5a) however gives a shortcut to an M-by-n
inated if known sufficiently well). Jacobian:

−1
The different parameter vectors in the relation y = ∇x hL θL = ∇x hL hTL R−1 hL hTL R−1 y o (36)
hL (θN )θL + hN (θN ) + e are summarized in Table 2.
Table 2: Summary of parameter vectors for the different models y = hL (θN )θL + hN (θN ) + e, where the noise models are summarized in
(29a), (29b), (29c), (29d), and (29e). The values of the dimensions assume that the set of microphones giving SW observations is a subset of
the MB observations.
Model Linear Parameters Nonlinear Parameters dim(θ) dim(y)
MB θLMB = [t0 1/c]T θNMB = [ ] 2+0 M
T
SW θLSW = t0 θNMB = [1/c, αT , v0 ] 1 + (n + 1) S
MB;SW T MB;SW T
MB;SW θL = [t0 b] θN = [1/c, αT , v0 ] (M + 1) + (n + 1) M+S
T
MB-SW θLMB-SW = [ ] θNMB-SW = [1/c, αT , v0 ] 0 + (n + 1) S
Road MB Model. The localization accuracy using the MB model is

bounded below according to
Trees
64 −17
Cov x MB
≥ σe2 + σb2 · 104 . (39)
Shooter Camp −17 9
x2 The root mean square error (RMSE) is consequently

Trees
bounded according to
x1
Microphones 1
1000 m RMSE x MB
≥ tr Cov xMB ≈ 606 σe2 + σb2 [m]. (40)
n
Figure 4: Example scenario. A network with 14 sensors deployed
for camp protection. The sensors detect intruders, keep track on Monte Carlo simulations (not described here) indicate
that
vehicle movements, and, of course, locate shooters. the NLS estimator attains this lower bound for σe2 + σb2 <
0.1 s. The dash-dotted curve in Figure 5 shows the bound
versus σb for fix σe = 500 μs. An uncontrolled increase as
for y o = hL (xo ; po )θLo , where xo , po , and θ o denote the true soon as σb > σe can be noted.
(unperturbed) values. For the case n = 2 and known p = po ,
this Jacobian can, with some effort, be expressed explicitly.
SW Model. The SW model is disregarded here, since the SW
The equivalent bound is
detections alone contain no shooter distance information.

−1
Cov(x) ≥ ∇Tx hL θL R−1 ∇x hL θL . (37)
MB-SW Model. The localization accuracy using the MB-SW
5.2. SW, MB;SW, and MB-SW Cases. The estimation bounds model is bounded according to
for the SW, MB;SW, and MB-SW cases are analogously to
28 5
(33), but there are hardly any analytical expressions available. Cov xMB-SW ≥ σe2 · 105 , (41)
The Jacobian is probably best evaluated by finite difference 5 12
methods.
RMSE xMB-SW ≥ 1430σe [m]. (42)
5.3. Numerical Example. The really interesting question is
how the information in the different models relates to each The dashed lines in Figure 5 correspond to the RMSE bound
other. We will study a scenario where 14 microphones are for four different values of σe . Here, the MB-SW model gives
deployed in a sensor network to support camp protection; at least twice the error of the MB model, provided that
see Figure 4. The microphones are positioned along a road to there are no synchronization errors. However, in a wireless
track vehicles and around the camp site to detect intruders. network we expect the synchronization error to be 10–100
Of course, the microphones also detect muzzle blasts and times larger than the detection error, and then the MB-SW
shock waves from gunfire, so shooters can be localized and error will be substantially smaller than the MB error.
the shooter’s target identified.
A plane model (flat camp site) is assumed, x ∈ R2 , α ∈ MB;SW Model. The expression for the MB;SW bound is
R. Furthermore, it is assumed that somewhat involved; so the dependence on σb is only pre-
sented graphically, see Figure 5. The solid curves correspond
Rb = σb2 I synchronization error Cov . , to the MB;SW RMSE bound for the same four values
(38)
of σe as for the MB-SW bound. Apparently, when the
RMB = RSW = σe2 I (detection error Cov .),
synchronization error σb is large compared to the detection
and that α = 0, c = 330 m/s, v0 = 700 m/s, and r = 0.63. error σe , the MB-SW and MB;SW models contain roughly
The scenario setup implies that all microphones detect the the same amount of information, and the model having
shock wave, so S = M = 14. All bounds presented below are the simplest estimator, that is, the MB-SW model, should
calculated by numerical finite difference methods. be preferred. However, when the synchronization error is
Target
2
3
1.5 σe = 1000 μs
1
RMSE (m)
1
500 m
σe = 500 μs
0.5 Shooter
σe = 200 μs
Microphone
σe = 50 μs
Figure 6: Scene of the shooter localization field trial. There are ten
0 microphones, three shooter positions, and a common target.
0.1 1 10 100
σb (ms)
MB (σe =500 μs)

MB-SW(σe =50−1000 μs) position three, six instead of four rounds of 308 W are fired.
MB; SW (σe =50−1000 μs) All ammunition types are supersonic. However, when firing
from position three, not all microphones are subjected to the
Figure 5: Cramér-Rao RMSE bound (34) for the MB (40), the MB- shock wave.
SW (42), and the MB;SW models, respectively, as a function of the Light wind, no clouds, and around 24◦ C are the weather
synchronization error (STD) σb , and for different levels of detection conditions. Little or no acoustic disturbances are present.
error σe .
The terrain is rough. Dense woods surround the test site.
There is light bush vegetation within the site. Shooter
position 1 is elevated some 20 m; otherwise spots are within
smaller than 100 times the detection error, the complete ±5 m of a horizontal plane. Ground truth values of the
MB;SW model becomes more informative. positions are determined with less relative error than 1 m,
These results are comparable with the analysis in except for shooter position 1, which is determined with 10 m
[12, Figure 4a], where an example scenario with 6 micro- accuracy.
phones is considered.
6.1. Detection. The MB and SW are detected by visual
5.4. Summary of the CRLB Analysis. The synchronization inspection of the microphone signals in conjunction with
error level in a wireless sensor network is usually a matter filtering techniques. For shooter positions 1 and 2, the
of design tradeoff between performance and battery costs shock wave detection accuracy is approximately σeSW ≈
required by synchronization mechanisms. Based on the 80 μs, and the muzzle blast error σeMB is slightly worse. For
scenario example, the CRLB analysis is summarized with the shooting position 3 the accuracies are generally much worse,
following recommendations. since the muzzle blast and shock wave components become
intermixed in time.
(i) If σb σe , then the MB-SW model should be used.
(ii) If σb is moderate, then the MB;SW model should be 6.2. Numerical Setup. For simplicity, a plane model is
used. assumed. All elevation measurements are ignored and x ∈
(iii) Only if σb is very small (σb ≤ σe ), the shooting R2 and α ∈ R. Localization using the MB model (7) is done
direction is of minor interest, and performance may by minimizing (10a) over a 10 m grid well covering the area
be traded for simplicity, then the MB model should of interest, followed by numerical minimization.
be used. Localization using the MB-SW model (25) is done by
numerically minimizing (28). The objective function is sub-
ject to local optima; therefore the more robust muzzle blast
6. Experimental Data localization x is used as an initial guess. Furthermore, the
A field trial to collect acoustic data on nonmilitary small direction from x toward the mean point of the microphones
arms fire is conducted. 10 microphones are placed around (the camp) is used as initial shooting direction α. Initial
a fictitious camp; see Figure 6. The microphones are placed bullet speed is v = 800 m/s and initial speed of sound is
close to the ground and wired to a common recorder with 16- c = 330 m/s. r = 0.63 is used, which is a value derived from
bit sampling at 48 kHz. A total of 42 rounds are fired from the 308 Winchester ammunition ballistics.
three positions and aimed at a common cardboard target.
Three rifles and one pistol are used; see Table 3. Four rounds 6.3. Results. Figure 7 shows, at three enlarged parts of the
are fired of each armament at each shooter position, with scene, the resulting position estimates based on the MB
two exceptions. The pistol is only used at position three. At model (blue crosses) and based on the MB-SW (squares).
Table 3: Armament and ammunition used at the trial, and number of rounds fired at each shooter position. Also, the resulting localization
RMSE for the MB-SW model for each shooter position. For the Luger Pistol the MB model RMSE is given, since only one microphone is
located in the Luger Pistol SW cone.
Type Caliber Weight Velocity Sh. pos. # Rounds RMSE

308 Winchester 7.62 mm 9.55 g 847 m/s 1, 2, 3 4, 4, 6 19, 6, 6 m
Hunting Rifle 9.3 mm 15 g 767 m/s 1, 2, 3 4, 4, 4 6, 5, 6 m
Swedish Mauser 6.5 mm 8.42 g 852 m/s 1, 2, 3 4, 4, 4 40, 6, 6 m
Luger Pistol 9 mm 6.8 g 400 m/s 3 —, —, 4 —, —, 2 m
Apparently, the use of the shock wave significantly improves Table 4: Localization RMSE and theoretical bound (34) for the
localization at positions 1 and 2, while rather the opposite three different shooter positions using the MB and the MB-SW
holds at position 3. Figure 8 visualizes the shooting direction models, respectively, beside the aim RMSE for the MB-SW model.
estimates, α. Estimate root mean square errors (RMSEs) for The aim RMSE is with respect to the aim at x against the target,
the three shooter positions, together with the theoretical α , not with respect to the true direction α. This way the ability to
identify the target is assessed.
bounds (34), are given in Table 4. The practical results
indicate that the use of the shock wave from distant shooters Shooter position 1 2 3
cut the error by at least 75%. RMSE(xMB ) 105 m 28 m 2.4 m
MB Bound 1m 0.4 m 0.02 m
6.3.1. Synchronization and Detection Errors. Since all micro- RMSE (xMB-SW ) 26 m 5.7 m 5.2 m
phones are recorded by a common recorder, there are actually MB-SW Bound 9m 0.1 m 0.08 m
no timing errors due to inaccurate clocks. This is of course RMSE(α ) 0.041◦ 0.14◦ 17◦
the best way to conduct a controlled experiment, where any
uncertainty renders the dataset less useful. From experimen-
tal point of view, it is then simple to add synchronization
errors of any desired magnitude off-line. On the dataset at model discrepancy that could have greater impact than we
hand, this is however work under progress. At the moment, first anticipated. This will be investigated in experiments to
there are apparently other sources of error, worth identifying. come.
It should however be clarified that in the final wireless sensor
product, there will always be an unpredictable clock error. 6.3.3. Numerical Uncertainties. Finally, we face numerical
As mentioned, detection errors are present, and the expected uncertainties. There is no guarantee that the numerical
level of these (80 μs) is used for bound calculations in Table 4. minimization programs we have used here for the MB-
It is noted that the bounds are in level with, or below, the SW model really deliver the global minimum. In a realistic
positioning errors. implementation, every possible a priori knowledge and also
There are at least two explanations for the bad perfor- qualitative analysis of the SW and MB signals (amplitude,
mance using the MB-SW model at shooter position 3. One is duration, caliber classification, etc.) together with basic
that the number of microphones reached by the shock wave consistency checks are used to reduce the search space. The
is insufficient to make accurate estimates. There are four reduced search space may then be exhaustively sampled over
unknown model parameters, but for the relatively low speed a grid prior to the final numerical minimization. Simple
of pistol ammunition, for instance, only one microphone has experiments on an ordinary desktop PC indicate that with
a valid shock wave detection. Another explanation is that the an efficient implementation, it is feasible to, within the time
increased detection uncertainty (due to SW/MB intermix) frame of one second, minimize any of the described model
impacts the MB-SW model harder, since it relies on accurate objective functions over a discrete grid with 107 points. Thus,
detection of both the MB and SW. by allowing—say—one second extra of computation time,
the risk for hitting a local optima could be significantly
reduced.
6.3.2. Model Errors. No doubt, there are model inaccuracies
both in the ballistic and in the acoustic domain. To that end,
there are meteorological uncertainties out of our control. 7. Conclusions
For instance, looking at the MB-SW localizations around
shooter position 1 in Figure 7 (squares), three clusters We have presented a framework for estimation of shooter
are identified that correspond to three ammunition types location and aiming angle from wireless networks where each
with different ballistic properties; see the RMSE for each node has a single microphone. Both the acoustic muzzle
ammunition and position in Table 3. This clustering or bias blast (MB) and the ballistic shock wave (SW) contain useful
more likely stems from model errors than from detection information about the position, but only the SW contains
errors and could at least partially explain the large gap information about the aiming angle. A separable nonlinear
between theoretical bound and RMSE in Table 4. Working least squares (SNLSs) framework was proposed to limit the
with three-dimensional data in the plane is of course another parametric search space and to enable the use of global
40 Target
20
0 1
−20 500 m
−50 0 50 100 150
(m)
(a) Shooter
8 Microphone
Estimated position
0 2 Figure 8: Estimated shooting directions. The relatively slow pistol

ammunition is excluded.
−8
−10 0 10 20 30 40
(m) the synchronization error distribution may be completely
disregarded.
(b)
The bullet speed occurs as nuisance parameters in the
0 3
proposed signal model. Further, the bullet retardation con-
stant was optimized manually. Future work will investigate
if the retardation constant should also be estimated, and if
these two parameters can be used, together with the MB and
−2
SW signal forms, to identify the weapon and ammunition.
−4
Acknowledgment
This work is funded by the VINNOVA supported Centre
for Advanced Sensors, Multisensors and Sensor Networks,
−6
FOCUS, at the Swedish Defence Research Agency, FOI.
−6 −4 −2 0 2 4 References
(m)
[1] J. Bédard and S. Paré, “Ferret, a small arms’ fire detection
Shooter system: localization concepts,” in Sensors, and Command,
MB model Control, Communications, and Intelligence (C31) Technologies
MB-SW model for Homeland Defense and Law Enforcement II, vol. 5071 of
Proceedings of SPIE, pp. 497–509, 2003.
(c)
[2] J. A. Mazurek, J. E. Barger, M. Brinn et al., “Boomerang
Figure 7: Estimated positions x based on the MB model and on the mobile counter shooter detection system,” in Sensors, and C3I
MB-SW model. The diagrams are enlargements of the interesting Technologies for Homeland Security and Homeland Defense IV,
areas around the shooter positions. The dashed lines identify the vol. 5778 of Proceedings of SPIE, pp. 264–282, Bellingham,
shooting directions. Wash, USA, 2005.
[3] D. Crane, “Ears-MM soldier-wearable gun-shot/sniper detec-
tion and location system,” Defence Review, 2008.
[4] “PILAR Sniper Countermeasures System,” November 2008,
grid-based optimization algorithms (for the MB model), http://www.canberra.com.
eliminating potential problems with local minima. [5] J. Millet and B. Balingand, “Latest achievements in gunfire
For a perfectly synchronized network, both MB and detection systems,” in Proceedings of the of the RTO-MP-SET-
SW measurements should be stacked into one large signal 107 Battlefield Acoustic Sensing for ISR Applications, Neuilly-
model for which SNLS is applied. However, when the sur-Seine, France, 2006.
synchronization error in the network becomes comparable [6] P. Volgyesi, G. Balogh, A. Nadas, et al., “Shooter localization
and weapon classification with soldier-wearable networked
to the detection error for MB and SW, the performance
sensors,” in Proceedings of the 5th International Conference on
quickly deteriorates. For that reason, the time difference of Mobile Systems, Applications, and Services (MobiSys ’07), San
MB and SW at each microphone is used, which automatically Juan, Puerto Rico, 2007.
eliminates any clock offset. The effective number of measure- [7] A. Saxena and A. Y. Ng, “Learning Sound Location from a
ments decreases in this approach, but as the CRLB analysis single microphone,” in Proceedings of the IEEE International
showed, the root mean square position error is comparable Conference on Robotics and Automation (ICRA ’09), pp. 1737–
to that of the ideal stacked model, at the same time as 1742, Kobe, Japan, May 2009.
[8] W. S. Conner, J. Chhabra, M. Yarvis, and L. Krishnamurthy,

“Experimental evaluation of synchronization and topology
control for in-building sensor network applications,” in
Proceedings of the 2nd ACM International Workshop on Wireless
Sensor Networks and Applications (WSNA ’03), pp. 38–49, San
Diego, Calif, USA, September 2003.
[9] O. Younis and S. Fahmy, “A scalable framework for distributed
time synchronization in multi-hop sensor networks,” in
Proceedings of the 2nd Annual IEEE Communications Society
Conference on Sensor and Ad Hoc Communications and
Networks (SECON ’05), pp. 13–23, Santa Clara, Calif, USA,
September 2005.
[10] J. Elson and D. Estrin, “Time synchronization for wireless
sensor networks,” in Proceedings of the International Parallel
and Distributed Processing Symposium, 2001.
[11] G. Simon, M. Maróti, Á. Lédeczi, et al., “Sensor network-based
countersniper system,” in Proceedings of the 2nd International
Conference on Embedded Networked Sensor Systems (SenSys
’04), pp. 1–12, Baltimore, Md, USA, November 2004.
[12] G. T. Whipps, L. M. Kaplan, and R. Damarla, “Analysis
of sniper localization for mobile, asynchronous sensors,” in
Signal Processing, Sensor Fusion, and Target Recognition XVIII,
vol. 7336 of Proceedings of SPIE, 2009.
[13] E. Danicki, “Acoustic sniper localization,” Archives of Acoustics,
vol. 30, no. 2, pp. 233–245, 2005.
[14] L. M. Kaplan, T. Damarla, and T. Pham, “Qol for passive
acoustic gunfire localization,” in Proceedings of the 5th IEEE
International Conference on Mobile Ad-Hoc and Sensor Systems
(MASS ’08), pp. 754–759, Atlanta, Ga, USA, 2008.
[15] D. Lindgren, O. Wilsson, F. Gustafsson, and H. Habberstad,
“Shooter localization in wireless sensor networks,” in Proceed-
ings of the 12th International Conference on Information Fusion
(FUSION ’09), pp. 404–411, Seattle, Wash, USA, 2009.
[16] R. Stoughton, “Measurements of small-caliber ballistic shock
waves in air,” Journal of the Acoustical Society of America, vol.
102, no. 2, pp. 781–787, 1997.
[17] F. Gustafsson, Statistical Sensor Fusion, Studentlitteratur,
Lund, Sweden, 2010.
[18] E. Danicki, “The shock wave-based acoustic sniper localiza-
tion,” Nonlinear Analysis: Theory, Methods & Applications, vol.
65, no. 5, pp. 956–962, 2006.
[19] K. W. Lo and B. G. Ferguson, “A ballistic model-based method
for ranging direct fire weapons using the acoustic muzzle blast
and shock wave,” in Proceedings of the International Conference
on Intelligent Sensors, Sensor Networks and Information Pro-
cessing (ISSNIP ’08), pp. 453–458, December 2008.
[20] S. Kay, Fundamentals of Signal Processing: Estimation Theory,
Prentice Hall, Upper Saddle River, NJ, USA, 1993.
[21] N. Patwari, A. O. Hero III, M. Perkins, N. S. Correal, and
R. J. O’Dea, “Relative location estimation in wireless sensor
networks,” IEEE Transactions on Signal Processing, vol. 51, no.
8, pp. 2137–2148, 2003.
[22] S. Gezici, Z. Tian, G. B. Giannakis et al., “Localization via
ultra-wideband radios: a look at positioning aspects of future
sensor networks,” IEEE Signal Processing Magazine, vol. 22, no.
4, pp. 70–84, 2005.
[23] F. Gustafsson and F. Gunnarsson, “Possibilities and funda-
mental limitations of positioning using wireless commu-
nication networks measurements,” IEEE Signal Processing
Magazine, vol. 22, pp. 41–53, 2005.

Microphone Array Speech Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Microphone Array Speech Processing

Uploaded by

Copyright:

Available Formats

EURASIP Journal on Advances in Signal Processing

Microphone Array Speech Processing

Guest Editors: Sven Nordholm, Thushara Abhayapala, Simon Doclo,

Microphone Array Speech Processing

DOA Estimation with Local-Peak-Weighted CSP, Osamu Ichikawa, Takashi Fukuda,

Shooter Localization in Wireless Microphone Networks, David Lindgren, Olof Wilsson,

Sven Nordholm (EURASIP Member),1 Thushara Abhayapala (EURASIP Member),2

Correspondence should be addressed to Sven Nordholm, s.nordholm@curtin.edu.au

Received 21 July 2010; Accepted 21 July 2010

communications. By detecting the muzzle blast as well as

Correspondence should be addressed to Xin Zhang, zhang xin@pmail.ntu.edu.sg

Received 16 April 2009; Revised 24 August 2009; Accepted 3 December 2009

Academic Editor: Thushara Abhayapala

1. Introduction to use the Frequency-Invariant (FI) beampattern synthesis

(DTFT) of the phase modes defined in (4):

where S(ω) is the spectrum of the source signal. Radius r

xK −1 [n] p−L [n] h−L

Figure 2: The system structure of a uniform circular array beamformer.

Representing the complex spatial response G(ω, φ) by 0

Table 1: Computational complexity of diﬀerent broadband beampattern synthesis method.

Method Number of iteration Amount of computation per iteration

Table 2: Comparison of array gain at each frequency along the 0

0.6 −0.0097 −1.6253 −0.2186 −30

magnitude of the ripples of the main beam is set to 0.1. −15

White noise gain (dB)

13 by evaluating the directivity and the white-noise gain over

0.12 of the IEEE International Conference on Acoustics, Speech,

Griﬃths, Ed., pp. 507–524, Academic Press, New York, NY,

Correspondence should be addressed to René M. M. Derkx, renederkx@online.nl

Received 21 July 2009; Revised 10 November 2009; Accepted 15 December 2009

Academic Editor: Simon Doclo

1. Introduction superdirectional response is constructed by means of a linear

Figure 1: Circular array geometry with three cardioid microphones.

0.5 0.5 0.5

−0.5 −0.5 −0.5

Figure 2: Eigenbeams (monopole and two orthogonal dipoles).

3.5 2.5 1.5 Er1 w1

Figure 4: Generalized Sidelobe Canceller scheme.

sin ϕ1 − sin ϕ1 − ϕ2 − sin ϕ2

To eliminate the dependency of α in (37), we will use

with i = 1, 2. ϑni = ϑi − ϕs , (45)

y[k] = p[k] − (w1 [k] + 2α)r1 [k] − w2 [k]r2 [k]

X2 = 4 + 4w1 [k] + w1 [k]2 + 4w2 [k]2

w1 [k] + 4w1 [k]2 + 4w1 [k] − 4w2 [k]X1

Algorithm 1: Optimal null-steering for two directional interferers.

210 330 210 330

240 300 240 300

210 330 210 330

240 300 240 300

1 From the plots in Figure 8, it can be seen that if one of

close to the azimuthal angle of the desired source. When we

ϑn1 and ϑn2

ϑn1 and ϑn2

Figure 14: Results of the real-life experiment (waveform).

N3 3π/2 rad π/2 rad N1

Figure 13: Practical setup of the microphone array. ϑni with i = 1, 2

Figure 15: Results of the real-life experiment (angle estimates).

90 1.5 90 1.5 90 1.5

180 0 180 0 180 0

210 330 210 330 210 330

240 300 240 300 240 300

210 330 210 330

240 300 240 300

y[k] = p[k] − (w1 [k] + 2α)r1 [k] − w2 [k]r2 [k]

X2 = 4 + 4w1 [k] + w1 [k]2 + 4w2 [k]2

w1 [k] + 4w1 [k]2 + 4w1 [k] − 4w2 [k]X1

ϑn1 and ϑn2

ϑn1 and ϑn2

Figure 13: Practical setup of the microphone array. ϑni with i = 1, 2

Figure 1: An example of CSP. iT = argmax ϕT (i) . (3)