Abutalebi2015 PDF

Available online at www.sciencedirect.
com
ScienceDirect
Speech Communication 67 (2015) 92–101
www.elsevier.com/locate/specom
Speech enhancement based on b-order MMSE estimation of Short

Time Spectral Amplitude and Laplacian speech modeling
Hamid Reza Abutalebi ⇑, Mehdi Rashidinejad
Electrical and Computer Engineering Dept., Yazd University, Yazd, Iran
Received 4 January 2014; received in revised form 20 November 2014; accepted 2 December 2014
Available online 9 December 2014
Abstract
This paper addresses the problem of speech enhancement employing the Minimum Mean-Square Error (MMSE) of b-order Short
Time Spectral Amplitude (STSA). The motivation has been to take advantages of both Laplacian speech modeling and b-order cost func-
tion in MMSE estimation of clean speech. We present an analytical solution for b-order MMSE STSA estimator assuming Laplacian
prior for the real and imaginary parts of the Discrete Fourier Transform (DFT) coefficients of (clean) speech. We also assume Gaussian
distribution for the real and imaginary parts of the DFT coefficients of the noise. The analytical solution, named b-order LapMMSE,
does not have a closed form and is highly non-linear and computationally complex. Using some approximations for the joint probability
density function and the Bessel function, we also present an improved closed-form version of the estimator (called b-order ImpLa-
pMMSE). The value of b is adapted as a function of frame Signal to Noise Ratio (SNR). We have compared the performance of the
proposed estimator with the state-of-the-art estimators that assume either Gaussian or Laplacian probability density functions for
the real and imaginary parts of the DFT coefficients of clean speech. To this end, the input noisy signal and the outputs of MMSE STSA,
b-order STSA, and ImpLapMMSE estimators have been compared with the output of the proposed estimator. Our comparative eval-
uations in terms of Segmental SNR (SegSNR), Perceptual Evaluation of Speech Quality (PESQ), and Log-Likelihood Ratio (LLR) dis-
tance demonstrate the superior performance of the proposed b-order ImpLapMMSE estimator.
Ó 2014 Elsevier B.V. All rights reserved.
Keywords: Speech enhancement; Laplacian speech modeling; Spectral amplitude estimation; b-order MMSE
1. Introduction Amplitude (STSA) estimator of clean speech signal by

Ephraim and Malah (1984). In this method, the MMSE
In increasing number of speech processing applications, estimation of clean speech STSA is obtained and combined
noise reduction is becoming an essential pre-processing with the short time phase of noisy signal to produce the
component to improve system performance. In the past enhanced speech signal. Ephraim and Malah then extended
three decades, Minimum Mean Square Error (MMSE)- the STSA estimator to a Log Spectral Amplitude (LSA)
based single-channel speech enhancement algorithms have estimator by utilizing a perceptually-motivated cost func-
received a lot of attention. The fundamental work in this tion which minimizes the mean squared error between the
topic backs to the presentation of Short Time Spectral LSAs of the clean and estimated signals (Ephraim and
Malah, 1985). Considering speech presence uncertainty, a
modified version of LSA, called Optimally Modified-LSA
⇑ Corresponding author at: Electrical and Computer Engineering Dept., (OM-LSA) estimator, was proposed by Cohen (2001).
Yazd University, Pajuhesh St., Safaieh, 89195-741 Yazd, Iran. Tel.: +98 On the other side, as a good trade-off between speech
(35)31232396; fax: +98 (35)38200144.
E-mail address: habutalebi@yazd.ac.ir (H.R. Abutalebi).
distortion and noise reduction, You et al. (2005) proposed
http://dx.doi.org/10.1016/j.specom.2014.12.002
0167-6393/Ó 2014 Elsevier B.V. All rights reserved.
H.R. Abutalebi, M. Rashidinejad / Speech Communication 67 (2015) 92–101 93
the b-order MMSE approach for the STSA estimation of distribution tail which is not properly modeled by Gauss-
clean speech. In their work, You et al. investigated the ian prior; instead, Laplacian or Gamma distribution can
effectiveness of range of b values in estimating STSA based model this heavy tail more appropriately.
on the MMSE criterion, and discussed how the b value The use of Laplacian or Gamma pdf, however, compli-
could be adapted using the frame signal-to-noise ratio cates the derivation of the MMSE estimate of the magni-
(SNR). By incorporating the speech presence probability tude spectrum. This complication is partly because the
in b-order MMSE approach, an improved version of the magnitude and phase of the DFT coefficients are no longer
b-order MMSE estimator was also presented and applied independent when the real and imaginary parts of the DFT
in speech enhancement (Dashtbozorg and Abutalebi, coefficients are modeled by a Laplacian (or Gamma) distri-
2009; You et al., 2009) and dereverberation tasks bution (Chen and Loizou, 2007).
(Abutalebi and Dashtbozorg, 2011). In these works, some Chen and Loizou (2007) derived an approximate
alternative approaches were also proposed for the adapta- MMSE estimator (named LapMMSE) of the speech mag-
tion of b value; while (You et al., 2009) considered b value nitude spectrum based on Laplacian model for the real and
as a function of both local and frame SNRs, in (Abutalebi imaginary parts of speech DFT coefficients and Gaussian
and Dashtbozorg, 2011) the value of b was calculated model for the noise DFT coefficients. This estimator was
through a linear relationship with speech presence proba- derived under the assumption that the magnitude and
bility for each frame and each frequency bin. Furthermore, phase of the complex DFT coefficients were independent.
inspiring by the masking property of human auditory The authors followed that work in (Rashidinejad et al.,
system, another generalization of STSA estimator was pro- 2010), where an improved version of LapMMSE (called
posed in (Loizou, 2005), where the MMSE cost function ImpLapMMSE) was presented using appropriate approxi-
was weighted by the p-th power of STSA of the clean mations for the pdf of speech magnitude spectrum and the
speech. Finally, incorporating both aspects of spectral Bessel function.
weighting and b-order MMSE estimation, a generalized The motivation of current work is to take advantages of
weighted b-order STSA estimator was proposed and exam- both Laplacian speech modeling and b-order MMSE
ined in (Plourde and Champagne, 2008; Deng et al., 2014). approach for speech enhancement. In the present paper,
However, most of the previous works (including those in starting with a formulation similar to that of (Chen and
(Ephraim and Malah, 1984, 1985; Cohen, 2001; You et al., Loizou, 2007), we derive b-order LapMMSE estimator
2005; Abutalebi and Dashtbozorg, 2011)) are based on when the clean speech DFT coefficients are modeled by a
assuming Gaussian prior for the real and imaginary parts Laplacian distribution. However, the derived analytical
of Discrete Fourier Transform (DFT) coefficients of clean solution is highly non-linear, computationally complex,
speech signal. This Gaussian assumption, however, holds and very time-consuming for implementation. Hence, sim-
asymptotically for long duration analysis frames, for which ilar to our approach in (Rashidinejad et al., 2010), we
the span of the correlation of the signal is much shorter than apply some approximations for the Bessel function as well
the DFT size. While this assumption might hold for the real as for the pdf of the magnitude spectrum of the clean
and imaginary parts of noise DFT coefficients, it does not speech, to reduce the complexity of the estimator. This is
hold for the real and imaginary parts of speech DFT coef- shown to result in an improved closed form of the estima-
ficients, which are typically estimated using relatively short tor, namely b-order ImpLapMMSE. In the proposed
(20–30 ms) duration windows (Chen and Loizou, 2007). To method, the order of the cost function (b) is adapted based
resolve this shortcoming and improve the speech estimator, on the frame SNR.
more recently, researchers have searched for adopting a Simulation results on a wide range of noisy conditions
more appropriate statistical model for the real and imagi- demonstrate that the proposed method reduces the cor-
nary parts of DFT coefficients of speech signal. In this rupting noise component in a better way, which results in
way, non-Gaussian distributions have been employed to less residual noise compared to many existing methods
model the real and imaginary parts of DFT coefficients of (with either Laplacian or Gaussian assumption).
(clean) speech signals. Generally, a Laplacian or a Gamma In the aspect of utilized b-order cost function, our pro-
probability density function (pdf) is used to model real and posed method is similar to the work of Breithaupt et al.
imaginary parts of the clean DFT coefficients. For example, (2008) which presents an analytic solution for parameter-
see (Breithaupt and Martin, 2003; Chen and Loizou 2005, ized (or, b-order) MMSE estimation of the speech STSA
2007; Erkelens et al., 2007; Gazor and Zhang, 2005; based on a Chi-distribution for the prior model of (clean)
Hendriks and Heusdens, 2010; Lotter and Vary, 2003; speech spectral magnitude.
Martin 2002, 2005). The rest of the paper is organized as follows. In Section
The suitability of Laplacian or Gamma pdf has been 2, we explain our formulation and derivation of the pro-
validated through comparing different non-Gaussian distri- posed b-order LapMMSE estimator. In Section 3, a closed
butions with the histograms of real and imaginary parts of form expression is derived as b-order ImpLapMMSE esti-
the clean speech DFT coefficients from a large speech data- mator. In Section 4, we explain our evaluation process and
base (Martin, 2002). This can be also justified by consider- discuss the resulting performance. Finally, Section 5 con-
ing that the short size of the DFT frames results in a cludes the paper.
94 H.R. Abutalebi, M. Rashidinejad / Speech Communication 67 (2015) 92–101
2R
R 3b1
2. b-Order LapMMSE estimator 1 bþ1

xk exp
x2k 2p
exp 2xk Y k cos hk
p xk
ffiffiffiffiffiffiffi ðjcos h j þ jsin h j Þ dh dx
k k k k
6 0 kd ðkÞ 0 kd ðkÞ kx ðkÞ 7
bk ¼6
X
R 7:
4 R1 x2k
5
Let r(n) = s(n) + d(n), where s(n), d(n), and r(n) respec- 0
2p
xk exp kd ðkÞ 0 exp 2xk Y k coshk
kd ðkÞ
x
pffiffiffiffiffiffiffi ðjcos hk j þ jsin hk jÞ dhk dxk
k
kx ðkÞ
tively denote clean speech signal, additive noise, and ð7Þ
received noisy signal. Applying the Short Time Fourier
Y 2k
Transform (STFT), we have Let nk , kkdx ðkÞ
ðkÞ
and ck , kd ðkÞ respectively denote the a priori
Rk ðlÞ ¼ S k ðlÞ þ Dk ðlÞ; ð1Þ and a posteriori SNRs (McAulay and Malpass, 1980). We
can express (7) in terms of the a priori and a posteriori
where l and k denote discrete time (frame) and frequency SNRs as follows:
indices, respectively. To simplify the notation, we omit l 2R
3b1
pffiffiffi
1 bþ1 ck x2k R 2p
in the following and express the above equation as Rk = - 6 0
x k exp Yk 2 0
exp 2xk ck cos hk
Yk

xk ck
p
Y k nk
ffiffiffi
ffi ð j cosh k j þ j sin h k j Þ dh k dx k
7
Sk + Dk. Using the exponential notation, the k-th spectral bk ¼6
X
7:
4 R1 ck x2k R 2p
pffiffiffi
xk ck
5
2xk ck coshk
x exp exp p ffiffiffiffi ð j cos h j þ j sin h j Þ dh dx
component of clean speech signal and noisy signal can be 0 k Y 2 0
k
Yk Yk nk
k k k k
respectively expressed as ð8Þ

jhk
Sk ¼ X k e ; ð2:aÞ The above equation gives the b-order Laplacian MMSE
estimator of the spectral magnitude. In this article, this esti-
Rk ¼ Y k ejwk : ð2:bÞ
mator is referred to as the b-order Laplacian MMSE esti-
The objective is to find X^ k , the estimate of spectral mator (or briefly, b-order LapMMSE).
amplitude (magnitude) of clean speech. You et al. (2005) To the knowledge of the authors, (8) has no closed form
considered the following equation as a cost function: solution. In Chen and Loizou (2007), by applying some
n 2 o approximations, a closed form solution was derived for
J ¼ E X bk X^ bk : ð3Þ the standard (b = 1) LapMMSE estimator (called
ApLapMMSE) assuming that the magnitude and phase
where E{} is the expectation operator. The cost function of the complex DFT coefficients of the clean speech signals
minimizes the mean-square error between the b-order are statistically independent. The marginal pdf of the phase
(clean) speech spectral amplitude and the b-order estimated can be derived from (6) as follows:
spectral amplitude. By minimizing the cost function with Z 1 " #
respect to X^ k , the estimated value of the spectral amplitude xk xk
f ðhk Þ ¼ pffiffiffiffiffiffiffiffiffiffi exp pffiffiffiffiffiffiffiffiffiffi ðj cos hk j þ j sin hk jÞ dxk
is achieved as (You et al., 2005) 0 2 kx ðxÞ kx ðxÞ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
xk
¼
X^ k ¼ E X bk Rk
b
2ðj cos hk j þ j sin hk jÞ
"R 1 R 2p #b1 " #1
xk
x b
f ðR jx ; h Þf ðx ; h Þdh dx
¼ R0 1 R0 2p k k k k k k k k
; ð4Þ exp pffiffiffiffiffiffiffiffiffiffi ðj cos hk j þ j sin hk jÞ
kx ðxÞ
0 0
f ðRk jxk ; hk Þf ðxk ; hk Þdhk dxk 0
Z 1
1
where xk denotes the sample value of Xk; also, f(xk, hk) þ
0 2ðj cos hk j þ j sin hk jÞ
denotes the joint pdf of the magnitude (amplitude) and " #
phase spectra, and f(Rk|xk, hk) is given by Ephraim and xk
exp pffiffiffiffiffiffiffiffiffiffi ðj cos hk j þ j sin hk jÞ dxk
Malah (1984) kx ðxÞ
pffiffiffiffiffiffiffiffiffiffi
1 1 jhk 2 kx ðxÞ
f ðRk jxk ; hk Þ ¼ exp jRk xk e j ; ð5Þ ¼0
pkd ðkÞ kd ðkÞ 2ðj cos hk j þ j sin hk jÞ
2
" #1
where kd ðkÞ is the variance of the k-th DFT coefficient of xk

the noise. Assuming Laplacian pdf for the real and imagi- exp pffiffiffiffiffiffiffiffiffiffi ðj cos hk j þ j sin hk jÞ
kx ðxÞ
nary parts of the speech DFT coefficients, f(xk, hk) is given pffiffiffiffiffiffiffiffiffiffi
0
by Chen and Loizou (2007) kx ðxÞ

( ) ¼ : ð9Þ
2 þ 4j cos hk sin hk j
xk xk
f ðxk ; hk Þ ¼ pffiffiffiffiffiffiffiffiffiffiffi exp pffiffiffiffiffiffiffiffiffiffiffi ðj cos hk j þ j sin hk jÞ ;
2 kx ðkÞ kx ðkÞ As for the marginal pdf of the magnitude, it has been
shown in (Chen and Loizou, 2007) that
ð6Þ
Z p=4 sffiffiffiffiffiffiffiffiffiffiffi !
where kx ðkÞ is the variance of the k-th clean DFT coeffi- 2xk 2
f ðxk Þ ¼ exp xk cos u du; xk P 0:
cient. Substituting (5) and (6) into (4), yields the following kx ðkÞ 0 kx ðkÞ
form of estimator: ð10Þ
X1 m
By considering 1 1
Ck ¼ F ðm; m; 1; 2nk ck Þ ; ð14:cÞ
X
1
m¼0
m! 2nk
expðx cos uÞ ¼ I 0 ðxÞ þ 2 I n ðxÞ cosðnuÞ; ð11Þ
" "
n¼1
8X 1
ð1Þn
np 1 n2 X
1
Cðm þ 12 n þ 1Þ
m
1
where In(.) is the modified Bessel function of n-th order, the Dk ¼ sin
p n¼1 n 4 2nk m¼0 m!Cðm þ n þ 1Þ 2nk
marginal pdf of the magnitude can be re-written as (Chen ##
and Loizou, 2007)
! F ðm;n m;1;2nk ck Þ ; ð14:dÞ
pffiffiffi
pxk 2
f ðxk Þ ¼ I 0 pffiffiffiffiffiffiffiffiffiffiffi xk
2kx ðkÞ kx ðkÞ and C() is the Gamma function. Also, F(a, b; c; d) is the
pffiffiffi !
Gaussian hypergeometric function defined as
4xk X 1
1 2 np
þ I n pffiffiffiffiffiffiffiffiffiffiffi xk sin ; x P 0: ð12Þ X1
kx ðkÞ n¼1 n kx ðkÞ 4 ðaÞn ðbÞn zn
F ða; b; c; zÞ, ; ð15Þ
n¼0
ðcÞn n!
Considering (6), (9) and (12), it is obvious that the
speech spectral amplitude and phase are not statistically where
independent, i.e., f(xk, hk) – f(xk) f(hk). 8
Nonetheless, further analysis by Chen and Loizou (2007) < 1;
> n¼0
on the typical plots of f(xk, hk) and f(xk) f(hk) has revealed ðqÞn , qðq þ 1Þðq þ 2Þ:::ðq þ n 1Þ; n > 0: ð16Þ
>
:
that the difference between these two densities is large
around xk 0, but the difference reduces to near zero for
xk P 2. So, for the magnitude values in the range of As shown in (14.a)–(14.d), the derived b-order
xk P 2, the joint pdf f(xk, hk) is almost equal to f(xk) f(hk). ApLapMMSE estimator is highly non-linear and computa-
This in turn means that the magnitude and phase can be con- tionally complex. In Section 3, we present another expres-
sidered nearly independent, at least for a specific range of sion for f(xk, hk) and apply some approximations for Bessel
magnitude values (say xk P 2). Based on this observation, function which will result in a computationally-feasible
and assuming a uniform distribution for the marginal pdf estimator.
of the phase (i.e. f(hk) 1/2p for hk 2 [0, 2p)), it is possible
to approximate the joint pdf as (Chen and Loizou, 2007) 3. Derivation of the improved closed form approximation for
pffiffiffi ! b-order LapMMSE estimator
xk 2
f ðxk ; hk Þ f ðxk Þ f ðhk Þ I 0 pffiffiffiffiffiffiffiffiffiffiffi xk
4kx ðkÞ kx ðkÞ 3.1. Approximation of joint pdf
pffiffiffi !
2xk X 1 1
2 np
þ I n pffiffiffiffiffiffiffiffiffiffiffi xk sin ; As stated above, assuming the statistically independence
pkx ðkÞ n¼1 n kx ðkÞ 4
of the magnitude and phase, and considering (11) for
0 6 hk < 2p; x P 0: ð13Þ expanding the integrand of (10) in terms of Bessel func-
tions, the b-order ApLapMMSE (Eq. (14)) is obtained.
Using a formulation similar to (Chen and Loizou, 2007),
As mentioned before, (14) is highly complex and nonlinear;
we generalize the estimator for the case of b-order Lapla-
also, (14) involves an infinite number of terms. To solve
cian MMSE estimation. Substituting (5) and (13) into (4)
this problem, we exploit here an alternative expression
and using (Gradshteyn and Ryzhik, 2000, Thm.6.633.1),
for f(xk) (instead of (12)) to reach a more appropriate
the b-order ApLapMMSE estimator is achieved as
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi approximation for joint pdf.
b k ¼ pffiffiffiffi b Ak þ Bk Y k ;
X
1
ð14Þ
Using the Taylor series expansion for the integrand of
ck C k þ D k (10) around xk = 0, we have
where Z p
2xk 4
1
m f ðxk Þ ¼ expðAxk cosðuÞÞdu
X C m þ 12 b þ 1 1 kx ðkÞ 0
Ak ¼ F ðm; m; 1; 2nk ck Þ ; Z pX
m¼0
m!Cðm þ 1Þ 2nk 2xk 4
1
ðAxk cosðuÞÞ
n
¼ du
ð14:aÞ kx ðkÞ 0 n¼0 n!
" n Z p4
8X
1
ð1Þ
n
pn 1 n2 X
1
Cðmþ 12 nþ 12 bþ1Þ 1 m 2xk X 1
ðAxk Þ
Bk ¼ sin ¼ cosn ðuÞdu; ð17Þ
p n¼1 n 4 2nk m¼0 m!Cðmþnþ1Þ 2nk kx ðkÞ n¼0 n! 0
##
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
F ðm;nm;1;2nk ck Þ ; ð14:bÞ where A ¼ 2=kx ðkÞ. Considering the values of the fol-
lowing integrals:
Rp Rp
4
cos0 ðuÞdu ¼ 4
du ¼ p4 Considering this expansion and using (Gradshteyn and
0 0
Rp pffiffi Rp Ryzhik, 2000, Thm.3.462.1) and (Abramowitz and
2
0
4
cosðuÞdu ¼ 2 4
0
cos2 ðuÞdu ¼ pþ2
8 Stegun, 1964), we get
R p4 pffiffi R p4
cos3 ðuÞdu ¼ 5122 cos4 ðuÞdu ¼ ð3pþ8Þ ð18Þ 2PM1 ck m 3b1
0 0 32 1
R p4 pffiffi R p4 Cðb þ 2m þ 2ÞD ðT Þ
b k ¼ p1ffiffiffiffiffiffiffi 4 m¼0 ðm!Þ 2 2 ðbþ2mþ2Þ k
cos5 ðuÞdu ¼ 431202 cos6 ðuÞdu ¼ 15pþ44 X PM1 1 ck m 5 Y k;
0 0 192
2ck m¼0 ðm!Þ2 2 Cð2m þ 2ÞDð2mþ2Þ ðT k Þ
:::
ð24Þ
It can be easily deduced that pffiffi
where T k , 2p2ffiffiffi. Also, Dp(.) is parabolic cylinder function
Z p4 pffiffiffi!n p nk
n 2 2 p defined as
cos ðuÞdu : ð19Þ ( pffiffiffi pffiffiffiffiffiffi )
0 p 4 p 2
z4 p p 1 z2 2pz 1 p 3 z2
Dp ðzÞ,2 e
2 M ; ; p M ; ; ;
C 1p
2
2 2 2 C 2 2 2 2
Substituting (19) in (17), we get
ð25Þ
n Z p4
2xk X 1
ðAxk Þ where M(a, b; z) is confluent hypergeometric function,
f ðxk Þ ¼ cosn ðuÞdu
kx ðkÞ n¼0 n! 0
X1
ðaÞn zn
pffiffiffi!n Mða; b; zÞ, ; ð26Þ
p 2xk X 1 n
ðAxk Þ 2 2 ðbÞn n!
n¼0
4 kx ðkÞ n¼0 n! p

pffiffi n and (a)n and (b)n are calculated as defined in (16).
2 2
Axk The optimality of Taylor series expansion of Bessel func-
p 2xk X 1
p
tions depends on the value of x; larger values of x (that cor-
4 kx ðkÞ n¼0 n! responds to higher SNRs) require more terms in series
pffiffiffi !
pxk 2 2 expansion (larger M). Increasing the number of summation
exp Axk ; xk P 0: ð20Þ terms (M), (24) presents a good approximation of (22)
2kx ðkÞ p
(Rashidinejad et al., 2010), However, the resulting estima-
R þ1 tor is still computationally demanding. As another solution,
Considering the constraint 0
f ðxk Þ dxk ¼ 1, f(xk, hk)
we consider the following well-known approximation of the
can be approximated as
Bessel function:
1 pffiffiffiffiffiffiffiffi
f ðxk ; hk Þ f ðxk Þ I 0 ðxÞ ð1= 2pxÞ expðxÞ: ð27Þ
2p
!
8xk 4 Again, using (Gradshteyn and Ryzhik, 2000,
¼ 3 exp pffiffiffiffiffiffiffiffiffiffiffi xk ; 0 6 hk
p kx ðkÞ p kx ðkÞ Thm.3.462.1) and (Abramowitz and Stegun, 1964), this
results in
< 2p; xk P 0: ð21Þ 0 11
1 Cðb þ 32ÞDðbþ3Þ ðP k Þ b
Substituting (5) and (21) into (4) yields b k ¼ pffiffiffiffiffiffiffi @
X 2 A Y k; ð28Þ
2ck C 32 Dð3Þ ðP k Þ
2R
3b1 2
1 bþ1 pffiffi
x exp 1
x 2
p 4ffiffiffiffiffiffiffi
x k I 2xk Y k
0 kd ðkÞ dxk pffiffiffiffiffiffiffi
6 0 k kd ðkÞ k p kx ðkÞ 7 where P k , 2p2ffiffiffi
2ck .
b 6
Xk ¼ 4
7: p nk
R1 5
2x Y
0
xk exp k ðkÞ xk pffiffiffiffiffiffiffi xk I 0 k ðkÞ dxk
1 2 4 k k Eq. (28) now presents an improved low-complexity ver-
d p kx ðkÞ d
sion of the proposed estimator (which is referred to as the
ð22Þ b-order ImpLapMMSE). Finally, the clean speech compo-
nent is obtained using the inverse STFT and the weighted
Since there is yet no closed form solution for these integrals
overlap-add method.
(nominator and denominator of (22)), we suggest the fol-
To check the suitability of the utilized approximations of
lowing approximations for the Bessel function to reach a the Bessel function, in Fig. 1 the values of the these approx-
closed form low-complexity estimator.
imations ((23) and (27)) have been compared with the exact
values of Bessel function, I0(x), in the range of 0 < x < 15.
3.2. Approximation of the Bessel function For the case of approximation (23), we have considered
M = 40. To be easily comparable, the values have been
To reach a closed form solution for (22), we first pro- drawn for 0 < x < 5, 5 < x < 10, and 10 < x < 15, sepa-
pose the use of Taylor series expansion of I0(.) around rately. As seen in Fig. 1, in the case of small values of x
x = 0, i.e. (say 0 < x < 5), (23) presents a more adequate approxima-
X
M 1 tion (at the cost of higher computational load); for x > 5,
2m 2 both (23) and (27) are acceptable approximations of Bessel
I 0 ðx; MÞ ðx=2Þ ð1=m!Þ : ð23Þ
m¼0 function.
(a) (b) 5
x 10 (c)
30 3000 3.5
Bessel(x)
approximation (equation (23))
3 (approximation equation (27))
25 2500
2.5
20 2000
Bessel(x)
Bessel(x)
Bessel(x)
2
15 1500
1.5
10 1000
1
5 500 0.5
0 0 0
0 1 2 3 4 5 5 6 7 8 9 10 10 11 12 13 14 15
x x x
Fig. 1. Comparative plot of exact Bessel function and the utilized approximations ((23) and (27)) for the range of (a) 0 < x < 5, (b) 5 < x < 10, and (c)
10 < x < 15.
4. Implementation and performance evaluation (Gerkmann and Hendriks, 2012). The ‘decision-directed’
approach (Ephraim and Malah, 1984) was used to compute
To evaluate the performance of the proposed estimator the a priori SNR, nk, with a = 0.98.
(b-order ImpLapMMSE), we have compared its perfor- Similar to You et al. (2005), in our simulations, the
mance with those for MMSE-STSA (Ephraim and value of b (in Eq. (28)) was adapted in each frame as a
Malah, 1984), b-order MMSE (You et al., 2005), and our function of the frame SNR, N(l), which is defined as
recently proposed one, ImpLapMMSE (Rashidinejad PðN =2Þ1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2
k¼0 jjRk ðlÞj kd ðl; kÞj
et al., 2010). Unlike the first two reference methods NðlÞ ¼ 10log10 PðN =2Þ1 ; ð29Þ
(Ephraim and Malah, 1984; You et al., 2005) that consider k¼0 kd ðl; kÞ
Gaussian pdf for speech, ImpLapMMSE and b-order and N is the frame length. Based on the performance anal-
ImpLapMMSE assume Laplacian priors. ysis of b-order MMSE estimator, You et al. (2005) pro-
For simulation, twenty (clean) speech signals (sampled posed the following formula for the value of b in the l-th
at 16 kHz) were selected randomly from the TIMIT data- frame:
base (Garofolo et al., 1993) (10 male speakers, 10 female
bðlÞ ¼ l1 NðlÞ þ l2 ; ð30Þ
speakers). In accordance with the two main references
(Chen and Loizou, 2007; You et al., 2005), in the first set where l1 and l2 are two linear coefficients. It has been seen
of the experiments, we corrupted the speech signals with that this linear relationship between b and N(l) can help to
white Gaussian noise as well as F-16 cockpit noise from preserve weak speech spectral components in strong speech
Noisex database (Varga and Steeneken, 1993), covering a frames. Actually, (30) is an empirical formula with param-
wide range of input SNRs (10 dB, 5 dB, 2 dB, 0 dB, eters l1 and l2 that are obtained from simulation experi-
5 dB, and 10 dB). The noise signals were sampled at ments. In addition to adapting b based on frame SNR,
16 kHz. In all the experiments, 15 ms, 50% overlapped the value of b is also limited to a maximum value of 4. This
Hamming windows were employed. To estimate the noise is because too high b values will leave most of the noise
Power Spectral Density (PSD), we utilized bias compen- components intact. So, the final b value can be expressed
sated MMSE-based method that has been proposed in by (You et al., 2005)
Table 1
Comparative performance, in terms of Segmental SNR, of the Gaussian-based MMSE-STSA, Beta-order MMSE, and Laplacian-based ImpLapMMSE,
and Beta-order ImpLapMMSE estimators in the case of White and F-16 noises.
Estimator White noise input SNR (dB) F-16 noise input SNR (dB)
10 5 2 0 5 10 10 5 2 0 5 10
Noisy signal (Input) 9.5 8.5 7.6 6.5 4.5 3.0 8.4 7.3 6.1 5.3 4.1 3.0
MMSE-STSA 4.7 4.1 3.4 2.9 1.7 0.5 4.5 3.7 3.2 2.5 1.2 0.9
b-order MMSE 4.4 3.6 3.1 2.5 1.2 0.8 4.1 3.1 2.6 2.1 0.9 1.1
ImpLapMMSE 3.9 2.8 2.2 1.7 0.8 1.1 3.5 2.0 1.7 1.1 0.6 1.6
b-order ImpLapMMSE 3.1 2.2 1.1 0.3 0.9 2.4 2.8 1.7 0.9 0.1 1.1 2.6
Table 2
Comparative performance, in terms of PESQ of the Gaussian-based MMSE-STSA, Beta-order MMSE, and Laplacian-based ImpLapMMSE, and Beta-
order ImpLapMMSE estimators in the case of White and F-16 noises.
10 5 2 0 5 10 10 5 2 0 5 10
MMSE-STSA 1.0 1.3 1.5 1.7 2.2 2.5 1.1 1.5 1.8 1.9 2.3 2.6
b-order MMSE 1.1 1.4 1.6 1.8 2.3 2.5 1.2 1.6 1.8 2.0 2.3 2.6
ImpLapMMSE 1.2 1.4 1.7 1.9 2.4 2.7 1.3 1.8 2.0 2.1 2.4 2.7
Table 3
Comparative performance, in terms of LLR of the Gaussian-based MMSE-STSA, Beta-order MMSE, and Laplacian-based ImpLapMMSE, and Beta-
order ImpLapMMSE estimators in the case of White and F-16 noises.
10 5 2 0 5 10 10 5 2 0 5 10
MMSE-STSA 3.1 2.7 2.6 2.5 2.2 2.0 1.5 1.3 1.2 1.1 1.0 0.9
b-order MMSE 2.9 2.6 2.5 2.4 2.2 2.0 1.5 1.3 1.2 1.1 1.0 1.0
ImpLapMMSE 2.8 2.5 2.4 2.3 2.2 1.9 1.4 1.2 1.1 1.0 0.9 0.8
Table 4 Table 5
Comparative performance, in terms of Segmental SNR, of the Gaussian- Comparative performance, in terms of PESQ of the Gaussian-based
based MMSE-STSA, Beta-order MMSE, and Laplacian-based ImpLa- MMSE-STSA, Beta-order MMSE, and Laplacian-based ImpLapMMSE,
pMMSE, and Beta-order ImpLapMMSE estimators in the case of Babble and Beta-order ImpLapMMSE estimators in the case of Babble and Pink
and Pink noises. noises.
Estimator Babble noise input Pink noise input SNR Estimator Babble noise input Pink noise input SNR
SNR (dB) (dB) SNR (dB) (dB)
10 0 10 10 0 10 10 0 10 10 0 10
Noisy signal (Input) 9.1 6.1 1.6 8.6 5.5 1.1 Noisy signal (Input) 1.0 1.4 2.1 0.8 1.3 2.1
MMSE-STSA 7.1 3.9 0.2 5.1 1.8 1.2 MMSE-STSA 1.2 1.6 2.4 1.2 2.1 2.8
b-order MMSE 6.8 3.5 0.5 4.9 1.7 1.4 b-order MMSE 1.3 1.7 2.4 1.2 2.2 2.9
ImpLapMMSE 5.9 3.1 0.9 4.6 1.4 1.8 ImpLapMMSE 1.4 1.7 2.5 1.3 2.3 3.0
b-order ImpLapMMSE 5.1 2.7 1.9 3.7 1.1 2.1 b-order ImpLapMMSE 1.5 1.8 2.6 1.4 2.4 3.1
bðlÞ ¼ max fmin ½l1 NðlÞ þ l2 ; l3 ; l4 g: ð31Þ Table 6

Comparative performance, in terms of LLR of the Gaussian-based
In our simulations, similar to You et al. (2005) work, we set
MMSE-STSA, Beta-order MMSE, and Laplacian-based ImpLapMMSE,
l1 = 0.25, l2 = 1.75, l3 = 4, and l4 = 0.001. and Beta-order ImpLapMMSE estimators in the case of Babble and Pink
To evaluate the performance of the estimators in speech noises.
enhancement task, we have used three common objective Estimator Babble noise input Pink noise input SNR
measures for speech quality: SegSNR, PESQ, and LLR. SNR (dB) (dB)
These measures have been selected based on their consider- 10 0 10 10 0 10
able correlation with the subjective quality assessments (Hu
Noisy signal (Input) 1.6 1.3 1.1 1.7 1.3 1.0
and Loizou, 2008). In the SegSNR measurement, we com- MMSE-STSA 1.5 1.3 0.9 1.4 1.0 0.7
pute the SNR of each frame of the test (noisy/processed) b-order MMSE 1.5 1.2 0.8 1.4 0.9 0.7
signal and then average over all frames. The LLR measure ImpLapMMSE 1.4 1.2 0.8 1.3 0.8 0.6
for each frame of test signal is defined as b-order ImpLapMMSE 1.4 1.1 0.8 1.3 0.7 0.6
T
at Rs at
LLR ¼ log T ; ð32Þ
as R s as
computed by averaging frame LLRs across all frames.
where as is the Linear Prediction Coefficient (LPC) vector PESQ is an objective measure that predicts the subjective
of the original speech frame, at is the LPC vector of the test Mean Opinion Score (MOS) for the given test signal. The
speech frame, and Rs is the autocorrelation matrix of the PESQ score is mapped to a MOS-like scale in the range
original (clean) speech signal. The mean LLR value is of 0.5 to 4.5. The superior performance (greater noise
Fig. 2. Comparison the effect of frame length: Comparative performance, in terms of (a) Segmental SNR, (b) PESQ, and (c) LLR of the Gaussian-based
Beta-order MMSE and Beta-order ImpLapMMSE estimators at the frame lengths of 15 ms and 30 ms.
Fig. 3. Sample spectrograms of: (a) clean speech signal (TIMIT sentences “Big dogs can be dangerous. The shop closes for launch.” uttered by a male
speaker), (b) noisy signal (white Gaussian noise, 0 dB), (c) the output of Gaussian-based MMSE-STSA estimator, (d) the output of Gaussian-based Beta-
order MMSE estimator, (e) the output of Laplacian-based ImpLapMMSE estimator, and (f) the output of Beta-order ImpLapMMSE estimator.
reduction, improved speech quality, and less speech distor- values are the averages over twenty input signals. The best
tion) can be interpreted from higher SegSNR and PESQ, value in each column has been highlighted in bold case. As
and lower LLR values. shown, the Laplacian-based estimators (ImpLapMMSE
The evaluation results for the case of white Gaussian and b-order ImpLapMMSE) have produced superior out-
noise and F-16 noise have been listed in Tables 1–3 in terms put signals compared to the Gaussian-based estimators
of SegSNR, PESQ and LLR, respectively. The reported (MMSE and b-order MMSE).
To extend our evaluations on the other noise types, in pdf as well as the Bessel function, we derive b-order ImpLa-
the second set of the experiments we repeated our tests pMMSE estimator. Comparing the proposed method with
on the speech signals degraded by babble and pink noises alternative state-of-the-art approaches (namely, MMSE
from Noisex database (Varga and Steeneken, 1993), at STSA, b-order STSA, and ImpLapMMSE) shows that
three input SNRs (10 dB, 0 dB, and 10 dB). The results the proposed estimator is more effective in reducing additive
have been presented in Tables 4–6 in terms of SegSNR, noise. Also, similar to what reported in (Chen and Loizou,
PESQ, and LLR, respectively. 2005), the proposed b-order ImpLapMMSE estimator has
As seen in Tables 1–6, the proposed estimator (b-order less residual noise, less distortion speech signal and finally
ImpLapMMSE) has consistently resulted in higher Seg- better performance in results.
SNR, lower LLR, and higher PESQ values in all cases of It should be mentioned that the parameters of b-adapta-
white, F-16, babble, and pink noises. The results show that tion formula and other parameters of the proposed method
for both high and low input SNRs, the proposed b-order have been empirically set based on experimental results.
ImpLapMMSE estimator yields superior performance in Certainly, by choosing a more appropriate function for
noise reduction, demonstrating the efficiency of Laplacian b-adaptation and/or extending the experiments for param-
assumption for speech priors. eter setting, the outputs of the b-order estimator can be
As mentioned in Section 1, the suitability of Gaussian or improved more.
Laplacian distributions for the modeling of real and imagi- Motivated by previous experiences with b-order estima-
nary parts of speech DFT samples depends on the duration tors (Abutalebi and Dashtbozorg, 2011; You et al., 2005),
of analysis frames (or the DFT size). So, it is informative to we are continuing this research to propose an adaptive pro-
examine the effect of the frame size on the performance of cedure for calculating optimum value of b in each frame
Gaussian- and Laplacian-based estimators. To this end, in and each frequency bin. We are also examining the perfor-
the next set of experiments, we compared the performance mance of the proposed estimators in the problem of speech
of b-order MMSE and b-order ImpLapMMSE estimators de-reverberation. Derivation of MMSE (or b-order
in the cases of 15 ms and 30 ms frames. The tests were done MMSE) STSA estimator under other non-Gaussian priors
on the speech signals corrupted by white Gaussian noise. As can be another trend for future works.
it is shown in Fig. 2, b-order ImpLapMMSE estimator has
superior performance than b-order MMSE estimator in
both 15 ms and 30 ms frames. Also, our tests show that in References
the case of 15 ms frames, the outputs are slightly better than
Abramowitz, M., Stegun, I.A., 1964. Handbook of Mathematical Func-
those in the case of 30 ms frame. tions. Dover, New York.
To further analyze the performance the estimators, we Abutalebi, H.R., Dashtbozorg, B., 2011. Speech dereverberation in noisy
have provided sample spectrograms in Fig. 3. This figure environments using an adaptive minimum mean square estimator. IET
shows the spectrograms of the sample output signals of Signal Proc. 5 (2), 130–137.
Breithaupt, C., Martin, R., 2003. MMSE estimation of magnitude squared
MMSE-STSA (Ephraim and Malah, 1984), b-order MMSE
DFT coefficients with supergaussian priors. In: Proc. IEEE ICASSP.
(You et al., 2005), ImpLapMMSE (Rashidinejad et al., Hong Kong, pp. 896–899.
2010) and the proposed b-order ImpLapMMSE estimators Breithaupt, C., Krawczyk, M., Martin, R., 2008. Parameterized MMSE
along with those for the clean and noisy signals. The clean spectral magnitude estimation for the enhancement of noisy speech. In:
signal has been the TIMIT sentences “Big dogs can be dan- Proc. IEEE ICASSP, Las Vegas, Nevada, USA, pp. 4037–4040.
Chen, B., Loizou, P.C., 2005. Speech enhancement using a MMSE short
gerous. The shop closes for launch.” uttered by a male
time spectral amplitude estimator with Laplacian speech modeling. In:
speaker. The noisy signal has been produced by adding Proc. IEEE ICASSP. Philadelphia, Pennsylvania, USA, pp. 1097–
the white Gaussian noise to the clean signal at SNR of 1100.
0 dB. As can be seen, the outputs of the Laplacian-based Chen, B., Loizou, P.C., 2007. A Laplacian-based MMSE estimator for
estimators (ImpLapMMSE and b-order ImpLapMMSE) speech enhancement. Speech Commun. 49, 134–143.
Cohen, I., 2001. On speech enhancement under signal presence uncer-
have less residual noise.
tainty. In: Proc. IEEE ICASSP. Salt Lake City, Utah, USA.
Furthermore, we validated the comparative results Dashtbozorg, B., Abutalebi, H.R., 2009. Adaptive MMSE speech spectral
through some informal listening tests done by five listeners. amplitude estimator under signal presence uncertainty. In: Proc. 17th
These tests show that the b-order ImpLapMMSE estimator European Signal Processing Conference (EUSIPCO), Glasgow, Scot-
produces lower residual noise than the state-of-the-art land, pp. 209–212.
Deng, F., Bao, F., Bao, C-C., 2014. Speech enhancement using generalized
estimators.
weighted b-order spectral amplitude estimator. Speech Commun. 59,
55–68.
5. Summary and conclusion Ephraim, Y., Malah, D., 1984. Speech enhancement using a minimum
mean square error short-time spectral amplitude estimator. IEEE
In this paper, we focus on speech enhancement using b- Trans. Acoust, Speech, Signal Process. 32 (6), 1109–1121.
Ephraim, Y., Malah, D., 1985. Speech enhancement using a minimum
order STSA MMSE estimator where the real and imaginary
mean square error log-spectral amplitude estimator. IEEE Trans.
parts of clean speech DFT coefficients are modeled by a Acoust., Speech, Signal Process. 33 (2), 443–445.
Laplacian prior. The resulting analytical solution is highly Erkelens, J.S., Hendriks, R.C., Heusdens, R., Jensen, J., 2007. Minimum
complex. So, considering some approximations for joint mean-square error estimation of discrete Fourier coefficients with
generalized Gamma priors. IEEE Trans. Audio, Speech, Language Martin, R., 2002. Speech enhancement using MMSE short time spectral
Process. 15 (6), 1741–1752. estimation with Gamma distributed priors. In: Proc. IEEE ICASSP.
Garofolo. J.S., et al., 1993. TIMIT Acoustic-Phonetic Continuous Speech Orlando, Florida, USA, pp. 504–512.
Corpus. Linguistic Data Consortium, Philadelphia. <http:// Martin, R., 2005. Speech enhancement based on minimum mean-square
www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1> error estimation and supergaussian priors. IEEE Trans. Speech, Audio
(accessed December 2013). Process. 13 (5), 845–856.
Gazor, S., Zhang, W., 2005. Speech enhancement employing Laplacian– McAulay, R.J., Malpass, M.L., 1980. Speech enhancement using a soft-
Gaussian mixture. IEEE Trans. Speech, Audio Process. 13 (5), 896– decision noise suppression filter. IEEE Trans. Acoust., Speech Signal
904. Process. 28 (2), 137–145.
Gradshteyn, I., Ryzhik, I., 2000. Table of Integrals, Series and Products, Plourde, E., Champagne, B., 2008. Auditory-based spectral amplitude
sixth ed. Academic, New York. estimators for speech enhancement. IEEE Trans. Acoust., Speech
Gerkmann, T., Hendriks, R.C., 2012. Unbiased MMSE-based noise Signal Process. 16 (8), 1614–1623.
power estimation with low complexity and low tracking delay. IEEE Rashidinejad, M., Abutalebi, H.R., Tadaion, A.A., 2010. Speech
Trans. Audio, Speech, Language Process. 20 (4), 1383–1393. enhancement using an improved MMSE estimator with Laplacian
Hendriks, R.C., Heusdens, R., 2010. On linear versus non-linear magni- prior. In: Proc. 5th Int. Symp. on Telecommunication, Tehran, Iran.
tude-dft estimators and the influence of super-Gaussian speech priors. Varga, A., Steeneken, H., 1993. Assessment for automatic speech
In: Proc. IEEE ICASSP. Dallas, Texas, USA. recognition: II. NOISEX-92: a database and an experiment to study
Hu, Y., Loizou, P.C., 2008. Evaluation of objective quality measures for the effect of additive noise on speech recognition systems. Speech
speech enhancement. IEEE Trans. Audio, Speech, Language Process. Commun. 12 (3), 247–251.
16 (1), 229–238. You, C.H., Koh, S.N., Rahardja, S., 2005. b-Order MMSE spectral
Loizou, P.C., 2005. Speech enhancement based on perceptually motivated amplitude estimation for speech enhancement. IEEE Trans. Speech,
Bayesian estimators of the magnitude spectrum. IEEE Trans. Speech Audio Process. 13 (4), 475–486.
Audio Process. 13 (5), 857–869. You, C.H., Koh, S.N., Li, H., Rahardja, S., 2009. Improved adaptive
Lotter, T., Vary, P., 2003. Noise reduction by maximum a posteriori b-order MMSE speech enhancement. In: Proc. 2009 Annual Summit
spectral amplitude estimation with Supergaussian speech modeling. In: and Conference of Asia-Pacific Signal and Information Processing
Proc. Int. Workshop Acoustic Echo Noise Control, Kyoto, Japan. Association, pp. 797–800.

Abutalebi2015 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abutalebi2015 PDF

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

Speech enhancement based on b-order MMSE estimation of Short

1. Introduction Amplitude (STSA) estimator of clean speech signal by

respectively expressed as ð8Þ

by Chen and Loizou (2007) kx ðxÞ

bðlÞ ¼ max fmin ½l1 NðlÞ þ l2 ; l3 ; l4 g: ð31Þ Table 6

You might also like

Abutalebi2015 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abutalebi2015 PDF

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

Speech enhancement based on b-order MMSE estimation of Short

1. Introduction Amplitude (STSA) estimator of clean speech signal by

respectively expressed as ð8Þ

by Chen and Loizou (2007) kx ðxÞ

bðlÞ ¼ max fmin ½l1 NðlÞ þ l2 ; l3 ; l4 g: ð31Þ Table 6

You might also like

bðlÞ ¼ max fmin ½l1 NðlÞ þ l2 ; l3 ; l4 g: ð31Þ Table 6