You are on page 1of 19

electronics

Article
Speech Enhancement for Secure Communication
Using Coupled Spectral Subtraction and Wiener Filter
Hilman Pardede 1 , Kalamullah Ramli 2, * , Yohan Suryanto 2 , Nur Hayati 2
and Alfan Presekal 2
1 Research Center for Informatics, Indonesian Institute of Sciences, Jawa Barat 40135, Indonesia
2 Department of Electrical Engineering, University of Indonesia, Jawa Barat 16424, Indonesia
* Correspondence: kalamullah.ramli@ui.ac.id; Tel.: +62-811-8606-457

Received: 28 June 2019; Accepted: 13 August 2019; Published: 14 August 2019 

Abstract: The encryption process for secure voice communication may degrade the speech quality
when it is applied to the speech signals before encoding them through a conventional communication
system such as GSM or radio trunking. This is because the encryption process usually includes a
randomization of the speech signals, and hence, when the speech is decrypted, it may perceptibly
be distorted, so satisfactory speech quality for communication is not achieved. To deal with this,
we could apply a speech enhancement method to improve the quality of decrypted speech. However,
many speech enhancement methods work by assuming noise is present all the time, so the voice
activity detector (VAD) is applied to detect the non-speech period to update the noise estimate.
Unfortunately, this assumption is not valid for the decrypted speech. Since the encryption process
is applied only when speech is detected, distortions from the secure communication system are
characteristically different. They exist when speech is present. Therefore, a noise estimator that is able
to update noise even when speech is present is needed. However, most noise estimator techniques
only adapt to slow changes of noise to avoid over-estimation of noise, making them unsuitable for
this task. In this paper, we propose a speech enhancement technique to improve the quality of speech
from secure communication. We use a combination of the Wiener filter and spectral subtraction for
the noise estimator, so our method is better at tracking fast changes of noise without over-estimating
them. Our experimental results on various communication channels indicate that our method is
better than other popular noise estimators and speech enhancement methods.

Keywords: secure voice communication; speech enhancement; noise estimation; Wiener filter;
spectral subtraction

1. Introduction
Secure voice communication is necessary for many applications. Transmitted speech may have
sensitive and personal information of the speakers that needs to be kept confidential. For military
purposes, this becomes even more important since the information may affect national security.
The Global System for Mobile (GSM) technologies is mainly used for voice communications in many
countries, even for military purposes. However, while they ensure confidentiality for the radio access
channel, its security is not guaranteed when speech is transmitted across the switch network using
standard modulation such as pulse-code modulation (PCM) and adaptive differential pulse-code
modulation (ADPCM) forms [1]. To ensure the end-to-end confidentiality, the speech signals are
encrypted before entering the communication systems. However, the encryption process suffers
several drawbacks. First, the process may delay the transmission. Secondly, the decrypted speech
may be distorted since the encryption process usually includes a randomization of the speech signals,
and as a result, the speech signals may fail to reach sufficient quality.

Electronics 2019, 8, 897; doi:10.3390/electronics8080897 www.mdpi.com/journal/electronics


Electronics 2019, 8, 897 2 of 19

This study aims to improve the quality of the decrypted speech to achieve satisfying quality.
We focus on the implementation of speech enhancement (SE) for this purpose. SE is an active area of
research that aims to remove noise from speech, and hence, the quality of the speech signals may be
improved. This is very crucial in many speech-based applications such as speech recognition [2,3],
hearing aid devices [4,5], mobile communications [6], and hands-free devices [7,8].
There has been a plethora algorithms proposed in this field. With the increasing trend of
deep learning technologies, many current studies model the non-linear relations between speech
and noise using various deep learning networks [9–13]. One of the first approaches is to use the
denoising autoencoder where the networks learn to map the speech with itself after it is disturbed
by noise [9]. Later, there was increasing interest in using the generative adversarial network (GAN)
to reconstruct the clean signals from the noisy ones [12]. However, deep learning methods require
heavy computational loads for training, and the resulting models require a large amount storage,
which may not be suitable to be planted in the mobile devices or other devices with limited storage
such as hand-held transceivers (HT) in which the systems are planned to be embedded.
For real-time and fast SE that requires minimum storage, signal processing-based methods would
be preferred. Earlier signal processing-based-methods were derived from the assumption of linear
relations between noise and speech in the time domain. While they are assumed to be linear in the time
domain, many techniques prefer to remove noise in the frequency domain by transforming the noisy
speech, i.e., speech that is contaminated with noise, into the spectral domain [14] since the energy of
speech and noise is more distinguishable in that domain. One of the simplest and arguably the most
popular methods is spectral subtraction (SS) [15]. This method simply subtracts noisy speech with
the noise estimate. Noise is usually estimated using early frames of speech assuming speech is not
present during those periods. Later, spectral subtraction is also used for other purposes such as for
source separations [16] and dereverberation [17].
However, SS requires a good noise estimation, or otherwise, some artifacts may be left in the
enhanced speech that would cause unwanted and disturbing sound when it is transformed back into
the time domain. This is called musical noise [18]. Numerous methods have been proposed to reduce
musical noise. In [18], over-subtraction and flooring factors were proposed for SS. The over-subtraction
and flooring factors, which are determined heuristically, are set so that the gaps between the amplitudes
of those artifacts and the actual speech are minimum, hence reducing the loudness of musical noise.
Initially, these factors are set the same for all frequency bands. However, noise in the real world is
non-stationary, and it may have different energies for different spectra. Some types of noise such as
babble noise may have more energy at low frequencies while other such as the sound of machinery,
fans, etc., may have more energy at high frequencies. In [19], a non-linear spectral subtraction method
was proposed. The method works by computing the over-subtraction factor as a function of SNR for
each frequency. Other studies also showed that the SNRs change more extremely when a narrower
band is used, and they tend to change more slowly when a large band is used [20]. This property was
utilized in [21,22] by proposing multi-band spectral subtraction (MBSS). In their method, the speech
spectra were group into several wider frequency bands, and spectral subtraction was performed
separately for each band.
Other SE methods are derived using statistical properties of speech and noise signals. Usually,
Gaussian assumptions are used to model the speech for mathematical convenience. In [23],
Wiener filter (WF) was used for speech enhancement. WF is the optimal solution, in the
minimum-mean-squared-error (MMSE) sense, when we assume speech and noise spectra to be
Gaussian and uncorrelated. There were various variants of WF-based SE in later studies [24–26].
Meanwhile, the SE method is derived from the assumption that the speech and noise spectral
amplitudes are assumed to be Gaussianly distributed, as proposed in [26]. While SS is derived
intuitively, we could say that it is the optimal variance estimator or power spectra, in the maximum
likelihood sense, when we assume that the noisy speech is to be Gaussianly distributed. Recently,
there have been several efforts to use non-negative matrix factorization (NMF) to factorize noisy speech
Electronics 2019, 8, 897 3 of 19

into noise and speech components for speech enhancement [27,28]. NMF builds a noise model during
non-speech periods assuming it follows a certain distribution, such as Gaussian.
Studies indicated that the Gaussian assumptions are generally not valid when we use short time
windows for the processing of the speech signals. Non-Gaussian distributions are often used to instead,
and SE methods are derived accordingly. In [29], the MMSE estimator was derived by assuming the
log spectral amplitude of speech to be Gaussian, hence assuming a non-Gaussian distribution for their
spectral amplitude distributions. Other studies used Laplace [30,31] and Gamma distributions [32–34].
In [35,36], a variant of spectral subtraction, called q-spectral subtraction, was derived by assuming that
noisy speech spectra were distributed as q-Gaussian.
Another common assumption for SE is that noise is present all the time. Noise is usually estimated
during the non-speech period using a voice activity detector (VAD). However, VAD usually fails when
the SNR is low or when noise is non-stationary. Noise estimators are used instead of VAD to provide
estimations regardless of speech being present or not. Many noise estimators work by tracking
the minimum of the noisy speech spectra [37,38]. Unfortunately, these methods usually work for
slowly-changing noise. Other methods exploit the fact that noise usually does not change very
rapidly, so noise is estimated by averaging it from previous frames [39–42]. The methods apply the
adaptation speed weight factor to control how much the current frames affect the noise estimate.
Usually, the weight is set to be close to zero to minimize over-subtraction. However, as a consequence,
the methods were unable to track the sudden change of noise.
In secure communication, the noise energy is usually low during the non-speech period, while it
changes drastically when speech is present. This is because the encryption process usually detects
the speech presence and applies randomization on the detected speech parts only. While the sudden
change could be tracked by giving larger values for adaptation speed, the enhanced speech is very
likely to be over-subtracted, leaving unwanted artifacts. Therefore, the noise estimator needs not
only to be able to track the sudden change of noise, but also avoid over-subtraction. In this paper,
we propose a noise estimator that combines the Wiener filter and spectral subtraction. By doing so,
we could control the amount of the noise energy of current frames using the Wiener filter and the
average noise estimate from preceding frames using spectral subtraction to minimize over-subtraction.
The remainder of this paper is organized as follows. In Section 2, we briefly explain the spectral
subtraction and Wiener filter. We describe our proposed method in Section 3. In Section 4, we describe
the experimental setup, and the results are explained and discussed in Section 5. The paper is concluded
in Section 6.

2. Speech Enhancement Methods


Spectral subtraction (SS) is one of the earliest and also arguably the most popular SE method [15].
SS works with the assumption that the noise is additive and stationary. Let us denote y(t) as noisy
speech, i.e., the clean speech x (t) contaminated with additive noise d(t) at time t. Therefore, in the
time domain, their relation is:
y ( t ) = x ( t ) + d ( t ). (1)

By taking the discrete Fourier transform (DFT) of (1) and the power magnitude, we obtain their
relation in the spectral domain, assuming that speech and noise are uncorrelated as follows:

| Y (m, k) |2 =| X (m, k) |2 + | D (m, k) |2 , (2)

where m and k denote the frame and frequency indices, respectively. Assuming that we can obtain the
b (m, k ) |2 , then the spectral subtraction is formulated as:
estimate of | D (m, k) |, | D

b (m, k) |2 =| Y (m, k) |2 − | D
|X b (m, k) |2 . (3)
Electronics 2019, 8, 897 4 of 19

The performance of SS heavily depends on the accuracy of the noise estimate. However, this is
not an easy task due to several reasons. First, speech and noise are correlated when we process them
with a limited window length. There exist cross-terms between them, and therefore, even when noise
could perfectly be estimated, the exact clean signal could not be extracted [43,44]. Secondly, noise is
usually non-stationary, and hence, obtaining good estimator is very difficult. The inaccuracy of noise
estimation leaves artifacts that produce unwanted and disturbing sound, called musical noise, in the
time domain. To minimize this, over-subtraction and flooring factors are introduced for spectral
subtraction [18]:



 | Y (m, k) |2 −α | D (m, k) |2

 if | Y (m, k) |2 > (α + β) | D (m, k) |2

b (m, k) |2 =
|X (4)


 β | Y (m, k ) |2

 otherwise,

where α and β are over-subtraction and flooring factors, respectively.


The parameter α is usually chosen to be higher than one to reduce the energy of the artifacts and
hence reduce their loudness. Meanwhile, β is set between zero and one to reduce the gap between
the remaining peaks of the artifacts and the spectral valleys of the speech signals. There is a trade-off
in determining the values of α and β. If α is too large, the intelligibility of the speech signals may
suffer, but if it is too small, much of the noise may still remain. On the other hand, if β is too large,
more background noise would be present, but if it is too small, musical noise would be more audible.
Many studies have proposed a solution to finding good values for α and β. They are determined
heuristically based on the estimate of the signal-to-noise ratio (SNR) [19,21,22]. In [35,36], a variant
of spectral subtraction was derived by assuming noisy speech to be non-Gaussian. The method,
called q-spectral subtraction (q-SS), explains the non-linear relations between speech and noise and
derives over-subtraction factors analytically.
The Wiener filter (WF) was first introduced by Norbert Wiener in the 1940s as a solution to find
the optimum estimation of signals from a noisy environment. If the noisy signal y(t) consists of the
clean signal x (t) and the noise d(t) and the noise and the signals are uncorrelated, then the estimation
of the clean signals can be obtained using linear combinations of the data y(t) such that the mean
squared error is minimum.
Many studies used WF for SE [23–26]. The transfer function of the Wiener filter is computed in
frequency domain as follows:

| X (m, k) |2
H (m, k) = . (5)
| X (m, k) |2 + | D
b (m, k) |2

However, as indicated by (5), WF requires the knowledge of clean speech. It is also non-causal,
making it unrealizable. To overcome this, the transfer function is computed in an iterative manner
where the enhanced speech is estimated from the estimate transfer functions from preceding frames
as follows:
|Xb (m, k) |2 = H (m − 1, k) | Y (m, k ) |2 . (6)

However, a noise estimator or a VAD are still needed to estimate the noise spectra. To overcome
this, we could modify the Wiener filter to estimate noise instead of the enhanced speech [45]. Now,
the transfer function of the Wiener filter can be formulated as follows:

|D b (m − 1, k) |2
H (m, k) = . (7)
b (m − 1, k) |2 + | D
|X b (m − 1, k ) |2

By doing so, we compute the Wiener filter from the enhanced speech instead of the noisy speech,
and a VAD is not needed.
Electronics 2019, 8, 897 5 of 19

3. The Proposed Method


Figure 1 illustrates the block diagram of our proposed method. The proposed method works as
follow. The noisy speech y(t) is framed, windowed, and then transformed into the frequency domain
by taking the discrete Fourier transform (DFT). After that, we take the magnitude spectra | Y (m, k) |.
The magnitude spectra of noisy speech is fed into the proposed noise estimator, and simple subtraction
with the following formula is applied to obtain the enhanced speech:

| Y (m, k) | − | D (m, k) |

 b
|X
b (m, k ) |= if Y (m, k) >| D
b (m, k) (8)


 β | Y (m, k) | otherwise.

Figure 1. The block diagram of the proposed method.

In this paper, β was set to 0.002.


For the noise estimator, we combined the Wiener filter and q-spectral subtraction [35,36]. For the
Wiener filter, we modify (7) and compute its transfer function as follows:
!0.5
|D e (m − 1, k) |2
H (m, k) = , (9)
e (m − 1, k) |2 + | D
|X e (m − 1, k) |2

where | De (m, k − 1) | is the estimate of the noise spectra. They are obtained by averaging the estimate
of noise at the current frame with its preceding estimates of noise. The decision-directed formula [26]
is modified with the following rule:

|D
e (m, k) |= c | D
e (m − 1, k) | +(1 − c) | D
b n (m, k) |, (10)

where c is the adaptation speed factor, which is chosen between zero and one. Here, instead of
updating the noise spectra with current noisy spectra | Y (m, k) | − | D b (m, k) | as in [26], we updated
them with the noise estimate from the Wiener filter (see (16)). Therefore, we could control the parts
of the slow changes of noise in | D
e (m, k − 1) |, which was determined from the average estimate of
clean speech from preceding frames, and faster changes in | D b n (m, k ) |, which was computed by the
Wiener filter. By doing so, we would be able to track the changes of noise and minimize the risk of
Electronics 2019, 8, 897 6 of 19

over-subtraction as well. In this paper, we evaluated the optimal c for the performance of our method
empirically. We varied c from 0.5–0.95.
Meanwhile, | Xe (m, k − 1) | of (9) is the average estimate of clean speech of the preceding frames.
It is calculated using q-spectral subtraction (q-SS) with the following rules [36]:

e (m, k) |= 2(2 − q) | Y (m, k) | − | D


|X e (m, k) |, (11)
3−q

where the parameter q is estimated based on a priori SNR, Ψ(m, k), as reported in [36], with the
following relation:

1

 if Ψn (m, k) ≥ 20 dB
q= −0.038Ψn (m, k) + 1.8 if −5 dB ≤ Ψ(m, k) < 20 dB (12)

1.88 if Ψ(m, k) < −5 dB.

The parameter Ψ(m, k) is the clean speech-to-noise-ration (in dB), i.e., Ψ(m, k) = 10 log ψ(m, k).
Since ψ(m, k) cannot be directly estimated from the noisy speech, we estimated it using maximum
likelihood, assuming the noise variance was known. It could be estimated from the posterior SNR
estimations. However, to reduce the computations, we estimated ψ(m, k) using the formula as reported
in [26]:
 
ψ(m, k ) = max φ e(m, k), 0 , (13)

where:
φ e(m, k − 1) + (1 − a) φ(m, k ) .
e(m, k) = aφ (14)
b
The notation φ(m, k ) is the noisy signal-to-noise ratio, i.e.:

| Y (m, k) |
φ(m, k ) = (15)
|De (m, k) |

The parameters a and b were set to 0.725 and 2, respectively, in this paper, as reported in [29].
The computation for calculating the posterior SNR was simple and fairly fast. We evaluated the
performance of these formulations by comparing their estimations with the actual SNR. The actual
SNR could be calculated since we had access to the clean speech. The results are illustrated in Figure 2.
As we can see, the SNR estimator did not perform very well. This was as expected, as the distortions
were highly non-stationary.

Figure 2. Comparisons of actual SNR with estimated SNR.


Electronics 2019, 8, 897 7 of 19

We opted to use q-SS instead of the conventional SS in this study. As was reported in [36], q-SS
has more attenuation than SS [36]. This means the flooring process would occur at higher SNR for q-SS
than that of SS. This is especially beneficial when over-subtraction occurs. Since noise is more dominant
in low SNR and also more difficult to estimate, having them floored at higher SNR conditions would
minimize the loss of information.
Then, we can obtain the noise estimate | D b (m, k ) | using the Wiener filter:

|D
b (m, k ) |= H (m, k) | Y (m, k) | . (16)

The estimate found in (16) will be used for (8) to find the enhanced speech X b (m, k ). For the first
frame, we assumed D (m − 1, k) to be equal to D (m, k), and they were estimated using the the average
e b
spectra of the first five frames of an utterance. After the enhancement process, the enhanced speech
was transformed back to the time domain by applying the inverse discrete Fourier transform (IDFT)
and overlap and add (OLAP).

4. Experimental Setup
For the evaluation data, we selected a subset from the Tokyo Institute of Technology Multilingual
Speech Corpus-Indonesian (TITML-IDN) [46]. It is the Indonesian phonetically-balanced speech
corpus, which consists of 20 Indonesian speakers’ recordings (10 male and 10 female speakers). Each
speaker was asked to read 343 phonetically-balanced sentences. From this dataset, we selected
2 speakers: a female and a male, and 10 utterances for each speakers. Therefore, there was a total 20
sentences. The phonetic transcription of the spoken utterances used for this experiments is shown in
Table 1.

Table 1. The list of utterances used for the experiments.

Utterance Phonetic Transcription


1 [h-ai] [s-@-l-a:-m-a:-t] [p-a:-g-i] [a:-p-a:] [k-a:-b-a:-R]
2 [s-@-m-O:-g-a:] [u:-Ã-i-a:-n-ñ-a:] [b-@-R-Ã-a:-l-a:-n] [l-a:-n-Ù-a:-R]
3 [a:-Ù-a:-R-a:] [n-O:-n-t-O:-n-ñ-a:] [Ã-a:-d-i] [k-a:-n]
4 [a:-p-a:-k-a:-h] [k-a:-m-u:] [s-u:-d-a:-h] [m-a:-k-a:-n] [s-i-a:-N]
5 [n-a:-n-t-i] [m-a:-l-a:-m] [p-u:-l-a:-N] [Ã-a:-m] [b-@-R-a:-p-a:]
6 [Ã-a:-N-a:-n] [l-u:-p-a:] [s-a:-R-a:-p-a:-n] [j-a:]
7 [Ù-@-p-a:-t] [i-s-t-i-R-a:-h-a:-t] [d-a:-n] [m-i-m-p-i] [j-a:-N] [i-n-d-a:-h]
8 [m-a:-a:-f] [s-a:-j-a:] [t-@-R-l-a:-m-b-a:-t] [d-a:-t-a:- N] [k-@] [k-a:-n-t-o-R]
9 [d-i-a:] [t-i-d-a:-k] [d-a:-t-a:- N] [k-@] [s-@-k-O:-l-a:-h]
10 [a:-l-a:-s-a:-n-ñ-a:] [b-@-l-u:-m] [m-@-N-@-R-Ã-a:-k-a:-n] [p-@-k-@-R-Ã-a:-a:-n] [R-u:-m-a:-h]

The process of creating the experimental data is shown in Figure 3. All selected utterances were
passed through the encryptor before being transmitted using communication devices and encoded
using various communication channels. In this study, we used four types of communication channels:
GSM, Interactive Multimedia Association (IMA)-ADPCM(denoted as I-ADPCM), Microsoft ADPCM
(denoted as M-ADPCM), and PCM. After that, the speech was received by another communication
device, and then, the decryptor process was applied to obtain the decoded speech. The decoded speech
was then fed through the speech enhancement method.
The encryption utilized multicircular permutation to randomize the voice signals. Encryption
and decryption were executed utilizing a digital dynamic complex permutation module rotated by
a set of expanded keys. At the sender or transmitter, the voice was encrypted using permutation
multicircular shrinking. At the receiver, the encrypted speech was decrypted with permutation
multicircular expanding. The direction of the shift in permutations, both shrinking and expanding,
was determined by the value of the expanded key. More details about the encryption process can be
found in [47].
Electronics 2019, 8, 897 8 of 19

Figure 3. The process of creating the data for the experiments. IMA, Interactive Multimedia Association;
M, Microsoft.

We evaluated our method using the perceptual evaluation of speech quality (PESQ) [48] and
frequency-weighted segmental SNR (FwSNR) [49]. Both metrics were chosen since both metrics have
strong correlation subjective quality measures [50].
Our noise estimator was compared with other popular methods based on minimal tracking
and recursive averaging. Three noise estimation methods: Martin [37], Hirsch [51], and improved
minimum controlled recursive averaging (IMCRA) [52], were selected. We also provided the results
of SS with these noise estimators. In addition, we compared our method with several SE methods.
They were SS [18], the Wiener filter (WF) [25], Karhurnen–Loeve transform-based speech enhancement
(KLT) [53], NMF [27], and Log minimum-mean-squared-error (LogMMSE) [29]. For SS, WF, KLT,
and LogMMSE, we implemented the code published in [54]. The implementations of IMCRA, Martin,
and Hirsch were also taken from the same source. The NMF implementation was taken from [55].

5. Results and Discussions


In this section, we present evaluations of the proposed method. First, we evaluate the effect of
various values for adaptation speed c to the performance of the method. Secondly, we evaluate
the ability of our method to track the noise from decrypted speech and compare it with other
noise estimators: Martin, Hirsch, and IMCRA. Lastly, we compare our method with several speech
enhancement methods and noise estimators.

5.1. The Effect of Adaptation Speed


Figures 4 and 5 show the performance of the proposed method when we varied the adaptation
speech factor, c. The performance shown is the average over all utterances. It can clearly be seen that
the performance of the method is affected by c. For PESQ, the best performance was obtained when
c = 0.75 for PCM, I-ADPCM, and M-ADPCM, while the results indicated that lower c was required
for GSM. Meanwhile, the highest FwSNR was achieved when we used c = 0.85 for PCM, I-ADPCM,
and M-PCM, while c = 0.75 was the best for GSM. As indicated by both metrics, lower c was required
for GSM to achieve better performance. This was because GSM produced the most noisy speech (as
indicated by low FwSNR and low PESQ). Therefore, the effect of noise was more severe in GSM than
in other channels. As a result, lower c was needed to compensate the fast changes of noise.
While our experiments pointed out that the objective quality could be improved when we use
smaller c, the best performance of PESQ and FwSNR was achieved for different c. As it is known that
having “cleaner” speech does not necessarily improve the quality of speech, therefore the appropriate
c should be selected carefully.
Electronics 2019, 8, 897 9 of 19

(a) (b)
1.27 2.7
1.26 2.6
PESQ

PESQ
1.26
2.5
1.25
2.4
1.25
0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9
c c
(c) (d)
2.7

2.6 2.6
PESQ

PESQ
2.5 2.5

2.4 2.4
0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9
c c
Figure 4. The performance of the proposed method (perceptual evaluation of speech quality
(PESQ)) when the adaptation speed c was varied for various communication channels: (a) GSM,
(b) IMA-ADPCM, (c) PCM, and (d) Microsoft-ADPCM. The results are the average PESQ scores over
all 20 utterances.

(a) (b)

3.6 13
FwSNR

FwSNR

3.4 12

0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9
c c
(c) (d)
13
13
FwSNR

FwSNR

12
12
11

0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9
c c
Figure 5. The performance of the proposed method (frequency-weighted segmental SNR (FwSNR))
when the adaptation speed c is varied for various communication channels: (a) GSM, (b) IMA-ADPCM,
(c) PCM, and (d) Microsoft-ADPCM.. The results are the average FwSNR over all 20 utterances.
Electronics 2019, 8, 897 10 of 19

5.2. Comparison with Other Noise Estimators


Table 2 shows the average log spectral distances (LSD) in dB. LSD is calculates as follows.
s
1 M K | Nre f (m, k) |2
LSD =
K×M ∑ ∑ 10 log10
| Nne (m, k) |2
(17)
m =1 k =1

where | Nre f (m, k ) |2 is the noise power spectrum of actual noise, which is computed by subtracting
the noisy decrypted speech with the actual clean speech, | Nne (m, k) |2 is the noise estimate from noise
estimators, M is the number of frames, and K is the number of frequency components. The results
are the average of LSD over five utterances of female speakers on various communication channels.
The results showed that our method had the smallest LSD over other noise estimators, implying that it
tracked the changes in noise better than other evaluated methods.

Table 2. The average LSD (in dB) of the actual noise and estimated noise of our method and other
noise estimators for various communication channels. I, IMA; IMCRA, improved minimum controlled
recursive averaging.

Methods GSM I-ADPCM M-ADPCM PCM


Hirsch 43.89 40.50 43.25 40.47
Martin 43.88 41.65 46.06 41.52
IMCRA 41.20 37.34 39.26 37.28
Proposed 40.00 36.17 43.93 36.17

Figure 6 shows the comparisons of the noise estimates for the proposed method with other noise
estimators: Martin, Hirsch, and IMCRA. It is the magnitude spectrum (in dB) of the 10th band of a
female speaker on various encodings: GSM, I-ADPCM, M-ADPCM, and PCM. The figure further
suggests that the proposed method was closer to actual noise than other noise estimators in tracking
the changes of noise. Other estimators such as Martin and Hirsch’s methods were based on tracking
the minimum statistics of the spectrum to avoid the over-estimate of noise. As a result, while it is very
unlikely to have over-estimates of noise, there are large gaps between actual noise with the estimate.
On the other hand, our method generally had smaller gaps between the prediction and the actual gaps.
Interestingly, our method was also able to minimize over-estimation.

(a)
Figure 6. Cont.
Electronics 2019, 8, 897 11 of 19

(b)
Figure 6. Comparisons of the resulting noise estimation spectra of the proposed method and several
noise estimators: Martin, Hirsch, and IMCRA. The figures show the magnitude spectra (in dB) of a
frequency bin (k = 10) on (a) IMA-ADPCM and (b) M-ADPCM. For the proposed method, we used
c = 0.8.

5.3. Comparisons with Other SE Methods


Table 3 compares our proposed method with several SE methods: SS, WF, KLT, NMF,
and LogMMSE. For this, we used c = 0.8 for the remainder of our experiments. In addition, we
compared our method with SS when noise estimators were used: Martin, Hirsch, and IMCRA. We
also investigated whether the improvements were because of the overlap and add (OLAP process),
so we present the results of using OLAP only for the noisy speech. The results indicated that OLAP
did not improve the PESQ and FwSNR metrics. On average, the results were slightly worse than the
original noisy speech, suggesting that OLAP may not contribute to the improvements of our method.
It is obvious that the SS did not improve the PESQ scores either. Similar results can also be seen for WF
and KLT. In some cases, the PESQ scores were even worse. This is not surprising, because the methods
were unable to track the changes of noise when speech was present. Only NMF and LogMMSE were
able to improve the PESQ scores slightly. NMF, due to its prior assumptions on noise, may be able to
remove some parts of the noise even during the speech period.

Table 3. Results of our method (PESQ scores and FwSNR) in comparison with other methods.
The proposed method applied c = 0.8. OLAP, overlap and add; SS, spectral subtraction; WF, Wiener filter;
KLT, Karhurnen–Loeve transform.

PESQ FwSNR
Methods
GSM I-ADPCM M-ADPCM PCM GSM I-ADPCM M-ADPCM PCM
Noisy speech 1.234 2.303 2.346 2.303 2.501 15.300 13.533 15.309
With OLAP only 1.234 2.295 2.345 2.295 2.495 15.210 13.523 15.190
SS 1.229 2.308 2.351 2.308 2.431 13.148 12.575 13.155
SS + Martin 1.235 2.296 2.341 2.296 2.213 13.505 12.773 13.510
SS + IMCRA 1.234 2.295 2.339 2.295 2.185 13.207 12.613 13.200
SS + Hirsch 1.234 2.289 2.333 2.288 2.131 13.192 12.562 13.187
WF 1.163 1.963 1.989 1.963 2.535 11.918 11.328 11.919
LogMMSE 1.240 2.352 2.405 2.352 2.711 12.439 11.895 12.440
KLT 1.173 2.090 2.118 2.091 2.206 10.512 10.128 10.510
NMF 1.350 2.551 2.606 2.591 1.654 2.987 2.957 2.995
Proposed method 1.251 2.654 2.649 2.654 3.620 13.415 12.799 13.405
Electronics 2019, 8, 897 12 of 19

Our method improved the PESQ scores for most encoding techniques. We obtained PESQ
improvement of 0.256, 0.253, 0.260, 0.261, 0.266, 0.533, 0.214, and 0.434 on average over all encoding
schemes compared to the original decrypted speech, SS, SS + Martin, SS + IMCRA, SS + Hirsch,
WF, LogMMSE, and KLT, respectively, confirming that our method could improve the quality of the
decrypted speech. Only NMF had better PESQ scores than our method for GSM, whereas our method
was slightly better for other encoding schemes.
As indicated by the spectrograms of the enhanced speech (see Figures 7 and 8), the spectrograms
of all reference methods did not change drastically compared to original noisy speech for IMA-ADPCM.
In contrast, NMF and our proposed method were clearly able to remove some parts of noise from the
speech. Similar spectrogram results were observed for other utterances as well. Meanwhile for GSM,
where it was noticeably that the conditions were the worst, we see that other methods only removed
parts of noise during non-speech periods only. Our method could remove some parts of noise during
speech periods.

(a) (b)

(c) (d)

(e) (f)
Figure 7. Cont.
Electronics 2019, 8, 897 13 of 19

(g)
Figure 7. The spectrogram of female speakers: “hai selamat pagi apa kabar” of (a) original speech,
(b) the decrypted speech from I-ADPCM, the enhanced speech from (c) the Wiener filter, (d) KLT,
(e) LogMMSE, and (f) NMF, and (g) the the proposed method.

(a) (b)

(c) (d)

(e) (f)
Figure 8. Cont.
Electronics 2019, 8, 897 14 of 19

(g)
Figure 8. The spectrogram of female speakers: “acara nontonnya jadi kan” of (a) original speech, (b) the
decrypted speech from GSM, the enhanced speech from (c) the Wiener filter, (d) KLT, (e) LogMMSE,
and (f) NMF, and (g) the the proposed method.

Surprisingly, we observed that the performance of SS tended to get worse when noise estimators
were used. Since most noise estimators avoid over-subtraction of the noise estimate tended to be very
low compared to the actual noise (as we showed in Section 5.2). As the spectrograms indicated, our
method was better at removing more noise compared to other noise estimators (see Figure 9). This is
consistent with our previous findings.
Meanwhile, for FwSNR, the enhanced speech tended to have lower FwSNR for most methods.
LogMMSE, WF, and the proposed method improved FwSNR only on GSM encoding where our method
was better than LogMMSE. The FwSNR for NMF largely dropped for all encoding schemes. As NMF
usually assumed Gaussian priors, it was very likely that noise and speech may not follow the same
distributions, forcing the enhanced speech to be Gaussian, and as a consequence, the FwSNR dropped.
In general, the proposed method was among the least that degraded the FwSNR compared to other
method for other encoding schemes. These results indicate that while the method may cause the
enhanced speech to be over-subtracted, the effect may not be as significant as other methods. This
might be because other methods may not perform well on the boundary between when speech is
present and absent. This caused distortions in the boundary area of speech, and hence, the FwSNRs
was worse.
We compared the computational complexity of the proposed method with the other SE methods.
We did this by computing the runtime of each method for processing 80 utterances, comprising
20 sentences on 4 types of encoding schemes. We repeated the experiments five times, and their
average was calculated. The experiments were conducted on Intel i3 processors with 4 GB of memory.
The results are shown in Table 4. Based on the results, we found that WF had the largest running time,
indicating that it was the method with the largest computational complexity. This was as expected,
since WF is an iterative process with the maximum iterative number set to 1000 in the implementation.
NMF also had large computational complexity. The matrix decomposition operations may contribute
to the large computational time. Meanwhile, SS had the smallest computation time as it was one of the
simplest methods. Only SS and LogMMSE had smaller computation times than the proposed method,
suggesting that the method was quite simple and its computational load can be considered low.
Electronics 2019, 8, 897 15 of 19

Table 4. Computational complexity of the proposed method in comparison with other methods.

No. SS KLT LogMMSE WF NMF Proposed


1 12,945 43,157 27,170 255,627 205,544 37,139
2 13,131 41,292 28,038 256,006 204,908 37,268
3 13,022 41,753 27,171 255,987 206,773 37,117
4 13,420 41,885 27,477 256,151 207,229 37,103
5 13,470 41,003 27,405 255,775 205,974 36,644
Average 13,198 41,818 27,452 255,909 206,086 37,054

(a) (b)

(c) (d)

(e) (f)
Figure 9. The spectrogram of male speakers: “apakah kamu sudah makan siang.” of (a) the original
speech, (b) the decrypted speech from PCM and the enhanced speech from (c) SS + IMCRA, (d) SS +
Martin, and (e) SS + Hirsch, and (f) the proposed method.

6. Conclusions
One way to secure voice communication is to apply an encryption method before sending the
voice through mobile or wired communication networks. However, this approach may cause the
decrypted speech to have degraded quality. In this paper, we presented an enhancement method with
coupled spectral subtraction and the Wiener filter for noise estimation to improve the quality of speech
from secure communication.
Electronics 2019, 8, 897 16 of 19

Our experimental works on speech transmitted on various communication channels: GSM,


IMA-ADPCM, M-ADPCM, and PCM, showed that our proposed method was generally better at
improving the PESQ scores of the decrypted speech compared to other speech enhancement or noise
estimator methods. While the FwSNR metrics tended to get worse for I-ADPCM, M-ADPCM, and PCM
in general, our method improved them on GSM channels, the conditions that had the worst FwSNR.
The results might indicate the effectiveness of our method in conditions of highly-distorted speech.
Our observations on the spectrograms results suggested that the enhanced speech signals from our
methods were cleaner than the reference methods.
In the future, our plan is to further improve the performance of our method. In this research,
we only applying a single adaptation speed, determined heuristically. Since it clearly affected the
performance, it may be interesting to find out whether we could in an automatic way predict the
optimum adaptation speed factor.

Author Contributions: Conceptualization, H.P. and K.R.; methodology, H.P., N.H., and Y.S.; software, H.P., N.H.,
and Y.S.; validation, H.P., Y.S., and N.H.; formal analysis, H.P.; investigation, H.P. and N.H.; resources, H.P., N.H.,
and Y.S.; data curation, N.H. and A.P.; writing, original draft preparation, H.P.; writing, review and editing, H.P.
and K.R.; visualization, H.P.; supervision, K.R.; project administration, A.P., Y.S., and N.H.; funding acquisition,
K.R.
Funding: This work was supported by Lembaga Pengelola Dana Pendidikan (LPDP) RISPRO from the Ministry
of Finance of the Republic of Indonesia.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Katugampala, N.N.; Al-Naimi, K.T.; Villette, S.; Kondoz, A.M. Real time data transmission over GSM
voice channel for secure voice and data applications. In Proceedings of the 2nd IEEE Secure Mobile
Communications Forum: Exploring the Technical Challenges in Secure GSM and WLAN, London, UK,
23–23 September 2004; pp. 7/1–7/4.
2. Gong, Y. Speech recognition in noisy environments: A survey. Speech Commun. 1995, 16, 261–291. [CrossRef]
3. Vincent, E.; Watanabe, S.; Nugraha, A.A.; Barker, J.; Marxer, R. An analysis of environment, microphone and
data simulation mismatches in robust speech recognition. Comput. Speech Lang. 2017, 46, 535–557. [CrossRef]
4. Levitt, H. Noise reduction in hearing aids: A review. J. Rehabil. Res. Dev. 2001, 38, 111–122. [PubMed]
5. Kam, A.C.S.; Sung, J.K.K.; Lee, T.; Wong, T.K.C.; van Hasselt, A. Improving mobile phone speech recognition
by personalized amplification: Application in people with normal hearing and mild-to-moderate hearing
loss. Ear Hear 2017, 38, e85–e92. [CrossRef] [PubMed]
6. Goulding, M.M.; Bird, J.S. Speech enhancement for mobile telephony. IEEE Trans. Veh. Technol. 1990,
39, 316–326. [CrossRef]
7. Juang, B.; Soong, F. Hands-free telecommunications. In Proceedings of the International Workshop on
Hands-Free Speech Communication, Kyoto, Japan, 9–11 April 2001.
8. Jin, W.; Taghizadeh, M.J.; Chen, K.; Xiao, W. Multi-channel noise reduction for hands-free voice
communication on mobile phones. In Proceedings of the 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 506–510.
9. Lu, X.; Tsao, Y.; Matsuda, S.; Hori, C. Speech enhancement based on deep denoising autoencoder.
In Proceedings of the Interspeech, 14th Annual Conference of the International Speech Communication
Association, Lyon, France, 25–29 August 2013; pp. 436–440.
10. Xu, Y.; Du, J.; Dai, L.; Lee, C. A Regression Approach to Speech Enhancement Based on Deep Neural
Networks. IEEE/ACM IEEE Trans. Audio Speech Lang. Process. 2015, 23, 7–19.
11. Weninger, F.; Erdogan, H.; Watanabe, S.; Vincent, E.; Le Roux, J.; Hershey, J.R.; Schuller, B.
Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR.
In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, Liberec,
Czech Republic, 25–28 August 2015; pp. 91–99.
Electronics 2019, 8, 897 17 of 19

12. Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech Enhancement Generative Adversarial Network.
In Proceedings of the Interspeech, 18th Annual Conference of the International Speech Communication
Association, Stockholm, Sweden, 20–24 August 2017; pp. 3642–3646.
13. Kumar, A.; Florencio, D. Speech Enhancement in Multiple-Noise Conditions Using Deep Neural Networks.
In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 Sepember 2016; pp. 3738–3742.
14. Shekokar, S.; Mali, M. A brief survey of a DCT-based speech enhancement system. Int. J. Sci. Eng. Res 2013,
4, 1–3.
15. Boll, S. A spectral subtraction algorithm for suppression of acoustic noise in speech. In Proceedings of the
1979 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Washington, DC,
USA, 2–4 April 1979; Volume 4, pp. 200–203.
16. Nasu, Y.; Shinoda, K.; Furui, S. Cross-Channel Spectral Subtraction for meeting speech recognition.
In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Process (ICASSP),
Prague, Czech Republic, 22–27 May 2011; pp. 4812–4815.
17. Furuya, K.; Kataoka, A. Robust speech dereverberation using multichannel blind deconvolution with
spectral subtraction. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1579–1591. [CrossRef]
18. Berouti, M.; Schwartz, R.; Makhoul, J. Enhancement of speech corrupted by acoustic noise. In Proceedings of
the 1979 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Washington,
DC, USA, 2–4 April 1979; Volume 4, pp. 208–211.
19. Lockwood, P.; Boudy, J. Experiments with a nonlinear spectral subtractor (NSS), hidden Markov models and
the projection, for robust speech recognition in cars. Speech Commun. 1992, 11, 215–228. [CrossRef]
20. Cappé, O. Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor.
IEEE Trans. Speech Audio Process. 1994, 2, 345–349. [CrossRef]
21. Kamath, S.; Loizou, P. A multi-band spectral subtraction method for enhancing speech corrupted by colored
noise. In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing,
Orlando, FL, USA, 13–17 May 2002; Volume 4, p. 44164.
22. Udrea, R.M.; Oprea, C.C.; Stanciu, C. Multi-microphone Noise Reduction System Integrating Nonlinear
Multi-band Spectral Subtraction. In Pervasive Computing Paradigms for Mental Health; Oliver, N., Serino, S.,
Matic, A., Cipresso, P., Filipovic, N., Gavrilovska, L., Eds.; Springer International Publishing: Cham,
Switzerland, 2018; pp. 133–138.
23. McAulay, R.; Malpass, M. Speech enhancement using a soft-decision noise suppression filter. IEEE Trans.
Acoust. Speech Signal Process. 1980, 28, 137–145. [CrossRef]
24. Scalart, P. Speech enhancement based on a priori signal to noise estimation. In Proceedings of the 1996 IEEE
International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA,
USA, 9 May 1996; Volume 2, pp. 629–632.
25. Hu, Y.; Loizou, P.C. Speech enhancement based on wavelet thresholding the multitaper spectrum. IEEE Trans.
Speech Audio Process. 2004, 12, 59–67. [CrossRef]
26. Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral
amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 1109–1121. [CrossRef]
27. Lyubimov, N.; Kotov, M. Non-negative matrix factorization with linear constraints for single-channel speech
enhancement. In Proceedings of the Interspeech, 14th Annual Conference of the International Speech
Communication Association, Lyon, France, 25–29 August 2013; pp. 446–450.
28. Duan, Z.; Mysore, G.J.; Smaragdis, P. Speech enhancement by online non-negative spectrogram
decomposition in nonstationary noise environments. In Proceedings of the Interspeech, Portland, OR, USA,
9–13 September 2012.
29. Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude
estimator. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 443–445. [CrossRef]
30. Mahmmod, B.M.; Ramli, A.R.; Abdulhussian, S.H.; Al-Haddad, S.A.R.; Jassim, W.A. Low-Distortion MMSE
Speech Enhancement Estimator Based on Laplacian Prior. IEEE Access 2017, 5, 9866–9881. [CrossRef]
31. Chen, B.; Loizou, P.C. Speech enhancement using a MMSE short time spectral amplitude estimator with
Laplacian speech modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing, Philadelphia, PA, USA, 23–23 March 2005; Volume 1.
Electronics 2019, 8, 897 18 of 19

32. Wang, Y.; Brookes, M. Speech enhancement using an MMSE spectral amplitude estimator based on a
modulation domain Kalman filter with a Gamma prior. In Proceedings of the 2016 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016;
pp. 5225–5229.
33. Martin, R. Speech enhancement using MMSE short time spectral estimation with gamma distributed speech
priors. In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech and Signal Process
(ICASSP), Orlando, FL, USA, 13–17 May 2002; Volume 1, pp. I-253–I-256.
34. Andrianakis, I.; White, P.R. Mmse Speech Spectral Amplitude Estimators With Chi and Gamma Speech
Priors. In Proceedings of the 2006 IEEE International Conference on Acoustics, Speech and Signal Process
(ICASSP), Toulouse, France, 14–19 May 2006; Volume 3.
35. Pardede, H.F.; Koichi, S.; Koji, I. Q-Gaussian based spectral subtraction for robust speech recognition.
In Proceedings of the Interspeech, 13th Annual Conference of the International Speech Communication
Association, Portland, OR, USA, 9–13 September 2012; pp. 1255–1258.
36. Pardede, H.; Iwano, K.; Shinoda, K. Spectral subtraction based on non-extensive statistics for speech
recognition. IEICE Trans. Inf. Syst. 2013, 96, 1774–1782. [CrossRef]
37. Martin, R. Noise power spectral density estimation based on optimal smoothing and minimum statistics.
IEEE Trans. Speech Audio Process. 2001, 9, 504–512. [CrossRef]
38. Barnov, A.; Bracha, V.B.; Markovich-Golan, S. QRD based MVDR beamforming for fast tracking of speech
and noise dynamics. In Proceedings of the 2017 IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics (WASPAA), New Paltz, NY, USA, 15–18 October 2017; pp. 369–373.
39. Cohen, I.; Berdugo, B. Noise estimation by minima controlled recursive averaging for robust speech
enhancement. IEEE Signal Process Lett. 2002, 9, 12–15. [CrossRef]
40. Lu, C.T.; Lei, C.L.; Shen, J.H.; Wang, L.L.; Tseng, K.F. Estimation of noise magnitude for speech denoising
using minima-controlled-recursive-averaging algorithm adapted by harmonic properties. Appl. Sci. 2017,
7, 9. [CrossRef]
41. Lu, C.T.; Chen, Y.Y.; Shen, J.H.; Wang, L.L.; Lei, C.L. Noise Estimation for Speech Enhancement Using
Minimum-Spectral-Average and Vowel-Presence Detection Approach. In Proceedings of the International
Conference on Frontier Computing, Bangkok, Thailand, 9–11 September 2016; pp. 317–327.
42. He, Q.; Bao, F.; Bao, C.; He, Q.; Bao, F.; Bao, C. Multiplicative update of auto-regressive gains for
codebook-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 457–468.
[CrossRef]
43. Faubel, F.; Mcdonough, J.; Klakow, D. A phase-averaged model for the relationship between noisy speech,
clean speech and noise in the log-mel domain. In Proceedings of the Interspeech, 9th Annual Conference of
the International Speech Communication Association, Brisbane, Australia, 22–26 September 2008.
44. Zhu, Q.; Alwan, A. The effect of additive noise on speech amplitude spectra: A quantitative analysis.
IEEE Signal Process Lett. 2002, 9, 275–277.
45. Sovka, P.; Pollak, P.; Kybic, J. Extended spectral subtraction. In Proceedings of the 1996 8th European Signal
Processing Conference (EUSIPCO 1996), Trieste, Italy, 10–13 September 1996; pp. 1–4.
46. Lestari, D.P.; Iwano, K.; Furui, S. A large vocabulary continuous speech recognition system for Indonesian
language. In Proceedings of the 15th Indonesian Scientific Conference in Japan Proceedings, Hiroshima,
Japan, 4–7 August 2006; pp. 17–22.
47. Hayati, N.; Suryanto, Y.; Ramli, K.; Suryanegara, M. End-to-End Voice Encryption Based on Multiple
Circular Chaotic Permutation. In Proceedings of the 2019 2nd International Conference on Communication
Engineering and Technology (ICCET), Nagoya, Japan, 12–15 April 2019.
48. Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)—A
new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001
IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, 7–11
May 2001; Volume 2, pp. 749–752.
49. Tribolet, J.; Noll, P.; McDermott, B.; Crochiere, R. A study of complexity and quality of speech waveform
coders. In Proceedings of the ICASSP’78. IEEE International Conference on Acoustics, Speech, and Signal
Processing, Tulsa, OK, USA, 10–12 April 1978; Volume 3, pp. 586–590.
50. Hu, Y.; Loizou, P.C. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio
Speech Lang. Process. 2008, 16, 229–238. [CrossRef]
Electronics 2019, 8, 897 19 of 19

51. Hirsch, H.G.; Ehrlicher, C. Noise estimation techniques for robust speech recognition. In Proceedings of the
1995 International Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, USA, 9–12 May 1995;
Volume 1, pp. 153–156.
52. Cohen, I. Noise spectrum estimation in adverse environments: Improved minima controlled recursive
averaging. IEEE Trans. Speech Audio Process. 2003, 11, 466–475. [CrossRef]
53. Hu, Y.; Loizou, P.C. A generalized subspace approach for enhancing speech corrupted by colored noise.
IEEE Trans. Speech Audio Process. 2003, 11, 334–341. [CrossRef]
54. Loizou, P.C. Speech Enhancement: Theory and Practice, 2nd ed.; CRC Press, Inc.: Boca Raton, FL, USA, 2013.
55. NMFdenoiser. 2014. Available online: https://github.com/niklub/NMFdenoiser (accessed on
11 March 2019).

c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like