Professional Documents
Culture Documents
Signal Analysis and Interpretation Lab, University of Southern California, Los Angeles, USA 90089
ppapadop@usc.edu, tsiartas@usc.edu, jjgibson@usc.edu, shri@sipi.usc.edu
In this work, our goal is to estimate the SNR of spontaneous Since SNR is the ration of energies, we first calculate the
speech signals or signals where speech boundaries are not long-term energy in each frame from the spectrogram (the av-
available to us. Although, there are different kinds of SNR erage energy in each frame). Then we apply different smooth-
criteria, such as Global SNR, Local SNR, Segmental SNR ing windows, using the moving average smoothing method.
([7]), we focus on the estimation of Global SNR. Global SNR For every case of smoothing window length, we estimate
gives us information about the effect of noise on the whole P(X) and P(N ) by taking percentile windows on the long-
signal and is defined as: term energy and substitute those values in (3). So, for differ-
ent smoothing windows and energy regions we have different
features.
q P
1 N 2
N i=1 s (i)
SNR = 10 · log10 q P (1)
1 N 2 (i) 2.2. Long-Term Signal Variability (LTSV)
N i=1 n
where the numerator is the root-mean square of the speech Long-Term Signal Variability (LTSV) was proposed in [16]
signal and the denominator is the root-mean square of the and is a way of measuring the degree of non-stationarity in
noise signal, expressing their respective energies P(S) and a signal. Since speech is non-stationary, we can use LTSV
P(N ). to identify speech regions in a signal. Hence, we can make
Assuming that the noise is additive, the observed signal estimates of P(X) and P(N ) based on percentage regions
x(i) is a sum of the speech signal s(i) and the noise signal of variability and measure the respective energies of those
n(i), x(i) = s(i) + n(i), i being the time index. Furthermore, regions. For example, when noise is stationary we can de-
if the speech and noise signals are independent and zero-mean duce that speech is present in the region where 85% to 90% of
we can rewrite equation (1) as: LTSV is concentrated. On the other hand, in the region 10%
to 15% where LTSV is concentrated only noise is present.
P(X) − P(N ) An estimate based on variability is similar to the one of
SN R = 10 · log10 (2)
P(N ) equation (3), where the windows of energy used for the esti-
which will be the basis of our estimation formulae. mates correspond to regions of the LTSV. However, before we
Our approach focuses on finding regions of speech pres- compute those estimates we first apply smoothing windows
ence (and absense) in the signal without requiring VAD. We on LTSV and median filtering on the corresponding energy
measure the respective energies of these regions, and create regions.
SNR estimators based on the formula of equation (2). Af-
terwards, we create a regression model, which we train with 2.3. Pitch
ordinary least squares and get our final SNR estimation.
Another measure we can use to identify speech regions is
To distinguish the regions of speech presence and absence
through pitch detection. We use the openSMILE software,
in the signal we use a variety of features such as long-term
[17], to extract pitch information from the signal. Since pitch
energy, variability, pitch, and voicing probability. We take
transitions are abrupt, and speech exists in the neighbour of
percentile windows of those features and calculate the ener-
pitch regions we apply smoothing on the outcome of pitch de-
gies P(X) and P(N ) corresponding to those windows. The
tection. Afterwards, we estimate P(X) and P(N ) based on
bands of high and low energies offer a reasonable approxima-
percentage regions of pitch presence in the signal in a similar
tion for representing speech from noisy speech regions. Such
fashion as in equation (3).
an estimate can be expressed as:
8288
kind. The SNR estimation is based on the features we de- For the KNN classifier we used 20 nearest neighbors
scribed and is given by: (K=20) based on 13 MFCCs. We used the same set of 1680
files (adding noise for every SNR level) to train the KNN
M classifier. The final decision is made by calculating the
\
X
SN R= ai · f i + ǫ (4) probability of each class in every frame and then follows a
i=1 majority vote.
where M is the number of features, ǫ is the disturbance
term, ai and fi are the regression coefficients and the regres- 4. EXPERIMENTAL RESULTS
sors respectively.
In the second case, we have no prior knowledge about We have tested our system for five different noise types. We
the kind of noise that corrupts the signal. Instead, we use a randomly selected 150 files from the TIMIT database (there
classification scheme to identify the noise type and use the was no overlap between the training and testing files). In
appropriate regression model. n [18], the authors use a K- each file we introduced 3 to 10 seconds silence regions and
Nearest Neighbour Classifier (KNN) classifier based on Bark then added noise at 6 different SNR levels. We compared our
scale features to classify noise types. In our work we have method with the WADA and NIST SNR estimation methods
used a KNN classifier on 13 MFCCs. using the mean absolute error metric. In all cases we found
that our method outperforms the other methods.
3. EXPERIMENTAL SETUP 11
SNR Estimation Error on White Noise
Regression
10 WADA
a, b, c, d in 3 we used to estimate the energies are shown in Fig. 1. Mean absolute error for White Noise.
table 1
a b c d 11
SNR Estimation Error on Pink Noise
Regression
85% 95% 5% 15% 10 WADA
NIST SNR
80% 90% 10% 20% 9
0
These values where the result of experimental procedure. −5 0 5
SNR
10 15 20
Our experiments showed that adding more features (i.e. more Fig. 2. Mean absolute error for Pink Noise.
smoothing windows, etc) boosts the performance of the esti-
mation. Since this is a work in progress, in the future we plan In figures 1, 2,3 the results of white, pink and car interior
to provide detailed analysis of the impact each feature has on noise are presented. By comparing the mean absolute error
the model. of our method and the WADA and NIST SNR method for 6
For every noise type we used 1680 clean speech files from different SNR levels,it is clear that our method provides better
the TIMIT Database sampled at 16KHz in which we intro- estimates for every SNR level (difference in error ranges from
duced silence periods randomly selected between 3 and 10 0.3db to 7db).
seconds to create signals with unknown speech boundaries. In the case of machine gun noise (figure 4) our method
Then we added noise at six SNR levels (-5dB, 0dB, 5dB, greatly outperforms the other methods (difference in mean
10dB, 15dB, 20dB), resulting in a total of 10080 training sam- absolute error is about 30db). Both WADA and NIST SNR
ples per regression model. fail to provide accurate estimates as shown from their mean
8289
SNR Estimation Error on Car Interior Noise a signal with a noise that was used for training the KNN clas-
12
Regression
WADA
sifier, the signal was correctly classified and the appropriate
NIST SNR
regression model was used. Since our classifier achieved per-
10
fect accuracy for the given set of noises, we tried to corrupt
a signal with high frequency Noise (which was not used for
Mean Absolute Error
8
training the classifier). The classifier chose the regression
6 model for white noise. In figure 6 we can see the results when
we corrupted signals with high frequency noise and used the
4 white noise regression model to estimate the SNR.
2
60 3
Mean Absolute Error
50 2
40 1
30 0
−5 0 5 10 15 20
SNR
20
Fig. 5. Mean absolute error for Babble Speech Noise. other SNR estimation algorithms (WADA, NIST SNR) and
the proposed method in general outperforms across all exper-
Finally, in the case of Babble Speech Noise (figure 5) we imental conditions.
can see that only for 0dB the WADA method performs better. Finally, we plan to attempt to generalize across noise
Since babble speech noise is similar to speech some of our types. Moreover, we want to improve our channel classifica-
features(i.e. pitch,voicing) fail at same energy levels. How- tion by employing features that can capture noise character-
ever, our method gives better estimates overall. istics, since it is well known that MFCCs are not very robust
The above results refer to the case where we know the under noise conditions. We also plan to test more advanced
type of noise that corrupts the signal and we choose the ap- classifiers (e.g. DBN-DNN, SVMs, etc) as well as adaptive
propriate regression model. In the second set of experiments schemes and soft assignment approaches that will generalize
we used the same test set of files. In every case we corrupted better for unseen noise conditions.
8290
6. REFERENCES [13] R. Martin, “Noise power spectral density estimation
based on optimal smoothing and minimum statistics,”
[1] H. G. Hirsch and C. Ehricher, “Noise estimation tech- IEEE Transactions on Speech and Audio Processing,
niques for robust speech recognition,” in Proc. IEEE vol. 9, no. 5, pp. 504–512, 2001.
ICASSP, 1995.
[14] C. Kim and R. M. Stern, “Robust signal-to-noise ra-
[2] J. Morales-Cordovilla, N. Ma, V. Sanchez, J. Carmona, tio estimation based on waveform amplitude distribution
A. Peinado, and J. Barker, “A pitch based noise es- analysis,” in Proc. Interspeech, 2008, pp. 2598–2601.
timation technique for robust speech recognition with
[15] E. Nemer, R. Goubran, and S. Mahmoud, “SNR estima-
missing data,” in Acoustics, Speech and Signal Process-
tion of speech signals using subbands and fourth-order
ing (ICASSP), 2011 IEEE International Conference on,
statistics,” Signal Processing Letters, IEEE, vol. 6, no. 7,
2011, pp. 4808–4811.
pp. 171–174, 1999.
[3] Y. Ephraim and D. Malah, “Speech enhancement us- [16] P. Ghosh, A. Tsiartas, and S. Narayanan, “Robust Voice
ing a minimum-mean square error short-time spectral Activity Detection Using Long-Term Signal Variabil-
amplitude estimator,” IEEE Transactions on Acoustics, ity,” IEEE Transactions on Audio, Speech, and Lan-
Speech and Signal Processing, vol. 32, no. 6, pp. 1109– guage Processing, vol. 19, no. 3, pp. 600–613, 2011.
1121, 1984.
[17] “openSMILE,” http://opensmile.sourceforge.net/.
[4] C. Plapous, C. Marro, and P. Scalart, “Improved Signal
to Noise Ratio Estimation for Speech Enhancement.” [18] C. Eamdeelerd and K. Songwatana, “Audio noise clas-
IEEE Transactions on Audio, Speech and Language sification using bark scale features and k-nn technique,”
Processing, vol. 14, no. 6, pp. 2098–2108, 2006. in International Symposium on Communications and In-
formation Technologies, ISCIT 2008., pp. 131–134.
[5] Y. Ren and M. T. Johnson, “An improved SNR estimator
for speech enhancement.” in ICASSP. IEEE, 2008, pp.
4901–4904.
8291