You are on page 1of 4

EFFICIENT CLASSIFICATION OF NOISY SPEECH USING NEURAL

NETWORKS

C. Shao, M. Bouchard

School of Information Technology and Engineering,


University of Ottawa, 161 Louis-Pasteur,
Ottawa (Ontario), K1N 6N5, Canada, e-mail : bouchard@site.uottawa.ca

ABSTRACT vs. inactive speech in noisy speech (i.e. VAD


classification). The work focuses on raw classification,
The classification of active speech vs. inactive speech in without hangover mechanisms or speech enhancement, in
noisy speech is an important part of speech applications, order to identify which type of algorithm would lead to a
typically in order to achieve a lower bit-rate. In this work, better core in a more complex classifier. Hangover
the error rates for raw classification (i.e. with no hangover schemes such as in [1],[2] are usually added to a raw
mechanism) of noisy speech obtained with traditional classifier to smooth the decision of the classifier (i.e. to
classification algorithms are compared to the rates avoid spurious changes in the classification decision), by
obtained with Neural Network classifiers, trained with combining the current raw decision with the decisions of
different learning algorithms. The traditional the previous frames. In this paper, the error rates for raw
classification algorithms used are the linear classifier, classification (i.e. with no hangover mechanism) of noisy
some Nearest Neighbor classifiers and the Quadratic speech obtained with traditional classification algorithms
Gaussian classifier. The training algorithms used for the are compared to the rates obtained with Neural Network
Neural Networks classifiers are the Extended Kalman
classifiers, trained with different learning algorithms. The
Filter and the Levenberg-Marquadt algorithm. An
traditional classification algorithms used in the paper are
evaluation of the computational complexity for the
the linear classifier, some Nearest Neighbor classifiers
different classification algorithms is presented. Our noisy
speech classification experiments show that using Neural and the Quadratic Gaussian classifier. The training
Network classifiers typically produces a more accurate algorithms used for the Neural Networks classifiers are
and more robust classification than other traditional the Extended Kalman Filter and the Levenberg-Marquadt
algorithms, while having a significantly lower algorithm. Previous work on speech classification using
computational complexity. Neural Network classifiers Neural Network classifiers have been published [6],[7],
may therefore be a good choice for the core component of however the simple Back Propagation algorithm was used
a noisy speech classifier, which would typically also for the training, and as it is mentioned later in this work
include a hangover mechanism and possibly a speech this can greatly affect the classification performance.
enhancement algorithm. Moreover, features that are known to be efficient for
speech classification are used in this paper. The
1. INTRODUCTION computational complexity of the different classification
algorithms is also estimated in this paper. The paper is
With the development of multimedia communications and organized as follows: Section 2 introduces the
Internet communications, classification is increasingly classification features used in the experiments, and briefly
used in speech and audio applications. One of the most describes the algorithms for the different classifiers,
well known classification application in speech Section 3 shows the experimental results of active speech
processing is the VAD (Voice Activity Detector) used in vs. inactive speech classification in noisy speech, Section
speech coding standards [1],[2]. In this case, different 4 presents a comparison of the computational complexity
encoders are used for speech and noise based on the for each considered classifier and Section 5 presents a
decision of the VAD, in order to reduce the overall bit conclusion and ideas for future work.
rate for transmission. Classification is also used in some
other speech applications such as multi-mode low bit-rate 2. CLASSIFICATION FEATURES AND
coding [3],[4], and to classify different types of noise to ALGORITHMS
achieve an improved performance in Comfort Noise
Generation (CNG) [5]. The features that were used for the noisy speech
classification are very similar to the ones used in ITU-T
This paper considers the classification of active speech G.729B [1], except that the average values of the features
were not subtracted from the features. The first features
This work was supported by the National Science and Engineering Research
Council (NSERC), Canada are the 10th order LSF (Line Spectral Frequency)
coefficients [1]. The Zero-Crossing (ZC) rate [1] is the centroids for each class.
second feature used. The third and fourth features used
are energy measures: the full band energy and the lower The Quadratic Gaussian classifier classifies the input
band energy (i.e. energy of the speech filtered with a features according to the probability information on the
1kHz cutoff low pass filter), as in [1]. The sound files population of the training data [9]. It assumes that the
used for the experiments were 16 bits linear PCM files feature vectors of each class obey a multivariate gaussian
with 8 kHz sampling rate for both speech (2 males and 2 distribution. The decision function is
females, 10 sentences each) and noise (street, car and di ( x) = p (ωi ) f ( x | x ∈ ωi ) , where p (ωi ) is the a priori
music). The files were divided into frames and then probability that x is chosen from class i, and f ( x | x ∈ ωi )
features were extracted on a frame-by-frame basis with a
is the conditional gaussian probability distribution
20 ms frame size. Each frame was non-overlapping and
function of observation x given that x is chosen from class
was processed with a 160 coefficients Hamming window.
i. Only the covariance matrix of x and the mean vector of
The LSF coefficients were calculated using a 10th order
x are required to fully describe the probability distribution
Levinson-Durbin algorithm, followed by a conversion to
function.
LSF coefficients. The zero-crossing rate ZC, full-band
energy Ef, and low-band energy El were also calculated
A Neural Network classifier is an artificial intelligent
on a frame-by-frame basis. For each frame there is thus
network with parallel processing units working together.
10 LSF coefficients, 1 ZC, 1 Ef and 1 El, which construct
A feedforward multi-layer perceptron structure with one
a 13-dimensions vector. hidden layer [11] was used in this paper. The activation
function of the neurons in the hidden layer was selected to
The four types of classifiers to be compared for noisy be a hyperbolic tangent function, and there was no
speech classification in this paper are the linear Least- activation function in the output layer (identity linear
Squares classifier, some Nearest Neighbor classifiers, the function). Although the standard Back Propagation (BP)
Quadratic Gaussian classifier, and some Neural Network is a common learning algorithm for multilayer perceptron
classifiers. A linear classifier assumes that the class networks, it converges slowly and typically has local
boundaries can be defined by a linear combination of the minima problems. Some fast learning algorithms have
input features. They use Least-Squares optimization or been proposed in the past to train feedforward neural
Least Mean-Square optimization to find the optimal networks, such as the conjugate gradient algorithm, the
estimation of a linear function between the features and quasi-Newton algorithm, the Levenberg-Marquadt
the decision space [8]. An additional DC feature (the algorithm, and the Extended Kalman Filter algorithm.
bias) is often used to achieve a better classification. The The Matlab™ toolbox indicates that the Levenberg-
Nearest Neighbor classifiers use the distance between Marquadt algorithm (LM) is typically better than the
measured features and the prototypes of each class to conjugate gradient or quasi-Newton algorithms [12]. The
perform classification [9]. Many distances can be used to Extended Kalman Filter algorithm (EKF) is a non-linear
calculate the distance. In this paper, the Euclidean system identification algorithm which can also be used as
distance is used. The k-mean algorithm from vector a fast training algorithm for feedforward neural networks
[13]. It typically does not need to guess step sizes as in
quantization [10] is used to find the centroids/prototypes
the standard BP algorithm and the LM algorithm, and
of each class. During the testing phase (i.e. after training),
usually shows excellent performance in terms of the
there are different algorithms for the Nearest Neighbor convergence speed and the solution achieved. For the LM
classifiers. The simplest one is 1-NN. It chooses the algorithm, the Matlab™ Neural Network toolbox
closest centroid among all the centroids of all classes to implementation was used [12], while for the EKF our
classify. Another algorithm for the Nearest Neighbor own implementation was programmed.
classifiers is k-NN. k-NN finds the k or k% nearest
neighbors first, then chooses the most frequently chosen 3. EXPERIMENTAL RESULTS FOR
one (the higher probability) among the k or k% centroids
NOISY SPEECH CLASSIFICATION
as the result. From our experiments, it was found that this
decision rule did not work particularly well for speech In this section, the results for raw classification (i.e. with
classification. A modified k-NN (or 3-NN to be more no hangover mechanism) of active vs. inactive parts of
specific) was used instead, denoted as k-NN-major. It noisy speech are presented, under different SNR (Signal
chooses the majority among the k Nearest Neighbors. It to Noise) ratios. The active speech frames were
favors the nearest one, and adjusts the decision only when constructed by mixing the clean speech frames (with
the frequency of the second or the third chosen candidate
silence sections removed) with different types of "noise":
is significantly greater than the nearest one. When the
car noise, street noise and classic music, under different
number of prototypes in each class is different, the
SNR ratios (20dB and 10dB). The inactive speech/silence
probability of each centroid for k-NN should be
parts were build from different segments of the same
normalized according to the ratio of the numbers of
sources of noise. The database was grouped into several
sets according to the different SNR ratios and the noise obtained using the basic Back Propagation (BP) training
types. There were about 4300 frames for each active algorithm. The main problem with the BP algorithm was
speech set, and 3800 frames for each inactive speech or not the slow convergence speed, but mostly the local
noise set. For all the sets in the database, the order of the minima found by the algorithm.
feature vectors was randomized. During training, 600
frames from the clean speech set and from the 20dB and 4. COMPARISON OF THE
10dB SNR noisy speech sets (for each of the three noise COMPUTATIONAL COMPLEXITY
types) were chosen to construct the training data, all the
remaining frames were available for the testing set. Thus The computational complexity of the different classifiers
the same speech files (i.e. same male and female during the testing phase (i.e. the one that matters) is
speakers) were used for training and testing, but different presented in Table 2. For the linear classifier, a single dot
frames were used during each phase. During testing, the product XT ·W has to be computed, where X and W are
active speech and inactive speech/noise segments were (M+1) by 1 vectors (M is the number of features used,
chosen at a 40%:60% ratio, for every SNR ratio. Totally there is an extra dimension for the bias in the input
4200:4200 active/inactive frames were used for training vector). The Nearest Neighbor classifiers need to
and 8000:12000 active/inactive frames were used for calculate the distances from the input vector to all the
testing. The same weight was given to active speech centroids, and then compare them. Using the Euclidian
misclassified as inactive (i.e. clipping), and inactive distance, the computation becomes mainly two dot
speech misclassified as active. It would be possible to products of size M for each centroid. In Table 2, N is used
modify the different algorithms to further minimize the to define the total number of centroids used by the
active speech misclassifications by biasing the cost Nearest Neighbor or Quadratic Gaussian classifiers. The
functions, but this was not done in our simulations. discriminant function of Quadratic Gaussian classifiers is
f(x|ωi)p(ωi), and most of the computation is to calculate
Table 1 presents the main results of the active speech vs. the probability f(x|ωi). Assuming that the mean vector and
inactive speech classification for noisy speech. In Table covariance matrix of each prototype and the p(ωi)
1, the number of prototypes used for active and inactive probabilities are pre-calculated and stored in memory
speech is indicated when appropriate (i.e. 100+50), and (which may require a very significant amount of
the number of neurons is also indicated for Neural memory), then the main complexity comes from a matrix
Network classifiers. It was found that the performance of product, which is proportional to M2 computations. This
the Neural Network classifiers was significantly less computation must be done for each centroid, thus N
sensitive to the number of neurons than the performance times. For Neural Network classifiers, in the testing phase
of other classifiers was sensitive to the number of a linear combination is computed in each neuron,
prototypes. Therefore this is one first aspect where the
followed by an activation function in each neuron.
performance of Neural Network classifiers was more
Assuming that all the activation functions in the hidden
robust. From Table 1, when the testing data was clean
layer neurons are hyperbolic tangents and can be
speech the Quadratic Gaussian classifier produced the
best performance with 2.5 % total error rate, followed by designed as a single look-up table, then most of the
the Neural Network classifiers trained with the LM and computational complexity comes from the linear
EKF algorithms, with 4.62 % and 5 % total error rates, combinations, which have size (M+1) for each of the n1
respectively. However, the performance of the Quadratic neurons of the hidden layer (including a bias weight) and
Gaussian classifier severely degraded when more realistic size ( n1 +1) for neurons of the output layer (a single
20 dB SNR noisy speech was used. In this case the neuron at the output layer was used in this paper).
Neural Network classifiers with LM and EKF produced a
much better performance than all other algorithms (5.82%
Table 2 also provides an example of the complexity with
and 5.85% total error rate compared to 9.18% total error
the values of M, N, and n1 used for the results of Section
rate for the next best classifier). For 10 dB SNR noisy
speech, the Neural Network classifiers again 3. It is clear from this table that the complexity of the
outperformed all the other types of classifiers, with 15.46 Quadratic Gaussian method is the highest, much higher
% and 17.09 % total error rates for the LM and EKF than for Nearest Neighbor and Neural Network
trained networks, while the next best method had a 20.36 classifiers. The Neural Network classifiers also have a
% total error rate. Since Neural Network classifiers have significantly lower complexity than the Nearest Neighbor
a better performance in our experiments with finite SNR classifiers. Except for the linear classifier (which
noisy speech, and considering that their complexity is produced a very weak performance), the computational
lower (as shown in Section 4), they are an interesting load of the Neural Network classifiers is the lowest.
choice for practical implementations. It should be noted
here that the use of an advanced training algorithm (LM 5. CONCLUSION AND FUTURE WORK
or EKF) was required for the Neural Network classifiers
to produce a good performance, only poor results were In this paper, the raw classification (i.e. with no hangover
mechanism) of active and inactive segments in noisy [10] J. Makhoul, S. Roucos and H. Gish “Vector Quantization
speech was explored, using classical classifiers, such as in Speech Coding”, Proceedings of the IEEE, Vol 73,
linear classifiers, Nearest Neighbor classifiers and pages 1551-1588, Nov. 1985
[11] S. Haykin Neural networks: a comprehensive foundation,
Quadratic Gaussian classifiers, and using Neural Network 2nd edition, Prentice-Hall, 1999
classifiers trained with advanced algorithms such as the [12] Matlab™ Toolbox of Neural Network Reference Guide,
Levenberg-Marquadt algorithm and the Extended Kalman 1995, The MathWorks Inc.
Filter algorithm. The experimental results presented in the [13] S. Singhal and L. Wu “Training Feed-Forward Networks
paper illustrate that in general the Neural Network with the Extended Kalman Algorithm”, ICASSP 1989, Vol.
classifiers found using one of the advanced training 2, pages 1187-1190, Glasgow, Scotland, May 1989
TABLE 1
algorithms typically produced the best and more robust
THE EXPERIMENTAL RESULTS FOR THE CLASSIFICATION OF ACTIVE
performance for classification of active speech vs. SPEECH VS. INACTIVE SPEECH SEGMENTS IN NOISY SPEECH
inactive speech in noisy speech. The computational load testing error for error for weighted
of the Neural Network classifiers during the testing phase Classifier set active inactive total
was also shown to be significantly less that for the other (SNR) speech speech error rate
non-linear classifiers. Neural Network classifiers may (%) (%) (%)
therefore be a good choice for the core component of a Linear (Least- clean 23.05 0.43 9.48
noisy speech classifier, which would typically also Squares) 20dB 23.23 1.39 10.13
10dB 13.30 39.48 29.01
include a hangover mechanism and possibly a speech
Nearest clean 18.80 2.96 9.30
enhancement algorithm. Future work could include Neighbor 20dB 22.32 16.08 18.12
classification with a bias to minimize further the (100+50 10dB 21.93 22.83 22.93
misclassifications of active speech (i.e. clipping), a prototypes,
comparison with other machine learning algorithms such 1-NN)
as Support Vector Machines or others, classification of Nearest clean 20.30 2.66 9.72
speech under more noisy conditions, and a comparison of Neighbor 20dB 22.43 1.47 9.86
(75+50 10dB 14.22 27.62 22.26
neural networks classifiers combined with hangover prototypes,
mechanisms and speech enhancement with standard 3-NN)
classifiers using such mechanisms. Quadratic clean 2.80 2.30 2.50
Gaussian 20dB 7.38 10.37 9.18
6. REFERENCES (50+30 10dB 14.70 24.13 20.36
prototypes)
Neural clean 5.85 3.80 4.62
[1] ITU-T G.729B “A Silence Compression Scheme for Network 20dB 12.27 1.57 5.82
G.729 Optimized for Terminals Conforming to (20 neurons 10dB 19.37 12.85 15.46
Recommendation V.70”, 1996, International with LM)
Telecommunication Union Neural clean 6.50 4.00 5.00
[2] ETSI GSM 06.32 “Full rate speech; VAD for full rate Network 20dB 12.48 1.42 5.85
speech traffic channel”, 1998, European (20 neurons 10dB 18.26 16.31 17.09
Telecommunication Standards Institute with EKF)
[3] J. Thyssen et al. “A Candidate for the ITU-T 4 kbits/s TABLE 2
Speech Coding Standard”, ICASSP 2001, Vol.2, pages COMPARISON OF THE COMPUTATIONAL COMPLEXITY FOR THE
681-684, Salt Lake City, US, May 2001 DIFFERENT CLASSIFIERS
[4] G. Ruggeri, F. Beritelli and S. Casale “ Hybrid Multi-
Example
mode/ Multi-rate CS-ACELP Speech Coding for Adaptive
Classifier Computational complexity (with Section
Voice over IP”, ICASSP 2001, Vol. 2, pages 733-736, Salt
3 values)
Lake City, USA, May 2001
Linear multiplies M+1 14
[5] K. El-Maleh, A. Samouelian and P. Kabal “Frame-Level
classifier additions M 13
Noise Classification in Mobile Environments”, ICASSP
others one comparison 1
1999, Vol. 2, pages 237-240, Phoenix, USA, March 1999
[6] T.G. Crippa et al. “A Fast Neural Network Training Nearest multiplies (2M+1)N 4050
Algorithm and its Application to Voiced-Unvoiced-Silence Neighbor additions (2M-1)N 3750
Classification of Speech”, ICASSP 1991, Vol. 1, pages classifier others N number comp. 150
441-447, Toronto, Canada, May 1991 (1-NN)
[7] J. Ikedo “Voice Activity Detection Using Neural Quadratic multiplies N(M2+M+3) 14800
Network”, IEICE Trans. Communication, Vol. E81-B, Gaussian additions N(M2+M-1) 14480
pages 2509-2513, Dec. 1998 classifier others N number comp., 80
[8] B. Widrow and S. D. Stearns Adaptive Signal Processing, N look-up table
Prentice Hall, 1985 search 80
[9] E. Micheli-Tzanakou Supervised and unsupervised Neural multiplies Mn1+2n1+1 301
pattern recognition, feature extraction and computational Network additions (M+1) n1 280
intelligence, CRC Press, 1999

You might also like