EFFICIENT CLASSIFICATION OF NOISY SPEECH USING NEURAL
NETWORKS
C. Shao, M. Bouchard
School of Information Technology and Engineering,
University of Ottawa, 161 Louis-Pasteur, Ottawa (Ontario), K1N 6N5, Canada, e-mail : bouchard@site.uottawa.ca
ABSTRACT vs. inactive speech in noisy speech (i.e. VAD
classification). The work focuses on raw classification, The classification of active speech vs. inactive speech in without hangover mechanisms or speech enhancement, in noisy speech is an important part of speech applications, order to identify which type of algorithm would lead to a typically in order to achieve a lower bit-rate. In this work, better core in a more complex classifier. Hangover the error rates for raw classification (i.e. with no hangover schemes such as in [1],[2] are usually added to a raw mechanism) of noisy speech obtained with traditional classifier to smooth the decision of the classifier (i.e. to classification algorithms are compared to the rates avoid spurious changes in the classification decision), by obtained with Neural Network classifiers, trained with combining the current raw decision with the decisions of different learning algorithms. The traditional the previous frames. In this paper, the error rates for raw classification algorithms used are the linear classifier, classification (i.e. with no hangover mechanism) of noisy some Nearest Neighbor classifiers and the Quadratic speech obtained with traditional classification algorithms Gaussian classifier. The training algorithms used for the are compared to the rates obtained with Neural Network Neural Networks classifiers are the Extended Kalman classifiers, trained with different learning algorithms. The Filter and the Levenberg-Marquadt algorithm. An traditional classification algorithms used in the paper are evaluation of the computational complexity for the the linear classifier, some Nearest Neighbor classifiers different classification algorithms is presented. Our noisy speech classification experiments show that using Neural and the Quadratic Gaussian classifier. The training Network classifiers typically produces a more accurate algorithms used for the Neural Networks classifiers are and more robust classification than other traditional the Extended Kalman Filter and the Levenberg-Marquadt algorithms, while having a significantly lower algorithm. Previous work on speech classification using computational complexity. Neural Network classifiers Neural Network classifiers have been published [6],[7], may therefore be a good choice for the core component of however the simple Back Propagation algorithm was used a noisy speech classifier, which would typically also for the training, and as it is mentioned later in this work include a hangover mechanism and possibly a speech this can greatly affect the classification performance. enhancement algorithm. Moreover, features that are known to be efficient for speech classification are used in this paper. The 1. INTRODUCTION computational complexity of the different classification algorithms is also estimated in this paper. The paper is With the development of multimedia communications and organized as follows: Section 2 introduces the Internet communications, classification is increasingly classification features used in the experiments, and briefly used in speech and audio applications. One of the most describes the algorithms for the different classifiers, well known classification application in speech Section 3 shows the experimental results of active speech processing is the VAD (Voice Activity Detector) used in vs. inactive speech classification in noisy speech, Section speech coding standards [1],[2]. In this case, different 4 presents a comparison of the computational complexity encoders are used for speech and noise based on the for each considered classifier and Section 5 presents a decision of the VAD, in order to reduce the overall bit conclusion and ideas for future work. rate for transmission. Classification is also used in some other speech applications such as multi-mode low bit-rate 2. CLASSIFICATION FEATURES AND coding [3],[4], and to classify different types of noise to ALGORITHMS achieve an improved performance in Comfort Noise Generation (CNG) [5]. The features that were used for the noisy speech classification are very similar to the ones used in ITU-T This paper considers the classification of active speech G.729B [1], except that the average values of the features were not subtracted from the features. The first features This work was supported by the National Science and Engineering Research Council (NSERC), Canada are the 10th order LSF (Line Spectral Frequency) coefficients [1]. The Zero-Crossing (ZC) rate [1] is the centroids for each class. second feature used. The third and fourth features used are energy measures: the full band energy and the lower The Quadratic Gaussian classifier classifies the input band energy (i.e. energy of the speech filtered with a features according to the probability information on the 1kHz cutoff low pass filter), as in [1]. The sound files population of the training data [9]. It assumes that the used for the experiments were 16 bits linear PCM files feature vectors of each class obey a multivariate gaussian with 8 kHz sampling rate for both speech (2 males and 2 distribution. The decision function is females, 10 sentences each) and noise (street, car and di ( x) = p (ωi ) f ( x | x ∈ ωi ) , where p (ωi ) is the a priori music). The files were divided into frames and then probability that x is chosen from class i, and f ( x | x ∈ ωi ) features were extracted on a frame-by-frame basis with a is the conditional gaussian probability distribution 20 ms frame size. Each frame was non-overlapping and function of observation x given that x is chosen from class was processed with a 160 coefficients Hamming window. i. Only the covariance matrix of x and the mean vector of The LSF coefficients were calculated using a 10th order x are required to fully describe the probability distribution Levinson-Durbin algorithm, followed by a conversion to function. LSF coefficients. The zero-crossing rate ZC, full-band energy Ef, and low-band energy El were also calculated A Neural Network classifier is an artificial intelligent on a frame-by-frame basis. For each frame there is thus network with parallel processing units working together. 10 LSF coefficients, 1 ZC, 1 Ef and 1 El, which construct A feedforward multi-layer perceptron structure with one a 13-dimensions vector. hidden layer [11] was used in this paper. The activation function of the neurons in the hidden layer was selected to The four types of classifiers to be compared for noisy be a hyperbolic tangent function, and there was no speech classification in this paper are the linear Least- activation function in the output layer (identity linear Squares classifier, some Nearest Neighbor classifiers, the function). Although the standard Back Propagation (BP) Quadratic Gaussian classifier, and some Neural Network is a common learning algorithm for multilayer perceptron classifiers. A linear classifier assumes that the class networks, it converges slowly and typically has local boundaries can be defined by a linear combination of the minima problems. Some fast learning algorithms have input features. They use Least-Squares optimization or been proposed in the past to train feedforward neural Least Mean-Square optimization to find the optimal networks, such as the conjugate gradient algorithm, the estimation of a linear function between the features and quasi-Newton algorithm, the Levenberg-Marquadt the decision space [8]. An additional DC feature (the algorithm, and the Extended Kalman Filter algorithm. bias) is often used to achieve a better classification. The The Matlab™ toolbox indicates that the Levenberg- Nearest Neighbor classifiers use the distance between Marquadt algorithm (LM) is typically better than the measured features and the prototypes of each class to conjugate gradient or quasi-Newton algorithms [12]. The perform classification [9]. Many distances can be used to Extended Kalman Filter algorithm (EKF) is a non-linear calculate the distance. In this paper, the Euclidean system identification algorithm which can also be used as distance is used. The k-mean algorithm from vector a fast training algorithm for feedforward neural networks [13]. It typically does not need to guess step sizes as in quantization [10] is used to find the centroids/prototypes the standard BP algorithm and the LM algorithm, and of each class. During the testing phase (i.e. after training), usually shows excellent performance in terms of the there are different algorithms for the Nearest Neighbor convergence speed and the solution achieved. For the LM classifiers. The simplest one is 1-NN. It chooses the algorithm, the Matlab™ Neural Network toolbox closest centroid among all the centroids of all classes to implementation was used [12], while for the EKF our classify. Another algorithm for the Nearest Neighbor own implementation was programmed. classifiers is k-NN. k-NN finds the k or k% nearest neighbors first, then chooses the most frequently chosen 3. EXPERIMENTAL RESULTS FOR one (the higher probability) among the k or k% centroids NOISY SPEECH CLASSIFICATION as the result. From our experiments, it was found that this decision rule did not work particularly well for speech In this section, the results for raw classification (i.e. with classification. A modified k-NN (or 3-NN to be more no hangover mechanism) of active vs. inactive parts of specific) was used instead, denoted as k-NN-major. It noisy speech are presented, under different SNR (Signal chooses the majority among the k Nearest Neighbors. It to Noise) ratios. The active speech frames were favors the nearest one, and adjusts the decision only when constructed by mixing the clean speech frames (with the frequency of the second or the third chosen candidate silence sections removed) with different types of "noise": is significantly greater than the nearest one. When the car noise, street noise and classic music, under different number of prototypes in each class is different, the SNR ratios (20dB and 10dB). The inactive speech/silence probability of each centroid for k-NN should be parts were build from different segments of the same normalized according to the ratio of the numbers of sources of noise. The database was grouped into several sets according to the different SNR ratios and the noise obtained using the basic Back Propagation (BP) training types. There were about 4300 frames for each active algorithm. The main problem with the BP algorithm was speech set, and 3800 frames for each inactive speech or not the slow convergence speed, but mostly the local noise set. For all the sets in the database, the order of the minima found by the algorithm. feature vectors was randomized. During training, 600 frames from the clean speech set and from the 20dB and 4. COMPARISON OF THE 10dB SNR noisy speech sets (for each of the three noise COMPUTATIONAL COMPLEXITY types) were chosen to construct the training data, all the remaining frames were available for the testing set. Thus The computational complexity of the different classifiers the same speech files (i.e. same male and female during the testing phase (i.e. the one that matters) is speakers) were used for training and testing, but different presented in Table 2. For the linear classifier, a single dot frames were used during each phase. During testing, the product XT ·W has to be computed, where X and W are active speech and inactive speech/noise segments were (M+1) by 1 vectors (M is the number of features used, chosen at a 40%:60% ratio, for every SNR ratio. Totally there is an extra dimension for the bias in the input 4200:4200 active/inactive frames were used for training vector). The Nearest Neighbor classifiers need to and 8000:12000 active/inactive frames were used for calculate the distances from the input vector to all the testing. The same weight was given to active speech centroids, and then compare them. Using the Euclidian misclassified as inactive (i.e. clipping), and inactive distance, the computation becomes mainly two dot speech misclassified as active. It would be possible to products of size M for each centroid. In Table 2, N is used modify the different algorithms to further minimize the to define the total number of centroids used by the active speech misclassifications by biasing the cost Nearest Neighbor or Quadratic Gaussian classifiers. The functions, but this was not done in our simulations. discriminant function of Quadratic Gaussian classifiers is f(x|ωi)p(ωi), and most of the computation is to calculate Table 1 presents the main results of the active speech vs. the probability f(x|ωi). Assuming that the mean vector and inactive speech classification for noisy speech. In Table covariance matrix of each prototype and the p(ωi) 1, the number of prototypes used for active and inactive probabilities are pre-calculated and stored in memory speech is indicated when appropriate (i.e. 100+50), and (which may require a very significant amount of the number of neurons is also indicated for Neural memory), then the main complexity comes from a matrix Network classifiers. It was found that the performance of product, which is proportional to M2 computations. This the Neural Network classifiers was significantly less computation must be done for each centroid, thus N sensitive to the number of neurons than the performance times. For Neural Network classifiers, in the testing phase of other classifiers was sensitive to the number of a linear combination is computed in each neuron, prototypes. Therefore this is one first aspect where the followed by an activation function in each neuron. performance of Neural Network classifiers was more Assuming that all the activation functions in the hidden robust. From Table 1, when the testing data was clean layer neurons are hyperbolic tangents and can be speech the Quadratic Gaussian classifier produced the best performance with 2.5 % total error rate, followed by designed as a single look-up table, then most of the the Neural Network classifiers trained with the LM and computational complexity comes from the linear EKF algorithms, with 4.62 % and 5 % total error rates, combinations, which have size (M+1) for each of the n1 respectively. However, the performance of the Quadratic neurons of the hidden layer (including a bias weight) and Gaussian classifier severely degraded when more realistic size ( n1 +1) for neurons of the output layer (a single 20 dB SNR noisy speech was used. In this case the neuron at the output layer was used in this paper). Neural Network classifiers with LM and EKF produced a much better performance than all other algorithms (5.82% Table 2 also provides an example of the complexity with and 5.85% total error rate compared to 9.18% total error the values of M, N, and n1 used for the results of Section rate for the next best classifier). For 10 dB SNR noisy speech, the Neural Network classifiers again 3. It is clear from this table that the complexity of the outperformed all the other types of classifiers, with 15.46 Quadratic Gaussian method is the highest, much higher % and 17.09 % total error rates for the LM and EKF than for Nearest Neighbor and Neural Network trained networks, while the next best method had a 20.36 classifiers. The Neural Network classifiers also have a % total error rate. Since Neural Network classifiers have significantly lower complexity than the Nearest Neighbor a better performance in our experiments with finite SNR classifiers. Except for the linear classifier (which noisy speech, and considering that their complexity is produced a very weak performance), the computational lower (as shown in Section 4), they are an interesting load of the Neural Network classifiers is the lowest. choice for practical implementations. It should be noted here that the use of an advanced training algorithm (LM 5. CONCLUSION AND FUTURE WORK or EKF) was required for the Neural Network classifiers to produce a good performance, only poor results were In this paper, the raw classification (i.e. with no hangover mechanism) of active and inactive segments in noisy [10] J. Makhoul, S. Roucos and H. Gish “Vector Quantization speech was explored, using classical classifiers, such as in Speech Coding”, Proceedings of the IEEE, Vol 73, linear classifiers, Nearest Neighbor classifiers and pages 1551-1588, Nov. 1985 [11] S. Haykin Neural networks: a comprehensive foundation, Quadratic Gaussian classifiers, and using Neural Network 2nd edition, Prentice-Hall, 1999 classifiers trained with advanced algorithms such as the [12] Matlab™ Toolbox of Neural Network Reference Guide, Levenberg-Marquadt algorithm and the Extended Kalman 1995, The MathWorks Inc. Filter algorithm. The experimental results presented in the [13] S. Singhal and L. Wu “Training Feed-Forward Networks paper illustrate that in general the Neural Network with the Extended Kalman Algorithm”, ICASSP 1989, Vol. classifiers found using one of the advanced training 2, pages 1187-1190, Glasgow, Scotland, May 1989 TABLE 1 algorithms typically produced the best and more robust THE EXPERIMENTAL RESULTS FOR THE CLASSIFICATION OF ACTIVE performance for classification of active speech vs. SPEECH VS. INACTIVE SPEECH SEGMENTS IN NOISY SPEECH inactive speech in noisy speech. The computational load testing error for error for weighted of the Neural Network classifiers during the testing phase Classifier set active inactive total was also shown to be significantly less that for the other (SNR) speech speech error rate non-linear classifiers. Neural Network classifiers may (%) (%) (%) therefore be a good choice for the core component of a Linear (Least- clean 23.05 0.43 9.48 noisy speech classifier, which would typically also Squares) 20dB 23.23 1.39 10.13 10dB 13.30 39.48 29.01 include a hangover mechanism and possibly a speech Nearest clean 18.80 2.96 9.30 enhancement algorithm. Future work could include Neighbor 20dB 22.32 16.08 18.12 classification with a bias to minimize further the (100+50 10dB 21.93 22.83 22.93 misclassifications of active speech (i.e. clipping), a prototypes, comparison with other machine learning algorithms such 1-NN) as Support Vector Machines or others, classification of Nearest clean 20.30 2.66 9.72 speech under more noisy conditions, and a comparison of Neighbor 20dB 22.43 1.47 9.86 (75+50 10dB 14.22 27.62 22.26 neural networks classifiers combined with hangover prototypes, mechanisms and speech enhancement with standard 3-NN) classifiers using such mechanisms. Quadratic clean 2.80 2.30 2.50 Gaussian 20dB 7.38 10.37 9.18 6. REFERENCES (50+30 10dB 14.70 24.13 20.36 prototypes) Neural clean 5.85 3.80 4.62 [1] ITU-T G.729B “A Silence Compression Scheme for Network 20dB 12.27 1.57 5.82 G.729 Optimized for Terminals Conforming to (20 neurons 10dB 19.37 12.85 15.46 Recommendation V.70”, 1996, International with LM) Telecommunication Union Neural clean 6.50 4.00 5.00 [2] ETSI GSM 06.32 “Full rate speech; VAD for full rate Network 20dB 12.48 1.42 5.85 speech traffic channel”, 1998, European (20 neurons 10dB 18.26 16.31 17.09 Telecommunication Standards Institute with EKF) [3] J. Thyssen et al. “A Candidate for the ITU-T 4 kbits/s TABLE 2 Speech Coding Standard”, ICASSP 2001, Vol.2, pages COMPARISON OF THE COMPUTATIONAL COMPLEXITY FOR THE 681-684, Salt Lake City, US, May 2001 DIFFERENT CLASSIFIERS [4] G. Ruggeri, F. Beritelli and S. Casale “ Hybrid Multi- Example mode/ Multi-rate CS-ACELP Speech Coding for Adaptive Classifier Computational complexity (with Section Voice over IP”, ICASSP 2001, Vol. 2, pages 733-736, Salt 3 values) Lake City, USA, May 2001 Linear multiplies M+1 14 [5] K. El-Maleh, A. Samouelian and P. Kabal “Frame-Level classifier additions M 13 Noise Classification in Mobile Environments”, ICASSP others one comparison 1 1999, Vol. 2, pages 237-240, Phoenix, USA, March 1999 [6] T.G. Crippa et al. “A Fast Neural Network Training Nearest multiplies (2M+1)N 4050 Algorithm and its Application to Voiced-Unvoiced-Silence Neighbor additions (2M-1)N 3750 Classification of Speech”, ICASSP 1991, Vol. 1, pages classifier others N number comp. 150 441-447, Toronto, Canada, May 1991 (1-NN) [7] J. Ikedo “Voice Activity Detection Using Neural Quadratic multiplies N(M2+M+3) 14800 Network”, IEICE Trans. Communication, Vol. E81-B, Gaussian additions N(M2+M-1) 14480 pages 2509-2513, Dec. 1998 classifier others N number comp., 80 [8] B. Widrow and S. D. Stearns Adaptive Signal Processing, N look-up table Prentice Hall, 1985 search 80 [9] E. Micheli-Tzanakou Supervised and unsupervised Neural multiplies Mn1+2n1+1 301 pattern recognition, feature extraction and computational Network additions (M+1) n1 280 intelligence, CRC Press, 1999
Environment, Development and Sustainability Volume issue 2020 [doi 10.1007_s10668-019-00583-2] Vendoti, Suresh; Muralidhar, M.; Kiranmayi, R. -- Techno-economic analysis of off-grid solar_wind_bioga