You are on page 1of 9

Chapter 1

Introduction
1.1 Preamble
Speech processing is an interdisciplinary field that studies acoustic signals using both signal
processing techniques and knowledge from linguistics. An efficient speech recognition system
needs both speech enhancement and speech feature extraction methods to handle clean and noisy
data. The goal of error free speech recognition systems has remained unsolved over the years for
noisy speech signals. This is not true for clean speech signal recognition system. Speech
enhancement focuses on finding and estimating optimal parameters by reducing noise or by
enhancing the speech of noisy signal. Normally, speech recognition system works well in a
controlled environment but they perform poor in noisy environments. This work addresses on
‘In handling of clean and noisy speech signals using wavelets and fuzzy techniques in improving
their recognition performances by proposing better feature enhancement and extraction
algorithms. The proposed new algorithms are verified for the performance of the speech
recognition systems for both clean and noisy speech signals.

1.2 Speech Recognition Systems and Noise:


The speech recognition systems have evolved for the need of systems that operates with speech
input. But towards this direction there is much to be developed. Until today the research towards
speech recognition is taken as a subjective phenomenon. It is seen that most of the speech
recognition systems have been designed to work in controlled environment using clean speech.
Such systems suffer from performance degradation under noisy environment.The known
problems like Speaker Variation, Background Noise and Continuous Character identification of
Speech etc., degrades the performance of the recognition system for noisy speech data.

If a recognition system is used under noisy environment, it must be robust to many different
types and levels of noise or change in the speaker’s voice. Noise is categorized as either additive
noise or convoluted noise. The noise vectors contaminate the speech signal changing the data
vectors that represent the speech. The noise can be induced by adding various types of
background or environmental noise. These noises can be babble noise, street noise, car noise or
voices recorded during the usage of phone and fan entities. The noise can also be induced by the
change of speakers. Changes in the speaker’s voice are caused by the modifications of
articulation parameters. The main variances can be found in increasing the speakers pitch,
amplitude, vowel duration, and spectral tilt, as well as shift in formant frequencies F1 and F2.
However these changes are by no means constant, even for the same speaker under similar noise
conditions. It is highly difficult to either remove or model these signals. Hence it is necessary to
use efficient techniques like wavelets and fuzzy to handle the issues in feature extraction process.
Wavelet transform is well proved to be good for handling non-stationary signals and the
algorithms derived from the wavelet theory have became standards in digital signal processing.
Along with this technique even Soft Computing techniques like (SC) i.e. fuzzy is identified as a
foundation and more powerful tool to design and develop intelligent systems, to provide feasible
solutions with better features. Hence these two techniques are proposed in this thesis for
extracting the features of speech signals. Next section presents a detailed study on speech feature
extraction techniques.

1.3 Literature Review


Feature extraction is one of the most important phases in speech recognition system [1,2,3,4]. It
is a tedious task to extract the features of speech signals in adverse conditions with speaker’s
variabilities. To handle these, feature extraction procedures play a major role in extracting
features with good recognition accuracies. If this phase is designed and perused appropriately, it
results in designing well and efficient recognition models with less computation times.
This section presents a detailed review of work reported in the literature on speech signal
enhancement as well as feature extraction. Speech recognition is a widely researched area over
the decades and many outstanding contributions have been reported in the literature.
The previous works are presented in the following order i) works related to continuous wavelets
for speech recognition. ii) application of fuzzy techniques for speech recognition iii) Hybrid
approaches to speech recognition.
An extensive review on hybrid feature extraction for speech recognition is proposed by many
authors along with the conventional methods. Prithvi [P. Prithvi et al., 2015 (5) ] proposed a
theoretical analysis of the conventional feature extraction methods namely Mel Frequency
Cepstral Co-efficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC),Relative Spectra
-perceptual linear prediction (RASTA-PLP) with their pros and cons to speech recognition
application. It highlights MFCC features performs well if the size of codebook is small and
obtains less spectral estimation errors for high order MFCC co-efficients than RASTA-PLP. It
also highlights RASTA-PLP technique is more suitable for handling noisy signals. Dona Vargese
[Dona Vargese, Dominic Mathew , 2016(6) ] proposed a new reservoir computing model to
classify the speech features extracted from MFCC and RASTA-PLP on a TIMIT dataset for
phoneme classification. An accuracy of 91.5 % and 98.5 % is has been reported for the first layer
and second layer of the reservoir model when MFCC features are classified over RASTA-PLP
features.
Mohamed, Lotif [Mohamed, Lotif et al., 2017 (7) ] has proposed hybrid algorithms using
MFCC, PLP,RASTA, RASTA-PLP, VQLBG with Wavelet and PCA techniques for Arabian
Dataset. Neural network classifier is applied for speaker dependent speech recognition system.
They present various combinations of integrating algorithms like MFCC, RASTA-PLP, Delta-
MFCC, RASTA-VQLBG etc . Totally 8 methods are used and combined in various
combinations to extract the speech features. The Combination of MFCC, delta MFCC and
VQLBG shows the best performance by obtaining 1.41 error rate which is very less when
compared to other combinations of feature extraction algorithms. In the same year, Veton Z.
Këpuska [Veton Z. Këpuska, Hussien A. Elharati 2017 (8)] have proposed hybrid feature
extraction algorithms to extract features from the clean and additive noise speech signals. Four
conventional feature extraction methods namely MFCC, LPCC, RASTA, and PLP methods are
hybrided to extract clean and noisy features. Conventional methods have been integrated
invariably in various combinations and the obtained features are modeled using multivariate
Hidden Markov model. The performances of different hybrid methods for TI digits dataset is
presented by varying number of states and Gaussian mixtures of the HMM model. The
performance results are tabulated for 5, 10 and 30 dB Gaussian additive noises. The recognition
performance varies in between 94 to 97% for MFCC, LPCC, PLP and RASTA combinations. In
2019 authors [Wafa Helali et al.,,2019 (9)] proposed perceptual wavelet packets, Gamatone
filter and Amplitude modulation filters to extract features. These techniques are hybrided with
MFCC and RASTA-PLP and PLP traditional methods. HMM classifier is applied to model
these features on the TIMIT dataset for clean and additive noisy data. 100 % of recognition
accuracy is tabulated for clean signals and 98.8 % to 15db and 92.8% for -15dB additive
noise when MFCC, Gamatone filter and Amplitude modulation filters are combined.
Feature extraction algorithms normally use Discrete Cosine Transform (DCT) and Fast Fourier
Transform (FFT) transformation techniques. Basically Hilbert Huang Transform (HHT) is
widely used for speaker recognition application (10-11,12,13,14,15)Nordon, Pan J, Md
Khademul Islam [Nordon , E. Huang et al., 1995, Huang , H., and Pan J. (2006) Liwei Liu, Feng
Qian et al ., 2012, Md Khademul Islam Molla et al., 2012, Rudramurthy et al, 2013, Henna et al.,
2017] have been discussed the usage of HHT in Empirical Mode Decomposition (EMD) using it
as Intrinsic Mode Function (IMF) for speaker recognition application. The authors state that
improper choice of IMF yields less efficiency in filtering and recognition accuracies. It is due to
a tradeoff between under-sifting (producing incorrectly defined IMFs due to insufficient sifts to
reveal all the riding waves) and over-sifting (producing less physically meaningful IMFs).All the
papers presents HHT for speaker recognition only.
An extensive review on wavelet based application to speech signal feature extraction has shown
that discrete wavelet db11, coiflet ,Bior, Symlet and Haar [R Coifman 1995, Ayat 2004,Slavy G
V Balakrishnan et al., S., 2006, Mihov 2009, , M.A Oktar et al., 2016, R Hidayat et al ., 2018,]
have been extensively used to design feature extraction methods. However continuous wavelets
have certain advantages such as it posses i) good scaling property from which better localized
features can be extracted. ii) Continuous Wavelets can be applied on the overall signal compare
to discrete wavelet transform.When traditional filtering algorithm like spectral subtraction,

wiener filter, time varying speech model-based or state-based method fail to handle
noisy speech signals, Yuan [Yuan, 2003] proposed adaptive Bionic Wavelet Transform. The
algorithm was proposed to handle various types of additive noises at two folds. i) By Segmental
signal-to-noise ratio (SSNR) and ii) signal-to-noise ratio (SNR) using Morlet as the mother
wavelet. Experiments were conducted on TIMIT dataset and the results are tabulated for using
Bionic wavelet over other thresholding methods. The Risk thresholding function when used with
adaptive Bionic wavelet presents consistently best performance with increasing the SNR. An
extended work to the above paper [Yuan 2003 (23,24)] is proposed by [Mourad TALBI 2009 ,
(25) ] using Recurrent Neural Network to classify adaptive bionic features. These features are
enhanced using Elman filter along with Bionic wavelet . The results are compared with spectral
subtraction, Bionic wavelet and Elman filter adapted for Bionic Wavelet for thresholding.
Elman Adaptive Bionic wavelet has improved SNR compared to spectral subtraction and Bionic
wavelet .The improved SNR is obtained for additive white noise, Volvo noise over the Case of
F16 noise. Authors [Rajkumar ,Angamba Singh , K.Pritamdas, 2016,(26) ] proposed a Modified
Speech Enhancement Algorithm Based on the Continuous Wavelet Transform using Morlet
wavelet as a mother wavelet. The paper discuss results for additive Gaussian noise over other
noises like Babble, Airport and Car noises of NOIZEUS dataset. The Continuous adaptive
Bionic wavelet is compared over traditional DWT for high level noise i.e.at 0 dB I/P SNR’s. A
range of scales are fixed for different SNR’s to enhance the signals. Experiments are conducted
for the SNR’s varying from 0 to 30 db. An improvement in SNR is observed when continuous
wavelet is applied over discrete wavelet packet. In 2018, [Hangting Chen, Pengyuan Zhang 2018
(27,28)] proposed a Deep Convolution Neural Network classifier with Scalogram for Audio
Scene Modeling using Morlet wavelets for extracting the features of DCASE dataset. An
approach to learning is presented based on short-time Fourier transform and hand-tailored
filters. The extracted feature results are compared over scalograms with classical Mel energy.
90.5% increased accuracy is observed for multi-scale features of continuous Morlet wavelets.
Fuzzy logic evolved as prominent methods to develop speech recognition system. Authors
[Ramin Halavati, Saeed Bagheri Shouraki, et al., 2006 (28,29)] propose fuzzy for TIMIT
database phoneme Recognition. The author presents the application of fuzzy modeling approach
to ignore noise instead of reducing or removing. The speech features are extracted by converting
speech spectrogram into a fuzzy linguistic description instead of precise acoustic features. The
fuzzy modeling uses features defined by linguistic terms for phonemes. Genetic algorithm has
been used as a classifier to find appropriate definitions for phonemes and results are compared
with Hidden Markov model. The GA method gives better results in noisy environment. Each
test is repeated five times and with six different noise levels, i.e. for Clean, 30, 20, 10, 0 and 10
dB SNRs. The fuzzy approach has shown 20–28% more immunity against noise in normal noisy
environment (SNR: 30–10 dB) and this immunity decreases while the noise level approaches the
amounts that makes the input not identifiable for human beings but it is always above that of
HMM–MFCC. [Ingjr Ding ,2013 (30)] Proposed Speech recognition using variable-length frame
overlaps by intelligent fuzzy control by fuzzy logic control (FLC) mechanism. This is used to
determine a variable-length frame overlap between two consecutive frames. The FLC is devised
to regulating the frame overlap size. The fuzzy overlapped features are compared LPC, LPCC
and MFCC feature test cases. The proposed scheme of variable-length frame overlapping is
observed to be superior than to fixed-length frame overlapping. The fuzzy overlapping is
applied with a decrement 2ms of overlaps for all the classes. An adaptive learning approach is
used to derive accurate acoustic features. These variable frame features have been experimented
using LPC,LPCC and MFCC conventional approaches. The frame analysis with MFCC yields
99.33% when modeled with HMM over LPC and LPCC methods. In 2014 authors [Seyed
Mostafa Mirhassani , Hua-Nong Ting (2014) (31) ] proposed Fuzzy-based discriminative,
complementary feature extraction and selection procedures for Malay vowel children’s speaker
independent phoneme recognition system. The fuzzification is applied by using discriminative
criteria, fuzzy codification and fuzzy aggregation criteria to extract and select the optimal
features from MFCC. The features obtained are compared with other feature selection methods
like Sequential Forward Floating Search (SFFS) and sequential Backward Floating Selection
(SBFS) methods. The features obtained have been classified using Multi Layer Perceptron
(MLP) and Hidden Markov Model (HMM) for phoneme recognition accuracy. Improved
recognition accuracy rate 95.28% is observed for fuzzy features when classified with HMM
classifier where as MLP classifier gave 93.14% accuracy.
Amane [Amane Taleeb (2012)(32)] proposed Speech Recognition by Fuzzy-Neuro ANFIS
Network and Genetic Algorithms for TIMIT speech database. The learning algorithm is applied
on TIMIT to extract the MFCC co-effiecients, whereas genetic algorithm is applied to minimize
the number of input parameters of ANFIS and to minimize the classification error rate by
determining the optimal parameters . The work is oriented on the continuous speech recognition
in no-noisy condition and it is observed that GA with ANFIS yields better results 1 crossover
point technique. Authors [Lubna Eljawad, Rami Aljamaeen, 2019(33)] Proposed speech
recognition for ARABIC language using adaptive Neuro fuzzy inference system. The process
involves preprocessing using DC level removal and resizing, Feature extraction, and MLP and
fuzzy logic classifier. These systems are used to construct two intelligent recognizer i.e Fuzzy
logic and Neural network recognizers. These models are used to study the recognition ability of
MLP and fuzzy logic system for Arabic and English languages. Testing is performed on male
and female using cross validation. The recognition accuracies are compared over fuzzy logic
recognizer and neural network recognizer . An improved 94.5 is obtained for MLP compared to
Sugeno fuzzy inference model with 77%. The authors have suggested combining these two
intelligent recognition system to increase the recognition accuracy. Authors [Samiya Silarbi,
Bendahmane et al., 2014 (34)] proposed ANFIS for phoneme recognition using TIMIT database.
The pre-processing and feature extraction have been carried out for MFCC parameters. The
network is learnt by using subtractive clustering to define an optimal structure with small number
of rules. Hybrid learning using Gradient decent and least square estimation is used to find
feasible set of parameters. A recognition accuracy of 100 % is obtained for 6 vowels, 6 fricatives
and 6 plosives phoneme.
[Sankar K. Pal (1992)(36)] developed linguistic recognition system based on approximate
reasoning for handling various imprecise input patterns. Natural decisions are provided using
fuzzy rules for designing Fuzzy Inference System for classification of phonemes. This model has
obtained 80% of recognition accuracies. Here authors considered only three properties like
SMALL, MEDIUM, and HIGH and suggested to use very small, more or less small a linguistic
hedge leading to less impreciseness in input linguistic information.
Extended to this [Cetisli, B. (2010)(37)] proposed the concept of Neuro Fuzzy Classifier using
Linguistic Hedges to classify non-linear signals for Pima Indian Diabetes dataset, spam e-mail
dataset processing to handle the overlaps. LH is not applied for Speech processing applications.

Observations:
 It is obtained from the literature review that hybrid models have potential for developing
improved accuracy in speech recognition application. These models have not been tried
with convoluted noise.
 It is also seen that continuous wavelets have been experimented and shown that to
produce improved accuracy specifically with additive noise. Classification is achieved
using well known methods . An accuracy up to 90.5% is reported in the above works.
They have not been tried for convoluted noise.
 LH concept has been applied for Pima Indian Diabetes dataset, spam e-mail dataset.
ANFIS is proposed for Phone recognition but not for word recognition with noisy data.
Fuzzy is used for ignoring noise not for removal. Fuzzy tool is majorly used for feature
extraction and selection.

1.4 Motivation:
From the literature it is clearly evident that previous methods on hybrid feature extraction,
speech enhancement and classification using wavelets and fuzzy techniques are not analyzed
for convoluted noisy speech signals for various types of noises. As such, the study of the speech
recognition problem under degraded conditions is a difficult problem and thus an interesting area
to handle the challenges like ambiguity, impreciseness and incompleteness present in the speech
data. These challenges motivated us to conduct research and propose novel algorithms for feature
enhancement, extraction of speech signal using fuzzy and wavelet techniques.

1.5 Challenges in Speech Recognition:


Most of the contemporary speech recognition systems provide with an excellent recognition
performance under matched conditions i.e., maintenance of similar conditions during training
and testing phase of the speech recognition system. There are several reasons which are leading
to mismatch between training and testing sessions during speech recognition. Some of the critical
facts and challenges addressed in this work that degrades the performance of speech recognition
system are as follows:
• The speaking style
• The linguistic content of the task
• The environment noises

1.5 Problem definition:


It is proposed to enhance and extraction of features from speech signal using some new methods
based on hybrid ,wavelet and fuzzy approach with an objective of enhancing the recognition
accuracy.

1.6 Objective:
The main objective is to propose feature extraction algorithms using MFCC as a base model
using wavelets and fuzzy techniques. It is proposed to use hybrid techniques, continuous
wavelets to enhance and extract the speech features that increases the recognition accuracy.
Fuzzy based framing and Linguistic hedges are proposed and experimented for various
homogeneous and heterogeneous data.

1.7 Organization of the thesis:


The outlines of thesis are as follows:

Chapter1 presents about the review of state of the art methods in feature extraction using fuzzy
and wavelet techniques along with motivation, objectives and challenges in speech recognition
system. A general introduction to Automatic Speech Recognition is discussed in chapter 2 with
the relevant issues and designs of ASR systems. In Chapter 3 we present Hybrid, Hilbert Huang
Transform methods with wavelets to extract and cluster the speech features using various
clustering algorithms.

Feature extraction algorithms using Adaptive Bionic and continuous perceptual Morlet wavelets
are discussed to enhance and extract the speech features using thresholding functions in chapter
4.

In chapter 5 we present Fuzzy framing and Linguistic Hedge classifier to extract and classify
features for homogeneous and heterogeneous data set. Chapter 6 presents the conclusions as
well as suggestions for future work.

You might also like