You are on page 1of 15

Speech Processing & Recognition

Internship Project Report

by

Sangeet Sagar
(Dept. of ECE, The LNMIIT Jaipur)

Department of ECE
Birla Institute of Technology
Mesra, Ranchi
July-2017
BONAFIDE CERTIFICATE

This is to certify that this project entitled “SPEECH PROCESSING &

RECOGNITION” submitted to Birla Institute of Technology, Mesra, Ranchi, is a bonafide

record of the work done by “Sangeet Sagar” under my supervision from “11th of May to 8th

of July”

Place-

Date -
Declaration by Author

This is to declare that this report has been written by me. No part of the report is plagiarized

from other sources. All information included from other sources have been duly

acknowledged. I aver that if any part of the report is found to be plagiarized, I will take full

responsibility for it.

Sangeet Sagar
Roll- 15uec053
(Dept. of ECE, The LNMIIT Jaipur)
Table of Contents

Introduction ................................................................................................................................ 1

Pre-Processing............................................................................................................................ 2

Feature Extraction ...................................................................................................................... 3

Mel-Frequency Cepstrum Coefficients (MFCC) .......................................................... 3

Linear Predictive Coding (LPC) ................................................................................... 4

Speech Classification ................................................................................................................. 5

Artificial Neural Networks (ANN) ............................................................................... 5

Introduction to Accent Recognition System .............................................................................. 6

Database Preparation ................................................................................................................. 6

Advanced Feature Extraction Technique ................................................................................... 7

Delta-Cepstral Coefficients (Delta-MFCC) .................................................................. 7

Delta-Spectral Cepstral Features (Double –delta-MFCC) ............................................ 7

Results and Outcomes ................................................................................................................ 7

Classification based on SVM ..................................................................................................... 9

Results and Outcomes ................................................................................................................ 9

Conclusion ............................................................................................................................... 11
Introduction

Speech processing is very important research area where speaker recognition, speech
synthesis, speech codec, speech noise reduction are some of the research areas. Speech
recognition technology is one from the fast growing engineering technologies. It has a
number of applications in different areas and provides potential benefits. Speech recognition
usually involves extraction of features from speech signal and representing them using an
appropriate data model.

During this internship I did a mini project on “Isolated word recognition” using Mel
frequency Cepstral coefficient (MFCC) and Linear Predictive Coding (LPC) as a feature
extraction techniques and Artificial Neural Network as a classification technique and a final
project on “Accent Recognition for hindi and bengali speech signals” using MFCC, delta
MFCC and double-delta MFCC as a feature extraction techniques and Artificial Neural
Network as a classification technique. The performance of automatic speech recognition
systems can be increased, if the speaker’s accent or dialect is detected before the recognition
of speech by adapting the suitable ASR acoustic and/or language models.

1
1. Pre-Processing
Preprocessing of speech signals is considered a crucial step in the development of a
robust and efficient speech or speaker recognition system. The general preprocessing
pipeline is depicted in the following figure.

𝑥(𝑡) 𝑥[𝑛] 𝑥𝑤 [𝑛] 𝑥′𝑤


𝑊𝑖𝑛𝑑𝑜𝑤𝑖𝑛𝑔 𝑎𝑛𝑑 𝑓𝑟𝑎𝑚𝑒
𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑁𝑜𝑖𝑠𝑒 𝑟𝑒𝑚𝑜𝑣𝑎𝑙 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 𝐸𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛
𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛

Fig 1: General steps of the preprocessing stage

1.1 Sampling
In order that a computer is able to process the speech signal, it first has to be
digitized. Therefore the time-continuous speech signal is sampled and
quantized. The result is a time- and value-discrete signal.

1.2 Windowing and frame formation


To obtain the information of every partial signal we multiply the speech
signal with a windowing function (assuming that the signal behaves
stationary for those time frames).This windowing function weights the signal
in the time domain and divides it into a sequence of partial signals.

1.3 Noise Removal


Microphone related noise and electrical noise can be easily compensated by
training the speech recognizers on corresponding noisy speech samples. The
basic problem of noise reduction is to reduce the external noise without
disturbing the unvoiced and low-intensity noise-like components of the
speech signal itself.

This operation was performed in MATLAB and the noise removal was performed on
a recorded speech sample for the word ‘shunya’ and following was observed.

2
Fig 2: Noise Removal Technique

2. Feature Extraction

In feature extraction the Mel frequency Cepstral coefficient (MFCC) and combine
features of both MFCC and LPC are used. The both techniques are described below:

2.1 Mel-Frequency Cepstrum Coefficients (MFCC)

Human’s sound perception is nonlinear. Mel Frequency cepstral coefficients (MFCC)


are similar to the human perception system. Mel Frequency Cepstral coefficients
extract the linear and nonlinear properties of the speech signal.

Step 1: Pre-emphasis -The signal is passed through a filter which emphasis a high
frequencies. This process increases the energy of signal at high frequency. High
frequency also contains information. The equation used to denotes the pre-emphasis
is shown below:

𝑆(𝑛) = 𝑋(𝑛) − 0.95 ∗ 𝑋(𝑛 − 1)

Where s (n) denotes the output sample, x (n) is present sample,

x (n -1) is past sample.

Step 2: Framing and Overlapping- The speech signal is split into several frames such that
each frame can be examined in the short time instead of the entire signal. The frame
size is of the range 20-40 ms. Then overlapping is applied to frames, hamming
window is applied. The equation of hamming window is as follows:

𝑆(𝑛) = 𝑋(𝑛) ∗ 𝑊(𝑛)

2𝜋n
𝑊(𝑛) = 0.54 − 0.46 ∗ Cos [ ] ;0 ≤ 𝑛 ≤ 𝑁 − 1
𝑁−1

3
Step 3: Framing -The input speech signal is partitioned into frames with aduration
which lesser than window duration.

Step 4: Fast Fourier Transform - The Fast Fourier Transform (FFT) converts the frames
from time domain to frequency domain. The conversion is done from time to
frequency domain because the information is more in frequency domain. Therefore,
FFT is executed to obtain the magnitude frequency response of each frame and to
prepare the signal for the next stage.

𝑆(𝜔) = 𝑓𝑓𝑡(𝑋(𝑛))

Step 5: Mel Warping - Human ear perception of frequency contents of sounds for speech
signal does not follow a linear scale. Therefore, for each tone with an actual frequency
f, measured in Hz, subjective pitch is measured on a scale called the “Mel scale”. The
Mel frequency scale is linear frequency spacing below 1000 Hz and a logarithmic
spacing above 1000Hz. To compute the Mel for a given frequency f in Hz, the
following formula is used.

2.2 Linear Predictive Coding(LPC)

Linear Predictive Coding (LPC) analysis states that a speech sample can be
approximated as linear combination of past speech samples. LPC is based on the
source-filter model of speech production.
𝑝

𝑆̃[𝑛] = ∑ 𝑎𝑘 𝑠[𝑛 − 𝑘]
𝑘=1

The unknown 𝑎𝑘 , 𝑘 = 1,2 … 𝑝are called the LPC coefficients and can be solved by
theleast square method.

Fig 3: MFCC Matrix for five samples of


the word "shunya". We take only 13
coefficients as it contains the majority of
the information of the message signal.
We can take more than 13 coefficients
too but it will be no use and it would
also contain undesirable information.

4
We generally take only 11 or 12 coefficients which calculating LPC coefficients and the first
coefficient is always 1.Taking more than 11-12 coefficients generally takes up undesirable
information about the speech signals like background noise. To increase the accuracy of the
feature extraction process these LPC coefficients is vertically concatenated (using vertcat
function in MATLAB) to the MFCC matrix to obtain 25 coefficients for each sample.

3. Speech Classification

3.1 Artificial Neural Networks(ANN)


ANN are nothing but the crude electronics model based on neural structure of
brain. The human brain basically learns from the experience. ANN are
computer having their architecture modelled after the brain. Classifier used in
this speech recognition is Back Propagation Neural Network (BPNN). The
backpropagation architecture takes input nodes as features based on the
coefficients of MFCC and combine of both features MFCC and LPC.
There are two ways to do the classification process in MATLAB. Either we
can start with typing “nnstart” or by using command line function. The
target matrix is to be prepared accordingly according to our number of
samples for the isolated words.

I performed isolated word recognition for 10 speech signals: “zero, one, two, three,
four, five, six, seven, eight, nine, add, minus, into, divide” (These speech samples were
pronounced and recorded as written in a controlled environment of laboratory) with
20 samples for each of them using both MFCC and LPC as feature extraction process
and ANN as classifier and the following confusion matrix was obtained.

Fig 4: Confusion Matrix for above speech


signals (20 samples for each speech
signal) with only MFCC as extraction
features and ANN as classifier.

Recognition Accuracy 99.6%

5
Fig 5: Confusion Matrix for above speech
signals (20 samples for each speech
signals) with both MFCC and LPC as
extraction features and ANN as classifier.

Recognition Accuracy 100%

Conclusion: The experimental results shows that by using the combination of both MFCC
and LPC feature extraction techniques the results are higher as compared to proposed MFCC
feature extraction technique. The recognition accuracy is in the former case. The recognition
accuracy may differ by using only MFCC, LPC and combination of both MFCC and LPC
techniques as well as other classification techniques.

4. Introduction to Accent Recognition System


A large amount of useful data like speaker’s gender, age, race and accent of speaking etc., is
available within the speech signal. Enhanced attention to the development of natural
interfaces and effective communication applications for human machine while attempting to
issues such as the recognition of accent/dialect in speech recognition consisting of the words
spoken in an explicit manner. The speech signal hides the knowledge regarding accent within
the words based on the area of the speaker for which he belongs. Accent recognition helps
speech recognition systems.

5. Database Preparation
In this project we collected database from four speakers. We choose two native hindi
speakers and two native bengali. The database included four speech signals (pronounced and
recorded as written): “shunya”, “ek”, “do”, “teen”, “chaar” with 50 samples for each speech
signals. So in total we had 1000 speech signals (500 speech signals from hindi speakers and
500 speech signals from native bengali speakers). Each sample was recorded in ‘.wav’ format

6
in a noise controlled environment. We then preformed accent recognition task with MFCC
[2.1], delta-MFCC [5.1] and double-delta-MFCC [5.2] as our feature extraction process and
artificial neural networks (ANN) [3.1] as our classifier.

6. Advanced Feature Extraction Techniques for Accent Recognition


System

6.1 Delta-Cepstral Coefficients(Delta-MFCC)


Delta-cepstral features were proposed to add dynamic information to the
static cepstral features. They also improve recognition accuracy by adding a
characterization of temporal dependencies to the hidden-markov models
(HMM) frames, which are nominally assumed to be statistically independent
of one another. For a short-time cepstral sequence C[n], the delta-cepstral
features are typically defined as

𝐷[𝑛] = 𝐶[𝑛 + 𝑚] − 𝐶[𝑛 − 𝑚]

where n is the index of the analysis frames and in practice m is approximately


2 or 3.
These delta-MFCC coefficients (set of 13 coefficients similar to those of
MFCC coefficients) are concatenated vertically to the MFCC matrix to
increase the accuracy of the feature extraction technique.

6.2 Delta-Spectral Cepstral Features(Double –delta-MFCC)


These features are motivated by the non-stationarity of speech signals. They
add a greater degree of robustness to the feature extraction technique. In this
we again find the delta-MFCC coefficients of the above mentioned process
[4.1]. The input matrix of the double-delta-MFCC is the output matrix of the
delta-MFCC.

Now we vertically concatenate the MFCC features (13 coefficients), delta-


MFCC features (13 coefficients) and the double-delta-MFCC (13
coefficients) to get set of 39 coefficients for each speech signal sample. These
contain greater desired information of the speech signal and thus increasing
the accuracy.

7. Results and Outcomes(Classifier-ANN)


We observed different accuracies in three different cases:
i) Using only MFCC features.
ii) Combining MFCC + delta-MFCC features.
iii) Combining MFCC + delta-MFCC + double-delta-MFCC features.

7
Following are the confusion matrix obtained in each case with different recognition
accuracies:

Fig. 6: Confusion Matrix for hindi and bengali speech signals


Combining MFCC + delta-MFCC features. Only MFCC
features and ANN as classifier.

𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 98.6%

Fig. 7: Confusion Matrix for hindi and bengali speech signals


combining MFCC + delta-MFCC features and ANN as
classifier.

𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 99.6%

Fig. 8: Confusion Matrix for hindi and bengali speech signals


combining MFCC + delta-MFCC + double-delta-MFCC
features and ANN as classifier.

𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 99.7%

8
Accuracy Plot:-

8. Classification based on SVM (Support Vector Machines)


Support vector machine is a simple and effective algorithm for classification of speech or
speaker recognition. SVM is a binary nonlinear classifier capable of guessing whether an
input vector x belongs to a class-1 or class-2 category. The margin known as soft margin
is determined. The distance (R) between the
sample points and margin are calculated.

As shown in above figure, the feature space


consists of the features extracted from the
speech signal. The decision boundary or soft-
margin is determined to classify the
problems. The distance between the margin
and samples are computed. The sample that is
Fig: 9: Decision logic in SVM nearer to margin or having least distance is chosen.

9. Results and Outcomes(Classifier-SVM)


After feature extraction using MFCC, delta-MFCC and double-delta-MFCC we used
SVM as a classifier. Two groups were created one for bengali accent and other for hindi
accent. Group 1 included first 500 samples (i.e. 1 to 500) and group 2 included remaining
500 samples (i.e. 500 to 1000). This database of 1000 samples is used as training data.

9
Now we manually collected 50 samples from native hindi speakers and 50 samples
from native bengali speakers as testing data and then extracted its features. Resulting
confusion matrix are shown below:

Confusion matrix obtained using MFCC features and SVM as classifier.

Language Language

Hindi Bengali
Hindi 50 2
Bengali 0 48

Confusion matrix obtained using MFCC + delta-MFCC features and SVM as


classifier.

Language Language

Hindi Bengali
Hindi 50 2
Bengali 0 48

Confusion matrix obtained using MFCC + delta-MFCC features + double-delta-


MFCC features and SVM as classifier.

Language Language

Hindi Bengali
Hindi 50 2
Bengali 0 48

Accuracy -We observe that when SVM is used as classifier the accuracy is same in all three
cases. This might be due to limited database for limited number of speakers.

10
Conclusion

Different feature extraction techniques and recognition techniques are discussed and used in
this report and it can be concluded that performance of combination MFCC and LPC features
is superior to MFCC features. This paper attempts to provide a comprehensive survey on
speech recognition. We also observed that advanced feature extraction techniques like delta-
MFCC and double-delta-MFCC increases the speech recognition accuracy to appreciable
extent. Though we did not observe much difference in accuracy in case of SVM as classifier.

Speech recognition has attracted scientist as an important regulation and has created a
technological influence on society. It is hoping that this report bring out understand and
inspiration amongst the research group of automatic speech recognition (ASR) system.

11