You are on page 1of 18

“Speech to Text Conversion using Machine Learning Model”

A
Mini Project Report
Submitted in the fulfillment of
Power Electronics Mini Project (Third Year Electrical)

Academic Year 2020-21

Text
Project Guide: Prof. Suhas Kakade and Prof. Rohan Kulkarni

Project Members:

Sr No Name MIS Number

1 Pravin Maragale 111805054

2 Anshul Malokar 111805076

3 Aditya Upadhye 111805079

4 Jayesh Ingale 111805080

Project Type: Machine Learning

Platform/Software Used: Python (libraries used: librosa, keras, flask, onnx)


ABSTRACT

In every aspect of human life, sound plays an important role. From personal security to critical
surveillance, sound is a key element to develop the automated systems for these fields. Few
systems are already in the market, for example software like Jasper, Resourcemate and Spok
Speech Solution, but their efficiency is a point of concern for their implementation in real-life
scenarios. Having a very efficient as well as high accuracy speech to Text Recognition model is
really important for providing a good service to a customer.

The learning capabilities of the deep learning architectures can be used to develop sound
classification systems to overcome efficiency issues of the traditional systems. Our aim, in this
project, is to use deep learning networks to build Speech to Text Conversion Model in python
that understand simple spoken commands. We have used the Speech Command datasets released
by TenserFlow. It includes 65,000 one-second long utterances of 10 short words by thousands of
different people to train the convolutional neural network (CNN) and the tensor deep stacking
network (TDSN). The data set is of one-second .wav audio files, each containing a single
spoken English word. These words are from a small set of commands, and are spoken by a
variety of different speakers. Data set is used to train machine learning models. The labels that
users will have to predict in this project are yes, no, up, down, left, right, on, off, stop, go.

1
CONTENTS

Module 1: Introduction……………………………………………………………………..3

Module 2: Data Exploration,Visualization and Preprocessing Data…………...…………..6

Module 3: Model Architecture…………………………………………………………….10

Module 4: Deployment and Uses .……………………………………………………….....13

Module 5: Result and Conclusion………………………………………………………….15

2
Module 1
Introduction
______________________________________________________________________________
In this project, we have attempted to create a speech recognition machine learning model which
recognises 10 words, yes, no, up, down, left, right, on, off, stop, go, using keras and librosa
python libraries and a desktop app to showcase the working of our model.

Introduction to Signal Processing

Signal processing is an electrical engineering subfield that focuses on analysing, modifying, and
synthesizing signals such as sound, images, and scientific measurements. Signal processing
techniques can be used to improve transmission, storage efficiency and subjective quality and
also emphasize or detect components of interest in a measured signal

Digital signal

A digital signal is a discrete representation of a signal over a period of time. Here, the finite
number of samples exists between any two-time intervals. Application of digital signal include
audio and speech processing, sonar, radar and other sensor array processing and spectral density
estimation.

Analog signal

An analog signal is a continuous representation of a signal over a period of time. In an analog


signal, an infinite number of samples exist between any two-time intervals. For example, analog
signals are Human voice, Thermometer, Analog phones etc.

Audio Signal

When an object vibrates, the air molecules oscillate to and fro from their rest position and
transmits its energy to neighboring molecules. This results in the transmission of energy from
one molecule to another which in turn produces a sound wave. Audio signal processing is a
subfield of signal processing that is concerned with the electronic manipulation of audio signals.
Audio signals are electronic representations of sound waves—longitudinal waves which travel
through air, consisting of compressions and rarefactions. The energy contained in audio signals is
typically measured in decibels. As audio signals may be represented in either digital or analog
format, processing may occur in either domain. Audio signals have frequencies in the range of
roughly 20 to 20,000 Hz

3
Speech Signal

The speech signal, as it emerges from a speaker’s mouth is a one-dimensional function (air
pressure) of time. Microphones convert the fluctuating air pressure into electrical signals with
which we deal in speech processing.The average fundamental frequency for a male voice is
125Hz, for a female voice it’s 200Hz and for a child’s voice is 300Hz.

Sampling of an Audio Signal

Analog signals are memory hogging since they have an infinite number of samples and
processing them is highly computationally demanding. Therefore, we need a technique to
convert analog signals to digital signals so that we can work with them easily on digital
hardware.

Sampling the signal is a process of converting an analog signal to a digital signal by selecting a
certain number of samples per second from the analog signal. We are converting a speech signal
to a discrete signal through sampling so that it can be stored and processed efficiently in memory.

The below illustration depicts how the analog audio signal is discretized and stored in the
memory:

(a)

(b)

(c)

Fig no.1 : (a) Analog signal, (b) Sampling process and (c) Digital signal

4
The key thing to take away from the above figure is that we are able to reconstruct an almost
similar speech wave even after sampling the analog signal since we have chosen a high sampling
rate. The sampling rate or sampling frequency is defined as the number of samples selected per
second.

5
Module 2
Data Exploration, Visualization and Preprocessing Data
____________________________________________________________________________

Data exploration and visualization helps us to understand data and preprocessing steps in a better
way.
Time-Series:A time series is a sequence of data points that occur in successive order over some
period of time.
We’ll visualize the audio signal in the time domain.

Python Audio Libraries:


Python has some great libraries for audio preprocessing like Librosa and PyAudio.There are also
built-in modules for some basic audio functionalities.

Librosa
It is a Python module to analyze audio signals in general but geared more towards music.
Whenever we read an audio signal we usually get two channels one channel is the mono channel
and the other is stereo channel.
Librosa will make sure that it will keep the sample rate in 21kHz and it will try to normalize the
audio data between -1 to +1.
This will make data in the normalised pattern itself.
Installation:
pip install librosa
or
conda install -c conda-forge librosa

Loading an audio file and Visualizing Audio :

train_audio_path = './train/audio/'
samples, sample_rate =
librosa.load(train_audio_path+'yes/0a7c2a8d_nohash_0.wav', sr = 16000)

6
Fig. 2a. Data in Time Domain

Sampling rate:
The sample rate is the number of samples of audio carried per second, measured in Hz or kHz.

Resampling:
We’ll re-sample sampling rate to 8000 Hz since most of the speech-related frequencies are
present at 8000 Hz:

samples = librosa.resample(samples, sample_rate, 8000)

Using,IPython.display.Audio we can play the audio in jupyter notebook

ipd.Audio(samples, rate=sample_rate)

Fig. 2b. Playing Audio

Fig. 2c. Distribution of Duration of recordings

7
Exploring the dataset:
The dataset contains a total of 23,682 speech recordings in .wav file format mostly of one sec
duration bifurcated into 10 classes as [2377, 2375, 2375, 2359, 2353, 2367, 2367, 2357, 2380,
2372] recordings per class.

Fig. 2d. Number of recordings per class

Preprocessing the audio waves


Data preprocessing: It involves loading wav data, label encoding, feature scaling and data split
into training and test set.

In the data exploration part we found that the duration of a few recordings is less than 1 second
and the sampling rate is too high. So, we read the audio waves and use the below-preprocessing
steps to deal with this.

Here are the two steps we’ll follow:


● Resampling.
● Removing shorter recordings of less than 1 second.

Labels Encoding:
Label Encoding refers to converting the labels into numeric form so as to convert it into
the machine-readable form. Machine learning algorithms can then decide in a better way on how
those labels must be operated. It is an important preprocessing step for the structured dataset in
supervised learning. Sklearn.preprocessing package which provides several common utility
functions and transformer classes to change raw feature vectors into a representation that is more
suitable for the downstream estimators.

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
y=le.fit_transform(all_label)
classes= list(le.classes_)

8
Converting the integer encoded labels to a one-hot vector since it is a multi-class
classification problem:
Machine learning algorithms cannot work with categorical data directly. Categorical data must
be converted to numbers. A one hot encoding is a representation of categorical variables as
binary vectors. This first requires that the categorical values be mapped to integer values.
Then, each integer value is represented as a binary vector that is all zero values except the index
of the integer, which is marked with a 1.

This applies when you are working with a sequence classification type problem and plan on
using deep learning methods such as Long Short-Term Memory recurrent neural network. After
Converting the integer encoded labels to a one-hot vector we will reshape the 2D array to 3D
since the input to the conv1d must be a 3D array.

Split into train and validation set


The train-test split is a technique for evaluating the performance of a machine learning
algorithm. The procedure involves taking a dataset and dividing it into two subsets. The first
subset is used to fit the model and is referred to as the training dataset. The second subset is not
used to train the model; instead, the input element of the dataset is provided to the model, then
predictions are made and compared to the expected values. This second dataset is referred to as
the test dataset.

● Train Dataset: Used to fit the machine learning model.


● Test Dataset: Used to evaluate the machine learning model.

Fig. 2e. Splitting Dataset into train and test set

We will train the model on 80% of the data and validate on the remaining 20%.

9
Module 3

Model Architecture
______________________________________________________________________________

After the data preprocessing phase and splitting the dataset into training set and test set, building
model is the next step for the project. CNN is quite successful and vastly used in the field of
image prediction. Since audio can be represented in the form of an image, we have used CNN for
audio classification. As a result our independent variable consists of samples whereas the
dependent variable has 10 different classes. Convolution in only one dimension is required for
successful audio classification. As a result, conv1d from keras is used for model building. Firstly
the session is cleared and memory is freed from the first model using clearsession() so that the
chances of slowdown and gluttering are reduced.

The input is shaped to (8000,1) as length is one second and sampling rate is 8000Hz. Now we
start adding 1d convolution layers to the model.First layer has 8 filters. Max pooling is applied
which has a size of 3*3.

Fig. 3a. Convolution

Using max pooling the overfitting of the model can be reduced.It also reduces the computational
cost by reducing the number of parameters to learn.

Fig. 3.b Max pooling

Finally, dropout is applied to the layer so that overfitting can be prevented.The dropout layer
randomly sets input units to 0 with a frequency of rate at each step during training time, which
helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over

10
all inputs is unchanged. Here 30% of inputs are set to 0. After trying models with more than one,
two and three layers,we found out that the model with three layers was having good accuracy. As
a result 3 more layers are added to the model with number of filters being 16, 32 and 64
respectively. Here the RELU activation function is used in each layer to apply on weights for
each node.We have chosen this activation function as it is the most commonly used activation
function in CNN models.
The next step in modelling is to flatten layer.Flatten layer helps us to convert the data into a
1-dimensional array for inputting it to the next layer. We flatten the output of the convolutional
layers to create a single long feature vector.
Dense layer is added to the earlier model for obtaining a fully connected layer.Here we have use
two Dense layers so that subsequent layers are able to learn more complex features, and higher
level layers encode more abstract features.After each dense layer, dropout layer is added.

Fig. 3c. Fully-connected layer

Finally we have the output layer which is also a fully connected type of layer.So, dense is used to
create the output layer. Length of labels is taken as the total number of nodes whereas the
activation function used is “softmax”. The softmax function is used as an activation function
since there can be N number of output values. For compiling a model, a very standard form of
parameters are used. The model is compiled using categorical cross entropy as loss function and
adam optimizer. The only metrics we are viewing is accuracy. Finally the model is fitted on the
x_train and y_train dataset.To reduce the training time we have used callbacks techniques so that
training stops when there is no significant change in values of accuracy.Callbacks used are model
checkpoint and early stopping. A function is created for final use of the model.This function
takes an array file representing audio as input,reshapes it to (1,8000,1) and returns the predicted
text as output.
So the model looks like:

11
1)Input layer
2)4 layers of:
● Convolution layer
● MaxPooling layer
● Dropout layer
3)Flatten layer
4)2 layers of:
● Dense layer
● Dropout layer
5)Output layer.

Fig. 3d. Representation of Convolutional Neural Network

12
Module 4
Deployment and uses
______________________________________________________________________________
Overview
Using a Machine Learning model in some application is as important as creating and training a
model. Every model is trained with an aim of some use case. Our Speech to Text model is also
deployed in a desktop app which recognises your voice as you speak. App is currently available
only for Windows. To use the app:
1. Download the zip file from
https://github.com/Adityaupadhye/mini-project-speech-to-text/releases/download/v1.0.0/
speech.to.text.zip
2. Extract the contents and run “speech to text.exe”

Fig. 4a. Home screen of the app

Fig. 4b. Prediction screen

13
Steps Involved
After training, the model is saved in .h5 file format, then converted into .onnx file format (Open
Neural Network Exchange) which is a lightweight platform to run inference on a ML model.
As we speak, 1 sec audio is recorded with a sample rate of 16000, then resampled to the shape of
(1,8000,1) which is the input shape of the model defined.

Development of desktop app


As the ML model is defined and trained using python, the same environment was needed to
make the process of running inference smooth.
Python web framework Flask is used to run a server displaying HTML as a frontend.
To convert the web app into a desktop app, python flaskwebgui library is used which runs the
web app as a desktop app using chrome.
The python script which runs the application, is then converted to an executable file using the
pyinstaller library. This library then combines all the required dependencies and generates a
“speech to text.exe” file.
The advantage of creating an executable is, the user doesn’t need to install all the libraries and
set up the environment for running the application, instead just an executable file with all the
dependencies to be extracted with the application.

Uses of speech recognition


Nowadays speech recognition has a variety of uses like personal assistants (Google Assistant,
Alexa, Cortana etc), voice search, voice typing in google keyboard, phone unlock using voice
etc. All these use cases are possible because a ML model trained for all words in a language is
running inference in the background as you speak. There are also multilingual models such as
google keyboard which has voice typing in multiple languages.

14
Module 5
Result and Conclusion
______________________________________________________________________________

We have successfully created a desktop app which uses a Machine Learning model to predict the
audio and displays text as output. It can predict 10 different words. If another word except those
10 different words are given as input, the model returns text which matches the most with input
audio.
We can also see that loss value decreases significantly as the number of epochs increase.

Fig. 5a. loss vs number of epochs

After training the CNN model,we have been able to achieve 91.75% accuracy on the training set.
Accuracy on validation data is approximately 85.34% as we can see below.

15
Code:

ML model is defined and trained in the file Speech Recognition

Desktop app creation file: application.py

GitHub repo for this project: https://github.com/Adityaupadhye/mini-project-speech-to-text

Tensorflow Speech Recognition Dataset:


https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data

16
References:
1) Aditya Khamparia, Deepak Gupta, Nhu Gia Nguyen, Ashish Khanna, Babita Pandey, Prayag
Tiwari, Sound Classification Using Convolutional Neural Network and Tensor Deep Stacking
Network, pp.7717 - 7727, January 2019.
2) A. Graves, A. Mohamed and G. Hinton, "Speech recognition with deep recurrent neural
networks", 2013 IEEE International Conference on Acoustics Speech and Signal Processing, pp.
6645-6649, May 2013.
3) C.Vimala and V. Radha, "A Review on Speech Recognition Challenges and Approaches",
World of Computer Science and Information Technology Journal (WCSIT), vol. 2, no. 1, pp. 1-7,
2012.
4) https://keras.io/
5) https://medium.com/
6) https://towardsdatascience.com/
7) https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data
8) https://librosa.org/
9) http://home.iitk.ac.in/~vipular/stuff/2019_MLSP.html

17

You might also like