You are on page 1of 14

Flow Chart:

Input Audio

Preprocessing

Splitting data into train


and test

Building model using CNN

Training the model

Predicting the output


Data Set :
• Dataset used in this project is taken from Kaggle speech
commands dataset.
• This dataset contains audio files in .wav format of about one-
second long utterances of 5 words namely ‘right’, ‘left’, ‘yes’,
‘no’ and ‘happy’ each word having about 2500 samples of
recordings spoken by different speakers.
Coding platform used:
• We will be using Jupyter notebook for implementation of our code.
• Main advantages of using jupyter notebook are:
• They are great for executing ML models especially and you can see
both the code and the results as you can run cell by cell to better get
an understanding of what the code does.
• It is also language independent and easy to share.
Input and Output :
 Input: The Input used are 1 sec long recordings of isolated words like happy, left,
right, yes, and no which are stored in TrainAudio folder in .wav format. And the
sampling rate of all these signals is 16KHz which we will resample at 8000 Hz
since most of the speech-related frequencies are present at 8000 Hz.

 Output: After preprocessing ,building model and training input data with the
help of few python modules like librosa ,scipy and keras that contains
useful CNN libraries finally we predict the correct text at the output for the
audio input.
Important python libraries used:
• Librosa and Scipy: Used for processing audio signals.
• Numpy: Used for working with arrays and matrices.
• Matplotlib: Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy.
• Random: Random is used as a tool or a feature in preparing data and
in learning algorithms that map input data to output data in order to make predictions.
• Keras:Keras is a powerful and easy-to-use free open source Python library for
developing and evaluating deep learning models. It wraps the efficient numerical
computation libraries Theano and TensorFlow and allows you to define and train
neural network models in just a few lines of code.
Stagewise result and its related discussion:
• Import all the libraries that are mentioned in previous slide.
• Data Exploration and Visualization helps us to understand the data as well as pre-processing
steps in a better way. Here is the plot for Visualization of Audio signal in time domain.
• Sampling and resampling of signal:
• The sampling rate of the signal is 16,000 Hz. But we will re-sample it to 8000 Hz since most of
the speech-related frequencies are present at 8000 Hz. The below code is used for this:
• Next step is defining the labels and a look at the distribution of the duration of recordings which is shown
in screenshot below:
• Pre processing the audio waves:
• In the data exploration part earlier, we have seen that the duration of a few recordings is less than 1
second and the sampling rate is too high. So, let us read the audio waves of defined labels and use
the below-preprocessing steps to deal with this.
• Here are the two steps we’ll follow 1.Resampling and 2.Removing shorter commands of less than 1
second below is the code for it.
• Next we converted the output labels to integer encoded since CNN algorithm requires input and
output variables as numbers before we can use it to fit and evaluate a model. Code ss is below:

• Now, we converted the integer encoded labels to a vector since it is a multi-classification problem:

• Reshape the 2D array to 3D since the input to the conv1d must be a 3D array the code for it is;
• Split into train and test set
• Next, we will train the model on 80% of the data and test on the remaining 20%:

• Model Architecture for this problem


• We have build this model using conv1d. Conv1d is a convolutional neural network which performs
the convolution along only one dimension. We have used 4 layers of conv1D and these layers
creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal)
dimension to produce a vector of outputs.
• MaxPool1D downsamples the input representation by taking the maximum value over the window
defined by pool_size and the The Dropout layer randomly sets input units to 0 with a frequency rate
at each step during training time, which helps prevent overfitting .
• This is followed by 2 dense layers which indicates that all the neurons in a layer are connected to
those in the next layer.
 The ss for the code and output of the model are as follows:
• Next we trained the model on a batch size of 32 and evaluate the performance on the holdout set
and also loaded the best model as per the accuracy that increases with each training phase.
• Defining the predict function that predicts the correct text output for the audio input:

• Now with the help of random module the model will take randomly any audio and by
calling the predict function we get the final output text.

You might also like