Speech Detection On Urban Sounds

SPEECH DETECTION OF URBAN SOUNDS
A Dissertation submitted in partial fulfillment of the

requirements
for the award of the Degree of
BACHELOR OF ENGINEERING
IN
ELECTRONICS AND COMMUNICATION

ENGINEERING
BY
ETHASHYAM UR RAHMAN 1604-15-735-090

FARDEEN HASAN 1604-15-735-095
MOHAMMED WASAY AHMED 1604-15-735-313
Under the guidance of

AFSHAN KALEEM
Sr. Assistant Professor
Department of Electronics and Communication Engineering

MUFFAKHAMJAH COLLEGE OF ENGINEERING AND TECHNOLOGY
2019
MUFFAKHAM JAH
COLLEGE OF ENGINEERING & TECHNOLOGY
(Est. by Sultan-Ul-Uloom Education Society in 1980)
(Affiliated to OsmaniaUniversity, Hyderabad)
Approved by the AICTE & Accredited by NBA
CERTIFICATE
This is to certify that the dissertation titled ‘Speech Detection of Urban Sounds’
submitted by Fardeen Hasan bearing Roll No: 1604-15-735-095, Ethashyam Ur
Rahman bearing Roll No: 1604-15-735-090 and Mohammed Wasay Ahmed bearing
Roll No: 1604-15-735-114 in partial fulfilment of the requirements for the award of the
Degree of Bachelor of Engineering, is a bona fide record of work carried out by them
under my guidance and supervision during the year 2018-2019. The results embodied in
this report have not been submitted to any University or Institute for the award of any
Degree or Diploma.
Afshan Kaleem, Dr. Mohammed Arifuddin Sohel

Sr. Assistant Professor, Professor and Head
Department of ECE, Department of ECE
MJCET,Hyderabad MJCET,Hyderabad
INTERNAL EXAMINER EXTERNAL EXAMINER
8-2-249, Mount Pleasant, Road No.3, Banjara Hills, Hyderabad – 500 034
Phone: 040-23350523, 23352084, Fax: 040-2335 3428
Website: www.mjcollege.ac.in, e-mail: principal@mjcollege.ac.in

DECLARATION
We hereby declare that the work presented in this dissertation entitled

SPEECH DETECTION ON URBAN SOUNDS submitted in the partial
fulfilment of the requirement for the award of the degree of Bachelor of
Engineering in Electronics and Communication Engineering in the
department of Electronics and Communication Engineering,
MuffakhamJah College of Engineering and Technology, Hyderabad is
authentic record of my own work carried out under the guidance and
supervision of Afshan Kaleem Sr.Assistant professor, ECE Department.
We have not submitted the matter embodied in this report for the
award of any other degree or diploma.
Date:
Place: Hyderabad
ACKNOWLEDGEMENT
If language is considered as a sign of endorsement and gesture of

acknowledgement then let the terms play the heralding part of expressing my
gratitude acknowledgement.
The accomplishment and ultimate outcome of this project required a lot

of supervision and assistance from numerous people and I am tremendously
fortunate to have got this all along the completion of my project. All that I have
gone through is only due to such supervision and support.
At the extremely outset, I take the opportunity to convey my gratitude to
people whose proposition, cooperation and genuine support helped me to
complete the project work successfully.
I acknowledge my honest gratitude to Professor and Head of the ECE
department Dr. Mohammed Arifuddin Sohel for valuable guidance and
constant encouragement right from the commencement to the successful
completion of my project.
I obligate my deep gratitude and heartily thank to my internal project
guide Afshan Kaleem Sr.Assistant professor, ECE Department took ardent
interest on project work and guided me all along, till the achievement of project
work by providing all the necessary information.
Ethashyam Ur Rahman
Fardeen Hasan
Mohammed Wasay Ahmed
CONTENTS
ABSTRACT i
LIST OF ABBREVIATIONS ii
LIST OF FIGURES iii
LIST OF TABLES iv
CHAPTER 1 : INTRODUCTION
1.1 Introduction 1
1.2 Problem statement 2
1.3 Objective 2
1.4 System model 2
1.5 Organization of Report 2-3
CHAPTER 2 : LITERATURE SURVEY

2.1 Introduction 4
2.2 literature Survey 5-6
2.3 Summary 7
CHAPTER 3 : URBAN SOUNDS
3.1 Introduction 8
3.2 The Speech Signal 9-10
3.3 Urban sound Dataset 11
3.4 Methods of Speaker Detection 12
3.4.1 Speaker Diarization 12
3.4.2 Speech Overlapping 12
3.4.3 Segment Clustering of Speaker 13
3.5 Pre-Processing 14
3.5.1 Signal Processing 14
3.6 Feature Extraction Method
3.6.1 Linear Predictive Coding(LPC) 15
3.6.2 Perceptual Linear Prediction(PLP) 16
3.7 Environment sound Dataset 17
CHAPTER 4 : K-MEANS AND CLUSTERING
4.1 Introduction 19
4.2 K-means algorithm Definition 20
4.2.1 Algorithm 20
4.2.2 Choosing K 21
4.2.3 Distance Metric 22
4.3 Clustering 23
4.4 Bayesian information criterion 24
4.5 Feature Engineering 24
CHAPTER 5:MEL FREQUENCY CEPTRAL

COEFFICIENT(MFCC)
5.1 MFCC Introduction 26
5.1.1 Framing 27
5.1.2 Windowing 27
5.1.3 FFT 28
5.1.4 Mel filter Bank 28
5.1.5 Discrete cosine transform (DCT) 29
5.1.6 MFCC features 29
5.2 Mel Scale 31
CHAPTER 6 : Simulation and programming tool
6.1 Python programming 33
6.2 Python Libraries 33
6.2.1 NUMPY 33
6.2.2 SCIKIT-LEARN 33
6.2.3 PYTHON_SPEECH_FEATURES 33
6.2.4 MATPLOTLIB 33
6.3 SIMULATION TOOL-PyCharm 34
6.3.1 Introduction
CHAPTER 7 : PROPOSED APPROAC
7.1 Speaker change detection 35
7.1.1 Dataset making 35
7.1.2 Feature extraction 35
7.1.3 Algorithm and clustering 36
7.1.4 Plotting 36
7.2 Proposed Algorithm 37
CHAPTER 8 : RESULTS 38-41
CHAPTER 9 : CONCLUSION & FUTURE SCOPE 42
9.1 Conclusion 42
9.2 Future scope 43
REFERENCES 44
ABSTRACT
Recent studies have demonstrated the potential of unsupervised feature learning for
sound classification. In this thesis we further explore the application of the k-means
algorithm for feature learning from audio signals, here in the domain of urban sound
classification.
k-means is a relatively simple technique that has recently been shown to be competitive
with other more complex and time consuming approaches. We study how different parts
of the processing pipeline influence performance, taking into account the specificities of
the urban sonic environment.
We evaluate our approach on urban sound sources created into a dataset. The linguistic
features of respective sounds of the dataset is extracted using MFCC. The results are
complemented with error analysis and some proposals for future research.
i
LIST OF FIGURES
Title Page no.

Fig 1.1 Block diagram of speech detection process 1
Fig 1.4 Block diagram of system model 2
Fig 3.2 Formants of the vowel 10
Fig 3.3 urban sound classification 11
Fig 3.4.1 Wave plot of two different speakers 12
Fig 3.4.3 Clustering off different speakers 13
Fig 3.7.1 The LPC processor 15
Fig 4.2.2 Elbow diagram 21
Fig. 4.3 Difference between clustered and unclustered data 23
Fig. 4.4 BIC model 24
Fig 5.1 Block Diagram Of MFCC 27
Fig 5.1.1 MEL scale of frequency 28
Fig 6.1 Proposed block diagram 35
ii
LIST OF TABLES
Title Page no.

Table 3.8 The 50 sounds sources of ESC-5 17
iii
LIST OF ABBREVIATIONS
MFCC Mel Frequency Cepstral Coefficient

DCT Discrete Cosine Transform
ESC Environment Sound Classification
LPCC Linear Prediction Cepstral Coefficients
FIR Finite Impulse Response
IIR Infinite Impulse Response
AI Airconditioner
JM Jackhammer
FFT Fast Fourier Transform
FBE Filterbank Energies
PLP Perceptual Linear Prediction
US8k Urbansound 8k
iv
`
CHAPTER 1
INTRODUCTION
1.1 Introduction
Speaker Recognition is the art of recognizing a speaker from a given database using speech
as the only input. In this thesis we will be discussing a novel approach to detect speakers.
Speech processing is emerged as one of the important application area of digital signal
processing. Various fields for research in speech processing are speech recognition, speaker
recognition, speech synthesis, speech coding etc. The objective of automatic speaker
recognition is to extract, characterize and recognize the information about speaker identity.
Speaker Recognition can be divided by two ways. One way is to divide it into speaker
verification and speaker identification. For speaker verification, the test is based on the
claimed identity and a decision for accepting or rejecting is made. For speaker identification,
there is no claim of identity, the system chooses the speaker from the database or in open set
system, the identity can be unknown.
Feature extraction is the first step for speaker recognition. Many algorithms are
suggested/developed by the researchers for feature extraction. In this work, the Mel
Frequency Cepstrum Coefficient (MFCC) feature has been used for designing a text
dependent speaker identification system. Some modifications to the existing technique of
MFCC for feature extraction are also suggested to improve the speaker recognition
efficiency.
Fig 1.1 Block diagram of speech detection process
1
`
1.2 Problem Statement
o Sounds in an urban environment are usually composed of various multiple sound sources,
which makes it challenging to classify them into categories.
o Urban sounds are unstructured sounds unlike speech and audio sounds
o Speech detection is vital to distinguish between noise and data-set.
1.3 Objective
The main objective of this project is
1. To make a diverse data set of different urban sounds for testing.
2. To Extract Features of Urban Sounds Dataset to classify them using MFCC.
3. To demonstrate Clustering by K-Means technique.
1.4 System Model
Frequency MFCC
Speech Signal Normalization To Feature
Mel scale Extraction
Output Plots Result K-Means

Clustering
Fig 1.4 Block diagram of system model

1.5 Organization Of The Report
The organization of this report is as follows:
Chapter 1: Gives the introduction, problem objective, problem statement of the work done
and the structure of the rest of the report is introduced.
Chapter 2: Gives introduction, literature survey of two papers, there techniques used and
their results.
2
`
Chapter 3: Introduction about Urban sounds, Physiology of human ear, The speech signal,
Urban Sound Taxonomy and Dataset, Methods of speaker change, feature extraction
methods, Environment Sound Classification (ESC).
Chapter 4: K-Means Algorithm, Clustering, Bayesian Information Criterion
Chapter 5: Mel Frequency Cepstral Coefficient (MFCC)
Chapter 6: Proposed Approach
Chapter 7: Simulated results and discussion
Chapter 8: Conclusion
3
`
CHAPTER 2
LITERATURE SURVEY
2.1 Introduction
In this chapter, literature survey of 2 papers on classification and feature extraction of urban
sounds are discussed. This chapter discusses about their techniques and their result.
2.2 Literature Survey
In paper (1) Unsupervised feature learning for urban sound classification, To explore the
application of the spherical k-means algorithm for feature learning from audio signals, here
in the domain of urban sound classification. The approach for feature learning for audio
signals is to convert the signal into a time-frequency representation, a common choice being
the mel-spectrogram. Then extraction of log-scaled mel-spectrograms with 40 components
(bands) covering the audible frequency range (0-22050 Hz), using a window size of 23 ms
(1024 samples at 44.1 kHz) and a hop size of the same duration is done. Experimentation
with a larger numbers of bands (128) was also performed, but this did not improve
performance and hence they used lower (and faster to process) resolution of 40 bands. To
extract the mel-spectrograms Essentia audio analysis library via its Python bindings was
used. Shingling- Grouping several consecutive frames (by concatenating them into a single
larger vector prior to PCA whitening) allows us to learn features that take into account
temporal dynamics. This option is particularly interesting for urban noise-like sounds such
as idling engines or jackhammers, where the temporal dynamics could potentially improve
our ability to distinguish sounds whose instantaneous features (i.e. a single frame) can be
very similar. The spherical k-means algorithm has been shown to be competitive with more
advanced (and much slower) techniques such as sparse coding, and has been used
successfully to learn features from audio for both music and birdsong. The learned codebook
is used to encode the samples presented to the classifier (both for training and testing). A
possible encoding scheme is vector quantization, i.e. assign each sample to its closest (or n
closest for some choice of n) centroids in the codebook, resulting in a binary feature vector
whose only non-zero elements are the n selected neighbors. While this approach has been
shown to work for music, the experimentation concluded that a linear encoding scheme
where each sample is represented by its multiplication with the codebook matrix provides
4
`
better results.
After encoding, every audio recording is represented as a series of encoded frames (or
patches) over time. For classification, summarization over the time axis so that the
dimensionality of all samples is the same (and not too large). Different studies report success
using different summary (or pooling) statistics such as maximum, mean and standard
deviation or a combination of a larger number of statistics such as minimum, maximum,
mean and variance. In experimentation we use the mean and standard deviation, which we
found to be the best performing combination of two pooling functions. The feature learning
stage, a single classification algorithm for all experiments – a random forest classifier (500
trees) is used. This classifier was used successfully in combination with learned features, and
was also one of the top performing classifiers for a baseline system evaluated on the same
dataset used in this study. In experimentation we use the implementation provided in the
scikit-learn Python library. For evaluation the UrbanSound8K dataset. The dataset is
comprised of 8732 slices (excerpts) of up to 4 s in duration extracted To facilitate comparable
research, the slices in UrbanSound8K come pre-sorted into 10 folds using a stratified
approach which ensures that slices from the same recording will not be used both for training
and testing, which could potentially lead to artificially high results. In this paper we studied
the application of unsupervised feature learning to urban sound classification. It showed that
classification accuracy can be significantly improved by feature learning if taken into
consideration the specificities of this domain, primarily the importance of capturing the
temporal dynamics of urban sound sources.
In paper (2). Using Deep Convolutional Neural Network to Classify Urban Sounds. A
convolutional neural network usually classifies the environmental sounds. However, there
are few studies investigates the network input construction issue. The impact of time
resolution index of input spectrogram on classification performance is observed.
Unlike speech and audio signals, urban sounds are usually unstructured sounds. They include
various real-life noises generated by human activities, ranging from transportation to leisure
activities. Automatic urban sound classification could identify the noise source, benefiting
urban livability, such as noise control, audio surveillance, soundscape assessment and
acoustic environment planning. However, sounds in an urban environment are usually
composed of various multiple sound sources, which makes it challenging to classify them
into categories. For the urban sound classification with a relatively simple convolutional
5
`
neural network, the features that automatically learned from a large sound dataset are usually
presented in form of distinctive temporal-spectral representations. Combined with simple
soft- max classifier, improved classification accuracy over hand- crafted features is
frequently reported. for a given sound sample, its 2D mel- spectrogram low-level
representation is adopted as the convolutional neural network input layer. Specifically, for a
sound sample, its frame- based log-scaled mel-spectrograms are firstly extracted from
overlapped sample frames. Then all these frame-based static mel-spectrograms are
concatenated to form one 2D image. If dynamic frequency-delta (i.e. first-order frame-to-
frame difference) coefficients are also included, then overall convolutional neural network
input has the dimension of X ∈ RNt∗Nf ∗2, where Nt denotes the total frame number of the
sound, and Nf denotes the number of frequency bands. To take into account temporal
dynamics of a sound sample, different frame length can be set. The shorter the frame length,
the finer the time resolution. The approach is evaluated on the Urban- Sound8K dataset,
which includes 8732 typical urban noises. These samples span 10 environmental sound
classes (see Table I) with label of : air conditioner (AI), car horn (CA), children playing
(CH), dog bark (DO), drilling (DR), engine idling (EN), gun shot (GU), jackhammer (JA),
siren (SI) and street music (ST). With the strength of its large size, wide sound class and
clear annotation, the dataset has been widely used. The network is implemented in Python
with Keras [package. During training process, the convolutional neural network model
optimizes cross-entropy loss. Each mini-batch consists of 32 randomly inputs the training
dataset. Each model is trained with 50 epochs.
The input construction is also implemented in Python, with the aid of the librosa library for
the mel-spectrograms extraction. The results clearly verified that time resolution index of
network input has significant effects on the classification performance. Further analysis from
the confusion matrices of other inputs (not provided here due to space limitation) even show
that for different sound class, its optimal time resolution is different.  
Experiment results confirmed the time resolution index has significant contributions on the
classification performance, and best performances are usually obtained for input with
moderate time resolution.
6
`
2.3 Summary
The reviewed literature survey provides the information regarding Urban Sounds,
MFCC,K-Means clustering and its design methods.
7
`
CHAPTER 3
URBAN SOUNDS
3.1 Introduction
The automatic classification of environmental sound is a growing research field with

multiple applications to large-scale, content-based multimedia indexing and retrieval. In
particular, the sonic analysis of urban environments is the subject of increased interest, partly
enabled by multimedia sensor networks, as well as by large quantities of online multimedia
content depicting urban scenes. However, while there is a large body of research in related
areas such as speech, music and bioacoustics, work on the analysis of urban acoustics
environments is relatively scarce. Furthermore, when existent, it mostly focuses on the
classification of auditory scene type, e.g. street, park, as opposed to the identification of
sound sources in those scenes, e.g. car horn, engine idling, bird tweet. This dataset contains
8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air conditioner, car
horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and
street music.
8
`
3.2 The speech signal
According to the source-filter model of speech production, the speech signal can be
considered to be the output of a linear system. Depending on the type of input excitation
(source), two classes of speech sounds are produced, voiced and unvoiced. If the input
excitation is noise, then unvoiced sounds like /s/, /t/, etc. are produced, and if the input
excitation is periodic then voiced sounds like /a/, /i/, etc., are produced. In the unvoiced case,
noise is generated either by forcing air through a narrow constriction (e.g., production of /f/)
or by building air pressure behind an obstruction and then suddenly releasing that pressure
(e.g., production of /t/). In contrast, the excitation used to produce voiced sounds.
is periodic and is generated by the vibrating vocal cords. The frequency of the voiced
excitation is commonly referred to as the fundamental frequency (F0).
The vocal tract shape, defined in terms of tongue, velum, lip and jaw position, acts like a
"filter" that filters the excitation to produce the speech signal. The frequency response of the
filter has different spectral characteristics depending on the shape of the vocal tract. The
broad spectral peaks in the spectrum are the resonances of the vocal tract and are commonly
referred to as formants.
9
`
Fig 3.2 Formants of the vowel
Above shows for example the formants of the vowel /eh/ (as in "head"). The frequencies of
the first three formants (denoted as F1, F2, and F3) contain sufficient information for the
recognition of vowels as well as other voiced sounds. Formant movements have also been
found to be extremely important for the perception of unvoiced sounds (i.e., consonants).
In summary, the formants carry some information about the speech signal.
10
`
3.3 Urban sound Dataset

We have collected a dataset of annotated urban sounds including 10 low-level sounds from
the taxonomy: air conditioner, car horn, children playing, dog bark, drilling, engine idling,
gun shot, jackhammer, siren and street music. With the exception of “children playing” and
“gunshot” which were added for variety, all other sounds were selected due to the high
frequency in which they appear in urban noise complaints, as noted in the previous section.
The dataset is called Urban Sound. Due to the manual annotation effort required we limited
the number of sounds to 10, which we consider a good starting point. We intend to extend it
to more classes in future iterations. For a review of existing datasets and related literature.
we then normalized then to same frequency by using an online open source and heard every
sound patiently so that they don’t have more unwanted noise. After that we merged them
making a single Wave file and it resulted in a 5 min audio wav file which is quiet good for
us to start our project. We intend to increase the duration in future.
Urban sound Dataset
Community
Construction Mechanical 1. Dog Bark
1. Drilling 1.Air Conditioner Traffic
1. car horn 2.Children
2. Jackhammer 2.Engine idling playing
3. Street Music
Emergency
1. Gunshot
2.Siren
Fig 3.3 urban sound classification
11
`
3.4 Methods of Speaker change detection

3.4.1 Speaker Diarization
Speaker Diarization is the task of identifying the start and end time of a speaker in an audio
file, together with the identity of the speaker i.e. who spoke when. It can enhance the
readability of an automatic speech transcription by structuring the audio stream into speaker
turns and by providing the speaker’s true identity. Speaker diarization is a combination of
speaker segmentation and speaker clustering. The first aims at finding speaker change points
in an audio stream. The second aims at grouping together speech segments on the basis of
speaker characteristics. Diarization has many applications in speaker indexing, retrieval,
speech recognition with speaker identification, diarizing meeting and lectures.
Fig 3.4.1 Wave plot of two different speakers
3.4.2 Speech Overlapping
Since real dialogue does not always conform to the simple model of speaker turns the
possibility of overlapping speech segments is a obligation and the definition for 'babble
noise' is un clear in this sense. Overlapping speech naturally spreads a speaker change over
time domain, this may even haze the speaker change to insignificance and a smooth transition
to a different speaker altogether is a real liability. The methods discussed above can ease this
issue assuming the notion of speaker turns remains valid. This issue naturally lowers the
precision of the model, in a sense this issue will be regarded as a single speaker in speech
noise only. The data used in this proposed research does not contain overlapping speech in
the dataset therefore its imagined consequences are purely theoretical.
12
`
3.4.3 Segment clustering of Speaker
The segments can be clustered when the speech has been separated into speaker turn
segments and are saved in the matrix format. This provides a conjecture as to how many
speakers are present in the speech dataset. The performance of this step may get impacted
by the noise in the background environment and other such limitations which are definitely
a challenge in the proposed research Hierarchical Clustering [4], AHC, has been compared
and the general concept is to start by assuming that every speaker turn is a unique person. In
the proposed research experiment we have presumed every speaker turn is a unique person.
In the proposed algorithm for Speaker Change Detection an iteratively combining the most
similar segment is done until only there are two segments remaining for observation. The
number of speakers is established where the amalgamation of the segment parts are not
similar.
Fig 3.4.3 Clustering of different speakers
13
`
3.5 Pre-processing
The first step is to apply a high-pass filter to our audio signal with a cutoff around 200Hz.
The purpose of this is to remove any D.C. offset and low frequency noise. There is usually
very little speech information below 200 Hz so it does not hurt to suppress the low
frequencies.
Prior to feature extraction our audio signal is split into overlapping frames 20-40ms in length.
A frame is often extracted every 10ms. For example, if our audio signal is sampled at 16 kHz
and we take our frame length to be 25ms, then each frame will have 0.025*16000 = 400
samples per frame. With 1 frame every 10ms, we'll have the first frame starting at sample 0,
the second frame will start at sample 160 etc.
3.5.1 Signal processing

The last, and perhaps most important, difference among implant devices is in the signal
processing strategy used for transforming the speech signal to electrical stimuli. Several
signal processing techniques have been developed over the past 25 years. Some of these
techniques are aimed at preserving waveform information, others are aimed at preserving
envelope information, and others are aimed at preserving spectral features (e.g., formants).
A more detailed description of each of these signal processing techniques will be presented
in the following sections. Representative results for each signal processing strategy will be
presented.
3.6 Feature Extraction Methods

Theoretically, it should be possible to recognize speech directly from the digitized
waveform. However, because of the large variability of the speech signal, it is better to
perform some feature extraction that would reduce that variability. Particularly, eliminating
various source of information, such as whether the sound is voiced or unvoiced and, if
voiced, it eliminates the effect of the periodicity or pitch, amplitude of excitation signal
14
`
and fundamental frequency etc. The analysis in the cochlea takes place on a nonlinear
frequency scale (known as the Bark scale or the mel scale). This scale is approximately linear
up to about 1000 Hz and is approximately logarithmic thereafter. So, in the feature
extraction, it is very common to perform a frequency warping of the frequency axis after the
spectral computation.
3.6.1 LINEAR PREDICTIVE CODING (LPC)

LPC is one of the most powerful speech analysis techniques and is a useful method for
encoding quality speech at a low bit rate. The basic idea behind linear predictive analysis is
that a specific speech sample at the current time can be approximated as a linear combination
of past speech samples [15].
Fig 3.7.1 The LPC processor
LP is a model based on human speech production. It utilizes a conventional source-filter

model, in which the glottal, vocal tract, and lip radiation transfer functions are integrated
into one all-pole filter that simulates acoustics of the vocal tract.
The principle behind the use of LPC is to minimize the sum of the squared differences
between the original speech signal and the estimated speech signal over a finite duration.
This could be used to give a unique set of predictor coefficients. These predictor coefficients
are estimated every frame, which is normally 20 ms long. The predictor coefficients are
represented by ak. Another important parameter is the gain (G). The transfer function of the
time varying digital filter is given by
H(z) = G/(1-Σakz-k)
Where k=1 to p, which will be 10 for the LPC-10 algorithm and 18 for the improved
algorithm that is utilized. Levinsion-Durbin recursion will be utilized to compute the
required parameters for the auto-correlation method
15
`
The LPC analysis of each frame also involves the decision-making process of voiced or
unvoiced. A pitch-detecting algorithm is employed to determine to correct pitch period /
frequency. It is important to re-emphasis that the pitch, gain and coefficient parameters will
be varying with time from one frame to another.
In reality the actual predictor coefficients are never used in recognition, since they typical
show high variance. The predictor coefficient is transformed to a more robust set of
parameters known as cepstral coefficients.
3.6.2 PERCEPTUAL LINEAR PREDICTION (PLP)

The Perceptual Linear Prediction PLP model developed by Hermansky 1990. The goal of
the original PLP model is to describe the psychophysics of human hearing more accurately
in the feature extraction process.
PLP is similar to LPC analysis, is based on the short-term spectrum of speech. In contrast to
pure linear predictive analysis of speech, perceptual linear prediction (PLP) modifies the
short-term spectrum of the speech by several psychophysically based transformations.
As human voice is nonlinear in nature, Linear Predictive Codes are not a good choice for
speech estimation. PLP and MFCC are derived on the concept of logarithmically spaced
filter bank, clubbed with the concept of human auditory system and hence had the better
response compare to LPC parameters.
3.7 The ESC Datasets

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable
for benchmarking methods of environmental sound classification. The dataset consists of 5-
second-long recordings organized into 50 semantical classes (with 40 examples per class)
loosely arranged into 5 major categories:
• animal sounds
• natural soundscapes and water sounds
• human (non-speech) sounds
• interior/domestic sounds
• exterior/urban noises.
The goal of the extraction process was to keep sound events exposed in the foreground with
limited background noise when possible. However, field recordings are far from sterile, thus
some clips may still exhibit auditory overlap in the background. The dataset
16
`
provides an exposure to a variety of sound sources - some very common (laughter, cat
meowing, dog barking), some quite distinct (glass breaking, brushing teeth) and then some
where the differences are more nuanced (helicopter and airplane noise). One of the possible
deficiencies of this dataset is the limited number of clips available per class. This is related
to the high cost of manual annotation and extraction, and the decision to maintain strict
balance between classes despite limited availability of recordings for more exotic types of
sound events. Nevertheless, it will, hopefully, be useful in its current form and is a concept
that could be expanded on if sufficient interest is expressed.
Natural Huma Interior/do Exterior/

Animals
soundscapes & n, mestic urban
water sounds non- sounds noises
speech
sounds
Dog Rain Crying baby Door knock Helicopter
Rooster Sea waves Sneezing Mouse click Chainsaw
Pig Crackling fire Clapping Keyboard Siren

typing
Doo
Cow Crickets Breathing Car horn
r,
woo
d
crea
ks
Frog Chirping birds Coughing Can opening Engine
Cat Water drops Footsteps Washing Train

machine
Hen Wind Laughing Vacuum Church

cleaner bells
Insects
Pouring water Brushing Clock alarm Airplane
(flying)
teeth
Sheep Toilet flush Snoring Clock tick Fireworks
Table 3.8 The 50 sounds sources of ESC-5
17
`
Clips in this dataset have been manually extracted from public field recordings gathered by
the Freesound.org project. The dataset has been prearranged into 5 folds for comparable
cross-validation, making sure that fragments from the same original source file are contained
in a single fold.
18
`
CHAPTER 4
K-MEANS and
CLUSTERING
4.1 Introduction
K-Means clustering intends to partition n objects into k clusters in which each object belongs to
the cluster with the nearest mean. This method produces exactly k different clusters of greatest
possible distinction. The best number of clusters k leading to the greatest separation (distance) is
not known as a priori and must be computed from the data. The objective of K-Means clustering
is to minimize total intra-cluster variance, or, the squared error function.
K-Means is one of the most popular "clustering" algorithms. K-means stores k centroids that it uses
to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's
centroid than any other centroid. As we know Machine learning is divided into two parts
Supervised learning and Unsupervised learning, K-means clustering is a type of unsupervised
learning, which is used when you have unlabeled data (i.e., data without defined categories or
groups). The goal of this algorithm is to find groups in the data, with the number of groups
represented by the variable K.
4.2 K-means Algorithm Definition
The algorithm works iteratively to assign each data point to one of K groups based on the
features that are provided. Data points are clustered based on feature similarity. The results
of the K-means clustering algorithm are:
The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and
analyze the groups that have formed organically. The "Choosing K" section below describes
how the number of groups can be determined.
Each centroid of a cluster is a collection of feature values which define the resulting groups.
Examining the centroid feature weights can be used to qualitatively interpret what kind of
group each cluster represents.
19
`
4.2.1 Algorithm
The Κ-means clustering algorithm uses iterative refinement to produce a final result.
The algorithm inputs are the number of clusters Κ and the data set. The data set is a
collection of features for each data point. The algorithms starts with initial estimates
for the Κ centroids, which can either be randomly generated or randomly selected
from the data set. The algorithm then iterates between two steps:
1. Data assignment step:
Each centroid defines one of the clusters. In this step, each data point is assigned to
its nearest centroid, based on the squared Euclidean distance. More formally, if ci is
the collection of centroids in set C, then each data point x is assigned to a cluster
based on
where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point
assignments for each ith
cluster centroid be Si.
2. Centroid update step:
In this step, the centroids are recomputed. This is done by taking the mean of all data
points assigned to
that centroid's cluster.
The algorithm iterates between steps one and two until a stopping criteria is met (i.e.,
no data points
change clusters, the sum of the distances is minimized, or some maximum number
of iterations is reached).
This algorithm is guaranteed to converge to a result. The result may be a local

optimum (i.e. not necessarily
the best possible outcome), meaning that assessing more than one run of the
algorithm with randomized
starting centroids may give a better outcome.
20
`
4.2.2 CHOOSING K
The algorithm described above finds the clusters and data set labels for a particular pre-
chosen K. To find the number of clusters in the data, the user needs to run the K-means
clustering algorithm for a range of K values and compare the results. In general, there is no
method for determining exact value of K, but an accurate estimate can be obtained using the
following techniques.
One of the metrics that is commonly used to compare results across different values of K is
the mean distance between data points and their cluster centroid. Since increasing the number
of clusters will always reduce the distance to data points, increasing K will always decrease
this metric, to the extreme of reaching zero when K is the same as the number of data points.
Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as
a function of K is plotted and the "elbow point," where the rate of decrease sharply shifts,
can be used to roughly determine K.
A number of other techniques exist for validating K, including cross-validation, information
criteria, the information theoretic jump method, the silhouette method, and the G-means
algorithm. In addition, monitoring the distribution of data points across groups provides
insight into how the algorithm is splitting the data for each K.
Fig 4.2.2 Elbow diagram
21
`
4.2.3 Distance Metric

In order to measure the similarity or regularity among the data-items, distance metrics plays
a very important role. It is necessary to identify, in what manner the data are inter-related,
how various data dissimilar or similar with each other and what measures are considered for
their comparison. The main purpose of metric calculation in specific problem is to obtain an
appropriate distance /similarity function.
Formally, a distance function is a function Dist with positive real values, defined on the
Cartesian product X x X of a set X. It is called a metric of X if for each x, y, z ε X.
There are 4 different distance metrics as follows
1.1 Euclidean Distance

Euclidean distance computes the root of square difference between co-ordinates of pair of
objects.
1.2 Manhattan Distance

Manhattan distance computes the absolute differences between coordinates of pair of
objects
1.3 Chebychev Distance

Chebychev Distance is also known as maximum value distance and is computed as the
absolute
Magnitude of difference between coordinate of pair of objects.
1.4 Minkowski Distance

Minkowski Distance is the generalized metric distance.
22
`
4.3 CLUSTERING
Clustering is one of the most fundamental issues in data recognition. It plays a very important
role in searching for structures in data. It may serve as a pre-processing step for other
algorithms, which will operate on the identified clusters.
In general, clustering algorithms are used to group some given objects defined by a set of
numerical properties in such a way that the objects within a group are more similar than the
objects in different groups. Therefore, a specific clustering algorithm needs to be provided
with, a criterion to measure the similarity of objects, how to cluster the objects or points into
clusters.
The k-means clustering algorithm uses the Euclidean distance to measure the similarities
between objects. Both iterative algorithm and adaptive algorithm exist for the standard k-
means clustering. K-means clustering algorithms need to assume that the number of groups
(clusters) is known a priori.
Fig 4.3 Difference between clustered and unclustered data

k-means clustering aims to partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results
in a partitioning of the data space into Voronoi cells.
23
`
4.4 Bayesian information criterion

In statistics, the Bayesian information criterion (BIC) or Schwarz criterion (alsoSBC,
SBIC) is a criterion formodel selection among a finite set of models. It is based, inpart,
on the likelihood function, and it is closely related to Akaike information criterion(AIC).
Fig 4.4 BIC model

Consider the figure on the left. The curvy line separates the two sets of points really well.
The accuracy is high and we are all happy. But the set of datapoints we have is a small set.
On the other hand, the straight line separator might generalize very well because we are
not fine tuning it at all. But at the same time, it doesn’t really separate the two sets of points
that well.
The BIC is formally defined as:
Where
 L= the maximized value of the likelihood function of the model , i.e. , where are the parameter
Values that maximize the likelihood function.
 x= the observed data;
 n= the number of data points in x, the number of observations, or equivalently, the sample size;
 k= the number of parameters estimated by the model.
4.5 Feature Engineering

Feature engineering is the process of using domain knowledge to choose which data metrics to
input as features into a machine learning algorithm. Feature engineering plays a key role in K-
means clustering; using meaningful features that capture the variability of the data is essential for
the algorithm to find all of the naturally-occurring groups.
Categorical data (i.e., category labels such as gender, country, browser type) needs to be encoded
or separated in a way that can still work with24
the algorithm.
`
Feature transformations, particularly to represent rates rather than measurements, can help to
normalize the data. For example, in the delivery fleet example above, if total distance driven had
been used rather than mean distance per day, then drivers would have been grouped by how long
they had been driving for the company rather than rural vs. urban.
25
`
CHAPTER 5
Mel Frequency Cepstral Coefficient (MFCC)
5.1 MFCC
The mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of
a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale
of frequency.
It is popular feature Extraction technique. Mel-frequency cepstral coefficients are the feature
that collectively makes Melfrequency cepstral (MFC) . The difference between the cepstrum
and the mel-frequency cepstrum is that in Mel-frequency cepstral (MFC), the frequency
bands are equally spaced on the Mel scale, this mean that it approximates the human auditory
system's response more firmly than the linearly-spaced frequency bands used in the normal
cepstrum. This frequency warping allows better representation of sound.
Sounds generated by a human are filtered by the shape of the vocal tract etc. This shape
determines what sound comes out. If we succeed to determine the shape accurately, this gives
us an accurate representation of the phoneme being produced. The shape of the vocal tract
manifests itself in the envelope of the short time power spectrum, and the function of MFCCs
is to accurately represent this envelope. Mel Frequency Cepstral Coefficients (MFCCs) are
a feature widely used in automatic speech and speaker recognition. MFCC introduced by
Davis and Mermelstein in the 1980's.
MFCC is based on human hearing perceptions which cannot perceive frequencies over 1Khz.
In other words, in MFCC is based on known variation of the human ear’s critical bandwidth
with frequency. Pitch is used on Mel Frequency Scale to capture important characteristic of
speech. MFCC takes human perception sensitivity with respect to frequencies into
consideration so MFCC is best for speech recognition.
26
`
MFCCs are commonly derived as follows:
Fig 5.1 Block Diagram Of MFCC
1. Framing: The input speech signal is segmented into frames of 20~30 ms with optional
overlap of 1/3~1/2 of the frame size. Usually the frame size (in terms of sample points) is
equal to power of two in order to facilitate the use of FFT. If this is not the, we need to do
zero padding to the nearest length of power of two. If the sample rate is 16000Hz and the
frame size is 480 sample points, then the frame duration is 480/16000 = 0.03 sec = 20 ms.
Additional, if the overlap is 160 points, then the frame rate is 16000/ (480-160) = 50 frames
per second. The process of segmenting the speech samples obtained from analog to digital
conversion (ADC) into a small frame with the length within the range of 20 to 40 msec. The
speech signal is divided into N samples of frames. An Adjacent frames are being separated
by M (M<N).
Typical values used are M = 100 and N= 256.
2. Windowing:
The important function of Windowing is to reduce the aliasing effect, when cut the long
signal to a short-time signal in frequency domain.
Different types of windowing functions:
 Rectangular window
 Bartlett window
 Hamming window
Out of these, the most widely used window is Hamming window. Hamming window is used
as window shape by considering the next block in feature extraction processing chain and
integrates all the closest frequency lines. Each frame has to be multiplied with a hamming
window in order to keep the continuity of the first and the last points in the frame
27
`
(to be detailed in the next step). If the signal of frame is denoted by s (n), n = 0…N-1, then
the signal after Hamming windowing is s (n)*w (n), where w (n) is the Hamming window
defined by:
W (n, a) = (1 - a) - a cos (2pn/ (N-1)) ，0≦n≦N-1
3. FFT:
Spectral analysis shows that different accents in speech signals correspond to different
energy distribution over frequencies. Therefore we perform FFT to obtain the magnitude
frequency response of each frame. When we implement FFT on a frame, we consider that
the signal in frame is periodic, and continuous. If this is not the case though, we can perform
FFT but the discontinuity at the frame's first and last points is likely to introduce undesirable
effects in the frequency response. To handle this problem, we have two strategies.
a. Multiply each frame by a Hamming window to increase its continuity at the first and last
points.
b. Take a frame of a variable size such that it always contains an integer multiple number
of the fundamental periods of the speech signal.
Practically the second strategy encounters difficulty because the identification of the
fundamental period is not a minor problem. Moreover, unvoiced sounds do not have a
fundamental period at all. We generally adopt the first strategy to multiply the frame by a
Hamming window before performing FFT.
4. Mel Filter Bank:
Fig 5.1.1 MEL scale of frequency
28
`
Triangular Bandpass Filters are used because the frequency range in FFT spectrum is very
wide and voice signal does not follow the linear scale. We multiple the magnitude
frequency response by a set of 20 triangular bandpass filters to get the log energy of each
triangular bandpass filter. The position of these filters is equally spaced according to the
Mel frequency, which is related to the linear frequency f by the following equation.
Mel (f) =1125*ln (1+f/700)
The Mel-frequency is proportional to the logarithm of the linear frequency, reflecting
similar effects in the human's subjective aural perception.
Advantages of triangular bandpass filters:
 Smooth the magnitude spectrum such that the harmonics are flattened in order to obtain the
envelope of the spectrum with harmonics. This suggests that the pitch of a speech signal is
generally not presented in MFCC.
 Reduce the size of the features tangled in it.
5. Discrete Cosine Transform:

This is the process to convert the log Mel spectrum into time domain using Discrete Cosine
Transform (DCT). In this step, we apply CT on the 20 log energy Ek obtained from the
triangular bandpass filters to have L mel-scale cepstral coefficients. Formula for DCT is as
shown in below:
Cm=Sk=1�cos [m*(k-0.5)*p/N]*Ek, m=1, 2... L
Where N is the number of triangular bandpass filters and L is the number of mel-scale
cepstral coefficients.
6. MFCC Features:
In this way Mel-frequency cepstral coefficients are extracted from the speech signal. These
features are the main component of speech recognition process. Further classification of
these features is done by the various types of Classifier.
We will now go a little more slowly through the steps and explain why each of the steps is
necessary.
An audio signal is constantly changing, so to simplify things we assume that on short time
scales the audio signal doesn't change much (when we say it doesn't change, we mean
statistically i.e. statistically stationary, obviously the samples are constantly changing on
even short time scales). This is why we frame the signal into 20-40ms frames. If the frame
is much shorter we don't have enough samples to get a reliable spectral estimate, if it is longer
29
`
the signal changes too much throughout the frame.
The next step is to calculate the power spectrum of each frame. This is motivated by the
human cochlea (an organ in the ear) which vibrates at different spots depending on the
frequency of the incoming sounds. Depending on the location in the cochlea that vibrates
(which wobbles small hairs), different nerves fire informing the brain that certain frequencies
are present. Our periodogram estimate performs a similar job for us, identifying which
frequencies are present in the frame.
The periodogram spectral estimate still contains a lot of information not required for
Automatic Speech Recognition (ASR). In particular the cochlea can not discern the
difference between two closely spaced frequencies. This effect becomes more pronounced
as the frequencies increase. For this reason we take clumps of periodogram bins and sum
them up to get an idea of how much energy exists in various frequency regions. This is
performed by our Mel filterbank: the first filter is very narrow and gives an indication of
how much energy exists near 0 Hertz. As the frequencies get higher our filters get wider as
we become less concerned about variations. We are only interested in roughly how much
energy occurs at each spot. The Mel scale tells us exactly how to space our filterbanks and
how wide to make them. See below for how to calculate the spacing.
Once we have the filterbank energies, we take the logarithm of them. Human hearing also
motivates this: we don't hear loudness on a linear scale. Generally to double the percieved
volume of a sound we need to put 8 times as much energy into it. This means that large
variations in energy may not sound all that different if the sound is loud to begin with. This
compression operation makes our features match more closely what humans actually hear.
Why the logarithm and not a cube root? The logarithm allows us to use cepstral mean
subtraction, which is a channel normalization technique.
The final step is to compute the DCT of the log filterbank energies. There are 2 main reasons
this is performed. Because our filterbanks are all overlapping, the filterbank energies are
quite correlated with each other. The DCT decorrelates the energies which means diagonal
covariance matrices can be used to model the features in e.g. a HMM classifier. But notice
that only 12 of the 26 DCT coefficients are kept. This is because the higher DCT coefficients
represent fast changes in the filterbank energies and it turns out that these fast changes
actually degrade ASR performance, so we get a small improvement by dropping them.
30
`
5.2 What is the Mel scale?
The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured
frequency. Humans are much better at discerning small changes in pitch at low frequencies
than they are at high frequencies. Incorporating this scale makes our features match more
closely what humans hear.
The formula for converting from frequency to Mel scale is:
To go from Mels back to frequency:
5.3 Implementation steps
We start with a speech signal, we'll assume sampled at 16kHz.
1. Frame the signal into 20-40 ms frames. 25ms is standard. This means the frame length for
a 16kHz signal is 0.025*16000 = 400 samples. Frame step is usually something like 10ms
(160 samples), which allows some overlap to the frames. The first 400 sample frame starts
at sample 0, the next 400 sample frame starts at sample 160 etc. until the end of the speech
file is reached. If the speech file does not divide into an even number of frames, pad it with
zeros so that it does.
The next steps are applied to every single frame, one set of 12 MFCC coefficients is extracted
for each frame. A short aside on notation: we call our time domain signal . Once it is
framed we have where n ranges over 1-400 (if our frames are 400 samples)
and ranges over the number of frames. When we calculate the complex DFT, we
get - where the denotes the frame number corresponding to the time-domain
frame. is then the power spectrum of frame .
2. To take the Discrete Fourier Transform of the frame, perform the following:
31
`
where is an sample long analysis window (e.g. hamming window), and is

the length of the DFT. The periodogram-based power spectral estimate for the speech
frame is given by:
This is called the Periodogram estimate of the power spectrum. We take the absolute value
of the complex fourier transform, and square the result. We would generally perform a 512
point FFT and keep only the first 257 coefficents.
3. Compute the Mel-spaced filterbank. This is a set of 20-40 (26 is standard) triangular filters
that we apply to the periodogram power spectral estimate from step 2. Our filterbank comes
in the form of 26 vectors of length 257 (assuming the FFT settings fom step 2). Each vector
is mostly zeros, but is non-zero for a certain section of the spectrum. To calculate filterbank
energies we multiply each filterbank with the power spectrum, then add up the coefficents.
Once this is performed we are left with 26 numbers that give us an indication of how much
energy was in each filterbank.
32
`
`
CHAPTER 6
Simulation And Programming Tool
6.1 PYTHON PROGRAMMING
Python is an interpreted, high-level, general-purpose programming language.

Created by Guido van Rossum and first released in 1991, Python has a design philosophy
that emphasizes code readability, notably using significant whitespace. It provides constructs
that enable clear programming on both small and large scales.
It is the programming language used for high level implementation of machine learning
Algorithms and deep learning.
It’s significance lies in the varies libraries available for computation of different tasks.
6.2 Python Libraries

6.2.1 NUMPY
NumPy is the fundamental package for scientific computing with Python. It contains
among other things:
 a powerful N-dimensional array object

 sophisticated (broadcasting) functions
 tools for integrating C/C++ and Fortran code
 useful linear algebra, Fourier transform, and random number capabilities
6.2.2 SCIKIT-LEARN
Machine Learning in Python. Simple and efficient tools for data mining and data analysis;
Accessible to everybody, and reusable in various contexts.
 Simple and efficient tools for data mining and data analysis
 Accessible to everybody, and reusable in various contexts
 Built on NumPy, SciPy, and matplotlib
6.2.3 PYTHON_SPEECH_FEATURES
This library provides common speech features for ASR including MFCCs and filterbank
Energies.
It is largely used library in speech processing and in computing MFCC’s.
6.2.4 MATPLOTLIB
Matplotlib is a Python 2D plotting library which produces publication quality figures in a

variety of hardcopy formats and interactive environments across platforms. Matplotlib can
be used in Python scripts, the Python and IPython shells, the Jupyter notebook.
33
`
6.3 SIMULATION TOOL – PyCharm
6.3.1 Introduction
PyCharm in a Integrated development environment (IDE) for python programming it provides

all necessary libraries required for computation of a program/project.
It works by creating a virtual environment and then by adding all the necessary files which
when inculcated provides a good structure for the project as python is depended on various
libraries to complete the given task.
The package consists of:

1. Intelligent Python Assistance
PyCharm provides smart code completion, code inspections, on-the-fly error highlighting
and quick-fixes, along with automated code refactorings and rich navigation capabilities.
2. Web Development Frameworks
PyCharm offers great framework-specific support for modern web development frame
works such as Django, Flask, Google App Engine, Pyramid, and web2py.
3. Scientific Tools
PyCharm integrates with IPython Notebook, has an interactive Python console, and
supports Anaconda as well as multiple scientific packages including matplotlib and NumPy.
4. Cross-technology Development
In addition to Python, PyCharm supports JavaScript, CoffeeScript, TypeScript,

Cython, SQL, HTML/CSS, template languages, AngularJS, Node.js, and more.
5. Built-in Developer Tools
A huge collection of tools out of the box: an integrated debugger and test runner; Python
profiler; a built-in terminal; and integration with major VCS and built-in Database Tools.
34
`
CHAPTER 7
PROPOSED APPROACH
7.1 Speaker change detection
To start with we need to first develop a dataset or use a predefined dataset (open source).
Next, we take DCT of every frame and apply the normalization on each. Then, calculate the
power spectrum of individual frames. This identifies frequencies that are present in a given
the frame. Mel filterbank is applied to the power spectra, which gives the total energy present
in each subband filter. Then, we apply log-operation on subband energies.
We take the first 13 coefficients of MFCC as it contains most of the information (library
used python_speech_features to extract features) then we apply K-Means algorithm and
BIC for modelling the developed file. Atlast to plot the final change point we use matplot
library.
Frequency MFCC
Speech Signal Normalization To Feature
Mel scale Extraction
Output Plots Result K-Means

Clustering
Fig 3.5 Proposed block diagram
7.1.1 Dataset making
Choosing the dataset is the most crucial part of any speech processing, a dataset should have
all diverse sounds to make the program self-sufficient to any harsh environment. We chose 10
sounds that are of diverse nature and have the most distinguish pitch and other linguistic sound
feature. The 10 chosen sounds are- air conditioner, car horn, children playing, dog bark,
drilling, engine idling, gun shot, jackhammer, siren and street music. These sounds are exterior
sounds that are found in abundance occurring in nature. The sounds were recorded and were in
35
`
mp3 format and the change in format to wav format was integral for the program. The output
will be in “npz” format which can be used as training input to other program.
7.1.2 Feature extracting
The linguistic features are unique to their respective sounds. MFCC extracts these features,
we use MFCC by including Python Speech Features and it uses only the first 13 coefficients
and these play a vital role in the processing of these speeches and the detection of the relevant .
MFCC is preferred over LPC(linear predictive coding) as the latter is much more complex and
slow, while MFCC is simple compared to LPC it proves more efficient.
7.1.3 K_Means
K-Means clustering intends to partition n objects into k clusters in which each object belongs
to the cluster with the nearest mean. This method produces exactly k different clusters of
greatest possible distinction. The best number of clusters k leading to the greatest separation
(distance) is not known as a priori and must be computed from the data. The objective of
K-Means clustering is to minimize total intra-cluster variance, or, the squared error function.
7.1.4 Clustering
Clustering is one of the most fundamental issues in data recognition. It plays a very important role
in searching for structures in data. It may serve as a pre-processing step for other algorithms,
which will operate on the identified clusters.
In general, clustering algorithms are used to group some given objects defined by a set of
numerical properties in such a way that the objects within a group are more similar than the
objects in different groups.
36
`
7.2 Proposed speech detection Algorithm

1 Take a speech audio .
2 Normalize the signal by equalizing the frame rate and sample it.
3 Change the frequency scale to Mel scale by applying formula.
Mel (f) =1125*ln (1+f/700)
4. Apply MFCC and extract features from the speech audio.
5. Now by K-means algorithm we’ll cluster the required model
6. Plot the clusters obtained after processing by using matplot library.
IMPORTANCE OF COEFFICIENTS
1. If a cepstral coefficient has a positive value, it represents a sonorant sound since

the majority of the spectral energy in sonorant sounds are concentrated in the low-frequency
regions.
2. On the other hand, if a cepstral coefficient has a negative value, it represents a fricative
sound since most of the spectral energies in fricative sounds are concentrated at high
frequencies.
3. The lower order coefficients contain most of the information about the overall spectral
shape of the source-filter transfer function.
4. Even though higher order coefficients represent increasing levels of spectral details,
depending on the sampling rate and estimation method, 12 to 20 cepstral coefficients
are typically optimal for speech analysis. Selecting a large number of cepstral coefficients
results in more complexity in the models.
37
`
CHAPTER 8 RESULTS
MFCC FEATURES EXTRACTED:
[[19.6884738 0.34520428 -18.96777002 ... 17.19743708 7.23626649

-30.71404477]
[19.6776887 6.5152444 -18.13150971 ... 36.56473388 15.91031454
-20.585949]
[19.38906465 4.09170589 -16.16741236 ... 35.76400799 9.97291938
-27.12420088]
...
[4.48181023 -28.49382144 -1.39828052 ... 6.00959207 4.32710969
-11.78033898]
[4.59760795 -31.19398402 -1.50272759 ... 0.10406072 -5.33376854
-17.42454733]
[4.3957976 -30.3591145 0.93997567 ... 1.45635257 -3.36968642
- 10.39677659]]
[19.6884738 0.34520428 -18.96777002 -18.91139043 -28.31415639
-5.61446044 -28.95105212 -34.86181481 11.5893669 -4.58225194
17.19743708 7.23626649 -30.71404477]
[1.29913330e-01 9.65881348e-02 5.27648926e-02 ... 0.00000000e+00
3.05175781e-05 3.05175781e-05]
38
`
SPEAKER RECOGNITION OUTPUT

speaker start end
0 DOG 0 12.227
1 CAT 12.227 27.496
2 DRILLING 27.496 31.071
3 CAR_HORN 31.071 38.067
4 ENGINE_IDLING 38.067 83.813
5 JACKHAMMER 83.813 130.746
6 CHILDREN_PLAYING 130.746 137.416
7 GUN_SHOT 137.416 140.441
8 SIREN 140.441 173.719
9 STREET_MUSIC 173.719 189.055
10 AIR_CONDITIONER 189.055 214.275
11 DOG 214.724 216.408
12 DRILLING 220.714 244.076
13 CAR_HORN 244.076 264.433
14 ENGINE_IDLING 264.433 270.343
15 CAT 271.373 277.633
16 CHLIDREN_PLAYING 277.633 286.068

17 DRILLING 286.0680 309.0200
39
`
SPECTOGRAM OF WAV FILE
SPECTOGRAM OF INPUT SIGNAL
Time(ms)
40
`
CLUSTERING OUTPUT
41
`
CHAPTER 8
CONCLUSION & FUTURE SCOPE
8.1 Conclusion
It has been demonstrated that how the dataset of diverse sounds could be successfully
detected and distinguished from one an other this was possible by extracting the linguistic
features of each sound of the data set created comprising of these diverse urban sounds by
the means of MFCC.
Unsupervised training is superior to supervised training as without teacher and by means of

self-learning the functioning of the system will be less prone to more errors and will be more
viable with better output performance compared to supervised training.
K-Means is an algorithm, which utilizes the means of clustering to cluster the 10 sounds
into k clusters and by partitioning these datasets, gives us the output as the one closest to the
cluster.
42
`
8.2 Future Scope
The K-Means powered by MFCC has accurate results when urban dataset of 10 sounds was
Used. In Future work, An urban Dataset with a wide variety of sounds and that is available
online will test the accuracy and speed of the program on a higher level.
This algorithm can be tested with classifiers and can have better result with high accuracy
this approach can be taken into consideration in the future projects.
43
`
REFERENCES
[1] Philipos C. Loizou, P. C., Mimicking the Human Ear, Signal Processing Magazine, pp.
101-130, September 1998.
[2] J. Salamon and J. P. Bello, "Unsupervised feature learning for urban sound
classification," 2015 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Brisbane, QLD, 2015, pp. 171-175.
[3] H. Zhou, Y. Song and H. Shu, "Using deep convolutional neural network to classify
urban sounds," TENCON 2017 - 2017 IEEE Region 10 Conference, Penang, 2017, pp.
3089-3092.
[4] Zhang, C., Wang, Z., Li, D., and Dong, M., A multi-mode and multichannel cochlear
implant. In Signal Processing Proceedings. IEEE Vol. 3, pp. 2237-2240, 2004.
[5] Loizou, P. C., Dorman, M., and Tu. Z., On the Number of Channels Needed to
Understand Speech, The Journal of the Acoustical Society of America, 106, 2097, 1999.
[6] Li, Y., and Chu, W., A new non-restoring square root algorithm and its VLSI
Implementations, IEEE International Conference, pp.538-544, 1996.
[7] Raitio, T., Juvela, L., Suni, A., Vainio, M., Alku, P.: Phase perception of the glottal
excitation and its relevance in statistical parametric speech synthesis. Speech Commun. 81,
104–119 (2016)
[8] Yegnanarayana, B., Saikia, D., Krishnan, T.: Significance of group delay functions
in signal reconstruction from spectral magnitude or phase. IEEE Trans. Acoust. Speech
Signal Process. 32(3), 610–623 (1984)
[9] Saratxaga, I., Sanchez, J., Wu, Z., Hernaez, I., Navas, E.: Synthetic speech detection
using phase information. Speech Commun. 81, 30–41 (2016)
[10] Wu, Z., Siong, C.E., Li, H.: Detecting converted speech and natural speech for
antispoofing attack in speaker recognition. In: INTERSPEECH, Portland, Oregon,
USA, pp. 1700–1703 (2012)
[11] Seelamantula, C.S.: Phase-encoded speech spectrograms. In: INTERSPEECH, San

Francisco, USA, pp. 1775–1779 (2016)
44
`
[13] Shenoy, B.A., Mulleti, S., Seelamantula, C.S.: Exact phase retrieval in
principal shift-invariant spaces. IEEE Trans. Signal Process. 64(2), 406–416 (2016)
[1] Julius O. Smith III: Interpolated Delay Lines, Ideal Bandlimited Interpolation, and
Fractional Delay Filter Design. Center for Computer Research in Music and Acoustics
(CCRMA) Department of Music, Stanford University Stanford, California 94305, May 12,
2017
[2] Syed Sibtain Khalid, Safdar Tanweer, Dr. Abdul Mobin, Dr. Afshar Alam: A
comparative Performance Analysis of LPC and MFCC for Noise Estimation in Speech
Recognition Task, International Journal of Electronics Engineering Research, ISSN 0975-
6450 Volume 9, Number 3 (2017) pp. 377-390
45
`
Appendix A
Details of Project and
Relevance to Environment
and Mapping with POs and
PSOs with Justification
Title Roll no Project Relevance Relevance Reve- Hardware Type (App-

Of Of Super- To To lance And Lication,
project Students visor Enviro- Human To Software
nment safety ethics
Speech 1604-15- Afshan - - - software Research

Detection 735-090 Kaleem And
Of 1604-15- Development
Urban 735-095
Sounds 1604-15-
735-313
46
`
Mapping with PO and PSO with justification

3 – Highly Relevant; 2- Moderately Relevant; 1– Less Relevant; 0-Not Relevant
47
`
PROGRAM OUTCOMES
48
`
49
`
APPENDIX B
GANTT CHART
50
`
51
`
52
`
53
`
54
`
55

Speech Detection On Urban Sounds

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Detection On Urban Sounds

Uploaded by

Copyright:

Available Formats

SPEECH DETECTION OF URBAN SOUNDS

A Dissertation submitted in partial fulfillment of the

ELECTRONICS AND COMMUNICATION

ETHASHYAM UR RAHMAN 1604-15-735-090

Under the guidance of

Department of Electronics and Communication Engineering

Afshan Kaleem, Dr. Mohammed Arifuddin Sohel

INTERNAL EXAMINER EXTERNAL EXAMINER

Phone: 040-23350523, 23352084, Fax: 040-2335 3428

Website: www.mjcollege.ac.in, e-mail: principal@mjcollege.ac.in

We hereby declare that the work presented in this dissertation entitled

If language is considered as a sign of endorsement and gesture of

The accomplishment and ultimate outcome of this project required a lot

LIST OF FIGURES iii

CHAPTER 2 : LITERATURE SURVEY

CHAPTER 5:MEL FREQUENCY CEPTRAL

Title Page no.

Title Page no.

MFCC Mel Frequency Cepstral Coefficient

Fig 1.1 Block diagram of speech detection process

1.2 Problem Statement

o Speech detection is vital to distinguish between noise and data-set.

1.4 System Model

Output Plots Result K-Means

Fig 1.4 Block diagram of system model

2.2 Literature Survey

The automatic classification of environmental sound is a growing research field with

3.2 The speech signal

Fig 3.2 Formants of the vowel

3.3 Urban sound Dataset

Urban sound Dataset

Fig 3.3 urban sound classification

3.4 Methods of Speaker change detection

Fig 3.4.1 Wave plot of two different speakers

3.4.2 Speech Overlapping

3.4.3 Segment clustering of Speaker

Fig 3.4.3 Clustering of different speakers

3.5.1 Signal processing

3.6 Feature Extraction Methods

3.6.1 LINEAR PREDICTIVE CODING (LPC)

Fig 3.7.1 The LPC processor

LP is a model based on human speech production. It utilizes a conventional source-filter

3.6.2 PERCEPTUAL LINEAR PREDICTION (PLP)

3.7 The ESC Datasets

Natural Huma Interior/do Exterior/

Rooster Sea waves Sneezing Mouse click Chainsaw

Pig Crackling fire Clapping Keyboard Siren

Cat Water drops Footsteps Washing Train

Hen Wind Laughing Vacuum Church

Sheep Toilet flush Snoring Clock tick Fireworks

Table 3.8 The 50 sounds sources of ESC-5

4.2 K-means Algorithm Definition

1. Data assignment step:

cluster centroid be Si.

2. Centroid update step:

that centroid's cluster.

This algorithm is guaranteed to converge to a result. The result may be a local

starting centroids may give a better outcome.

Fig 4.2.2 Elbow diagram

4.2.3 Distance Metric

There are 4 different distance metrics as follows

1.1 Euclidean Distance

1.2 Manhattan Distance