Professional Documents
Culture Documents
BACHELOR OF ENGINEERING
IN
BY
CERTIFICATE
This is to certify that the dissertation titled ‘Speech Detection of Urban Sounds’
submitted by Fardeen Hasan bearing Roll No: 1604-15-735-095, Ethashyam Ur
Rahman bearing Roll No: 1604-15-735-090 and Mohammed Wasay Ahmed bearing
Roll No: 1604-15-735-114 in partial fulfilment of the requirements for the award of the
Degree of Bachelor of Engineering, is a bona fide record of work carried out by them
under my guidance and supervision during the year 2018-2019. The results embodied in
this report have not been submitted to any University or Institute for the award of any
Degree or Diploma.
8-2-249, Mount Pleasant, Road No.3, Banjara Hills, Hyderabad – 500 034
Date:
Place: Hyderabad
ACKNOWLEDGEMENT
Ethashyam Ur Rahman
Fardeen Hasan
Mohammed Wasay Ahmed
CONTENTS
ABSTRACT i
LIST OF ABBREVIATIONS ii
LIST OF TABLES iv
CHAPTER 1 : INTRODUCTION
1.1 Introduction 1
1.2 Problem statement 2
1.3 Objective 2
1.4 System model 2
1.5 Organization of Report 2-3
Recent studies have demonstrated the potential of unsupervised feature learning for
sound classification. In this thesis we further explore the application of the k-means
algorithm for feature learning from audio signals, here in the domain of urban sound
classification.
k-means is a relatively simple technique that has recently been shown to be competitive
with other more complex and time consuming approaches. We study how different parts
of the processing pipeline influence performance, taking into account the specificities of
the urban sonic environment.
We evaluate our approach on urban sound sources created into a dataset. The linguistic
features of respective sounds of the dataset is extracted using MFCC. The results are
complemented with error analysis and some proposals for future research.
i
LIST OF FIGURES
ii
LIST OF TABLES
iii
LIST OF ABBREVIATIONS
iv
`
CHAPTER 1
INTRODUCTION
1.1 Introduction
Speaker Recognition is the art of recognizing a speaker from a given database using speech
as the only input. In this thesis we will be discussing a novel approach to detect speakers.
Speech processing is emerged as one of the important application area of digital signal
processing. Various fields for research in speech processing are speech recognition, speaker
recognition, speech synthesis, speech coding etc. The objective of automatic speaker
recognition is to extract, characterize and recognize the information about speaker identity.
Speaker Recognition can be divided by two ways. One way is to divide it into speaker
verification and speaker identification. For speaker verification, the test is based on the
claimed identity and a decision for accepting or rejecting is made. For speaker identification,
there is no claim of identity, the system chooses the speaker from the database or in open set
system, the identity can be unknown.
Feature extraction is the first step for speaker recognition. Many algorithms are
suggested/developed by the researchers for feature extraction. In this work, the Mel
Frequency Cepstrum Coefficient (MFCC) feature has been used for designing a text
dependent speaker identification system. Some modifications to the existing technique of
MFCC for feature extraction are also suggested to improve the speaker recognition
efficiency.
1
`
o Sounds in an urban environment are usually composed of various multiple sound sources,
which makes it challenging to classify them into categories.
o Urban sounds are unstructured sounds unlike speech and audio sounds
1.3 Objective
The main objective of this project is
1. To make a diverse data set of different urban sounds for testing.
2. To Extract Features of Urban Sounds Dataset to classify them using MFCC.
3. To demonstrate Clustering by K-Means technique.
Frequency MFCC
Speech Signal Normalization To Feature
Mel scale Extraction
2
`
Chapter 3: Introduction about Urban sounds, Physiology of human ear, The speech signal,
Urban Sound Taxonomy and Dataset, Methods of speaker change, feature extraction
methods, Environment Sound Classification (ESC).
Chapter 4: K-Means Algorithm, Clustering, Bayesian Information Criterion
Chapter 5: Mel Frequency Cepstral Coefficient (MFCC)
Chapter 6: Proposed Approach
Chapter 7: Simulated results and discussion
Chapter 8: Conclusion
3
`
CHAPTER 2
LITERATURE SURVEY
2.1 Introduction
In this chapter, literature survey of 2 papers on classification and feature extraction of urban
sounds are discussed. This chapter discusses about their techniques and their result.
In paper (1) Unsupervised feature learning for urban sound classification, To explore the
application of the spherical k-means algorithm for feature learning from audio signals, here
in the domain of urban sound classification. The approach for feature learning for audio
signals is to convert the signal into a time-frequency representation, a common choice being
the mel-spectrogram. Then extraction of log-scaled mel-spectrograms with 40 components
(bands) covering the audible frequency range (0-22050 Hz), using a window size of 23 ms
(1024 samples at 44.1 kHz) and a hop size of the same duration is done. Experimentation
with a larger numbers of bands (128) was also performed, but this did not improve
performance and hence they used lower (and faster to process) resolution of 40 bands. To
extract the mel-spectrograms Essentia audio analysis library via its Python bindings was
used. Shingling- Grouping several consecutive frames (by concatenating them into a single
larger vector prior to PCA whitening) allows us to learn features that take into account
temporal dynamics. This option is particularly interesting for urban noise-like sounds such
as idling engines or jackhammers, where the temporal dynamics could potentially improve
our ability to distinguish sounds whose instantaneous features (i.e. a single frame) can be
very similar. The spherical k-means algorithm has been shown to be competitive with more
advanced (and much slower) techniques such as sparse coding, and has been used
successfully to learn features from audio for both music and birdsong. The learned codebook
is used to encode the samples presented to the classifier (both for training and testing). A
possible encoding scheme is vector quantization, i.e. assign each sample to its closest (or n
closest for some choice of n) centroids in the codebook, resulting in a binary feature vector
whose only non-zero elements are the n selected neighbors. While this approach has been
shown to work for music, the experimentation concluded that a linear encoding scheme
where each sample is represented by its multiplication with the codebook matrix provides
4
`
better results.
After encoding, every audio recording is represented as a series of encoded frames (or
patches) over time. For classification, summarization over the time axis so that the
dimensionality of all samples is the same (and not too large). Different studies report success
using different summary (or pooling) statistics such as maximum, mean and standard
deviation or a combination of a larger number of statistics such as minimum, maximum,
mean and variance. In experimentation we use the mean and standard deviation, which we
found to be the best performing combination of two pooling functions. The feature learning
stage, a single classification algorithm for all experiments – a random forest classifier (500
trees) is used. This classifier was used successfully in combination with learned features, and
was also one of the top performing classifiers for a baseline system evaluated on the same
dataset used in this study. In experimentation we use the implementation provided in the
scikit-learn Python library. For evaluation the UrbanSound8K dataset. The dataset is
comprised of 8732 slices (excerpts) of up to 4 s in duration extracted To facilitate comparable
research, the slices in UrbanSound8K come pre-sorted into 10 folds using a stratified
approach which ensures that slices from the same recording will not be used both for training
and testing, which could potentially lead to artificially high results. In this paper we studied
the application of unsupervised feature learning to urban sound classification. It showed that
classification accuracy can be significantly improved by feature learning if taken into
consideration the specificities of this domain, primarily the importance of capturing the
temporal dynamics of urban sound sources.
In paper (2). Using Deep Convolutional Neural Network to Classify Urban Sounds. A
convolutional neural network usually classifies the environmental sounds. However, there
are few studies investigates the network input construction issue. The impact of time
resolution index of input spectrogram on classification performance is observed.
Unlike speech and audio signals, urban sounds are usually unstructured sounds. They include
various real-life noises generated by human activities, ranging from transportation to leisure
activities. Automatic urban sound classification could identify the noise source, benefiting
urban livability, such as noise control, audio surveillance, soundscape assessment and
acoustic environment planning. However, sounds in an urban environment are usually
composed of various multiple sound sources, which makes it challenging to classify them
into categories. For the urban sound classification with a relatively simple convolutional
5
`
neural network, the features that automatically learned from a large sound dataset are usually
presented in form of distinctive temporal-spectral representations. Combined with simple
soft- max classifier, improved classification accuracy over hand- crafted features is
frequently reported. for a given sound sample, its 2D mel- spectrogram low-level
representation is adopted as the convolutional neural network input layer. Specifically, for a
sound sample, its frame- based log-scaled mel-spectrograms are firstly extracted from
overlapped sample frames. Then all these frame-based static mel-spectrograms are
concatenated to form one 2D image. If dynamic frequency-delta (i.e. first-order frame-to-
frame difference) coefficients are also included, then overall convolutional neural network
input has the dimension of X ∈ RNt∗Nf ∗2, where Nt denotes the total frame number of the
sound, and Nf denotes the number of frequency bands. To take into account temporal
dynamics of a sound sample, different frame length can be set. The shorter the frame length,
the finer the time resolution. The approach is evaluated on the Urban- Sound8K dataset,
which includes 8732 typical urban noises. These samples span 10 environmental sound
classes (see Table I) with label of : air conditioner (AI), car horn (CA), children playing
(CH), dog bark (DO), drilling (DR), engine idling (EN), gun shot (GU), jackhammer (JA),
siren (SI) and street music (ST). With the strength of its large size, wide sound class and
clear annotation, the dataset has been widely used. The network is implemented in Python
with Keras [package. During training process, the convolutional neural network model
optimizes cross-entropy loss. Each mini-batch consists of 32 randomly inputs the training
dataset. Each model is trained with 50 epochs.
The input construction is also implemented in Python, with the aid of the librosa library for
the mel-spectrograms extraction. The results clearly verified that time resolution index of
network input has significant effects on the classification performance. Further analysis from
the confusion matrices of other inputs (not provided here due to space limitation) even show
that for different sound class, its optimal time resolution is different.
Experiment results confirmed the time resolution index has significant contributions on the
classification performance, and best performances are usually obtained for input with
moderate time resolution.
6
`
2.3 Summary
The reviewed literature survey provides the information regarding Urban Sounds,
MFCC,K-Means clustering and its design methods.
7
`
CHAPTER 3
URBAN SOUNDS
3.1 Introduction
8
`
According to the source-filter model of speech production, the speech signal can be
considered to be the output of a linear system. Depending on the type of input excitation
(source), two classes of speech sounds are produced, voiced and unvoiced. If the input
excitation is noise, then unvoiced sounds like /s/, /t/, etc. are produced, and if the input
excitation is periodic then voiced sounds like /a/, /i/, etc., are produced. In the unvoiced case,
noise is generated either by forcing air through a narrow constriction (e.g., production of /f/)
or by building air pressure behind an obstruction and then suddenly releasing that pressure
(e.g., production of /t/). In contrast, the excitation used to produce voiced sounds.
is periodic and is generated by the vibrating vocal cords. The frequency of the voiced
excitation is commonly referred to as the fundamental frequency (F0).
The vocal tract shape, defined in terms of tongue, velum, lip and jaw position, acts like a
"filter" that filters the excitation to produce the speech signal. The frequency response of the
filter has different spectral characteristics depending on the shape of the vocal tract. The
broad spectral peaks in the spectrum are the resonances of the vocal tract and are commonly
referred to as formants.
9
`
Above shows for example the formants of the vowel /eh/ (as in "head"). The frequencies of
the first three formants (denoted as F1, F2, and F3) contain sufficient information for the
recognition of vowels as well as other voiced sounds. Formant movements have also been
found to be extremely important for the perception of unvoiced sounds (i.e., consonants).
In summary, the formants carry some information about the speech signal.
10
`
Community
Construction Mechanical 1. Dog Bark
1. Drilling 1.Air Conditioner Traffic
1. car horn 2.Children
2. Jackhammer 2.Engine idling playing
3. Street Music
Emergency
1. Gunshot
2.Siren
11
`
Since real dialogue does not always conform to the simple model of speaker turns the
possibility of overlapping speech segments is a obligation and the definition for 'babble
noise' is un clear in this sense. Overlapping speech naturally spreads a speaker change over
time domain, this may even haze the speaker change to insignificance and a smooth transition
to a different speaker altogether is a real liability. The methods discussed above can ease this
issue assuming the notion of speaker turns remains valid. This issue naturally lowers the
precision of the model, in a sense this issue will be regarded as a single speaker in speech
noise only. The data used in this proposed research does not contain overlapping speech in
the dataset therefore its imagined consequences are purely theoretical.
12
`
The segments can be clustered when the speech has been separated into speaker turn
segments and are saved in the matrix format. This provides a conjecture as to how many
speakers are present in the speech dataset. The performance of this step may get impacted
by the noise in the background environment and other such limitations which are definitely
a challenge in the proposed research Hierarchical Clustering [4], AHC, has been compared
and the general concept is to start by assuming that every speaker turn is a unique person. In
the proposed research experiment we have presumed every speaker turn is a unique person.
In the proposed algorithm for Speaker Change Detection an iteratively combining the most
similar segment is done until only there are two segments remaining for observation. The
number of speakers is established where the amalgamation of the segment parts are not
similar.
13
`
3.5 Pre-processing
The first step is to apply a high-pass filter to our audio signal with a cutoff around 200Hz.
The purpose of this is to remove any D.C. offset and low frequency noise. There is usually
very little speech information below 200 Hz so it does not hurt to suppress the low
frequencies.
Prior to feature extraction our audio signal is split into overlapping frames 20-40ms in length.
A frame is often extracted every 10ms. For example, if our audio signal is sampled at 16 kHz
and we take our frame length to be 25ms, then each frame will have 0.025*16000 = 400
samples per frame. With 1 frame every 10ms, we'll have the first frame starting at sample 0,
the second frame will start at sample 160 etc.
14
`
and fundamental frequency etc. The analysis in the cochlea takes place on a nonlinear
frequency scale (known as the Bark scale or the mel scale). This scale is approximately linear
up to about 1000 Hz and is approximately logarithmic thereafter. So, in the feature
extraction, it is very common to perform a frequency warping of the frequency axis after the
spectral computation.
15
`
The LPC analysis of each frame also involves the decision-making process of voiced or
unvoiced. A pitch-detecting algorithm is employed to determine to correct pitch period /
frequency. It is important to re-emphasis that the pitch, gain and coefficient parameters will
be varying with time from one frame to another.
In reality the actual predictor coefficients are never used in recognition, since they typical
show high variance. The predictor coefficient is transformed to a more robust set of
parameters known as cepstral coefficients.
16
`
provides an exposure to a variety of sound sources - some very common (laughter, cat
meowing, dog barking), some quite distinct (glass breaking, brushing teeth) and then some
where the differences are more nuanced (helicopter and airplane noise). One of the possible
deficiencies of this dataset is the limited number of clips available per class. This is related
to the high cost of manual annotation and extraction, and the decision to maintain strict
balance between classes despite limited availability of recordings for more exotic types of
sound events. Nevertheless, it will, hopefully, be useful in its current form and is a concept
that could be expanded on if sufficient interest is expressed.
Doo
Cow Crickets Breathing Car horn
r,
woo
d
crea
ks
Frog Chirping birds Coughing Can opening Engine
Insects
Pouring water Brushing Clock alarm Airplane
(flying)
teeth
17
`
Clips in this dataset have been manually extracted from public field recordings gathered by
the Freesound.org project. The dataset has been prearranged into 5 folds for comparable
cross-validation, making sure that fragments from the same original source file are contained
in a single fold.
18
`
CHAPTER 4
K-MEANS and
CLUSTERING
4.1 Introduction
K-Means clustering intends to partition n objects into k clusters in which each object belongs to
the cluster with the nearest mean. This method produces exactly k different clusters of greatest
possible distinction. The best number of clusters k leading to the greatest separation (distance) is
not known as a priori and must be computed from the data. The objective of K-Means clustering
is to minimize total intra-cluster variance, or, the squared error function.
K-Means is one of the most popular "clustering" algorithms. K-means stores k centroids that it uses
to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's
centroid than any other centroid. As we know Machine learning is divided into two parts
Supervised learning and Unsupervised learning, K-means clustering is a type of unsupervised
learning, which is used when you have unlabeled data (i.e., data without defined categories or
groups). The goal of this algorithm is to find groups in the data, with the number of groups
represented by the variable K.
The algorithm works iteratively to assign each data point to one of K groups based on the
features that are provided. Data points are clustered based on feature similarity. The results
of the K-means clustering algorithm are:
The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and
analyze the groups that have formed organically. The "Choosing K" section below describes
how the number of groups can be determined.
Each centroid of a cluster is a collection of feature values which define the resulting groups.
Examining the centroid feature weights can be used to qualitatively interpret what kind of
group each cluster represents.
19
`
4.2.1 Algorithm
The Κ-means clustering algorithm uses iterative refinement to produce a final result.
The algorithm inputs are the number of clusters Κ and the data set. The data set is a
collection of features for each data point. The algorithms starts with initial estimates
for the Κ centroids, which can either be randomly generated or randomly selected
from the data set. The algorithm then iterates between two steps:
Each centroid defines one of the clusters. In this step, each data point is assigned to
its nearest centroid, based on the squared Euclidean distance. More formally, if ci is
the collection of centroids in set C, then each data point x is assigned to a cluster
based on
where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point
assignments for each ith
In this step, the centroids are recomputed. This is done by taking the mean of all data
points assigned to
The algorithm iterates between steps one and two until a stopping criteria is met (i.e.,
no data points
change clusters, the sum of the distances is minimized, or some maximum number
of iterations is reached).
the best possible outcome), meaning that assessing more than one run of the
algorithm with randomized
20
`
4.2.2 CHOOSING K
The algorithm described above finds the clusters and data set labels for a particular pre-
chosen K. To find the number of clusters in the data, the user needs to run the K-means
clustering algorithm for a range of K values and compare the results. In general, there is no
method for determining exact value of K, but an accurate estimate can be obtained using the
following techniques.
One of the metrics that is commonly used to compare results across different values of K is
the mean distance between data points and their cluster centroid. Since increasing the number
of clusters will always reduce the distance to data points, increasing K will always decrease
this metric, to the extreme of reaching zero when K is the same as the number of data points.
Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as
a function of K is plotted and the "elbow point," where the rate of decrease sharply shifts,
can be used to roughly determine K.
A number of other techniques exist for validating K, including cross-validation, information
criteria, the information theoretic jump method, the silhouette method, and the G-means
algorithm. In addition, monitoring the distribution of data points across groups provides
insight into how the algorithm is splitting the data for each K.
21
`
Formally, a distance function is a function Dist with positive real values, defined on the
Cartesian product X x X of a set X. It is called a metric of X if for each x, y, z ε X.
22
`
4.3 CLUSTERING
Clustering is one of the most fundamental issues in data recognition. It plays a very important
role in searching for structures in data. It may serve as a pre-processing step for other
algorithms, which will operate on the identified clusters.
In general, clustering algorithms are used to group some given objects defined by a set of
numerical properties in such a way that the objects within a group are more similar than the
objects in different groups. Therefore, a specific clustering algorithm needs to be provided
with, a criterion to measure the similarity of objects, how to cluster the objects or points into
clusters.
The k-means clustering algorithm uses the Euclidean distance to measure the similarities
between objects. Both iterative algorithm and adaptive algorithm exist for the standard k-
means clustering. K-means clustering algorithms need to assume that the number of groups
(clusters) is known a priori.
23
`
Where
L= the maximized value of the likelihood function of the model , i.e. , where are the parameter
Values that maximize the likelihood function.
x= the observed data;
n= the number of data points in x, the number of observations, or equivalently, the sample size;
k= the number of parameters estimated by the model.
25
`
CHAPTER 5
Mel Frequency Cepstral Coefficient (MFCC)
5.1 MFCC
The mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of
a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale
of frequency.
It is popular feature Extraction technique. Mel-frequency cepstral coefficients are the feature
that collectively makes Melfrequency cepstral (MFC) . The difference between the cepstrum
and the mel-frequency cepstrum is that in Mel-frequency cepstral (MFC), the frequency
bands are equally spaced on the Mel scale, this mean that it approximates the human auditory
system's response more firmly than the linearly-spaced frequency bands used in the normal
cepstrum. This frequency warping allows better representation of sound.
Sounds generated by a human are filtered by the shape of the vocal tract etc. This shape
determines what sound comes out. If we succeed to determine the shape accurately, this gives
us an accurate representation of the phoneme being produced. The shape of the vocal tract
manifests itself in the envelope of the short time power spectrum, and the function of MFCCs
is to accurately represent this envelope. Mel Frequency Cepstral Coefficients (MFCCs) are
a feature widely used in automatic speech and speaker recognition. MFCC introduced by
Davis and Mermelstein in the 1980's.
MFCC is based on human hearing perceptions which cannot perceive frequencies over 1Khz.
In other words, in MFCC is based on known variation of the human ear’s critical bandwidth
with frequency. Pitch is used on Mel Frequency Scale to capture important characteristic of
speech. MFCC takes human perception sensitivity with respect to frequencies into
consideration so MFCC is best for speech recognition.
26
`
1. Framing: The input speech signal is segmented into frames of 20~30 ms with optional
overlap of 1/3~1/2 of the frame size. Usually the frame size (in terms of sample points) is
equal to power of two in order to facilitate the use of FFT. If this is not the, we need to do
zero padding to the nearest length of power of two. If the sample rate is 16000Hz and the
frame size is 480 sample points, then the frame duration is 480/16000 = 0.03 sec = 20 ms.
Additional, if the overlap is 160 points, then the frame rate is 16000/ (480-160) = 50 frames
per second. The process of segmenting the speech samples obtained from analog to digital
conversion (ADC) into a small frame with the length within the range of 20 to 40 msec. The
speech signal is divided into N samples of frames. An Adjacent frames are being separated
by M (M<N).
Typical values used are M = 100 and N= 256.
2. Windowing:
The important function of Windowing is to reduce the aliasing effect, when cut the long
signal to a short-time signal in frequency domain.
Different types of windowing functions:
Rectangular window
Bartlett window
Hamming window
Out of these, the most widely used window is Hamming window. Hamming window is used
as window shape by considering the next block in feature extraction processing chain and
integrates all the closest frequency lines. Each frame has to be multiplied with a hamming
window in order to keep the continuity of the first and the last points in the frame
27
`
(to be detailed in the next step). If the signal of frame is denoted by s (n), n = 0…N-1, then
the signal after Hamming windowing is s (n)*w (n), where w (n) is the Hamming window
defined by:
3. FFT:
Spectral analysis shows that different accents in speech signals correspond to different
energy distribution over frequencies. Therefore we perform FFT to obtain the magnitude
frequency response of each frame. When we implement FFT on a frame, we consider that
the signal in frame is periodic, and continuous. If this is not the case though, we can perform
FFT but the discontinuity at the frame's first and last points is likely to introduce undesirable
effects in the frequency response. To handle this problem, we have two strategies.
a. Multiply each frame by a Hamming window to increase its continuity at the first and last
points.
b. Take a frame of a variable size such that it always contains an integer multiple number
of the fundamental periods of the speech signal.
Practically the second strategy encounters difficulty because the identification of the
fundamental period is not a minor problem. Moreover, unvoiced sounds do not have a
fundamental period at all. We generally adopt the first strategy to multiply the frame by a
Hamming window before performing FFT.
4. Mel Filter Bank:
28
`
Triangular Bandpass Filters are used because the frequency range in FFT spectrum is very
wide and voice signal does not follow the linear scale. We multiple the magnitude
frequency response by a set of 20 triangular bandpass filters to get the log energy of each
triangular bandpass filter. The position of these filters is equally spaced according to the
Mel frequency, which is related to the linear frequency f by the following equation.
Mel (f) =1125*ln (1+f/700)
The Mel-frequency is proportional to the logarithm of the linear frequency, reflecting
similar effects in the human's subjective aural perception.
Advantages of triangular bandpass filters:
Smooth the magnitude spectrum such that the harmonics are flattened in order to obtain the
envelope of the spectrum with harmonics. This suggests that the pitch of a speech signal is
generally not presented in MFCC.
Reduce the size of the features tangled in it.
6. MFCC Features:
In this way Mel-frequency cepstral coefficients are extracted from the speech signal. These
features are the main component of speech recognition process. Further classification of
these features is done by the various types of Classifier.
We will now go a little more slowly through the steps and explain why each of the steps is
necessary.
An audio signal is constantly changing, so to simplify things we assume that on short time
scales the audio signal doesn't change much (when we say it doesn't change, we mean
statistically i.e. statistically stationary, obviously the samples are constantly changing on
even short time scales). This is why we frame the signal into 20-40ms frames. If the frame
is much shorter we don't have enough samples to get a reliable spectral estimate, if it is longer
29
`
the signal changes too much throughout the frame.
The next step is to calculate the power spectrum of each frame. This is motivated by the
human cochlea (an organ in the ear) which vibrates at different spots depending on the
frequency of the incoming sounds. Depending on the location in the cochlea that vibrates
(which wobbles small hairs), different nerves fire informing the brain that certain frequencies
are present. Our periodogram estimate performs a similar job for us, identifying which
frequencies are present in the frame.
The periodogram spectral estimate still contains a lot of information not required for
Automatic Speech Recognition (ASR). In particular the cochlea can not discern the
difference between two closely spaced frequencies. This effect becomes more pronounced
as the frequencies increase. For this reason we take clumps of periodogram bins and sum
them up to get an idea of how much energy exists in various frequency regions. This is
performed by our Mel filterbank: the first filter is very narrow and gives an indication of
how much energy exists near 0 Hertz. As the frequencies get higher our filters get wider as
we become less concerned about variations. We are only interested in roughly how much
energy occurs at each spot. The Mel scale tells us exactly how to space our filterbanks and
how wide to make them. See below for how to calculate the spacing.
Once we have the filterbank energies, we take the logarithm of them. Human hearing also
motivates this: we don't hear loudness on a linear scale. Generally to double the percieved
volume of a sound we need to put 8 times as much energy into it. This means that large
variations in energy may not sound all that different if the sound is loud to begin with. This
compression operation makes our features match more closely what humans actually hear.
Why the logarithm and not a cube root? The logarithm allows us to use cepstral mean
subtraction, which is a channel normalization technique.
The final step is to compute the DCT of the log filterbank energies. There are 2 main reasons
this is performed. Because our filterbanks are all overlapping, the filterbank energies are
quite correlated with each other. The DCT decorrelates the energies which means diagonal
covariance matrices can be used to model the features in e.g. a HMM classifier. But notice
that only 12 of the 26 DCT coefficients are kept. This is because the higher DCT coefficients
represent fast changes in the filterbank energies and it turns out that these fast changes
actually degrade ASR performance, so we get a small improvement by dropping them.
30
`
5.2 What is the Mel scale?
The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured
frequency. Humans are much better at discerning small changes in pitch at low frequencies
than they are at high frequencies. Incorporating this scale makes our features match more
closely what humans hear.
1. Frame the signal into 20-40 ms frames. 25ms is standard. This means the frame length for
a 16kHz signal is 0.025*16000 = 400 samples. Frame step is usually something like 10ms
(160 samples), which allows some overlap to the frames. The first 400 sample frame starts
at sample 0, the next 400 sample frame starts at sample 160 etc. until the end of the speech
file is reached. If the speech file does not divide into an even number of frames, pad it with
zeros so that it does.
The next steps are applied to every single frame, one set of 12 MFCC coefficients is extracted
for each frame. A short aside on notation: we call our time domain signal . Once it is
framed we have where n ranges over 1-400 (if our frames are 400 samples)
and ranges over the number of frames. When we calculate the complex DFT, we
get - where the denotes the frame number corresponding to the time-domain
2. To take the Discrete Fourier Transform of the frame, perform the following:
31
`
This is called the Periodogram estimate of the power spectrum. We take the absolute value
of the complex fourier transform, and square the result. We would generally perform a 512
point FFT and keep only the first 257 coefficents.
3. Compute the Mel-spaced filterbank. This is a set of 20-40 (26 is standard) triangular filters
that we apply to the periodogram power spectral estimate from step 2. Our filterbank comes
in the form of 26 vectors of length 257 (assuming the FFT settings fom step 2). Each vector
is mostly zeros, but is non-zero for a certain section of the spectrum. To calculate filterbank
energies we multiply each filterbank with the power spectrum, then add up the coefficents.
Once this is performed we are left with 26 numbers that give us an indication of how much
energy was in each filterbank.
32
`
`
CHAPTER 6
Simulation And Programming Tool
6.2.2 SCIKIT-LEARN
Machine Learning in Python. Simple and efficient tools for data mining and data analysis;
Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
6.2.3 PYTHON_SPEECH_FEATURES
This library provides common speech features for ASR including MFCCs and filterbank
Energies.
6.2.4 MATPLOTLIB
6.3.1 Introduction
PyCharm offers great framework-specific support for modern web development frame
works such as Django, Flask, Google App Engine, Pyramid, and web2py.
3. Scientific Tools
PyCharm integrates with IPython Notebook, has an interactive Python console, and
supports Anaconda as well as multiple scientific packages including matplotlib and NumPy.
4. Cross-technology Development
A huge collection of tools out of the box: an integrated debugger and test runner; Python
profiler; a built-in terminal; and integration with major VCS and built-in Database Tools.
34
`
CHAPTER 7
PROPOSED APPROACH
7.1 Speaker change detection
To start with we need to first develop a dataset or use a predefined dataset (open source).
Next, we take DCT of every frame and apply the normalization on each. Then, calculate the
power spectrum of individual frames. This identifies frequencies that are present in a given
the frame. Mel filterbank is applied to the power spectra, which gives the total energy present
in each subband filter. Then, we apply log-operation on subband energies.
We take the first 13 coefficients of MFCC as it contains most of the information (library
used python_speech_features to extract features) then we apply K-Means algorithm and
BIC for modelling the developed file. Atlast to plot the final change point we use matplot
library.
Frequency MFCC
Speech Signal Normalization To Feature
Mel scale Extraction
Choosing the dataset is the most crucial part of any speech processing, a dataset should have
all diverse sounds to make the program self-sufficient to any harsh environment. We chose 10
sounds that are of diverse nature and have the most distinguish pitch and other linguistic sound
feature. The 10 chosen sounds are- air conditioner, car horn, children playing, dog bark,
drilling, engine idling, gun shot, jackhammer, siren and street music. These sounds are exterior
sounds that are found in abundance occurring in nature. The sounds were recorded and were in
35
`
mp3 format and the change in format to wav format was integral for the program. The output
will be in “npz” format which can be used as training input to other program.
The linguistic features are unique to their respective sounds. MFCC extracts these features,
we use MFCC by including Python Speech Features and it uses only the first 13 coefficients
and these play a vital role in the processing of these speeches and the detection of the relevant .
MFCC is preferred over LPC(linear predictive coding) as the latter is much more complex and
slow, while MFCC is simple compared to LPC it proves more efficient.
7.1.3 K_Means
K-Means clustering intends to partition n objects into k clusters in which each object belongs
to the cluster with the nearest mean. This method produces exactly k different clusters of
greatest possible distinction. The best number of clusters k leading to the greatest separation
(distance) is not known as a priori and must be computed from the data. The objective of
K-Means clustering is to minimize total intra-cluster variance, or, the squared error function.
7.1.4 Clustering
Clustering is one of the most fundamental issues in data recognition. It plays a very important role
in searching for structures in data. It may serve as a pre-processing step for other algorithms,
which will operate on the identified clusters.
In general, clustering algorithms are used to group some given objects defined by a set of
numerical properties in such a way that the objects within a group are more similar than the
objects in different groups.
36
`
IMPORTANCE OF COEFFICIENTS
2. On the other hand, if a cepstral coefficient has a negative value, it represents a fricative
sound since most of the spectral energies in fricative sounds are concentrated at high
frequencies.
3. The lower order coefficients contain most of the information about the overall spectral
shape of the source-filter transfer function.
4. Even though higher order coefficients represent increasing levels of spectral details,
depending on the sampling rate and estimation method, 12 to 20 cepstral coefficients
are typically optimal for speech analysis. Selecting a large number of cepstral coefficients
results in more complexity in the models.
37
`
CHAPTER 8 RESULTS
38
`
0 DOG 0 12.227
39
`
Time(ms)
40
`
CLUSTERING OUTPUT
41
`
CHAPTER 8
CONCLUSION & FUTURE SCOPE
8.1 Conclusion
It has been demonstrated that how the dataset of diverse sounds could be successfully
detected and distinguished from one an other this was possible by extracting the linguistic
features of each sound of the data set created comprising of these diverse urban sounds by
the means of MFCC.
K-Means is an algorithm, which utilizes the means of clustering to cluster the 10 sounds
into k clusters and by partitioning these datasets, gives us the output as the one closest to the
cluster.
42
`
The K-Means powered by MFCC has accurate results when urban dataset of 10 sounds was
Used. In Future work, An urban Dataset with a wide variety of sounds and that is available
online will test the accuracy and speed of the program on a higher level.
This algorithm can be tested with classifiers and can have better result with high accuracy
this approach can be taken into consideration in the future projects.
43
`
REFERENCES
[1] Philipos C. Loizou, P. C., Mimicking the Human Ear, Signal Processing Magazine, pp.
101-130, September 1998.
[2] J. Salamon and J. P. Bello, "Unsupervised feature learning for urban sound
classification," 2015 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Brisbane, QLD, 2015, pp. 171-175.
[3] H. Zhou, Y. Song and H. Shu, "Using deep convolutional neural network to classify
urban sounds," TENCON 2017 - 2017 IEEE Region 10 Conference, Penang, 2017, pp.
3089-3092.
[4] Zhang, C., Wang, Z., Li, D., and Dong, M., A multi-mode and multichannel cochlear
implant. In Signal Processing Proceedings. IEEE Vol. 3, pp. 2237-2240, 2004.
[5] Loizou, P. C., Dorman, M., and Tu. Z., On the Number of Channels Needed to
Understand Speech, The Journal of the Acoustical Society of America, 106, 2097, 1999.
[6] Li, Y., and Chu, W., A new non-restoring square root algorithm and its VLSI
Implementations, IEEE International Conference, pp.538-544, 1996.
[7] Raitio, T., Juvela, L., Suni, A., Vainio, M., Alku, P.: Phase perception of the glottal
excitation and its relevance in statistical parametric speech synthesis. Speech Commun. 81,
104–119 (2016)
[8] Yegnanarayana, B., Saikia, D., Krishnan, T.: Significance of group delay functions
in signal reconstruction from spectral magnitude or phase. IEEE Trans. Acoust. Speech
Signal Process. 32(3), 610–623 (1984)
[9] Saratxaga, I., Sanchez, J., Wu, Z., Hernaez, I., Navas, E.: Synthetic speech detection
using phase information. Speech Commun. 81, 30–41 (2016)
[10] Wu, Z., Siong, C.E., Li, H.: Detecting converted speech and natural speech for
antispoofing attack in speaker recognition. In: INTERSPEECH, Portland, Oregon,
USA, pp. 1700–1703 (2012)
44
`
[13] Shenoy, B.A., Mulleti, S., Seelamantula, C.S.: Exact phase retrieval in
principal shift-invariant spaces. IEEE Trans. Signal Process. 64(2), 406–416 (2016)
[1] Julius O. Smith III: Interpolated Delay Lines, Ideal Bandlimited Interpolation, and
Fractional Delay Filter Design. Center for Computer Research in Music and Acoustics
(CCRMA) Department of Music, Stanford University Stanford, California 94305, May 12,
2017
[2] Syed Sibtain Khalid, Safdar Tanweer, Dr. Abdul Mobin, Dr. Afshar Alam: A
comparative Performance Analysis of LPC and MFCC for Noise Estimation in Speech
Recognition Task, International Journal of Electronics Engineering Research, ISSN 0975-
6450 Volume 9, Number 3 (2017) pp. 377-390
45
`
Appendix A
Details of Project and
Relevance to Environment
and Mapping with POs and
PSOs with Justification
46
`
47
`
PROGRAM OUTCOMES
48
`
49
`
APPENDIX B
GANTT CHART
50
`
51
`
52
`
53
`
54
`
55