You are on page 1of 6

An Audio Classification Approach using Feature

extraction neural network classification Approach.


Krishna Kumar, Kapil Chaturvedi
Department of CSE, SIRTS, BHOPAL

(2) Their attributes are the highlights we separate from


Abstract: Audio signal and audio data is an them (MFCC, chroma, centroid);
important information data. It make useful to (3) The classes (for example speakers, instruments,
understand behaviour and understanding of telephones, audio condition) fit the issue definition
existing or past scenario. There are multiple The intricacy lies in finding a suitable connection
features in sound signal which help in such analysis. among highlights and classes.
There are many approach such as wavelet analysis,
chroma, centroid based approach and MFCC is the
main approach for such feature extraction. In this
paper there is a sound classification using the
proper feature extraction then followed by proper
classification using fast efficient neural network
based approach is proposed. This algorithm process
and classify the sound input dataset and help in
high classification rate classification. The execution
of this system is performed using Matlab 2018b and
result execution performance shows the efficiency of
algorithm over traditional classification solution.

Keywords: Audio classification, neural network,


MFCC, frequency , audio signals. Figure 1: Audio Signals Frequency.
Machine learning and its other classification
I INTRODUCTION approaches worked with different signal analysis
Audio signals are important part in any kind of policy. Signal frequency work with audio signal
communication media today, where the speech analysis with its classification efficiency. Audio
recognition help understanding and visualizing of information help working with its pitch analysis and
various scenario. There are many classification several other auto processing values.
approaches which work towards the analysis of audio
classification. The approaches are mainly work with AUDIO CLASSIFICATION
various parameters involved in the speech signals. Quick increment in the measure of audio information
Audio information generated from different resources requests for an electronic strategy which permits
plays role while working with vocal human voice or effective and robotized content-based grouping and
with instrumental voice analysis. Thus previous recovery of audio database [1], [2], [3], [4]. Be that as
approaches worked with limited properties of audio it may, a crude audio sign information is a featureless
signal or any of speech input. gathering of bytes with most simple fields connected,
for example, name, group, testing rate. This does not
This paper work with the audio classification with promptly permit content-based grouping and recovery.
various content available in it. The recognition using While audio substance might be portrayed by utilizing
the approaches such as classification , regression and catchphrases and content, such data has so far been
other approaches work with identification of such made physically.Ongoing discoveries from audio-
analysis is discussed. related physiology, psychoacoustics, and discourse
The objective of the information mining procedure is to recognition, recommend that the audio-related
find learning from enormous databases and change into framework re-encodes acoustic range as far as ghostly
a human reasonable configuration. It is the procedure and transient balances. The perceptual job of
by which we consequently dole out an individual thing exceptionally low recurrence balances takes after quiet
to one of various classifications or classes, in view of message bearing waves regulating higher recurrence
its qualities. transporters. The presence of adjustments in discourse
For our situation: has been confirm. Dynamic data given by the balance
(1) The things are audio sign (for example audios, range incorporates quick and slower time-changing
tracks, extracts); amounts, for example, pitch, phonetic and syllabic rates

Authorized licensed use limited to: Cornell University Library. Downloaded on August 20,2020 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
of discourse, beat of music, and so forth [8]. In utilization of time-area 14 highlights evades the
addition, discourse clarity relies upon the uprightness requirement for a FFT processor to figure phantom
of the moderate spectro-worldly vitality tweaks. highlights. Given that this calculation has just been
• Recognition accuracy: Audio recognition and its created in an exploratory manner, it is normal that
analysis tool work with the fundamental of working improved exhibition past that detailed is conceivable.
procedure. Ccuracy recognition from the voice sample
plays an important role of understanding more on In this paper [13] provided details regarding
accuracy parameter. development of a constant PC framework fit for
• Recognition speed –An absolute recognition speed is recognizing discourse signals from music flag over a
always an factor which plays an important role while wide scope of advanced audio information. They have
dealing with real-time analysis of audio. inspected 13 highlights proposed to quantify
• Pre-processing: Part conversation and frame thoughtfully particular properties of discourse and
conversation of voice is performed with the pre- additionally music flag, and joined them in a few
processing module. multidimensional grouping structures. They have given
• HMM Training: Working with pattern understanding broad information on framework execution and the
over the voice sample is performed using the HMM cross-approved preparing/test arrangement used to
module which is duly performed while collecting the assess the framework. For the datasets right now being
sample unit. used, the best classifier arranges with 5.8% blunder on
• HMM Recognition: A pattern matching and analysis a casing by-outline premise, and 1.4% mistake when
process work with the HMM recognition while it deals coordinating long (2.4 second) fragments of audio.
the recognition of signal based on HMM training.
ii) Artificial Neural Network Classier(ANN) Neural In this paper [14] introduced a technique for quickly
network approach with its network layer , distribution deciding the qualities of audio examples, utilizing a
and working with different analysis platform work with directed tree-based vector quantizer prepared to
the ANN architecture [4] amplify common data (MMI). Such a measure has
• Speech breakup, noise cancellation and processing of demonstrated effective for talker recognizable proof,
actual data work with pre-processing unit.. and the expansion from discourse to general audio, for
• Two kinds of acoustic features are extracted, from the example, music, is clear. A classifier that recognizes
speech signal. They are Mel Frequency Cep- strum discourse from music and non-vocal audios is
Coefficients (MFCC) and Linear Predictive Coding exhibited, just as test results appearing flawless
coefficients (LPCC). grouping exactness might be accomplished on a little
corpus utilizing significantly under two seconds for
every test audio record. The procedures displayed here
II LITERATURE REVIEW might be reached out to different applications and
areas, for example, audio recovery by-comparability,
In this paper [11] proposed a content ward speaker melodic type order, and programmed division of
confirmation framework which uses distinctive sort of persistent audio.
data for settling on choice in regards to the personality
of guaranteed speaker. The benchmark framework In this paper [15] analyzed the separation accomplished
utilized is (DTW) strategy for coordinating. Discovery by a few unique highlights utilizing regular planning
for the area of the end point is critical for the and test sets and a comparable classifier. The database
presentation of DTW based format coordinating. A gathered for these tests consolidates talk from thirteen
technique dependent on vowel beginning point (VOP) vernaculars and music from wherever all through the
is proposed finding the end purpose of an expression. world. For every circumstance the flows in the
The proposed strategy for speaker check utilized the component space were shown by a Support vector
suprasegmental and source highlights, other than the machine (SVM). Tests were finished on four sorts of
phantom highlights. The suprasegmental highlights, for feature, sufficiency, cepstra, pitch and zero
example, pitch and length are removed from the convergences. For every circumstance the subordinate
distorting way data of the DTW calculation. Highlights of the component was furthermore used and found to
of the excitation source, removed utilizing neural improve execution. The best execution came about due
system models are likewise utilized for content ward to using the cepstra and delta cepstra which gave a
speaker confirmation framework. comparable slip-up rate (EER) of 1.2%. This was
eagerly trailed by institutionalized adequacy and delta
In this paper [12] built up a system which is effective at plentifulness. This in any case used an essentially less
segregating discourse from music on communicate FM capricious model. The pitch and delta pitch gave an
radio. The computational straightforwardness of the EER of 4% which was better than the zero convergence
methodology could fit wide application including the which conveyed an EER of 6%.
capacity to consequently change channels when ads
show up. The calculation gives the capacity to In this paper [16] displayed a various leveled
vigorously recognize the two classes discourse and framework for audio order and recovery dependent on
music and runs effectively continuously. The severe audio substance examination. The framework

Authorized licensed use limited to: Cornell University Library. Downloaded on August 20,2020 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
comprises of three phases. The main stage is known as • The problem of pattern recognition, which
the coarse-level audio grouping and division where traditionally followed the framework of Bayes
audio accounts are characterized and divided into and required estimation of distributions for the
discourse, music, a few kinds of natural audios, and data, was transformed into an optimization
quietness, in light of morphological and measurable problem involving minimization of the
investigation of worldly bends of brief time highlights empirical recognition error.
of audio sign. In the subsequent stage, ecological
audios are additionally grouped into definite classes, • The second problem is the choice of k,
for example, adulation, downpour, feathered creatures choosing k large generally results in a linear
audio, and so forth. This fine-level order depends on classifier whereas small k results in nonlinear
time-recurrence investigation of audio flag and ones. This influences the generalization
utilization of shrouded Markov model (HMM) for capability of the k NN classifier. The optimal
grouping. k can be found by using for instance the leave
out one method on the training set. A
III PROBLEM DEFINITION disadvantage of this method is its large
computing power requirement, since for
• Non-uniform availability of voice sample and
classifying an object its distance to all the
its process issue plays the major limitation
objects in the learning set has to be calculated.
where a proper pre-processing technique is
required. • Both pitch and loudness appeared to be an
issue for spectral flatness due to the
• Working with the proper grammer level check
perception of crescendos and decrescendos
with automated system, working with updated
resulting from the source separation of noise
dictionary help working with analysis is major
and pitch (perceived fading or increase in one
limitation due to limited library.
or the other).
• In practice, efficient classification selection Proposed Architecture
,working with proper procedure is very much There are components which are described below with
lack and needful in approach. their functionality.

Dataset document extracted from the web resource and


Figure 2: Flow chart of proposed algorithm. stored over the local disk. Initializing MATLAB library
and components for the data processing.
In the figure 2 above, the flow steps of complete • Input file selection which is needed to match
approach is discussed, which discuss about the various with existing image.
steps involve in the process.

Authorized licensed use limited to: Cornell University Library. Downloaded on August 20,2020 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
• Pre-processing the image with grayscale Linguistic structure:
conversation and further using the binarization
A = imread(filename,fmt)
over it.
• Measuring features using Feature extraction [X,map] = imread(filename,fmt)
Histogram of oriented gradients (HOG)
[...] = imread(filename)
• Filtering of color and intensity data using the
filter kernels. Further block normalization. [...] = imread(...,idx) (TIFF as it were)
• Performing CNN using the number of layer
selection and number of kernel selection. [...] = imread(...,ref) (HDF as it were)
Classification of data is performed using the
traditional CNN approach. RESULT ANALYSIS:
This section discuss about the comparison parameters
• Performing the Hybrid proposed approach and their analysis along with traditional algorithm. The
which take advantage of SENet and CNN algorithm Proposed CNN is compared with spectro-
layer concept of better classification. In this temportal features, deep learning and SVM approach of
scenario the quick process occurrence is audio classification. Hence the solution discuss the
performed using suppression of less used efficiency of proposed approach over existing
information. It tries to add weights to each and solutions.
every feature map in the layer. Computation Parameters:
• Thus the SE approach along with CNN is In order to perform study and efficiency of proposed
going to work as proposed solution for quick algorithm. The comparison is performed using
process with high accuracy. accuracy and Error rate analysis. The confusion matrix
computation and further analysis of proposed approach
• Confusion matrix computation and returning is performed. Following are the parameters and its
the result parameter comparison analysis. discussion formulae which is used for parameter
computation.
IV SIMULATION PLATFORM Accuracy:
In order to perform proper simulation over the
technique, Matlab as simulation tool is used here with
audio library. Thus working with network layer NN Accuracy (ACC) is calculated as the number of all
library tool and audio analysis tool helped to process correct predictions divided by the total number of the
the implementation and result analysis part. dataset. The best accuracy is 1.0, whereas the worst is
0.0.
MATLAB deals with:

1. Basic stream control and


programming dialect

2. How to compose contents (primary


capacities) with matlab

3. How to compose capacities with


matlab

4. How to utilize the debugger


Figure 3: Accuracy
5. How to utilize the graphical interface computation formulae.

6. Examples of helpful contents and The figure 3 above shows the equation formulae which
capacities for audio signal handling is used for computing the accuracy in the execution.
Accuracy is calculated as the total number of two
In the wake of finding out about matlab we will have correct predictions (TP + TN) divided by the total
the capacity to utilize matlab as an apparatus to assist number of a dataset (P + N).
us with our maths, gadgets, flag and audio signal
preparing, measurements, neural systems, control and Error Rate:
mechanization.

MATLAB FUNCTIONS: Error rate (ERR) is calculated as the number of all


incorrect predictions divided by the total number of the
imread: Read audio signals from illustrations records. dataset. The best error rate is 0.0, whereas the worst is
1.0.

Authorized licensed use limited to: Cornell University Library. Downloaded on August 20,2020 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
Figure 5: Graphical analysis of comparative study
between proposed and traditional solution

In the figure 5 above, the comparison analysis between


the proposed CNN and traditional approaches is
performed. The execution result shows the efficiency
of proposed system.

Figure 4: Error
computation formulae
The figure 4 above shows the equation formulae which Comparison Analysis
is used for computing the Error rate in the execution. between the Proposed and
RESULT COMPARISON:
Following table and graph shows the statically and Previous published work

Accuracy and Error rate %


graphical analysis of approach.
95
90
Table 1: Comparison among obtained results. 85
80
Method Features Accuracy Error Rate 75
70
(EER) Accuracy Error Rate
Proposed CNN 89.2 77.8
Proposed CNN Audio 89.2 77.8
SVM 86.7 78.5

Spectro- Audio 87.4 89.5


Figure 6: Comparison Analysis between previous and
temporal
proposed work
feature ()

In the figure 6 above, the comparison analysis between


the proposed CNN and Previously paper SVM
Deep Learning Audio 85.2 78.9
algorithm is performed. The execution result shows the
efficiency of proposed system.
SVM Audio 86.7 78.5
V CONCLUSION

Sound is the signal form which is always around us and


The table 1 above shows the comparison analysis many of the latest technology, applications are working
between the algorithm and proposed CNN approach. on the sound system. Today many advantages for
The comparison made shows the efficiency of especially abled people also involve sound technology.
proposed algorithm in terms of accuracy and error rate This approach works with signals and such
efficient along with other algorithms. classification according to requirement, using them is
required. Signal analysis given an important
understanding of different motions. Working with
different chords and its visualization provides the
Comparison Analysis actual classification which can provide a proper study
between the Proposed about the input signals. Finding a proper technique
using which a better classification can be perform is
and other algorithms always a challenging issue working with dynamic
operation. This research discuss about the sound
Accuracy and Error rate %

95 classification using its different middle features and


90
85 their analysis. The recent researcher discussed their
80
75 algorithm such as support vector machine, feature
70
Accuracy Error Rate based solution and other rule based solution for the
Proposed CNN 89.2 77.8 analysis of audio signals. The presented approach
which is SENet based CNN is given which the parallel
Spectro-temporal is computing oriented fast approach for more accuracy
87.4 89.5
feature and less error rate. The approach is implemented using
Deep Learning 85.2 78.9 the MATLAB tool and further the parameter such as
SVM 86.7 78.5 Accuracy and Error rate is computed using
classification confusion matrix. Result obtained shows
the high value of accuracy and less value of error rate

Authorized licensed use limited to: Cornell University Library. Downloaded on August 20,2020 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
from the given proposed CNN solution. Thus the International Conference on Acoustics, Speech, and
approach is efficient and can be used in further for Signal Processing (ICASSP), May 2013.
better classification.
[6] J. Cai, D. Ee, B. Pham, P. Roe, and J. Zhang,
Sensor Network for the Monitoring of Ecosystem: Bird
VI FUTURE WORK Species Recognition, in Proceedings of the
International Conference on Intelligent Sensors, Sensor
The scope of the research work in the area of content Networks and Information, pp. 293–298, IEEE, 2007.
based audio classification is not limited by any means.
There is a provision of expansion of the research both
[7] A. Temko, R. Malkin, C. Zieger, D. Macho, C.
vertically (i.e., increase in the number of classes) and
laterally (i.e., adding more unique features to each Nadeu, and M. Omologo, CLEAR appraisal of acoustic
class). This will require more feature sets and more event distinguishing proof and portrayal structures,
rigorous analysis. An optimum feature set for each Multimodal Technologies for Perception of Humans,
pp. 311–322, 2007.
audio class can be created by including certain unique
features which may improve the performance of the
classifier. [8] P. Li, Y. Guan, S. Wang, B. Xu, and W. Liu,
Monaural talk division subject to MAXVQ and CASA
Following are the future work possibility with sound for ground-breaking talk affirmation, Computer Speech
classification: and Language, vol. 24, no. 1, pp. 30–44, 2010.
1. Working with mobile devices and
implementing them with the voice input from [9] D. Gerhard, Audio Signal Classification: History
real time devices can be done. and Current Techniques, in Technical Report TR-CS
2. It can be used with especially abled people 2003-07, pp. 1–38, Department of Computer Science,
where the application can be very useful to University of Regina, 2003.
understand the person behavior.
3. This approach can be used in robotics for auto [10] Y. T. Peng, C. Y. Lin, M. T. Sun, and K. C. Tsai,
analysis and performing the different task Healthcare audio event course of action using Hidden
according to the user requirements. Markov Models and Hierarchical Hidden Markov
Models, in Proceedings of the IEEE International
REFERENCES Conference on Multimedia and Expo (ICME), pp.
1218–1221, IEEE, June 2009.
[1] S.M. Biagio , M. Crocco , M. Cristani , S. Martelli ,
V. Murino , Heterogeneous auto-comparable qualities [11] Bidisha Sharma and S. R. Mahadeva Prasanna,
of properties (HASC): abusing social information for Vowel Onset Point Detection using Sonority
classification, in: IEEE International Conference on InformationBidisha, August 20–24, 2017.
Computer Vision (ICCV), 2, IEEE, 2013, pp. 809.
[12] Erik Nordhamn, Björn Sikström and Lars
[2] J. Dennis, H. D. Tran, and H. Li, Spectrogram Wanhammar, DESIGN OF A FFTPROCESSOR.
Image Feature for Audio Event Classification in
Mismatched Conditions, IEEE Signal Processing
[13] Eldar Sultanow, A Multidimensional
Letters, vol. 18, pp. 130–133, Feb. 2011.
Classification of 55 Enterprise Architecture
Frameworks.
[3] J. Dennis, H. D. Tran, and E. S. Chng, Image
Feature Representation of the Subband Power
[14] Jose Miguel Leiva-Murillo, Student Member,
Distribution for Robust Audio Event Classification,
IEEE Transactions on Audio, Speech, and Language IEEE, and Antonio, Maximization of Mutual
Processing, vol. 21, pp. 367–377, Feb. 2013. Information for Supervised Linear Feature Extraction,
IEEE TRANSACTIONS ON NEURAL NETWORKS,
VOL. 18, NO. 5, SEPTEMBER 2007.
[4] J. Dennis, H. D. Tran, and E. S. Chng, Overlapping
audio event affirmation using neighborhood
spectrogram features and the summed up Hough [15] Haitian Ling1,2, Kunping Zhu1*, Predicting
change, Pattern Recognition Letters, vol. 34, pp. 1085– Precipitation Events Using Gaussian Mixture Model,
1093, July 2013. Journal of Data Analysis and Information Processing,
2017, 5.
[5] J. Dennis, Q. Yu, H. Tang, H. D. Tran, and H. Li,
Temporal Coding of Local Spectrogram Features for [16] Marcin PIETRZYKOWSKI and Wojciech
Robust Audio Recognition, in Proceedings of the IEEE SAŁABUN, Applications of Hidden Markov Model:
front line, Computer Technology and Applications,Vol
5 (4),1384-1391.

Authorized licensed use limited to: Cornell University Library. Downloaded on August 20,2020 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.

You might also like