Professional Documents
Culture Documents
ON
MUSIC RECOGNISER
Signature of Supervisor
Name of Supervisor : Dr. Vimal Kumar Mishra
ECE Department,
JIIT, Sec-128,
Noida-201304
Date: 09-05-2019
i
DECLARATION
We hereby declare that this written submission represents our own ideas in our own
words and where other’s ideas or words have been included, have adequately cited and
referenced the original sources. We also declare that we have adhered to all principles
of academic honesty and integrity and have not misinterpreted or fabricated or falsified
any idea/data/fact/source in our submission.
Place: Noida
Dated: 09-05-2019
ii
ABSTRACT
Network system is a series of appliances/clients connected to each other and to other
peripheral devices. Based on their architecture they can be broadly classified into client-
server or peer-to-peer network. Database system for these networks are further classi-
fied into Centralized and Distributed database systems.
The networks that we use in present time basically depend on centralized database
system(client-server network).In Centralized database there is high involvement of third
party organisation which manages the server/database of the network system. Data in-
tegrity is high in this type of system but redundancy ( storage of similar data) , replica-
tion of data or recovery of lost data is very difficult ,sometimes impossible.
These drawbacks of centralized database system can be overcome by designing a net-
work using Blockchain technologies.Blockchain is a type of distributive database sys-
tem.In this system data is spread among all clients and there is no central server control-
ling the flow of data.Use of smart contracts,proof of work and encryption among nodes
make flow of data more secure and fast.
To make this type of system we are using Hyperledger fabric and composer for de-
signing our own network system for a company. Hyperledger fabric is an open source
blockchain framework which help to develop a network system. We will be defining a
fictitious airline company for which we are creating a network API to handle their B2B
(business to business) network system.The basic foundation is to leverage the Hyper-
ledger fabric for the formation of a Business Network that would enable efficient B2B
collaboration leading to better experience.
iii
ACKNOWLEDGEMENTS
We would like to take this opportunity to thank everyone who enabled us to create
and execute our minor project.We would like to show our gratitude to Dr Vimal Ku-
mar Mishra for giving us a good guideline for project throughout numerous consula-
tions.We would also like to expand our deepest gratitude to all those who have directly
and indirectly guided us in this project.
We would link to expand our thanks to our classmates and team members itself for
their insightful comments and constructives suggestions to improve the quality of this
project.
iv
i
Contents
CERTIFICATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
DECLARATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation To Choose The Topic . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objective and Scope of The Project . . . . . . . . . . . . . . . . . . . 2
2 Requiremnets 3
2.1 Android Studio - for developing the application . . . . . . . . . . . . . 3
2.2 Server Space - for hosting the database and storing user information . . 3
2.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Implementation 5
3.1 Android Application . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.1 Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Searching and Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.1 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.1 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.3 Specificity and False Positives . . . . . . . . . . . . . . . . . . 12
4 Results 13
ii
References 15
List of Figures
3.1 Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Constellation Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Fast Combinatorial Hashing . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Hash Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 Scatter plot of matching hash diagonals: No diagonal . . . . . . . . . . 10
3.6 Histogram of differences of time offset: Signals do not match . . . . . . 10
3.7 Scatter plot of matching hash diagonals: Diagonal Present . . . . . . . 11
3.8 Histogram of differences of time offset: Signals match . . . . . . . . . 11
3.9 Recognition rate – Additive Noise . . . . . . . . . . . . . . . . . . . . 12
3.10 Recognition rate – Additive noise + GSM compression . . . . . . . . . 12
Introduction
Rapid development of mobile devices and internet has made possible for us to access
different music resources freely. The number of songs available exceeds the listening
capacity of a single individual. People sometimes feel difficult to choose from millions
of songs. Moreover music service providers need an efficient way to manage songs
and help their customers in discovering music by giving quality recommendation. Thus
there is a strong need for identifying some random music playing around and a good
recommendation system. Currently there are many music streaming services, like Pan-
dora, Spotify etc. which have worked and are working upon building high precision
commercial music recognition and recommendation system. These companies are gen-
erating revenues by helping their customers recognize the random music playing and
hence helping them discover relevant music and charging them for the quality of their
services. From a user’s perspective, it’s simple: Start the app, press a button, and let
your phone listen to the song. After a few seconds, even with background noise and
distortion, the app will tell you what the song is. It works so quickly and so well that it
almost seems like magic – but, as with most magical things these days, it’s mostly run
by algorithms.
1
growing need for automatic methods of music classification and similar assessment,
and emotion based methods are potentially among the most useful access mechanisms
for music collection. This is the main problem we wish to fix with this project.
We have tried to develop a flexible audio search engine. The algorithm is noise
and distortion resistant, computationally efficient, and massively scalable, capable of
quickly identifying a short segment of music captured through the microphone of the
smart phone in the presence of foreground voices and other dominant noise, and through
voice codec compression, out of a database of over a million tracks. The algorithm
uses a combinatorial hashed time-frequency constellation analysis of the audio, yielding
unusual properties such as transparency, in which multiple tracks mixed together may
each be identified.
2
Chapter 2
Requiremnets
In this chapter, we have learnt about the different components and programming
languages that we have used.
3
specialized for running servers on it. This often implies that it is more powerful and
reliable than standard personal computers, but alternatively, large computing clusters
may be composed of many relatively simple, replaceable server components
2.3 Dataset
In this project, we used a large dataset which contains millions of songs. The data
is open; meta- data, audio content analysis, etc. are available for all the songs. It is
also very large and contains around 48 million (user id, song id, play count) triplets
collected from histories of over one million users and metadata (280 GB) of millions
of songs. Users are anonymous here and thus information about their demography and
timestamps of listening events is not known.
4
Chapter 3
Implementation
3.2 Recognition
We then moved on to the major function of our app that is recognition. We implemented
this by starting with recording and capturing of the audio using the ibuilt microphone
in our phone. For this purpose we used MediaRecorder class provided by android to
record audio or video. After that we performed various processes on our captured audio
signal like fingerprinting, hashing etc.
3.2.1 Fingerprinting
5
fingerprints derived from the music database. The candidate matches are subsequently
evaluated for correctness of match. Some guiding principles for the attributes to use as
fingerprints are that they should be temporally localized, translation-invariant, robust,
and sufficiently entropic.
The temporal locality guideline suggests that each fingerprint hash is calculated
usingaudio samples near a corresponding point in time, so that distant events do not
affect thehash. The translation- invariant aspect means that fingerprint hashes derived
from cor-responding matching content are reproducible independent of position within
an audiofile, as long as the temporal locality containing the data from which the hash is
com-puted is contained within the file. This makes sense, as an unknown sample could
comefrom any portion of the original audio track. Robustness means that hashes gener-
atedfrom the original clean database track should be reproducible from a degraded copy
ofthe audio. Furthermore, the fingerprint tokens should have sufficiently high entropy
inorder to minimize the probability of false token matches at non-corresponding loca-
tionsbetween the unknown sample and tracks within the database. Insufficient entropy
leadsto excessive and spurious matches at non-corresponding locations, requiring more
pro-cessing power to cull the results, and too much entropy usually leads to fragility
andnon-reproducibility of fingerprint tokens in the presence of noise and distortion
Spectrogram
Initially the audio is in digital form after being sample and quantized inside the mi-
crophone, which is then converted to a spectrogram to convert it into an analog signal
.Generating a signature from the audio is essential for searching by sound. One common
technique is creating a time-frequency graph called spectrogram.
Any piece of audio can be translated to a spectrogram. Each piece of audio is split
into some segments over time. In some cases adjacent segments share a common time
boundary, in other cases adjacent segments might overlap. The result is a graph that
plots three dimensions of audio: frequency vs amplitude (intensity) vs time. Most of
the time, a spectrogram is a 3 dimensions graph where:
• On the vertical (Y) axis you have the frequency of the pure tone
6
The color represents the amplitude in dB Another interesting fact is that the intensity of
the frequencies changes through time and depends upon instrument to instrument.
Robust Constellation
In order to address the problem of robust identification in the presence of highly sig-
nificant noise and distortion, we experimented with a variety of candidate features that
could survive GSM encoding in the presence of noise. We settled on spectrogram peaks,
due to their robustness in the presence of noise and approximate linear superposability
A time-frequency point is a candidate peak if it has a higher energy content than all its
neighbors in a region centered around the point. Candidate peaks are chosen according
to a density criterion in order to assure that the time-frequency strip for the audio file
has reasonably uniform coverage. The peaks in each time-frequency locality are also
chosen according amplitude, with the justification that the highest amplitude peaks are
most likely to survive the distortions listed above.
Thus, a complicated spectrogram, as illustrated in Figure 3.1 may be reduced to a
sparse set of coordinates, as illustrated in Figure 3.2. enerally a peak in the spectrum is
still a peak with the same coordinates in a filtered spectrum (assuming that the deriva-
tive of the filter transfer function is reasonably small—peaks in the vicinity of a sharp
transition in the transfer function are slightly frequency-shifted). We term the sparse
coordinate lists “constellation maps” since the coordinate scatter plots often resemble a
star field.
The pattern of dots should be the same for matching segments of audio. If you put
the constellation map of a database song on a strip chart, and the constellation map of a
short matching audio sample of a few seconds length on a transparent piece of plastic,
then slide the latter over the former, at some point a significant number of points will
7
coincide when the proper time offset is located and the two constellation maps are
aligned in register.
The number of matching points will be significant in the presence of spurious peaks
injected due to noise, as peak positions are relatively independent; further, the number
of matches can also be significant even if many of the correct points have been deleted.
The signal is then passed through a Low Pass Filter, the low frequency signals are cut
off and high frequency signals are mapped and the following map is called Constellation
Map.
Finding the correct registration offset directly from constellation maps can be rather
slow, due to raw constellation points having low entropy. For example, a 1024-bin fre-
quency axis yields only at most 10 bits of frequency data per peak. We have developed
a fast way of indexing constellation maps.
Fingerprint hashes are formed from the constellation map, in which pairs of time-
frequency points are combinatorially associated. Anchor points are chosen, each anchor
point having a target zone associated with it. Each anchor point is sequentially paired
with points within its target zone, each pair yielding two frequency components plus
the time difference between the points (Figure 3.3 and 3.4). These hashes are quite
reproducible, even in the presence of noise and voice codec compression. Furthermore,
each hash can be packed into a 32-bit unsigned integer. Each hash is also associated
with the time offset from the beginning of the respective file to its anchor point, though
the absolute time is not a part of the hash itself.
To create a database index, the above operation is carried out on each track in a
database to generate a corresponding list of hashes and their associated offset times.
8
Track IDs may also be appended to the small data structs, yielding an aggregate 64-bit
struct, 32 bits for the hash and 32 bits for the time offset and track ID. To facilitate fast
processing, the 64-bit structs are sorted according to hash token value.
The number of hashes per second of audio recording being processed is approxi-
mately equal to the density of constellation points per second times the fan-out factor
into the target zone
9
Due to the rigid constraints of the problem, the following technique solves the prob-
lem in approximately N*log(N) time, where N is the number of points appearing on the
scatterplot. For the purposes of this discussion, we may assume that the slope of the
diagonal line is 1.0. Then corresponding times of matching features between matching
files have the relationship tk’=tk+offset, where tk’ is the time coordinate of the feature
in the matching (clean) database soundfile and tk is the time coordinate of the corre-
sponding feature in the sample soundfile to be identified. For each (tk’,tk) coordinate
in the scatterplot, we calculate tk=tk’-tk. Then we calculate a histogram of these tk
values and scan for a peak. This may be done by sorting the set of tk values and quickly
scanning for a cluster of values. The scatterplots are usually very sparse, due to the
specificity of the hashes owing to the combinatorial method of generation as discussed
above. Since the number of time pairs in each bin is small, the scanning process takes
on the order of microseconds per bin, or less. The score of the match is the number
of matching points in the histogram peak. The presence of a statistically significant
cluster indicates a match. Figure 3.5 illustrates a scatterplot of database time versus
sample time for a track that does not match the sample. There are a few chance associa-
tions, but no linear correspondence appears. Figure 3.7 shows a case where a significant
number of matching time pairs appear on a diagonal line. Figures 3.6 and 3B show the
histograms of the tk values corresponding to Figures 3.6 and 3.8. This bin scanning
process is repeated for each track in the database until a significant match is found.
Note that the matching and scanning phases do not make any special assumption
about the format of the hashes. In fact, the hashes only need to have the properties of
having sufficient entropy to avoid too many spurious matches to occur, as well as being
reproducible. In the scanning phase the main thing that matters is for the matching
hashes to be temporally aligned
10
Figure 3.7: Scatter plot of matching hash diag-Figure 3.8: Histogram of differences of time
onals: Diagonal Present offset: Signals match
3.3.1 Significance
As described above, the score is simply the number of matching and time-aligned hash
tokens. The distribution of scores of incorrectly-matching tracks is of interest in deter-
mining the rate of false positives as well as the rate of correct recognitions. To summa-
rize briefly, a histogram of the scores of incorrectly-matching tracks is collected. The
number of tracks in the database is taken into account and a probability density func-
tion of the score of the highest- scoring incorrectly-matching track is generated. Then
an acceptable false positive rate is chosen.
3.4 Performance
3.4.1 Noise
The algorithm performs well with significant levels of noise and even non-linear distor-
tion. It can correctly identify music in the presence of voices, traffic noise, dropout, and
even other music. To give an idea of the power of this technique, from a heavily cor-
rupted 15 second sample, a statistically significant match can be determined with only
about 1-2 of the generated hash tokens actually surviving and contributing to the offset
cluster. A property of the scatterplot histogramming technique is that discontinuities
are irrelevant, allowing immunity to dropouts and masking due to interference. One
somewhat surprising result is that even with a large database, we can correctly identify
each of several tracks mixed together, including multiple versions of the same piece, a
property we call “transparency”.
Figure 3.5 shows the result of performing 250 sample recognitions of varying length
and noise levels against a test database of 10000 tracks consisting of popular music. A
noise sample was recorded in a noisy pub to simulate “real-life” conditions. Audio
11
excerpts of 15, 10, and 5 seconds in length were taken from the middle of each test
track, each of which was taken from the test database. For each test excerpt, the relative
power of the noise was normalized to the desired signal-to-noise ratio, then linearly
added to the sample. We see that the recognition rate drops to 50 -9, -6, and -3 dB SNR,
respectively Figure 3.6 shows the same analysis, except that the resulting music+noise
mixture was further subjected to GSM 6.10 compression, then reconverted to PCM
audio. In this case, the 50
Figure 3.9: Recognition rate – Additive Noise Figure 3.10: Recognition rate – Additive noise
+ GSM compression
3.4.2 Speed
For a database of about 20 thousand tracks implemented on a PC, the search time is
on the order of 5-500 milliseconds, depending on parameters settings and application.
The service can find a matching track for a heavily corrupted audio sample within a few
hundred milliseconds of core search time. With “radio quality” audio, we can find a
match in less than 10 milliseconds, with a likely optimization goal reaching down to 1
millisecond per query.
The algorithm was designed specifically to target recognition of sound files that are al-
ready present in the database. It is not expected to generalize to live recordings. That
said, we have anecdotally discovered several artists in concert who apparently either
have extremely accurate and reproducible timing (with millisecond precision), or are
more plausibly lip syncing. The algorithm is conversely very sensitive to which partic-
ular version of a track has been sampled. Given a multitude of different performances
of the same song by an artist, the algorithm can pick the correct one even if they are
virtually indistinguishable by the human ear.
12
Chapter 4
Results
We tested our app on many platforms that could have affected our results.
We tested recognition feature on multiple noise levels and volume levels and our app
works about 8/10 times (30 times tested) opportunities and recognizes almost all of the
songs in English language although results our below acceptance for Hindi language.
Figure 4.1: Opening Page Of App Figure 4.2: Song After Recognition
13
Chapter 5
The Future Scope of this project can be the Recommendation System that can be imple-
mented in this application with the help of machine learning. Recommendation System
basically enhance is the overall user experience by suggesting the user songs on the
basis of the previous listening history of the user. The system will recommend songs
on the basis of the release year of the song the user has previously listened to, similar
artists and similar genres. Apparently for its implementation part the most available
solution is CF-NADE, a type of Collaborative Filtering approach for recommendation
from user data, currently used by popular streaming services such as Spotify.
• The information can also be made available on a web site, where the user may
register and log in with her mobile phone number and password. At the web site,
or on a smart phone,
• The user may view her tagged track list and buy the CD.
• The user may also download the ringtone corresponding to the tagged track, if it
is available. The user may also send a 30-second clip of the song to a friend.
• Other services, such as purchasing an MP3 download may become available soon
14
References
[1] Avery Li-Chun Wang and Julius O. Smith, III., WIPO publication WO
02/11123A2, 7 February 2002, (Priority 31 July 2000).
[2] Jaap Haitsma, Antonius Kalker, Constant Baggen, and Job Oostveen., WIPO
publication WO 02/065782A1, 22 August 2002, (Priority 12 February, 2001).
[3] Jaap Haitsma, Antonius Kalker, “A Highly Robust Audio Fingerprinting System”,
International Symposium on Music Information Retrieval (ISMIR) 2002, pp. 107-
115.
[4] [Sean Ward and Isaac Richards, WIPO publication WO 02/073520A1, 19 Septem-
ber 2002, (Priority 13 March 2001).
[5] Erling Wold, Thom Blum, Douglas Keislar, James Wheaton, “Content-Based
Classification, Search, and Retrieval of Audio”, in IEEE Multimedia, Vol. 3, No.
3: FALL 1996, pp. 27-36.
[6] Cheng Yang, “MACS: Music Audio Characteristic Sequence Indexing For Similar-
ity Retrieval”, in IEEE Workshop on Applications of Signal Processing to Audio
and Acoustics, 2001.
[7] http://www.musiwave.com.
15