Professional Documents
Culture Documents
Assignment IV
Q1. Explain how audio signal processing is used for music applications.
A1. Music signal processing may appear to be the junior relation of the large
and mature field of speech signal processing, not least because many
techniques and representations originally developed for speech have been
applied to music, often with good results. However, music signals possess
specific acoustic and structural characteristics that distinguish them from
spoken language or other non-musical signals.
A few applications of audio signal processing for music are as follows:
1. Acoustic fingerprint
An acoustic fingerprint is a condensed digital summary, a fingerprint,
deterministically generated from an audio signal, that can be used to
identify an audio sample or quickly locate similar items in an audio
database. Practical uses of acoustic fingerprinting include identifying songs,
melodies, tunes, or advertisements; sound effect library management; and
video file identification. Media identification using acoustic fingerprints can
be used to monitor the use of specific musical works and performances on
radio broadcast, records, CDs, streaming media and peer-to-peer
networks. This identification has been used in copyright compliance,
licensing, and other monetisation schemes.
A robust acoustic fingerprint algorithm must take into account the
perceptual characteristics of the audio. If two files sound alike to the human
ear, their acoustic fingerprints should match, even if their binary
representations are quite different. Acoustic fingerprints are not hash
functions, which must be sensitive to any small changes in the data.
Acoustic fingerprints are more analogous to human fingerprints where
small variations that are insignificant to the features the fingerprint uses are
tolerated. One can imagine the case of a smeared human fingerprint
impression which can accurately be matched to another fingerprint sample
in a reference database; acoustic fingerprints work in a similar way.
Perceptual characteristics often exploited by audio fingerprints include
average zero crossing rate, estimated tempo, average spectrum, spectral
flatness, prominent tones across a set of frequency bands, and bandwidth.
Most audio compression techniques will make radical changes to the binary
encoding of an audio file, without radically affecting the way it is perceived
by the human ear. A robust acoustic fingerprint will allow a recording to be
identified after it has gone through such compression, even if the audio
quality has been reduced significantly.
Generating a signature from the audio is essential for searching by sound.
One common technique is creating a time-frequency graph called
spectrogram.
Any piece of audio can be translated to a spectrogram. Each piece of audio
is split into some segments over time. In some cases adjacent segments
share a common time boundary, in other cases adjacent segments might
overlap. The result is a graph that plots three dimensions of audio:
frequency vs amplitude (intensity) vs time.
Shazam's algorithm picks out points where there are peaks in the
spectrogram which represent higher energy content.Focusing on peaks in
the audio greatly reduces the impact that background noise has on audio
identification. Shazam builds their fingerprint catalog out as a hash table,
where the key is the frequency. They do not just mark a single point in the
spectrogram, rather they mark a pair of points: the peak intensity plus a
second anchor point. So their database key is not just a single frequency, it
is a hash of the frequencies of both points. This leads to fewer hash
collisions improving the performance of the hash table.
2. Musical Pitch
Most musical instruments—including string-based instru- ments such as
guitars, violins, and pianos, as well as instruments based on vibrating air
columns such as flutes, clarinets, and trumpets—are explicitly constructed
to allow performers to produce sounds with easily controlled, locally stable
funda- mental periods. Such a signal is well described as a harmonic series
of sinusoids at multiples of a fundamental frequency, and results in the
percept of a musical note (a single perceived event) at a clearly defined
pitch in the mind of the listener. With the exception of unpitched
instruments like drums, and a few inharmonic instruments such as bells,
the periodicity of individual musical notes is rarely ambiguous, and thus
equating the perceived pitch with fundamental frequency is common.
A sequence of notes—a melody—performed at pitches exactly one octave
dis- placed from an original will be perceived as largely musically
equivalent. We note that the sinusoidal harmonics of a funda- mental at fo
at frequencies fo, 2fo, 3fo, 4fo,… are a proper
superset of the harmonics of a note with fundamental 2fo(i.e. 2fo, 4fo, 6fo,
… ), and this is presumably the basis of the perceived similarity. Other
pairs of notes with frequencies in simple ratios, such as fo and 3fo/2 will
also share many harmonics, and are also perceived as similar—although
not as close as the octave. Fig. below shows the waveforms and
spectrograms of middle C (with fundamental frequency 262 Hz) played on
a piano and a violin. Zoomed-in views above the waveforms show the
relatively stationary waveform with a 3.8-ms period in both cases. The
spectrograms (calculated with a 46-ms window) show the harmonic series
at integer multiples of the fundamental. Obvious differences between piano
and violin sound include the decaying energy within the piano note, and the
slight frequency modulation (“vibrato”) on the violin.
3. Music Production
Algorithmic composition is the technique of using algorithms to create
music.Algorithms (or, at the very least, formal sets of rules) have been used to
compose music for centuries; the procedures used to plot voice-leading in
Western counterpoint, for example, can often be reduced to algorithmic
determinacy. The term can be used to describe music-generating techniques
that run without ongoing human intervention, for example through the
introduction of chance procedures. However through live coding and other
interactive interfaces, a fully human-centric approach to algorithmic
composition is possible.
Some algorithms or data that have no immediate musical relevance are used
by composers as creative inspiration for their music. Algorithms such as
fractals, L-systems, statistical models, and even arbitrary data (e.g. census
figures, GIS coordinates, or magnetic field measurements) have been used as
source materials.
Some Models for algorithmic composition:
Compositional algorithms are usually classified by the specific programming
techniques they use. The results of the process can then be divided into 1)
music composed by computer and 2) music composed with the aid of
computer. Music may be considered composed by computer when the
algorithm is able to make choices of its own during the creation process.
Another way to sort compositional algorithms is to examine the results of their
compositional processes. Algorithms can either 1) provide notational
information (sheet music or MIDI) for other instruments or 2) provide an
independent way of sound synthesis (playing the composition by itself). There
are also algorithms creating both notational data and sound synthesis.
One way to categorise compositional algorithms is by their structure and the
way of processing data, as seen in this model of six partly overlapping types:
• mathematical models
• knowledge-based systems
• grammars
• optimisation approaches
• evolutionary methods
• systems which learn
• hybrid systems
Translational models
This is an approach to music synthesis that involves "translating" information
from an existing non-musical medium into a new sound. The translation can
be either rule-based or stochastic. For example, when translating a picture into
sound, a jpeg image of a horizontal line may be interpreted in sound as a
constant pitch, while an upwards-slanted line may be an ascending scale.
Oftentimes, the software seeks to extract concepts or metaphors from the
medium, (such as height or sentiment) and apply the extracted information to
generate songs using the ways music theory typically represents those
concepts. Another example is the translation of text into music, which can
approach composition by extracting sentiment (positive or negative) from the
text using machine learning methods like sentiment analysis and represents
that sentiment in terms of chord quality such as minor (sad) or major (happy)
chords in the musical output generated.
Mathematical models
Mathematical models are based on mathematical equations and random
events. The most common way to create compositions through mathematics
is stochastic processes. In stochastic models a piece of music is composed as
a result of non-deterministic methods. The compositional process is only
partially controlled by the composer by weighting the possibilities of random
events. Prominent examples of stochastic algorithms are Markov chains and
various uses of Gaussian distributions. Stochastic algorithms are often used
together with other algorithms in various decision-making processes.
Music has also been composed through natural phenomena. These chaotic
models create compositions from the harmonic and inharmonic phenomena of
nature. For example, since the 1970s fractals have been studied also as
models for algorithmic composition.
As an example of deterministic compositions through mathematical models,
the On-Line Encyclopedia of Integer Sequences provides an option to play
an integer sequence as 12-tone equal temperament music. (It is initially set
to convert each integer to a note on an 88-key musical keyboard by
computing the integer modulo 88, at a steady rhythm. Thus 123456, the
natural numbers, equals half of a chromatic scale.) As another example,
the all-interval series has been used for computer-aided composition.
Q2. Choose a Raag, note its ‘Swars’, play its composite frequencies on
MATLAB for the ‘Aaroh’ and ‘Avaroh’, plot the waveform and pitch
contour. Write down your conclusions.
A2.
The chosen Raag is Jhinjhoti and the specification are as follows:
Frequency of Swars 240 270 300 337.5 360 400 432 480
Fig 1. Waveform and Pitch Contour (Aaroh)
Q3. For the Raag chosen, also plot the power spectrum and verify if the
constituent frequencies can be identified. Write down your conclusions.
A3.
Fig 3. Power spectrum of Aaroh using Short Time Fourier Transform (STFT)
Fig 4. Power spectrum of Avaroh using Short Time Fourier Transform (STFT)
Conclusions
1. The frequency components constituting the most power in the Raag audio are shown in
warm colours (refer to the side bars in the Fig. 3 and Fig. 4). From this, we infer that
dominant frequencies in the audio lie between 0.2 KHz to 0.5 KHz or between 200 Hz to
500 Hz for both, Aaroh and Avaroh, which is indeed the case with Jhinjhoti Raag (refer
Table. 1).
2. More importantly, the power spectrum using STFT shows a time-frequency breakup of
the Raag audio and from it, we can clearly make out the increasing trend of the
dominant frequency on the time axis which concurs with our Pitch Contour plot (refer
Fig. 1 and Fig. 2).
3. The STFT Power Spectrum also shows that the audio has some frequencies upto 5KHz
(although not dominant). This can be explained by the fact that in the code we record the
Raag from the speaker and hence some wayward frequencies might have also been
recorded due to the non-ideal frequency response of the speaker leading to the high
albeit non-dominant frequencies observed in Fig. 3 and Fig. 4
Q4. How would you use signal processing for speech compression for
mobile phones, as well as speech transmission for mobile phones?