You are on page 1of 16

Advanced Digital Signal Processing

Assignment IV
Q1. Explain how audio signal processing is used for music applications.

A1. Music signal processing may appear to be the junior relation of the large
and mature field of speech signal processing, not least because many
techniques and representations originally developed for speech have been
applied to music, often with good results. However, music signals possess
specific acoustic and structural characteristics that distinguish them from
spoken language or other non-musical signals.
A few applications of audio signal processing for music are as follows:
1. Acoustic fingerprint
An acoustic fingerprint is a condensed digital summary, a fingerprint,
deterministically generated from an audio signal, that can be used to
identify an audio sample or quickly locate similar items in an audio
database. Practical uses of acoustic fingerprinting include identifying songs,
melodies, tunes, or advertisements; sound effect library management; and
video file identification. Media identification using acoustic fingerprints can
be used to monitor the use of specific musical works and performances on
radio broadcast, records, CDs, streaming media and peer-to-peer
networks. This identification has been used in copyright compliance,
licensing, and other monetisation schemes.
A robust acoustic fingerprint algorithm must take into account the
perceptual characteristics of the audio. If two files sound alike to the human
ear, their acoustic fingerprints should match, even if their binary
representations are quite different. Acoustic fingerprints are not hash
functions, which must be sensitive to any small changes in the data.
Acoustic fingerprints are more analogous to human fingerprints where
small variations that are insignificant to the features the fingerprint uses are
tolerated. One can imagine the case of a smeared human fingerprint
impression which can accurately be matched to another fingerprint sample
in a reference database; acoustic fingerprints work in a similar way.
Perceptual characteristics often exploited by audio fingerprints include
average zero crossing rate, estimated tempo, average spectrum, spectral
flatness, prominent tones across a set of frequency bands, and bandwidth.
Most audio compression techniques will make radical changes to the binary
encoding of an audio file, without radically affecting the way it is perceived
by the human ear. A robust acoustic fingerprint will allow a recording to be
identified after it has gone through such compression, even if the audio
quality has been reduced significantly.
Generating a signature from the audio is essential for searching by sound.
One common technique is creating a time-frequency graph called
spectrogram.
Any piece of audio can be translated to a spectrogram. Each piece of audio
is split into some segments over time. In some cases adjacent segments
share a common time boundary, in other cases adjacent segments might
overlap. The result is a graph that plots three dimensions of audio:
frequency vs amplitude (intensity) vs time.
Shazam's algorithm picks out points where there are peaks in the
spectrogram which represent higher energy content.Focusing on peaks in
the audio greatly reduces the impact that background noise has on audio
identification. Shazam builds their fingerprint catalog out as a hash table,
where the key is the frequency. They do not just mark a single point in the
spectrogram, rather they mark a pair of points: the peak intensity plus a
second anchor point. So their database key is not just a single frequency, it
is a hash of the frequencies of both points. This leads to fewer hash
collisions improving the performance of the hash table.
2. Musical Pitch
Most musical instruments—including string-based instru- ments such as
guitars, violins, and pianos, as well as instruments based on vibrating air
columns such as flutes, clarinets, and trumpets—are explicitly constructed
to allow performers to produce sounds with easily controlled, locally stable
funda- mental periods. Such a signal is well described as a harmonic series
of sinusoids at multiples of a fundamental frequency, and results in the
percept of a musical note (a single perceived event) at a clearly defined
pitch in the mind of the listener. With the exception of unpitched
instruments like drums, and a few inharmonic instruments such as bells,
the periodicity of individual musical notes is rarely ambiguous, and thus
equating the perceived pitch with fundamental frequency is common.
A sequence of notes—a melody—performed at pitches exactly one octave
dis- placed from an original will be perceived as largely musically
equivalent. We note that the sinusoidal harmonics of a funda- mental at fo
at frequencies fo, 2fo, 3fo, 4fo,… are a proper
superset of the harmonics of a note with fundamental 2fo(i.e. 2fo, 4fo, 6fo,
… ), and this is presumably the basis of the perceived similarity. Other
pairs of notes with frequencies in simple ratios, such as fo and 3fo/2 will
also share many harmonics, and are also perceived as similar—although
not as close as the octave. Fig. below shows the waveforms and
spectrograms of middle C (with fundamental frequency 262 Hz) played on
a piano and a violin. Zoomed-in views above the waveforms show the
relatively stationary waveform with a 3.8-ms period in both cases. The
spectrograms (calculated with a 46-ms window) show the harmonic series
at integer multiples of the fundamental. Obvious differences between piano
and violin sound include the decaying energy within the piano note, and the
slight frequency modulation (“vibrato”) on the violin.

3. Music Production
Algorithmic composition is the technique of using algorithms to create
music.Algorithms (or, at the very least, formal sets of rules) have been used to
compose music for centuries; the procedures used to plot voice-leading in
Western counterpoint, for example, can often be reduced to algorithmic
determinacy. The term can be used to describe music-generating techniques
that run without ongoing human intervention, for example through the
introduction of chance procedures. However through live coding and other
interactive interfaces, a fully human-centric approach to algorithmic
composition is possible.
Some algorithms or data that have no immediate musical relevance are used
by composers as creative inspiration for their music. Algorithms such as
fractals, L-systems, statistical models, and even arbitrary data (e.g. census
figures, GIS coordinates, or magnetic field measurements) have been used as
source materials.
Some Models for algorithmic composition:
Compositional algorithms are usually classified by the specific programming
techniques they use. The results of the process can then be divided into 1)
music composed by computer and 2) music composed with the aid of
computer. Music may be considered composed by computer when the
algorithm is able to make choices of its own during the creation process.
Another way to sort compositional algorithms is to examine the results of their
compositional processes. Algorithms can either 1) provide notational
information (sheet music or MIDI) for other instruments or 2) provide an
independent way of sound synthesis (playing the composition by itself). There
are also algorithms creating both notational data and sound synthesis.
One way to categorise compositional algorithms is by their structure and the
way of processing data, as seen in this model of six partly overlapping types:
• mathematical models
• knowledge-based systems
• grammars
• optimisation approaches
• evolutionary methods
• systems which learn
• hybrid systems
Translational models
This is an approach to music synthesis that involves "translating" information
from an existing non-musical medium into a new sound. The translation can
be either rule-based or stochastic. For example, when translating a picture into
sound, a jpeg image of a horizontal line may be interpreted in sound as a
constant pitch, while an upwards-slanted line may be an ascending scale.
Oftentimes, the software seeks to extract concepts or metaphors from the
medium, (such as height or sentiment) and apply the extracted information to
generate songs using the ways music theory typically represents those
concepts. Another example is the translation of text into music, which can
approach composition by extracting sentiment (positive or negative) from the
text using machine learning methods like sentiment analysis and represents
that sentiment in terms of chord quality such as minor (sad) or major (happy)
chords in the musical output generated.
Mathematical models
Mathematical models are based on mathematical equations and random
events. The most common way to create compositions through mathematics
is stochastic processes. In stochastic models a piece of music is composed as
a result of non-deterministic methods. The compositional process is only
partially controlled by the composer by weighting the possibilities of random
events. Prominent examples of stochastic algorithms are Markov chains and
various uses of Gaussian distributions. Stochastic algorithms are often used
together with other algorithms in various decision-making processes.
Music has also been composed through natural phenomena. These chaotic
models create compositions from the harmonic and inharmonic phenomena of
nature. For example, since the 1970s fractals have been studied also as
models for algorithmic composition.
As an example of deterministic compositions through mathematical models,
the On-Line Encyclopedia of Integer Sequences provides an option to play
an integer sequence as 12-tone equal temperament music. (It is initially set
to convert each integer to a note on an 88-key musical keyboard by
computing the integer modulo 88, at a steady rhythm. Thus 123456, the
natural numbers, equals half of a chromatic scale.) As another example,
the all-interval series has been used for computer-aided composition.

Q2. Choose a Raag, note its ‘Swars’, play its composite frequencies on
MATLAB for the ‘Aaroh’ and ‘Avaroh’, plot the waveform and pitch
contour. Write down your conclusions.

A2.
The chosen Raag is Jhinjhoti and the specification are as follows:

Swars Sa Re Ga Ma Pa Dha Komal Ni Sa`

Frequency of Swars 240 270 300 337.5 360 400 432 480
Fig 1. Waveform and Pitch Contour (Aaroh)

Fig 2. Waveform and Pitch Contour (Avaroh)


Conclusions:
1. From the waveform plots in both Fig. 1 and Fig. 2 (Aaroh and Avaroh), the start and
end of the swars can be easily identified.
2. We can also see that the start and end of each individual swar coincides perfectly
with a change in the pitch frequency which then stays constant throughout the
duration of that swar.
3. When the Raag is played in Aaroh (Ascending), the Pitch Contour plot shows an
increasing trend (Fig. 1) and when the Raag is played in Avaroh (Descending), the
Pitch Contour plot shown a decreasing trend (Fig. 2) as per expectation and in
complete symmetry.

Q3. For the Raag chosen, also plot the power spectrum and verify if the
constituent frequencies can be identified. Write down your conclusions.

A3.

Fig 3. Power spectrum of Aaroh using Short Time Fourier Transform (STFT)
Fig 4. Power spectrum of Avaroh using Short Time Fourier Transform (STFT)

Conclusions
1. The frequency components constituting the most power in the Raag audio are shown in
warm colours (refer to the side bars in the Fig. 3 and Fig. 4). From this, we infer that
dominant frequencies in the audio lie between 0.2 KHz to 0.5 KHz or between 200 Hz to
500 Hz for both, Aaroh and Avaroh, which is indeed the case with Jhinjhoti Raag (refer
Table. 1).
2. More importantly, the power spectrum using STFT shows a time-frequency breakup of
the Raag audio and from it, we can clearly make out the increasing trend of the
dominant frequency on the time axis which concurs with our Pitch Contour plot (refer
Fig. 1 and Fig. 2).
3. The STFT Power Spectrum also shows that the audio has some frequencies upto 5KHz
(although not dominant). This can be explained by the fact that in the code we record the
Raag from the speaker and hence some wayward frequencies might have also been
recorded due to the non-ideal frequency response of the speaker leading to the high
albeit non-dominant frequencies observed in Fig. 3 and Fig. 4

Q4. How would you use signal processing for speech compression for
mobile phones, as well as speech transmission for mobile phones?

A4. Traditionally, quantisation techniques such as Law and A Law


companding were used to lower the bits per second sent over a
communication network. However due to inherent high rigidity in those
compression techniques fell out of favour against the newer LPC (Linear
Predictive Coding) methods such as ADPCM which began to gain rapid
acceptance due to their ability to adapt to each speaker individually and
increase the compression ratio by exploiting the high correlation between
consecutive samples of speech. ADPCM and its variants were still waveform
encoders. More recently however, a new class of speech compression or
speech coding methods took form which instead of encoding the waveform of
speech, aimed to code and transmit the minimal amount of information
necessary to synthesize speech which is audibly perceived to be accurate.
This is achieved by modelling aspects of the speech generation process,
effectively modelling the vocal tract. Two components are generally found in
CELP and its variants, one which models the long term speech structure, the
pitch, and another, which models the short-term speech structure, the formant.
However, CELP, with its advanced techniques like speech modelling and
vector quantisation had the disadvantage of being slow and computationally
expensive compared to ADPCM. Consequently, the latest in the speech
compression and coding which is used by Skype, WhatsApp, PlayStation 4
and the like, is a technique called CELT. CELT stands for Constrained Energy
Lapped Transform and unlike CELP, offers low latency while giving equally
high compression. Moreover, CELT is open-sourced and completely royalty
free developed by the Opus group. It is suitable for both speech and music. It
is based on Modified Discrete Cosine Transform (MDCP) and borrows ideas
from the CELP algorithm, but avoids some of its limitations by operating in the
frequency domain exclusively.

The initial PCM-Coded signal is handled in relatively small, overlapping blocks


for the MDCT and transformed to frequency coefficients. Choosing an
especially short block size on the one hand enables for a low latency, but also
leads to poor frequency resolution that has to be compensated. For a further
reduction of the algorithmic delay to the expense of a minor sacrifice in audio
quality, the by nature 50% of overlap between the blocks is practically cut
down to half by silencing the signal during one eight at both ends of a block,
respectively. The coefficients are grouped to resemble the critical bands of the
human auditory system as shown below in the diagram which compares the
CELT bands with the Bark scale.
The entire amount of energy of each group is analysed and the values
quantised for data reduction and compressed through prediction by only
transmitting the difference to the predicted values (delta encoding). The
(unquantised) band energy values are removed from the raw DCT
coefficients (normalisation). The coefficients of the resulting residual signal
are coded by Pyramid Vector Quantisation. This encoding leads to code
words of fixed (predictable) length, which in turn enables for robustness
against bit errors and leaves no need for entropy encoding. Finally, all
output of the encoder are coded to one bitstream by a range encoder. In
connection with the PVQ, CELT uses a technique known as band folding,
which delivers a similar effect to Spectral Band Replication (SBR) by
reusing coefficients of lower bands for higher ones, but has much less
impact on the algorithmic delay and computational complexity than the
SBR. The decoder unpacks the individual components from the range
coded bitstream, multiplies the band energy to the band shape coefficients
and transforms them back (via iMDCT) to PCM data. The individual blocks
are rejoined using weighted overlap-add (WOLA). Many parameters are not
explicitly coded, but instead reconstructed by using the same functions as
the encoder. The advantages of CELT that make it the unmatched choice
for modern speech coding are as follows: 1. Low algorithmic delay. It allows
for latencies of typically 3 to 9 ms but is configurable to below 2 ms at the
price of more bitrate to reach a similar audio quality. 2. CELT supports
mono and stereo audio and is applicable to both speech and music. 3. It
has very low computational complexity. 4. It enables for constant and
variable bitrate. 5. Completely open source.

Q5. Explain the role of signal processing in Biomedical applications


A Hearing aid is simply an electronic sound amplifier. You’ve seen people on
stage speak into a microphone and have their voices hugely amplified by giant
loudspeaker so crowds can hear them. A hearing aid works exactly the same
way, except that the microphone, amplifier, and loudspeaker are built into a
small, discreet, plastic package worn behind the ear.
Types: Analog and digital hearing aid.

You might also like