Professional Documents
Culture Documents
Assignment IV
Music signals possess specific acoustic and structural characteristics that distinguish
them from spoken language or other nonmusical signals. Music signal processing may
appear to be the junior relation of the large and mature field of audio signal processing,
not least because many techniques and representations originally developed for audio
have been applied to music, often with extraordinary results.
The audio signal processing involved in various music centric applications are as
follows:
I. Music Information Retrieval Systems: Currently, music classifi cation and searching
depends entirely upon textual metadata (title of the piece, composer, players,
instruments, etc.). Developing features that can be extracted automatically from
recorded music for describing musical content would be very useful for music
classifi cation and subsequent retrieval based on various user-specifi ed similarity
criteria. The chief attributes of a piece of music are its timbral texture, pitch
content and rhythmic content. The pitch and rhythm relationships between
successive notes make up the melody of the music. These features can be
developed with the help of appropriate audio processing methods which we will
now describe.
the factorisation, on basis on clustering, the vocals (lead) can be separated from
the background (accompaniment) as shown.
III. Music Production: Composers work with notes, melodies and chord progressions
to produce a song. It wouldn’t be a stretch to say that a music producer’s toolbox is
made of another kind of creative tool: audio effects. They’re at the core of how
producers shape sound and make it into music. Audio processing turn a so-so mix
into a powerful fi nished track. Common audio processing effects are:
A. Modulation effects: These effects include Chorus, Tremolo, Flanger and Phaser.
For example, for the Chorus effect, the audio processor makes copies of the
original audio signal and applies delay and pitch-modulation to those copies.
This is then mixed to together to give more harmonics to the original audio to
make it a ‘fuller’ audio. Tremolo, on the other hand, is a modulation effect
created by varying the amplitude (volume) of a signal. It gives a trembling effect
and makes the sound more rhythmic, percussive or stuttering.
B. Time-based effects: These effects include Reverb, Delay and Echo. Reverb is a
lot of echoes all happening at the same time, so it is heard as one single sound.
Music composers use audio processing to calculate the needed delay, level,
frequency response and algorithmically generate multiple echoes before
overlapping them into a one sound. Reverb brings sustain to a sound and adds
spaciousness and depth to the signal. Delay and Echoes are just copies of the
original sound signals with different phases or lags. Delay is also the foundation
for other effects including Chorus and Reverb.
P.T.O
Q. Choose a Raag, note its ‘Swars’, play its composite frequencies on MATLAB for
the ‘Aaroh’ and ‘Avaroh’, plot the waveform and pitch contour. Write down your
conclusions.
Conclusions:
1. From the waveform plots in both Fig. 1 and Fig. 2 (Aaroh and Avaroh), the start and
end of the swars can be easily identified.
2. We can also see that the start and end of each individual swar coincides perfectly
with a change in the pitch frequency which then stays constant throughout the
duration of that swar.
3. When the Raag is played in Aaroh (Ascending), the Pitch Contour plot shows an
increasing trend (Fig. 1) and when the Raag is played in Avaroh (Descending), the
Pitch Contour plot shown a decreasing trend (Fig. 2) as per expectation and in
complete symmetry.
Q. For the Raag chosen, also plot the power spectrum and verify if the constituent
frequencies can be identified. Write down your conclusions.
Figure 3. Power Spectrum of Aaroh using Short Time Fourier transform (STFT)
Figure 4. Power Spectrum of Avaroh using Short Time Fourier transform (STFT)
Conclusions
1. The frequency components constituting the most power in the Raag audio are
shown in warm colours (refer to the side bars in the Fig. 3 and Fig. 4). From this, we
infer that dominant frequencies in the audio lie between 0.2 KHz to 0.5 KHz or
between 200 Hz to 500 Hz for both, Aaroh and Avaroh, which is indeed the case with
Bhairav Raag (refer Table. 1).
2. More importantly, the power spectrum using STFT shows a time-frequency breakup
of the Raag audio and from it, we can clearly make out the increasing trend of the
dominant frequency on the time axis which concurs with our Pitch Contour plot
(refer Fig. 1 and Fig. 2).
3. The STFT Power Spectrum also shows that the audio has some frequencies upto
5KHz (although not dominant). This can be explained by the fact that in the code we
record the Raag from the speaker and hence some wayward frequencies might have
also been recorded due to the non-ideal frequency response of the speaker leading
to the high albeit non-dominant frequencies observed in Fig. 3 and Fig. 4
Q. How would you use signal processing for speech compression for mobile
phones, as well as speech transmission for mobile phones?
Traditionally, quantisation techniques such as μ Law and A Law companding were used
to lower the bits per second sent over a communication network. However due to
inherent high rigidity in those compression techniques fell out of favour against the
newer LPC (Linear Predictive Coding) methods such as ADPCM which began to gain
rapid acceptance due to their ability to adapt to each speaker individually and increase
the compression ratio by exploiting the high correlation between consecutive samples
of speech. ADPCM and its variants were still waveform encoders.
More recently however, a new class of speech compression or speech coding methods
took form which instead of encoding the waveform of speech, aimed to code and
transmit the minimal amount of information necessary to synthesize speech which is
audibly perceived to be accurate. This is achieved by modelling aspects of the speech
generation process, effectively modelling the vocal tract. Two components are generally
found in CELP and its variants, one which models the long term speech structure, the
pitch, and another, which models the short-term speech structure, the formant. However,
CELP, with its advanced techniques like speech modelling and vector quantisation had
the disadvantage of being slow and computationally expensive compared to ADPCM.
Consequently, the latest in the speech compression and coding which is used by Skype,
WhatsApp, PlayStation 4 and the like, is a technique called CELT. CELT stands for
Constrained Energy Lapped Transform and unlike CELP, offers low latency while giving
equally high compression. Moreover, CELT is open-sourced and completely royalty free
developed by the Opus group. It is suitable for both speech and music. It is based on
Modified Discrete Cosine Transform (MDCP) and borrows ideas from the CELP
algorithm, but avoids some of its limitations by operating in the frequency domain
exclusively.
The initial PCM-Coded signal is handled in relatively small, overlapping blocks for the
MDCT and transformed to frequency coefficients. Choosing an especially short block
size on the one hand enables for a low latency, but also leads to poor frequency
resolution that has to be compensated. For a further reduction of the algorithmic delay
to the expense of a minor sacrifice in audio quality, the by nature 50% of overlap
between the blocks is practically cut down to half by silencing the signal during one
eight at both ends of a block, respectively.
The coefficients are grouped to resemble the critical bands of the human auditory
system as shown below in the diagram which compares the CELT bands with the Bark
scale.
The entire amount of energy of each group is analysed and the values quantised for
data reduction and compressed through prediction by only transmitting the difference
to the predicted values (delta encoding).
The (unquantised) band energy values are removed from the raw DCT coefficients
(normalisation). The coefficients of the resulting residual signal are coded by Pyramid
Vector Quantisation.
This encoding leads to code words of fixed (predictable) length, which in turn enables
for robustness against bit errors and leaves no need for entropy encoding. Finally, all
output of the encoder are coded to one bitstream by a range encoder. In connection
with the PVQ, CELT uses a technique known as band folding, which delivers a similar
effect to Spectral Band Replication (SBR) by reusing coefficients of lower bands for
higher ones, but has much less impact on the algorithmic delay and computational
complexity than the SBR. The decoder unpacks the individual components from the
range coded bitstream, multiplies the band energy to the band shape coefficients and
transforms them back (via iMDCT) to PCM data. The individual blocks are rejoined using
weighted overlap-add (WOLA). Many parameters are not explicitly coded, but instead
reconstructed by using the same functions as the encoder.
The advantages of CELT that make it the unmatched choice for modern speech coding
are as follows:
• Signal processing usually involves the use of spectral and spatial fi ltering to
maximise the signal to noise ratio (SNR), as well as procedures to deal with
contamination of EEG data (i.e., artifacts) and the changes in the characteristics of
the signal in session-to-session recordings (i.e., nonstationarities). After all the
signal processing steps, one usually needs to simplify this highly-dimensional
sensor-space data into a feature vector that can be handled by a classifi er/
decoder.