Professional Documents
Culture Documents
Relatore:
prof. Marco Masoero
Corrado SCANAVINO
LUGLIO 2009
II
Acknowledgements
Àlaleria.
III
Contents
Acknowledgements III
1 Introduction 1
IV
4.3.4 The Inverse Short Time Fourier Transform & Overlap-Add Resyn-
thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Constant-Q analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.1 Implementation of Constant-Q Analysis . . . . . . . . . . . . . . 55
7 Conclusion 87
Bibliography 121
V
List of Tables
VI
List of Figures
VII
4.2 Two plots of static spectrum. The image represents the SPL against
frequency of a drum hit played by a robot (on the left), and a note of a
violin (on the right). The difference is noticeable, while the robot hit has
apparently no harmonically related frequency components, in the violin
note this is clear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Basic operation of the STFT used for sound analysis. . . . . . . . . . . 46
4.4 Waterfall spectrum, a 3D representation os the STFT spectrum. The
graph was obtained with Spectutils package for GNU Octave. The anal-
ysis parameters of the STFT are shown above the figure, the audio sample
analyzed is extracted from Laurie Anderson’s Violin Solo. . . . . . . . . 48
4.5 Types of windows used in STFT for audio analysis. No ideal window
exists, the term "optimal window" is preferred. Several types of windows
are used, for musical purpose the Kaiser window has usually a preferential
use. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Spacing of filters for STFT (filterbank view) on the top and Constant-Q
filterbank on the bottom. It is clear the advantage of the Constant-Q
filterbank method, which places the filters linearly against log(frequency),
which is similar to the frequency response of the human ear. . . . . . . . 54
4.7 Waterfall spectrogram of a Constant Q transform of violin glissando from
578 Hz to 880 Hz (D5 to A5). Taken from Judith Brown’s Calculation
of a constant Q spectral transform. [A glissando is a glide from one pitch to another.
It is an Italianized musical term derived from the French glisser, to glide, It is also where the pianist
slides up the piano with his or her hands. From Wikipedia.] . . . . . . . . . . . . . . . . 56
4.8 Waterfall spectrogram of a Constant Q transform of flute playing diatonic
scale from 262 Hz to 523 Hz (C4 to C5). Taken from Judith Brown’s
Calculation of a constant Q spectral transform. [In music theory, a diatonic scale
is a seven note musical scale comprising five whole steps and two half steps, in which the half steps
are maximally separated. From Wikipedia.] . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Max 5 patcher window . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Max 5 window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1 The two robots on the sides, SCS + the performer in the middle. . . . . 70
6.2 On the top, the waveform corresponding to a hit of a robot percussion-
ist of ·O M M· . On the bottom, the intensity profile of the hit (using
Praat), where onset, attack and the transient/steady state separation
are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 From top to bottom: waveform, static spectrum (FFT) and time-varying
pectrum (STFT). From right to left: one hit of ·O M M· robot, one hit of
snare drum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
VIII
6.4 Unwrapped phase deviation between two adjacent analysis frames. ∆ϕn,k
is the unwrapped phase deviation. For the simpler case represented by
a steady state sinusoid, the phase deviation is approximately 0 constant
in-between the whole analysis frames, while, during transient the phase
deviation should be extremely large and easy to detect. . . . . . . . . . 79
6.5 Graphical representation of the bounded-Q filterbank. Only the octave
are geometrically spaced, in between the octave the spacing between
analysis bins is linear. This allows the application of FFT-like algorithm
to calculate the spectrum of each component. . . . . . . . . . . . . . . 83
B.1 Max patcher window showing our test patch realized to analyze the ·O M M·
sounds with bonk∼ 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . 116
C.1 XCode main window . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
C.2 a Bundle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
IX
Chapter 1
Introduction
The sound analysis is a very wide area of research; its typical applications range from
studies about the environment impact, the vibration models, bioacoustics to. . . Music.
Each of these fields has its own specific characteristics, thus we need to identify every
time the best approaches.
The specific field of application of this thesis is musical, meaning that the method has
been tested inside the Orchestra Meccanica Marinetti project, or ·O M M· . This thesis
is about a specific sound analysis approach, defined as perceptually grounded because
it mimic s the human perception (auditory system) of sound. The perceptual sound
analysis is an ideal candidate for applications in this context. The idea was to extend or
improve some of the musical characteristics of the robotic Orchestra.
·O M M· is a project about a robotic orchestra, controlled in real-time by a performer. The
project has been conceived by the programmer-digital artist Angelo Comino AKA Motor,
and consists mainly of two robots, which play drums, conducted by a performer through
a gestural controller, via MIDI. The two robots are more than 2 meter high, and the
drums consist of oil cans (such as the one used by petrol companies) of standard size.
These devices were designed and built with industrial component, with special care in
order to emulate the movement of a real drummer, by the people of Mechatronic Lab
(LIM) of Politecnico di Torino, thanks to the collaboration of local robotic companies:
Prima Electronics, ERXA e ACTUA. Each robot has two, moved by two power electric
engines, controlled by a FPGA-DSP dedicated hardware, while the interaction with the
performer is adjusted in real-time by the Show Control System, developed in Max/MSP
(a typical development environment for this kind of application), running on a Apple
laptop. A picture of the ensemble robots + performer can be seen in picture 6.1.
·O M M· was presented in October 2008 in Turin, arousing great visibility on national
media (press and television) and it is actually completing the engineering development
and starting the artistic deployment.
The idea of robot musician is not new, in early 80s at Waseda University (Japan)
1
1 – Introduction
WABOT-2 was experimented: a robot keyboardist able to converse with a person, read
a musical score with its eyes (a camera) and play it on an electronic organ. A robot
drummer has been developed, since 2005 to present, at the Georgia Tech College of
Computing programmed. Haile, that’s the name, is able to listen to live musicians and
accompany them, playing a drum. Haile’s output is based on live-analysis and process-
ing of sounds produced by other musicians playing at the same time, not pre-recorded
sequences. Other examples are the recent robotic trumpeter and violinist by the TMC
(Toyota Motor Corporation).
One of the critical point within a mechanical orchestra is to sync the execution of the
musical score among real and virtual instruments. The human ear is particularly careful
about timing, but physical devices, as the electromechanical arms of the robots, have
variable delays when activated. These variations depend on the note intensity requested,
or the execution rhythm. Each robot basically receives a message on a serial line (MIDI),
stating what kind of hit should be executed. Besides the message generation and data
transmission delay, usually negligible from a human point of view, we have the delay
introduced by the physical movement of the arms. Thus, we need to measure the time
interval between the digital command and the perceived strike on the can.
We propose a "perceptually grounded approach" to recognize the hit of the different
strikes, in order to compute the delay matrix of a generic score. The robots can play
approximately two hits per second by each arms and the sounds they can play consists of
three different variations. That is, the robot arms must be positioned to the correct level
in height and will hit the drum after a delay, which is primarily related to the distance
of the arm to the drum and the acceleration by which the arm is driven. Hence, the
delays are variable and not only, problems related to non-absorbed vibrations between a
hit and another could cause unwanted change in loudness and pitch perceived. That’s
why a perceptual based approach was needed to characterize the robots performance’s
behaviour and its response to the applied digitally stimuli.
The thesis contains, in his first part, an overview to the phenomena occurring in our
auditory systems, in particular when a new musical event occurs (i.e. a robot hit in
our case), in which our ear encodes both time and frequency phenomena related to the
human perception of sound. In the specific case of percussions, the sound produced can
be for convenience subdivided in two parts, the transient (the origin of the sound) and
the steady state portions (could be intended as the extended support to transient). Two
important points should be considered. First, during transient, at the precise instant
the sound is originated (onset of a sound), corresponds a rapid increase in sound energy
which can reach its peak in less than 5 ms. Detecting onset, our first aim, is not a
trivial task. Second, the time in which the onset occurs is the meaningful component of
a percussive sound; performing a sort of spectrum analysis at this point, the information
achieved are usually considered sufficient to predict the entire sound, i.e. the steady
state portion of the sound should be derived.
2
Efforts have been then extended to the subfield of signal processing, for specific treat-
ments of the musical audio signal (DSP). Therefore we present the digital audio repre-
sentation and the realization of digital filters, basic (but still very advanced) component
of every digital signal processing task. For this purpose the books of Roads, Dodge,
Rocchesso and Beauchamp were a good starting point. Later we introduce the methods
used for audio analysis, in particular those systems which perform harmonic analysis,
usually represented under the name Fourier analysis. The Fourier transform operation,
applied to a musical signal, can be viewed as a decomposition of the sound into a finite
number of harmonics, each of which represented by a complex value. This value is suf-
ficient to extract all the information needed to derive frequency, intensity and phase of
each harmonic; but doesn’t represent the only solution for musical analysis. The method
we suggest considering, under certain condition (in particular for music or speech), is
Constant-Q Transform based, well approximating some features of the human auditory
system.
Composers of the 20th century have contributed to the evolution of electronic music,
in a way that even they wouldn’t be expected. From Luigi Russolo and the intonarumori
in 1918, mechanical instrument producing non harmonic sound (Art of Noise), musical
schemes have been continuously redefined. Russolo influenced even Stravinsky (Paris
1921), and after Stravinsky and the expressivity and richness of his Music, musicians
became also technician and explored new electro-acoustic machines producing sound.
Between the russian composer and the first Moog have passed several years, but the
works of other composers like Bartók, Varèse, Messiaen, Shaeffer, Ligeti, Cage have
maintained straightforward the state of innovation.
In parallel, the evolutionary studies of certain mathematicians and physicians of the
19th century (primarily Helmholtz and Fourier) have lead technicians in the discoveries
that made it possible the realizations of the first electronic instruments, become famous
with movie like "Forbidden Planet" (Louis and Bebe Barron) or "2001: A Space Odissey"
(Ligeti and HAL voice inspired to computer synthesis experiment of Max Mathews). Or
the experimental works of Norman McLaren.
Other experiments in between art and technology had been proposed, such as the
electroacoustical compositions "A man sitting in a cafeteria" by Charles Dodge and "I
am sitting in a room" by Alvin Lucier; very attractive for their expressivity and their
educational approach. The first is one of the first experiment of reproducing speech
with computer and the second is a brilliant example of application of different impulse
responses of a room. This music come from 60s and 70s.
Grazie!
3
1 – Introduction
4
Chapter 2
A Perceptually Grounded
Approach...
5
2 – A Perceptually Grounded Approach...
conveyed by sounds, with respect to the study of natural auditory systems (human ear),
had been successfully applied to derive the relationship between physical stimuli and the
induced mental construct.
The subfield of psychophysics (the study of psychological responses to physical stim-
uli) depicting this phenomena is the psychoacoustic.
6
2.1 – Auditory Cognition (Reminding Psychoacoustics)
300 Hz and use the corresponding value of SPL, then the two tones will sound equally
loud to the listener.
Obviously, the perfect sine wave is an artifact, no sound exists in nature as expression
solely of a frequency. However, it is demonstrated that is possible to destructure5 the
sound as a sum of perfect sine waves. Therefore we can assume that each of which,
weighted per the FM curves and then summed, will contribute to total loudness. But
this is another theoretical situation, since no linearity can be actually applied, at least
not on the overall spectrum, because of the presence of critical bands6 .
Before introducing the time and frequency perception of human hearing, the most
advanced features of our auditory system, it maybe better to understand how the ear
works.
7
2 – A Perceptually Grounded Approach...
Figure 2.2: Equal-loudness contours for the human ear, determined experimentally by
Fletcher and Munson, published on Loudness, its definition, measurement and calculation
[1933].
• The outer ear: amplifies and conveys incoming sound waves such as air vibration.
Here the sound waves enter the auditory canal, which can amplify sounds con-
taining frequencies in the range between 3 Hz and 12 kHz. At the far end of the
5
See chapter 4, Fourier Trasnform and Overlapp Add Resysnthesis, for explanation to the fact.
6
See section 3.1.3. for explanations of critical bands.
8
2.1 – Auditory Cognition (Reminding Psychoacoustics)
auditory canal is the eardrum (or tympanic membrane), which marks the beginning
of the middle ear.
Sound waves, coming from the auditory canal, are now hitting the tympanic mem-
brane. Here, three delicate bones, the malleus (hammer), incus (anvil) and stapes
(stirrup), convert the low-level pressure eardrum sound vibrations into higher-level
pressure sound vibrations to another, smaller membrane, called the oval or ellipti-
cal window. Finally, another ,The stapedius musclewhich has the role to prevent
damages in the inner ear. The middle ear still contains the sound information
in wave form; it is converted to nerve impulses in the cochlea. Higher pressure
is necessary because the inner ear beyond the oval window contains liquid rather
than air.
9
2 – A Perceptually Grounded Approach...
• The inner ear: processes mechanical vibration and transduce them mechani-
cally, hydrodynamically and electrochemically. These are then transmitted through
nerves to the brain.
The inner ear consists of the cochlea and several non-auditory structures. The
cochlea has three fluid-filled sections, and supports a fluid wave driven by pressure
across the basilar membrane separating two of the sections. Strikingly, one section,
called the cochlear duct or scala media, contains an extracellular fluid similar in
composition to endolymph, which is usually found inside of cells. The organ of
Corti is located at this duct, and transforms mechanical waves to electric signals
in neurons. The other two sections are known as the scala tympani and the scala
vestibuli, these are located within the bony labyrinth which is filled with fluid
called perilymph. The chemical difference between the two fluids (endolymph &
perilymph) is important for the function of the inner ear.
Additional processes occur at the brain level, for example, other neural encoded in-
formations are used in order to combine signals coming from both ears and fuse them
into one sensation. However, although complex, the mechanism do not yield necessary
information to the brain to understand, for example a single note, an harmony, a rhythm,
or higher-level musical structures. It appeared that also the low-level time and frequency
perceptual mechanisms, operate both on the musical signal in parallel. Thus the de-
termination of the nature of sound is not only determined by the physical properties of
sound and human ear, but all these informations will be combined at high-level (i.e. in
the brain) where the sound takes its musical form.
Period detector
The mechanism of period detector inside auditory system, operates on the fine structure
of the neurally translated incoming waveform. The neural pattern is obtained by nerve
cells (in the organ of Corti) firing individually or in group, at a rate which corresponds
to the wave’s period. Individually, each cells can operate in this manner only up to a
certain period, if this is too small, they cannot recover quickly enough. However, group
of cells can rotate or stagger their firing, so that they, in effect, follow submultiples of
sound period.
10
2.1 – Auditory Cognition (Reminding Psychoacoustics)
A special feature is that the ear can encode variation in the envelope of the wave,
studies have demonstrated the existence of a mechanism in the central auditory system
to detect amplitude modulation (AM), although in a small range of frequencies (75 to
500 Hz) and only for significant depth of modulation.
Event detector
Another time-related mechanism, deep inside the human ear, is the perception of event.
Musical event occurs every time there is a variation of the vibration pattern, that is,
something is happen nearby and we hear a new sound. Sound onset 7 is the perception
of new sound is born. At onset time other nerve cells fire, and different cells operate
on different onset slopes. A model for onset detection, developed by Gordon in 1984
[26], showed that the moment of perceptual onset of musical event can be significantly
delayed from the physical onset. Another problem is that is not possible to establish
unequivocally the threshold over which an event becomes audible to the ear, that is,
the definition of the threshold over which the ear recognizes the onset. What does the
human ear consider as audible event? Bilmes proposed these questions: does it refer
to the time when physical energy in a signal increases infinitesimally? the time of peak
firing rate of the cochlear nerve? the time when we first notice a sound? the time
when we first perceive a musical event? or something else? [8] Whatever it means
it is demonstrated that perceptual onset time is not necessarily coincident with initial
increase in physical energy.
Again, since other cells respond to temporal interval between events, this means that
human auditory system is able to connect single events into rhythmic stream.
Temporal integration
11
2 – A Perceptually Grounded Approach...
Figure 2.4: Part of the inner ear, the cochlea is shaped like a 32 mm longs nail and is
filled with two different fluids separated by the basilar membrane.
Frequency is a physical parameter associated to each wave that carries the sound energy
to the ear. Pitch is the perceived parameter related to frequency, it can be thought as
the quality of a sound, governed by the rate of vibrations produced by the sound [44].
In the inner ear, the oscillations of the oval window assume the form of traveling
waves which move along the basilar membrane, ie. along the entire length of the cochlea.
12
2.1 – Auditory Cognition (Reminding Psychoacoustics)
The mechanism for detecting frequencies is located in the basilar membrane. A simple
correspondence occurs: when a single sine tone excites the ear, a region of the basilar
membrane oscillates around its equilibrium position. Since real sounds have no single
frequency, this region will show a place where excitation has a maximum, corresponding
to the fundamental frequency. The distance of this maximum from the end of the basilar
membrane is directly related to frequency, so that each frequency is mapped in a precise
place along the membrane. The mechanical properties of the cochlea (wide and stiff
at the base, narrower and much less stiff at the apex) denotes a roughly logarithmic
decrease in bandwidth as we move linearly away from the cochlear opening (the oval
window), as shown in figure 2.4. Thus, the auditory system acts as a spectrum analyzer,
detecting the frequencies in the incoming sound at every moment in time. In the inner
ear, the cochlea can be understood as a set of band-pass filters, each filter letting only
frequencies in a very narrow range pass. This mechanism could be associated to a
filterbank of constant-Q filters 8 , because of their property to be linearly spaced on a
logarithmic scale9 .
However, the sensation of pitch is not only related to the fundamental frequency per-
ceived. Other contributes, related to the temporal mechanism encoded in the ear, such
as period detection, can alter the sensation of pitch.
The sounds that ear can sense, have wide frequency range, approximately from 20 to
20 KHz. The perceived pitch, also expressed in Hz, has a limited range, approximately
from 60 to 5 KHz.
Critical Bands
Since each frequency stimulates a region of the basilar membrane, a limit to frequency
resolution of the ear is imposed. This limit is reflected to another characteristic of
perception, known as critical band.
A simple example to understand how the ear works in the critical band is necessary.
Think, or better listen, two sine waves very close in frequency, they have a total loudness
which is less than the sum of the two loudness we would hear if they were separated
in frequency. Now, if we slowly separate each other in frequency, we perceive the same
loudness up to a point, then, over a certain frequency the total loudness increases
approximately to the value of the sum of individual loudness. The frequency difference,
needed to perceive loudness as sum of individual loudness is the critical band.
The ear behavior in this region, can be thought as a kind of frequency integration,
because it is similar to the temporal integration we have seen earlier. Inside the critical
band resides other important factors of perception, roughness and beating. Roughness
8
A constant-Q filterbank is a set of bandpass filters, which fit their bandwidths according to central
frequencies, to maintain a fixed ratio (Q).
9
See chapter 4, the section on Constant-Q analysis, which base his benefit on the similarity with
the human pitch detector mechanism occurring in the basilar membrane.
13
2 – A Perceptually Grounded Approach...
Figure 2.5: Cochleagrams, expressed in bark unit as function of time. On the left the
spoken italian word "ape", on the right a short excerpt of Moondog’s “Pigmy pig”.
is a sensation of dissonance, its presence is particularly strong in the lower and upper
bound of the critical band, where the two tones are almost separated but not yet ready to
be perceived as two sounds. In the middle of the critical band the two tones are heard as
one with a frequency that lies between the two frequencies, where we can clearly perceive
the sensation of beating. When the two tones are separated by 1 Hz we perceive a single
beating per second. The width of critical bands (bandwidths) increase in frequency.
The Bark scale was proposed to represent the human ear behavior inside the crit-
ical bands. An example of such a representation is proposed in figure 2.5, where the
spectrogram produced is plotted against frequency on a Bark scale; in this case refer-
ring to cochleagram is appropriate. The Bark scale (of human hearing) ranges from
1 to 24 Barks, corresponding to the first 24 critical bands. The proposed Bark center
frequencies, in Hz, are:
50, 150, 250, 350, 450, 570, 700, 840, 1000, 1170, 1370, 1600, 1850, 2150, 2500,
2900, 3400, 4000, 4800, 5800, 7000, 8500, 10500, 13500
100, 100, 100, 100, 110, 120, 140, 150, 160, 190, 210, 240, 280, 320, 380, 450, 550,
700, 900, 1100, 1300, 1800, 2500, 3500, 5000
14
2.1 – Auditory Cognition (Reminding Psychoacoustics)
to 15.5 kHz, the highest sampling rate for which the Bark scale is defined up to the
Nyquist limit, 31 KHz.
When many frequencies are present (fundamental tones and harmonics) the auditory
system works on all of them simultaneously, with the limit of resolution introduced by
critical band. This effect on the overall spectrum is another contribute to the perceived
pitch. Not only evitabile, the pitch is also influenced by inharmonic spectra, which is a
characteristic of noise.
Perception of noise
White noise does not affect pitch because it is completely random and has a flat spectrum
that doesn’t evoke any sensation, if not trouble. Since colored noise are created by
modulating white noise, some of them can yield a vague pitch sensation, depending on
the modulation applied. For example, for an AM modulation of white noise, there may
be a pitch corresponding to the modulation’s frequency. Other sensation of pitch can
be achieved by filtering or applying digital effects to white noise.
We have seen the most important factors characterizing the sensation of pitch, now
we can introduce the last perceived attribute of a sound, the timbre.
15
2 – A Perceptually Grounded Approach...
16
Chapter 3
Prior to the study of specific aspects in sound analysis (chapter 4 and 6), it is better to
clarify the basic concept behind the audio representation on digital computers. In the
following paragraph, the main attributes of a digitized sound are dealt with from basic
terms (sampling and quantization) to advanced applications (digital filters).
This chapter is therefore divided into two sections: the fist contains a brief illustration
of the theories behind digital representation of music, while the second gives a deeper
explanation of how the filters are implemented on digital computers.
17
3 – Digital Audio Concepts
at which the sampling operation is performed, has to be at least twice the frequency
band of the analog signal. The frequency band is determined by the maximum frequency
contained in the signal. Since the (average) upper frequency limit to human hearing is
considered to be 20 KHz, sampling rate higher than 40 KHz must be choosen. This is
enough to allow reconstruction of the original signal, starting from samples, in a way
that human ear cannot distinguish from the original.
The device performing this operation is called analog-to-digital converter (ADC).
At each period (i.e. the inverse of the sampling rate), the ADC produces a string of
binary numbers, called sample, which are stored in memory in the exact order they are
received. The inverse operation, from digital-to-analog, is realized by the digital-to-
analog converter, DAC.
The sampling rate normally used in computer to represent digital audio signal is
44,1 KHz or 48 KHz. The frequency halves the sampling rate, is called the Nyquist
frequency. The faster the sampling rate, the greater the Nyquist frequency and conse-
quently the frequencies that can be represented (but also the demands on speed and
power consumption of the hardware).
Aliasing
Like any other analog to digital conversions, also the audio conversion may be affected by
the problem of aliasing. Aliasing occurs cause frequencies, higher than half the sampling
rate (Nyquist frequency), may be present at input of the ADC. This results as distortion
of the original signal and it can be heard, in acoustical term, as an unwanted change
in pitch1 , because frequencies over Nyquist are probably converted at low frequencies.
The problem can be easily overcome by placing an anti-aliasing filter 2 before the ADC,
which ensures that only signals below Nyquist enter the converter. This system is also
replicated at the end of the audio chain, in between the DAC and the speaker, for the
same reason. In figure C.2 is proposed a generic audio system.
18
3.1 – Toward Digital Representation of Sound
analog analog
audio audio
input output
MEMORY
nyquist nyquist
frequency frequency
dynamic range for human hearing, is called threshold of pain3 and it is estimated above
120 dB. Hiroshima explosion was 180 dB. If the sound is particularly short the threshold
of pain can increase, but is better not to try...
While recording music, it is important to capture the wide-as-possible dynamic range,
in order to reproduce music in its fully expressive way. For example, recording an orchestra
will require wider dynamic range than a solo instrument[44]. The number of bits (Nbit),
used to represent each sample4 , has a direct influence on the maximum dynamic range
of digital audio systems. The following simple formula can be used for this purpose:
Therefore, a 24 bit system may reach 147 dB, much more than the threshold of pain.
Considerations should be given on noise, when speaking of dynamic range, because
noisy sound components (not only the noise introduced by all electronic devices, but
real noisy sounds), are always present in the proximity of the audio system and they
can alter, for example, the minimum of the dynamic range. Signal-to-noise ratio (SNR)
compares the level of a given signal to the level of noise in the system. Noise can have
a wide variety of meanings and also depends on the environment and sensibility of the
listener. SNR is also expressed in dB so that a great value of dB means a clear sound.
SNR of a good audio system is often higher than 90 dB. Dynamic range and SNR are
good indicator of the quality of any audio system, but not the only.
Quantization Error
Now, we present the last (and one of the most) important factor, determining digital
audio quality, the quantization. How many bits are needed to represent the sampled
amplitude of the signal? Normally the answer is given by the maximum resolution of the
3
Level higher to this threshold can seriously damage the human hearing system.
4
That is, the quantization, explained in the next section.
19
3 – Digital Audio Concepts
ADC used to compute sampling. Obviously, the higher the resolution of the converter,
the better the quality of the digitized sound. Since the number of bits is a finite integer n,
only 2n values can be used to represent the original value, these are called quantization
levels. When the system has to convert a value, which is not integer, a round off is
necessary. The quantization error is the difference between the real value and the binary
strings used to represent it, that is typical of almost all the samples and introduce the
quantization noise. In 16 and 24 bit ADC the quantization noise is negligible.
20
3.2 – Digital Filters
All filters may be characterized by the frequency response. The well-known frequency
responses are: low-pass, high-pass, band-pass and band-reject.
The frequency response consists of two parts: amplitude response, shown in figure
3.2 for the four basic types of filters, and phase response. The amplitude response is the
ratio of the amplitude of output signal to the input signal, varying along frequency range.
The phase response (also varying with frequency) is the amount of phase alteration in
the signal passing through the filter. Sometimes it is defined in terms of phase delay,
that is, the amount of phase change from the original phase, expressed in ms.
5
The Z-transform converts a discrete time-domain signal, a sequence of real or complex numbers,
into a complex frequency-domain representation. See later in the text for more details.
6
The 4th release of the MUSIC saga, the last developed by Max Mathews at Bell Labs.
21
3 – Digital Audio Concepts
A[dB] A[dB]
f[Hz] f[Hz]
LOW-PASS FILTER HIGH-PASS FILTER
A[dB] A[dB]
f[Hz] f[Hz]
BAND-PASS FILTER BAND-REJECT FILTER
Figure 3.2: Amplitude (A) response versus frequency, for the four basic types of filters.
22
3.2 – Digital Filters
A [dB]
f [Hz]
Figure 3.3: The pass-band or bandwidth of a band-pass filter is the difference between
the upper and lower cutoff frequency. The cutoff frequencies are defined as the frequency
at which the amplitude, but energy would be better to say instead, is half the pass-band
amplitude. In the figure, 40 dB is assumed as the maximum level of amplitude in pass-
band.
The bandwidth of the band-pass filter, is also called the selectivity of the filter, and is
useful in quantifying the quality factor, Q.
7
An octave is the interval between two points where the frequency at the second point is twice the
frequency of the first.
8
The order is the mathematical measure of complexity.
23
3 – Digital Audio Concepts
Figure 3.4: Example of application of a constant Q filter. Here the center frequencies
are tuned around generic musical octave. In music, an octave, is the interval between
one musical pitch and another with half or double its frequency.
fc
Q=
BW
24
3.2 – Digital Filters
Parallel connection allow the filters to operate on the same signal at the same time.
The output signal will be given by the sum of all the filters’ output; that means, the
frequency response of a parallel connection is the sum of all the frequency responses.
For instance, a band-reject filter can be obtained by connecting a low-pass and a high-
pass filter in parallel. An interesting example of parallel connection is represented by
the contant-Q filterbank, which consists of an array of constant-Q band-pass filters that
separates the input signal into several components, each one carrying a single frequency
subband9 of the original signal. For musical purpose, these subbands are normally non-
overlapping and exponentially distributed e.g. in the whole frequency range of human
hearing, between 20 Hz and 20 KHz. A special type of constant-Q filterbank, had been
historically represented by the octave filterbank, in particular, the third-octave filterbanks
have also been standardized for use in audio analysis10 [12]. In a third-octave filter bank,
the center frequencies of the various bands are exponentially spaced along frequency
axis, in a way described by the formula:
where f cc [k] are the center frequencies of an array of k filters, the first of which is cen-
tered at fc [0] = 1000 HZ, as an example. The bandwidth of the k th filter is proportional
to the k th center frequency, as the following formula states:
2 1/3 − 1
BW [k] = fc [k] ·
2 1/6
Therefore, since the expression of bandwidth contains the center frequency, the quality
factor Q[k] = fc [k]/BW [k], is constant for all k filters.
Cascade connection, also called series connection, is the other way to connect filters
each others. In this case, the signal will pass through a series of filters, one by one,
respecting the linking order. The direct consequence is that: the overall amplitude
response becomes the multiplication (thus sum in dB) of the individual filter responses,
while, the overall filter order, becomes the sum of the individual filter orders [18]. For
instance, cascading two or more low-pass filters with the same frequency response, makes
it easy to obtain higher rolloff i.e. greater attenuation around crossover frequency.
Cascade connection of filters, may be critical in some cases, for example much care
must be taken, when designing series of filters with different bandwidths. Each filter of
the cascade, must guarantee that significant energy will pass at least through a common
range of frequencies, otherwise the output could be inaudible.
9
Submultiple of the signal’s bandwidth.
10
Third-octave filters are useful because they have a good correlation to the subjective response of
the human ear.
25
3 – Digital Audio Concepts
26
3.2 – Digital Filters
Figure 3.5: Alteration of the envelope of a tone (INPUT) passed through a narrow filter
(OUTPUT). The output envelope has been stretched in time during onset and offset
components of the tone (initial and final portion).
are connected in cascade, since everyone will affect the time-duration of the sound,
unwanted distortions can occur.
27
3 – Digital Audio Concepts
cd, miniDisc etc). Compressor, limiter, expander, noise gates, and noise reduction, are
only few examples of the sound processing treatment normally applied to the dynamic
range of the music we hear, coming from almost every digital music medium.
Digital filters assume a primary role in quite all sound processing applications. After
classical filter theory had been quickly exposed in the previous sections, the hard task
should be that to transpose some of those concepts, to the discrete world of quantized
samples.
For that purpose, a general and expressive definition coming from LTI system theory,
will be generalized for the case of digital filters, but before, some clarifications must
precede it. Systems who don’t change their behavior in time and fulfill the superposition
property15 are called linear time-invariant (LTI) and the most important property is that
those systems can be completely characterized by the impulse response. The impulse
response, is the behavior of those systems for short impulse input. Hence, the general
the definition we’ll adopt to explicit filter realizations, is the following: the output signal
of an LTI system is given by the convolution of the impulse response with the input
signal.
Thence, the assumption here is that to consider filters as LTI systems.
Impulse Response
The general definition of impulse response of a filter is the response of such a filter, fed
with a short pulse. The short pulse can be considered as a test signal, through which
the characteristics of the filter are bared. The common test signal used in LTI systems,
such as in filters, is the unit impulse, defined as:
(
1, if t=0;
U I(t) =
0, otherwise.
In the case of discrete systems, such as digital filters, the unit impulse is obtained substi-
tuting t with n, and delimiting the sample index in between brackets [·], for unambiguity.
Therefore, for discrete LTI systems, the unit impulse could be rewritten as follows:
(
1, if n=0;
U I[n] =
0, otherwise.
which can be seen as one-sample impulse. In digital terms, the briefest signal possible
(the approximation of the unit impulse) is exactly a single sample, which contains energy
15
In the case of filters, the superposition property states: when two signals are added together and fed
to the filter, the filter’s output is the same as if the two signal were putted through the filter separately
and then added the outputs.
28
3.2 – Digital Filters
at all the frequencies below Nyquist16 . By definitions, the output signal of a filter fed
by unitary impulse is the impulse response, henceforth simply called IR.
Since we can say that unit impulse contains all the frequencies of the signal, IR
can also be seen as the time domain representation of the amplitude-versus-frequency
response, earlier presented as the frequency response. The bridge between the two
domain is represented by the convolution.
Convolution
Convolution is a generic signal processing operation, like addition or multiplication, but
has a lot of more interest because convolving two signals in the time domain is equal to
multiply them in the frequency domain. That’s why convolution operation is considered
the bridge between the two domain. Convolution is a fundamental operation in digital
audio processing as well as in filters. Let’s see how it works, starting from the formula
representing the previous definition given for LTI systems, now generalized for the case
of filters: the output signal y[n]of every digital filter is given by the convolution of the
impulse response of the filter with the input signal x[n]. Here it is:
where ∗ is the convolution and h(t) is the impulse response. When the impulse response
h(t) is obtained through the one-sample impulse, acting as unit impulse, convolution
proves to be an identity operation:
That is, every function convolved with the unit impulse remains the same.
While speaking of convolution in terms of signal processing, a certain regard to other
two properties, is necessary. Convolving the input signal with scaled version of the unit
impulse:
and convolving the input signal with a delayed copy of the unit impulse, by means of
time-shifting:
y[n] = x[n] ∗ U I[n − t] = x[n − t]
16
According to Fourier’s theories, an inverse relationship exists between the duration of a signal and
its frequency content: the shorter the signal, the wider the spectrum.
29
3 – Digital Audio Concepts
That is, the result of the convolution between input signal and scaled or time-shifted
unit response is the same as to scaling or time shifting the input signal. Consequently:
any input signal can be represented by a sequence of scaled and delayed unit impulse
functions. Not only, easily recognizable effect in sound systems, such as echo and
reverberation can be recreated by really simple but appropriate design of IR function, as
showed right in figure 3.6.
In the case of reverberation effect, showed in right side of figure 3.6, the time-
smearing 17 effect occurs when the two time-shifted unit impulse functions are too
close, relatively to the duration of the sounds. Thus the first sound cannot be separated
from its following replica. Those effects, when thick, assumes the form of reverberation.
The law of convolution, applied to computer music, affirms that the convolution of
two waveforms 18 in the time domain, is equal to the multiplication of the two spectra
in the frequency domain. This is fundamental concept in sound processing techniques,
because any of the transformations applied to sound in the time domain have a direct
correspondence in the frequency domain, and vice versa.
Finally, the mathematical definition of discrete convolution, applied over two generic
17
See chapter 3. Time-smearing is a phenomena which occurs when two close-in-time sounds cannot
be separated by the time-resolution of the ear.
18
From chapter 3, waveform will be used to define the analogue sound signal in the time-domain
30
3.2 – Digital Filters
1 −1
nX
a[n1 ] ∗ b[n2 ] = a[m] · b[n2 − m] = y[k]
m=0
To enhance the analogy with filters, the formula may be interpreted by this way: a[n1 ]
acts as a weighting function (such as the IR) for each delayed copy of b[n2 ] (i.e. the
input signal). The result of the operation y[k] is k sample long, with respect to:
k = length(a[n]) + length(b[n]) − 1
That way to compute convolution, a sum for each value of k, is called direct convolution.
The direct form is computationally intensive, requiring N 2 operations, where N is the
length of the longest of the two input. A faster solution to implement convolution on
digital computers was founded. It works with the FFT19 algorithm, applied to both the
convolutional operands. The results are multiplied, and finally reversed to time-domain
through the IFFT20 algorithm, to be finally summed. The cost of the fast convolution
drastically reduces the computational complexity to N log N operations.
31
3 – Digital Audio Concepts
∞
X
X[z] = h[n] · z −n
n=−∞
while the frequency response can be achieved by applying the DFT to the IR of the filter.
32
3.2 – Digital Filters
In the equation above the convolution formula given above, can be easily recognized.
Here h[m] is the finite impulse response, typical of FIR filter realizations. The time
extension of the impulse response determines the lenght of the filter, which is N + 1.
As introduced above, the transfer function can be achieved by applying the Z-transform
to the impulse response, which result as:
N
X
H[z] = h[m] · z −m = h[0] + h[1] · z −1 + h[2] · z −2 + . . . + h[N ] · z −N
m=0
The simpler example of FIR filter is the first order low pass filter, which takes into
account only the first previous input sample. The formula of this kind of filter is the
following:
y[n] = 0.5(x[n] + x[n − 1])
33
3 – Digital Audio Concepts
Besides, to obtain an high pass filter, again of the first order, we must simply change
the operand, like this:
y[n] = 0.5(x[n] − x[n − 1])
In order to run an N order FIR filter we need to have, at any instant, the current
input sample together with the sequence of the N preceding samples. These N samples
constitute the memory of the filter. In practical implementations, it is customary to
allocate the memory in contiguous cells of the data memory or, in any case, in locations
that can be easily accessed sequentially. At every sampling instant, the state must be
updated in such a way that x(k) becomes x(k + 1), and this seems to imply a shift of
N data words in the filter memory. Indeed, instead of moving data, it is convenient to
move the indexes that access the data.
Such as an example, three memory words are put in an area organized as a circular
buffer (see figure 3.8). The input is written to the word pointed by the index and the
three preceding values of the input are read with the three preceding values of the index.
At every sample instant, the four indexes are incremented by one, with the trick of
beginning from location 0 whenever we exceed the length M of the buffer (this ensures
the circularity of the buffer). The counterclockwise arrow indicates the direction taken
by the indexes, while the clockwise arrow indicates the movement that should be done
by the data if the indexes would stay in a fixed position.
As a matter of fact, an FIR filter contains a delay line since it stores N consecutive
samples of the input sequence and uses each of them with a delay of N samples at most.
The points where the circular buffer is read are called taps and the whole structure is
called a tapped delay line.
34
3.2 – Digital Filters
Due to the advanced forms in which digital filters can be designed, the result obtained
could be even more precise than the analog counterpart.
35
3 – Digital Audio Concepts
36
Chapter 4
37
4 – ...To Sound Spectrum Analysis
• Audio analysis, takes digital signal (but leaves unaltered the stream of samples)
and mathematically determines its characteristics.
38
4.1 – Introduction to Sound Analysis in the Frequency Domain
Sound synthesis made by computer, starts in 1957 with Max Mathews’ MUSIC 1.
With MUSIC 3 in 1960 was introduced the concept of Unit Genertor (UG), the simpler
instrument for the computer, the greatest change in the way to the computer sound
programmer’s approach. With a UG, one can create a sine wave to produce an oscillator,
with logical and arithmetic UG one can multiply two oscillator to produce another sound,
design filters’ frequency and impulse response, combine filters with oscillators to create
new more complex sounds, and so on, to infinity. The UG so created, quickly increased
in complexity, in parallel to the rapid rise of microelectronics, becoming one of typical
features in most music programming language. With the consequently development of
faster algorithm, music synthesis has been widely extended in many research areas.
Curtis Roads says about synthesis:
After Max Mathews in 1957, dozens of sound synthesis techniques have been invented.
As in the field of computer graphics, it is difficult to say at any time which techniques will
flourish and which will fade over time. This situation is fueled by competitive pressure in
the music industry, making it inevitable that synthesis methods fall in and out of fashion,
because no one of these methods can satisfy [44].
As just a souvenir, here are reported some of the synthesis methods, in no precise order:
• wavetable synthesis
• sampling synthesis
• additive synthesis
• subtractive synthesis
• sinusoidal synthesis
• granular synthesis
• modulation synthesis
• formant synthesis
• residual synthesis
• graphic synthesis
• stochastic synthesis
39
4 – ...To Sound Spectrum Analysis
analog
audio
input
loudness
(...) bpm
analog
audio
output
algorithms
oscillators (...)
40
4.2 – Introduction to the Fourier Analysis
that not all interesting musical sounds have a clear pitch and the pitch of a sound may
not necessarily correspond to the lower component of its spectrum, Fourier analysis still
constitutes one of the pillars of acoustics and music. [36]
Origin
Since Jean Baptiste Joseph, Baron de Fourier, in 1822 published his evolutionary theory,
we can be traced back to the events that made the history, in rapid succession:
• 1898 first mechanical harmonic analyzer that could be reversed to waveforms syn-
thesizer,
• 1960 advent of FFT algorithm reduced enormous calculus computing fourier trans-
form,
• 1977 advent of STFT, short-time fourier transform, widely used in music systems.
What Fourier stated, in a few mathematical words, was that complex but periodic
signal can be seen as a sum of simple signals. In musical context this was intended
as periodic waveform that can be deconstructed in a combination of simple sinusoidal
waves, each one with its own amplitude, frequency an phase. On digital computers
a sine wave is generated by an oscillator (first UG was an oscillator) able to produce
sounds by a sine wave with only three parameters: amplitude, frequency and phase. In
engineering mathematic an oscillator is normally expressed in another form through the
Euler’s relations, which allow to express sine and cosine functions by means of complex
exponential.
In this chapter we will introduce a particular Fourier-based analysis and synthesis
system, called the short-time Fourier transform, STFT, due to Allen and Rabiner (1977).
This is a very general technique, useful in the study of time-varying signals such as musical
sounds, that can be used as the basis for more specialized techniques. In the following
chapters the STFT is accounted as the basis for several analysis/synthesis systems.
In musical contexts, Fourier Transform is applied to analog signals (FT) having a
limited bandwidth or to a finite number of digital samples (DFT or STFT).
We can summarize here the techniques used to compute FT over analog and digital
input signals:
41
4 – ...To Sound Spectrum Analysis
where x(t) is a generic waveform and t and ω are the continuous time index and the
continuous frequency index. ω is the angular frequency, expressed in radians per second.
The simple relationship with the correspondent frequency in Hz is f = ω/(2π).
The FT could also lead to another interpretation, more interesting in musical con-
text, that is the decomposition of the waveform into an infinite number of sinusoidal
components.
The result of the FT is a complex value X(ω) for every values of ω, but X(ω) is
usually considered the whole spectrum of x(t). Each complex value, expressed in the
form (a + jb), with a and b the real and imaginary part, reveals the three fundamental
components of a sinusoid: frequency, amplitude and phase. Obviously, ω is the frequency
and the other two can be computed with the following simple formulas:
√
amplitude =⇒ |X(ω)| = a2+b2
b
phase =⇒ arg[X(ω)] = arctan
a
X(ω), again the whole spectrum of x(t), is a periodic function of ω with period 2π and
the original signal x(t) can be recontructed by means of the Inverse Fourier Transform,
defined as follows:
Z ∞
1
x(t) = X(ω) · ejωt dt
2π −∞
The Fourier Transform is valid only applied over time-continuous signals, e.g. waveforms.
Let’s see how it works with digital signals.
42
4.2 – Introduction to the Fourier Analysis
Figure 4.2: Two plots of static spectrum. The image represents the SPL against fre-
quency of a drum hit played by a robot (on the left), and a note of a violin (on the
right). The difference is noticeable, while the robot hit has apparently no harmonically
related frequency components, in the violin note this is clear.
where x[n] is the nth value of discrete-time signal N samples long . That’s the motive
because the integral in the formula goes to − N2 to N2 ). ωk = 2π · ( Nk ) is the discrete
angular frequency, k is an integer number going from 0 to N-1 and N must be chosen
even.
While X(k) is called the discrete spectrum, the k-th X(k) discrete frequency sample
is called the k-th frequency bin. In DFT the relationship between discrete angular
frequency and frequency in Hz is:
ωk
f = fs ·
2π
where fs = 1
T
is the sampling frequency and T the period between samples.
43
4 – ...To Sound Spectrum Analysis
Due to discrete value of k, DFT assumes that x[n] can be represented by a finite
number of sinusoids, this means that the signal x[n] is band-limited in frequency. Besides,
the frequencies of the sinusoids are equally distributed between 0 Hz and the sampling
rate fs , or, in radians, between 0 and 2π.
The DFT internally masks the frequency-domain sampling function, because there
is a direct correspondence between the number of input samples and the number of
outputted frequencies.
The inverse DFT is defined as:
N/2−1
1 X
x[n] = X[k] · ejωk n
N
k=−N/2
There is a faster computational version of the DFT, which is called the FFT. The algo-
rithm used to compute FFT allows the substitution of complex products with weighted
sum so that the computational cost is reduced from N 2 to N · log N . This is still one
of the most used and advanced technique of implementing DFT on digital computers,
especially where real-time DFT is needed or the space in memory is a critical point (i.e.
on chips).
Unfortunately both FT and DFT work only for periodic signals: in music only an
accurate note coming out from a tuned musical instruments can be treated as a periodic
waveform, while most of the sounds are non-periodic and time-varying waveforms.
So, let’s now introduce to the most used FT technique for musical purpose on digital
computers, the Short Time Fourier Transform.
44
4.2 – Introduction to the Fourier Analysis
take a breath.
45
4 – ...To Sound Spectrum Analysis
Figure 4.3: Basic operation of the STFT used for sound analysis.
• windowed DFT, where the DFT (FFT) is computed over each windowed segment;
windows may overlap
• filterbank view, a bank of bandpass filters equally spaced across the frequency
domain (i.e. from 0Hz to Nyquist frequency)
46
4.3 – The Short Time Fourier Transform (STFT)
where the output, X[n,k], is the DFT of the windowed input at each discrete-time
n for each discrete frequency bin k. h[n − m] is the time-shifting window function that
follows the signal. m, in the general formulation can vary from −∞ to +∞ but can be
substituted with the appropriate length of the window. N is the number of points in the
spectrum.
The given angular frequency, for each bins k, is that:
k · fs
ωk =
2π · N
Another formulation of STFT, is that of X.Serra, where H, the hop size, is the time
advance of the incoming signal, substituting the time-shifting window function.
It is also a function of two variable, follows the definition:
N
X −1
X[l,k] = {w[n] x[n + lH]} · e−jωk n
n=0
now, w(n) is a real window, l indicates the frame to pass through window and again,
the same exponential function. X[l,k], the spectrum, is the DFT of the sequence of
w(n)x(n + lH) for 0 < n < N − 1. The spectrum is computed at every frame l,
advancing with H along the input signal x(n).
47
4 – ...To Sound Spectrum Analysis
Figure 4.4: Waterfall spectrum, a 3D representation os the STFT spectrum. The graph
was obtained with Spectutils package for GNU Octave. The analysis parameters of the
STFT are shown above the figure, the audio sample analyzed is extracted from Laurie
Anderson’s Violin Solo.
STFT interpreted by this way, can be seen ad a filterbank which perform analysis
in parallel on each windowed segment of the input signal. For every frame of the input
signal, a complex value returned by the n filters can describe n sinusoids. Filterbank
view was the base for phase vododer analysis/resynthesis technique, and inspired the
constant Q method later proposed. The filterbank view is therefore an abstraction, used
in computing STFT with programming language on digital computer.
The STFT output (both the two views) is a series of spectra, one for each frame of
input signal. Each spectrum has a real and imaginary part, which can be easily converted
into magnitude and phase value. In the filterbank view, frequency covers most important
role than phase. Istantaneous frequency is therefore calculated by converting the phase
value by the method of phase unwrapping3 , to obtain a sinusoid with the obtained
3
Phase unwrapping ensures that all appropriate multiples of 2π have been included in Θ(ω)
48
4.3 – The Short Time Fourier Transform (STFT)
49
4 – ...To Sound Spectrum Analysis
types.png
Figure 4.5: Types of windows used in STFT for audio analysis. No ideal window exists,
the term "optimal window" is preferred. Several types of windows are used, for musical
purpose the Kaiser window has usually a preferential use.
Another point in the choice of the window length is between odd and even length.
For phase detection a zero-phase window is better because the windowing process won’t
modify the phase of the analysis waveform. Therefore an odd window length is preferred,
with the middle sample centered at the time origin of the analysis window.
There are a lot of standard window function, used for STFT purpose:
• rectangular
• Hamming
• Hanning or Hann
• Gaussian
• Blackman
• Blackman-Harris
• Kaiser
Gaussian, Hamming and Kaiser are the more often used. The kaiser window has char-
acteristics which are well tuned around musical context.
50
4.3 – The Short Time Fourier Transform (STFT)
51
4 – ...To Sound Spectrum Analysis
The inverse of the hop size is called overlap factor (if H > M the analysis window
will not overlap). For example, if H=M=1024 and fs=44100Hz, the time resolution over
the input waveform is 1024/44100 = 23 ms , if the overlap factor is 8 the time resolution
becomes 2,9ms.
Greater overlap factor will generally give better analysis results, but also greater
computational cost. Hence, overlap factor has to be chosen whereas the input waveform
characteristics, i.e. fast-changing waveforms need more overlap. There are some general
criterion for determining an efficient overlap factor, the more general is to choose overlap
in a way that all the data are equally weighted as in the case of overlap-add synthesis
presented later.
Other criterion is too chose overlap factor according to the nature of window function,
that is, overlapping windows should add perfectly to a constant value, i.e. 1. For
a rectangular window this is easy to obtain, hop size can be simply M/i, with i any
positive integer. If consecutive analysis windows are added each others to a constant, no
amplitude deviation is possible, hence successive windowing operation will not perform
amplitude modulation to the input waveform.
To summarize, the STFT operation is applied to a stream of input samples and
results in a series of frames that one after another produce a time-varying spectrum,
thus the impression is to see a continuous spectrum.
The four parameters to choose in designing efficient STFT, can be summarized as
follows:
• window shape
• window length
• FFT size
52
4.4 – Constant-Q analysis
The overlap-add resynthesis method, due to Allen and Rabiner (1977), says that we can
reconstruct each windowed segment of the original waveform starting from the spectrum
components by the use of ISTFT over each frames. It takes the magnitude and phase
value of each spectrum to generate a time-domain waveform using the same envelope of
the analysis window used to compute the STFT. Then each resynthesized time-domain
segment is overlapped and added to reconstruct the original waveform.
In theory, the overlap-add process is an identity operation (i.e. the reconstructed
signal equal the original) by mathematical mean, only if the overlapped and added
windows sum to a contant. That means that we can pass countless times the signal into
STFT and back to the original with ISTFT, however, even good implementations of the
STFT, lose even a small amount of information, demonstrating that this is impossible.
OA resysnthesis is not the only method to do resysnthesis of the orignal waveform
based on STFT, many others are possible. Weighted overlap-add method is similar to
OA, the difference resides in the transformation applied to the window function before
resynthesis. The analysis window and the synthesis window must maintain the identity
property, this is achieved by respecting the relationship:
∞
X
w[m − nH] = c
n=−∞
where c is a constant.
The synthesis window is needed when, before resynthesis, a transformation is applied
to the phase spectrum, which can create phase discontinuities at the frame boundaries.
Oscillator-bank resynthesis (also called sinusoidal additive resynthesis, SAR) in an-
other method, in which analysis data (magnitude and phase) are converted into synthesis
data (amplitude and frequency) in order to drive one oscillator for each frames which
are then summed to recreate the original signal. This method frees from the add to
constant rule of the OA, because the converted spectrum is more robust against digital
processing transformation eventually applied before synthesis.
SAR method can be applied to the filterbank interpretation of the STFT, by matching
each frequency bins to a sine wave and then sum all the sine waves for synthesis.[54]
53
4 – ...To Sound Spectrum Analysis
Figure 4.6: Spacing of filters for STFT (filterbank view) on the top and Constant-Q
filterbank on the bottom. It is clear the advantage of the Constant-Q filterbank method,
which places the filters linearly against log(frequency), which is similar to the frequency
response of the human ear.
analysis. Our aim in this thesis, will be that to demonstrate its goodness against a
specific task of sound analysis, the onset detection, applied to the case of ·O M M· (next
chapter) and its perceptually grounded approach applied to the recognition of the sound
has been played (final chapter).
The constant-Q transform has advantages over the Fourier transform, which lie in
musical aspects. The STFT computes frequency components (frequency bins) on a
linear scale, that means, it expresses frequencies with fixed resolution or bandwidth. This
method has an inconvenient, because it frequently results into inadequate resolution for
low musical frequencies and exaggerated resolution for high frequencies. The choice of
the frequency resolution is addressed to the choice of the appropriate window length,
which best fits the resolution needed (i.e. the lower frequency content which can solve).
Moreover, high frequency resolution means poor time resolution and vice-versa, that is,
there is always a tradeoff between time/frequency resolution in STFT based analysis.
Such a problem will be discussed through an example: suppose the sampling fre-
quency to be fs =44100 Hz and N =1024 samples the window length. The frequency
54
4.4 – Constant-Q analysis
bins that can be analyzed will be 512, equally spaced over the bandwidth, i.e. from 0 to
20 KHz. Increasing the sample rate, e.g fs =96 KHz will not increase the frequency res-
olution of analysis but will only widen the bandwidth up to 48 KHz. To get an increase
in frequency resolution, one must choose a larger window length. The limit example
is that to obtain a frequency resolution of 1 Hz, a window length up to 44100 samples
must be chosen, by sacrificing the time resolution to 1 s! Conversely, if time resolution
needed is 1 ms, i.e. to analyze 1000 events per second, the window length should be
44 samples, thus the frequency resolution will be about 1000 Hz!
Now let’s see a practical example, introducing the need of a different tool for anal-
ysis. Suppose the task is to solve (to analyze separately) the frequencies correspond-
ing to the fundamental frequencies of notes in a piano. Now, suppose the two lower
notes being spaced, for example, 2.5 Hz apart. The analysis window must be chosen
N =16384 samples long, that is, for fs = 44100 Hz the frequency resolution will be
fs /N ' 2.5 Hz. This would not only result in a bad time resolution (400 ms), but the
real problem consists in the extremely useless frequency resolution used to solve higher
frequencies notes, because here the spacing between notes is much more than 2.5 Hz.
Not only, what we said in previous chapter about perception of frequencies, here is com-
pletely neglected. However, since STFT is performed via FFT, the time required for
output is extremely low and implementation in real-time do not constitute a problem,
although lots of data are useless and must be discarded after analysis. Therefore, the
complexity reduction achieved by applying FFT algorithm, is the principal reason behind
the wide use of STFT in sound analysis purposes.
Constant-Q transforms constitutes an alternative to the fixed frequency representa-
tion of Fourier transform. In a constant-Q transform the bandwidth of each frequency
bins, varies proportionally with frequency. In the next section, we’ll see a typical imple-
mentation of constant Q for musical analysis purpose, applied to the case of a piano.
But first, we should take a look at the waterfall spectrograms represented in figures 4.7
and 4.8, taken from Brown [10]. The two pictures clearly point out the advantaged in
representing musical signal with the constant Q transform, which lies in musical aspect.
It is especially clear if compared to the previous image (4.4) showing STFT waterfall
spectrum.
55
4 – ...To Sound Spectrum Analysis
where f will vary from fmin to an upper frequency chosen below the Nyquist frequency.
The minimum frequenc fmin can be chosen to be the lowest frequency about which
4
Equal temperament is a musical temperament, or a system of tuning in which every pair of adjacent
notes has an identical frequency ratio. In equal temperament tunings an interval, usually the octave, is
divided into a series of equal steps (equal frequency ratios).
56
4.4 – Constant-Q analysis
information is desired, e.g. a frequency just below that of the G string for calculations
on sound produced by a violin. The resolution or bandwidth δf for the discrete Fourier
transform is equal to the sampling rate divided by the window size (the number of
samples analyzed in the time domain). In order for the ratio of frequency to bandwidth
to be a constant (constant Q), then the window size must vary inversely with frequency.
More precisely, for quarter-tone resolution required is:
Q = f /δf = f /0.029f = 34
where the quality factor Q is defined as δf = f /Q. We note that the bandwidth
δf = f /Q. With a sampling frequency fs = 1/T where T is the sample period, the
length of the window in samples at frequency fk ,
N [k] = S/δfk = (S/fk )Q
57
4 – ...To Sound Spectrum Analysis
Note also from this equation that the window contain Q complete cycles for each fre-
quency fk , since the period in samples is fk . This have physical means since, in orde to
distinguish between fk+1 and fk when their ratio is, e.g 21/24 = 34/33, we must look
at at least 33 cycles. It is also interesting for comparison, to consider the conventional
discrete Fourier transform in terms of the quality factor Q = f /δf . We find that f /δf
is equal to the number of the coefficient k, and this is the number of periods in the fixed
window for that frequency.
The constant-Q transform have demonstrated good result as approaching to the task
of sound analysis; especially regarding the identification of musical notes, this transform
shows to be a more appropriate spectral representation due to its geometrically spaced
frequency channels. Although it should not be considered a good starting point for
musical synthesis, because of the controversial inverse function. Inverse function had
been proposed but not successfully implemented for musical purpose.
This method will be proposed, in the final chapter, to achieve the goal of onset
detection.
58
Chapter 5
To generate 1 and process acoustical signals is to compose music, more directly than
inscribing ink on paper. Curtis Roads[44].
For our purpose, that is to analyze the musical flow of the robotic orchestra, we need a
flexible and extensible platform. For this reason we discarded a priori the use of numerical
analysis software, like Matlab/Octave (we used them but for other purposes). Although
they offer lot of (free for GNU Octave) packages tuned around sound analysis and
processing ([63][47] and [32]), our need is to integrate the analysis onto the Show Control
System2 of the Orchestra, thus, for musical context other softwares are recommended.
One of the pillar in this category is represented by Max/MSP, the software already used
for the development of ·O M M· SCS.
In this chapter the main softwares for real-time audio applications are treated from
basic, to advanced (in the case of Max/MSP, the software we choose for our purpose)
features. Later, at the end of chapter, an overview of the typical application, which are
these software designed for, and a comparison between the textual based (unix style,
terminal-like window) and graphical based software.
1
Musical sound synthesis.
2
Show control system (entertainment) is a generic for a system (could be very complex) whose main
feature is that to coordinate all the different systems (audio, video, MIDI, OSC....) controlling the
hardware, by which a show is formed. In our case, the SCS, coordinates the robots musician and the
gestural controller and possible other (to be experimented) features.
59
5 – Real-Time Audio Applications
5.1 Max/MSP
Max/MSP devotes his first part of the name to Max Mathews3 , who wrote in 1957, the
first ever computer program, specific for sound generation4 . Max was also the original
name of the software, developed by Miller S. Puckette at IRCAM in the mid 80s and first
commercially distributed since early 90s. MSP, is a package for real-time DSP (standing
for Max Signal Processing or the initials of Miller S. Puckette), added to the software
since 1997. Due to its graphical (but minimal) nature, Max/MSP differs from the most
MUSIC-N languages, Max can be considered a visual programming language. Visual
programming let you graphically connect objects together with patch cords to design
interactive software. This is normally the attitude of designing programs, think at the
flowchart or most modern techniques such as UML. But the difference resides here, with
flowcharts the blocks represent code that will be written, in Max, the code is written
already.
Since Max uses icons to represent objects written in high level language, Max is a
meta-language, responding to the paradigm "programs can write programs". Max/MSP
distinguishes between two levels of timing: that of an "event" scheduler, and that of the
DSP (similar to the distinction between control-rate and audio-rate processes in Csound,
direct descendant of MUSIC.)
With Max you can also control external hardware, read data from sensors, inter-
change audio and data with other software other than generate and analyze sounds,
create musical intruments, video and animation. All these features let Max be a popular
choice for composing interactive media works. Most of all for the approachable graphi-
cal interface, extensive bindings to media processes and protocols, and the open-ended
philosophy.
Follows a short description of principal Max features:
60
5.1 – Max/MSP
be opened up and filled with objects which will continue to work after the patcher object
is folded up again. The action flows from the top down. When an object is tweaked
by the user or MIDI comes in, a message is sent to any connected objects, which react
with messages of their own. Only one thing happens at a time, but it’s all so fast it
seems instantaneous. When a pathway branches, messages are sent to right destinations
before left.
Objects
The name of the object represents what it does. There are a few hundred objects
included with Max, ranging in complexity from simple math to full featured sequencers.
Arguments, if present, specify initial values for the object to work with. Data comes into
the object via the inlets, and results are put out the outlets. Each inlet or outlet on an
object has a specific meaning. This will be displayed in a flag as the mouse passes by
(further details are in the manual). Usually, input to the left inlet triggers the operation
of the object. For instance, the delay object (as shown) will send a bang message out
the outlet 500 milliseconds after a bang is received in the left inlet. Data applied to the
right inlet will change the delay time.
Messages
Data bytes sent down the patch cords are called messages, which fall into one of the
following types:
61
5 – Real-Time Audio Applications
• list Several of the above, separated by spaces. The first element of a list must be
a number.
Audio signals are sent in yellow patch cords. These are little packets of data, but
sent so fast as to be effectively continuous. Jitter signals are sent via green patch cords.
Jitter messages are names of matrices that hold data for jit.objects to process. Every
object responds to a variety of messages. If a message won’t work, a warning will appear
in the Max window.
Max windows
The Max window contains information sent from Max (like error messages) or things
you might like to print. It’s sort of a terminal window.
Max runtime
Max is not required to run a finished patch. Anyone can download Max/MSP Runtime
for free, which will run patches but not edit them. There is also a process for converting
patches into stand-alone applications.
Pure Data
PD is the open source twin of Max released under a BSD license, developed by the same
author, Miller Puckette, since 1996. It show off the same potentiality of Max, with little
differences explicated by the author in [42] and [38].
5.2 CSound
CSound is one of the better-known textual interfaces for computer music composition.
CSound was originally written by Barry Vercoe at MIT in 1985, based on languages
of the Music-N family, and continues to be developed today. At its core, CSound is
“designed around the notion that the composer creates a synthesis orchestra and a score
that references the orchestra.”
62
5.2 – CSound
Csound files were originally processed in non real-time to render sonic output, in a
“process referred to as ‘sound rendering’ as analogous to the process of ‘image rendering’
in the world of computer graphics.” [55]. Csound instruments are defined in the orchestra
file as directed graphs of unit generator types (called ‘opcodes’). Flexible sound routing
can be achieved using control and audio busses via the Zak objects. Control rate is
evident in CSound through the a-rate and k-rate notations.
The strong separation of synthesis and temporal event definition imposes a strict
limitation on the scope for algorithmic composition: new synthesis processes cannot
be defined in response to temporal events, and new temporal events cannot occur in
response to the synthesis output. “Csound is very powerful for certain tasks (sound
synthesis) while not particularly suited to others (data management and manipulation,
etc.).” [55]
63
5 – Real-Time Audio Applications
5.3 Supercollider
SuperCollider is a high-level programming music language, designed specifically for dy-
namic and generative structures and synthesis of computer music. It can be generally
applied to many different approaches to composition and improvisation rather than any
particular preconceived model. It features an application-specific high-level programming
language SCLang (inspired to C++) with extensive data-description and functional pro-
gramming capabilities, and support functions for common musical needs.
SuperCollider has also features as several library of unit generators for signal pro-
cessing. Sample-rate and control-rate distinctions are made explicit via the .ar and .kr
notation. A key distinction from CSound is that code can be evaluated in real-time as
the program runs.
SuperCollider is ideal for algorithmic composition. Since version 3.0 (the currently
available version), graphs of unit generators are defined textually and compiled at run-
time into dynamic libraries (‘SynthDefs’) to be loaded as instruments (‘synths’) by the
synthesis engine (‘SCServer’), all under control of the language.
The separation of language and synthesis into distinct processes in version 3.0 in-
troduces compilation and performance optimizations, but also implies limitations in the
degree of temporal control: “Because instruments are compiled into code, it is not pos-
sible to generate patches programmatically at the time of the event as one could in SC2.
In SC2, an audio trigger could suspend the signal processing, run some composition
code, and then resume signal processing. In SC Server, messaging between the engines
causes a certain amount of latency.”[34]
SuperCollider 3.0 therefore represents a return to the CSound model of orchestra
and score, in which however the score is procedural rather than declarative.
5.4 Chuck
ChucK represents one of the only contemporary options that avoids latency in the pro-
cedural control of synthesis. It also provides a library of unit generators to be freely
instantiated and connected into graphs within ChucK scripts. The authors refer to
ChucK as ‘strongly timed’, which can be defined as follows:
64
5.4 – Chuck
Like SuperCollider’s SCLang, the ChucK language was written especially for the ChucK
software. It is a high-level interpreted programming language.
65
5 – Real-Time Audio Applications
66
Table 5.1: Musical software for realtime synthesis and control
Name Creator Typical purposes First release date Recent release License Development
(2009) status
Max/MSP Miller Puckette Realtime- mid-1980s Max 5.0.7 Commercial Mature
synthesis, software
hardware-control (Cycling’74)
Pure Data Miller Puckette Realtime- 1990s pd-extended BSD-like Stable
synthesis, (0.41.4),
hardware-control pd-vanilla
(0.42.5)
Csound Barry Vercoe Realtime- 1986 Csound 5.10 LGPL Mature
synthesis,
algorithmic
compositiona ,
67
audio-rendering
5.4 – Chuck
a
Algorithmic composition is the technique of using algorithms to create music.
b
Live-coding is the name given to the process of writing code to modify software in realtime as part of a performance. Most generally,
writing (parts of) programs while they run. It’s ometimes known as "interactive programming" or "on-the-fly programming".
5 – Real-Time Audio Applications
68
Chapter 6
In the first part of chapter 2, we presented the attributes of sound perceived through the
human auditory system, now we are going to focus the discussion over a particular aspect
of such a mechanism, the perception of time events, especially the ones related to the
initial portion of sounds. By implementation of the skills of computer music discussed
in chapter 3 and 4, in the detail, digital filters and constant Q analysis, we’ll try to
simulate the functionalities of auditory system to achieve the goal of event detection.
The interest in our research is largely treated in literature and must be anticipated by
some fundamental definitions, in primis attack and onset of a sound. Then, some of the
most advanced techniques for onset and attack detection will be proposed and finally,
our method based on bonk∼ for Max/MSP will complete this chapter.
Therefore, and first of all, the project on which we are working, will advance at a
glance.
69
6 – Perceptual Onset Detection
Figure 6.1: The two robots on the sides, SCS + the performer in the middle.
the Futurism, is ready for the launch, in november, after had been presented in october
2008.
In the organization of this thesis, this chapter is the right place to introduce the
project we are working on, before going into the discussion on the recognition of particular
sound onset. Figure 6.1, representing the ensemble of the show control system of
·O M M· , is proposed to lighten the visual imagination. The two robotic drummers,
both having two arms, can play at maximum 120 bpm through each arms (i.e. until
four contemporary3 hits), a score, sent them via MIDI4 from a computer. On the
same computer, a complex Max/MSP patch, elaborates input data received from the
exoskeleton, calibrate it, and let the output modify the rhythmic pattern in real-time5 .
Since the robots play big and cumbersome drums, two oil bins of regular dimension, the
music produced is kind of loud percussive sounds, on the style of the french band Les
tambours du Bronx.
3
Perceptually contemporary. A minimum delay (but less than 15 ms) between the mechanically of
the two arms of a robot, must be ensured between two electrical transmissions.
4
Musical Intrument Digital Interface.
5
Quasi-realtime, to be precise.
70
6.2 – From Transient to Attack and Onset Definitions
71
6 – Perceptual Onset Detection
Figure 6.2: On the top, the waveform corresponding to a hit of a robot percussionist
of ·O M M· . On the bottom, the intensity profile of the hit (using Praat), where onset,
attack and the transient/steady state separation are highlighted.
transients are normally considered the principal part of a percussive sound, thus char-
acterizing them the steady state portions can be, although approximately, derived by
applying an appropriate synthesis stage which recreates the slow decay at the estimated
resonance of the drum.
The actual meaning of onset, coming from psychoacoustics knowledges, allow us to
define transient in according to the behavior of the three perceptual attributes presented
in chapter 2. In correspondence of a transient, we can denote the following behavior:
For the purpose of this thesis, the transient/steady-state separation is left apart from
the following discussions (see [56]for deepening) and attention will be focused on the
transient region, in particular to onset and attack recognition. But first of all, let’s spend
72
6.2 – From Transient to Attack and Onset Definitions
Figure 6.3: From top to bottom: waveform, static spectrum (FFT) and time-varying
pectrum (STFT). From right to left: one hit of ·O M M· robot, one hit of snare drum.
a few words on the importance of onset detection, often kept secret, in most musical
software applications.
73
6 – Perceptual Onset Detection
2. reduction: reduction is a process through which the complex signal, here consid-
ered as sum of sinusoids or oscillators, is simplified for analysis (e.g. subsampling).
The simplified signal (must reflect the local structure of the original) should en-
hance the transient characteristic while de-emphasize the steady state. This oper-
ation is critical and has been proposed in many ways, but can be summarized in
two categories: the methods that make use of explicitly signal features (i.e energy,
frequency, phase) and methods based on probabilistic models, which approximate
the signal’s behavior. The function obtained, after reduction of the original signal,
can be called detection function[4] or observation function[50].
74
6.3 – General Scheme for Onset Detection
Is not difficult to imagine that overlapping sounds, noise, musical effects (e.g.
vibrato and tremolo12 ) or modulations, are just some examples of the difficulties
that a peak-picking stage can encounter. That’s why the final decision of the
pick-peaking algorithm may be, in certain cases, anticipated by a post-processing
and thresholding stages.
In the next sections, we will adopt the term detection function, described in point 2;
every detection scheme has its own detection function. Our method based on bonk∼ will
follow, after a discussion on the modern techniques, largely treated in literature and well
summarized in [4][50] and [31].
• If the signal is strongly percussive (e.g. drums), time-domain methods are also
adequate (i.e. method based on thresholding the amplitude).
• Spectral methods, such as those based on phase distributions and spectral differ-
ence perform relatively well on strongly pitched transients. [4]
• The complex-domain spectral difference seems to be a good choice in general, at
the cost of a slight increase in computational complexity. [21][56]
• When precise time localization is required, then wavelet methods can be useful,
possibly in combination with another method. [21][22]
• If a high computational load is acceptable, and a suitable training set is available,
then statistical methods give the best overall results, and are less dependent on a
particular choice of parameters. [4] for introduction and [24][2] for more detail.
12
Vibrato and tremolo are two important musical effects. Vibrato is produced, in singing and musical
instruments, by a regular pulsating change of pitch, and is used to add expression and vocal-like qualities
to instrumental music. Tremolo usually refers to periodic variations in the amplitude of a musical note
(or in singing). Depth and speed of vibrato/tremolo determine the amount and speed of pitch/amplitude
changes. It is difficult to achieve, with singing voice, separated variations in pitch and amplitude, they
will usually be achieved at the same time; that’s why the two terms are sometimes confused. In digital
signals processing, vibrato and tremolo are easier to achieve separately.
75
6 – Perceptual Onset Detection
Spectral Difference
This idea can be extended to reach a more appropriate detection function, that is,
considering frames of the STFT. We recall that a generic STFT frame of a waveform,
is given by:
X∞
X[n,k] = {x[m] h[n − m]·}e−jωk n
m=−∞
where k = 0,1,...,N − 1 is the frequency bin index and h the finite-length sliding win-
dow14 . As previous method, if we now take into account the first difference between the
magnitude of consecutive STFT frames, that is:
N
X
δX = |X[n,k]| − |X[n − 1,k]|
k=1
13
The first difference is the difference between two consecutive samples, in this case each sample
describes the energy content of the sampled waveform.
14
see chapter 5 for detail on STFT
76
6.3 – General Scheme for Onset Detection
this measure, known as the spectral difference, can be used to build an efficient onset
detection function. Energy-based algorithms are fast and easy to implement, decrease
their effectiveness when approaching to nonpercussive sounds or when transient energy
is given by overlapping (and more complex, e.g. strongly pitched) sounds.
77
6 – Perceptual Onset Detection
in the case of a given stationary sinusoid (i.e. extracted from steady state portion of
the signal). In a steady state sinusoid, extracted from a single frame of the STFT, the
phase, as well as the phase in the previous frame, are used to calculate a value for the
instantaneous frequency. An estimate of the instantaneous frequency of the STFT frame
within this window, is that:
ϕk (n) − ϕk (n − 1)
fk (n) = fs
2πh
where h is the hop size between windows and fs the sampling rate.
What is expected, for a stationary sinusoid, is that the instantaneous frequencies
should be approximately constant over adjacent windows. Furthermore, this is equiva-
lent to say that the phase increment from adjacent windows remaining approximately
constant. This is expressed in formula as follows:
ϕk (n) − ϕk (n − 1) ' ϕk (n − 1) − ϕk (n − 2)
Equivalently, the phase deviation can be defined as the second difference of the phase,
which is:
∆ϕk (n) − 2ϕk (n − 1) + ϕk (n − 2) ' 0
During a transient region, the instantaneous frequency is not usually well defined, and
hence will tend to a large value. This is illustrated in figure 6.4, Bello proposes a method
that analyzes the instantaneous distribution (in the sense of a probability distribution or
histogram) of phase deviations across the frequency domain.
During the steady-state part of a sound, deviations tend to zero, thus the distribution
is strongly peaked around this value. During attack transients, values increase, widening
and flattening the distribution. However, this method, although showing some improve-
ment for complex signals, is susceptible to phase distortion and to noise introduced by
the phases of components with no significant energy. Finally, why do not mix phase
and energy approaches? Again Bello gives the answer, proposing this solution to the
detection task in [5].
Phase, energy, and phase/energy approaches do not represent the only methods
applied to the solution of the onset detection task. Several other methods had been
proposed, with particular regard to stochastic and statistical methods ([2] for example
and [4] for comparison), very different from the ways above. A Deterministic Plus
Stochastic model (such as described by Serra in [54]) for specific onset detection, had
been recently presented by Gifford and Brown in [24].
But let’s now introduce the perceptual based approach, we used to reach the task
of onset detction.
78
6.4 – Introduction to the Perceptual Based Approach to Onset Detection
Figure 6.4: Unwrapped phase deviation between two adjacent analysis frames. ∆ϕn,k
is the unwrapped phase deviation. For the simpler case represented by a steady state
sinusoid, the phase deviation is approximately 0 constant in-between the whole analysis
frames, while, during transient the phase deviation should be extremely large and easy
to detect.
79
6 – Perceptual Onset Detection
mechanisms, which encodes both time and frequency effects, determines the subjective
perception of sound onsets. The principal limits are imposed by time resolution and fre-
quency masking effects (both explained in chapter 2). Moreover, overlapping pitched and
non-pitched sounds (even percussive pitched sounds) could obfuscate the perception of
pitch and also delay or obscure one or more adjacent onsets. Let’s see an example, before
introducing the perceptual method applied to ·O M M· , based on bonk∼ for Max/MSP.
Band-Wise Processing
Scheirer in [51] was the first to clearly demonstrate the fact that an onset detection algo-
rithm should follow the human auditory system, by treating frequency bands separately
and then combining the results at the end. An earlier system described by Bilmes, was
similar to the way above, but his system only used a high-frequency and a low-frequency
bands, which himself judged not so effective [8].
Scheirer in [51] described a psychoacoustic demonstration on beat perception, which
shows that certain kinds of signal simplifications can be performed without affecting
the perceived rhythmic content of a musical signal. When the signal is divided into at
least four frequency bands and the corresponding bands of a noise signal are controlled
by the amplitude envelopes of the musical signal, the noise signal will have a rhythmic
percept which exploits significant similarities to the original signal. On the other hand,
this does not hold if only one band is used, in which case the original signal is no more
recognizable from its simplified form (detection funtion).
The method proposed by Klapuri [30], is the most significant example of succesfull
application by applying psychoacoustic knowledge to the onset detection task. It utilizes
the band-wise processing principle as introduced by Bilmes and Scheirer. The procedure
is the following:
2. a filterbank divides the signal into 21 non-overlapping bands and, at each band,
the onset components are detected and their time and intensity is determined,
80
6.5 – Onset Detection in ·O M M·
the remaining eighteen are third-octave band-pass filters18 . All subsequent calculations
can be done one band at a time. This reduces the memory requirements of the algorithm
in the case of long input signals, assumed that parallel processing is not desired.
The output of each filter is full-wave rectified and then decimated by factor 180 to
ease the following computations. Amplitude envelopes are calculated by convolving the
band-limited signals with a 100ms half-Hanning (raised cosine) window. This window
performs much the same energy integration as the human auditory system, preserving
sudden changes, but masking rapid modulation [30].
81
6 – Perceptual Onset Detection
account the delay required for the note to be completed, and anticipate the execution
of such a note, for the exact time of the delay.
To calculate the delay we needed to provide an onset detector stage in our Show
Control System. We exactly know the time of the MIDI event, hence what is missing is
the detection of the note executed.
Our need was that to integrate the onset detection stage into the SCS, developed in
Max/MSP, and running on an Apple laptop. We initially tried to implement a new
method, to familiarize with development in Max. It was based on an envelope follower
of the signal with a variable threshold applied to it. When the amplitude envelope of
the signal exceed the threshold, an onset is detected. Since its practical inefficiency, this
method was immediately left apart, and other methods were explored.
We founded a very interesting approach in bonk∼ object, an external library available
open source on the web, for Max/MSP. The original code was written by the same author
of Max, Miller Puckette, for Pure Data in 1989. Then it has been revised by other people
during the years: Ted Apel ported bonk∼ to Max/MSP platform and later Barry Threw
applied the latest modification (2008). The version we have used is called bonk∼ 1.4,
founded in M.Puckette repository, with permission to apply changes.
82
6.5 – Onset Detection in ·O M M·
Figure 6.5: Graphical representation of the bounded-Q filterbank. Only the octave are
geometrically spaced, in between the octave the spacing between analysis bins is linear.
This allows the application of FFT-like algorithm to calculate the spectrum of each
component.
We found that the implementation of 15 (non overlapping) filters was successful for
our case. See table 6.1 for detail on the filters used for the band-wise analysis. In this
table can be easily recognized the filter spacing with two filters per octave, except where
prohibited (the first two filters do not respect this spacing20 ). The details of filterbank’s
implementation can be founded in appendix of this thesis.
The final stage, what we have called before the pick-picking stage, in bonk∼ works
essentially with the definition of a growth function.
83
6 – Perceptual Onset Detection
The final configuration of bonk∼ gave great results: all the hits produced by the
orchestra can be located in time and the value of CDR (Correct Detection Result),
proposed in [30], was very easy to calculate.
The CDR is given by:
84
6.6 – From Onset Analysis to Sound Classification
Table 6.2: Results in detecting onset of the five soundtracks created for analysis purpose,
played at different bpm. ·O M M·
·O M M· Total Detected Undetected False CDR [%]
soundtrack Onsets Detected
(bpm)
100 120 120 0 0 100
105 80 80 0 0 100
110 90 90 0 0 100
115 120 120 0 2 98
120 60 60 0 5 95
2. start the bonk∼ analysis and stop it after all the onsets provided are recognized
5. bonk∼ will report the number corresponding to the sound which best fit the
spectral template read from file
85
6 – Perceptual Onset Detection
Table 6.3: Numerical result in detecting onset and recognizing the three sounds (A/B/C)
produced by the ·O M M·
·O M M· Total Onsets Total A/B/C Total A/B/C Note A/B/C
soundtrack Notes Notes Notes
(bpm) Recognized Confused
100 120 30/60/30 25/64/31 5/5/0
105 80 30/30/20 25/32/23 5/6/4
110 90 33/26/31 32/28/30 2/1/1
115 120 100/12/8 97/16/7 4/1/3
120 60 25/20/15 30/15/15 4/0/7
The results is quite unexpected, more than the 80% (on average, in particular cases
were higher than 95%) of correct correspondence had been found, simply looking at the
onset analysis results.
86
Chapter 7
Conclusion
A specific component of the human ear, the basilar membrane inside the cochlea, located
in the inner ear, is responsible of the detection of frequency components of a sound.
In this thin membrane, 32 mm long, the frequencies cause oscillations around specific
points of the basilar membrane. The mechanical properties of the cochlea (wide and stiff
at the base, narrower and much less stiff at the end), in which the basilar membrane is
located, denotes a roughly logarithmic decrease in bandwidth as we move linearly away
from the cochlear opening (the oval window).
Therefore we propose a different approach to sound analysis, which is known as the
Constant-Q filterbank method. This method is typically implemented with a bank of
band pass filters (filterbank) with constant Q ratio. We recall the definition of Q which is
the ratio between the center frequency and the bandwidth of a filter. This method mimics
the behaviour of the auditory system in detecting frequency, i.e the filter are linearly
spaced along a logarithmic frequency axis. This can be obtained with filters maintaining
constant their Q ratio. It is established that the Q must be chosen approximately equal
to 37, to perform a rigorous scan over the frequency range from 20 Hz to 20 KHz and
several filters (at least one hundred) must be used.
Our task had been compared to the one provided by an external library for Max/MSP,
called bonk∼ , developed by the same author of Max/MSP, Miller Puckette. We found
in it a very interesting approach, very helpful for our purpose. The code has been revised
by other people during the years (the original code was 1989) and the version we have
taken into account is bonk∼ 3.0. The code was found on M.Puckette repository, with
permission to apply changes.
The bonk∼ method works essentially on a specialization of the constant-Q filter bank
analysis, called bounded-Q analysis. This method has the advantage to reduce the
complexity of the constant-Q transform. In this kind of analysis the value of Q is limited
(bounded) to approximately 5 and a few number of filters should be used to obtain at
least the same results. We found that the implementation of 15 non overlapping filters
87
7 – Conclusion
was successful for our case. The bandwidths of the filters subdivide the sound spectrum
into regions which are approximately tuned around musical octave, thus respecting what
seems to be the auditory system response. This method, implemented for the first time
in the late 80s, has showed good results in various fields of musical analysis, in particular
segmentation and transcription.
After several months spent in debugging, optimizing and adapting the source code to
our needs, we have subjected it to several tests. These tests, performed with the sounds
recorded at LIM laboratories in Verres (AO), have demonstrated the soundness of this
approach. All the hits produced by the orchestra can be located in time. Not only, we
also trained bonk∼ to recognize which kind of sound have been produced, and we can
obtain more than 85% of successful correspondences.
The percentage of recognized hits indicates the approach validity; possible other musical
applications can be foreseen in or outside the ·O M M· .
88
Appendix A
Source 1: main.c
1 /∗ ∗
@page chapter_msp_anatomy Anatomy o f a MSP O b j e c t
6 Here i s an e n u m e r a t i o n o f t h e b a s i c t a s k s :
1) a d d i t i o n a l header files
2) C s t r u c t u r e d e c l a r a t i o n
16
The C s t r u c t u r e d e c l a r a t i o n must b e g i n w i t h a #t _ p x o b j e c t , n o t a #t _ o b j e c t :
@code
t y p e d e f s t r u c t _ my d sp ob je c t
{
21 t _ p x o b j e c t m_obj ;
// r e s t o f t h e s t r u c t u r e ’ s f i e l d s
} t_mydspobject ;
@endcode
26 3) i n i t i a l i z a t i o n routine
89
A – MSP, anatomy of the object
A f t e r c r e a t i n g y o u r c l a s s w i t h c l a s s _ n e w ( ) , you must c a l l c l a s s _ d s p i n i t ( ) ,
w h i c h w i l l add some s t a n d a r d method h a n d l e r s f o r i n t e r n a l m e s s a g e s u s e d by
a l l signal objects .
@code
class_dspinit (c) ;
36 @endcode
4 ) new i n s t a n c e r o u t i n e
The d s p method s p e c i f i e s t h e s i g n a l p r o c e s s i n g f u n c t i o n y o u r o b j e c t d e f i n e s
a l o n g w i t h i t s a r g u m e n t s . Your o b j e c t ’ s d s p method w i l l be c a l l e d when t h e
MSP s i g n a l c o m p i l e r i s b u i l d i n g a s e q u e n c e o f o p e r a t i o n s ( known a s t h e DSP
C h a i n ) t h a t w i l l be p e r f o r m e d on e a c h s e t o f a u d i o s a m p l e s . The o p e r a t i o n
sequence c o n s i s t s of a p o i n t e r s to f u n c t i o n s ( c a l l e d perform r o u t i n e s )
f o l l o w e d by a r g u m e n t s t o t h o s e f u n c t i o n s .
61
The d s p method i s d e c l a r e d a s f o l l o w s :
@code
v o i d m y d s p o b j e c t _ d s p ( t _ m y d s p o b j e c t ∗ x , t _ s i g n a l ∗∗ sp , s h o r t ∗ c o u n t ) ;
@endcode
66
To add an e n t r y t o t h e DSP c h a i n , y o u r d s p method u s e s dsp_add ( ) . The d s p
method i s p a s s e d an a r r a y o f s i g n a l s (# t _ s i g n a l p o i n t e r s ) , w h i c h c o n t a i n
p o i n t e r s t o t h e a c t u a l s a m p l e memory y o u r o b j e c t ’ s p e r f o r m r o u t i n e w i l l be
u s i n g f o r i n p u t and o u t p u t . The a r r a y o f s i g n a l s s t a r t s w i t h t h e i n p u t s (
from l e f t t o r i g h t ) , f o l l o w e d by t h e o u t p u t s . F o r example , i f y o u r o b j e c t
h a s two i n p u t s ( b e c a u s e y o u r new i n s t a n c e r o u t i n e c a l l e d d s p _ s e t u p ( x , 2 ) )
and t h r e e o u t p u t s ( b e c a u s e y o u r new i n s t a n c e c r e a t e d t h r e e s i g n a l o u t l e t s ) ,
t h e s i g n a l a r r a y s p would c o n t a i n f i v e i t e m s a s f o l l o w s :
@code
s p [ 0 ] // l e f t i n p u t
s p [ 1 ] // r i g h t i n p u t
90
71 s p [ 2 ] // l e f t o u t p u t
s p [ 3 ] // m i d d l e o u t p u t
s p [ 4 ] // r i g h t o u t p u t
@endcode
You can u s e a v a r i e t y o f s t r a t e g i e s t o p a s s a r g u m e n t s t o y o u r p e r f o r m r o u t i n e
v i a dsp_add ( ) . F o r s i m p l e u n i t g e n e r a t o r s t h a t don ’ t s t o r e any i n t e r n a l
s t a t e between computing v e c t o r s , i t i s s u f f i c i e n t to pas s the i n p u t s ,
o u t p u t s , and v e c t o r s i z e . F o r o b j e c t s t h a t n e e d t o s t o r e i n t e r n a l s t a t e
b e t w e e n c o m p u t i n g v e c t o r s s u c h a s f i l t e r s o r ramp g e n e r a t o r s , you w i l l p a s s
a p o i n t e r t o y o u r o b j e c t , whose d a t a s t r u c t u r e s h o u l d c o n t a i n s p a c e t o
s t o r e t h i s s t a t e . The p l u s 1 ~ o b j e c t d o e s n o t n e e d t o s t o r e i n t e r n a l s t a t e .
I t p a s s e s t h e i n p u t , o u t p u t , and v e c t o r s i z e t o i t s p e r f o r m r o u t i n e . The
p l u s 1 ~ d s p method i s shown b e l o w :
@code
81 v o i d p l u s 1 _ d s p ( t _ p l u s 1 ∗ x , t _ s i g n a l ∗∗ sp , s h o r t c o u n t )
{
dsp_add ( p l u s 1 _ p e r f o r m , 3 , s p [0]−> s_vec , s p [1]−> s_vec , s p [0]−>s_n ) ;
}
@endcode
86
The f i r s t a r g um e n t t o dsp_add ( ) i s y o u r p e r f o r m r o u t i n e , f o l l o w e d by t h e number
o f a d d i t i o n a l a r g u m e n t s you w i s h t o c o p y t o t h e DSP c h a i n , and t h e n t h e
arguments .
91 Here i s t h e p l u s 1 p e r f o r m r o u t i n e :
@code
t _ i n t ∗ p l u s 1 _ p e r f o r m ( t _ i n t ∗w)
{
96 t _ f l o a t ∗ in , ∗ out ;
int n;
i n = ( t _ f l o a t ∗ )w [ 1 ] ; // g e t i n p u t s i g n a l v e c t o r
o u t = ( t _ f l o a t ∗ )w [ 2 ] ; // g e t o u t p u t s i g n a l v e c t o r
101 n = ( i n t )w [ 3 ] ; // v e c t o r s i z e
w h i l e ( n−−) // p e r f o r m c a l c u l a t i o n on a l l s a m p l e s
∗ o u t++ = ∗ i n++ + 1 . ;
91
A – MSP, anatomy of the object
106
return w + 4; // must r e t u r n n e x t DSP c h a i n l o c a t i o n
}
@endcode
111 6) Free f u n c t i o n
@code
116 v o i d mydspobject_free ( t_mydspobject ∗x )
{
dsp_free ( ( t_pxobject ∗) x ) ;
// can do o t h e r s t u f f h e r e
121 }
@endcode
92
Appendix B
No substantial modification has been applied to the original bonk∼ code. Previous
modification to the original has been done by Barry Threw for the latest version of
bonk∼ , the what we used1 .
Source 2: main.c
1 /∗
###########################################################################
# bonk~ − a pd and Max/MSP e x t e r n a l
# by m i l l e r p u c k e t t e and t e d a p p e l
# h t t p : / / c r c a . u c s d . edu /~msp/
6 # Max/MSP p o r t by b a r r y t h r e w ( m e @ b a r r y t h r e w . com )
# h t t p : / /www . b a r r y t h r e w . com
# San F r a n c i s c o , CA 2008
# f o r Kesumo − h t t p : / /www . kesumo . com
# Max 5 o p t i m i z e d v e r s i o n f o r l o u d p e r c u s s i v e s o u n d s , by Z e n g i . BETA v e r s i o n
11 # T u r i n , June 2009
###########################################################################
// bonk~ d e t e c t s a t t a c k s i n an a u d i o s i g n a l
###########################################################################
T h i s s o f t w a r e i s c o p y r i g h t e d by M i l l e r P u c k e t t e and o t h e r s . The f o l l o w i n g
16 t e r m s ( t h e " S t a n d a r d I m p r o v e d BSD L i c e n s e " ) a p p l y t o a l l f i l e s a s s o c i a t e d w i t h
the software u n l e s s e x p l i c i t l y disclaimed in i n d i v i d u a l f i l e s :
R e d i s t r i b u t i o n and u s e i n s o u r c e and b i n a r y f o r m s , w i t h o r w i t h o u t
modification , are permitted provided that the f o l l o w i n g c o n d i t i o n s are
21 met :
1. Redistributions o f s o u r c e c o d e must r e t a i n t h e a b o v e c o p y r i g h t
notice , t h i s l i s t o f c o n d i t i o n s and t h e f o l l o w i n g d i s c l a i m e r .
2. Redistributions i n b i n a r y form must r e p r o d u c e t h e a b o v e
26 copyright notice , t h i s l i s t o f c o n d i t i o n s and t h e f o l l o w i n g
1
Bonk3 can be found in M. Puckette repository (ask to him) or in other non precise location on
the web. The one we used was found on Barry Threw website, but is now no more longer available.
93
B – bonk∼ source code
d i s c l a i m e r i n t h e d o c u m e n t a t i o n and / o r o t h e r m a t e r i a l s p r o v i d e d
with the d i s t r i b u t i o n .
3 . The name o f t h e a u t h o r may n o t be u s e d t o e n d o r s e o r promote
p r o d u c t s d e r i v e d from t h i s s o f t w a r e w i t h o u t s p e c i f i c p r i o r
31 written permission .
56 //#i f d e f NT
//#pragma w a r n i n g ( d i s a b l e : 4305 4 2 4 4 )
//#e n d i f
#i n c l u d e " e x t . h"
61 #i n c l u d e " z_dsp . h"
#i n c l u d e " math . h"
#i n c l u d e " e x t _ s u p p o r t . h"
#i n c l u d e " e x t _ p r o t o . h"
#i n c l u d e " ext_obex . h"
66
typedef double t _ f l o a t a r g ; /∗ from m_pd . h ∗/
#d e f i n e flog log
#d e f i n e f e x p exp
71 #d e f i n e fsqrt sqrt
#d e f i n e t _ r e s i z e b y t e s ( a , b , c ) t _ r e s i z e b y t e s ( ( char ∗) ( a ) , ( b ) , ( c ) )
void ∗ bonk_class ;
#d e f i n e g e t b y t e s t _ g e t b y t e s
76 #d e f i n e f r e e b y t e s t _ f r e e b y t e s
94
B.1 – The bonk∼ Method
#d e f i n e DEFMASKTIME 4
#d e f i n e DEFMASKDECAY 0 . 7
91 #d e f i n e DEFDEBOUNCEDECAY 0
#d e f i n e DEFMINVEL 7
#d e f i n e DEFATTACKBINS 1
#d e f i n e MAXATTACKWAIT 4
96 //DATA STRUCTURES
typedef struct _ f i l t e r k e r n e l
{
int k_filterpoints ;
i n t k_hoppoints ;
101 int k_skippoints ;
i n t k_nhops ;
float k_centerfreq ; /∗ c e n t e r f r e q u e n c y , b i n s ∗/
f l o a t k_bandwidth ; /∗ b a n d w i d t h , b i n s ∗/
float ∗ k_stuff ;
106 } t _ f i l t e r k e r n e l ;
121 /∗ 1 . 3 r e v i e w ∗/
#d e f i n e MAXNFILTERS 50
#d e f i n e MASKHIST 8
136 t y p e d e f s t r u c t t e m p l a t e
{
f l o a t t_amp [ MAXNFILTERS ] ;
} t_template ;
141 t y p e d e f s t r u c t _ i n s i g
{
t _ h i s t g _ h i s t [ MAXNFILTERS ] ; /∗ h i s t o r y f o r e a c h f i l t e r ∗/
void ∗ g_outlet ; /∗ o u t l e t f o r raw d a t a ∗/
f l o a t ∗ g_inbuf ; /∗ b u f f e r e d i n p u t s a m p l e s ∗/
146 t _ f l o a t ∗ g_invec ; /∗ new i n p u t s a m p l e s ∗/
} t_insig ;
t y p e d e f s t r u c t _bonk
95
B – bonk∼ source code
{
151
t _ p x o b j e c t x_obj ;
void ∗ obex ;
v o i d ∗ x_cookedout ;
void ∗ x_clock ;
156 s h o r t x_ vo l ;
/∗ p a r a m e t e r s ∗/
i n t x_npoints ; /∗ number o f p o i n t s i n i n p u t b u f f e r ∗/
i n t x_period ; /∗ number o f i n p u t s a m p l e s b e t w e e n a n a l y s e s ∗/
161 int x_nfilters ; /∗ number o f f i l t e r s r e q u e s t e d ∗/
float x_halftones ; /∗ n o m i n a l h a l f t o n e s b e t w e e n f i l t e r s ∗/
f l o a t x_overlap ;
float x_firstbin ;
196 //PROTOTYPES
// p r o t o t y p e s f o r methods : n e e d a method f o r e a c h i n c o m i n g m e s s a g e
s t a t i c v o i d ∗bonk_new ( t_symbol ∗ s , l o n g ac , t_atom ∗ av ) ;
s t a t i c v o i d b o n k _ t i c k ( t_bonk ∗ x ) ;
s t a t i c v o i d b o n k _ d o i t ( t_bonk ∗ x ) ;
201 s t a t i c t _ i n t ∗ bonk_perform ( t _ i n t ∗w) ;
s t a t i c v o i d bonk_dsp ( t_bonk ∗ x , t _ s i g n a l ∗∗ s p ) ;
v o i d b o n k _ a s s i s t ( t_bonk ∗ x , v o i d ∗b , l o n g m, l o n g a , c h a r ∗ s ) ;
s t a t i c v o i d b o n k _ f r e e ( t_bonk ∗ x ) ;
v o i d bonk_setup ( v o i d ) ;
206 v o i d main ( ) ;
// methods f o r t r e s h o l d and o t h e r f e a t u r e s
s t a t i c v o i d b o n k _ t h r e s h ( t_bonk ∗ x , t _ f l o a t a r g f 1 , t _ f l o a t a r g f 2 ) ;
s t a t i c v o i d b o n k _ p r i n t ( t_bonk ∗ x , t _ f l o a t a r g f ) ;
96
B.1 – The bonk∼ Method
// methods f o r r e a d s and w r i t e t e m p l a t e s
216 s t a t i c v o i d b o n k _ w r i t e ( t_bonk ∗ x , t_symbol ∗ s ) ;
s t a t i c v o i d bonk_read ( t_bonk ∗ x , t_symbol ∗ s ) ;
// method f o r a t t r i b u t e s s e t t e r
v o i d b o n k _ m i n v e l _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
221 v o i d b o n k _ l o t h r e s h _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d b o n k _ h i t h r e s h _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d bonk_masktime_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d bonk_maskdecay_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d b o n k _ d e b o u n c e d e c a y _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
226 v o i d bonk_debug_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d bonk_spew_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d b o n k _ u s e l o u d n e s s _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d b o n k _ a t t a c k b i n s _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
231 f l o a t q r s q r t ( f l o a t f ) ;
double c l o c k _ g e t s y s t i m e ( ) ;
double c l o c k _ g e t t i m e s i n c e ( double p r e v s y s t i m e ) ;
c h a r ∗ s t r c p y ( c h a r ∗ s1 , c o n s t c h a r ∗ s 2 ) ;
236 // c l o c k f u n c t i o n
s t a t i c v o i d b o n k _ t i c k ( t_bonk ∗ x ) ;
#d e f i n e HALFWIDTH 0 . 7 5 /∗ h a l f p e a k b a n d w i d t h a t h a l f power p o i n t i n b i n s ∗/
c f = f i r s t b i n ; // f i r s t c e n t e r f r e q o f t h e f i l t e r b a n k
// b a n d w i d t h
bw = c f ∗ r e l s p a c e ∗ o v e r l a p ;
266 i f ( bw < HALFWIDTH)
bw = HALFWIDTH ;
// c r e a t e s ( i ) f i l t e r s , MAX( i ) =50.
// s t o p s c r e a t i n g f i l t e r s when c f e x c e e d n p o i n t s / 2 , r e t u r n i .
271 f o r ( i = 0 ; i < n f i l t e r s ; i ++)
97
B – bonk∼ source code
{
f l o a t ∗ f p , newcf , newbw ;
float normalizer = 0;
i n t f i l t e r p o i n t s , s k i p p o i n t s , h o p p o i n t s , nhops ;
276
f i l t e r p o i n t s = 0 . 5 + n p o i n t s ∗ HALFWIDTH/bw ;
// f i l t e r p o i n t s = 0 . 5 + n p o i n t s / bw ;
i f ( c f > n p o i n t s /2)
{
281 p o s t ( " bonk ~ : ␣ o n l y ␣ u s i n g ␣%d␣ f i l t e r s ␣ ( r a n ␣ p a s t ␣ N y q u i s t ) " , i +1) ;
break ;
}
i f ( f i l t e r p o i n t s < 4)
{
286 p o s t ( " bonk ~ : ␣ o n l y ␣ u s i n g ␣%d␣ f i l t e r s ␣ ( k e r n e l s ␣ g o t ␣ t o o ␣ s h o r t ) " , i +1) ;
break ;
}
else i f ( f i l t e r p o i n t s > npoints )
f i l t e r p o i n t s = npoints ;
291
h o p p o i n t s = 0 . 5 + 0 . 5 ∗ n p o i n t s ∗ HALFWIDTH/bw ;
// h o p p o i n t s = 0 . 5 + 0 . 5 ∗ n p o i n t s /bw ;
nhops = 1 . + ( n p o i n t s − f i l t e r p o i n t s ) /( f l o a t ) h o p p o i n t s ;
296 s k i p p o i n t s = 0 . 5 ∗ ( n p o i n t s − f i l t e r p o i n t s − ( nhops −1) ∗ h o p p o i n t s ) ;
// F i l l t h e k e r n e l o f t h e f i l t e r s i n f i l t e r b a n k −> f i l t e r k e r n e l
b−>b_vec [ i ] . k_stuff =
( float ∗) g e t b y t e s (2 ∗ s i z e o f ( f l o a t ) ∗ f i l t e r p o i n t s ) ;
301 b−>b_vec [ i ]. k_filterpoints = filterpoints ;
b−>b_vec [ i ] . k_nhops = n h o p s ;
b−>b_vec [ i ] . k_hoppoints = hoppoints ;
b−>b_vec [ i ] . k_skippoints = skippoints ;
b−>b_vec [ i ] . k_centerfreq = cf ;
306 b−>b_vec [ i ] . k_bandwidth = bw ;
p o s t ( " i ␣%d␣ ␣ c f ␣ %.2 f ␣ ␣bw␣ %.2 f ␣Q␣ %.2 f ␣ n h o p s ␣%d , ␣ hop ␣%d , ␣ s k i p ␣%d , ␣ n p o i n t s ␣%d , ␣
n o r m a l i z e r ␣ %.8 f ␣ f p 0 ␣ %.6 f , ␣ f p 1 ␣ %.6 f " , i , c f , bw , c f /bw , nhops , h o p p o i n t s
, s k i p p o i n t s , f i l t e r p o i n t s , n o r m a l i z e r , &f p [ 0 ] , &f p [ 1 ] ) ;
326 n e w c f = ( c f + bw/ o v e r l a p ) / ( 1 − r e l s p a c e ) ;
newbw = n e w c f ∗ o v e r l a p ∗ r e l s p a c e ;
i f ( newbw < HALFWIDTH)
{
98
B.1 – The bonk∼ Method
newbw = HALFWIDTH ;
331 n e w c f = c f + 2 ∗ HALFWIDTH / o v e r l a p ;
}
c f = newcf ;
bw = newbw ;
}
336 // s e t s t o 0 t h e r e m a i n i n g f i l t e r s , i f l e s s t h a n 50 f i l t e r s a r e u s e d
f o r ( ; i < n f i l t e r s ; i ++)
b−>b_vec [ i ] . k _ s t u f f = 0 , b−>b_vec [ i ] . k _ f i l t e r p o i n t s = 0 ;
return (b) ;
}
341
s t a t i c v o i d b o n k _ f r e e f i l t e r b a n k ( t _ f i l t e r b a n k ∗b )
{
t _ f i l t e r b a n k ∗ b2 , ∗ b3 ;
int i ;
346 i f ( b o n k _ f i l t e r b a n k l i s t == b )
b o n k _ f i l t e r b a n k l i s t = b−>b_next ;
e l s e f o r ( b2 = b o n k _ f i l t e r b a n k l i s t ; b3 = b2−>b_next ; b2 = b3 )
i f ( b3 == b )
{
351 b2−>b_next = b3−>b_next ;
break ;
}
f o r ( i = 0 ; i < b−>b _ n f i l t e r s ; i ++)
i f ( b−>b_vec [ i ] . k _ s t u f f )
356 f r e e b y t e s ( b−>b_vec [ i ] . k _ s t u f f ,
b−>b_vec [ i ] . k _ f i l t e r p o i n t s ∗ s i z e o f ( f l o a t ) ) ;
f r e e b y t e s (b , s i z e o f (∗ b ) ) ;
}
99
B – bonk∼ source code
391 x−>x _ a t t a c k e d = 0 ;
x−>x_maskphase = 0 ;
x−>x_debug = 0 ;
x−>x _ h i t h r e s h = DEFHITHRESH ;
x−>x _ l o t h r e s h = DEFLOTHRESH ;
396 x−>x_masktime = DEFMASKTIME ;
x−>x_maskdecay = DEFMASKDECAY ;
x−>x _ l e a r n = 0 ;
x−>x _ l e a r n d e b o u n c e = c l o c k _ g e t s y s t i m e ( ) ;
x−>x _ l e a r n c o u n t = 0 ;
401 x−>x _ d e b o u n c e d e c a y = DEFDEBOUNCEDECAY ;
x−>x _ m i n v e l = DEFMINVEL ;
x−>x _ u s e l o u d n e s s = 0 ;
x−>x _ d e b o u n c e v e l = 0 ;
x−>x _ a t t a c k b i n s = DEFATTACKBINS ;
406 x−>x_sr = s a m p l e r a t e ;
x−>x _ f i l t e r b a n k = 0 ;
x−>x _ h i t = 0 ;
f o r ( f b = b o n k _ f i l t e r b a n k l i s t ; f b ; f b = f b −>b_next )
i f ( f b −>b _ n f i l t e r s == x−>x _ n f i l t e r s &&
411 f b −>b _ h a l f t o n e s == x−>x _ h a l f t o n e s &&
f b −>b _ f i r s t b i n == f i r s t b i n &&
f b −>b _ o v e r l a p == o v e r l a p &&
f b −>b _ n p o i n t s == x−>x _ n p o i n t s )
{
416 f b −>b _ r e f c o u n t ++;
x−>x _ f i l t e r b a n k = f b ;
break ;
}
i f ( ! x−>x _ f i l t e r b a n k )
421 x−>x _ f i l t e r b a n k = b o n k _ n e w f i l t e r b a n k ( n p o i n t s , n f i l t e r s , h a l f t o n e s , o v e r l a p ,
f i r s t b i n ) , x−>x _ f i l t e r b a n k −>b _ r e f c o u n t ++;
}
s t a t i c v o i d b o n k _ t i c k ( t_bonk ∗ x )
{
426 t_atom a t [ MAXNFILTERS ] , ∗ ap , a t 2 [ 3 ] ;
int i , j , k , n ;
t _ h i s t ∗h ;
f l o a t ∗pp , v e l = 0 , t e m p e r a t u r e = 0 ;
f l o a t ∗ fp ;
431 t_template ∗ tp ;
i n t n f i t , n i n s i g = x−>x _ n i n s i g , n t e m p l a t e = x−>x_ntemplate , n f i l t e r s = x−>
x_nfilters ;
t _ i n s i g ∗ gp ;
#i f d e f _MSC_VER
f l o a t p o w e r o u t [ MAXNFILTERS∗MAXCHANNELS ] ;
436 #e l s e
f l o a t ∗ p o w e r o u t = a l l o c a ( x−>x _ n f i l t e r s ∗ x−>x _ n i n s i g ∗ s i z e o f ( ∗ p o w e r o u t ) ) ;
#e n d i f
100
B.1 – The bonk∼ Method
451 i f ( v e l > 0 ) t e m p e r a t u r e /= v e l ;
else temperature = 0;
v e l ∗= 0 . 5 / n i n s i g ; /∗ f u d g e f a c t o r ∗/
i f ( x−>x _ h i t )
{
456 /∗ i f h i t n o n z e r o i t ’ s a c l o c k c a l l b a c k . i f i n " l e a r n " mode u p d a t e t h e
t e m p l a t e l i s t ; i n any e v e n t match t h e h i t t o known t e m p l a t e s . ∗/
i f ( v e l < x−>x _ d e b o u n c e v e l )
{
461 i f ( x−>x_debug )
p o s t ( " b o u n c e ␣ c a n c e l l e d : ␣ v e l ␣%f ␣ d e b o u n c e ␣%f " ,
v e l , x−>x _ d e b o u n c e v e l ) ;
return ;
}
466 i f ( v e l < x−>x _ m i n v e l )
{
i f ( x−>x_debug )
p o s t ( " l o w ␣ v e l o c i t y ␣ c a n c e l l e d : ␣ v e l ␣%f , ␣ m i n v e l ␣%f " ,
v e l , x−>x _ m i n v e l ) ;
471 return ;
}
x−>x _ d e b o u n c e v e l = v e l ;
i f ( x−>x _ l e a r n )
{
476 d o u b l e l a s t t i m e = x−>x _ l e a r n d e b o u n c e ;
d o u b l e msec = c l o c k _ g e t t i m e s i n c e ( l a s t t i m e ) ;
i f ( ( ! n t e m p l a t e ) | | ( msec > 2 0 0 ) )
{
i n t c o u n t u p = x−>x _ l e a r n c o u n t ;
481 /∗ n o r m a l i z e t o 100 ∗/
f l o a t norm ;
f o r ( i = n f i l t e r s ∗ n i n s i g , norm = 0 , pp = p o w e r o u t ; i −−; pp++)
norm += ∗ pp ∗ ∗ pp ;
i f ( norm < 1 . 0 e −15) norm = 1 . 0 e −15;
486 norm = 1 0 0 . f ∗ q r s q r t ( norm ) ;
/∗ c h e c k i f t h i s i s t h e f i r s t s t r i k e f o r a new t e m p l a t e ∗/
i f ( ! countup )
{
int oldn = ntemplate ;
491 x−>x _ n t e m p l a t e = n t e m p l a t e = o l d n + n i n s i g ;
x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te , o l d n
∗ s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) , n t e m p l a t e ∗ s i z e o f ( x−>
x_template [ 0 ] ) ) ;
f o r ( i = n i n s i g , pp = p o w e r o u t ; i −−; o l d n++)
f o r ( j = n f i l t e r s , f p = x−>x _ t e m p l a t e [ o l d n ] . t_amp ; j −−;
pp++, f p++)
496 ∗ f p = ∗ pp ∗ norm ;
}
else
{
int oldn = ntemplate − n i n s i g ;
501 i f ( o l d n < 0 ) p o s t ( " b o n k _ t i c k ␣ bug " ) ;
f o r ( i = n i n s i g , pp = p o w e r o u t ; i −−; o l d n++)
{
f o r ( j = n f i l t e r s , f p = x−>x _ t e m p l a t e [ o l d n ] . t_amp ; j −−;
pp++, f p++)
506 ∗ f p = ( c o u n t u p ∗ ∗ f p + ∗ pp ∗ norm )
/( countup + 1.0 f ) ;
}
}
c o u n t u p ++;
101
B – bonk∼ source code
511 i f ( c o u n t u p == x−>x _ l e a r n ) c o u n t u p = 0 ;
x−>x _ l e a r n c o u n t = c o u n t u p ;
}
else return ;
}
516 x−>x _ l e a r n d e b o u n c e = c l o c k _ g e t s y s t i m e ( ) ;
i f ( ntemplate )
{
f l o a t b e s t f i t = −1e30 ;
int templatecount ;
521 n f i t = −1;
f o r ( i = 0 , t e m p l a t e c o u n t = 0 , t p = x−>x _ t e m p l a t e ;
t e m p l a t e c o u n t < n t e m p l a t e ; i ++)
{
f l o a t dotprod = 0;
526 f o r ( k = 0 , pp = p o w e r o u t ;
k < n i n s i g && t e m p l a t e c o u n t < n t e m p l a t e ;
k++, t p ++, t e m p l a t e c o u n t ++)
{
f o r ( j = n f i l t e r s , f p = tp−>t_amp ;
531 j −−; f p ++, pp++)
{
i f ( ∗ f p < 0 | | ∗ pp < 0 ) p o s t ( " b o n k _ t i c k ␣ bug ␣ 2 " ) ;
d o t p r o d += ∗ f p ∗ ∗ pp ;
}
536 }
i f ( dotprod > b e s t f i t )
{
b e s t f i t = dotprod ;
nfit = i ;
541 }
}
i f ( n f i t < 0 ) p o s t ( " b o n k _ t i c k ␣ bug " ) ;
}
else n f i t = 0;
546 }
else n f i t = −1; /∗ h i t i s zero ; t h i s i s t h e " bang " method . ∗/
x−>x _ a t t a c k e d = 1 ;
i f ( x−>x_debug )
551 p o s t ( " bonk ␣ o u t : ␣ number ␣%d , ␣ v e l ␣%f , ␣ t e m p e r a t u r e ␣%f " , n f i t , v e l , t e m p e r a t u r e )
;
SETFLOAT( at2 , n f i t ) ;
SETFLOAT( a t 2 +1 , v e l ) ;
SETFLOAT( a t 2 +2 , t e m p e r a t u r e ) ;
556 o u t l e t _ l i s t ( x−>x_cookedout , 0 , 3 , a t 2 ) ;
f o r ( n = 0 , gp = x−>x _ i n s i g + ( n i n s i g −1) ,
pp = p o w e r o u t + n f i l t e r s ∗ ( n i n s i g −1) ; n < n i n s i g ;
n++, gp−−, pp −= n f i l t e r s )
561 {
f l o a t ∗ pp2 ;
f o r ( i = 0 , ap = at , pp2 = pp ; i < n f i l t e r s ;
i ++, ap++, pp2++)
{
566 ap−>a_type = A_FLOAT ;
ap−>a_w . w _ f l o a t = ∗ pp2 ;
}
o u t l e t _ l i s t ( gp−>g _ o u t l e t , 0 , n f i l t e r s , a t ) ;
}
571 }
102
B.1 – The bonk∼ Method
// r e p o r t t h e a t t a c k
s t a t i c v o i d b o n k _ d o i t ( t_bonk ∗ x )
{
576 i n t i , j , ch , n ;
t _ f i l t e r k e r n e l ∗k ;
t _ h i s t ∗h ;
f l o a t growth = 0 , ∗ fp1 , ∗ fp3 , ∗ fp4 , h i t h r e s h , l o t h r e s h ;
i n t n i n s i g = x−>x _ n i n s i g , n f i l t e r s = x−>x _ n f i l t e r s ,
581 maskphase = x−>x_maskphase , n e x t p h a s e , o l d m a s k p h a s e ;
t _ i n s i g ∗ gp ;
n e x t p h a s e = maskphase + 1 ;
i f ( n e x t p h a s e >= MASKHIST)
nextphase = 0;
586 x−>x_maskphase = n e x t p h a s e ;
o l d m a s k p h a s e = n e x t p h a s e − x−>x _ a t t a c k b i n s ;
i f ( oldmaskphase < 0)
o l d m a s k p h a s e += MASKHIST ;
i f ( x−>x _ u s e l o u d n e s s )
591 h i t h r e s h = q r s q r t ( q r s q r t ( x−>x _ h i t h r e s h ) ) ,
l o t h r e s h = q r s q r t ( q r s q r t ( x−>x _ l o t h r e s h ) ) ;
e l s e h i t h r e s h = x−>x _ h i t h r e s h , l o t h r e s h = x−>x _ l o t h r e s h ;
f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < n i n s i g ; ch++, gp++)
{
596 f o r ( i = 0 , k = x−>x _ f i l t e r b a n k −>b_vec , h = gp−>g _ h i s t ;
i < n f i l t e r s ; i ++, k++, h++)
{
f l o a t power = 0 , maskpow = h−>h_mask [ maskphase ] ;
f l o a t ∗ i n b u f= gp−>g _ i n b u f + k−>k _ s k i p p o i n t s ;
601 i n t c o u n t u p = h−>h_countup ;
i n t f i l t e r p o i n t s = k−>k _ f i l t e r p o i n t s ;
/∗ i f t h e u s e r a s k e d f o r more f i l t e r s t h a t f i t u n d e r t h e
N y q u i s t f r e q u e n c y , some f i l t e r s won ’ t a c t u a l l y be f i l l e d i n
s o we s k i p r u n n i n g them . ∗/
606 if (! filterpoints )
{
h−>h_countup = 0 ;
h−>h_mask [ n e x t p h a s e ] = 0 ;
h−>h_power = 0 ;
611 continue ;
}
// f o r e a c h f i l t e r :
/∗ r u n t h e f i l t e r r e p e a t e d l y , s l i d i n g i t f o r w a r d by h o p p o i n t s ,
f o r nhop t i m e s ∗/
616 f o r ( f p 1 = i n b u f , n = 0 ; n < k−>k_nhops ; f p 1 += k−>k _ h o p p o i n t s , n++)
{
f l o a t rsum = 0 , is um = 0 ;
f o r ( f p 3 = f p 1 , f p 4 = k−>k _ s t u f f , j = f i l t e r p o i n t s ; j −−;)
{
621 // / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
// c a l c u l a t i n g t h e power f o r e a c h f i l t e r /
// g=t h e i n p u t b u f f e r / / / / / / / / / / / / / / / / / / / /
// f p 4= fp [ 0 ] e fp [ 1 ] / ////// ////// /
// / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
626 f l o a t g = ∗ f p 3 ++;
rsum += g ∗ ∗ f p 4 ++;
is um += g ∗ ∗ f p 4 ++;
}
power += rsum ∗ rsum + i sum ∗ is um ;
631 // p o s t ( " power %.12 f " , power ) ; // c a p i r e s e p o s s i b i l e d e c i m a r e i
v a l o r i d i power da p o s t a r e i n max window )
}
103
B – bonk∼ source code
if ( ! x−>x _ w i l l a t t a c k )
h−>h _ b e f o r e = maskpow ;
}
if ( ! x−>x _ w i l l a t t a c k && c o u n t u p >= x−>x_masktime )
maskpow ∗= x−>x_maskdecay ;
646
i f ( power > maskpow )
{
maskpow = power ;
countup = 0 ;
651 }
c o u n t u p ++;
h−>h_countup = c o u n t u p ;
h−>h_mask [ n e x t p h a s e ] = maskpow ;
h−>h_power = power ;
656 }
}
i f ( x−>x _ w i l l a t t a c k ) // an a t t a c k i s r e p o r t e d
{
// h o w e v e r we won ’ t a c t u a l l y r e p o r t t h e a t t a c k u n t i l t h e s p e c t r u m s t o p
g r o w i n g . g r o w t h must d e c r e a s e b e l o w l o t h r e s h .
661 i f ( x−>x _ w i l l a t t a c k > MAXATTACKWAIT | | g r o w t h < x−>x _ l o t h r e s h )
{
/∗ i f h a v e n ’ t y e t , and i f n o t i n spew mode , r e p o r t a h i t ∗/
i f ( ! x−>x_spew && ! x−>x _ a t t a c k e d )
{
666 f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < n i n s i g ; ch++, gp++)
f o r ( i = n f i l t e r s , h = gp−>g _ h i s t ; i −−; h++)
h−>h_outpower = h−>h_mask [ n e x t p h a s e ] ;
x−>x _ h i t = 1 ;
// s e t s a c l o c k t o go o f f n m i l l i s e c o n d s from t h e c u r r e n t l o g i c a l
t i m e w i t h c l o c k _ d e l a y ( c l o c k t o s c h e d u l e , n ( ms ) )
671 // S c h e d u l e t h e e x e c u t i o n o f a C l o c k
c l o c k _ d e l a y ( x−>x _ c lo c k , 0 ) ;
}
}
i f ( g r o w t h < x−>x _ l o t h r e s h )
676 x−>x _ w i l l a t t a c k = 0 ;
e l s e x−>x _ w i l l a t t a c k ++;
}
e l s e i f ( g r o w t h > x−>x _ h i t h r e s h )
{
681 i f ( x−>x_debug ) p o s t ( " a t t a c k : ␣ g r o w t h ␣=␣%f " , g r o w t h ) ;
x−>x _ w i l l a t t a c k = 1 ;
x−>x _ a t t a c k e d = 0 ;
f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < n i n s i g ; ch++, gp++)
f o r ( i = n f i l t e r s , h = gp−>g _ h i s t ; i −−; h++)
686 h−>h_mask [ n e x t p h a s e ] = h−>h_power , h−>h_countup = 0 ;
}
// spew mode a l w a y s o u t p u t d a t a f o r e v e r y p e r f o r m e d a n a l y s i s
/∗ i f i n " spew " mode j u s t a l w a y s o u t p u t ∗/
104
B.1 – The bonk∼ Method
691 i f ( x−>x_spew )
{
f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < n i n s i g ; ch++, gp++)
f o r ( i = n f i l t e r s , h = gp−>g _ h i s t ; i −−; h++)
h−>h_outpower = h−>h_power ;
696 x−>x _ h i t = 0 ;
c l o c k _ d e l a y ( x−>x _ c lo c k , 0 ) ;
}
x−>x _ d e b o u n c e v e l ∗= x−>x _ d e b o u n c e d e c a y ;
}
701
// 4//PERFORM ROUTINE
// I t r e c e i v e s a p o i n t e r t o a p i e c e o f t h e DSP c h a i n and i t i s e x p e c t e d t o r e t u r n
t h e l o c a t i o n o f t h e n e x t p e r f o r m r o u t i n e on t h e c h a i n .
// The n e x t l o c a t i o n i s d e t e r m i n e d by t h e number o f a r g u m e n t s s p e c i f i e d f o r t h e
p e r f o r m r o u t i n e w i t h t h e c a l l t o dsp_add ( ) .
// F o r example , i f we p a s s t h r e e a r g u m e n t s , we n e e d t o r e t u r n w + 4 .
706 s t a t i c t _ i n t ∗ bonk_perform ( t _ i n t ∗w)
{
t_bonk ∗ x = ( t_bonk ∗ ) (w [ 1 ] ) ;
i n t n = ( i n t ) (w [ 2 ] ) ; // v e c t o r s i z e
int onset = 0;
711 i f ( x−>x_countdown >= n )
x−>x_countdown −= n ;
else
{
i n t i , j , n i n s i g = x−>x _ n i n s i g ;
716 t _ i n s i g ∗ gp ;
i f ( x−>x_countdown > 0 )
{
n −= x−>x_countdown ;
o n s e t += x−>x_countdown ;
721 x−>x_countdown = 0 ;
}
while ( n > 0)
{
i n t i n f i l l = x−> x _ i n f i l l ;
726 i n t m = ( n < ( x−>x _ n p o i n t s − i n f i l l ) ?
n : ( x−>x _ n p o i n t s − i n f i l l ) ) ;
f o r ( i = 0 , gp = x−>x _ i n s i g ; i < n i n s i g ; i ++, gp++)
{
f l o a t ∗ f p = gp−>g _ i n b u f + i n f i l l ;
731 t _ f l o a t ∗ i n 1 = gp−>g _ i n v e c + o n s e t ;
f o r ( j = 0 ; j < m; j ++)
∗ f p++ = ∗ i n 1 ++;
}
i n f i l l += m;
736 x−> x _ i n f i l l = i n f i l l ;
// when i n p u t i s f i l l e d w i t h n p o i n t s a m p l e s , b o n k _ d o i t !
i f ( i n f i l l == x−>x _ n p o i n t s )
{
bonk_doit ( x ) ;
741
/∗ s h i f t o r c l e a r t h e i n p u t b u f f e r and u p d a t e c o u n t e r s ∗/
i f ( x−>x _ p e r i o d > x−>x _ n p o i n t s )
x−>x_countdown = x−>x _ p e r i o d − x−>x _ n p o i n t s ;
e l s e x−>x_countdown = 0 ;
746 i f ( x−>x _ p e r i o d < x−>x _ n p o i n t s )
{
i n t o v e r l a p = x−>x _ n p o i n t s − x−>x _ p e r i o d ;
f l o a t ∗ fp1 , ∗ fp2 ;
f o r ( n = 0 , gp = x−>x _ i n s i g ; n < n i n s i g ; n++, gp++)
105
B – bonk∼ source code
751 f o r ( i = o v e r l a p , f p 1 = gp−>g_inbuf ,
f p 2 = f p 1 + x−>x _ p e r i o d ; i −−;)
∗ f p 1++ = ∗ f p 2 ++;
x−> x _ i n f i l l = o v e r l a p ;
}
756 e l s e x−> x _ i n f i l l = 0 ;
}
n −= m;
o n s e t += m;
}
761 }
r e t u r n (w+3) ;
}
// 3//DSP METHOD
766 // From MAX 5 API : The d s p method s p e c i f i e s t h e s i g n a l p r o c e s s i n g f u n c t i o n y o u r
o b j e c t d e f i n e s along with i t s arguments .
// The o b j e c t ’ s d s p method w i l l be c a l l e d w h e n e v e r t h e MSP s i g n a l c o m p i l e r i s
b u i l d i n g a s e q u e n c e o f o p e r a t i o n s ( known a s t h e DSP C h a i n ) t h a t w i l l be
p e r f o r m e d on e a c h s e t o f a u d i o s a m p l e s .
// The o p e r a t i o n s e q u e n c e c o n s i s t s o f a p o i n t e r s t o f u n c t i o n s ( c a l l e d p e r f o r m
r o u t i n e s ) f o l l o w e d by a r g u m e n t s t o t h o s e f u n c t i o n s .
s t a t i c v o i d bonk_dsp ( t_bonk ∗ x , t _ s i g n a l ∗∗ s p )
{
771 i n t i , n = s p [0]−>s_n , n i n s i g = x−>x _ n i n s i g ;
t _ i n s i g ∗ gp ;
x−>x_sr = s p [0]−> s _ s r ;
786 s t a t i c v o i d b o n k _ t h r e s h ( t_bonk ∗ x , t _ f l o a t a r g f 1 , t _ f l o a t a r g f 2 )
{
i f ( f1 > f2 )
p o s t ( " bonk : ␣ w a r n i n g : ␣ l o w ␣ t h r e s h o l d ␣ g r e a t e r ␣ t h a n ␣ h i ␣ t h r e s h o l d " ) ;
x−>x _ l o t h r e s h = ( f 1 <= 0 ? 0 . 0 0 0 1 : f 1 ) ;
791 x−>x _ h i t h r e s h = ( f 2 <= 0 ? 0 . 0 0 0 1 : f 2 ) ;
}
s t a t i c v o i d b o n k _ p r i n t ( t_bonk ∗ x , t _ f l o a t a r g f )
{
796 int i ;
p o s t ( " t h r e s h ␣%f ␣%f " , x−>x _ l o t h r e s h , x−>x _ h i t h r e s h ) ;
p o s t ( " mask ␣%d␣%f " , x−>x_masktime , x−>x_maskdecay ) ;
p o s t ( " a t t a c k −b i n s ␣%d" , x−>x _ a t t a c k b i n s ) ;
p o s t ( " d e b o u n c e ␣%f " , x−>x _ d e b o u n c e d e c a y ) ;
801 p o s t ( " m i n v e l ␣%f " , x−>x _ m i n v e l ) ;
p o s t ( " spew ␣%d" , x−>x_spew ) ;
p o s t ( " u s e l o u d n e s s ␣%d" , x−>x _ u s e l o u d n e s s ) ;
106
B.1 – The bonk∼ Method
{
i n t j , n i n s i g = x−>x _ n i n s i g ;
t _ i n s i g ∗ gp ;
811 f o r ( j = 0 , gp = x−>x _ i n s i g ; j < n i n s i g ; j ++, gp++)
{
t _ h i s t ∗h ;
i f ( n i n s i g > 1 ) p o s t ( " i n p u t ␣%d : " , j +1) ;
f o r ( i = x−>x _ n f i l t e r s , h = gp−>g _ h i s t ; i −−; h++)
816 p o s t ( "pow␣%f ␣ mask ␣%f ␣ b e f o r e ␣%f ␣ c o u n t ␣%d" ,
h−>h_power , h−>h_mask [ x−>x_maskphase ] ,
h−>h _ b e f o r e , h−>h_countup ) ;
}
p o s t ( " f i l t e r ␣ d e t a i l s ␣ ( f r e q u e n c i e s ␣ a r e ␣ i n ␣ u n i t s ␣ o f ␣ %.2 f −Hz . ␣ b i n s ) : " ,
821 x−>x_sr ) ;
f o r ( j = 0 ; j < x−>x _ n f i l t e r s ; j ++)
p o s t ( "%2d␣ ␣ c f ␣ %.2 f ␣ ␣bw␣ %.2 f ␣ ␣ n h o p s ␣%d␣ hop ␣%d␣ s k i p ␣%d␣ n p o i n t s ␣%d" ,
j ,
x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ c e n t e r f r e q ,
826 x−>x _ f i l t e r b a n k −>b_vec [ j ] . k_bandwidth ,
x−>x _ f i l t e r b a n k −>b_vec [ j ] . k_nhops ,
x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ h o p p o i n t s ,
x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ s k i p p o i n t s ,
x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ f i l t e r p o i n t s ) ;
831 }
i f ( x−>x_debug ) p o s t ( " debug ␣mode" ) ;
}
s t a t i c v o i d b o n k _ f o r g e t ( t_bonk ∗ x )
836 {
i n t n t e m p l a t e = x−>x_ntemplate , newn = n t e m p l a t e − x−>x _ n i n s i g ;
i f ( newn < 0 ) newn = 0 ;
x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te ,
x−>x _ n t e m p l a t e ∗ s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) ,
841 newn ∗ s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) ) ;
x−>x _ n t e m p l a t e = newn ;
x−>x _ l e a r n c o u n t = 0 ;
}
107
B – bonk∼ source code
}
871 x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te ,
x−>x _ n t e m p l a t e ∗ s i z e o f ( t _ t e m p l a t e ) , 0 ) ;
while (1)
{
f o r ( i = x−>x _ n f i l t e r s , f p = v e c ; i −−; f p++)
876 i f ( f s c a n f ( f d , "%f " , f p ) < 1 ) goto nomore ;
x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te ,
ntemplate ∗ s i z e o f ( t_template ) ,
( ntemplate + 1) ∗ s i z e o f ( t_template ) ) ;
f o r ( i = x−>x _ n f i l t e r s , f p = vec ,
881 f p 2 = x−>x _ t e m p l a t e [ n t e m p l a t e ] . t_amp ; i −−;)
∗ f p 2++ = ∗ f p ++;
n t e m p l a t e ++;
}
nomore :
886 i f ( r e m a i n i n g = ( n t e m p l a t e % x−>x _ n i n s i g ) )
{
p o s t ( " bonk_read : ␣%d␣ t e m p l a t e s ␣ n o t ␣ a ␣ m u l t i p l e ␣ o f ␣%d ; ␣ d r o p p i n g ␣ e x t r a s " ) ;
x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te ,
ntemplate ∗ s i z e o f ( t_template ) ,
891 ( ntemplate − remaining ) ∗ s i z e o f ( t_template ) ) ;
ntemplate = ntemplate − remaining ;
}
p o s t ( " bonk : ␣ r e a d ␣%d␣ t e m p l a t e s \n" , n t e m p l a t e ) ;
x−>x _ n t e m p l a t e = n t e m p l a t e ;
896 f c l o s e ( fd ) ;
}
s t a t i c v o i d b o n k _ w r i t e ( t_bonk ∗ x , t_symbol ∗ s )
{
901 FILE ∗ f d = f o p e n ( s−>s_name , "w" ) ;
i n t i , n t e m p l a t e = x−>x _ n t e m p l a t e ;
t _ t e m p l a t e ∗ t p = x−>x _ t e m p l a t e ;
f l o a t ∗ fp ;
i f ( ! fd )
906 {
p o s t ( "%s : ␣ c o u l d n ’ t ␣ c r e a t e " , s−>s_name ) ;
return ;
}
f o r ( ; n t e m p l a t e −−; t p++)
911 {
f o r ( i = x−>x _ n f i l t e r s , f p = tp−>t_amp ; i −−; f p++)
f p r i n t f ( f d , " %6.2 f ␣ " , ∗ f p ) ;
f p r i n t f ( f d , " \n" ) ;
}
916 p o s t ( " bonk : ␣ w r o t e ␣%d␣ t e m p l a t e s \n" , x−>x _ n t e m p l a t e ) ;
f c l o s e ( fd ) ;
}
// f r e e f u n t i o n
921 s t a t i c v o i d b o n k _ f r e e ( t_bonk ∗ x )
{
i n t i , n i n s i g = x−>x _ n i n s i g ;
t _ i n s i g ∗ gp = x−>x _ i n s i g ;
926 #i f d e f MSP
dsp_free ( ( t_pxobject ∗) x ) ;
#e n d i f
f o r ( i = 0 , gp = x−>x _ i n s i g ; i < n i n s i g ; i ++, gp++)
f r e e b y t e s ( gp−>g_inbuf , x−>x _ n p o i n t s ∗ s i z e o f ( f l o a t ) ) ;
931 c l o c k _ f r e e ( x−>x _ c l o c k ) ;
108
B.1 – The bonk∼ Method
i f (!−−(x−>x _ f i l t e r b a n k −>b _ r e f c o u n t ) )
b o n k _ f r e e f i l t e r b a n k ( x−>x _ f i l t e r b a n k ) ;
}
c l a s s _ o b e x o f f s e t _ s e t ( c , c a l c o f f s e t ( t_bonk , o b e x ) ) ; // c a l c o f f s e t c a l c u l a t e s
b y t e − o f f s e t from t h e b e g i n n i n g o f bonk s t r u c t u r e . The v a l u e i s s t o r e i n
o b e x f i e l d o f same s t r u c t u r e .
//NEW ATTRIBUTES
956 // c r e a t e s ( new ) a t t r i b u t e w i t h a t t r _ o f f s e t _ n e w ( name , t y p e , a t t r i b u t e i s f o r
s e t t i n g / q u e r y f l a g , method (NULL i s d e f a u l t method ) g e t , method s e t , b y t e −
offset ) .
// a d d s a t t r i b u t e t o t h e o b j e c t o f t h e c l a s s . w i t h c l a s s _ a d d a t r ( )
a t t r = a t t r _ o f f s e t _ n e w ( " n p o i n t s " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) 0L ,
c a l c o f f s e t ( t_bonk , x _ n p o i n t s ) ) ;
class_addattr (c , attr ) ;
109
B – bonk∼ source code
class_addattr (c , attr ) ;
1006 //METHODS!
// a d d s method t o o b j e c t o f t h e c l a s s w i t h c l a s s _ a d d m e t h o d ( c l a s s p o i n t e r , m,
name , t y p e , 0 )
//m=f u n c t i o n g e t c a l l e d when method i s i n v o q u e d
c l a s s _ a d d m e t h o d ( c , ( method ) bonk_dsp , " d s p " , A_CANT, 0 ) ;
// a d d s s p e c i a l o b e x methods
1021 c l a s s _ a d d m e t h o d ( c , ( method ) object_obex_dumpout , " dumpout " , A_CANT, 0 ) ;
c l a s s _ a d d m e t h o d ( c , ( method ) o b j e c t _ o b e x _ q u i c k r e f , " q u i c k r e f " , A_CANT, 0 ) ;
110
B.1 – The bonk∼ Method
// r e g i s t e r s a p r e v i o u s l y d e f i n e d o b j e c t c l a s s w i t h c l a s s _ r e g i s t e r ( name_space ,
c l a s s p o i n t e r ) . T h i s f u n c t i o n i s r e q u i r e d , and s h o u l d be c a l l e d a t t h e end
o f main ( ) .
// namespace=The d e s i r e d c l a s s ’ s name s p a c e . T y p i c a l l y , #CLASS_BOX, f o r o b e x
c l a s s e s o r #CLASS_NOBOX f o r c l a s s e s w h i c h w i l l o n l y be u s e d i n t e r n a l l y
c l a s s _ r e g i s t e r (CLASS_BOX, c ) ;
1031 bonk_class = c ;
p o s t ( " \n" ) ;
p o s t ( "BonkOMM~␣ v ␣ 1 . 0 ␣−␣ d e t e c t s ␣ a t t a c k s ␣ i n ␣ a u d i o ␣ s i g n a l s " ) ;
post ( " Zengi ␣ r e v i s i o n " ) ;
1036 p o s t ( " O r i g i n a l ␣ by ␣ M i l l e r ␣ P u c k e t t e ␣ and ␣Ted␣ Appel , ␣ h t t p : / / c r c a . u c s d . edu /~msp/ " ) ;
p o s t ( " \n" ) ;
}
t _ i n s i g ∗g ;
x−>x _ h i t h r e s h = DEFHITHRESH ;
x−>x _ l o t h r e s h = DEFLOTHRESH ;
1066 x−>x_masktime = DEFMASKTIME ;
x−>x_maskdecay = DEFMASKDECAY ;
x−>x _ d e b o u n c e d e c a y = DEFDEBOUNCEDECAY ;
x−>x _ m i n v e l = DEFMINVEL ;
x−>x _ a t t a c k b i n s = DEFATTACKBINS ;
1071
i f ( ! x−>x _ p e r i o d ) x−>x _ p e r i o d = x−>x _ n p o i n t s / 2 ;
x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) g e t b y t e s ( 0 ) ;
x−>x _ n t e m p l a t e = 0 ;
x−> x _ i n f i l l = 0 ;
1076 x−>x_countdown = 0 ;
x−>x _ w i l l a t t a c k = 0 ;
x−>x _ a t t a c k e d = 0 ;
x−>x_maskphase = 0 ;
x−>x_debug = 0 ;
1081 x−>x _ l e a r n = 0 ;
x−>x _ l e a r n d e b o u n c e = c l o c k _ g e t s y s t i m e ( ) ;
x−>x _ l e a r n c o u n t = 0 ;
111
B – bonk∼ source code
x−>x _ u s e l o u d n e s s = 0 ;
x−>x _ d e b o u n c e v e l = 0 ;
1086 x−>x_sr = s y s _ g e t s r ( ) ; /∗ g e t s t h e s a m p l e r a t e ∗/
/∗ s o m e t h i n g u s e f u l f o r debug
i f ( ac ) {
s w i t c h ( av [ 0 ] . a_type ) {
1091 c a s e A_LONG:
x−>x _ n i n s i g = av [ 0 ] . a_w . w_long ;
break ;
}
1096 }
//CREATE INLET
// c r e a t e s t h e s i g n a l i n l e t w i t h d s p _ s e t u p ( ( c a s t t o t _ p r o b j e c t ) o b j e c t
p o i n t e r , n s i g n a l s ) , s o you n e e d n o t make them y o u r s e l f !
1111 // n s i g n a l s=The number o f s i g n a l / p r o x y i n l e t s t o c r e a t e f o r t h e o b j e c t . If
t h e o b j e c t h a s no s i g n a l i n l e t s , you may p a s s 0 .
d s p _ s e t u p ( ( t _ p x o b j e c t ∗ ) x , x−>x _ n i n s i g ) ;
//CREATE OUTLETS
// s t o r e s t h e dumpout o u t l e t i n t h e o b e x w i t h t h e g e n e r i c f u n c t i o n
o b j e c t _ o b e x _ s t o r e ( o b j e c t p o i n t e r , key , v a l ) . The dumpout o u t l e t a r e
t h a t u s e d by a t t r i b u t e s t o r e p o r t d a t a i n r e s p o n s e t o ’ g e t ’ q u e r i e s .
1116 // k e y=A s y m b o l i c name f o r t h e d a t a t o be s t o r e d
// v a l=A t _ o b j e c t ∗ , t o be s t o r e d i n t h e obex , r e f e r e n c e d u n d e r t h e k e y
// The g e n e r i c c a s e i s n o r m a l l y a d a p t e d t o be u s e d a s f o l l o w :
o b j e c t _ o b e x _ s t o r e ( x , _sym_dumpout , o u t l e t _ n e w ( x , NULL) ) ;
// c r e a t e s new o u t l e t s w i t h o u t l e t _ n e w ( o b j e c t , s ) .
// s=A C−s t r i n g s p e c i f y i n g t h e m e s s a g e t h a t w i l l be s e n t o u t t h i s o u t l e t , o r
NULL t o i n d i c a t e t h e o u t l e t w i l l be u s e d t o s e n d v a r i o u s m e s s a g e s .
1121 o b j e c t _ o b e x _ s t o r e ( x , gensym ( " dumpout " ) , o u t l e t _ n e w ( x , NULL) ) ;
//CLOCK
1131 // c r e a t e s a new C l o c k o b j e c t w i t h clock_new ( o b j e c t p o i n t e r , ( method ) f n ) .
T h i s f u n c t i o n i s n o r m a l l y c a l l e d i n t h e new i n s t a n c e r o u t i n e f u n c t i o n .
// f n=F u n c t i o n t o be c a l l e d when t h e c l o c k g o e s o f f , t h i s f u n c t i o n must be
called object_tick .
// C l o c k o b j e c t i s u s e d a s i n t e r f a c e t o t h e Max s c h e d u l e r .
112
B.1 – The bonk∼ Method
v o i d b o n k _ d e b o u n c e d e c a y _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )
1191 {
i f ( ac && av ) {
f l o a t f = a t o m _ g e t f l o a t ( av ) ;
f = ( f < 0) ? 0 : f ;
f = ( f > 1) ? 1 : f ;
113
B – bonk∼ source code
1196 x−>x _ d e b o u n c e d e c a y = f ;
}
}
v o i d b o n k _ a t t a c k b i n s _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )
{
1226 i f ( ac && av ) {
i n t n = a t o m _ g e t l o n g ( av ) ;
n = ( n < 1) ? 1 : n ;
n = ( n > MASKHIST) ? MASKHIST : n ;
x−>x _ a t t a c k b i n s = n ;
1231 }
}
/∗ end a t t r s e t t e r s ∗/
v o i d b o n k _ a s s i s t ( t_bonk ∗ x , v o i d ∗b , l o n g m, l o n g a , c h a r ∗ s ) {
1236 i f (m == ASSIST_INLET )
s t r c p y ( s , " ( S i g n a l ) ␣ A ud io ␣ I n p u t , ␣ A n a l y s i s ␣ A t t r i b u t e s " ) ;
e l s e i f (m==ASSIST_OUTLET) {
switch ( a ) {
c a s e 0 : s t r c p y ( s , " ( L i s t ) ␣Raw␣ F i l t e r ␣ A m p l i t u d e s " ) ; b r e a k ;
1241 c a s e 1 : s t r c p y ( s , " ( L i s t ) ␣ I n s t r u m e n t ␣Number , ␣ L o u d n e s s , ␣ T e m p e r a t u r e " ) ;
break ;
c a s e 2 : s t r c p y ( s , "Dump" ) ; b r e a k ;
}
}
}
1246
s t a t i c v o i d b o n k _ l e a r n ( t_bonk ∗ x , i n t n )
{
i f ( n < 0) n = 0 ;
i f (n)
1251 {
x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te , x−>x _ n t e m p l a t e ∗
s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) , 0 ) ;
x−>x _ n t e m p l a t e = 0 ;
}
x−>x _ l e a r n = n ;
114
B.1 – The bonk∼ Method
1256 x−>x _ l e a r n c o u n t = 0 ;
}
/∗ g e t c u r r e n t s y s t e m t i m e ∗/
double c l o c k _ g e t s y s t i m e ( )
1261 {
return gettime () ;
}
/∗ g e t t h e e l a p s e d t i m e s i n c e t h e g i v e n s y s t e m ti me , i n m i l l i s e c o n d s ∗/
1266 d o u b l e c l o c k _ g e t t i m e s i n c e ( d o u b l e p r e v s y s t i m e )
{
return (( gettime () − prevsystime ) ) ;
}
1271 f l o a t q r s q r t ( f l o a t f )
{
r e t u r n 1/ s q r t ( f ) ;
}
115
B – bonk∼ source code
116
Figure B.1: Max patcher window showing our test patch realized to analyze the ·O M M· sounds with bonk∼ 3.0
Appendix C
Define the class: its data structure, size, and how instances are to be created and
destroyed
Define functions (called methods) that will respond to various messages, performing
some action
With the name externals are considered all the external object of Max/MSP, i.e. not
included in the software issue. Therefore an external could be any objects created by your
own (in such a programming language) or developed by somebody else. In later chapters,
4 and 5, we use bonk∼ , an external object originally developed by Miller Puckette, will
be extensively analyzed, modified, for the purpose of onset detection.someone else. The
need to write an external is to add one or more specific task to the the logical and
arithmetics unit or the DSP chain of the software.
First, we downloaded the Max 5 Software Development Kit (SDK) from cycling74.com,
which includes framework, API reference and some examples. The framework contains
117
C – Writing External for Max/MSP with XCode
the header files where Max/MSP standard function and struct get called. The various
parts of the framework are described in the API. So, while creating new one or modify-
ing existing external, you must do it in according to the Max/MSP API reference. To
develop your object,
Since the most objects are written in C, now we procede describing the process to
develop objects in C, but externals can also be created it in such different programming
languages. We used Xcode(version 3.1.2) to develop the external, the latest version of
native IDE of the Apple Mac OS X. Let’s see an Xcode example to understand the most
significant contents of a project:
The Source folder includes the source code you develop, typically is only one file
named yourobject.c. The External Frameworks and Libraries folder is the place where
to add the MaxAPI and MaxAudioAPI (MSP) frameworks. The Product folder contains
the external created after compiling the source code, while Target are the option of the
compiler. The objects created are single file with .mxo extension, but those only seem
to be files, because they hide contents in it. This is what under Mac OS is called a
118
"bundle", or simply a package.
A bundle contains a list of files and folders, like the ones showed in this view:
You can create both Max or MSP external, depending on user requirement, with
some difference in the structure, essentially MSP externals are the ones which involve
Audio DSP, such as the one i used while Max externals are logic and arithmetic objects.
In order to use the external in the Max patcher windows, you have to add the .mxo
package, produced by the building of the source code, into the msp-external (or max-
external) folder in the application folder. But you can do this automatically by telling
XCode where to build your object in the building target.
For better understand what target are, you can think at the option of the compiler.
Most of the option are predefined when compiling an external for Max/MSP with xCode.
An example of configuring the target manually, should be represented by typing a file
with .xcconfig extension, and adding this to the project. Then you can use it as target
field in XCode.
Three are the basic component of a Max external source code:
1. the entry poin as main() function
2. description of the object as the Structs
3. definition of the functionality as the Methods =>BEHAVIOUR
Some methods and element of the structs are required by Max, and are explained in
the MaxAPI reference. The development of the source code can be summarized in five
points:
1. including the right header files (usually ext.h and ext_obex.h for MSP objects)
2. declaring a C structure for your object
3. writing an initialization routine called main that defines the class
4. writing a new instance routine that creates a new instance of the class, when
someone makes one or types its name into an object box
5. writing methods (or message handlers) that implement the behavior of the object.
119
C – Writing External for Max/MSP with XCode
120
Bibliography
121
Bibliography
March 1994.
[15] Roger B. Dannenberg. Nyquist Reference Manual. Carnegie Mellon University
School of Computer Science, Pittsburgh, PA 15213, U.S.A., 2007.
[16] Alain de Cheveignè. Pitch perception models - a historical review. Technical report,
CNRS - Ircam, Paris, 2004.
[17] Filipe Diniz, Iuri Kothe, Sergio L. Netto, and Luiz W. P. Biscainho. High-selectivity
filter banks for spectral analysis of music signals. EURASIP Journal on Advances
in Signal Processing, 2007.
[18] C. Dodge and T. Jerse. Computer music: syntesis, composition and performance.
Thomson Learning, 1985.
[19] Carlo Drioli and Nicola Orio. Elementi di acustica e psicoacustica, 1999.
[20] C. Duxbury, M. Sandler, and M. Davis. A hybrid approach to musical note onset
detection. In In Proc. Digital Audio Effects Workshop (DAFx, 2002.
[21] Chris Duxbury, Juan Pablo Bello, Mike Davies, Mark Sandler, and Mark S. Com-
plex domain onset detection for musical signals. In In Proc. Digital Audio Effects
Workshop (DAFx, 2003.
[22] Chris Duxbury, Juan Pablo Bello, Mark Sandler, and Mike Davies. A comparison
between fixed and multiresolution analysis for onset detection in musical signals. In
In Proc. Digital Audio Effects Workshop (DAFx, 2004.
[23] Ichiro Fujinaga. Max/MSP Externals Tutorial, 2005.
[24] Toby Gifford and Andrew R. Brown. Listening for noise: An approach to percussiv
onset detection. In The Australasian Computer Music Conference, 2008.
[25] M. Gimenes, E. R. Miranda, and C. Johnson. A memetic approach to the evolution
of rhythms in a society of software agents. In Proceedings of the 10th Brazilian
Symposium of Musical Computation (SBCM), Belo Horizonte (Brazil), 2005.
[26] John William Gordon. Perception of Attack Transients in Musical Tones. PhD
thesis, CCRMA, Department of Music, Stanford University, 1984.
[27] Paul Gurnig. An Introduction to Writing Externs in C for Max/MSP. University of
Chicago, 2005.
[28] Kurt Jacobson. A metric for music similarity derived from psychoacoustic features
in digital music signals. PhD thesis, University of Miami, 2006.
[29] K. L. Kashima and B. Mont-Reynaud. The bounded-q approach to time-varying
spectral analysis. Tech. Rep. STAN M-28, Stanford University, Department of
Music, 1985.
[30] A. Klapuri. Sound onset detection by applying psychoacoustic knowledge. In
ICASSP ’99: Proceedings of the Acoustics, Speech, and Signal Processing, 1999.
on 1999 IEEE International Conference, pages 3089–3092, Washington, DC, USA,
1999. IEEE Computer Society.
[31] Alexandre Lacoste and Douglas Eck. A supervised classification algorithm for note
onset detection. EURASIP J. Appl. Signal Process., 2007(1):153, January 2007.
122
Bibliography
[32] Kai Lassfolk and Jaska Uimonen. Spectutils, an audio signal analysis and visual-
ization toolkit for gnu octave. In 11th Int. Conference on Digital Audio Effects
(DAFx-08), 2008.
[33] Paul Masri. Computer Modeling of Sound for Transformation and Synthesis of
Musical Signals. PhD thesis, University of Bristol, UK, 1996.
[34] James Mccartney. Rethinking the computer music language: Supercollider. In
Rethinking the Computer Music Language: SuperCollider, volume 26, pages 61–
68, Cambridge, MA, USA, 2002. MIT Press.
[35] Jon Mccormack. A developmental model for generative media. In Advances in
Artificial Life, pages 88–97. 2005.
[36] E. R. Miranda. Computer Sound Design Synthesis techniques and programming.
Focal press, 2002.
[37] Eduardo R. Miranda. Artificial phonology: Disembodied humanoid voice for com-
posing music with surreal languages. Leonardo Music Journal, 15(1):8–16, 2005.
[38] M. S. Puckette, T. Apel, and David Zicarelli. Real-time audio analysis tools for pd
and msp. In In Proceedings of the ICMC, 1998.
[39] Miller Puckette. Is there life after midi? ICMA, 1994.
[40] Miller Puckette. Max at seventeen. Comput. Music J., 26(4):31–43, 2002.
[41] Miller Puckette. The Theory and Technique of Electronic Music. World Scientific
Publishing Co. Pte. Ltd., 2007.
[42] Miller S. Puckette. Pure data: recent progress. In Pure Data: recent progress,
1997.
[43] Arunan Ramalingam and Sridhar Krishnan. Gaussian mixture modeling of short-
time fourier transform features for audio fingerprinting. IEEE Transactions on In-
formation Forensics and Security, 1(4):457–463, December 2006.
[44] Curtis Roads. The Computer Music Tutorial. The MIT Press, February 1996.
[45] Curtis Roads. Microsound. The MIT Press, 2004.
[46] D. Rocchesso and F. Fontana. The Sounding Object. Mondo Estremo, 2003.
[47] Davide Rocchesso. Introduction to Sound Processing. GNU GNU Free Documen-
tation License, 2003.
[48] Davide Rocchesso. Programmazione visuale, versione 1.3, 2007.
[49] Davide Rocchesso. Sound to sound, sense to sense, 2008.
[50] X. Rodet and F. Jaillet. Detection and modeling of fast attack transients. In
International Computer Music Conference (ICMC), pages 30–33, 2001.
[51] E. D. Scheirer. Tempo and beat analysis of acoustic musical signals. Journal of
the Acoustical Society of America, 103(1):588–601, 1998.
[52] X. Serra. Musical Sound Modeling with Sinusoids plus Noise, pages 91–122. Swets
and Zeitlinger, 1997.
[53] Xavier Serra. Parshl: An analysis/synthesis program for non-harmonic sounds based
on a sinusoidal representation, 1985.
123
Bibliography
124