A Perceptually Grounded Approach To Sound Analysis

POLITECNICO DI TORINO
III Facoltà di Ingegneria dell’Informazione

Corso di Laurea in Ingegneria Elettronica
Tesi di Laurea Magistrale
A perceptually grounded approach

to sound analysis
An application for Orchestra Meccanica Marinetti
Relatore:
prof. Marco Masoero
Corrado SCANAVINO
LUGLIO 2009
II
Acknowledgements
Àlaleria.
III
Contents
Acknowledgements III
1 Introduction 1
2 A Perceptually Grounded Approach... 5

2.1 Auditory Cognition (Reminding Psychoacoustics) . . . . . . . . . . . . . 5
2.1.1 Limits of Perception, Perception of Intensity. Loudness . . . . . . 6
2.1.2 The Human Ear . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Perception of Time and Periods . . . . . . . . . . . . . . . . . . 10
2.1.4 Perception of Frequency, the Sensation of Pitch . . . . . . . . . 12
2.1.5 Perception of Timbre . . . . . . . . . . . . . . . . . . . . . . . 15
3 Digital Audio Concepts 17

3.1 Toward Digital Representation of Sound . . . . . . . . . . . . . . . . . . 17
3.2 Digital Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Filters Background . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Introduction to Digital Audio Processing with Filters . . . . . . . 27
3.2.3 Digital implementation of filters . . . . . . . . . . . . . . . . . . 32
3.2.4 FIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.5 IIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 ...To Sound Spectrum Analysis 37

4.1 Introduction to Sound Analysis in the Frequency Domain . . . . . . . . . 38
4.2 Introduction to the Fourier Analysis . . . . . . . . . . . . . . . . . . . . 40
4.2.1 Fourier Transform (FT), Classic Formulation . . . . . . . . . . . 42
4.2.2 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . 43
4.3 The Short Time Fourier Transform (STFT) . . . . . . . . . . . . . . . . 46
4.3.1 The Filterbank View . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Windowing: Length and Shape of the Window Function . . . . . 49
4.3.3 Computation of the DFT (via FFT) . . . . . . . . . . . . . . . . 51
IV
4.3.4 The Inverse Short Time Fourier Transform & Overlap-Add Resyn-
thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Constant-Q analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.1 Implementation of Constant-Q Analysis . . . . . . . . . . . . . . 55
5 Real-Time Audio Applications 59

5.1 Max/MSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 CSound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Supercollider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Chuck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Perceptual Onset Detection 69

6.1 The Curious Case of ·O M M· . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 From Transient to Attack and Onset Definitions . . . . . . . . . . . . . 71
6.3 General Scheme for Onset Detection . . . . . . . . . . . . . . . . . . . 74
6.3.1 Energy Based Approach . . . . . . . . . . . . . . . . . . . . . . 76
6.3.2 Phase Based Approach . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Introduction to the Perceptual Based Approach to Onset Detection . . . 79
6.5 Onset Detection in ·O M M· . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.5.1 The bonk∼ Method . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5.2 Result of the Analysis in ·O M M· . . . . . . . . . . . . . . . . . . 83
6.6 From Onset Analysis to Sound Classification . . . . . . . . . . . . . . . 85
6.6.1 Learning Results . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7 Conclusion 87
A MSP, anatomy of the object 89
B bonk∼ source code 93

B.1 The bonk∼ Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
C Writing External for Max/MSP with XCode 117
Bibliography 121
V
List of Tables
5.1 Musical software for realtime synthesis and control . . . . . . . . . . . . 67

6.1 Filterbank design in our method based on bonk∼ . . . . . . . . . . . . 84
6.2 Results in detecting onset of the five soundtracks created for analysis
purpose, played at different bpm. ·O M M· . . . . . . . . . . . . . . . . . 85
6.3 Numerical result in detecting onset and recognizing the three sounds
(A/B/C) produced by the ·O M M· . . . . . . . . . . . . . . . . . . . . . 86
VI
List of Figures
2.1 Winckel’s treshold of hearing [1967]. . . . . . . . . . . . . . . . . . . . 7

2.2 Equal-loudness contours for the human ear, determined experimentally by
Fletcher and Munson, published on Loudness, its definition, measurement
and calculation [1933]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Peripheral auditory system. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Part of the inner ear, the cochlea is shaped like a 32 mm longs nail and
is filled with two different fluids separated by the basilar membrane. . . . 12
2.5 Cochleagrams, expressed in bark unit as function of time. On the left
the spoken italian word "ape", on the right a short excerpt of Moondog’s
“Pigmy pig”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Simpler digital audio system . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Amplitude (A) response versus frequency, for the four basic types of
filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 The pass-band or bandwidth of a band-pass filter is the difference between
the upper and lower cutoff frequency. The cutoff frequencies are defined
as the frequency at which the amplitude, but energy would be better
to say instead, is half the pass-band amplitude. In the figure, 40 dB is
assumed as the maximum level of amplitude in pass-band. . . . . . . . . 23
3.4 Example of application of a constant Q filter. Here the center frequencies
are tuned around generic musical octave. In music, an octave, is the
interval between one musical pitch and another with half or double its
frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Alteration of the envelope of a tone (INPUT) passed through a narrow
filter (OUTPUT). The output envelope has been stretched in time during
onset and offset components of the tone (initial and final portion). . . . 27
3.6 Echo and reverberation effects explained by convolution. . . . . . . . . . 30
3.7 Simple delay line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8 Circular buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Digital sound synthesis and sound analysis. . . . . . . . . . . . . . . . . 40
VII
4.2 Two plots of static spectrum. The image represents the SPL against
frequency of a drum hit played by a robot (on the left), and a note of a
violin (on the right). The difference is noticeable, while the robot hit has
apparently no harmonically related frequency components, in the violin
note this is clear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Basic operation of the STFT used for sound analysis. . . . . . . . . . . 46
4.4 Waterfall spectrum, a 3D representation os the STFT spectrum. The
graph was obtained with Spectutils package for GNU Octave. The anal-
ysis parameters of the STFT are shown above the figure, the audio sample
analyzed is extracted from Laurie Anderson’s Violin Solo. . . . . . . . . 48
4.5 Types of windows used in STFT for audio analysis. No ideal window
exists, the term "optimal window" is preferred. Several types of windows
are used, for musical purpose the Kaiser window has usually a preferential
use. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Spacing of filters for STFT (filterbank view) on the top and Constant-Q
filterbank on the bottom. It is clear the advantage of the Constant-Q
filterbank method, which places the filters linearly against log(frequency),
which is similar to the frequency response of the human ear. . . . . . . . 54
4.7 Waterfall spectrogram of a Constant Q transform of violin glissando from
578 Hz to 880 Hz (D5 to A5). Taken from Judith Brown’s Calculation
of a constant Q spectral transform. [A glissando is a glide from one pitch to another.
It is an Italianized musical term derived from the French glisser, to glide, It is also where the pianist
slides up the piano with his or her hands. From Wikipedia.] . . . . . . . . . . . . . . . . 56
4.8 Waterfall spectrogram of a Constant Q transform of flute playing diatonic
scale from 262 Hz to 523 Hz (C4 to C5). Taken from Judith Brown’s
Calculation of a constant Q spectral transform. [In music theory, a diatonic scale
is a seven note musical scale comprising five whole steps and two half steps, in which the half steps
are maximally separated. From Wikipedia.] . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Max 5 patcher window . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Max 5 window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1 The two robots on the sides, SCS + the performer in the middle. . . . . 70
6.2 On the top, the waveform corresponding to a hit of a robot percussion-
ist of ·O M M· . On the bottom, the intensity profile of the hit (using
Praat), where onset, attack and the transient/steady state separation
are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 From top to bottom: waveform, static spectrum (FFT) and time-varying
pectrum (STFT). From right to left: one hit of ·O M M· robot, one hit of
snare drum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
VIII
6.4 Unwrapped phase deviation between two adjacent analysis frames. ∆ϕn,k
is the unwrapped phase deviation. For the simpler case represented by
a steady state sinusoid, the phase deviation is approximately 0 constant
in-between the whole analysis frames, while, during transient the phase
deviation should be extremely large and easy to detect. . . . . . . . . . 79
6.5 Graphical representation of the bounded-Q filterbank. Only the octave
are geometrically spaced, in between the octave the spacing between
analysis bins is linear. This allows the application of FFT-like algorithm
to calculate the spectrum of each component. . . . . . . . . . . . . . . 83
B.1 Max patcher window showing our test patch realized to analyze the ·O M M·
sounds with bonk∼ 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . 116
C.1 XCode main window . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
C.2 a Bundle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
IX
Chapter 1
Introduction
The sound analysis is a very wide area of research; its typical applications range from
studies about the environment impact, the vibration models, bioacoustics to. . . Music.
Each of these fields has its own specific characteristics, thus we need to identify every
time the best approaches.
The specific field of application of this thesis is musical, meaning that the method has
been tested inside the Orchestra Meccanica Marinetti project, or ·O M M· . This thesis
is about a specific sound analysis approach, defined as perceptually grounded because
it mimic s the human perception (auditory system) of sound. The perceptual sound
analysis is an ideal candidate for applications in this context. The idea was to extend or
improve some of the musical characteristics of the robotic Orchestra.
·O M M· is a project about a robotic orchestra, controlled in real-time by a performer. The
project has been conceived by the programmer-digital artist Angelo Comino AKA Motor,
and consists mainly of two robots, which play drums, conducted by a performer through
a gestural controller, via MIDI. The two robots are more than 2 meter high, and the
drums consist of oil cans (such as the one used by petrol companies) of standard size.
These devices were designed and built with industrial component, with special care in
order to emulate the movement of a real drummer, by the people of Mechatronic Lab
(LIM) of Politecnico di Torino, thanks to the collaboration of local robotic companies:
Prima Electronics, ERXA e ACTUA. Each robot has two, moved by two power electric
engines, controlled by a FPGA-DSP dedicated hardware, while the interaction with the
performer is adjusted in real-time by the Show Control System, developed in Max/MSP
(a typical development environment for this kind of application), running on a Apple
laptop. A picture of the ensemble robots + performer can be seen in picture 6.1.
·O M M· was presented in October 2008 in Turin, arousing great visibility on national
media (press and television) and it is actually completing the engineering development
and starting the artistic deployment.
The idea of robot musician is not new, in early 80s at Waseda University (Japan)
1
1 – Introduction
WABOT-2 was experimented: a robot keyboardist able to converse with a person, read
a musical score with its eyes (a camera) and play it on an electronic organ. A robot
drummer has been developed, since 2005 to present, at the Georgia Tech College of
Computing programmed. Haile, that’s the name, is able to listen to live musicians and
accompany them, playing a drum. Haile’s output is based on live-analysis and process-
ing of sounds produced by other musicians playing at the same time, not pre-recorded
sequences. Other examples are the recent robotic trumpeter and violinist by the TMC
(Toyota Motor Corporation).
One of the critical point within a mechanical orchestra is to sync the execution of the
musical score among real and virtual instruments. The human ear is particularly careful
about timing, but physical devices, as the electromechanical arms of the robots, have
variable delays when activated. These variations depend on the note intensity requested,
or the execution rhythm. Each robot basically receives a message on a serial line (MIDI),
stating what kind of hit should be executed. Besides the message generation and data
transmission delay, usually negligible from a human point of view, we have the delay
introduced by the physical movement of the arms. Thus, we need to measure the time
interval between the digital command and the perceived strike on the can.
We propose a "perceptually grounded approach" to recognize the hit of the different
strikes, in order to compute the delay matrix of a generic score. The robots can play
approximately two hits per second by each arms and the sounds they can play consists of
three different variations. That is, the robot arms must be positioned to the correct level
in height and will hit the drum after a delay, which is primarily related to the distance
of the arm to the drum and the acceleration by which the arm is driven. Hence, the
delays are variable and not only, problems related to non-absorbed vibrations between a
hit and another could cause unwanted change in loudness and pitch perceived. That’s
why a perceptual based approach was needed to characterize the robots performance’s
behaviour and its response to the applied digitally stimuli.
The thesis contains, in his first part, an overview to the phenomena occurring in our
auditory systems, in particular when a new musical event occurs (i.e. a robot hit in
our case), in which our ear encodes both time and frequency phenomena related to the
human perception of sound. In the specific case of percussions, the sound produced can
be for convenience subdivided in two parts, the transient (the origin of the sound) and
the steady state portions (could be intended as the extended support to transient). Two
important points should be considered. First, during transient, at the precise instant
the sound is originated (onset of a sound), corresponds a rapid increase in sound energy
which can reach its peak in less than 5 ms. Detecting onset, our first aim, is not a
trivial task. Second, the time in which the onset occurs is the meaningful component of
a percussive sound; performing a sort of spectrum analysis at this point, the information
achieved are usually considered sufficient to predict the entire sound, i.e. the steady
state portion of the sound should be derived.
2
Efforts have been then extended to the subfield of signal processing, for specific treat-
ments of the musical audio signal (DSP). Therefore we present the digital audio repre-
sentation and the realization of digital filters, basic (but still very advanced) component
of every digital signal processing task. For this purpose the books of Roads, Dodge,
Rocchesso and Beauchamp were a good starting point. Later we introduce the methods
used for audio analysis, in particular those systems which perform harmonic analysis,
usually represented under the name Fourier analysis. The Fourier transform operation,
applied to a musical signal, can be viewed as a decomposition of the sound into a finite
number of harmonics, each of which represented by a complex value. This value is suf-
ficient to extract all the information needed to derive frequency, intensity and phase of
each harmonic; but doesn’t represent the only solution for musical analysis. The method
we suggest considering, under certain condition (in particular for music or speech), is
Constant-Q Transform based, well approximating some features of the human auditory
system.
Composers of the 20th century have contributed to the evolution of electronic music,
in a way that even they wouldn’t be expected. From Luigi Russolo and the intonarumori
in 1918, mechanical instrument producing non harmonic sound (Art of Noise), musical
schemes have been continuously redefined. Russolo influenced even Stravinsky (Paris
1921), and after Stravinsky and the expressivity and richness of his Music, musicians
became also technician and explored new electro-acoustic machines producing sound.
Between the russian composer and the first Moog have passed several years, but the
works of other composers like Bartók, Varèse, Messiaen, Shaeffer, Ligeti, Cage have
maintained straightforward the state of innovation.
In parallel, the evolutionary studies of certain mathematicians and physicians of the
19th century (primarily Helmholtz and Fourier) have lead technicians in the discoveries
that made it possible the realizations of the first electronic instruments, become famous
with movie like "Forbidden Planet" (Louis and Bebe Barron) or "2001: A Space Odissey"
(Ligeti and HAL voice inspired to computer synthesis experiment of Max Mathews). Or
the experimental works of Norman McLaren.
Other experiments in between art and technology had been proposed, such as the
electroacoustical compositions "A man sitting in a cafeteria" by Charles Dodge and "I
am sitting in a room" by Alvin Lucier; very attractive for their expressivity and their
educational approach. The first is one of the first experiment of reproducing speech
with computer and the second is a brilliant example of application of different impulse
responses of a room. This music come from 60s and 70s.
Grazie!
3
1 – Introduction
4
Chapter 2
A Perceptually Grounded
Approach...
There are no theoretical limitations to the performance of the computer as a source

of musical sounds, in contrast to the performance of ordinary instruments. At present,
the range of computer music is limited principally by cost and by our knowledge of
psychoacoustics. M V Mathews1
2.1 Auditory Cognition (Reminding Psychoacous-

tics)
Some of the subjects treated in this section require notions of acoustic.
Intensity, frequency, duration and spectrum are physical attributes used in literature to
describe the acoustical properties of a sound. These attributes do not form music itself,
but they can vary the perception of each sound components of a musical flow. The
perceptual attributes, pitch, loudness and timbre, describe how the physical attribute
related to sound are perceived and interpreted as mental construct by the brain, through
our hearing system.
The composer needs to know how to construct and balance physical attributes of sound
in a way that correspond, more or less, to the composer’s musical concept [44]
Since sound is supposed carried by vibrations2 , propagating through a medium such as air,
the detection of these vibrations constitute our sense of hearing. Physical informations
1
Appeared on article called The Digital Computer as a Musical Instrument, on journal Science, dated
1 Nov. 1963. Now computers costs little bit less. And some on psychoacoustics will follow.
2
Other representation of sound are possible, a microscale was proposed, i.e. sound can be de-
composed into smaller time unit called microsound or sound particle. See [45] and Gabor, Schaeffer,
Shoenberg et al. for more on different sound decomposition.
5
2 – A Perceptually Grounded Approach...
conveyed by sounds, with respect to the study of natural auditory systems (human ear),
had been successfully applied to derive the relationship between physical stimuli and the
induced mental construct.
The subfield of psychophysics (the study of psychological responses to physical stim-
uli) depicting this phenomena is the psychoacoustic.
2.1.1 Limits of Perception, Perception of Intensity. Loudness

Intensity is proportional to energy, i.e. the variance of air pressure, in a sound wave.
Sound intensity is measured in terms of sound pressure level (SPL) on a logarithmic
scale, thus the result can be expressed in dB:

p
SP L[ db] = 20 · log10
p0
where p0 corresponds to the estimated threshold of hearing at 1 KHz. The threshold of

hearing is generally reported as the RMS3 sound pressure of 20 µPa, which is approxi-
mately the quietest sound a young human with undamaged hearing can detect at 1 KHz.
SPL is inversely proportional to distance from the sound sources.
Loudness is the perceptual attribute related to changes in intensity, that is, increase
in sound intensity are perceived as increase in the loudness mechanism. Unfortunately
there is not a trivial relationship. Loudness also depends on other factors like spectrum,
duration and presence of background sounds.
Winckel4 in 1967 proposed the range of hearing for a young adult human ear, shown
in figure2.1, this range can vary with age and individual’s sensitivity. Winckel’s range of
hearing is valid for sustained sine tones. For shorter tones this threshold can raise, this is
because, approaching to the borders of the threshold, the ear seems to integrate energy
for shorter tones, at leat for less than 200 ms. Other studies have shown that human
body can sense very low frequencies, although ears do not, and that the upper limit of
sensitivity may be well beyond 20 Khz.
Another useful tool are the Fletcher-Munson curves. They proposed a graph similar to
that of Winckel, introducing the concept of constant-loudness contour, easy to identify in
figure 2.2. The meaning of this graph is that each curve has roughly the same loudness.
These constant loudness curves are called phons. A phon is intended as "number
of dB at 1 KHz". In other words, a sine tone at 1 KHz with intensity of 50 dB has a
loudness level of 50 phons. Therefore, if we want to produce a sine tone at 300 Hz with
the same loudness as the 1KHz tone, it is necessary to follow the 50 phons curve until
3
Roots mean square, a statistical measure of the magnitude of a varying quantity.
4
Fritz Winckel, austrian acoustician, is considered one of the pioneer of the electronic music. He
published in 1967 the book Music, Sound and Sensation: A Modern Exposition.
6
2.1 – Auditory Cognition (Reminding Psychoacoustics)
Figure 2.1: Winckel’s treshold of hearing [1967].
300 Hz and use the corresponding value of SPL, then the two tones will sound equally
loud to the listener.
Obviously, the perfect sine wave is an artifact, no sound exists in nature as expression
solely of a frequency. However, it is demonstrated that is possible to destructure5 the
sound as a sum of perfect sine waves. Therefore we can assume that each of which,
weighted per the FM curves and then summed, will contribute to total loudness. But
this is another theoretical situation, since no linearity can be actually applied, at least
not on the overall spectrum, because of the presence of critical bands6 .
Before introducing the time and frequency perception of human hearing, the most
advanced features of our auditory system, it maybe better to understand how the ear
works.
7
Figure 2.2: Equal-loudness contours for the human ear, determined experimentally by
Fletcher and Munson, published on Loudness, its definition, measurement and calculation
[1933].
2.1.2 The Human Ear

The peripheral auditory system is the medium by which sound waves are detected,
encoded, and retransmitted through nerve cells to the brain, where human can finally
render sound. Although very sophisticated, the process can be intuitively subdivided into
three steps, each accomplished into different place in the ear.
• The outer ear: amplifies and conveys incoming sound waves such as air vibration.
Here the sound waves enter the auditory canal, which can amplify sounds con-
taining frequencies in the range between 3 Hz and 12 kHz. At the far end of the
5
See chapter 4, Fourier Trasnform and Overlapp Add Resysnthesis, for explanation to the fact.
6
See section 3.1.3. for explanations of critical bands.
8
Figure 2.3: Peripheral auditory system.
auditory canal is the eardrum (or tympanic membrane), which marks the beginning
of the middle ear.
• The middle ear: transduces air vibrations into mechanical vibrations.
Sound waves, coming from the auditory canal, are now hitting the tympanic mem-
brane. Here, three delicate bones, the malleus (hammer), incus (anvil) and stapes
(stirrup), convert the low-level pressure eardrum sound vibrations into higher-level
pressure sound vibrations to another, smaller membrane, called the oval or ellipti-
cal window. Finally, another ,The stapedius musclewhich has the role to prevent
damages in the inner ear. The middle ear still contains the sound information
in wave form; it is converted to nerve impulses in the cochlea. Higher pressure
is necessary because the inner ear beyond the oval window contains liquid rather
than air.
9
• The inner ear: processes mechanical vibration and transduce them mechani-
cally, hydrodynamically and electrochemically. These are then transmitted through
nerves to the brain.
The inner ear consists of the cochlea and several non-auditory structures. The
cochlea has three fluid-filled sections, and supports a fluid wave driven by pressure
across the basilar membrane separating two of the sections. Strikingly, one section,
called the cochlear duct or scala media, contains an extracellular fluid similar in
composition to endolymph, which is usually found inside of cells. The organ of
Corti is located at this duct, and transforms mechanical waves to electric signals
in neurons. The other two sections are known as the scala tympani and the scala
vestibuli, these are located within the bony labyrinth which is filled with fluid
called perilymph. The chemical difference between the two fluids (endolymph &
perilymph) is important for the function of the inner ear.
Additional processes occur at the brain level, for example, other neural encoded in-
formations are used in order to combine signals coming from both ears and fuse them
into one sensation. However, although complex, the mechanism do not yield necessary
information to the brain to understand, for example a single note, an harmony, a rhythm,
or higher-level musical structures. It appeared that also the low-level time and frequency
perceptual mechanisms, operate both on the musical signal in parallel. Thus the de-
termination of the nature of sound is not only determined by the physical properties of
sound and human ear, but all these informations will be combined at high-level (i.e. in
the brain) where the sound takes its musical form.
2.1.3 Perception of Time and Periods

Higher level perceptual processes can be obtained only because other mechanisms, in
the inner ear, encode both time and frequency. In this section, we look at temporal
features, two of them seems to be the most prominent: period detector and temporal
integration.
Period detector
The mechanism of period detector inside auditory system, operates on the fine structure
of the neurally translated incoming waveform. The neural pattern is obtained by nerve
cells (in the organ of Corti) firing individually or in group, at a rate which corresponds
to the wave’s period. Individually, each cells can operate in this manner only up to a
certain period, if this is too small, they cannot recover quickly enough. However, group
of cells can rotate or stagger their firing, so that they, in effect, follow submultiples of
sound period.
10
A special feature is that the ear can encode variation in the envelope of the wave,
studies have demonstrated the existence of a mechanism in the central auditory system
to detect amplitude modulation (AM), although in a small range of frequencies (75 to
500 Hz) and only for significant depth of modulation.
Event detector
Another time-related mechanism, deep inside the human ear, is the perception of event.
Musical event occurs every time there is a variation of the vibration pattern, that is,
something is happen nearby and we hear a new sound. Sound onset 7 is the perception
of new sound is born. At onset time other nerve cells fire, and different cells operate
on different onset slopes. A model for onset detection, developed by Gordon in 1984
[26], showed that the moment of perceptual onset of musical event can be significantly
delayed from the physical onset. Another problem is that is not possible to establish
unequivocally the threshold over which an event becomes audible to the ear, that is,
the definition of the threshold over which the ear recognizes the onset. What does the
human ear consider as audible event? Bilmes proposed these questions: does it refer
to the time when physical energy in a signal increases infinitesimally? the time of peak
firing rate of the cochlear nerve? the time when we first notice a sound? the time
when we first perceive a musical event? or something else? [8] Whatever it means
it is demonstrated that perceptual onset time is not necessarily coincident with initial
increase in physical energy.
Again, since other cells respond to temporal interval between events, this means that
human auditory system is able to connect single events into rhythmic stream.
Temporal integration
Temporal integration is another important feature in perception of time. Human ear

seems to integrate two or more event, if they are too close together. This is the principal
limit to the resolution of perceived rhythm. The minimal time between frames that
human ear can sense separately is variable, depending on the duration of each events.
For example this period can be few milliseconds if the events are very short, but can also
be much greater than 50 ms.
What happens when succession of sound events cannot be perceived separately in
time by human ear? They smear together to form one sensation, in other words, tem-
poral resolution is lost. Therefore, human ear has no fixed “time resolution”. A lot
of phenomenons are related to temporal integration, one for all, the effect (sometimes
desired) of reverberation.
7
Onset is the point at which a musical event becomes audible. For percussive sound is considered
to be the same of attack time, the instant in which the stick strokes the drum.
11
The case of reverberation

Reverberation is different from echo, also from sequences of echoes. However, if a sound
is reflected by a surface we hear the sound and its echo, if the surface is irregular or
other surfaces are present in the room, several echoes can be heard. The number of
echoes per second is normally referred to as echo density. When echo density is greater
than 30, individual echoes are separated by less than 35 ms and the ear cannot perceive
them separately. The fusion of echoes in a unique sensation, that is, reverberation.
Not only short time rates between events affects the probability of smearing, but also
the frequency. If two following notes in a musical stream have similar frequencies they
will probably smear together.
2.1.4 Perception of Frequency, the Sensation of Pitch
Figure 2.4: Part of the inner ear, the cochlea is shaped like a 32 mm longs nail and is
filled with two different fluids separated by the basilar membrane.
Frequency is a physical parameter associated to each wave that carries the sound energy
to the ear. Pitch is the perceived parameter related to frequency, it can be thought as
the quality of a sound, governed by the rate of vibrations produced by the sound [44].
In the inner ear, the oscillations of the oval window assume the form of traveling
waves which move along the basilar membrane, ie. along the entire length of the cochlea.
12
The mechanism for detecting frequencies is located in the basilar membrane. A simple
correspondence occurs: when a single sine tone excites the ear, a region of the basilar
membrane oscillates around its equilibrium position. Since real sounds have no single
frequency, this region will show a place where excitation has a maximum, corresponding
to the fundamental frequency. The distance of this maximum from the end of the basilar
membrane is directly related to frequency, so that each frequency is mapped in a precise
place along the membrane. The mechanical properties of the cochlea (wide and stiff
at the base, narrower and much less stiff at the apex) denotes a roughly logarithmic
decrease in bandwidth as we move linearly away from the cochlear opening (the oval
window), as shown in figure 2.4. Thus, the auditory system acts as a spectrum analyzer,
detecting the frequencies in the incoming sound at every moment in time. In the inner
ear, the cochlea can be understood as a set of band-pass filters, each filter letting only
frequencies in a very narrow range pass. This mechanism could be associated to a
filterbank of constant-Q filters 8 , because of their property to be linearly spaced on a
logarithmic scale9 .
However, the sensation of pitch is not only related to the fundamental frequency per-
ceived. Other contributes, related to the temporal mechanism encoded in the ear, such
as period detection, can alter the sensation of pitch.
The sounds that ear can sense, have wide frequency range, approximately from 20 to
20 KHz. The perceived pitch, also expressed in Hz, has a limited range, approximately
from 60 to 5 KHz.
Critical Bands
Since each frequency stimulates a region of the basilar membrane, a limit to frequency
resolution of the ear is imposed. This limit is reflected to another characteristic of
perception, known as critical band.
A simple example to understand how the ear works in the critical band is necessary.
Think, or better listen, two sine waves very close in frequency, they have a total loudness
which is less than the sum of the two loudness we would hear if they were separated
in frequency. Now, if we slowly separate each other in frequency, we perceive the same
loudness up to a point, then, over a certain frequency the total loudness increases
approximately to the value of the sum of individual loudness. The frequency difference,
needed to perceive loudness as sum of individual loudness is the critical band.
The ear behavior in this region, can be thought as a kind of frequency integration,
because it is similar to the temporal integration we have seen earlier. Inside the critical
band resides other important factors of perception, roughness and beating. Roughness
8
A constant-Q filterbank is a set of bandpass filters, which fit their bandwidths according to central
frequencies, to maintain a fixed ratio (Q).
9
See chapter 4, the section on Constant-Q analysis, which base his benefit on the similarity with
the human pitch detector mechanism occurring in the basilar membrane.
13
Figure 2.5: Cochleagrams, expressed in bark unit as function of time. On the left the
spoken italian word "ape", on the right a short excerpt of Moondog’s “Pigmy pig”.
is a sensation of dissonance, its presence is particularly strong in the lower and upper
bound of the critical band, where the two tones are almost separated but not yet ready to
be perceived as two sounds. In the middle of the critical band the two tones are heard as
one with a frequency that lies between the two frequencies, where we can clearly perceive
the sensation of beating. When the two tones are separated by 1 Hz we perceive a single
beating per second. The width of critical bands (bandwidths) increase in frequency.
The Bark scale was proposed to represent the human ear behavior inside the crit-
ical bands. An example of such a representation is proposed in figure 2.5, where the
spectrogram produced is plotted against frequency on a Bark scale; in this case refer-
ring to cochleagram is appropriate. The Bark scale (of human hearing) ranges from
1 to 24 Barks, corresponding to the first 24 critical bands. The proposed Bark center
frequencies, in Hz, are:
50, 150, 250, 350, 450, 570, 700, 840, 1000, 1170, 1370, 1600, 1850, 2150, 2500,
2900, 3400, 4000, 4800, 5800, 7000, 8500, 10500, 13500
while their corresponding bandwidths, are:
100, 100, 100, 100, 110, 120, 140, 150, 160, 190, 210, 240, 280, 320, 380, 450, 550,
700, 900, 1100, 1300, 1800, 2500, 3500, 5000
These center-frequencies and bandwidths should be interpreted as being associated with

a specific fixed filter bank in the ear. Note that since the Bark scale is defined only up
14
to 15.5 kHz, the highest sampling rate for which the Bark scale is defined up to the
Nyquist limit, 31 KHz.
When many frequencies are present (fundamental tones and harmonics) the auditory
system works on all of them simultaneously, with the limit of resolution introduced by
critical band. This effect on the overall spectrum is another contribute to the perceived
pitch. Not only evitabile, the pitch is also influenced by inharmonic spectra, which is a
characteristic of noise.
Perception of noise
White noise does not affect pitch because it is completely random and has a flat spectrum
that doesn’t evoke any sensation, if not trouble. Since colored noise are created by
modulating white noise, some of them can yield a vague pitch sensation, depending on
the modulation applied. For example, for an AM modulation of white noise, there may
be a pitch corresponding to the modulation’s frequency. Other sensation of pitch can
be achieved by filtering or applying digital effects to white noise.
We have seen the most important factors characterizing the sensation of pitch, now
we can introduce the last perceived attribute of a sound, the timbre.
2.1.5 Perception of Timbre

The generic definition of timbre is this: the attribute by which we can distinguish two
sound with the same loudness and pitch. Thus timbre is the character or quality of a
musical sound, distinct but influenced from its pitch and loudness. Sometimes, in more
ecstatic way, is also referred to as the color of sound.
The characteristics determining timbre reside in the constantly changing spectrum
of a musical sound, produced for example by an instrument. The steady-state spectrum
is not enough to distinguish a sound produced by an instrument to another, but also the
attack and decay portion of the spectrum are very important. Therefore, timbre has to
be more than one dimension, because involves temporal envelope and evolution of the
spectral distribution over time.[44]
15
16
Chapter 3
Digital Audio Concepts
Prior to the study of specific aspects in sound analysis (chapter 4 and 6), it is better to
clarify the basic concept behind the audio representation on digital computers. In the
following paragraph, the main attributes of a digitized sound are dealt with from basic
terms (sampling and quantization) to advanced applications (digital filters).
This chapter is therefore divided into two sections: the fist contains a brief illustration
of the theories behind digital representation of music, while the second gives a deeper
explanation of how the filters are implemented on digital computers.
3.1 Toward Digital Representation of Sound

Analog audio signal
Sounds come in the form of vibrations, carried to our ears by a physical medium such
as air. Referring to an electrical system, thus replacing ears with a microphone, sounds
are transduced into a time-varying voltage in accordance to the vibration’s patterns
present in the air. This is what is called the analog audio signal, a continuous signal
which consists of a continuum of values. In physic, an analog sound signal is usually
considered a mono-dimensional signal representing the air pressure over the microphone
membrane.
The Sampling Theorem (Nyquist/Shannon)

In order to perform any sort of sound processing on digital computer, the analog signal
must be reduced to digital data, each representing a discrete value of the signal’s in-
stantaneous voltage. The operation that transforms the analog signal into digital signal
is, ladies and gentlemen, the sampling. The sampling theorem states that, in order to
accurately represent sound digitally, the sampling rate, defined as the frequency in Hz
17
3 – Digital Audio Concepts
at which the sampling operation is performed, has to be at least twice the frequency
band of the analog signal. The frequency band is determined by the maximum frequency
contained in the signal. Since the (average) upper frequency limit to human hearing is
considered to be 20 KHz, sampling rate higher than 40 KHz must be choosen. This is
enough to allow reconstruction of the original signal, starting from samples, in a way
that human ear cannot distinguish from the original.
The device performing this operation is called analog-to-digital converter (ADC).
At each period (i.e. the inverse of the sampling rate), the ADC produces a string of
binary numbers, called sample, which are stored in memory in the exact order they are
received. The inverse operation, from digital-to-analog, is realized by the digital-to-
analog converter, DAC.
The sampling rate normally used in computer to represent digital audio signal is
44,1 KHz or 48 KHz. The frequency halves the sampling rate, is called the Nyquist
frequency. The faster the sampling rate, the greater the Nyquist frequency and conse-
quently the frequencies that can be represented (but also the demands on speed and
power consumption of the hardware).
Aliasing
Like any other analog to digital conversions, also the audio conversion may be affected by
the problem of aliasing. Aliasing occurs cause frequencies, higher than half the sampling
rate (Nyquist frequency), may be present at input of the ADC. This results as distortion
of the original signal and it can be heard, in acoustical term, as an unwanted change
in pitch1 , because frequencies over Nyquist are probably converted at low frequencies.
The problem can be easily overcome by placing an anti-aliasing filter 2 before the ADC,
which ensures that only signals below Nyquist enter the converter. This system is also
replicated at the end of the audio chain, in between the DAC and the speaker, for the
same reason. In figure C.2 is proposed a generic audio system.
Dynamic Range and Signal-to-Noise Ratio

The dynamic range is the difference between the softer and the louder sound that
can exist in the system. It is expressed in terms of decibels, because of their useful log
compression for large numbers (e.g. doubling a number will reflects in a 3 dB increment).
Since decibel is a unit of measurement for ratio, in acoustic, dB is used to represent
the ratio between the actual level of intensity to a reference level. As reference level it
is normally considered the threshold of hearing, 10−12 W/m2 . The maximum value of
1
Pitch is a sound attribute, perceptually related to frequency. Pitch and the other musical perceptual
attribute will be presented in chapter 2.
2
The simpler realization of anti-aliasing filter is obtained by a low-pass filter with cutoff frequency
equal to Nyquist frequency. See next section for filter’s explanation.
18
3.1 – Toward Digital Representation of Sound
analog analog
audio audio
input output
MEMORY
low pass digital low pass

filter ADC samples DAC filter
nyquist nyquist
frequency frequency
Figure 3.1: Simpler digital audio system
dynamic range for human hearing, is called threshold of pain3 and it is estimated above
120 dB. Hiroshima explosion was 180 dB. If the sound is particularly short the threshold
of pain can increase, but is better not to try...
While recording music, it is important to capture the wide-as-possible dynamic range,
in order to reproduce music in its fully expressive way. For example, recording an orchestra
will require wider dynamic range than a solo instrument[44]. The number of bits (Nbit),
used to represent each sample4 , has a direct influence on the maximum dynamic range
of digital audio systems. The following simple formula can be used for this purpose:
(DR)Max = Nbit · 6.11 [ dB]
Therefore, a 24 bit system may reach 147 dB, much more than the threshold of pain.
Considerations should be given on noise, when speaking of dynamic range, because
noisy sound components (not only the noise introduced by all electronic devices, but
real noisy sounds), are always present in the proximity of the audio system and they
can alter, for example, the minimum of the dynamic range. Signal-to-noise ratio (SNR)
compares the level of a given signal to the level of noise in the system. Noise can have
a wide variety of meanings and also depends on the environment and sensibility of the
listener. SNR is also expressed in dB so that a great value of dB means a clear sound.
SNR of a good audio system is often higher than 90 dB. Dynamic range and SNR are
good indicator of the quality of any audio system, but not the only.
Quantization Error
Now, we present the last (and one of the most) important factor, determining digital
audio quality, the quantization. How many bits are needed to represent the sampled
amplitude of the signal? Normally the answer is given by the maximum resolution of the
3
Level higher to this threshold can seriously damage the human hearing system.
4
That is, the quantization, explained in the next section.
19
ADC used to compute sampling. Obviously, the higher the resolution of the converter,
the better the quality of the digitized sound. Since the number of bits is a finite integer n,
only 2n values can be used to represent the original value, these are called quantization
levels. When the system has to convert a value, which is not integer, a round off is
necessary. The quantization error is the difference between the real value and the binary
strings used to represent it, that is typical of almost all the samples and introduce the
quantization noise. In 16 and 24 bit ADC the quantization noise is negligible.
Digital Audio Signal, File Format and Perceptual Codec

Digital audio samples are finally grouped into files, to be stored on hard drive of digital
computer. In this case we can distinguish between two different format for files, those
obtained after an algorithm for compression is applied and those who not. In the non-
compressed case, all the samples are stored without applying changes and a preamble
is added to the begin of the file, which includes the information required by a player
in order to correctly reproduce the musical content (sampling frequency, number of
channels, modulation etc...). Compressed files, were obviously introduced to save the
amount of space required to store the files. The goodness of the algorithm determines
the quality of the digital compressed file (against the original, non-compressed). In MP3,
for example, the algorithm takes into account phenomena occurring in the perception of
frequency (frequency masking) by the human ear. Therefore, what it does is a decimation
(and adaptive bit re-allocation) of the samples used to represent frequencies that human
ear cannot sense or distinguish with the resolution imposed by the sampling. For this
reason, compressors such as MP3, are sometimes called perceptual codec; under certain
condition, e.g. bitrate higher than 128 Kbps, perceptual codecs ensure a remarkable
level of transparency, thus their quality should not be distinguished from the original.
20
3.2 – Digital Filters
3.2 Digital Filters

The most general definition of a digital filter is: a computational algorithm that converts
one sequence of numbers into another[18]. Thus, any digital device with an input and
an output is a filter[44]. The advanced design of filters is beyond the aim of this thesis,
but the basics have to be understood with respect to chapter 5 and 6, where digital
filters are implemented for specific purposes in sound analysis. In the following pages,
instead of "implement", a term specifically used in computer language, we may prefer
the use of "design", to point out more or less the same information, that is, the way the
filters are realized.
Digital filters began part of the integrated circuit since 1950. From the 60s, the so
called Z-transform5 was introduced to standardize the mathematical representation of
filter’s behavior. In sound synthesis programming language, argued in the next chapter,
digital filters appears in the early 60, with MUSIC IV6 . Only later in the 80, when the
cost of hardware architecture fell down, real-time digital filter played the most important
role on the widespreading low-cost applications, such as synthesizer, effects unit and
digital mixer.
Historically, the most common use of filters, at least in computer music, was that of
boosting, attenuating or separating regions of the sound spectrum. All these operations
imply processing in the frequency domain. However, since filters also carry out other
important sound processing techniques, such as reverberation and delay, the effect of
filtering should not be intended as to be frequency-domain-only related. As we’ll see very
soon, also the time structure of the signal can be altered by means of filtering operation.
3.2.1 Filters Background

The frequency response
All filters may be characterized by the frequency response. The well-known frequency
responses are: low-pass, high-pass, band-pass and band-reject.
The frequency response consists of two parts: amplitude response, shown in figure
3.2 for the four basic types of filters, and phase response. The amplitude response is the
ratio of the amplitude of output signal to the input signal, varying along frequency range.
The phase response (also varying with frequency) is the amount of phase alteration in
the signal passing through the filter. Sometimes it is defined in terms of phase delay,
that is, the amount of phase change from the original phase, expressed in ms.
5
The Z-transform converts a discrete time-domain signal, a sequence of real or complex numbers,
into a complex frequency-domain representation. See later in the text for more details.
6
The 4th release of the MUSIC saga, the last developed by Max Mathews at Bell Labs.
21
A[dB] A[dB]
f[Hz] f[Hz]
LOW-PASS FILTER HIGH-PASS FILTER
A[dB] A[dB]
f[Hz] f[Hz]
BAND-PASS FILTER BAND-REJECT FILTER
Figure 3.2: Amplitude (A) response versus frequency, for the four basic types of filters.
Pass-band and stop-band, cutoff and center frequency

The pass-band or bandwidth of a filter is defined as the frequency region in which the
filter has no effect (or at most a little attenuation) over the signal, while the stop-
band is the frequency region where a great attenuation is applied to the signal. In
all kind of filters, there’s always a smooth transition between the pass-band and the
stop-band (and viceversa), which is normally called the transition band. The most
important characteristic associated to the transition band, is the cutoff frequency fc .
Conventionally, fc is chosen as the frequency at which the power transmitted is reduced
to one-half (i.e. -3 dB) of the maximum power in the pass-band.
In low-pass and high-pass realization, the cutoff frequency determines the bandwidth
of the filter, e.g., the extension over a frequency range of the signal passed through
the filter. In band-pass and band-reject filters, the bandwidth is limited by two cutoff
frequencies, fu and fl , which stands for the upper and lower limit of the bandwidth.
Consequently, the center frequency fo of a band-pass (and band-reject) filter is defined
as 21 · (fu − fl ). These characteristics are shown in figure 3.3 for the case of band-
pass filter. The rate at which the attenuation slope increases in the stopband, is called
22
A [dB]
f [Hz]
Figure 3.3: The pass-band or bandwidth of a band-pass filter is the difference between
the upper and lower cutoff frequency. The cutoff frequencies are defined as the frequency
at which the amplitude, but energy would be better to say instead, is half the pass-band
amplitude. In the figure, 40 dB is assumed as the maximum level of amplitude in pass-
band.
rolloff. In musical application’s filters, the rolloff is frequently measured in dB/octave7

of decrement during the transition band. The slope of the transition band is determined
by the order8 of the filter. In the analog counterpart, the order determined by summing
all the electronic components used to realize the filter. Whereas, for digital filters is
much more elaborate, as explained later in the chapter.
Selectivity and quality factor (Q)
The bandwidth of the band-pass filter, is also called the selectivity of the filter, and is
useful in quantifying the quality factor, Q.
7
An octave is the interval between two points where the frequency at the second point is twice the
frequency of the first.
8
The order is the mathematical measure of complexity.
23
Figure 3.4: Example of application of a constant Q filter. Here the center frequencies
are tuned around generic musical octave. In music, an octave, is the interval between
one musical pitch and another with half or double its frequency.
The Q, in musical purpose filters, can be interpreted as the resonance degree of a

band-pass[18] and is given by:
fc
Q=
BW
Here, BW is an acronym for bandwidth. Hence, higher Q implies narrower bandwidth. As

an example, high Q is needed when the target is to extract a single frequency component
from a signal. A solution to this task will be presented further in this thesis.
A special type of band-pass filter is the constant-Q filter, which is widely used in
sound analysis. This filter allows variation of the bandwidth as a function of the center
frequency, by maintaining fixed its Q. For example, with fixed Q = 5 and fc = 300 Hz
the bandwidth is 300/5 = 60 Hz. Shifting the center frequency around 6 Khz, the band-
width becomes much wider, 6000/5 = 1200 Hz. This type of filter has two fascinating
characteristics, in musical application. Since the energy of a sound is normally concen-
trated at low frequencies and spreads towards the high frequency, a band-pass constant
Q filters allows extraction of narrower bandwidth while its center frequency is set around
low frequency region, and wider, as the center frequency moves toward high frequencies.
This is only the first, the second is that, the bandwidths of this kind of filter, plotted
against log frequency, appears to be constant. This behavior, presented in figure 4.6,
seems to be quite similar to the perception of frequency of the human auditory system,
as we aimed to explain in first part of chapter 2.
Filter combination, parallel and cascade

The four basic types of filters can be combined to form more complex filter design. Two
fundamental methods of combinations are possible: parallel and cascade.
24
Parallel connection allow the filters to operate on the same signal at the same time.
The output signal will be given by the sum of all the filters’ output; that means, the
frequency response of a parallel connection is the sum of all the frequency responses.
For instance, a band-reject filter can be obtained by connecting a low-pass and a high-
pass filter in parallel. An interesting example of parallel connection is represented by
the contant-Q filterbank, which consists of an array of constant-Q band-pass filters that
separates the input signal into several components, each one carrying a single frequency
subband9 of the original signal. For musical purpose, these subbands are normally non-
overlapping and exponentially distributed e.g. in the whole frequency range of human
hearing, between 20 Hz and 20 KHz. A special type of constant-Q filterbank, had been
historically represented by the octave filterbank, in particular, the third-octave filterbanks
have also been standardized for use in audio analysis10 [12]. In a third-octave filter bank,
the center frequencies of the various bands are exponentially spaced along frequency
axis, in a way described by the formula:
fc [k] = 2 k/3 · 1000 Hz
where f cc [k] are the center frequencies of an array of k filters, the first of which is cen-
tered at fc [0] = 1000 HZ, as an example. The bandwidth of the k th filter is proportional
to the k th center frequency, as the following formula states:
2 1/3 − 1
BW [k] = fc [k] ·
2 1/6
Therefore, since the expression of bandwidth contains the center frequency, the quality
factor Q[k] = fc [k]/BW [k], is constant for all k filters.
Cascade connection, also called series connection, is the other way to connect filters
each others. In this case, the signal will pass through a series of filters, one by one,
respecting the linking order. The direct consequence is that: the overall amplitude
response becomes the multiplication (thus sum in dB) of the individual filter responses,
while, the overall filter order, becomes the sum of the individual filter orders [18]. For
instance, cascading two or more low-pass filters with the same frequency response, makes
it easy to obtain higher rolloff i.e. greater attenuation around crossover frequency.
Cascade connection of filters, may be critical in some cases, for example much care
must be taken, when designing series of filters with different bandwidths. Each filter of
the cascade, must guarantee that significant energy will pass at least through a common
range of frequencies, otherwise the output could be inaudible.
9
Submultiple of the signal’s bandwidth.
10
Third-octave filters are useful because they have a good correlation to the subjective response of
the human ear.
25
Impulse response and time-domain effect of filtering

As well as being characterized by the frequency response, every filter has an impulse
response.
Impulse response, is the time-domain description of the filter’s response for very short
pulses, approximations of the mathematical unit impulse function (or Dirac delta11 ).
Studying the filter’s response to a unit impulse, could be useful to determine the filter’s
response to any short-time change of the input signal. Therefore, the impulse response
could be interpreted as the adaptive response of a filter, the most related to the time-
varying characteristics of the signal. That’s why, filters are sometimes designed to have
a specific impulse response, instead of the frequency response. What the frequency
response shows us, is the filter’s response after a stable output is reached, that is,
obtained after a long enough time is elapsed from the beginning of the filter’s operation.
However, in order to understand the effect of filtering, a time-domain point of view is
equally important.
To advance what follows, which includes musical-related content, figure 3.5 is pro-
posed as example. What we’ll say about sound, in the figure is referred to a pure
sinusoidal tone, the simpler sound produced by an oscillator 12 .
By definition, every sound could be delimited between the so called attack and decay
portions, which roughly correspond to the sudden start and the softer end of a sound.
It is demonstrated that their role leads one the most important time-related feature,
typical in human auditory system, the perception of event. We recall this observation
here, because it would be important in proceeding of this thesis. Thus, an alternative
way to design a filter, is deriving its impulse response, and one way to do this is to observe
the way in which the filter reacts at the beginning and at the end of the signal passing
through it. These are two particular case of the so called transient response, which
introduce another fundamental aspect related to sound. Transient will be discussed
more detail in chapter 6. However, a close relation exists between impulse and transient
response [18].
Still refer to figure 3.5, the transient response appears evident, acting as a time-
stretching operation applied to the initial portion of the sine wave, and similarly, at
the end. Transient response, at least in this simple case, could be associated to the
two time-intervals required by the filter, in order to produce the steady-state output
after the attack transient occurs, and the time spent before the sound die, after the
decay transient. This behavior becomes very dangerous when large number of filters
(
11 ∞, if t=0; R∞
The Dirac delta definition is: δ(t) = , with the constraint −∞ δ(x) dx = 1.
0, otherwise.
12
The digital oscillator is the simpler sound generator in computer music. It represents the propagation
(in the air) of a single sound wave at a certain frequency and amplitude. It is still present in current
computer synthesis programs and was introduced in 1960 as the fundamental Unit Generator in Music
3 by Max Mathews.
26
Figure 3.5: Alteration of the envelope of a tone (INPUT) passed through a narrow filter
(OUTPUT). The output envelope has been stretched in time during onset and offset
components of the tone (initial and final portion).
are connected in cascade, since everyone will affect the time-duration of the sound,
unwanted distortions can occur.
3.2.2 Introduction to Digital Audio Processing with Filters

In the first part of the chapter, we described the process by which a generic audio signal
is transformed into digital samples and stored onto hard disk drive of a digital computer.
Thus, the simplest use of the computer in audio system, can be addressed to digital
recording and reproduction of sound files. The consequent step is represented by all
the useful and/or aesthetic operations that could be done over the digital audio signals.
This is the vast area of digital audio processing, in which filters play a determinant role.
In digital audio processing, four fields of particular interest can be isolated, because of
their extensive treatment in literature: sound mixing ([44]), delays and effects([63] for
all), sound spatialization and reverberation13 ([63] and [41][15][55] for implementation’s
examples) and sound modeling 14 ([44][52][33]). Useful applications of sound processing
can be found in all the widespread digital audio systems, e.g. in digital music player (mp3,
13
Sound spatialization is essentially the movement of sound through space. Dolby Digital Surround
is one of the best known techniques.
14
An example of sound modeling, sound by applying physical knowledge.
27
cd, miniDisc etc). Compressor, limiter, expander, noise gates, and noise reduction, are
only few examples of the sound processing treatment normally applied to the dynamic
range of the music we hear, coming from almost every digital music medium.
Digital filters assume a primary role in quite all sound processing applications. After
classical filter theory had been quickly exposed in the previous sections, the hard task
should be that to transpose some of those concepts, to the discrete world of quantized
samples.
For that purpose, a general and expressive definition coming from LTI system theory,
will be generalized for the case of digital filters, but before, some clarifications must
precede it. Systems who don’t change their behavior in time and fulfill the superposition
property15 are called linear time-invariant (LTI) and the most important property is that
those systems can be completely characterized by the impulse response. The impulse
response, is the behavior of those systems for short impulse input. Hence, the general
the definition we’ll adopt to explicit filter realizations, is the following: the output signal
of an LTI system is given by the convolution of the impulse response with the input
signal.
Thence, the assumption here is that to consider filters as LTI systems.
Impulse Response
The general definition of impulse response of a filter is the response of such a filter, fed
with a short pulse. The short pulse can be considered as a test signal, through which
the characteristics of the filter are bared. The common test signal used in LTI systems,
such as in filters, is the unit impulse, defined as:
(
1, if t=0;
U I(t) =
0, otherwise.
In the case of discrete systems, such as digital filters, the unit impulse is obtained substi-
tuting t with n, and delimiting the sample index in between brackets [·], for unambiguity.
Therefore, for discrete LTI systems, the unit impulse could be rewritten as follows:
(
1, if n=0;
U I[n] =
0, otherwise.
which can be seen as one-sample impulse. In digital terms, the briefest signal possible
(the approximation of the unit impulse) is exactly a single sample, which contains energy
15
In the case of filters, the superposition property states: when two signals are added together and fed
to the filter, the filter’s output is the same as if the two signal were putted through the filter separately
and then added the outputs.
28
at all the frequencies below Nyquist16 . By definitions, the output signal of a filter fed
by unitary impulse is the impulse response, henceforth simply called IR.
Since we can say that unit impulse contains all the frequencies of the signal, IR
can also be seen as the time domain representation of the amplitude-versus-frequency
response, earlier presented as the frequency response. The bridge between the two
domain is represented by the convolution.
Convolution
Convolution is a generic signal processing operation, like addition or multiplication, but
has a lot of more interest because convolving two signals in the time domain is equal to
multiply them in the frequency domain. That’s why convolution operation is considered
the bridge between the two domain. Convolution is a fundamental operation in digital
audio processing as well as in filters. Let’s see how it works, starting from the formula
representing the previous definition given for LTI systems, now generalized for the case
of filters: the output signal y[n]of every digital filter is given by the convolution of the
impulse response of the filter with the input signal x[n]. Here it is:
y[n] = x[n] ∗ h[n]
where ∗ is the convolution and h(t) is the impulse response. When the impulse response
h(t) is obtained through the one-sample impulse, acting as unit impulse, convolution
proves to be an identity operation:
y[n] = x[n] ∗ U I[n] = x[n]
That is, every function convolved with the unit impulse remains the same.
While speaking of convolution in terms of signal processing, a certain regard to other
two properties, is necessary. Convolving the input signal with scaled version of the unit
impulse:
y[n] = x[n] ∗ (c · U I[n]) = c · x[n]
and convolving the input signal with a delayed copy of the unit impulse, by means of
time-shifting:
y[n] = x[n] ∗ U I[n − t] = x[n − t]
16
According to Fourier’s theories, an inverse relationship exists between the duration of a signal and
its frequency content: the shorter the signal, the wider the spectrum.
29
Figure 3.6: Echo and reverberation effects explained by convolution.
That is, the result of the convolution between input signal and scaled or time-shifted
unit response is the same as to scaling or time shifting the input signal. Consequently:
any input signal can be represented by a sequence of scaled and delayed unit impulse
functions. Not only, easily recognizable effect in sound systems, such as echo and
reverberation can be recreated by really simple but appropriate design of IR function, as
showed right in figure 3.6.
In the case of reverberation effect, showed in right side of figure 3.6, the time-
smearing 17 effect occurs when the two time-shifted unit impulse functions are too
close, relatively to the duration of the sounds. Thus the first sound cannot be separated
from its following replica. Those effects, when thick, assumes the form of reverberation.
The law of convolution, applied to computer music, affirms that the convolution of
two waveforms 18 in the time domain, is equal to the multiplication of the two spectra
in the frequency domain. This is fundamental concept in sound processing techniques,
because any of the transformations applied to sound in the time domain have a direct
correspondence in the frequency domain, and vice versa.
Finally, the mathematical definition of discrete convolution, applied over two generic
17
See chapter 3. Time-smearing is a phenomena which occurs when two close-in-time sounds cannot
be separated by the time-resolution of the ear.
18
From chapter 3, waveform will be used to define the analogue sound signal in the time-domain
30
finite-length input signals a[n1 ] and b[n2 ], is:
1 −1
nX
a[n1 ] ∗ b[n2 ] = a[m] · b[n2 − m] = y[k]
m=0
To enhance the analogy with filters, the formula may be interpreted by this way: a[n1 ]
acts as a weighting function (such as the IR) for each delayed copy of b[n2 ] (i.e. the
input signal). The result of the operation y[k] is k sample long, with respect to:
k = length(a[n]) + length(b[n]) − 1
That way to compute convolution, a sum for each value of k, is called direct convolution.
The direct form is computationally intensive, requiring N 2 operations, where N is the
length of the longest of the two input. A faster solution to implement convolution on
digital computers was founded. It works with the FFT19 algorithm, applied to both the
convolutional operands. The results are multiplied, and finally reversed to time-domain
through the IFFT20 algorithm, to be finally summed. The cost of the fast convolution
drastically reduces the computational complexity to N log N operations.
Transfer function and frequency response

The frequency domain description of a digital filter reflects its ability to pass, reject or
enhance certain frequencies included in the input signal spectrum. The common terms
used to describe the characteristics of filters in the frequency domain are the transfer
function H(z) and the frequency response H(f ). Both can be obtained by means of
mathematical transforms applied to the impulse response. Since the transfer function
is considered a useful tool when designing filter, especially in electronic literature, the
Z-trasform must be been introduced. Here is the Z-transform definition, of a generic
digital signal x[n]:
∞
X
X[z] = x[n] · z −n
n=−∞
Which is related to the Discrete Fourier transform, introduced later, by substitution

z = ejω . The transfer function is achieved by applying the Z-transform to the IR h[n]
of the filter, as follow:
19
Fast Fourier Transform
20
Inverse Fast Fourier Transfor
31
∞
X
X[z] = h[n] · z −n
n=−∞
while the frequency response can be achieved by applying the DFT to the IR of the filter.
3.2.3 Digital implementation of filters

Traditionally, digital filter realizations have been classified into two large families:
• non-recursive filters
• recursive filters
These names came from the nature of algorithms used to design those filters. From the
transfer function point of view, these two categories can be reformulated as follows:
• filters whose transfer function doesn’t have the denominator
• filters whose transfer function have the denominator
In both cases, every output sample is calculated as a combination of the previous input
and/or output samples. Respecting the order of previous classification, such possible
combinations are:
• current input samples with past input samples
• both present and past output samples and sometimes past input samples
Since digital filters base their output on combinations of past input/output samples, they
imply the concept of "memory". In computers, the first realization of filter is easier than
the second. The concept of delay line 21 should be observed. A generic scheme, used to
represent delay line is presented in figure 3.7. The amount of delay required depends on
the memory we need to have, for the specific design of filter. This delay determines the
number of memory cells dedicated for storage of the delayed samples. As an obvious
consequence, the storage space required is greater and the computational cost is higher
when we take in account a lot of past samples.
Finally, since the two possible realizations of filters depend on the nature of their IR,
the last and commonly used subdivision of filters is that:
• Finite Impulse Response filter (FIR filter)
• Infinite Impulse Response filter (IIR filter)
21
Recirculating memory unit, whose purpose is that of delaying the incoming signal by an established
number of samples. See [41] for more on delay line and digital implementation.
32
Figure 3.7: Simple delay line.
3.2.4 FIR Filters

In FIR filter, the response due to an impulse input will decay within a finite time.
Conversely, in FIR filter realizations, the impulse response will theoretically never die. In
comparison, implementation of FIR filters are easier, but slower, when compared to IIR
filters. Though IIR filters are fast, practical implementation is a bit tough compared to
FIR filters.
A FIR filter is a linear combination of a finite number of samples of the input signal.
M
X
y[n] = h[m] · x[n − m]
m=0
In the equation above the convolution formula given above, can be easily recognized.
Here h[m] is the finite impulse response, typical of FIR filter realizations. The time
extension of the impulse response determines the lenght of the filter, which is N + 1.
As introduced above, the transfer function can be achieved by applying the Z-transform
to the impulse response, which result as:
N
X
H[z] = h[m] · z −m = h[0] + h[1] · z −1 + h[2] · z −2 + . . . + h[N ] · z −N
m=0
The simpler example of FIR filter is the first order low pass filter, which takes into
account only the first previous input sample. The formula of this kind of filter is the
following:
y[n] = 0.5(x[n] + x[n − 1])
33
Besides, to obtain an high pass filter, again of the first order, we must simply change
the operand, like this:
y[n] = 0.5(x[n] − x[n − 1])
Follows the general formula for FIR filters:
y[n] = a0 · x[n] ± a1 · x[n − 1] ± . . . ± ai · x[n − i]
In order to run an N order FIR filter we need to have, at any instant, the current
input sample together with the sequence of the N preceding samples. These N samples
constitute the memory of the filter. In practical implementations, it is customary to
allocate the memory in contiguous cells of the data memory or, in any case, in locations
that can be easily accessed sequentially. At every sampling instant, the state must be
updated in such a way that x(k) becomes x(k + 1), and this seems to imply a shift of
N data words in the filter memory. Indeed, instead of moving data, it is convenient to
move the indexes that access the data.
Such as an example, three memory words are put in an area organized as a circular
buffer (see figure 3.8). The input is written to the word pointed by the index and the
three preceding values of the input are read with the three preceding values of the index.
At every sample instant, the four indexes are incremented by one, with the trick of
beginning from location 0 whenever we exceed the length M of the buffer (this ensures
the circularity of the buffer). The counterclockwise arrow indicates the direction taken
by the indexes, while the clockwise arrow indicates the movement that should be done
by the data if the indexes would stay in a fixed position.
As a matter of fact, an FIR filter contains a delay line since it stores N consecutive
samples of the input sequence and uses each of them with a delay of N samples at most.
The points where the circular buffer is read are called taps and the whole structure is
called a tapped delay line.
3.2.5 IIR Filters

The filters of the second family admit only recursive realizations; thus the impulse re-
sponse of these filters is infinitely long, justifying their name, Infinite Impulse Response
(IIR) filters. In general, an IIR filter is represented by a difference equation where the
output signal at a given instant is obtained as a linear combination of samples of the
input and output signals at previous time instants. The simplest, nontrivial, IIR filter
that can be conceived: the one-pole filter having coefficients a1 = 1/2 and b0 = 1/2, is
defined by:
y[n] = 0.5(y[n − 1] + x[n])
34
Figure 3.8: Circular buffer.
and the transfer function of this filter is:

1/2
H[z] =
1 − 12 z −1
Due to the advanced forms in which digital filters can be designed, the result obtained
could be even more precise than the analog counterpart.
35
36
Chapter 4
...To Sound Spectrum Analysis
37
4 – ...To Sound Spectrum Analysis
4.1 Introduction to Sound Analysis in the Frequency

Domain
In this chapter when we speak of analog signal we will usually refer to waveforms.
The spectral analysis of musical sounds is the legacy of Fourier analysis. Although
other methods had been explored, Fourier concepts are still applied to every digital
sound applications. The soundness of Fourier analysis resides in its representation which
highlight similar characteristics the auditory system by psychoacoustical knowledge.
We mentioned before, at the end of chapter 2, that all the operation could be done
over the sampled signal are performed to achieve goals coming from the necessity to
have different output (i.e. sound processing, in which filters have a dominant role),
to generate output from nothing (i.e. synthesis) or to analyze the digital sample to
predict some of the sound characteristics. The principal effort in sound analysis are
pitch recognition, timbre perceptio, rythm recognition and bpm extraction. In some
cases analysis is only the first stage in order to perform reconstruction of the original
signal, that is, resynthesis. Audio synthesis can be definitively different from the above,
because is the process in generating sounds, and this can be obtained in so many ways,
that a suitable discussion in this thesis would be too long.
In general, in digital music system, we can distinguish between:
• Audio synthesis, the process of generating stream of audio samples by algorithmic

means.
• Audio analysis, takes digital signal (but leaves unaltered the stream of samples)
and mathematically determines its characteristics.
Those systems are represente in figure 4.1.
Digital audio synthesis

Audio synthesis is the other great challenge in digital audio performance. Since no sound
is generated by synthesis for the purpose of this thesis, we will only treat marginally the
argument.
Everything started with the necessity to let the computer speaks in human(and not
humanoid) way. Since psychoacoustic and vocoder techniques1 , several studies made it
possible to implement the physical studies behind human auditory system. Hence why,
why do not try to replicate with computer, the most sophisticated sound produced by
human body, the speech.
1
A phase vocoder is a type of vocoder which can scale both the frequency and time domains of audio
signals by using phase information. The computer algorithm allows frequency-domain modifications to
a digital sound file (typically time expansion/compression and pitch shifting).
38
4.1 – Introduction to Sound Analysis in the Frequency Domain
Sound synthesis made by computer, starts in 1957 with Max Mathews’ MUSIC 1.
With MUSIC 3 in 1960 was introduced the concept of Unit Genertor (UG), the simpler
instrument for the computer, the greatest change in the way to the computer sound
programmer’s approach. With a UG, one can create a sine wave to produce an oscillator,
with logical and arithmetic UG one can multiply two oscillator to produce another sound,
design filters’ frequency and impulse response, combine filters with oscillators to create
new more complex sounds, and so on, to infinity. The UG so created, quickly increased
in complexity, in parallel to the rapid rise of microelectronics, becoming one of typical
features in most music programming language. With the consequently development of
faster algorithm, music synthesis has been widely extended in many research areas.
Curtis Roads says about synthesis:
After Max Mathews in 1957, dozens of sound synthesis techniques have been invented.
As in the field of computer graphics, it is difficult to say at any time which techniques will
flourish and which will fade over time. This situation is fueled by competitive pressure in
the music industry, making it inevitable that synthesis methods fall in and out of fashion,
because no one of these methods can satisfy [44].
As just a souvenir, here are reported some of the synthesis methods, in no precise order:
• wavetable synthesis
• sampling synthesis
• additive synthesis
• subtractive synthesis
• sinusoidal synthesis
• granular synthesis
• modulation synthesis
• physical modeling synthesis
• formant synthesis
• residual synthesis
• graphic synthesis
• stochastic synthesis
39
analog
audio
input
loudness
low pass pitch (rhytm) SOUND

filter ADC computer
ANALYSIS
(...) bpm
analog
audio
output
algorithms
SOUND wavetables low pass

computer DAC filter
SYNTHESIS
oscillators (...)
Figure 4.1: Digital sound synthesis and sound analysis.
Digital audio analysis

Any sound can be interchangeably represented in the time domain by a waveform or in
the frequency domain by a set of spectra[54]. Thus, in the following pages waveform is
the term adopted to to audio signal Three main aspects are treated in analyzing sounds:
pitch recognition, rhythm detection and spectrum analysis (also helpful for the other
two). A lot of synthesis technique are based on data outputted by analyzing sound, in
these cases we can speak of analysis/resynthesis or analysis/synthesis techniques. To
understand this purpose imagine a peak detector, applied to a musical flow, and think
as using frequencies and amplitude of every peak detected (derivative of the analysis) to
drive digital oscillators. analysis/synthesis techniques analysis/resynthesis techniques
4.2 Introduction to the Fourier Analysis

Spectral modelling techniques are the legacy of the Fourier analysis theory. Originally
developed in the nineteenth century, Fourier analysis considers that a pitched sound is
made up of various sinusoidal components, where the frequencies of higher components
are integral multiples of the frequency of the lowest component. The pitch of a musical
note is then assumed to be determined by the lowest component, normally referred to as
the fundamental frequency. In this case, timbre is the result of the presence of specific
components and their relative amplitudes, as if it were the result of a chord over a
prominently loud fundamental with notes played at different volumes. Despite the fact
40
4.2 – Introduction to the Fourier Analysis
that not all interesting musical sounds have a clear pitch and the pitch of a sound may
not necessarily correspond to the lower component of its spectrum, Fourier analysis still
constitutes one of the pillars of acoustics and music. [36]
Origin
Since Jean Baptiste Joseph, Baron de Fourier, in 1822 published his evolutionary theory,
we can be traced back to the events that made the history, in rapid succession:
• 1870 first mechanical harmonic analyzer,
• 1898 first mechanical harmonic analyzer that could be reversed to waveforms syn-
thesizer,
• 1930 advent of analog filters made it possible spectrum analysis,
• 1940 first digital implementation on computers,
• 1960 advent of FFT algorithm reduced enormous calculus computing fourier trans-
form,
• 1977 advent of STFT, short-time fourier transform, widely used in music systems.
What Fourier stated, in a few mathematical words, was that complex but periodic
signal can be seen as a sum of simple signals. In musical context this was intended
as periodic waveform that can be deconstructed in a combination of simple sinusoidal
waves, each one with its own amplitude, frequency an phase. On digital computers
a sine wave is generated by an oscillator (first UG was an oscillator) able to produce
sounds by a sine wave with only three parameters: amplitude, frequency and phase. In
engineering mathematic an oscillator is normally expressed in another form through the
Euler’s relations, which allow to express sine and cosine functions by means of complex
exponential.
In this chapter we will introduce a particular Fourier-based analysis and synthesis
system, called the short-time Fourier transform, STFT, due to Allen and Rabiner (1977).
This is a very general technique, useful in the study of time-varying signals such as musical
sounds, that can be used as the basis for more specialized techniques. In the following
chapters the STFT is accounted as the basis for several analysis/synthesis systems.
In musical contexts, Fourier Transform is applied to analog signals (FT) having a
limited bandwidth or to a finite number of digital samples (DFT or STFT).
We can summarize here the techniques used to compute FT over analog and digital
input signals:
• FT, time-continuous signal input, frequency-continuous spectrum output.
41
• DTFT, time-discrete signal input, frequency-continuous spectrum output.

• DFT, time-discrete input signal, frequency-discrete spectrum output.
• STFT, short-time input signal (time-continuos inside short-time periods), time-
varying frequency-discrete spectrum output.
4.2.1 Fourier Transform (FT), Classic Formulation

In signal processing, the Fourier Transform is intended as the mathematical transforma-
tion by which a time-domain signal is converted into its frequency spectrum. Usually, the
result of the FT is just called the spectrum of the input signal. In its original formulation,
the FT extends the spectrum of the signal to the whole frequency range, from 0 to ∞.
Here the definition:
Z ∞
X(ω) = x(t) · e−jωt dt
−∞
where x(t) is a generic waveform and t and ω are the continuous time index and the
continuous frequency index. ω is the angular frequency, expressed in radians per second.
The simple relationship with the correspondent frequency in Hz is f = ω/(2π).
The FT could also lead to another interpretation, more interesting in musical con-
text, that is the decomposition of the waveform into an infinite number of sinusoidal
components.
The result of the FT is a complex value X(ω) for every values of ω, but X(ω) is
usually considered the whole spectrum of x(t). Each complex value, expressed in the
form (a + jb), with a and b the real and imaginary part, reveals the three fundamental
components of a sinusoid: frequency, amplitude and phase. Obviously, ω is the frequency
and the other two can be computed with the following simple formulas:
√
amplitude =⇒ |X(ω)| = a2+b2

b
phase =⇒ arg[X(ω)] = arctan
a
X(ω), again the whole spectrum of x(t), is a periodic function of ω with period 2π and
the original signal x(t) can be recontructed by means of the Inverse Fourier Transform,
defined as follows:
Z ∞
1
x(t) = X(ω) · ejωt dt
2π −∞
The Fourier Transform is valid only applied over time-continuous signals, e.g. waveforms.
Let’s see how it works with digital signals.
42
4.2.2 Discrete Fourier Transform (DFT)
Figure 4.2: Two plots of static spectrum. The image represents the SPL against fre-
quency of a drum hit played by a robot (on the left), and a note of a violin (on the
right). The difference is noticeable, while the robot hit has apparently no harmonically
related frequency components, in the violin note this is clear.
In digital computers, waveforms are transformed into discrete samples by means of

sampling, therefore in this case the Discrete Fourier Transform is computed in place
of FT. A signal that has discrete value representation in frequency-domain is called a
periodic signal, that means that the spectrum shows isolated spectral lines.
The DFT formula can be written as follows:
N/2−1
X
X[k] = x[n] · e−jωk n
n=−N/2
where x[n] is the nth value of discrete-time signal N samples long . That’s the motive
because the integral in the formula goes to − N2 to N2 ). ωk = 2π · ( Nk ) is the discrete
angular frequency, k is an integer number going from 0 to N-1 and N must be chosen
even.
While X(k) is called the discrete spectrum, the k-th X(k) discrete frequency sample
is called the k-th frequency bin. In DFT the relationship between discrete angular
frequency and frequency in Hz is:
ωk
f = fs ·
2π
where fs = 1
T
is the sampling frequency and T the period between samples.
43
Due to discrete value of k, DFT assumes that x[n] can be represented by a finite
number of sinusoids, this means that the signal x[n] is band-limited in frequency. Besides,
the frequencies of the sinusoids are equally distributed between 0 Hz and the sampling
rate fs , or, in radians, between 0 and 2π.
The DFT internally masks the frequency-domain sampling function, because there
is a direct correspondence between the number of input samples and the number of
outputted frequencies.
The inverse DFT is defined as:
N/2−1
1 X
x[n] = X[k] · ejωk n
N
k=−N/2
There is a faster computational version of the DFT, which is called the FFT. The algo-
rithm used to compute FFT allows the substitution of complex products with weighted
sum so that the computational cost is reduced from N 2 to N · log N . This is still one
of the most used and advanced technique of implementing DFT on digital computers,
especially where real-time DFT is needed or the space in memory is a critical point (i.e.
on chips).
Unfortunately both FT and DFT work only for periodic signals: in music only an
accurate note coming out from a tuned musical instruments can be treated as a periodic
waveform, while most of the sounds are non-periodic and time-varying waveforms.
So, let’s now introduce to the most used FT technique for musical purpose on digital
computers, the Short Time Fourier Transform.
44
this page is intentionally left blank,
take a breath.
45
4.3 The Short Time Fourier Transform (STFT)

One of the main problems with the original Fourier transform theory is that it does not
take into account that the components of a sound spectrum vary substantially during
its course. In this case, the result of the analysis of a sound of, for example, 5 minutes’
duration would inform the various components of the spectrum but would not inform
when and how they developed in time.[36]
The Short Time Fourier Transform represents a solution to this problem. It splits
the sound into short-time segments performing an operation called windowing, and se-
quentially analyses each segment. Normally the FFT technique is applied in order to
compute FT on each windowed portion, because the computational cost of this operation
is extremely lower. The reasons of the wide use of this technique can be summarized
as follows: the spectrum derives from a sequence of individual analysis windows that
can trace the time evolution of the sound. Thus the spectrum can be seen as a set of
spectra equally spaced in time, one for each windowed portion of the waveform, giving a
more sophisticated and convenient representation than DFT. This could be seen in figure
refSTFTspectrum STFT is also a focal point in sounds analysis, because a time-varying
spectrum is more similar to the human auditory system, therefore this could be helpful
in determining perceptual attributes, overall pitch and timbre.
How short should be chosen the short-time segment? Less than 1/10 of a second,
tipically.
rectangular phase spectra

wave form to
FFT polar
coordinates magnitude spectra
window
Figure 4.3: Basic operation of the STFT used for sound analysis.
The STFT operation over a waveform can be interpreted in two ways:
• windowed DFT, where the DFT (FFT) is computed over each windowed segment;
windows may overlap
• filterbank view, a bank of bandpass filters equally spaced across the frequency
domain (i.e. from 0Hz to Nyquist frequency)
46
4.3 – The Short Time Fourier Transform (STFT)
Windowed DFT (realized by FFT)

Windowing means that incoming signal is segmented in temporal windows, each of which
has the same duration. but windows may overlap. Then the segmented portions of signal
are analyzed with DFT (or FFT) separately. The general formula of STFT is:
∞
X
X[n,k] = {x[m] h[n − m]·}e−jωk n
m=−∞
where the output, X[n,k], is the DFT of the windowed input at each discrete-time
n for each discrete frequency bin k. h[n − m] is the time-shifting window function that
follows the signal. m, in the general formulation can vary from −∞ to +∞ but can be
substituted with the appropriate length of the window. N is the number of points in the
spectrum.
The given angular frequency, for each bins k, is that:
k · fs
ωk =
2π · N
Another formulation of STFT, is that of X.Serra, where H, the hop size, is the time
advance of the incoming signal, substituting the time-shifting window function.
It is also a function of two variable, follows the definition:
N
X −1
X[l,k] = {w[n] x[n + lH]} · e−jωk n
n=0
now, w(n) is a real window, l indicates the frame to pass through window and again,
the same exponential function. X[l,k], the spectrum, is the DFT of the sequence of
w(n)x(n + lH) for 0 < n < N − 1. The spectrum is computed at every frame l,
advancing with H along the input signal x(n).
4.3.1 The Filterbank View

The other possible view of the STFT, is represented by a group of band-pass filters
(filterbank2 ), equally spaced in order to cover the whole frequency band of the input
signal. This method is similar to the one used in spectrum equalizer, in which the shape
of the spectrum can be modeled by the user controlling each level of the filters. But
here, all filters have the same bandwidth and the center frequencies are equally spaced
up to Nyquist.
2
see first chapter 3 for more detail on filterbank.
47
Figure 4.4: Waterfall spectrum, a 3D representation os the STFT spectrum. The graph
was obtained with Spectutils package for GNU Octave. The analysis parameters of the
STFT are shown above the figure, the audio sample analyzed is extracted from Laurie
Anderson’s Violin Solo.
STFT interpreted by this way, can be seen ad a filterbank which perform analysis
in parallel on each windowed segment of the input signal. For every frame of the input
signal, a complex value returned by the n filters can describe n sinusoids. Filterbank
view was the base for phase vododer analysis/resynthesis technique, and inspired the
constant Q method later proposed. The filterbank view is therefore an abstraction, used
in computing STFT with programming language on digital computer.
The STFT output (both the two views) is a series of spectra, one for each frame of
input signal. Each spectrum has a real and imaginary part, which can be easily converted
into magnitude and phase value. In the filterbank view, frequency covers most important
role than phase. Istantaneous frequency is therefore calculated by converting the phase
value by the method of phase unwrapping3 , to obtain a sinusoid with the obtained
3
Phase unwrapping ensures that all appropriate multiples of 2π have been included in Θ(ω)
48
frequency and magnitude given by the derived spectrum magnitude.

Some steps are necessary in the calculation of the STFT on digital computers. They
are presented in the following sections.
4.3.2 Windowing: Length and Shape of the Window Function

The meaning of this operation is that every input waveforms must be time-limited (win-
dowed) in order to calculate its digital FT. Windowing operation is so called because
it consists in the multiplication of the input waveform for a window function, which
allows us to extract values from a segment (frame) of the waveform, depending on the
window function and window length. Since multiplication in time-domain correspond
to convolution in frequency domain, the product operation inside the Fourier transform
operation, give a resulting spectrum that will be the convolution of the spectrum of the
waveform and the spectrum of the window.
The window is mathematical function, which has non zero values over a limited time
range. The simpler one is the rectangular window, which assumes 1 inside window length
and 0 outside.
The choice of window length is very important because determines the frequency
resolution and the time resolution of the analysis. In the filterbank view of the STFT,
the frequency resolution is the frequency band of each band-pass filters. It can be derived
by the ratio (Fs/N samples in window length), for example, for a fs=44.1KHz and 1024
samples per window length, we obtain a frequency resolution of 43 Hz. Thus, in this
case the waveform is decomposed in 1024 sinusoids having frequencies integer multiple
of 43Hz (harmonics). The analysis of the frequencies in between the analysis bands is
obviously as much important, but this is beyond the scope of this thesis. We recommend
[3] for further readings.
The wave shape is the other most important criterion while choosing the appropriate
window function. All the standard windows adopted for computing STFT on waveforms,
are real and symmetric function and their spectra look like the sinc function sin(t)/t. In
figure 4.5 are presented the most commonly used window function and their spectra. In
the spectrum characteristics of the window we can see two features: the main lobe and
the side lobes. The width of the main lobe, defined as the extended bins across a period,
determines the frequency resolution, i.e. narrower lobe allows better resolution. The
attenuation of the side lobe from the main, the difference in dB between the height of
the main lobe and the height of the adjacent side lobe, determines the level of cross-talk
interference between two adjacent analyzing window. Typically reduce the side lobe is
reflected in an increase of the width of the main lobe, so a compromise must be chosen.
A generic rule could be the following: when the waveform is mainly composed by
distinct number of sine wave a narrow main lobe is preferable, when the waveform is
made of noisy like waves a wide main lobe is preferred.
49
types.png
Figure 4.5: Types of windows used in STFT for audio analysis. No ideal window exists,
the term "optimal window" is preferred. Several types of windows are used, for musical
purpose the Kaiser window has usually a preferential use.
Another point in the choice of the window length is between odd and even length.
For phase detection a zero-phase window is better because the windowing process won’t
modify the phase of the analysis waveform. Therefore an odd window length is preferred,
with the middle sample centered at the time origin of the analysis window.
There are a lot of standard window function, used for STFT purpose:
• rectangular
• Hamming
• Hanning or Hann
• Gaussian
• Blackman
• Blackman-Harris
• Kaiser
Gaussian, Hamming and Kaiser are the more often used. The kaiser window has char-
acteristics which are well tuned around musical context.
50
4.3.3 Computation of the DFT (via FFT)

The discrete spectrum of each portions of the windowed waveform can now be calculated
using the DFT. In practice, when possible, the FFT algorithm is used for this purpose.
The implementation of FFT algorithm will be discussed below.
FFT Size, Zero Padding

The problem with FFT is that requires that the analyzed signal must be N long, with N
a power of two number, called the FFT size. Therefore, since the signal length is fixed
by the window length, and this is chosen to obtain a desired frequency function (and
can also be variable, i.e. expanded for high frequency) the analysis window hardly ever
fit the FFT size. This problem in practice can be easily overcome by adding 0 to the
rest of the length required to match the FFT size, this operation is called zero-padding.
This method has also other benefits, since zero-padding in time means interpolating
in frequency domain, therefore the spectrum obtained will be sharper (more spectral lines,
oversampled spectrum). Note that zero-padding does not increase frequency resolution
in the spectrum, because the analysis window length remains the same, but can make
easy to track spectral peak which are not exact bin frequencies.
Usually the FFT size is chosen to be the first power of two, at least twice the window
length M, therefore M-N points will be forced to 0. N/M will be called the zero-padding
factor. This factor should be large enough to enable estimation of the maximum of the
main lobe, that is, the spectral peak. Since the window length is not an exact number of
periods for every frequency, the center frequency of the spectral peaks will rarely match
the frequency bins, hence the zero-padding factor chosen appropriately can resolve this
problem and the peak can be found.
As said before, choosing an odd length window, help us in phase detection. The
odd length window means that the windowed waveform will be centered at time origin,
thus half of the samples will be positioned before the time origin (negative-time value)
and the other half after the analysis time origin (positive-time value).The FFT input
buffer of this windowed waveform, will contain the positive-time values at the beginning
and the negative-time value at the end, the rest of the length (in the middle) will be
zero-padded.
Hop Size, Overlap Factor

The hop size determines the time advance of the analysis window, i.e. the time difference
(in samples) between two adjacent analysis windows. Hop size is normally referred to
assume the unitary value when hop size is equal to window length (in samples).The
analysis windows could be overlapped in order to have more analysis point and therefore
more time resolution over the input waveform.
51
The inverse of the hop size is called overlap factor (if H > M the analysis window
will not overlap). For example, if H=M=1024 and fs=44100Hz, the time resolution over
the input waveform is 1024/44100 = 23 ms , if the overlap factor is 8 the time resolution
becomes 2,9ms.
Greater overlap factor will generally give better analysis results, but also greater
computational cost. Hence, overlap factor has to be chosen whereas the input waveform
characteristics, i.e. fast-changing waveforms need more overlap. There are some general
criterion for determining an efficient overlap factor, the more general is to choose overlap
in a way that all the data are equally weighted as in the case of overlap-add synthesis
presented later.
Other criterion is too chose overlap factor according to the nature of window function,
that is, overlapping windows should add perfectly to a constant value, i.e. 1. For
a rectangular window this is easy to obtain, hop size can be simply M/i, with i any
positive integer. If consecutive analysis windows are added each others to a constant, no
amplitude deviation is possible, hence successive windowing operation will not perform
amplitude modulation to the input waveform.
To summarize, the STFT operation is applied to a stream of input samples and
results in a series of frames that one after another produce a time-varying spectrum,
thus the impression is to see a continuous spectrum.
The four parameters to choose in designing efficient STFT, can be summarized as
follows:
• window shape
• window length
• hop size / overlap factor
• FFT size
4.3.4 The Inverse Short Time Fourier Transform & Overlap-

Add Resynthesis
The ISTFT is the process by which the original waveform can be reconstructed starting
from the frequency-domain analysis data produced by the STFT. This is the typical
feature of a synthesis process, thus what will be presented in this section is the use of
ISTFT in the particular synthesis method called overlap-add resynthesis.
But before, the definition of the inverse STFT:
N/2−1
X
−jωk mR
Xm (ωk ) = e x(n + mR)w(n)e−jωk n
n=−N/2
52
4.4 – Constant-Q analysis
The overlap-add resynthesis method, due to Allen and Rabiner (1977), says that we can
reconstruct each windowed segment of the original waveform starting from the spectrum
components by the use of ISTFT over each frames. It takes the magnitude and phase
value of each spectrum to generate a time-domain waveform using the same envelope of
the analysis window used to compute the STFT. Then each resynthesized time-domain
segment is overlapped and added to reconstruct the original waveform.
In theory, the overlap-add process is an identity operation (i.e. the reconstructed
signal equal the original) by mathematical mean, only if the overlapped and added
windows sum to a contant. That means that we can pass countless times the signal into
STFT and back to the original with ISTFT, however, even good implementations of the
STFT, lose even a small amount of information, demonstrating that this is impossible.
OA resysnthesis is not the only method to do resysnthesis of the orignal waveform
based on STFT, many others are possible. Weighted overlap-add method is similar to
OA, the difference resides in the transformation applied to the window function before
resynthesis. The analysis window and the synthesis window must maintain the identity
property, this is achieved by respecting the relationship:
∞
X
w[m − nH] = c
n=−∞
where c is a constant.
The synthesis window is needed when, before resynthesis, a transformation is applied
to the phase spectrum, which can create phase discontinuities at the frame boundaries.
Oscillator-bank resynthesis (also called sinusoidal additive resynthesis, SAR) in an-
other method, in which analysis data (magnitude and phase) are converted into synthesis
data (amplitude and frequency) in order to drive one oscillator for each frames which
are then summed to recreate the original signal. This method frees from the add to
constant rule of the OA, because the converted spectrum is more robust against digital
processing transformation eventually applied before synthesis.
SAR method can be applied to the filterbank interpretation of the STFT, by matching
each frequency bins to a sine wave and then sum all the sine waves for synthesis.[54]
4.4 Constant-Q analysis

The constant Q filterbank analysis is another technique descendant of harmonic analysis,
such as the Fourier transform. Its use was adopted since late 70 and inspired other tech-
niques such as bounded-Q transform, auditory transform and wavelet transform. Q, the
ratio between center frequency and bandwidth of a filter, and filterbank, a connection of
filters in parallel, were introduced in chapter 2. Now the use of both these characteristics
will be addressed to the case of constant-Q analysis, as alternative to the strict Fourier
53
Figure 4.6: Spacing of filters for STFT (filterbank view) on the top and Constant-Q
filterbank on the bottom. It is clear the advantage of the Constant-Q filterbank method,
which places the filters linearly against log(frequency), which is similar to the frequency
response of the human ear.
analysis. Our aim in this thesis, will be that to demonstrate its goodness against a
specific task of sound analysis, the onset detection, applied to the case of ·O M M· (next
chapter) and its perceptually grounded approach applied to the recognition of the sound
has been played (final chapter).
The constant-Q transform has advantages over the Fourier transform, which lie in
musical aspects. The STFT computes frequency components (frequency bins) on a
linear scale, that means, it expresses frequencies with fixed resolution or bandwidth. This
method has an inconvenient, because it frequently results into inadequate resolution for
low musical frequencies and exaggerated resolution for high frequencies. The choice of
the frequency resolution is addressed to the choice of the appropriate window length,
which best fits the resolution needed (i.e. the lower frequency content which can solve).
Moreover, high frequency resolution means poor time resolution and vice-versa, that is,
there is always a tradeoff between time/frequency resolution in STFT based analysis.
Such a problem will be discussed through an example: suppose the sampling fre-
quency to be fs =44100 Hz and N =1024 samples the window length. The frequency
54
bins that can be analyzed will be 512, equally spaced over the bandwidth, i.e. from 0 to
20 KHz. Increasing the sample rate, e.g fs =96 KHz will not increase the frequency res-
olution of analysis but will only widen the bandwidth up to 48 KHz. To get an increase
in frequency resolution, one must choose a larger window length. The limit example
is that to obtain a frequency resolution of 1 Hz, a window length up to 44100 samples
must be chosen, by sacrificing the time resolution to 1 s! Conversely, if time resolution
needed is 1 ms, i.e. to analyze 1000 events per second, the window length should be
44 samples, thus the frequency resolution will be about 1000 Hz!
Now let’s see a practical example, introducing the need of a different tool for anal-
ysis. Suppose the task is to solve (to analyze separately) the frequencies correspond-
ing to the fundamental frequencies of notes in a piano. Now, suppose the two lower
notes being spaced, for example, 2.5 Hz apart. The analysis window must be chosen
N =16384 samples long, that is, for fs = 44100 Hz the frequency resolution will be
fs /N ' 2.5 Hz. This would not only result in a bad time resolution (400 ms), but the
real problem consists in the extremely useless frequency resolution used to solve higher
frequencies notes, because here the spacing between notes is much more than 2.5 Hz.
Not only, what we said in previous chapter about perception of frequencies, here is com-
pletely neglected. However, since STFT is performed via FFT, the time required for
output is extremely low and implementation in real-time do not constitute a problem,
although lots of data are useless and must be discarded after analysis. Therefore, the
complexity reduction achieved by applying FFT algorithm, is the principal reason behind
the wide use of STFT in sound analysis purposes.
Constant-Q transforms constitutes an alternative to the fixed frequency representa-
tion of Fourier transform. In a constant-Q transform the bandwidth of each frequency
bins, varies proportionally with frequency. In the next section, we’ll see a typical imple-
mentation of constant Q for musical analysis purpose, applied to the case of a piano.
But first, we should take a look at the waterfall spectrograms represented in figures 4.7
and 4.8, taken from Brown [10]. The two pictures clearly point out the advantaged in
representing musical signal with the constant Q transform, which lies in musical aspect.
It is especially clear if compared to the previous image (4.4) showing STFT waterfall
spectrum.
4.4.1 Implementation of Constant-Q Analysis

Such an example of constant Q transform implementation is represented by the 1/24th -
octave bank of filters. The constant Q filter bank and its similarity to the auditory
system has been explored in [42] and [38]. Various schemes for implementing constant
Q spectral analysis outside a musical context have been published, for example that of
Gambardella, which proposed an inverse funcion to reverse the constant-Q method back
to the time domain. This is of importance if manipulation of the signal in the spectral
domain followed by transformation back to the time domain is desired.
55
Figure 4.7: Waterfall spectrogram of a Constant Q transform of violin glissando from

578 Hz to 880 Hz (D5 to A5). Taken from Judith Brown’s Calculation of a constant Q
spectral transform. [A glissando is a glide from one pitch to another. It is an Italianized musical term derived
from the French glisser, to glide, It is also where the pianist slides up the piano with his or her hands. From Wikipedia.]
For musical analysis, we would like frequency components corresponding to quarter-

tone spacing of the equal tempered scale4 . The frequency of the k th spectral component
is thus:
fk = (21/24 )k fmin
where f will vary from fmin to an upper frequency chosen below the Nyquist frequency.
The minimum frequenc fmin can be chosen to be the lowest frequency about which
4
Equal temperament is a musical temperament, or a system of tuning in which every pair of adjacent
notes has an identical frequency ratio. In equal temperament tunings an interval, usually the octave, is
divided into a series of equal steps (equal frequency ratios).
56
Figure 4.8: Waterfall spectrogram of a Constant Q transform of flute playing diatonic

scale from 262 Hz to 523 Hz (C4 to C5). Taken from Judith Brown’s Calculation of a
constant Q spectral transform. [In music theory, a diatonic scale is a seven note musical scale comprising five
whole steps and two half steps, in which the half steps are maximally separated. From Wikipedia.]
information is desired, e.g. a frequency just below that of the G string for calculations
on sound produced by a violin. The resolution or bandwidth δf for the discrete Fourier
transform is equal to the sampling rate divided by the window size (the number of
samples analyzed in the time domain). In order for the ratio of frequency to bandwidth
to be a constant (constant Q), then the window size must vary inversely with frequency.
More precisely, for quarter-tone resolution required is:
Q = f /δf = f /0.029f = 34
where the quality factor Q is defined as δf = f /Q. We note that the bandwidth
δf = f /Q. With a sampling frequency fs = 1/T where T is the sample period, the
length of the window in samples at frequency fk ,
N [k] = S/δfk = (S/fk )Q
57
Note also from this equation that the window contain Q complete cycles for each fre-
quency fk , since the period in samples is fk . This have physical means since, in orde to
distinguish between fk+1 and fk when their ratio is, e.g 21/24 = 34/33, we must look
at at least 33 cycles. It is also interesting for comparison, to consider the conventional
discrete Fourier transform in terms of the quality factor Q = f /δf . We find that f /δf
is equal to the number of the coefficient k, and this is the number of periods in the fixed
window for that frequency.
The constant-Q transform have demonstrated good result as approaching to the task
of sound analysis; especially regarding the identification of musical notes, this transform
shows to be a more appropriate spectral representation due to its geometrically spaced
frequency channels. Although it should not be considered a good starting point for
musical synthesis, because of the controversial inverse function. Inverse function had
been proposed but not successfully implemented for musical purpose.
This method will be proposed, in the final chapter, to achieve the goal of onset
detection.
58
Chapter 5
Real-Time Audio Applications
To generate 1 and process acoustical signals is to compose music, more directly than
inscribing ink on paper. Curtis Roads[44].
For our purpose, that is to analyze the musical flow of the robotic orchestra, we need a
flexible and extensible platform. For this reason we discarded a priori the use of numerical
analysis software, like Matlab/Octave (we used them but for other purposes). Although
they offer lot of (free for GNU Octave) packages tuned around sound analysis and
processing ([63][47] and [32]), our need is to integrate the analysis onto the Show Control
System2 of the Orchestra, thus, for musical context other softwares are recommended.
One of the pillar in this category is represented by Max/MSP, the software already used
for the development of ·O M M· SCS.
In this chapter the main softwares for real-time audio applications are treated from
basic, to advanced (in the case of Max/MSP, the software we choose for our purpose)
features. Later, at the end of chapter, an overview of the typical application, which are
these software designed for, and a comparison between the textual based (unix style,
terminal-like window) and graphical based software.
1
Musical sound synthesis.
2
Show control system (entertainment) is a generic for a system (could be very complex) whose main
feature is that to coordinate all the different systems (audio, video, MIDI, OSC....) controlling the
hardware, by which a show is formed. In our case, the SCS, coordinates the robots musician and the
gestural controller and possible other (to be experimented) features.
59
5 – Real-Time Audio Applications
5.1 Max/MSP
Max/MSP devotes his first part of the name to Max Mathews3 , who wrote in 1957, the
first ever computer program, specific for sound generation4 . Max was also the original
name of the software, developed by Miller S. Puckette at IRCAM in the mid 80s and first
commercially distributed since early 90s. MSP, is a package for real-time DSP (standing
for Max Signal Processing or the initials of Miller S. Puckette), added to the software
since 1997. Due to its graphical (but minimal) nature, Max/MSP differs from the most
MUSIC-N languages, Max can be considered a visual programming language. Visual
programming let you graphically connect objects together with patch cords to design
interactive software. This is normally the attitude of designing programs, think at the
flowchart or most modern techniques such as UML. But the difference resides here, with
flowcharts the blocks represent code that will be written, in Max, the code is written
already.
Since Max uses icons to represent objects written in high level language, Max is a
meta-language, responding to the paradigm "programs can write programs". Max/MSP
distinguishes between two levels of timing: that of an "event" scheduler, and that of the
DSP (similar to the distinction between control-rate and audio-rate processes in Csound,
direct descendant of MUSIC.)
With Max you can also control external hardware, read data from sensors, inter-
change audio and data with other software other than generate and analyze sounds,
create musical intruments, video and animation. All these features let Max be a popular
choice for composing interactive media works. Most of all for the approachable graphi-
cal interface, extensive bindings to media processes and protocols, and the open-ended
philosophy.
Follows a short description of principal Max features:
The patcher windows

MAX is designed to look familiar to composers who have worked with patchable synthe-
sizers. What you see on the screen is a lot of boxes with lines connecting them. The
boxes are called objects and the lines are patch cords. What happens is that each object
represents a process. The results of the process are passed along the patch cord to the
next object. Ultimately, there is an object that sends MIDI data, audio or video out.
Each window full of objects is called a patcher. Several patchers may be open at once
and they can all be active, even if their window is hidden. Patchers can be saved, and
then entered as an object in another patcher. There is also a patcher object, that can
3
Max Vernon Mathews is considered unequivocally one of the pioneer in the world of computer
music. Sc.D. in 1954 at Massachusetts Institute of Technology (MIT), while working at Bell Lab he
was founder of MUSIC, the first computer based programming language for sysnthesis (1957).
4
MUSIC 1
60
5.1 – Max/MSP
Figure 5.1: Max 5 patcher window
be opened up and filled with objects which will continue to work after the patcher object
is folded up again. The action flows from the top down. When an object is tweaked
by the user or MIDI comes in, a message is sent to any connected objects, which react
with messages of their own. Only one thing happens at a time, but it’s all so fast it
seems instantaneous. When a pathway branches, messages are sent to right destinations
before left.
Objects
The name of the object represents what it does. There are a few hundred objects
included with Max, ranging in complexity from simple math to full featured sequencers.
Arguments, if present, specify initial values for the object to work with. Data comes into
the object via the inlets, and results are put out the outlets. Each inlet or outlet on an
object has a specific meaning. This will be displayed in a flag as the mouse passes by
(further details are in the manual). Usually, input to the left inlet triggers the operation
of the object. For instance, the delay object (as shown) will send a bang message out
the outlet 500 milliseconds after a bang is received in the left inlet. Data applied to the
right inlet will change the delay time.
Messages
Data bytes sent down the patch cords are called messages, which fall into one of the
following types:
• int A number without a decimal point.
61
• float A number with a decimal point.
• symbol A character string1 such as “stop” that may be understood by certain

objects. A symbol may be followed by further information in the message.
• list Several of the above, separated by spaces. The first element of a list must be
a number.
• bang A message that triggers the action of an object.
Audio signals are sent in yellow patch cords. These are little packets of data, but
sent so fast as to be effectively continuous. Jitter signals are sent via green patch cords.
Jitter messages are names of matrices that hold data for jit.objects to process. Every
object responds to a variety of messages. If a message won’t work, a warning will appear
in the Max window.
Max windows
The Max window contains information sent from Max (like error messages) or things
you might like to print. It’s sort of a terminal window.
Max runtime
Max is not required to run a finished patch. Anyone can download Max/MSP Runtime
for free, which will run patches but not edit them. There is also a process for converting
patches into stand-alone applications.
Pure Data
PD is the open source twin of Max released under a BSD license, developed by the same
author, Miller Puckette, since 1996. It show off the same potentiality of Max, with little
differences explicated by the author in [42] and [38].
5.2 CSound
CSound is one of the better-known textual interfaces for computer music composition.
CSound was originally written by Barry Vercoe at MIT in 1985, based on languages
of the Music-N family, and continues to be developed today. At its core, CSound is
“designed around the notion that the composer creates a synthesis orchestra and a score
that references the orchestra.”
62
5.2 – CSound
Figure 5.2: Max 5 window
Csound files were originally processed in non real-time to render sonic output, in a
“process referred to as ‘sound rendering’ as analogous to the process of ‘image rendering’
in the world of computer graphics.” [55]. Csound instruments are defined in the orchestra
file as directed graphs of unit generator types (called ‘opcodes’). Flexible sound routing
can be achieved using control and audio busses via the Zak objects. Control rate is
evident in CSound through the a-rate and k-rate notations.
The strong separation of synthesis and temporal event definition imposes a strict
limitation on the scope for algorithmic composition: new synthesis processes cannot
be defined in response to temporal events, and new temporal events cannot occur in
response to the synthesis output. “Csound is very powerful for certain tasks (sound
synthesis) while not particularly suited to others (data management and manipulation,
etc.).” [55]
63
5.3 Supercollider
SuperCollider is a high-level programming music language, designed specifically for dy-
namic and generative structures and synthesis of computer music. It can be generally
applied to many different approaches to composition and improvisation rather than any
particular preconceived model. It features an application-specific high-level programming
language SCLang (inspired to C++) with extensive data-description and functional pro-
gramming capabilities, and support functions for common musical needs.
SuperCollider has also features as several library of unit generators for signal pro-
cessing. Sample-rate and control-rate distinctions are made explicit via the .ar and .kr
notation. A key distinction from CSound is that code can be evaluated in real-time as
the program runs.
SuperCollider is ideal for algorithmic composition. Since version 3.0 (the currently
available version), graphs of unit generators are defined textually and compiled at run-
time into dynamic libraries (‘SynthDefs’) to be loaded as instruments (‘synths’) by the
synthesis engine (‘SCServer’), all under control of the language.
The separation of language and synthesis into distinct processes in version 3.0 in-
troduces compilation and performance optimizations, but also implies limitations in the
degree of temporal control: “Because instruments are compiled into code, it is not pos-
sible to generate patches programmatically at the time of the event as one could in SC2.
In SC2, an audio trigger could suspend the signal processing, run some composition
code, and then resume signal processing. In SC Server, messaging between the engines
causes a certain amount of latency.”[34]
SuperCollider 3.0 therefore represents a return to the CSound model of orchestra
and score, in which however the score is procedural rather than declarative.
5.4 Chuck
ChucK represents one of the only contemporary options that avoids latency in the pro-
cedural control of synthesis. It also provides a library of unit generators to be freely
instantiated and connected into graphs within ChucK scripts. The authors refer to
ChucK as ‘strongly timed’, which can be defined as follows:
• supports sample accurate events
• defines no-control-rate (or supports dynamically arbitrary control-rates)
• supports concurrent functional logic
• control logic can be placed at any granularity relative to synthesis
• supports run-time interaction and script execution
64
5.4 – Chuck
Like SuperCollider’s SCLang, the ChucK language was written especially for the ChucK
software. It is a high-level interpreted programming language.
65
Comparison between Graphical User Interface and

Textual Language Interface Musical Softwares
GUI + Easier to view and input quantitatively rich data such as control envelopes
Common tasks can be immediately and intuitively represented
Interaction can be more rapid
GUI - Interfaces tend to be more specific
Complex data-structures, if made visible, can be visually overwhelming
Precise qualitative specification can be difficult at fine granularity
TLI + Compact description of complex data-structures
High degree of precision & control
Textual elements may more easily refer to or embed each other
TLI - Tiresome to specify by data-entry when precision is not required
Simple tasks may require detailed code
Interaction can be time-consuming, particularly if text must be compiled
66
Table 5.1: Musical software for realtime synthesis and control
Name Creator Typical purposes First release date Recent release License Development
(2009) status
Max/MSP Miller Puckette Realtime- mid-1980s Max 5.0.7 Commercial Mature
synthesis, software
hardware-control (Cycling’74)
Pure Data Miller Puckette Realtime- 1990s pd-extended BSD-like Stable
synthesis, (0.41.4),
hardware-control pd-vanilla
(0.42.5)
Csound Barry Vercoe Realtime- 1986 Csound 5.10 LGPL Mature
synthesis,
algorithmic
compositiona ,
67
audio-rendering
5.4 – Chuck
SuperCollider James Realtime- 1996 SC 3.3.1 GPL Stable

McCartney synthesis,
live-codingb ,
algorithmic
composition
ChucK Ge Wang and Realtime 2004 ChucK 1.2.1.2 GPL Immature
Perry Cook synthesis, (dracula)
live-coding,
algorithmic
composition
a
Algorithmic composition is the technique of using algorithms to create music.
b
Live-coding is the name given to the process of writing code to modify software in realtime as part of a performance. Most generally,
writing (parts of) programs while they run. It’s ometimes known as "interactive programming" or "on-the-fly programming".
68
Chapter 6
Perceptual Onset Detection
In the first part of chapter 2, we presented the attributes of sound perceived through the
human auditory system, now we are going to focus the discussion over a particular aspect
of such a mechanism, the perception of time events, especially the ones related to the
initial portion of sounds. By implementation of the skills of computer music discussed
in chapter 3 and 4, in the detail, digital filters and constant Q analysis, we’ll try to
simulate the functionalities of auditory system to achieve the goal of event detection.
The interest in our research is largely treated in literature and must be anticipated by
some fundamental definitions, in primis attack and onset of a sound. Then, some of the
most advanced techniques for onset and attack detection will be proposed and finally,
our method based on bonk∼ for Max/MSP will complete this chapter.
Therefore, and first of all, the project on which we are working, will advance at a
glance.
6.1 The Curious Case of ·O M M·

Two robots, loud sounds, and one computer. The reason I decided to make a thesis.
The project ·O M M· consists of a show in which a performer conduct two robot drummers.
Many people worked for more than one year, developing the electromechanical parts of
the robots, at LIM laboratories, and developing the software, in Max/MSP, to control
the robots. The two robotic percussionists (were ten in the original concept) are directed
by the performer though a gestural controller, the GipsyMIDI1 exoskeleton. The ·O M M·
orchestra, called after italian futurist Filippo Tommaso Marinetti2 in the centenary of
1
http://www.sonalog.com/framesets/gypsymidi_frame.htm
2
Filippo Tommaso Marinetti (1876 – 1944) was an Italian poet and musician, founder of the Futurist
movement. It is responsible of the Futurist Manifesto, published on the french journal Le figaro, in
1908. He had also introduced the presence "on stage" of humanoid form of life, mechanical bodies
considered primordial example of "robot" (ten year before Karel Ĉapek introduced the concept onto his
novel Rossum’s Universal Robots)
69
6 – Perceptual Onset Detection
Figure 6.1: The two robots on the sides, SCS + the performer in the middle.
the Futurism, is ready for the launch, in november, after had been presented in october
2008.
In the organization of this thesis, this chapter is the right place to introduce the
project we are working on, before going into the discussion on the recognition of particular
sound onset. Figure 6.1, representing the ensemble of the show control system of
·O M M· , is proposed to lighten the visual imagination. The two robotic drummers,
both having two arms, can play at maximum 120 bpm through each arms (i.e. until
four contemporary3 hits), a score, sent them via MIDI4 from a computer. On the
same computer, a complex Max/MSP patch, elaborates input data received from the
exoskeleton, calibrate it, and let the output modify the rhythmic pattern in real-time5 .
Since the robots play big and cumbersome drums, two oil bins of regular dimension, the
music produced is kind of loud percussive sounds, on the style of the french band Les
tambours du Bronx.
3
Perceptually contemporary. A minimum delay (but less than 15 ms) between the mechanically of
the two arms of a robot, must be ensured between two electrical transmissions.
4
Musical Intrument Digital Interface.
5
Quasi-realtime, to be precise.
70
6.2 – From Transient to Attack and Onset Definitions
6.2 From Transient to Attack and Onset Defini-

tions
Every musical signal can be subdivided6 into smaller unit of sound. The two subdivisions
considered here are the transient and the steady-state portions of a sound. The transient
portion is located at the origin of a sound, originated by a stimulus (e.g. a chord is
played on guitar, a stick strikes the drum), causing a sudden change in the perceptual
attributes. The duration of transient is assumed to be very short, substantially related
to the duration of the stimulus. The steady-state portion, coming after the transient, is
a kind of support to the sound, it can be considered as the natural evolution of transient
according to the instrument design and environment. Figure 6.2 shows this separation
for the case of a drum hit.
To give more precise definitions, regarding the terms used in figure 6.2, the words of
Bello [4] are proposed:
• transient: period during which the excitation is applied and then damped
• attack: time-lag during which the amplitude of a sound increases
• onset: single instant chosen to mark the start of the transient
Another interesting (and more intuitive) definition of onset is that given by Bilmes in
[8]: onset is the point when a musical event becomes audible. He also called onset with
the name attack time. Actually, sometimes attack and onset are used as synonyms to
represent the same information. To circumvent this ambiguity, we propose what follows,
again from Bilmes: “each musical event in percussive music is called a drum stroke or
just a stroke, it corresponds to a hand or stick hitting a drum, two sticks hitting together,
two hands hitting together, a stick hitting a bell, etc. The attack time of a drum stroke
is its onset time, therefore the terms attack and onset have synonymous meanings, and
are used interchangeably.” This is not an hazard, because it is demonstrated [26] that
the time between zero and maximum energy in a percussive musical event is almost
instantaneous.
From the onset recognition point of view, the simpler case is therefore represented
by the drum, where the sounds generated are kinds of percussive events, well charac-
terized by the noticeable change of sound parameters associated to the strokes. Hence,
6
This subdivision is normally indicated with the name "segmentation". Segmentation (also present
in many other different applications, as well as in sound) is an operation which preserve the temporal
structure of the signal but can be used to identify, separate and organize the smallest rhythmic event
founded in the audio signal. Is a complex operation and can be done in several ways (choose of the
smallest unit) and several domains (time, frequency, time-frequency, complex), according to the target
to achieve. The main applications are transcription, onset detection and BPM and tempo calculation.
Transcription allows the musical flow to be represented by the note from which is generated, for this
reason is one of the most explored and advanced area of studies.
71
Figure 6.2: On the top, the waveform corresponding to a hit of a robot percussionist
of ·O M M· . On the bottom, the intensity profile of the hit (using Praat), where onset,
attack and the transient/steady state separation are highlighted.
transients are normally considered the principal part of a percussive sound, thus char-
acterizing them the steady state portions can be, although approximately, derived by
applying an appropriate synthesis stage which recreates the slow decay at the estimated
resonance of the drum.
The actual meaning of onset, coming from psychoacoustics knowledges, allow us to
define transient in according to the behavior of the three perceptual attributes presented
in chapter 2. In correspondence of a transient, we can denote the following behavior:
• abrupt change in perceived loudness
• abrupt change in perceived pitch
• abrupt variation of the perceived timbre
For the purpose of this thesis, the transient/steady-state separation is left apart from
the following discussions (see [56]for deepening) and attention will be focused on the
transient region, in particular to onset and attack recognition. But first of all, let’s spend
72
6.2 – From Transient to Attack and Onset Definitions
Figure 6.3: From top to bottom: waveform, static spectrum (FFT) and time-varying
pectrum (STFT). From right to left: one hit of ·O M M· robot, one hit of snare drum.
a few words on the importance of onset detection, often kept secret, in most musical
software applications.
Importance of onset detection in musical applications

Several commercial software for musical applications, exploit the onset detection for
many special purposes, very often not explicitly. With a rapid research, we found that the
use of onset detection’s technique is particularly important in all these applications: cut’n
paste (audio editing), audio/video synchronization, estimation of rythm and temporal
73
features (audio analysis), compression, content delivery7 , indexing8 , music information

retrieval9 , time-stretching10 and pitch-shifting11 (audio processing). Besides, as if not
enough, in other musical applications, detected and modeled onsets can be used for
further musical processing or even synthesis [50], [3]. Detected onsets can, for instance,
trigger a synthesis algorithm in order to create a new musical piece from the percussive
rhythm founded in a recording.
6.3 General Scheme for Onset Detection

Several studies and different approaches had been proposed for the solution of the onset
detection task. Here are presented in order the three general steps, which can be found
in almost all the techniques[4]:
1. preprocessing: before analysis sometimes transformations are applied to sound

(e.g. compression and limiter) in order to emphasize or de-emphasize some of the
characteristics of the signal. Usually preprocessing is considered an optional, but
could be extremely helpful, for example to accentuate the sudden changes in the
waveform. Normally, when high speed performance is required by the system, i.e.
for real-time applications, preprocessing is avoided.
2. reduction: reduction is a process through which the complex signal, here consid-
ered as sum of sinusoids or oscillators, is simplified for analysis (e.g. subsampling).
The simplified signal (must reflect the local structure of the original) should en-
hance the transient characteristic while de-emphasize the steady state. This oper-
ation is critical and has been proposed in many ways, but can be summarized in
two categories: the methods that make use of explicitly signal features (i.e energy,
frequency, phase) and methods based on probabilistic models, which approximate
the signal’s behavior. The function obtained, after reduction of the original signal,
can be called detection function[4] or observation function[50].
3. pick-peaking: the final stage of onset detection is historically entrusted to peak-

picking algorithm, which localize onsets as local maxima (i.e. peak) of the detec-
tion function. This stage is also critical, because depends on the "goodness" of
the detection function and must be "robust" against possible misunderstanding.
7
Content delivery describes the delivery of audio "content" through a delivery medium such as
Internet, onset detection may improve for example the organization of sound into UDP packets.
8
Indexing is feature of database. While creating a database of sounds, onset detection may be
important to determine similar characteristics to be used to subdivide the database into categories.
9
MIR, See http://en.wikipedia.org/wiki/Music_information_retrieval
10
Time stretching is a way to change the speed or duration of an audio signal without affecting its
pitch.
11
Pitch Shifting is a way to change the pitch of a signal without changing its length.
74
6.3 – General Scheme for Onset Detection
Is not difficult to imagine that overlapping sounds, noise, musical effects (e.g.
vibrato and tremolo12 ) or modulations, are just some examples of the difficulties
that a peak-picking stage can encounter. That’s why the final decision of the
pick-peaking algorithm may be, in certain cases, anticipated by a post-processing
and thresholding stages.
In the next sections, we will adopt the term detection function, described in point 2;
every detection scheme has its own detection function. Our method based on bonk∼ will
follow, after a discussion on the modern techniques, largely treated in literature and well
summarized in [4][50] and [31].
Choice of the Appropriate Detection Function

A brief guideline is proposed here, before entering the discussion of our method for
recognizing onsets in ·O M M· . What follows could be useful in determining the appropriate
methods for approaching the onset detection task. The choice strongly depends on the
sound object of the analysis.
The general, good practice, usually requires a balance of complexity between prepro-
cessing, construction of the detection function, and peak-picking.[4]
In this summary, the methods proposed (in order of complexity) are followed by the
corresponding text where it’s possible to find implementation details.
• If the signal is strongly percussive (e.g. drums), time-domain methods are also
adequate (i.e. method based on thresholding the amplitude).
• Spectral methods, such as those based on phase distributions and spectral differ-
ence perform relatively well on strongly pitched transients. [4]
• The complex-domain spectral difference seems to be a good choice in general, at
the cost of a slight increase in computational complexity. [21][56]
• When precise time localization is required, then wavelet methods can be useful,
possibly in combination with another method. [21][22]
• If a high computational load is acceptable, and a suitable training set is available,
then statistical methods give the best overall results, and are less dependent on a
particular choice of parameters. [4] for introduction and [24][2] for more detail.
12
Vibrato and tremolo are two important musical effects. Vibrato is produced, in singing and musical
instruments, by a regular pulsating change of pitch, and is used to add expression and vocal-like qualities
to instrumental music. Tremolo usually refers to periodic variations in the amplitude of a musical note
(or in singing). Depth and speed of vibrato/tremolo determine the amount and speed of pitch/amplitude
changes. It is difficult to achieve, with singing voice, separated variations in pitch and amplitude, they
will usually be achieved at the same time; that’s why the two terms are sometimes confused. In digital
signals processing, vibrato and tremolo are easier to achieve separately.
75
6.3.1 Energy Based Approach

Log Energy Approach
Since the energy of a sound is the most prominent variation in a musical flow, occurring
during transients, energy based approach could be considered the easier way to achieve
the goal of event detection. Usually, the introduction of a new generic note (e.g. that of
a piano) leads to an increase in the energy of the signal and for the specific case of strong
percussive note attacks (e.g. the case of ·O M M· ), this increase in energy will be very
sharp. For this reason, energy method proves to be a useful and efficient approach for
lot of onset detection’s applications, in particular detecting percussive transients. The
local energy of a frame x(n)of the signal is defined as:
N
2
−1
1 X
E[n] = (x[n + m])2 h(m)
N N
m=− 2
where h(m) is a window of lenght N, centered at m = 0. Taking the first difference13

of E(n) produces a detection function, in which peaks could be localized in time by
a pick-picking algorithm, to find onset locations. An improvement to this equation,
follows from psychoacoustical knowloedge, is to consider that loudness is perceived log-
arithmically. Hence, computing the first difference of log E(n) roughly simulates the
ear’s perception of loudness. These are usually considered the simplest approaches to
note onset detection.
Spectral Difference
This idea can be extended to reach a more appropriate detection function, that is,
considering frames of the STFT. We recall that a generic STFT frame of a waveform,
is given by:
X∞
X[n,k] = {x[m] h[n − m]·}e−jωk n
m=−∞
where k = 0,1,...,N − 1 is the frequency bin index and h the finite-length sliding win-
dow14 . As previous method, if we now take into account the first difference between the
magnitude of consecutive STFT frames, that is:
N
X
δX = |X[n,k]| − |X[n − 1,k]|
k=1
13
The first difference is the difference between two consecutive samples, in this case each sample
describes the energy content of the sampled waveform.
14
see chapter 5 for detail on STFT
76
6.3 – General Scheme for Onset Detection
this measure, known as the spectral difference, can be used to build an efficient onset
detection function. Energy-based algorithms are fast and easy to implement, decrease
their effectiveness when approaching to nonpercussive sounds or when transient energy
is given by overlapping (and more complex, e.g. strongly pitched) sounds.
High Frequency Content

This technique is very interesting and successfully applied by Masri in [33]. The consider-
ation over which is grounded, are proposed by Rodet and Jaillet: energy increases linked
to transient tend to appear as a broadband event and since the energy of the signal is
usually concentrated at low frequencies, changes due to transients are more noticeable
at high frequencies. To emphasize this, the spectrum obtained by the STFT, can be
weighted preferentially toward high frequencies, before summing to obtain a weighted
energy measure. The following formula was proposed:
N
2
−1
1 X
E[n] = Wk |X[n,k]|2
N N
k=− 2
where Wk is the (frequency dependent) weighting function. Masri proposed Wk =

|k|, called high frequency content (HFC), a linear weighting function, by which each
frequency bin gives a contribute proportional to its frequency.
Energy increment related to transient component are more noticeable at higher fre-
quencies (although the total energy is usually concentrated at lower frequencies), there-
fore HFC should be considered one of the pillar in the energy based onset detection task.
Besides, HFC method do not take into account temporal features of the waveform, which
could be equally important. That’s why, alternative methods, (e.g. considering phase
spectrum information), should also be considered.
6.3.2 Phase Based Approach

The use of phase spectra in approaching the onset detection task, is relatively a recent
introduction [20]. Let’s see how it works.
The starting points are the definition of unwrapped phase 15 , and its application
15
The unwrapped phase is the phase which allow variation into the limited range between 0 and 2π.
It is calculated from the instantaneous frequency value in the following manner:
d
ω(t) = ϕ0 (t) = ϕ(t), instantaneous angular f requency
dt
1 0
f (t) = φ (t), instantaneous f requency in Hz
2π
Z t
φ(t) = 2π f (τ ) dτ + φ(0), the 2π − unwrapped phase.
0
77
in the case of a given stationary sinusoid (i.e. extracted from steady state portion of
the signal). In a steady state sinusoid, extracted from a single frame of the STFT, the
phase, as well as the phase in the previous frame, are used to calculate a value for the
instantaneous frequency. An estimate of the instantaneous frequency of the STFT frame
within this window, is that:

ϕk (n) − ϕk (n − 1)
fk (n) = fs
2πh
where h is the hop size between windows and fs the sampling rate.
What is expected, for a stationary sinusoid, is that the instantaneous frequencies
should be approximately constant over adjacent windows. Furthermore, this is equiva-
lent to say that the phase increment from adjacent windows remaining approximately
constant. This is expressed in formula as follows:
ϕk (n) − ϕk (n − 1) ' ϕk (n − 1) − ϕk (n − 2)
Equivalently, the phase deviation can be defined as the second difference of the phase,
which is:
∆ϕk (n) − 2ϕk (n − 1) + ϕk (n − 2) ' 0
During a transient region, the instantaneous frequency is not usually well defined, and
hence will tend to a large value. This is illustrated in figure 6.4, Bello proposes a method
that analyzes the instantaneous distribution (in the sense of a probability distribution or
histogram) of phase deviations across the frequency domain.
During the steady-state part of a sound, deviations tend to zero, thus the distribution
is strongly peaked around this value. During attack transients, values increase, widening
and flattening the distribution. However, this method, although showing some improve-
ment for complex signals, is susceptible to phase distortion and to noise introduced by
the phases of components with no significant energy. Finally, why do not mix phase
and energy approaches? Again Bello gives the answer, proposing this solution to the
detection task in [5].
Phase, energy, and phase/energy approaches do not represent the only methods
applied to the solution of the onset detection task. Several other methods had been
proposed, with particular regard to stochastic and statistical methods ([2] for example
and [4] for comparison), very different from the ways above. A Deterministic Plus
Stochastic model (such as described by Serra in [54]) for specific onset detection, had
been recently presented by Gifford and Brown in [24].
But let’s now introduce the perceptual based approach, we used to reach the task
of onset detction.
78
6.4 – Introduction to the Perceptual Based Approach to Onset Detection
Figure 6.4: Unwrapped phase deviation between two adjacent analysis frames. ∆ϕn,k
is the unwrapped phase deviation. For the simpler case represented by a steady state
sinusoid, the phase deviation is approximately 0 constant in-between the whole analysis
frames, while, during transient the phase deviation should be extremely large and easy
to detect.
6.4 Introduction to the Perceptual Based Approach

to Onset Detection
Perceptual based approach have demonstrated their strength in the detection task of both
pitched and non-pitched sounds16 , as summarized in [13]. Since perceptual attributes
of sound are usually subjected to judgment from person to person, the perceptual on-
set detection should be always considered a good method, when it reaches the aim of
approximate the human ear onset detection mechanism. The different approach at the
base, may justify the possible divergence of results as compared with other methods;
usually the aim here is not that to find the best and efficient algorithms which recognize
every onset, but only the perceptually meaningful onsets. At the human auditory level,
16
Pitched and non-pitched sound is a common way to describe sound with strong harmonic related
component frequencies (pitched sound, e.g. piano, violin...) and sparse and not related harmonics
(non-pitched sound, e.g. drums...). At a perceptual level, the difference is that when a pitched sound
occurs, a pitch is clearly associated to the fundamental frequency of the sound. Non-pitched sounds
trigger different mechanisms in human auditory system, however, sometimes a pitch is associated, but
this won’t actually correspond to the fundamental frequency.
79
mechanisms, which encodes both time and frequency effects, determines the subjective
perception of sound onsets. The principal limits are imposed by time resolution and fre-
quency masking effects (both explained in chapter 2). Moreover, overlapping pitched and
non-pitched sounds (even percussive pitched sounds) could obfuscate the perception of
pitch and also delay or obscure one or more adjacent onsets. Let’s see an example, before
introducing the perceptual method applied to ·O M M· , based on bonk∼ for Max/MSP.
Band-Wise Processing
Scheirer in [51] was the first to clearly demonstrate the fact that an onset detection algo-
rithm should follow the human auditory system, by treating frequency bands separately
and then combining the results at the end. An earlier system described by Bilmes, was
similar to the way above, but his system only used a high-frequency and a low-frequency
bands, which himself judged not so effective [8].
Scheirer in [51] described a psychoacoustic demonstration on beat perception, which
shows that certain kinds of signal simplifications can be performed without affecting
the perceived rhythmic content of a musical signal. When the signal is divided into at
least four frequency bands and the corresponding bands of a noise signal are controlled
by the amplitude envelopes of the musical signal, the noise signal will have a rhythmic
percept which exploits significant similarities to the original signal. On the other hand,
this does not hold if only one band is used, in which case the original signal is no more
recognizable from its simplified form (detection funtion).
The method proposed by Klapuri [30], is the most significant example of succesfull
application by applying psychoacoustic knowledge to the onset detection task. It utilizes
the band-wise processing principle as introduced by Bilmes and Scheirer. The procedure
is the following:
1. the overall loudness of the signal is normalized to 70 dB level (pre-processing)

using the model of equal loudness contour17 ,
2. a filterbank divides the signal into 21 non-overlapping bands and, at each band,
the onset components are detected and their time and intensity is determined,
3. in final phase, the onset components are combined to yield onsets.
This method uses psychoacoustic models both in onset component detection, in

its time and intensity determination, and in combining the results. The design of the
filterbank is the core of this system, Klapuri proposed a filterbank which approximates
the critical bands displacement and covers the frequencies from 44 Hz to 18 KHz. This
is obtained with 21 filters, where the lowest three are one-octave band-pass filters and
17
See chapter 2 for more on equal loudness contour.
80
6.5 – Onset Detection in ·O M M·
the remaining eighteen are third-octave band-pass filters18 . All subsequent calculations
can be done one band at a time. This reduces the memory requirements of the algorithm
in the case of long input signals, assumed that parallel processing is not desired.
The output of each filter is full-wave rectified and then decimated by factor 180 to
ease the following computations. Amplitude envelopes are calculated by convolving the
band-limited signals with a 100ms half-Hanning (raised cosine) window. This window
performs much the same energy integration as the human auditory system, preserving
sudden changes, but masking rapid modulation [30].
6.5 Onset Detection in ·O M M·

From a Listener Point of View
One of the critical point within a mechanical orchestra is to sync the execution of the
musical score among real and virtual instruments; this problem is addressed to the Show
Control System of ·O M M· . We remind that human ear is particularly careful about timing,
hence, with physical devices such as the electromechanical arms of the robots, we have
to ensure correct timing, as expected by the listener. In our system, the variable delays
afflicting the orchestra while playing, must be overcome to made possible a synchronized
execution.
The variations of the delays, strongly depends on the note intensity requested and
the execution rhythm imposed. Each robot basically receives a message on a serial
line (MIDI), stating what kind of hit should be executed. In addition to the message
generation and data transmission delay, usually negligible from a human point of view,
we have the delay introduced by the physical movement of the arms. That is, the robot
arms must be positioned to the correct level height and will hit the drum after a time
elapsed, proportional to the distance of the arm from the drum and the acceleration by
which is driven. Furthermore, the delays are not only variable in a predictable manner,
but problems related to non-absorbed vibrations between a hit and another can also
occur. This could cause unwanted change in loudness and pitch19 perceived by the
listener. That’s why we need a perceptual based approach to analyze (in real-time) the
sound produced by the robot, and its response to the different, digitally applied, stimuli.
What we need is to measure the time elapsed between the digital command and the
perceived strike on the can, and try to compensate it during the robotic performance. At
this purpose we thought to create a delay matrix, where all the fields represent a delay
of a note to be executed. Therefore, once having this delay matrix we can take into
18
Octave and third-octave filters are presented in chapter 3.
19
Since sound produced by ·O M M· can be considered non-pitched, we do not provide a pith detector
stage in the orchestra, although we might think that a pitch will be perceived. We tried to understand
this behaviour with fiddle∼ and pitch∼ external objects for Max/MSP
81
account the delay required for the note to be completed, and anticipate the execution
of such a note, for the exact time of the delay.
To calculate the delay we needed to provide an onset detector stage in our Show
Control System. We exactly know the time of the MIDI event, hence what is missing is
the detection of the note executed.
Integration in the SCS
Our need was that to integrate the onset detection stage into the SCS, developed in
Max/MSP, and running on an Apple laptop. We initially tried to implement a new
method, to familiarize with development in Max. It was based on an envelope follower
of the signal with a variable threshold applied to it. When the amplitude envelope of
the signal exceed the threshold, an onset is detected. Since its practical inefficiency, this
method was immediately left apart, and other methods were explored.
We founded a very interesting approach in bonk∼ object, an external library available
open source on the web, for Max/MSP. The original code was written by the same author
of Max, Miller Puckette, for Pure Data in 1989. Then it has been revised by other people
during the years: Ted Apel ported bonk∼ to Max/MSP platform and later Barry Threw
applied the latest modification (2008). The version we have used is called bonk∼ 1.4,
founded in M.Puckette repository, with permission to apply changes.
6.5.1 The bonk∼ Method

The bonk∼ method works essentially on a specialization of the constant-Q filter bank
analysis, called emphbounded-Q analysis. This method has the advantage to drastically
reduce the complexity of the constant-Q transform, such as well described in [17] after
[29] and [11]. In this kind of analysis the value of Q is limited (bounded) to approximately
5 and a few number of filters could be used to obtain a filterbank which give us at least
the same results of a constant-Q analysis. In addition, the bounded-Q analysis, takes the
advantages of a FFT-like algorithm, applied in between each frequency channel. This
is possible because the octaves are geometrically separated, but within each octave,
the frequency bins are equally spaced, as shown in figure 6.5. This channel distribution
becomes a good approximation for the geometric scale with a proper number of channels
per octave.
Puckette in [38] says: the bonk∼ object was written for dealing with sound sources
for which sinusoidal decomposition breaks down; the first application has been to drums
and percussion. That is, our case. The bandwidths of the filters subdivide the sound
spectrum into regions which are approximately tuned around the critical bands, in a
similar manner to the above Klapuri approach. This should well mimics the auditory
system behavior.
82
6.5 – Onset Detection in ·O M M·
Figure 6.5: Graphical representation of the bounded-Q filterbank. Only the octave are
geometrically spaced, in between the octave the spacing between analysis bins is linear.
This allows the application of FFT-like algorithm to calculate the spectrum of each
component.
We found that the implementation of 15 (non overlapping) filters was successful for
our case. See table 6.1 for detail on the filters used for the band-wise analysis. In this
table can be easily recognized the filter spacing with two filters per octave, except where
prohibited (the first two filters do not respect this spacing20 ). The details of filterbank’s
implementation can be founded in appendix of this thesis.
The final stage, what we have called before the pick-picking stage, in bonk∼ works
essentially with the definition of a growth function.
6.5.2 Result of the Analysis in ·O M M·

After several months spent in debugging, optimizing and adapting the source code to
our needs, we have subjected it to several tests. These tests, performed with the sounds
recorded at LIM laboratories in Verres (AO), have demonstrated the soundness of this
approach. We recorded 3 track, including 300 sounds sounds each, then we made a lot
of cut/n/paste to obtain five scores, called soundtracks in the result summary. Every
soundtrack, very different each others, have been realized at different bpm21 , from 100
to 120 (which is the maximum values that a single robot of ·O M M· can reach). These
tracks, created with Ableton Live, were sent to bonk∼ by a dedicated Max patch,
realized for testing. The sounding objects are sent to bonk∼ via a special object for
Max, called Elastic, which allows variation of the pitch end tempo of the execution. The
testing patch we realized for that purpose is presented in appendix of the thesis.
20
See chapter 2 for critical band description.
21
Beats Per Minute.
83
Table 6.1: Filterbank design in our method based on bonk∼

Filter fc [Hz] Bandwidht Filter Number of Hop size
number [Hz] Points hops
1 86 32.25 1024 1 512
2 150.5 32.25 1024 1 512
3 220.59 37.84 873 1 436
4 312.18 53.32 617 2 308
5 441.18 75.68 436 3 218
6 623.93 107.07 308 5 154
7 882.36 151.36 218 8 109
8 1247.86 214.14 154 12 77
9 1764.72 302.72 109 17 55
10 2495.72 428.28 77 25 39
11 3529.44 605.44 55 36 27
12 4991.44 856.56 39 52 19
13 7059.31 1211.31 27 72 14
14 9983.31 1712.69 19 101 10
15 14118.19 2422.19 14 145 7
The final configuration of bonk∼ gave great results: all the hits produced by the
orchestra can be located in time and the value of CDR (Correct Detection Result),
proposed in [30], was very easy to calculate.
The CDR is given by:
total − undetected − f alse detected

CDR = · 100%
total
and the result of the analysis are the following:

Only at more than 110 bpm, bonk fails in some cases, that is, spurious onset are
reported (but all the provided onsets are recognized).
84
6.6 – From Onset Analysis to Sound Classification
Table 6.2: Results in detecting onset of the five soundtracks created for analysis purpose,
played at different bpm. ·O M M·
·O M M· Total Detected Undetected False CDR [%]
soundtrack Onsets Detected
(bpm)
100 120 120 0 0 100
105 80 80 0 0 100
110 90 90 0 0 100
115 120 120 0 2 98
120 60 60 0 5 95
6.6 From Onset Analysis to Sound Classification

In this last section, we aim to demonstrate that the use of the bonk∼ analysis in
detecting onset, could be extended to the aspiring task of sound classification. As
mentioned before, ·O M M· robots are able to produce three different sounds, substantially
with different intensity and duration. What we are gonna present in this section is the
ability of bonk∼ to predict which of the three sounds has been played, only using the
loudness content of each frequency bands during onset detection. This, if true, would
mean that the spectral analysis performed by bonk∼ , especially when it detects an
onset, is enough to predict which kind of sound had been produced by the orchestra.
Puckette has provided another tool in bonk∼ , called learn mode, which allows to
store into a template the pattern originated by the output of each filters used for the
analysis. The learn mode works essentially by this way:
1. enter learn mode and choose the number of equal
2. start the bonk∼ analysis and stop it after all the onsets provided are recognized
3. store the spectral template into file and read it
4. exit learn mode and continue to analyze the signal
5. bonk∼ will report the number corresponding to the sound which best fit the
spectral template read from file
6.6.1 Learning Results

Results of bonk∼ learning for the case of ·O M M· :
85
Table 6.3: Numerical result in detecting onset and recognizing the three sounds (A/B/C)
produced by the ·O M M·
·O M M· Total Onsets Total A/B/C Total A/B/C Note A/B/C
soundtrack Notes Notes Notes
(bpm) Recognized Confused
100 120 30/60/30 25/64/31 5/5/0
105 80 30/30/20 25/32/23 5/6/4
110 90 33/26/31 32/28/30 2/1/1
115 120 100/12/8 97/16/7 4/1/3
120 60 25/20/15 30/15/15 4/0/7
The results is quite unexpected, more than the 80% (on average, in particular cases
were higher than 95%) of correct correspondence had been found, simply looking at the
onset analysis results.
86
Chapter 7
Conclusion
A specific component of the human ear, the basilar membrane inside the cochlea, located
in the inner ear, is responsible of the detection of frequency components of a sound.
In this thin membrane, 32 mm long, the frequencies cause oscillations around specific
points of the basilar membrane. The mechanical properties of the cochlea (wide and stiff
at the base, narrower and much less stiff at the end), in which the basilar membrane is
located, denotes a roughly logarithmic decrease in bandwidth as we move linearly away
from the cochlear opening (the oval window).
Therefore we propose a different approach to sound analysis, which is known as the
Constant-Q filterbank method. This method is typically implemented with a bank of
band pass filters (filterbank) with constant Q ratio. We recall the definition of Q which is
the ratio between the center frequency and the bandwidth of a filter. This method mimics
the behaviour of the auditory system in detecting frequency, i.e the filter are linearly
spaced along a logarithmic frequency axis. This can be obtained with filters maintaining
constant their Q ratio. It is established that the Q must be chosen approximately equal
to 37, to perform a rigorous scan over the frequency range from 20 Hz to 20 KHz and
several filters (at least one hundred) must be used.
Our task had been compared to the one provided by an external library for Max/MSP,
called bonk∼ , developed by the same author of Max/MSP, Miller Puckette. We found
in it a very interesting approach, very helpful for our purpose. The code has been revised
by other people during the years (the original code was 1989) and the version we have
taken into account is bonk∼ 3.0. The code was found on M.Puckette repository, with
permission to apply changes.
The bonk∼ method works essentially on a specialization of the constant-Q filter bank
analysis, called bounded-Q analysis. This method has the advantage to reduce the
complexity of the constant-Q transform. In this kind of analysis the value of Q is limited
(bounded) to approximately 5 and a few number of filters should be used to obtain at
least the same results. We found that the implementation of 15 non overlapping filters
87
7 – Conclusion
was successful for our case. The bandwidths of the filters subdivide the sound spectrum
into regions which are approximately tuned around musical octave, thus respecting what
seems to be the auditory system response. This method, implemented for the first time
in the late 80s, has showed good results in various fields of musical analysis, in particular
segmentation and transcription.
After several months spent in debugging, optimizing and adapting the source code to
our needs, we have subjected it to several tests. These tests, performed with the sounds
recorded at LIM laboratories in Verres (AO), have demonstrated the soundness of this
approach. All the hits produced by the orchestra can be located in time. Not only, we
also trained bonk∼ to recognize which kind of sound have been produced, and we can
obtain more than 85% of successful correspondences.
The percentage of recognized hits indicates the approach validity; possible other musical
applications can be foreseen in or outside the ·O M M· .
88
Appendix A
MSP, anatomy of the object
Source 1: main.c
1 /∗ ∗
@page chapter_msp_anatomy Anatomy o f a MSP O b j e c t
An MSP o b j e c t t h a t h a n d l e s a u d i o s i g n a l s i s a r e g u l a r Max o b j e c t w i t h a few

e x t r a s . R e f e r t o t h e <a h r e f =" p l u s s z ~_8c−s o u r c e . h t m l"> p l u s s z ~</a> example
p r o j e c t s o u r c e a s we d e t a i l t h e s e a d d i t i o n s . p l u s s z ~ i s s i m p l y an object
t h a t a d d s 1 t o a s i g n a l , i d e n t i c a l i n f u n c t i o n t o t h e r e g u l a r MSP +~ o b j e c t
i f you w e r e t o g i v e i t an a r g u m en t o f 1 .
6 Here i s an e n u m e r a t i o n o f t h e b a s i c t a s k s :
1) a d d i t i o n a l header files
A f t e r i n c l u d i n g e x t . h and ext_obex . h , i n c l u d e z_dsp . h

11 @code
#i n c l u d e " z_dsp . h"
@endcode
2) C s t r u c t u r e d e c l a r a t i o n
16
The C s t r u c t u r e d e c l a r a t i o n must b e g i n w i t h a #t _ p x o b j e c t , n o t a #t _ o b j e c t :
@code
t y p e d e f s t r u c t _ my d sp ob je c t
{
21 t _ p x o b j e c t m_obj ;
// r e s t o f t h e s t r u c t u r e ’ s f i e l d s
} t_mydspobject ;
@endcode
26 3) i n i t i a l i z a t i o n routine
When c r e a t i n g t h e c l a s s w i t h c l a s s _ n e w ( ) , you must h a v e a f r e e f u n c t i o n . I f you

h a v e n o t h i n g s p e c i a l t o do , u s e d s p _ f r e e ( ) , w h i c h i s d e f i n e d f o r t h i s
p u r p o s e . I f you w r i t e y o u r own f r e e f u n c t i o n , t h e f i r s t t h i n g i t s h o u l d do
i s c a l l d s p _ f r e e ( ) . T h i s i s e s s e n t i a l t o a v o i d c r a s h e s when f r e e i n g y o u r
o b j e c t when a u d i o p r o c e s s i n g i s t u r n e d on .
@code
c = c l a s s _ n e w ( " m y d s p o b j e c t " , ( method ) mydspobject_new , ( method ) d s p _ f r e e ,
s i z e o f ( t _ m y d s p o b j e c t ) , NULL , 0 ) ;
31 @endcode
89
A – MSP, anatomy of the object
A f t e r c r e a t i n g y o u r c l a s s w i t h c l a s s _ n e w ( ) , you must c a l l c l a s s _ d s p i n i t ( ) ,
w h i c h w i l l add some s t a n d a r d method h a n d l e r s f o r i n t e r n a l m e s s a g e s u s e d by
a l l signal objects .
@code
class_dspinit (c) ;
36 @endcode
Your s i g n a l o b j e c t n e e d s a method t h a t i s bound t o t h e s y m b o l " d s p " −− we ’ l l

d e t a i l what t h i s method d o e s below , b u t t h e f o l l o w i n g l i n e n e e d s t o be
added w h i l e i n i t i a l i z i n g t h e c l a s s :
@code
c l a s s _ a d d m e t h o d ( c , ( method ) m y d s p o b j e c t _ d s p , " d s p " , A_CANT, 0 ) ;
41 @endcode
4 ) new i n s t a n c e r o u t i n e
The new i n s t a n c e r o u t i n e must c a l l d s p _ s e t u p ( ) , p a s s i n g a p o i n t e r t o t h e n e w l y

a l l o c a t e d o b j e c t p o i n t e r p l u s a number o f s i g n a l i n l e t s t h e o b j e c t w i l l
h a v e . I f t h e o b j e c t h a s no s i g n a l i n l e t s , you may p a s s 0 . The p l u s z ~ o b j e c t
( a s an e x a m p l e ) h a s a s i n g l e s i g n a l i n l e t :
46 @code
dsp_setup ( ( t_pxobject ∗) x , 1) ;
@endcode
d s p _ s e t u p ( ) w i l l make t h e s i g n a l i n l e t s ( a s p r o x i e s ) s o you n e e d n o t make them

yourself .
51
I f y o u r o b j e c t w i l l h a v e a u d i o s i g n a l o u t p u t s , t h e y n e e d t o be c r e a t e d i n t h e
new i n s t a n c e r o u t i n e w i t h o u t l e t _ n e w ( ) . However , you w i l l n e v e r a c c e s s them
d i r e c t l y , s o you don ’ t n e e d t o s t o r e p o i n t e r s t o them a s you do w i t h
r e g u l a r o u t l e t s . Here i s an e x a m p l e o f c r e a t i n g two s i g n a l o u t l e t s :
@code
outlet_new ( ( t_object ∗) x , " s i g n a l ") ;
outlet_new ( ( t_object ∗) x , " s i g n a l ") ;
56 @endcode
5 ) The d s p method and p e r f o r m r o u t i n e
The d s p method s p e c i f i e s t h e s i g n a l p r o c e s s i n g f u n c t i o n y o u r o b j e c t d e f i n e s
a l o n g w i t h i t s a r g u m e n t s . Your o b j e c t ’ s d s p method w i l l be c a l l e d when t h e
MSP s i g n a l c o m p i l e r i s b u i l d i n g a s e q u e n c e o f o p e r a t i o n s ( known a s t h e DSP
C h a i n ) t h a t w i l l be p e r f o r m e d on e a c h s e t o f a u d i o s a m p l e s . The o p e r a t i o n
sequence c o n s i s t s of a p o i n t e r s to f u n c t i o n s ( c a l l e d perform r o u t i n e s )
f o l l o w e d by a r g u m e n t s t o t h o s e f u n c t i o n s .
61
The d s p method i s d e c l a r e d a s f o l l o w s :
@code
v o i d m y d s p o b j e c t _ d s p ( t _ m y d s p o b j e c t ∗ x , t _ s i g n a l ∗∗ sp , s h o r t ∗ c o u n t ) ;
@endcode
66
To add an e n t r y t o t h e DSP c h a i n , y o u r d s p method u s e s dsp_add ( ) . The d s p
method i s p a s s e d an a r r a y o f s i g n a l s (# t _ s i g n a l p o i n t e r s ) , w h i c h c o n t a i n
p o i n t e r s t o t h e a c t u a l s a m p l e memory y o u r o b j e c t ’ s p e r f o r m r o u t i n e w i l l be
u s i n g f o r i n p u t and o u t p u t . The a r r a y o f s i g n a l s s t a r t s w i t h t h e i n p u t s (
from l e f t t o r i g h t ) , f o l l o w e d by t h e o u t p u t s . F o r example , i f y o u r o b j e c t
h a s two i n p u t s ( b e c a u s e y o u r new i n s t a n c e r o u t i n e c a l l e d d s p _ s e t u p ( x , 2 ) )
and t h r e e o u t p u t s ( b e c a u s e y o u r new i n s t a n c e c r e a t e d t h r e e s i g n a l o u t l e t s ) ,
t h e s i g n a l a r r a y s p would c o n t a i n f i v e i t e m s a s f o l l o w s :
@code
s p [ 0 ] // l e f t i n p u t
s p [ 1 ] // r i g h t i n p u t
90
71 s p [ 2 ] // l e f t o u t p u t
s p [ 3 ] // m i d d l e o u t p u t
s p [ 4 ] // r i g h t o u t p u t
@endcode
76 The #t _ s i g n a l d a t a s t r u c t u r e ( d e f i n e d i n z_dsp . h ) , c o n t a i n s two i m p o r t a n t

e l e m e n t s : t h e s_n f i e l d , w h i c h i s t h e s i z e o f t h e s i g n a l v e c t o r , and s_vec ,
w h i c h i s a p o i n t e r t o an a r r a y o f 32− b i t f l o a t s c o n t a i n i n g t h e s i g n a l d a t a
. A l l t _ s i g n a l s y o u r o b j e c t w i l l r e c e i v e h a v e t h e same s i z e . T h i s s i z e i s
n o t n e c e s s a r i l y t h e same a s t h e g l o b a l MSP s i g n a l v e c t o r s i z e , b e c a u s e y o u r
o b j e c t m i g h t be i n s i d e a p a t c h e r w i t h i n a p o l y ~ o b j e c t t h a t d e f i n e s i t s
own s i z e . T h e r e f o r e i t i s i m p o r t a n t t o u s e t h e s_n f i e l d o f a s i g n a l p a s s e d
t o y o u r o b j e c t ’ s d s p method .
You can u s e a v a r i e t y o f s t r a t e g i e s t o p a s s a r g u m e n t s t o y o u r p e r f o r m r o u t i n e
v i a dsp_add ( ) . F o r s i m p l e u n i t g e n e r a t o r s t h a t don ’ t s t o r e any i n t e r n a l
s t a t e between computing v e c t o r s , i t i s s u f f i c i e n t to pas s the i n p u t s ,
o u t p u t s , and v e c t o r s i z e . F o r o b j e c t s t h a t n e e d t o s t o r e i n t e r n a l s t a t e
b e t w e e n c o m p u t i n g v e c t o r s s u c h a s f i l t e r s o r ramp g e n e r a t o r s , you w i l l p a s s
a p o i n t e r t o y o u r o b j e c t , whose d a t a s t r u c t u r e s h o u l d c o n t a i n s p a c e t o
s t o r e t h i s s t a t e . The p l u s 1 ~ o b j e c t d o e s n o t n e e d t o s t o r e i n t e r n a l s t a t e .
I t p a s s e s t h e i n p u t , o u t p u t , and v e c t o r s i z e t o i t s p e r f o r m r o u t i n e . The
p l u s 1 ~ d s p method i s shown b e l o w :
@code
81 v o i d p l u s 1 _ d s p ( t _ p l u s 1 ∗ x , t _ s i g n a l ∗∗ sp , s h o r t c o u n t )
{
dsp_add ( p l u s 1 _ p e r f o r m , 3 , s p [0]−> s_vec , s p [1]−> s_vec , s p [0]−>s_n ) ;
}
@endcode
86
The f i r s t a r g um e n t t o dsp_add ( ) i s y o u r p e r f o r m r o u t i n e , f o l l o w e d by t h e number
o f a d d i t i o n a l a r g u m e n t s you w i s h t o c o p y t o t h e DSP c h a i n , and t h e n t h e
arguments .
The p e r f o r m r o u t i n e i s n o t a " method " i n t h e t r a d i t i o n a l s e n s e . I t w i l l be

c a l l e d w i t h i n t h e c a l l b a c k o f an a u d i o d r i v e r , which , u n l e s s t h e u s e r i s
e m p l o y i n g t h e Non−R e a l Time a u d i o d r i v e r , w i l l t y p i c a l l y be i n a h i g h −
p r i o r i t y t h r e a d . Thread p r o t e c t i o n i n s i d e t h e p e r f o r m r o u t i n e i s m i n i m a l .
You can u s e a c l o c k , b u t you c a n n o t u s e q e l e m s o r o u t l e t s . The d e s i g n o f
t h e p e r f o r m r o u t i n e i s somewhat u n l i k e o t h e r Max methods . I t r e c e i v e s a
p o i n t e r t o a p i e c e o f t h e DSP c h a i n and i t i s e x p e c t e d t o r e t u r n t h e
l o c a t i o n o f t h e n e x t p e r f o r m r o u t i n e on t h e c h a i n . The n e x t l o c a t i o n i s
d e t e r m i n e d by t h e number o f a r g u m e n t s you s p e c i f i e d f o r y o u r p e r f o r m
r o u t i n e w i t h y o u r c a l l t o dsp_add ( ) . F o r example , i f you w i l l p a s s t h r e e
a r g u m e n t s , you n e e d t o r e t u r n w + 4 .
91 Here i s t h e p l u s 1 p e r f o r m r o u t i n e :
@code
t _ i n t ∗ p l u s 1 _ p e r f o r m ( t _ i n t ∗w)
{
96 t _ f l o a t ∗ in , ∗ out ;
int n;
i n = ( t _ f l o a t ∗ )w [ 1 ] ; // g e t i n p u t s i g n a l v e c t o r
o u t = ( t _ f l o a t ∗ )w [ 2 ] ; // g e t o u t p u t s i g n a l v e c t o r
101 n = ( i n t )w [ 3 ] ; // v e c t o r s i z e
w h i l e ( n−−) // p e r f o r m c a l c u l a t i o n on a l l s a m p l e s
∗ o u t++ = ∗ i n++ + 1 . ;
91
A – MSP, anatomy of the object
106
return w + 4; // must r e t u r n n e x t DSP c h a i n l o c a t i o n
}
@endcode
111 6) Free f u n c t i o n
The f r e e f u n c t i o n f o r t h e c l a s s must e i t h e r be d s p _ f r e e ( ) o r i t must be w r i t t e n

t o c a l l d s p _ f r e e ( ) a s shown i n t h e e x a m p l e b e l o w :
@code
116 v o i d mydspobject_free ( t_mydspobject ∗x )
{
dsp_free ( ( t_pxobject ∗) x ) ;
// can do o t h e r s t u f f h e r e
121 }
@endcode
92
Appendix B
bonk∼ source code
No substantial modification has been applied to the original bonk∼ code. Previous
modification to the original has been done by Barry Threw for the latest version of
bonk∼ , the what we used1 .
B.1 The bonk∼ Method
Source 2: main.c
1 /∗
###########################################################################
# bonk~ − a pd and Max/MSP e x t e r n a l
# by m i l l e r p u c k e t t e and t e d a p p e l
# h t t p : / / c r c a . u c s d . edu /~msp/
6 # Max/MSP p o r t by b a r r y t h r e w ( m e @ b a r r y t h r e w . com )
# h t t p : / /www . b a r r y t h r e w . com
# San F r a n c i s c o , CA 2008
# f o r Kesumo − h t t p : / /www . kesumo . com
# Max 5 o p t i m i z e d v e r s i o n f o r l o u d p e r c u s s i v e s o u n d s , by Z e n g i . BETA v e r s i o n
11 # T u r i n , June 2009
###########################################################################
// bonk~ d e t e c t s a t t a c k s i n an a u d i o s i g n a l
###########################################################################
T h i s s o f t w a r e i s c o p y r i g h t e d by M i l l e r P u c k e t t e and o t h e r s . The f o l l o w i n g
16 t e r m s ( t h e " S t a n d a r d I m p r o v e d BSD L i c e n s e " ) a p p l y t o a l l f i l e s a s s o c i a t e d w i t h
the software u n l e s s e x p l i c i t l y disclaimed in i n d i v i d u a l f i l e s :
R e d i s t r i b u t i o n and u s e i n s o u r c e and b i n a r y f o r m s , w i t h o r w i t h o u t
modification , are permitted provided that the f o l l o w i n g c o n d i t i o n s are
21 met :
1. Redistributions o f s o u r c e c o d e must r e t a i n t h e a b o v e c o p y r i g h t
notice , t h i s l i s t o f c o n d i t i o n s and t h e f o l l o w i n g d i s c l a i m e r .
2. Redistributions i n b i n a r y form must r e p r o d u c e t h e a b o v e
26 copyright notice , t h i s l i s t o f c o n d i t i o n s and t h e f o l l o w i n g
1
Bonk3 can be found in M. Puckette repository (ask to him) or in other non precise location on
the web. The one we used was found on Barry Threw website, but is now no more longer available.
93
B – bonk∼ source code
d i s c l a i m e r i n t h e d o c u m e n t a t i o n and / o r o t h e r m a t e r i a l s p r o v i d e d
with the d i s t r i b u t i o n .
3 . The name o f t h e a u t h o r may n o t be u s e d t o e n d o r s e o r promote
p r o d u c t s d e r i v e d from t h i s s o f t w a r e w i t h o u t s p e c i f i c p r i o r
31 written permission .
THIS SOFTWARE I S PROVIDED BY THE AUTHOR ‘ ‘ AS I S ’ ’ AND ANY

EXPRESS OR IMPLIED WARRANTIES , INCLUDING , BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
36 PARTICULAR PURPOSE ARE DISCLAIMED . IN NO EVENT SHALL THE AUTHOR
BE LIABLE FOR ANY DIRECT , INDIRECT , INCIDENTAL , SPECIAL ,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES ( INCLUDING , BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES ; LOSS OF USE ,
DATA, OR PROFITS ; OR BUSINESS INTERRUPTION ) HOWEVER CAUSED AND
41 ON ANY THEORY OF LIABILITY , WHETHER IN CONTRACT, STRICT
LIABILITY , OR TORT ( INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN I F ADVISED OF
THE POSSIBILITY OF SUCH DAMAGE.
∗/
46
/∗
dolist :
d e c a y and o t h e r t i m e s i n msec // s t i l l t o do
∗/
51
#i n c l u d e <math . h>
#i n c l u d e < s t d i o . h>
#i n c l u d e < s t r i n g . h>
56 //#i f d e f NT
//#pragma w a r n i n g ( d i s a b l e : 4305 4 2 4 4 )
//#e n d i f
#i n c l u d e " e x t . h"
61 #i n c l u d e " z_dsp . h"
#i n c l u d e " math . h"
#i n c l u d e " e x t _ s u p p o r t . h"
#i n c l u d e " e x t _ p r o t o . h"
#i n c l u d e " ext_obex . h"
66
typedef double t _ f l o a t a r g ; /∗ from m_pd . h ∗/
#d e f i n e flog log
#d e f i n e f e x p exp
71 #d e f i n e fsqrt sqrt
#d e f i n e t _ r e s i z e b y t e s ( a , b , c ) t _ r e s i z e b y t e s ( ( char ∗) ( a ) , ( b ) , ( c ) )
void ∗ bonk_class ;
#d e f i n e g e t b y t e s t _ g e t b y t e s
76 #d e f i n e f r e e b y t e s t _ f r e e b y t e s
//BONK ATTRIBUTE SETTINGS , YOU CAN OVERRIDE THEM IN MAX PATCH

#d e f i n e DEFNPOINTS 1024
#d e f i n e MAXCHANNELS 2 /∗ 8 ∗/
81 #d e f i n e MINPOINTS 64
#d e f i n e DEFPERIOD 256 /∗ 128 ∗/
#d e f i n e DEFNFILTERS 15
#d e f i n e DEFHALFTONES 6
#d e f i n e DEFOVERLAP 1
86 #d e f i n e DEFFIRSTBIN 2 /∗ m o d i f i c a t o , e r a 1 ∗/
#d e f i n e DEFHITHRESH 10 /∗ 5 ∗/
#d e f i n e DEFLOTHRESH 5 /∗ 2 . 5 ∗/
94
B.1 – The bonk∼ Method
#d e f i n e DEFMASKTIME 4
#d e f i n e DEFMASKDECAY 0 . 7
91 #d e f i n e DEFDEBOUNCEDECAY 0
#d e f i n e DEFMINVEL 7
#d e f i n e DEFATTACKBINS 1
#d e f i n e MAXATTACKWAIT 4
96 //DATA STRUCTURES
typedef struct _ f i l t e r k e r n e l
{
int k_filterpoints ;
i n t k_hoppoints ;
101 int k_skippoints ;
i n t k_nhops ;
float k_centerfreq ; /∗ c e n t e r f r e q u e n c y , b i n s ∗/
f l o a t k_bandwidth ; /∗ b a n d w i d t h , b i n s ∗/
float ∗ k_stuff ;
106 } t _ f i l t e r k e r n e l ;
// f i l t e r b a n k s t r u c t u r e , i m p l e m e n t s a linked list of filter with s t r u c _ f i l t e r b a n l ∗

b_next .
typedef struct _ f i l t e r b a n k
{
111 int b_nfilters ; /∗ number o f f i l t e r s i n bank ∗/
i n t b_npoints ; /∗ i n p u t v e c t o r s i z e ∗/
float b_halftones ; /∗ f i l t e r b a n d w i d t h i n h a l f t o n e s ∗/
f l o a t b_overlap ; /∗ o v e r l a p ; d e f a u l t 1 f o r 1/2− power p t s ∗/
float b_firstbin ; /∗ f r e q o f f i r s t f i l t e r i n b i n s , d e f a u l t 1 ∗/
116 t _ f i l t e r k e r n e l ∗ b_vec ; /∗ f i l t e r k e r n e l s ∗/
i n t b_refcount ; /∗ number o f bonk~ o b j e c t s u s i n g t h i s ∗/
s t r u c t _ f i l t e r b a n k ∗ b_next ; /∗ n e x t i n l i n k e d l i s t ∗/
} t_filterbank ;
121 /∗ 1 . 3 r e v i e w ∗/
#d e f i n e MAXNFILTERS 50
#d e f i n e MASKHIST 8
static t_filterbank ∗ bonk_filterbanklist ;

126
typedef struct _hist
{
f l o a t h_power ;
f l o a t h_before ;
131 f l o a t h_outpower ;
i n t h_countup ;
f l o a t h_mask [ MASKHIST ] ;
} t_hist ;
136 t y p e d e f s t r u c t t e m p l a t e
{
f l o a t t_amp [ MAXNFILTERS ] ;
} t_template ;
141 t y p e d e f s t r u c t _ i n s i g
{
t _ h i s t g _ h i s t [ MAXNFILTERS ] ; /∗ h i s t o r y f o r e a c h f i l t e r ∗/
void ∗ g_outlet ; /∗ o u t l e t f o r raw d a t a ∗/
f l o a t ∗ g_inbuf ; /∗ b u f f e r e d i n p u t s a m p l e s ∗/
146 t _ f l o a t ∗ g_invec ; /∗ new i n p u t s a m p l e s ∗/
} t_insig ;
t y p e d e f s t r u c t _bonk
95
{
151
t _ p x o b j e c t x_obj ;
void ∗ obex ;
v o i d ∗ x_cookedout ;
void ∗ x_clock ;
156 s h o r t x_ vo l ;
/∗ p a r a m e t e r s ∗/
i n t x_npoints ; /∗ number o f p o i n t s i n i n p u t b u f f e r ∗/
i n t x_period ; /∗ number o f i n p u t s a m p l e s b e t w e e n a n a l y s e s ∗/
161 int x_nfilters ; /∗ number o f f i l t e r s r e q u e s t e d ∗/
float x_halftones ; /∗ n o m i n a l h a l f t o n e s b e t w e e n f i l t e r s ∗/
f l o a t x_overlap ;
float x_firstbin ;
166 float x_hithresh ; /∗ t h r e s h o l d f o r t o t a l g r o w t h t o t r i g g e r ∗/

float x_lothresh ; /∗ t h r e s h o l d f o r t o t a l g r o w t h t o r e −arm ∗/
f l o a t x_minvel ; /∗ minimum v e l o c i t y we o u t p u t ∗/
f l o a t x_maskdecay ;
i n t x_masktime ;
171 int x_useloudness ; /∗ u s e l o u d n e s s s p e c t r a i n s t e a d o f power ∗/
f l o a t x_debouncedecay ;
f l o a t x_debouncevel ;
double x_learndebounce ; /∗ d e b o u n c e t i m e ( i n " l e a r n " mode o n l y ) ∗/
int x_attackbins ; /∗ number o f b i n s t o w a i t f o r a t t a c k ∗/
176
t_filterbank ∗ x_filterbank ;
t _ h i s t x _ h i s t [ MAXNFILTERS ] ;
t_template ∗ x_template ;
t_insig ∗ x_insig ;
181 int x_ninsig ;
i n t x_ntemplate ;
int x_infill ;
i n t x_countdown ;
int x_willattack ;
186 i n t x_attacked ;
i n t x_debug ;
i n t x_learn ;
int x_learncount ; /∗ c o u n t u p f o r " l e a r n " mode ∗/
i n t x_spew ; /∗ i f t r u e , a l w a y s g e n e r a t e o u t p u t ! ∗/
191 i n t x_maskphase ; /∗ p h a s e , 0 t o MASKHIST−1 , f o r mask h i s t o r y ∗/
f l o a t x_sr ; /∗ c u r r e n t s a m p l e r a t e i n Hz . ∗/
i n t x_hit ; /∗ next " t i c k " c a l l e d because of a hit , not a p o l l
∗/
} t_bonk ;
196 //PROTOTYPES
// p r o t o t y p e s f o r methods : n e e d a method f o r e a c h i n c o m i n g m e s s a g e
s t a t i c v o i d ∗bonk_new ( t_symbol ∗ s , l o n g ac , t_atom ∗ av ) ;
s t a t i c v o i d b o n k _ t i c k ( t_bonk ∗ x ) ;
s t a t i c v o i d b o n k _ d o i t ( t_bonk ∗ x ) ;
201 s t a t i c t _ i n t ∗ bonk_perform ( t _ i n t ∗w) ;
s t a t i c v o i d bonk_dsp ( t_bonk ∗ x , t _ s i g n a l ∗∗ s p ) ;
v o i d b o n k _ a s s i s t ( t_bonk ∗ x , v o i d ∗b , l o n g m, l o n g a , c h a r ∗ s ) ;
s t a t i c v o i d b o n k _ f r e e ( t_bonk ∗ x ) ;
v o i d bonk_setup ( v o i d ) ;
206 v o i d main ( ) ;
// methods f o r t r e s h o l d and o t h e r f e a t u r e s
s t a t i c v o i d b o n k _ t h r e s h ( t_bonk ∗ x , t _ f l o a t a r g f 1 , t _ f l o a t a r g f 2 ) ;
s t a t i c v o i d b o n k _ p r i n t ( t_bonk ∗ x , t _ f l o a t a r g f ) ;
96
211 s t a t i c v o i d bonk_bang ( t_bonk ∗ x ) ;

s t a t i c v o i d b o n k _ l e a r n ( t_bonk ∗ x , i n t n ) ;
s t a t i c v o i d b o n k _ f o r g e t ( t_bonk ∗ x ) ;
// methods f o r r e a d s and w r i t e t e m p l a t e s
216 s t a t i c v o i d b o n k _ w r i t e ( t_bonk ∗ x , t_symbol ∗ s ) ;
s t a t i c v o i d bonk_read ( t_bonk ∗ x , t_symbol ∗ s ) ;
// method f o r a t t r i b u t e s s e t t e r
v o i d b o n k _ m i n v e l _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
221 v o i d b o n k _ l o t h r e s h _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d b o n k _ h i t h r e s h _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d bonk_masktime_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d bonk_maskdecay_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d b o n k _ d e b o u n c e d e c a y _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
226 v o i d bonk_debug_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d bonk_spew_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d b o n k _ u s e l o u d n e s s _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
v o i d b o n k _ a t t a c k b i n s _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ;
231 f l o a t q r s q r t ( f l o a t f ) ;
double c l o c k _ g e t s y s t i m e ( ) ;
double c l o c k _ g e t t i m e s i n c e ( double p r e v s y s t i m e ) ;
c h a r ∗ s t r c p y ( c h a r ∗ s1 , c o n s t c h a r ∗ s 2 ) ;
236 // c l o c k f u n c t i o n
s t a t i c v o i d b o n k _ t i c k ( t_bonk ∗ x ) ;
#d e f i n e HALFWIDTH 0 . 7 5 /∗ h a l f p e a k b a n d w i d t h a t h a l f power p o i n t i n b i n s ∗/
241 //CONTANT Q FILTERBANK IMPLEMENTATION

s t a t i c t_filterbank ∗ bonk_newfilterbank ( int npoints , int n f i l t e r s , float halftones ,
float overlap , float f i r s t b i n )
{
int i , j ;
f l o a t c f , bw , h , r e l s p a c e ;
246 t _ f i l t e r b a n k ∗b = ( t _ f i l t e r b a n k ∗ ) g e t b y t e s ( s i z e o f ( ∗ b ) ) ;
b−>b _ n p o i n t s = n p o i n t s ;
b−>b _ n f i l t e r s = n f i l t e r s ;
b−>b _ h a l f t o n e s = h a l f t o n e s ;
b−>b _ o v e r l a p = o v e r l a p ;
251 b−>b _ f i r s t b i n = f i r s t b i n ;
b−>b _ r e f c o u n t = 0 ;
b−>b_next = b o n k _ f i l t e r b a n k l i s t ;
bonk_filterbanklist = b;
b−>b_vec = ( t _ f i l t e r k e r n e l ∗ ) g e t b y t e s ( n f i l t e r s ∗ s i z e o f ( ∗ b−>b_vec ) ) ;
256
// i n c o n s t a n t Q f i l t e r b a n k , s p a c i n g b e t w e e n f i l t e r s i s i m p l e m e n t e d by t h i s way
h = exp ( ( l o g ( 2 . ) / 1 2 . ) ∗ h a l f t o n e s ) ; /∗ s p e c c e d i n t e r v a l b e t w e e n f i l t e r s ∗/
// h=h a l f t o n e s ∗ 5 ;
r e l s p a c e = ( h − 1) /( h + 1) ; /∗ n o m i n a l s p a c i n g −p e r −f f o r f i l t e r b a n k ∗/
261 // r e l s p a c e=h / 2 ;
c f = f i r s t b i n ; // f i r s t c e n t e r f r e q o f t h e f i l t e r b a n k
// b a n d w i d t h
bw = c f ∗ r e l s p a c e ∗ o v e r l a p ;
266 i f ( bw < HALFWIDTH)
bw = HALFWIDTH ;
// c r e a t e s ( i ) f i l t e r s , MAX( i ) =50.
// s t o p s c r e a t i n g f i l t e r s when c f e x c e e d n p o i n t s / 2 , r e t u r n i .
271 f o r ( i = 0 ; i < n f i l t e r s ; i ++)
97
{
f l o a t ∗ f p , newcf , newbw ;
float normalizer = 0;
i n t f i l t e r p o i n t s , s k i p p o i n t s , h o p p o i n t s , nhops ;
276
f i l t e r p o i n t s = 0 . 5 + n p o i n t s ∗ HALFWIDTH/bw ;
// f i l t e r p o i n t s = 0 . 5 + n p o i n t s / bw ;
i f ( c f > n p o i n t s /2)
{
281 p o s t ( " bonk ~ : ␣ o n l y ␣ u s i n g ␣%d␣ f i l t e r s ␣ ( r a n ␣ p a s t ␣ N y q u i s t ) " , i +1) ;
break ;
}
i f ( f i l t e r p o i n t s < 4)
{
286 p o s t ( " bonk ~ : ␣ o n l y ␣ u s i n g ␣%d␣ f i l t e r s ␣ ( k e r n e l s ␣ g o t ␣ t o o ␣ s h o r t ) " , i +1) ;
break ;
}
else i f ( f i l t e r p o i n t s > npoints )
f i l t e r p o i n t s = npoints ;
291
h o p p o i n t s = 0 . 5 + 0 . 5 ∗ n p o i n t s ∗ HALFWIDTH/bw ;
// h o p p o i n t s = 0 . 5 + 0 . 5 ∗ n p o i n t s /bw ;
nhops = 1 . + ( n p o i n t s − f i l t e r p o i n t s ) /( f l o a t ) h o p p o i n t s ;
296 s k i p p o i n t s = 0 . 5 ∗ ( n p o i n t s − f i l t e r p o i n t s − ( nhops −1) ∗ h o p p o i n t s ) ;
// F i l l t h e k e r n e l o f t h e f i l t e r s i n f i l t e r b a n k −> f i l t e r k e r n e l
b−>b_vec [ i ] . k_stuff =
( float ∗) g e t b y t e s (2 ∗ s i z e o f ( f l o a t ) ∗ f i l t e r p o i n t s ) ;
301 b−>b_vec [ i ]. k_filterpoints = filterpoints ;
b−>b_vec [ i ] . k_nhops = n h o p s ;
b−>b_vec [ i ] . k_hoppoints = hoppoints ;
b−>b_vec [ i ] . k_skippoints = skippoints ;
b−>b_vec [ i ] . k_centerfreq = cf ;
306 b−>b_vec [ i ] . k_bandwidth = bw ;
//BANDPASS FILTER DESIGN : v e d e r e

f o r ( f p = b−>b_vec [ i ] . k _ s t u f f , j = 0 ; j < f i l t e r p o i n t s ; j ++, f p+= 2 )
{
311 f l o a t phase = j ∗ c f ∗ (2∗3.14159/ n p o i n t s ) ;
f l o a t wphase = j ∗ ( 2 ∗ 3 . 1 4 1 5 9 / f i l t e r p o i n t s ) ;
f l o a t window = s i n ( 0 . 5 ∗ wphase ) ;
f p [ 0 ] = window ∗ c o s ( p h a s e ) ;
f p [ 1 ] = window ∗ s i n ( p h a s e ) ;
316 n o r m a l i z e r += window ;
// p o s t ( " c o s p h a s e %.2 f s i n p h a s e %.2 f wphase %.2 f window %.2 f norm %.2 f
f p 0 %.2 f f p 1 %.2 f " , c o s ( p h a s e ) , s i n ( p h a s e ) , wphase , window ,
normalizer , fp [ 0 ] , fp [ 1 ] ) ;
}
n o r m a l i z e r = 1/( n o r m a l i z e r ∗ nhops ) ;
f o r ( f p = b−>b_vec [ i ] . k _ s t u f f , j = 0 ;
321 j < f i l t e r p o i n t s ; j ++, f p+= 2 )
f p [ 0 ] ∗= n o r m a l i z e r , f p [ 1 ] ∗= n o r m a l i z e r ;
p o s t ( " i ␣%d␣ ␣ c f ␣ %.2 f ␣ ␣bw␣ %.2 f ␣Q␣ %.2 f ␣ n h o p s ␣%d , ␣ hop ␣%d , ␣ s k i p ␣%d , ␣ n p o i n t s ␣%d , ␣
n o r m a l i z e r ␣ %.8 f ␣ f p 0 ␣ %.6 f , ␣ f p 1 ␣ %.6 f " , i , c f , bw , c f /bw , nhops , h o p p o i n t s
, s k i p p o i n t s , f i l t e r p o i n t s , n o r m a l i z e r , &f p [ 0 ] , &f p [ 1 ] ) ;
326 n e w c f = ( c f + bw/ o v e r l a p ) / ( 1 − r e l s p a c e ) ;
newbw = n e w c f ∗ o v e r l a p ∗ r e l s p a c e ;
i f ( newbw < HALFWIDTH)
{
98
newbw = HALFWIDTH ;
331 n e w c f = c f + 2 ∗ HALFWIDTH / o v e r l a p ;
}
c f = newcf ;
bw = newbw ;
}
336 // s e t s t o 0 t h e r e m a i n i n g f i l t e r s , i f l e s s t h a n 50 f i l t e r s a r e u s e d
f o r ( ; i < n f i l t e r s ; i ++)
b−>b_vec [ i ] . k _ s t u f f = 0 , b−>b_vec [ i ] . k _ f i l t e r p o i n t s = 0 ;
return (b) ;
}
341
s t a t i c v o i d b o n k _ f r e e f i l t e r b a n k ( t _ f i l t e r b a n k ∗b )
{
t _ f i l t e r b a n k ∗ b2 , ∗ b3 ;
int i ;
346 i f ( b o n k _ f i l t e r b a n k l i s t == b )
b o n k _ f i l t e r b a n k l i s t = b−>b_next ;
e l s e f o r ( b2 = b o n k _ f i l t e r b a n k l i s t ; b3 = b2−>b_next ; b2 = b3 )
i f ( b3 == b )
{
351 b2−>b_next = b3−>b_next ;
break ;
}
f o r ( i = 0 ; i < b−>b _ n f i l t e r s ; i ++)
i f ( b−>b_vec [ i ] . k _ s t u f f )
356 f r e e b y t e s ( b−>b_vec [ i ] . k _ s t u f f ,
b−>b_vec [ i ] . k _ f i l t e r p o i n t s ∗ s i z e o f ( f l o a t ) ) ;
f r e e b y t e s (b , s i z e o f (∗ b ) ) ;
}
361 s t a t i c v o i d bonk_donew ( t_bonk ∗ x , i n t n p o i n t s , i n t p e r i o d , i n t n s i g , i n t nfilters ,

float halftones , float overlap , float f i r s t b i n , float samplerate )
{
int i , j ;
t _ h i s t ∗h ;
f l o a t ∗ fp ;
366 t _ i n s i g ∗g ;
t_filterbank ∗ fb ;
f o r ( j = 0 , g = x−>x _ i n s i g ; j < n s i g ; j ++, g++)
{
f o r ( i = 0 , h = g−>g _ h i s t ; i −−; h++)
371 {
h−>h_power = h−>h _ b e f o r e = 0 , h−>h_countup = 0 ;
f o r ( j = 0 ; j < MASKHIST ; j ++)
h−>h_mask [ j ] = 0 ;
}
376 /∗ we o u g h t t o c h e c k f o r f a i l u r e t o a l l o c a t e memory h e r e ∗/
g−>g _ i n b u f = ( f l o a t ∗ ) g e t b y t e s ( n p o i n t s ∗ s i z e o f ( f l o a t ) ) ;
f o r ( i = n p o i n t s , f p = g−>g _ i n b u f ; i −−; f p++) ∗ f p = 0 ;
}
i f (! period ) period = npoints /2;
381 x−>x _ n p o i n t s = n p o i n t s ;
x−>x _ p e r i o d = p e r i o d ;
x−>x _ n i n s i g = n s i g ;
x−>x _ n f i l t e r s = n f i l t e r s ;
x−>x _ h a l f t o n e s = h a l f t o n e s ;
386 x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) g e t b y t e s ( 0 ) ;
x−>x _ n t e m p l a t e = 0 ;
x−> x _ i n f i l l = 0 ;
x−>x_countdown = 0 ;
x−>x _ w i l l a t t a c k = 0 ;
99
391 x−>x _ a t t a c k e d = 0 ;
x−>x_maskphase = 0 ;
x−>x_debug = 0 ;
x−>x _ h i t h r e s h = DEFHITHRESH ;
x−>x _ l o t h r e s h = DEFLOTHRESH ;
396 x−>x_masktime = DEFMASKTIME ;
x−>x_maskdecay = DEFMASKDECAY ;
x−>x _ l e a r n = 0 ;
x−>x _ l e a r n d e b o u n c e = c l o c k _ g e t s y s t i m e ( ) ;
x−>x _ l e a r n c o u n t = 0 ;
401 x−>x _ d e b o u n c e d e c a y = DEFDEBOUNCEDECAY ;
x−>x _ m i n v e l = DEFMINVEL ;
x−>x _ u s e l o u d n e s s = 0 ;
x−>x _ d e b o u n c e v e l = 0 ;
x−>x _ a t t a c k b i n s = DEFATTACKBINS ;
406 x−>x_sr = s a m p l e r a t e ;
x−>x _ f i l t e r b a n k = 0 ;
x−>x _ h i t = 0 ;
f o r ( f b = b o n k _ f i l t e r b a n k l i s t ; f b ; f b = f b −>b_next )
i f ( f b −>b _ n f i l t e r s == x−>x _ n f i l t e r s &&
411 f b −>b _ h a l f t o n e s == x−>x _ h a l f t o n e s &&
f b −>b _ f i r s t b i n == f i r s t b i n &&
f b −>b _ o v e r l a p == o v e r l a p &&
f b −>b _ n p o i n t s == x−>x _ n p o i n t s )
{
416 f b −>b _ r e f c o u n t ++;
x−>x _ f i l t e r b a n k = f b ;
break ;
}
i f ( ! x−>x _ f i l t e r b a n k )
421 x−>x _ f i l t e r b a n k = b o n k _ n e w f i l t e r b a n k ( n p o i n t s , n f i l t e r s , h a l f t o n e s , o v e r l a p ,
f i r s t b i n ) , x−>x _ f i l t e r b a n k −>b _ r e f c o u n t ++;
}
s t a t i c v o i d b o n k _ t i c k ( t_bonk ∗ x )
{
426 t_atom a t [ MAXNFILTERS ] , ∗ ap , a t 2 [ 3 ] ;
int i , j , k , n ;
t _ h i s t ∗h ;
f l o a t ∗pp , v e l = 0 , t e m p e r a t u r e = 0 ;
f l o a t ∗ fp ;
431 t_template ∗ tp ;
i n t n f i t , n i n s i g = x−>x _ n i n s i g , n t e m p l a t e = x−>x_ntemplate , n f i l t e r s = x−>
x_nfilters ;
t _ i n s i g ∗ gp ;
#i f d e f _MSC_VER
f l o a t p o w e r o u t [ MAXNFILTERS∗MAXCHANNELS ] ;
436 #e l s e
f l o a t ∗ p o w e r o u t = a l l o c a ( x−>x _ n f i l t e r s ∗ x−>x _ n i n s i g ∗ s i z e o f ( ∗ p o w e r o u t ) ) ;
#e n d i f
f o r ( i = n i n s i g , pp = po we ro u t , gp = x−>x _ i n s i g ; i −−; gp++)

441 {
f o r ( j = 0 , h = gp−>g _ h i s t ; j < n f i l t e r s ; j ++, h++, pp++)
{
f l o a t power = h−>h_outpower ;
f l o a t i n t e n s i t y = ∗ pp = ( power > 0 ? 1 0 0 . ∗ q r s q r t ( q r s q r t ( power ) ) : 0 ) ;
446 v e l += i n t e n s i t y ;
t e m p e r a t u r e += i n t e n s i t y ∗ ( f l o a t ) j ;
// p o s t ( " power %.12 f i n t e n s i t y %.6 f " , power , i n t e n s i t y ) ;
}
}
100
451 i f ( v e l > 0 ) t e m p e r a t u r e /= v e l ;
else temperature = 0;
v e l ∗= 0 . 5 / n i n s i g ; /∗ f u d g e f a c t o r ∗/
i f ( x−>x _ h i t )
{
456 /∗ i f h i t n o n z e r o i t ’ s a c l o c k c a l l b a c k . i f i n " l e a r n " mode u p d a t e t h e
t e m p l a t e l i s t ; i n any e v e n t match t h e h i t t o known t e m p l a t e s . ∗/
i f ( v e l < x−>x _ d e b o u n c e v e l )
{
461 i f ( x−>x_debug )
p o s t ( " b o u n c e ␣ c a n c e l l e d : ␣ v e l ␣%f ␣ d e b o u n c e ␣%f " ,
v e l , x−>x _ d e b o u n c e v e l ) ;
return ;
}
466 i f ( v e l < x−>x _ m i n v e l )
{
i f ( x−>x_debug )
p o s t ( " l o w ␣ v e l o c i t y ␣ c a n c e l l e d : ␣ v e l ␣%f , ␣ m i n v e l ␣%f " ,
v e l , x−>x _ m i n v e l ) ;
471 return ;
}
x−>x _ d e b o u n c e v e l = v e l ;
i f ( x−>x _ l e a r n )
{
476 d o u b l e l a s t t i m e = x−>x _ l e a r n d e b o u n c e ;
d o u b l e msec = c l o c k _ g e t t i m e s i n c e ( l a s t t i m e ) ;
i f ( ( ! n t e m p l a t e ) | | ( msec > 2 0 0 ) )
{
i n t c o u n t u p = x−>x _ l e a r n c o u n t ;
481 /∗ n o r m a l i z e t o 100 ∗/
f l o a t norm ;
f o r ( i = n f i l t e r s ∗ n i n s i g , norm = 0 , pp = p o w e r o u t ; i −−; pp++)
norm += ∗ pp ∗ ∗ pp ;
i f ( norm < 1 . 0 e −15) norm = 1 . 0 e −15;
486 norm = 1 0 0 . f ∗ q r s q r t ( norm ) ;
/∗ c h e c k i f t h i s i s t h e f i r s t s t r i k e f o r a new t e m p l a t e ∗/
i f ( ! countup )
{
int oldn = ntemplate ;
491 x−>x _ n t e m p l a t e = n t e m p l a t e = o l d n + n i n s i g ;
x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te , o l d n
∗ s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) , n t e m p l a t e ∗ s i z e o f ( x−>
x_template [ 0 ] ) ) ;
f o r ( i = n i n s i g , pp = p o w e r o u t ; i −−; o l d n++)
f o r ( j = n f i l t e r s , f p = x−>x _ t e m p l a t e [ o l d n ] . t_amp ; j −−;
pp++, f p++)
496 ∗ f p = ∗ pp ∗ norm ;
}
else
{
int oldn = ntemplate − n i n s i g ;
501 i f ( o l d n < 0 ) p o s t ( " b o n k _ t i c k ␣ bug " ) ;
f o r ( i = n i n s i g , pp = p o w e r o u t ; i −−; o l d n++)
{
f o r ( j = n f i l t e r s , f p = x−>x _ t e m p l a t e [ o l d n ] . t_amp ; j −−;
pp++, f p++)
506 ∗ f p = ( c o u n t u p ∗ ∗ f p + ∗ pp ∗ norm )
/( countup + 1.0 f ) ;
}
}
c o u n t u p ++;
101
511 i f ( c o u n t u p == x−>x _ l e a r n ) c o u n t u p = 0 ;
x−>x _ l e a r n c o u n t = c o u n t u p ;
}
else return ;
}
516 x−>x _ l e a r n d e b o u n c e = c l o c k _ g e t s y s t i m e ( ) ;
i f ( ntemplate )
{
f l o a t b e s t f i t = −1e30 ;
int templatecount ;
521 n f i t = −1;
f o r ( i = 0 , t e m p l a t e c o u n t = 0 , t p = x−>x _ t e m p l a t e ;
t e m p l a t e c o u n t < n t e m p l a t e ; i ++)
{
f l o a t dotprod = 0;
526 f o r ( k = 0 , pp = p o w e r o u t ;
k < n i n s i g && t e m p l a t e c o u n t < n t e m p l a t e ;
k++, t p ++, t e m p l a t e c o u n t ++)
{
f o r ( j = n f i l t e r s , f p = tp−>t_amp ;
531 j −−; f p ++, pp++)
{
i f ( ∗ f p < 0 | | ∗ pp < 0 ) p o s t ( " b o n k _ t i c k ␣ bug ␣ 2 " ) ;
d o t p r o d += ∗ f p ∗ ∗ pp ;
}
536 }
i f ( dotprod > b e s t f i t )
{
b e s t f i t = dotprod ;
nfit = i ;
541 }
}
i f ( n f i t < 0 ) p o s t ( " b o n k _ t i c k ␣ bug " ) ;
}
else n f i t = 0;
546 }
else n f i t = −1; /∗ h i t i s zero ; t h i s i s t h e " bang " method . ∗/
x−>x _ a t t a c k e d = 1 ;
i f ( x−>x_debug )
551 p o s t ( " bonk ␣ o u t : ␣ number ␣%d , ␣ v e l ␣%f , ␣ t e m p e r a t u r e ␣%f " , n f i t , v e l , t e m p e r a t u r e )
;
SETFLOAT( at2 , n f i t ) ;
SETFLOAT( a t 2 +1 , v e l ) ;
SETFLOAT( a t 2 +2 , t e m p e r a t u r e ) ;
556 o u t l e t _ l i s t ( x−>x_cookedout , 0 , 3 , a t 2 ) ;
f o r ( n = 0 , gp = x−>x _ i n s i g + ( n i n s i g −1) ,
pp = p o w e r o u t + n f i l t e r s ∗ ( n i n s i g −1) ; n < n i n s i g ;
n++, gp−−, pp −= n f i l t e r s )
561 {
f l o a t ∗ pp2 ;
f o r ( i = 0 , ap = at , pp2 = pp ; i < n f i l t e r s ;
i ++, ap++, pp2++)
{
566 ap−>a_type = A_FLOAT ;
ap−>a_w . w _ f l o a t = ∗ pp2 ;
}
o u t l e t _ l i s t ( gp−>g _ o u t l e t , 0 , n f i l t e r s , a t ) ;
}
571 }
102
// r e p o r t t h e a t t a c k
s t a t i c v o i d b o n k _ d o i t ( t_bonk ∗ x )
{
576 i n t i , j , ch , n ;
t _ f i l t e r k e r n e l ∗k ;
t _ h i s t ∗h ;
f l o a t growth = 0 , ∗ fp1 , ∗ fp3 , ∗ fp4 , h i t h r e s h , l o t h r e s h ;
i n t n i n s i g = x−>x _ n i n s i g , n f i l t e r s = x−>x _ n f i l t e r s ,
581 maskphase = x−>x_maskphase , n e x t p h a s e , o l d m a s k p h a s e ;
n e x t p h a s e = maskphase + 1 ;
i f ( n e x t p h a s e >= MASKHIST)
nextphase = 0;
586 x−>x_maskphase = n e x t p h a s e ;
o l d m a s k p h a s e = n e x t p h a s e − x−>x _ a t t a c k b i n s ;
i f ( oldmaskphase < 0)
o l d m a s k p h a s e += MASKHIST ;
i f ( x−>x _ u s e l o u d n e s s )
591 h i t h r e s h = q r s q r t ( q r s q r t ( x−>x _ h i t h r e s h ) ) ,
l o t h r e s h = q r s q r t ( q r s q r t ( x−>x _ l o t h r e s h ) ) ;
e l s e h i t h r e s h = x−>x _ h i t h r e s h , l o t h r e s h = x−>x _ l o t h r e s h ;
f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < n i n s i g ; ch++, gp++)
{
596 f o r ( i = 0 , k = x−>x _ f i l t e r b a n k −>b_vec , h = gp−>g _ h i s t ;
i < n f i l t e r s ; i ++, k++, h++)
{
f l o a t power = 0 , maskpow = h−>h_mask [ maskphase ] ;
f l o a t ∗ i n b u f= gp−>g _ i n b u f + k−>k _ s k i p p o i n t s ;
601 i n t c o u n t u p = h−>h_countup ;
i n t f i l t e r p o i n t s = k−>k _ f i l t e r p o i n t s ;
/∗ i f t h e u s e r a s k e d f o r more f i l t e r s t h a t f i t u n d e r t h e
N y q u i s t f r e q u e n c y , some f i l t e r s won ’ t a c t u a l l y be f i l l e d i n
s o we s k i p r u n n i n g them . ∗/
606 if (! filterpoints )
{
h−>h_countup = 0 ;
h−>h_mask [ n e x t p h a s e ] = 0 ;
h−>h_power = 0 ;
611 continue ;
}
// f o r e a c h f i l t e r :
/∗ r u n t h e f i l t e r r e p e a t e d l y , s l i d i n g i t f o r w a r d by h o p p o i n t s ,
f o r nhop t i m e s ∗/
616 f o r ( f p 1 = i n b u f , n = 0 ; n < k−>k_nhops ; f p 1 += k−>k _ h o p p o i n t s , n++)
{
f l o a t rsum = 0 , is um = 0 ;
f o r ( f p 3 = f p 1 , f p 4 = k−>k _ s t u f f , j = f i l t e r p o i n t s ; j −−;)
{
621 // / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
// c a l c u l a t i n g t h e power f o r e a c h f i l t e r /
// g=t h e i n p u t b u f f e r / / / / / / / / / / / / / / / / / / / /
// f p 4= fp [ 0 ] e fp [ 1 ] / ////// ////// /
// / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
626 f l o a t g = ∗ f p 3 ++;
rsum += g ∗ ∗ f p 4 ++;
is um += g ∗ ∗ f p 4 ++;
}
power += rsum ∗ rsum + i sum ∗ is um ;
631 // p o s t ( " power %.12 f " , power ) ; // c a p i r e s e p o s s i b i l e d e c i m a r e i
v a l o r i d i power da p o s t a r e i n max window )
}
103
if ( ! x−>x _ w i l l a t t a c k )
h−>h _ b e f o r e = maskpow ;
636 i f ( power > h−>h_mask [ o l d m a s k p h a s e ] )

{
i f ( x−>x _ u s e l o u d n e s s )
g r o w t h += q r s q r t ( q r s q r t ( power / ( h−>h_mask [ o l d m a s k p h a s e ] + 1 . 0 e
−15) ) ) − 1 . f ;
e l s e g r o w t h += power / ( h−>h_mask [ o l d m a s k p h a s e ] + 1 . 0 e −15) − 1 . f ;
641 // p o s t ( " power %.12 f h−>h_mask [ o l d m a s k p h a s e ] %.12 f g r o w t h %.12 f " ,
power , h−>h_mask [ o l d m a s k p h a s e ] , g r o w t h ) ;
}
if ( ! x−>x _ w i l l a t t a c k && c o u n t u p >= x−>x_masktime )
maskpow ∗= x−>x_maskdecay ;
646
i f ( power > maskpow )
{
maskpow = power ;
countup = 0 ;
651 }
c o u n t u p ++;
h−>h_countup = c o u n t u p ;
h−>h_mask [ n e x t p h a s e ] = maskpow ;
h−>h_power = power ;
656 }
}
i f ( x−>x _ w i l l a t t a c k ) // an a t t a c k i s r e p o r t e d
{
// h o w e v e r we won ’ t a c t u a l l y r e p o r t t h e a t t a c k u n t i l t h e s p e c t r u m s t o p
g r o w i n g . g r o w t h must d e c r e a s e b e l o w l o t h r e s h .
661 i f ( x−>x _ w i l l a t t a c k > MAXATTACKWAIT | | g r o w t h < x−>x _ l o t h r e s h )
{
/∗ i f h a v e n ’ t y e t , and i f n o t i n spew mode , r e p o r t a h i t ∗/
i f ( ! x−>x_spew && ! x−>x _ a t t a c k e d )
{
666 f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < n i n s i g ; ch++, gp++)
f o r ( i = n f i l t e r s , h = gp−>g _ h i s t ; i −−; h++)
h−>h_outpower = h−>h_mask [ n e x t p h a s e ] ;
x−>x _ h i t = 1 ;
// s e t s a c l o c k t o go o f f n m i l l i s e c o n d s from t h e c u r r e n t l o g i c a l
t i m e w i t h c l o c k _ d e l a y ( c l o c k t o s c h e d u l e , n ( ms ) )
671 // S c h e d u l e t h e e x e c u t i o n o f a C l o c k
c l o c k _ d e l a y ( x−>x _ c lo c k , 0 ) ;
}
}
i f ( g r o w t h < x−>x _ l o t h r e s h )
676 x−>x _ w i l l a t t a c k = 0 ;
e l s e x−>x _ w i l l a t t a c k ++;
}
e l s e i f ( g r o w t h > x−>x _ h i t h r e s h )
{
681 i f ( x−>x_debug ) p o s t ( " a t t a c k : ␣ g r o w t h ␣=␣%f " , g r o w t h ) ;
x−>x _ w i l l a t t a c k = 1 ;
x−>x _ a t t a c k e d = 0 ;
686 h−>h_mask [ n e x t p h a s e ] = h−>h_power , h−>h_countup = 0 ;
}
// spew mode a l w a y s o u t p u t d a t a f o r e v e r y p e r f o r m e d a n a l y s i s
/∗ i f i n " spew " mode j u s t a l w a y s o u t p u t ∗/
104
691 i f ( x−>x_spew )
{
h−>h_outpower = h−>h_power ;
696 x−>x _ h i t = 0 ;
c l o c k _ d e l a y ( x−>x _ c lo c k , 0 ) ;
}
x−>x _ d e b o u n c e v e l ∗= x−>x _ d e b o u n c e d e c a y ;
}
701
// 4//PERFORM ROUTINE
// I t r e c e i v e s a p o i n t e r t o a p i e c e o f t h e DSP c h a i n and i t i s e x p e c t e d t o r e t u r n
t h e l o c a t i o n o f t h e n e x t p e r f o r m r o u t i n e on t h e c h a i n .
// The n e x t l o c a t i o n i s d e t e r m i n e d by t h e number o f a r g u m e n t s s p e c i f i e d f o r t h e
p e r f o r m r o u t i n e w i t h t h e c a l l t o dsp_add ( ) .
// F o r example , i f we p a s s t h r e e a r g u m e n t s , we n e e d t o r e t u r n w + 4 .
706 s t a t i c t _ i n t ∗ bonk_perform ( t _ i n t ∗w)
{
t_bonk ∗ x = ( t_bonk ∗ ) (w [ 1 ] ) ;
i n t n = ( i n t ) (w [ 2 ] ) ; // v e c t o r s i z e
int onset = 0;
711 i f ( x−>x_countdown >= n )
x−>x_countdown −= n ;
else
{
i n t i , j , n i n s i g = x−>x _ n i n s i g ;
716 t _ i n s i g ∗ gp ;
i f ( x−>x_countdown > 0 )
{
n −= x−>x_countdown ;
o n s e t += x−>x_countdown ;
721 x−>x_countdown = 0 ;
}
while ( n > 0)
{
i n t i n f i l l = x−> x _ i n f i l l ;
726 i n t m = ( n < ( x−>x _ n p o i n t s − i n f i l l ) ?
n : ( x−>x _ n p o i n t s − i n f i l l ) ) ;
f o r ( i = 0 , gp = x−>x _ i n s i g ; i < n i n s i g ; i ++, gp++)
{
f l o a t ∗ f p = gp−>g _ i n b u f + i n f i l l ;
731 t _ f l o a t ∗ i n 1 = gp−>g _ i n v e c + o n s e t ;
f o r ( j = 0 ; j < m; j ++)
∗ f p++ = ∗ i n 1 ++;
}
i n f i l l += m;
736 x−> x _ i n f i l l = i n f i l l ;
// when i n p u t i s f i l l e d w i t h n p o i n t s a m p l e s , b o n k _ d o i t !
i f ( i n f i l l == x−>x _ n p o i n t s )
{
bonk_doit ( x ) ;
741
/∗ s h i f t o r c l e a r t h e i n p u t b u f f e r and u p d a t e c o u n t e r s ∗/
i f ( x−>x _ p e r i o d > x−>x _ n p o i n t s )
x−>x_countdown = x−>x _ p e r i o d − x−>x _ n p o i n t s ;
e l s e x−>x_countdown = 0 ;
746 i f ( x−>x _ p e r i o d < x−>x _ n p o i n t s )
{
i n t o v e r l a p = x−>x _ n p o i n t s − x−>x _ p e r i o d ;
f l o a t ∗ fp1 , ∗ fp2 ;
f o r ( n = 0 , gp = x−>x _ i n s i g ; n < n i n s i g ; n++, gp++)
105
751 f o r ( i = o v e r l a p , f p 1 = gp−>g_inbuf ,
f p 2 = f p 1 + x−>x _ p e r i o d ; i −−;)
∗ f p 1++ = ∗ f p 2 ++;
x−> x _ i n f i l l = o v e r l a p ;
}
756 e l s e x−> x _ i n f i l l = 0 ;
}
n −= m;
o n s e t += m;
}
761 }
r e t u r n (w+3) ;
}
// 3//DSP METHOD
766 // From MAX 5 API : The d s p method s p e c i f i e s t h e s i g n a l p r o c e s s i n g f u n c t i o n y o u r
o b j e c t d e f i n e s along with i t s arguments .
// The o b j e c t ’ s d s p method w i l l be c a l l e d w h e n e v e r t h e MSP s i g n a l c o m p i l e r i s
b u i l d i n g a s e q u e n c e o f o p e r a t i o n s ( known a s t h e DSP C h a i n ) t h a t w i l l be
p e r f o r m e d on e a c h s e t o f a u d i o s a m p l e s .
// The o p e r a t i o n s e q u e n c e c o n s i s t s o f a p o i n t e r s t o f u n c t i o n s ( c a l l e d p e r f o r m
r o u t i n e s ) f o l l o w e d by a r g u m e n t s t o t h o s e f u n c t i o n s .
s t a t i c v o i d bonk_dsp ( t_bonk ∗ x , t _ s i g n a l ∗∗ s p )
{
771 i n t i , n = s p [0]−>s_n , n i n s i g = x−>x _ n i n s i g ;
x−>x_sr = s p [0]−> s _ s r ;
776 f o r ( i = 0 , gp = x−>x _ i n s i g ; i < n i n s i g ; i ++, gp++)

gp−>g _ i n v e c = ( ∗ ( s p++))−>s_vec ;
// a d d s y o u r o b j e c t ’ s p e r f o r m method t o t h e DSP c a l l c h a i n w i t h dsp_add ( o b j e c t ’ s

p e r f o r m r o u t i n e , #, . . . ) and s p e c i f i e s t h e a r g u m e n t s i t w i l l be p a s s e d .
// The p e r f o r m r o u t i n e i s u s e d f o r p r o c e s s i n g a u d i o .
781 //#=The number o f a r g u m e n t s t h a t w i l l f o l l o w
// . . . t h e a r g u m e n t s
dsp_add ( bonk_perform , 2 , x , n ) ;
}
786 s t a t i c v o i d b o n k _ t h r e s h ( t_bonk ∗ x , t _ f l o a t a r g f 1 , t _ f l o a t a r g f 2 )
{
i f ( f1 > f2 )
p o s t ( " bonk : ␣ w a r n i n g : ␣ l o w ␣ t h r e s h o l d ␣ g r e a t e r ␣ t h a n ␣ h i ␣ t h r e s h o l d " ) ;
x−>x _ l o t h r e s h = ( f 1 <= 0 ? 0 . 0 0 0 1 : f 1 ) ;
791 x−>x _ h i t h r e s h = ( f 2 <= 0 ? 0 . 0 0 0 1 : f 2 ) ;
}
s t a t i c v o i d b o n k _ p r i n t ( t_bonk ∗ x , t _ f l o a t a r g f )
{
796 int i ;
p o s t ( " t h r e s h ␣%f ␣%f " , x−>x _ l o t h r e s h , x−>x _ h i t h r e s h ) ;
p o s t ( " mask ␣%d␣%f " , x−>x_masktime , x−>x_maskdecay ) ;
p o s t ( " a t t a c k −b i n s ␣%d" , x−>x _ a t t a c k b i n s ) ;
p o s t ( " d e b o u n c e ␣%f " , x−>x _ d e b o u n c e d e c a y ) ;
801 p o s t ( " m i n v e l ␣%f " , x−>x _ m i n v e l ) ;
p o s t ( " spew ␣%d" , x−>x_spew ) ;
p o s t ( " u s e l o u d n e s s ␣%d" , x−>x _ u s e l o u d n e s s ) ;
p o s t ( " number ␣ o f ␣ t e m p l a t e s ␣%d" , x−>x _ n t e m p l a t e ) ;

806 i f ( x−>x _ l e a r n ) p o s t ( " l e a r n ␣mode" ) ;
i f ( f != 0 )
106
{
i n t j , n i n s i g = x−>x _ n i n s i g ;
811 f o r ( j = 0 , gp = x−>x _ i n s i g ; j < n i n s i g ; j ++, gp++)
{
t _ h i s t ∗h ;
i f ( n i n s i g > 1 ) p o s t ( " i n p u t ␣%d : " , j +1) ;
f o r ( i = x−>x _ n f i l t e r s , h = gp−>g _ h i s t ; i −−; h++)
816 p o s t ( "pow␣%f ␣ mask ␣%f ␣ b e f o r e ␣%f ␣ c o u n t ␣%d" ,
h−>h_power , h−>h_mask [ x−>x_maskphase ] ,
h−>h _ b e f o r e , h−>h_countup ) ;
}
p o s t ( " f i l t e r ␣ d e t a i l s ␣ ( f r e q u e n c i e s ␣ a r e ␣ i n ␣ u n i t s ␣ o f ␣ %.2 f −Hz . ␣ b i n s ) : " ,
821 x−>x_sr ) ;
f o r ( j = 0 ; j < x−>x _ n f i l t e r s ; j ++)
p o s t ( "%2d␣ ␣ c f ␣ %.2 f ␣ ␣bw␣ %.2 f ␣ ␣ n h o p s ␣%d␣ hop ␣%d␣ s k i p ␣%d␣ n p o i n t s ␣%d" ,
j ,
x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ c e n t e r f r e q ,
826 x−>x _ f i l t e r b a n k −>b_vec [ j ] . k_bandwidth ,
x−>x _ f i l t e r b a n k −>b_vec [ j ] . k_nhops ,
x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ h o p p o i n t s ,
x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ s k i p p o i n t s ,
x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ f i l t e r p o i n t s ) ;
831 }
i f ( x−>x_debug ) p o s t ( " debug ␣mode" ) ;
}
s t a t i c v o i d b o n k _ f o r g e t ( t_bonk ∗ x )
836 {
i n t n t e m p l a t e = x−>x_ntemplate , newn = n t e m p l a t e − x−>x _ n i n s i g ;
i f ( newn < 0 ) newn = 0 ;
x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te ,
x−>x _ n t e m p l a t e ∗ s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) ,
841 newn ∗ s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) ) ;
x−>x _ n t e m p l a t e = newn ;
x−>x _ l e a r n c o u n t = 0 ;
}
846 s t a t i c v o i d bonk_bang ( t_bonk ∗ x )

{
i n t i , ch ;
x−>x _ h i t = 0 ;
851 f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < x−>x _ n i n s i g ; ch++, gp++)
{
t _ h i s t ∗h ;
f o r ( i = 0 , h = gp−>g _ h i s t ; i < x−>x _ n f i l t e r s ; i ++, h++)
h−>h_outpower = h−>h_power ;
856 }
bonk_tick ( x ) ;
}
s t a t i c v o i d bonk_read ( t_bonk ∗ x , t_symbol ∗ s )

861 {
FILE ∗ f d = f o p e n ( s−>s_name , " r " ) ;
f l o a t v e c [ MAXNFILTERS ] ;
int i , ntemplate = 0 , remaining ;
f l o a t ∗ fp , ∗ fp2 ;
866 i f ( ! fd )
{
p o s t ( "%s : ␣ open ␣ f a i l e d " , s−>s_name ) ;
return ;
107
}
871 x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te ,
x−>x _ n t e m p l a t e ∗ s i z e o f ( t _ t e m p l a t e ) , 0 ) ;
while (1)
{
f o r ( i = x−>x _ n f i l t e r s , f p = v e c ; i −−; f p++)
876 i f ( f s c a n f ( f d , "%f " , f p ) < 1 ) goto nomore ;
ntemplate ∗ s i z e o f ( t_template ) ,
( ntemplate + 1) ∗ s i z e o f ( t_template ) ) ;
f o r ( i = x−>x _ n f i l t e r s , f p = vec ,
881 f p 2 = x−>x _ t e m p l a t e [ n t e m p l a t e ] . t_amp ; i −−;)
∗ f p 2++ = ∗ f p ++;
n t e m p l a t e ++;
}
nomore :
886 i f ( r e m a i n i n g = ( n t e m p l a t e % x−>x _ n i n s i g ) )
{
p o s t ( " bonk_read : ␣%d␣ t e m p l a t e s ␣ n o t ␣ a ␣ m u l t i p l e ␣ o f ␣%d ; ␣ d r o p p i n g ␣ e x t r a s " ) ;
ntemplate ∗ s i z e o f ( t_template ) ,
891 ( ntemplate − remaining ) ∗ s i z e o f ( t_template ) ) ;
ntemplate = ntemplate − remaining ;
}
p o s t ( " bonk : ␣ r e a d ␣%d␣ t e m p l a t e s \n" , n t e m p l a t e ) ;
x−>x _ n t e m p l a t e = n t e m p l a t e ;
896 f c l o s e ( fd ) ;
}
s t a t i c v o i d b o n k _ w r i t e ( t_bonk ∗ x , t_symbol ∗ s )
{
901 FILE ∗ f d = f o p e n ( s−>s_name , "w" ) ;
i n t i , n t e m p l a t e = x−>x _ n t e m p l a t e ;
t _ t e m p l a t e ∗ t p = x−>x _ t e m p l a t e ;
f l o a t ∗ fp ;
i f ( ! fd )
906 {
p o s t ( "%s : ␣ c o u l d n ’ t ␣ c r e a t e " , s−>s_name ) ;
return ;
}
f o r ( ; n t e m p l a t e −−; t p++)
911 {
f o r ( i = x−>x _ n f i l t e r s , f p = tp−>t_amp ; i −−; f p++)
f p r i n t f ( f d , " %6.2 f ␣ " , ∗ f p ) ;
f p r i n t f ( f d , " \n" ) ;
}
916 p o s t ( " bonk : ␣ w r o t e ␣%d␣ t e m p l a t e s \n" , x−>x _ n t e m p l a t e ) ;
f c l o s e ( fd ) ;
}
// f r e e f u n t i o n
921 s t a t i c v o i d b o n k _ f r e e ( t_bonk ∗ x )
{
i n t i , n i n s i g = x−>x _ n i n s i g ;
t _ i n s i g ∗ gp = x−>x _ i n s i g ;
926 #i f d e f MSP
dsp_free ( ( t_pxobject ∗) x ) ;
#e n d i f
f o r ( i = 0 , gp = x−>x _ i n s i g ; i < n i n s i g ; i ++, gp++)
f r e e b y t e s ( gp−>g_inbuf , x−>x _ n p o i n t s ∗ s i z e o f ( f l o a t ) ) ;
931 c l o c k _ f r e e ( x−>x _ c l o c k ) ;
108
i f (!−−(x−>x _ f i l t e r b a n k −>b _ r e f c o u n t ) )
b o n k _ f r e e f i l t e r b a n k ( x−>x _ f i l t e r b a n k ) ;
}
936 // 1// INITILIZATION ROUTINE

v o i d main ( )
{
t_class ∗c ;
t_object ∗ a t t r ;
941 long a t t r f l a g s = 0 ;
t_symbol ∗ sym_long = gensym ( " l o n g " ) , ∗ s y m _ f l o a t 3 2 = gensym ( " f l o a t 3 2 " ) ;
//NEW INSTANCE ROUTINE

// c r e a t e s new i n s t a n c e o f t h e c l a s s w i t h c l a s s _ n e w ( name , mnew , m f r e e , s i z e ( i n
b y t e s ) o f t h e d a t a s t r u c t u r e , mmenu , t y p e ( most o f t e n A_GIMME, 0 ) , 0 . )
946 //mnew=The i n s t a n c e c r e a t i o n f u n c t i o n
// m f r e e=The i n s t a n c e f r e e f u n c t i o n
//mmenu=The f u n c t i o n c a l l e d when t h e u s e r c r e a t e s a new o b j e c t o f t h e c l a s s
from t h e Patch window ’ s p a l e t t e ( UI o b j e c t s o n l y ) , 0L i f you ’ r e n o t
d e f i n i n g a UI o b j e c t ,
// t y p e=A s t a n d a r d Max t y p e . The f i n a l ar g u m e nt o f t h e t y p e l i s t s h o u l d be a 0 .
G e n e r a l l y , o b e x o b j e c t s h a v e a s i n g l e t y p e argument , A_GIMME, f o l l o w e d by a
0.
951 c = c l a s s _ n e w ( " bonk3~" , ( method ) bonk_new , ( method ) b o n k _ f r e e , s i z e o f ( t_bonk ) , (

method ) 0L , A_GIMME, 0 ) ;
c l a s s _ o b e x o f f s e t _ s e t ( c , c a l c o f f s e t ( t_bonk , o b e x ) ) ; // c a l c o f f s e t c a l c u l a t e s
b y t e − o f f s e t from t h e b e g i n n i n g o f bonk s t r u c t u r e . The v a l u e i s s t o r e i n
o b e x f i e l d o f same s t r u c t u r e .
//NEW ATTRIBUTES
956 // c r e a t e s ( new ) a t t r i b u t e w i t h a t t r _ o f f s e t _ n e w ( name , t y p e , a t t r i b u t e i s f o r
s e t t i n g / q u e r y f l a g , method (NULL i s d e f a u l t method ) g e t , method s e t , b y t e −
offset ) .
// a d d s a t t r i b u t e t o t h e o b j e c t o f t h e c l a s s . w i t h c l a s s _ a d d a t r ( )
a t t r = a t t r _ o f f s e t _ n e w ( " n p o i n t s " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) 0L ,
c a l c o f f s e t ( t_bonk , x _ n p o i n t s ) ) ;
class_addattr (c , attr ) ;
961 a t t r = a t t r _ o f f s e t _ n e w ( " hop " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) 0L ,

c a l c o f f s e t ( t_bonk , x _ p e r i o d ) ) ;
a t t r = a t t r _ o f f s e t _ n e w ( " n f i l t e r s " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) 0L ,

c a l c o f f s e t ( t_bonk , x _ n f i l t e r s ) ) ;
966
a t t r = a t t r _ o f f s e t _ n e w ( " h a l f t o n e s " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method
) 0L , c a l c o f f s e t ( t_bonk , x _ h a l f t o n e s ) ) ;
a t t r = a t t r _ o f f s e t _ n e w ( " o v e r l a p " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method ) 0

L , c a l c o f f s e t ( t_bonk , x _ o v e r l a p ) ) ;
971 class_addattr (c , attr ) ;
a t t r = a t t r _ o f f s e t _ n e w ( " f i r s t b i n " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method )

0L , c a l c o f f s e t ( t_bonk , x _ f i r s t b i n ) ) ;
976 a t t r = a t t r _ o f f s e t _ n e w ( " m i n v e l " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method )

bonk_minvel_set , c a l c o f f s e t ( t_bonk , x _ m i n v e l ) ) ;
109
a t t r = a t t r _ o f f s e t _ n e w ( " l o t h r e s h " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method )

b o n k _ l o t h r e s h _ s e t , c a l c o f f s e t ( t_bonk , x _ l o t h r e s h ) ) ;
981
a t t r = a t t r _ o f f s e t _ n e w ( " h i t h r e s h " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method )
b o n k _ h i t h r e s h _ s e t , c a l c o f f s e t ( t_bonk , x _ h i t h r e s h ) ) ;
a t t r = a t t r _ o f f s e t _ n e w ( " masktime " , sym_long , a t t r f l a g s , ( method ) 0L , ( method )

bonk_masktime_set , c a l c o f f s e t ( t_bonk , x_masktime ) ) ;
a t t r = a t t r _ o f f s e t _ n e w ( " maskdecay " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method

) bonk_maskdecay_set , c a l c o f f s e t ( t_bonk , x_maskdecay ) ) ;
991 a t t r = a t t r _ o f f s e t _ n e w ( " d e b o u n c e d e c a y " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , (

method ) bonk_debouncedecay_set , c a l c o f f s e t ( t_bonk , x _ d e b o u n c e d e c a y ) ) ;
a t t r = a t t r _ o f f s e t _ n e w ( " debug " , sym_long , a t t r f l a g s , ( method ) 0L , ( method )

bonk_debug_set , c a l c o f f s e t ( t_bonk , x_debug ) ) ;
996
a t t r = a t t r _ o f f s e t _ n e w ( " spew " , sym_long , a t t r f l a g s , ( method ) 0L , ( method )
bonk_spew_set , c a l c o f f s e t ( t_bonk , x_spew ) ) ;
a t t r = a t t r _ o f f s e t _ n e w ( " u s e l o u d n e s s " , sym_long , a t t r f l a g s , ( method ) 0L , ( method )

b o n k _ u s e l o u d n e s s _ s e t , c a l c o f f s e t ( t_bonk , x _ u s e l o u d n e s s ) ) ;
a t t r = a t t r _ o f f s e t _ n e w ( " a t t a c k b i n s " , sym_long , a t t r f l a g s , ( method ) 0L , ( method )

b o n k _ a t t a c k b i n s _ s e t , c a l c o f f s e t ( t_bonk , x _ a t t a c k b i n s ) ) ;
1006 //METHODS!
// a d d s method t o o b j e c t o f t h e c l a s s w i t h c l a s s _ a d d m e t h o d ( c l a s s p o i n t e r , m,
name , t y p e , 0 )
//m=f u n c t i o n g e t c a l l e d when method i s i n v o q u e d
c l a s s _ a d d m e t h o d ( c , ( method ) bonk_dsp , " d s p " , A_CANT, 0 ) ;
1011 class_addmethod ( c , ( method ) bonk_bang , " bang " , A_CANT, 0 ) ;

class_addmethod ( c , ( method ) b o n k _ f o r g e t , " f o r g e t " , 0 ) ;
class_addmethod ( c , ( method ) b o n k _ l e a r n , " l e a r n " , A_LONG, 0 ) ;
class_addmethod ( c , ( method ) b o n k _ t h r e s h , " t h r e s h " , A_FLOAT, A_FLOAT, 0 ) ;
class_addmethod ( c , ( method ) b o n k _ p r i n t , " p r i n t " , A_DEFFLOAT, 0 ) ;
1016 class_addmethod ( c , ( method ) bonk_read , " r e a d " , A_DEFSYM, 0 ) ;
class_addmethod ( c , ( method ) b o n k _ w r i t e , " w r i t e " , A_DEFSYM, 0 ) ;
class_addmethod ( c , ( method ) b o n k _ a s s i s t , " a s s i s t " , A_CANT, 0 ) ;
// a d d s s p e c i a l o b e x methods
1021 c l a s s _ a d d m e t h o d ( c , ( method ) object_obex_dumpout , " dumpout " , A_CANT, 0 ) ;
c l a s s _ a d d m e t h o d ( c , ( method ) o b j e c t _ o b e x _ q u i c k r e f , " q u i c k r e f " , A_CANT, 0 ) ;
// a d d s some s t a n d a r d method h a n d l e r s f o r i n t e r n a l m e s s a g e s u s e d by a l l MSP

o b j e c t s with c l a s s _ d s p i n i t ( c l a s s p o i n t e r )
class_dspinit (c) ;
1026
110
// r e g i s t e r s a p r e v i o u s l y d e f i n e d o b j e c t c l a s s w i t h c l a s s _ r e g i s t e r ( name_space ,
c l a s s p o i n t e r ) . T h i s f u n c t i o n i s r e q u i r e d , and s h o u l d be c a l l e d a t t h e end
o f main ( ) .
// namespace=The d e s i r e d c l a s s ’ s name s p a c e . T y p i c a l l y , #CLASS_BOX, f o r o b e x
c l a s s e s o r #CLASS_NOBOX f o r c l a s s e s w h i c h w i l l o n l y be u s e d i n t e r n a l l y
c l a s s _ r e g i s t e r (CLASS_BOX, c ) ;
1031 bonk_class = c ;
p o s t ( " \n" ) ;
p o s t ( "BonkOMM~␣ v ␣ 1 . 0 ␣−␣ d e t e c t s ␣ a t t a c k s ␣ i n ␣ a u d i o ␣ s i g n a l s " ) ;
post ( " Zengi ␣ r e v i s i o n " ) ;
1036 p o s t ( " O r i g i n a l ␣ by ␣ M i l l e r ␣ P u c k e t t e ␣ and ␣Ted␣ Appel , ␣ h t t p : / / c r c a . u c s d . edu /~msp/ " ) ;
p o s t ( " \n" ) ;
}
1041 // 2//NEW INSTANCE ROUTINE

s t a t i c v o i d ∗bonk_new ( t_symbol ∗ s , l o n g ac , t_atom ∗ av )
{
short j ;
t_bonk ∗ x ;
1046
//CREATE INSTANCE
// c r e a t e s i n s t a n c e o f t h e o b j e c t c l a s s by a l l o c a t i n g memory w i t h o b j e c t _ a l l o c (
class pointer ) .
// I t s u s e i s r e q u i r e d w i t h obex−c l a s s o b j e c t s , i n s i d e t h e o b j e c t ’ s new i n s t a n c e
routine .
i f ( x = ( t_bonk ∗ ) o b j e c t _ a l l o c ( b o n k _ c l a s s ) ) {
1051
void ∗ o b j e c t _ a l l o c ( t_class ∗c ) ;
t _ i n s i g ∗g ;
1056 x−>x _ n p o i n t s = DEFNPOINTS ;

x−>x _ p e r i o d = DEFPERIOD ;
x−>x _ n f i l t e r s = DEFNFILTERS ;
x−>x _ h a l f t o n e s = DEFHALFTONES ;
x−>x _ f i r s t b i n = DEFFIRSTBIN ;
1061 x−>x _ o v e r l a p = DEFOVERLAP ;
x−>x _ n i n s i g = 1 ;
x−>x _ h i t h r e s h = DEFHITHRESH ;
x−>x _ l o t h r e s h = DEFLOTHRESH ;
1066 x−>x_masktime = DEFMASKTIME ;
x−>x_maskdecay = DEFMASKDECAY ;
x−>x _ d e b o u n c e d e c a y = DEFDEBOUNCEDECAY ;
x−>x _ m i n v e l = DEFMINVEL ;
x−>x _ a t t a c k b i n s = DEFATTACKBINS ;
1071
i f ( ! x−>x _ p e r i o d ) x−>x _ p e r i o d = x−>x _ n p o i n t s / 2 ;
x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) g e t b y t e s ( 0 ) ;
x−>x _ n t e m p l a t e = 0 ;
x−> x _ i n f i l l = 0 ;
1076 x−>x_countdown = 0 ;
x−>x _ w i l l a t t a c k = 0 ;
x−>x _ a t t a c k e d = 0 ;
x−>x_maskphase = 0 ;
x−>x_debug = 0 ;
1081 x−>x _ l e a r n = 0 ;
x−>x _ l e a r n d e b o u n c e = c l o c k _ g e t s y s t i m e ( ) ;
x−>x _ l e a r n c o u n t = 0 ;
111
x−>x _ u s e l o u d n e s s = 0 ;
x−>x _ d e b o u n c e v e l = 0 ;
1086 x−>x_sr = s y s _ g e t s r ( ) ; /∗ g e t s t h e s a m p l e r a t e ∗/
/∗ s o m e t h i n g u s e f u l f o r debug
i f ( ac ) {
s w i t c h ( av [ 0 ] . a_type ) {
1091 c a s e A_LONG:
x−>x _ n i n s i g = av [ 0 ] . a_w . w_long ;
break ;
}
1096 }
i f ( x−>x _ n i n s i g < 1 ) x−>x _ n i n s i g = 1 ;

i f ( x−>x _ n i n s i g > MAXCHANNELS) x−>x _ n i n s i g = MAXCHANNELS ; ∗/
1101 // t a k e s an atom l i s t and p r o p e r l y s e t any a t t r i b u t e s d e s c r i b e d w i t h i n . t o

do t h i s t h e s i m p l e s t way i s t o u s e f u n t i o n a t t r _ a r g s _ p r o c e s s ( o b j e c t
whose a t t r i b u t e s w i l l be p r o c e s s e d , ac , av )
// ac=The c o u n t o f t_atoms i n av
// av=An atom l i s t
// The f u n c t i o n a t t r _ a r g s _ p r o c e s s ( x , ac , av ) i s t y p i c a l l y u s e d i n o b j e c t ’ s
new i n s t a n c e t o c o n v e n i e n t l y p r o c e s s a t t r i b u t e a r g u m e n t s .
a t t r _ a r g s _ p r o c e s s ( x , ac , av ) ;
1106
x−>x _ i n s i g = ( t _ i n s i g ∗ ) g e t b y t e s ( x−>x _ n i n s i g ∗ s i z e o f ( ∗ x−>x _ i n s i g ) ) ;
//CREATE INLET
// c r e a t e s t h e s i g n a l i n l e t w i t h d s p _ s e t u p ( ( c a s t t o t _ p r o b j e c t ) o b j e c t
p o i n t e r , n s i g n a l s ) , s o you n e e d n o t make them y o u r s e l f !
1111 // n s i g n a l s=The number o f s i g n a l / p r o x y i n l e t s t o c r e a t e f o r t h e o b j e c t . If
t h e o b j e c t h a s no s i g n a l i n l e t s , you may p a s s 0 .
d s p _ s e t u p ( ( t _ p x o b j e c t ∗ ) x , x−>x _ n i n s i g ) ;
//CREATE OUTLETS
// s t o r e s t h e dumpout o u t l e t i n t h e o b e x w i t h t h e g e n e r i c f u n c t i o n
o b j e c t _ o b e x _ s t o r e ( o b j e c t p o i n t e r , key , v a l ) . The dumpout o u t l e t a r e
t h a t u s e d by a t t r i b u t e s t o r e p o r t d a t a i n r e s p o n s e t o ’ g e t ’ q u e r i e s .
1116 // k e y=A s y m b o l i c name f o r t h e d a t a t o be s t o r e d
// v a l=A t _ o b j e c t ∗ , t o be s t o r e d i n t h e obex , r e f e r e n c e d u n d e r t h e k e y
// The g e n e r i c c a s e i s n o r m a l l y a d a p t e d t o be u s e d a s f o l l o w :
o b j e c t _ o b e x _ s t o r e ( x , _sym_dumpout , o u t l e t _ n e w ( x , NULL) ) ;
// c r e a t e s new o u t l e t s w i t h o u t l e t _ n e w ( o b j e c t , s ) .
// s=A C−s t r i n g s p e c i f y i n g t h e m e s s a g e t h a t w i l l be s e n t o u t t h i s o u t l e t , o r
NULL t o i n d i c a t e t h e o u t l e t w i l l be u s e d t o s e n d v a r i o u s m e s s a g e s .
1121 o b j e c t _ o b e x _ s t o r e ( x , gensym ( " dumpout " ) , o u t l e t _ n e w ( x , NULL) ) ;
// c r e a t e s an o u t l e t t h a t w i l l ALWAYS s e n d t h e l i s t message with l i s t o u t ( (

t _ o b j e c t +) o b j e c t ) ) .
x−>x_cookedout = l i s t o u t ( ( t _ o b j e c t ∗ ) x ) ;
1126 f o r ( j = 0 , g = x−>x _ i n s i g + x−>x _ n i n s i g −1; j < x−>x _ n i n s i g ; j ++, g−−) {

g−>g _ o u t l e t = l i s t o u t ( ( t _ o b j e c t ∗ ) x ) ;
}
//CLOCK
1131 // c r e a t e s a new C l o c k o b j e c t w i t h clock_new ( o b j e c t p o i n t e r , ( method ) f n ) .
T h i s f u n c t i o n i s n o r m a l l y c a l l e d i n t h e new i n s t a n c e r o u t i n e f u n c t i o n .
// f n=F u n c t i o n t o be c a l l e d when t h e c l o c k g o e s o f f , t h i s f u n c t i o n must be
called object_tick .
// C l o c k o b j e c t i s u s e d a s i n t e r f a c e t o t h e Max s c h e d u l e r .
112
x−>x _ c l o c k = clock_new ( x , ( method ) b o n k _ t i c k ) ;
1136 bonk_donew ( x , x−>x _ n p o i n t s , x−>x _ p e r i o d , x−>x _ n i n s i g , x−>x _ n f i l t e r s ,

x−>x _ h a l f t o n e s , x−>x _ o v e r l a p , x−>x _ f i r s t b i n , s y s _ g e t s r ( ) ) ;
}
return ( x ) ;
}
1141
/∗ A t t r i b u t e s e t t e r s . ∗/
v o i d b o n k _ m i n v e l _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )
{
i f ( ac && av ) {
1146 f l o a t f = a t o m _ g e t f l o a t ( av ) ;
i f ( f < 0) f = 0 ;
x−>x _ m i n v e l = f ;
}
}
1151
v o i d b o n k _ l o t h r e s h _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )
{
i f ( ac && av ) {
f l o a t f = a t o m _ g e t f l o a t ( av ) ;
1156 i f ( f > x−>x _ h i t h r e s h )
x−>x _ l o t h r e s h = ( f <= 0 ? 0 . 0 0 0 1 : f ) ;
}
}
1161
v o i d b o n k _ h i t h r e s h _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )
{
i f ( ac && av ) {
1166 i f ( f < x−>x _ l o t h r e s h )
x−>x _ h i t h r e s h = ( f <= 0 ? 0 . 0 0 0 1 : f ) ;
}
}
1171
v o i d bonk_masktime_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )
{
i f ( ac && av ) {
i n t n = a t o m _ g e t l o n g ( av ) ;
1176 x−>x_masktime = ( n < 0 ) ? 0 : n ;
}
}
v o i d bonk_maskdecay_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )

1181 {
i f ( ac && av ) {
f = ( f < 0) ? 0 : f ;
f = ( f > 1) ? 1 : f ;
1186 x−>x_maskdecay = f ;
}
}
v o i d b o n k _ d e b o u n c e d e c a y _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )
1191 {
i f ( ac && av ) {
f = ( f < 0) ? 0 : f ;
f = ( f > 1) ? 1 : f ;
113
1196 x−>x _ d e b o u n c e d e c a y = f ;
}
}
v o i d bonk_debug_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )

1201 {
i f ( ac && av ) {
x−>x_debug = ( n != 0 ) ;
}
1206 }
v o i d bonk_spew_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )

{
i f ( ac && av ) {
1211 i n t n = a t o m _ g e t l o n g ( av ) ;
x−>x_spew = ( n != 0 ) ;
}
}
1216 v o i d b o n k _ u s e l o u d n e s s _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )

{
i f ( ac && av ) {
x−>x _ u s e l o u d n e s s = ( n != 0 ) ;
1221 }
}
v o i d b o n k _ a t t a c k b i n s _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av )
{
1226 i f ( ac && av ) {
n = ( n < 1) ? 1 : n ;
n = ( n > MASKHIST) ? MASKHIST : n ;
x−>x _ a t t a c k b i n s = n ;
1231 }
}
/∗ end a t t r s e t t e r s ∗/
v o i d b o n k _ a s s i s t ( t_bonk ∗ x , v o i d ∗b , l o n g m, l o n g a , c h a r ∗ s ) {
1236 i f (m == ASSIST_INLET )
s t r c p y ( s , " ( S i g n a l ) ␣ A ud io ␣ I n p u t , ␣ A n a l y s i s ␣ A t t r i b u t e s " ) ;
e l s e i f (m==ASSIST_OUTLET) {
switch ( a ) {
c a s e 0 : s t r c p y ( s , " ( L i s t ) ␣Raw␣ F i l t e r ␣ A m p l i t u d e s " ) ; b r e a k ;
1241 c a s e 1 : s t r c p y ( s , " ( L i s t ) ␣ I n s t r u m e n t ␣Number , ␣ L o u d n e s s , ␣ T e m p e r a t u r e " ) ;
break ;
c a s e 2 : s t r c p y ( s , "Dump" ) ; b r e a k ;
}
}
}
1246
s t a t i c v o i d b o n k _ l e a r n ( t_bonk ∗ x , i n t n )
{
i f ( n < 0) n = 0 ;
i f (n)
1251 {
x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te , x−>x _ n t e m p l a t e ∗
s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) , 0 ) ;
x−>x _ n t e m p l a t e = 0 ;
}
x−>x _ l e a r n = n ;
114
1256 x−>x _ l e a r n c o u n t = 0 ;
}
/∗ g e t c u r r e n t s y s t e m t i m e ∗/
double c l o c k _ g e t s y s t i m e ( )
1261 {
return gettime () ;
}
/∗ g e t t h e e l a p s e d t i m e s i n c e t h e g i v e n s y s t e m ti me , i n m i l l i s e c o n d s ∗/
1266 d o u b l e c l o c k _ g e t t i m e s i n c e ( d o u b l e p r e v s y s t i m e )
{
return (( gettime () − prevsystime ) ) ;
}
1271 f l o a t q r s q r t ( f l o a t f )
{
r e t u r n 1/ s q r t ( f ) ;
}
115
116
Figure B.1: Max patcher window showing our test patch realized to analyze the ·O M M· sounds with bonk∼ 3.0
Appendix C
Writing External for Max/MSP

with XCode
Max5 and XCode3.X

This section is presented to familiarize with the Max/MSP environment and will show
how it can be extended by creating external objects.
With respect to this thesis, the tutorial of Zicarelli [62] and has been taken in account.
Since no existing paper was found on how to write external for Max5, the three tutorial
have been adapted. In writing an external object for Max, the task is to write a shared
library in C that is loaded and called by the “master environment” and in turns calls upon
helpful routines back in the master environment. You create a class, or template for the
behavior of an object. Instances of this class “do the work” of the object, when they are
sent messages. Your external object definition will:
Define the class: its data structure, size, and how instances are to be created and
destroyed
Define functions (called methods) that will respond to various messages, performing
some action
With the name externals are considered all the external object of Max/MSP, i.e. not
included in the software issue. Therefore an external could be any objects created by your
own (in such a programming language) or developed by somebody else. In later chapters,
4 and 5, we use bonk∼ , an external object originally developed by Miller Puckette, will
be extensively analyzed, modified, for the purpose of onset detection.someone else. The
need to write an external is to add one or more specific task to the the logical and
arithmetics unit or the DSP chain of the software.
First, we downloaded the Max 5 Software Development Kit (SDK) from cycling74.com,
which includes framework, API reference and some examples. The framework contains
117
C – Writing External for Max/MSP with XCode
the header files where Max/MSP standard function and struct get called. The various
parts of the framework are described in the API. So, while creating new one or modify-
ing existing external, you must do it in according to the Max/MSP API reference. To
develop your object,
Since the most objects are written in C, now we procede describing the process to
develop objects in C, but externals can also be created it in such different programming
languages. We used Xcode(version 3.1.2) to develop the external, the latest version of
native IDE of the Apple Mac OS X. Let’s see an Xcode example to understand the most
significant contents of a project:
Figure C.1: XCode main window
The Source folder includes the source code you develop, typically is only one file
named yourobject.c. The External Frameworks and Libraries folder is the place where
to add the MaxAPI and MaxAudioAPI (MSP) frameworks. The Product folder contains
the external created after compiling the source code, while Target are the option of the
compiler. The objects created are single file with .mxo extension, but those only seem
to be files, because they hide contents in it. This is what under Mac OS is called a
118
"bundle", or simply a package.
A bundle contains a list of files and folders, like the ones showed in this view:
Figure C.2: a Bundle
You can create both Max or MSP external, depending on user requirement, with
some difference in the structure, essentially MSP externals are the ones which involve
Audio DSP, such as the one i used while Max externals are logic and arithmetic objects.
In order to use the external in the Max patcher windows, you have to add the .mxo
package, produced by the building of the source code, into the msp-external (or max-
external) folder in the application folder. But you can do this automatically by telling
XCode where to build your object in the building target.
For better understand what target are, you can think at the option of the compiler.
Most of the option are predefined when compiling an external for Max/MSP with xCode.
An example of configuring the target manually, should be represented by typing a file
with .xcconfig extension, and adding this to the project. Then you can use it as target
field in XCode.
Three are the basic component of a Max external source code:
1. the entry poin as main() function
2. description of the object as the Structs
3. definition of the functionality as the Methods =>BEHAVIOUR
Some methods and element of the structs are required by Max, and are explained in
the MaxAPI reference. The development of the source code can be summarized in five
points:
1. including the right header files (usually ext.h and ext_obex.h for MSP objects)
2. declaring a C structure for your object
3. writing an initialization routine called main that defines the class
4. writing a new instance routine that creates a new instance of the class, when
someone makes one or types its name into an object box
5. writing methods (or message handlers) that implement the behavior of the object.
119
C – Writing External for Max/MSP with XCode
120
Bibliography
[1] Cycling ’74. Max 5 API Reference, 2009.

[2] Samer Abdallah and Mark Plumbley. Unsupervised onset detection: A probabilistic
approach using ica and a hidden markov classifier, 2003.
[3] James Beauchamp. Analysis, Synthesis, and Perception of Musical Sounds: The
Sound of Music (Modern Acoustics and Signal Processing). Springer, December
2006.
[4] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. Sandler. A
tutorial on onset detection in music signals. Speech and Audio Processing, IEEE
Transactions on, 13(5):1035–1047, 2005.
[5] J. P. Bello, C. Duxbury, M. Davies, and M. Sandler. On the use of phase and energy
for musical onset detection in the complex domain. Signal Processing Letters, IEEE,
11(6):553–556, 2004.
[6] Juan Pablo Bello and Mark Sander. Phase-based note onset detection for music
signals. In in Proceedings of IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP’03), pages 441–444. IEEE Computer Society, 2003.
[7] Luciano Berio. Intervista sulla musica. Laterza, 1981.
[8] J. Bilmes. Timing is of the essence: Perceptual and computational techniques
for representing, learning, and reproducing expressive timing in percussive rhythm.
Master’s thesis, MIT, Cambridge, MA, 1993.
[9] Paul Brossier. Automatic Annotation of Musical Audio for Interactive Applications.
PhD thesis, Queen Mary University of London, UK, August 2006.
[10] Judith C. Brown. Calculation of a constant q spectral transform. J. Acoust. Soc.
Am., 1991.
[11] Judith C. Brown and Miller S. Puckette. An efficient algorithm for the calculation
of a constant q transform. J. Acoust. Soc. Am., 1992.
[12] Ryan J. Cassidy and J. O. Smith III. Auditory filter bank lab.
[13] Nick Collins. A comparison of sound onset detection algorithms with emphasis on
psychoacoustically motivated detection functions. In In AES Convention 118, pages
28–31, 2005.
[14] P. Cosi, G. De Poli, and G. Lauzzana. Auditory modelling and self-organizing neural
networks for timbre classification. Journal of New Music Research, 23(1):71–98,
121
Bibliography
March 1994.
[15] Roger B. Dannenberg. Nyquist Reference Manual. Carnegie Mellon University
School of Computer Science, Pittsburgh, PA 15213, U.S.A., 2007.
[16] Alain de Cheveignè. Pitch perception models - a historical review. Technical report,
CNRS - Ircam, Paris, 2004.
[17] Filipe Diniz, Iuri Kothe, Sergio L. Netto, and Luiz W. P. Biscainho. High-selectivity
filter banks for spectral analysis of music signals. EURASIP Journal on Advances
in Signal Processing, 2007.
[18] C. Dodge and T. Jerse. Computer music: syntesis, composition and performance.
Thomson Learning, 1985.
[19] Carlo Drioli and Nicola Orio. Elementi di acustica e psicoacustica, 1999.
[20] C. Duxbury, M. Sandler, and M. Davis. A hybrid approach to musical note onset
detection. In In Proc. Digital Audio Effects Workshop (DAFx, 2002.
[21] Chris Duxbury, Juan Pablo Bello, Mike Davies, Mark Sandler, and Mark S. Com-
plex domain onset detection for musical signals. In In Proc. Digital Audio Effects
Workshop (DAFx, 2003.
[22] Chris Duxbury, Juan Pablo Bello, Mark Sandler, and Mike Davies. A comparison
between fixed and multiresolution analysis for onset detection in musical signals. In
In Proc. Digital Audio Effects Workshop (DAFx, 2004.
[23] Ichiro Fujinaga. Max/MSP Externals Tutorial, 2005.
[24] Toby Gifford and Andrew R. Brown. Listening for noise: An approach to percussiv
onset detection. In The Australasian Computer Music Conference, 2008.
[25] M. Gimenes, E. R. Miranda, and C. Johnson. A memetic approach to the evolution
of rhythms in a society of software agents. In Proceedings of the 10th Brazilian
Symposium of Musical Computation (SBCM), Belo Horizonte (Brazil), 2005.
[26] John William Gordon. Perception of Attack Transients in Musical Tones. PhD
thesis, CCRMA, Department of Music, Stanford University, 1984.
[27] Paul Gurnig. An Introduction to Writing Externs in C for Max/MSP. University of
Chicago, 2005.
[28] Kurt Jacobson. A metric for music similarity derived from psychoacoustic features
in digital music signals. PhD thesis, University of Miami, 2006.
[29] K. L. Kashima and B. Mont-Reynaud. The bounded-q approach to time-varying
spectral analysis. Tech. Rep. STAN M-28, Stanford University, Department of
Music, 1985.
[30] A. Klapuri. Sound onset detection by applying psychoacoustic knowledge. In
ICASSP ’99: Proceedings of the Acoustics, Speech, and Signal Processing, 1999.
on 1999 IEEE International Conference, pages 3089–3092, Washington, DC, USA,
1999. IEEE Computer Society.
[31] Alexandre Lacoste and Douglas Eck. A supervised classification algorithm for note
onset detection. EURASIP J. Appl. Signal Process., 2007(1):153, January 2007.
122
Bibliography
[32] Kai Lassfolk and Jaska Uimonen. Spectutils, an audio signal analysis and visual-
ization toolkit for gnu octave. In 11th Int. Conference on Digital Audio Effects
(DAFx-08), 2008.
[33] Paul Masri. Computer Modeling of Sound for Transformation and Synthesis of
Musical Signals. PhD thesis, University of Bristol, UK, 1996.
[34] James Mccartney. Rethinking the computer music language: Supercollider. In
Rethinking the Computer Music Language: SuperCollider, volume 26, pages 61–
68, Cambridge, MA, USA, 2002. MIT Press.
[35] Jon Mccormack. A developmental model for generative media. In Advances in
Artificial Life, pages 88–97. 2005.
[36] E. R. Miranda. Computer Sound Design Synthesis techniques and programming.
Focal press, 2002.
[37] Eduardo R. Miranda. Artificial phonology: Disembodied humanoid voice for com-
posing music with surreal languages. Leonardo Music Journal, 15(1):8–16, 2005.
[38] M. S. Puckette, T. Apel, and David Zicarelli. Real-time audio analysis tools for pd
and msp. In In Proceedings of the ICMC, 1998.
[39] Miller Puckette. Is there life after midi? ICMA, 1994.
[40] Miller Puckette. Max at seventeen. Comput. Music J., 26(4):31–43, 2002.
[41] Miller Puckette. The Theory and Technique of Electronic Music. World Scientific
Publishing Co. Pte. Ltd., 2007.
[42] Miller S. Puckette. Pure data: recent progress. In Pure Data: recent progress,
1997.
[43] Arunan Ramalingam and Sridhar Krishnan. Gaussian mixture modeling of short-
time fourier transform features for audio fingerprinting. IEEE Transactions on In-
formation Forensics and Security, 1(4):457–463, December 2006.
[44] Curtis Roads. The Computer Music Tutorial. The MIT Press, February 1996.
[45] Curtis Roads. Microsound. The MIT Press, 2004.
[46] D. Rocchesso and F. Fontana. The Sounding Object. Mondo Estremo, 2003.
[47] Davide Rocchesso. Introduction to Sound Processing. GNU GNU Free Documen-
tation License, 2003.
[48] Davide Rocchesso. Programmazione visuale, versione 1.3, 2007.
[49] Davide Rocchesso. Sound to sound, sense to sense, 2008.
[50] X. Rodet and F. Jaillet. Detection and modeling of fast attack transients. In
International Computer Music Conference (ICMC), pages 30–33, 2001.
[51] E. D. Scheirer. Tempo and beat analysis of acoustic musical signals. Journal of
the Acoustical Society of America, 103(1):588–601, 1998.
[52] X. Serra. Musical Sound Modeling with Sinusoids plus Noise, pages 91–122. Swets
and Zeitlinger, 1997.
[53] Xavier Serra. Parshl: An analysis/synthesis program for non-harmonic sounds based
on a sinusoidal representation, 1985.
123
Bibliography
[54] Xavier Serra. A System for Sound Analysis/Transformation/Synthesis Based on

a Deterministic Plus Stochastic Decomposition. PhD thesis, Stanford University,
1989.
[55] Barry Vercoe. The Csound Book: Perspectives in Software Synthesis, Sound De-
sign, Signal Processing,and Programming. The MIT Press, March 2000.
[56] Tony S. Verma and Teresa H. Y. Meng. Extending spectral modeling synthesis with
transient modeling synthesis. Comput. Music J., 24(2):47–59, 2000.
[57] Gil Weinberg and Scott Driscoll. Robot-human interaction with an anthropomor-
phic percussionist. In CHI ’06: Proceedings of the SIGCHI conference on Human
Factors in computing systems, pages 1229–1232, New York, NY, USA, 2006. ACM.
[58] Gil Weinberg and Scott Driscoll. Toward robotic musicianship. Comput. Music J.,
30(4):28–45, 2006.
[59] Gil Weinberg and Scott Driscoll. The interactive robotic percussionist: new de-
velopments in form, mechanics, perception and interaction design. In HRI ’07:
Proceedings of the ACM/IEEE international conference on Human-robot interac-
tion, pages 97–104, New York, NY, USA, 2007. ACM.
[60] Gil Weinberg, Mark Godfrey, Alex Rae, and John Rhoads. A real-time genetic
algorithm in human-robot musical improvisation. Computer Music Modeling and
Retrieval: Sense of Sounds, pages 351–359, 2008.
[61] Stephen Wilson. Information Arts : Intersections of Art, Science, and Technology
(Leonardo Books). The MIT Press, April 2003.
[62] David Zicarelli. Writing External Objects for Max 4.0 and MSP 2.0. Cycling ’74,
2001.
[63] Udo Zölzer, Xavier Amatriain, Daniel Arfib, Jordi Bonada, Giovanni De Poli, Pierre
Dutilleux, Gianpaolo Evangelista, Florian Keiler, Alex Loscos, Davide Rocchesso,
Mark Sandler, Xavier Serra, and Todor Todoroff. DAFX:Digital Audio Effects.
John Wiley & Sons, May 2002.
124

A Perceptually Grounded Approach To Sound Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Perceptually Grounded Approach To Sound Analysis

Uploaded by

Copyright:

Available Formats

POLITECNICO DI TORINO

III Facoltà di Ingegneria dell’Informazione

Tesi di Laurea Magistrale

A perceptually grounded approach

2 A Perceptually Grounded Approach... 5

3 Digital Audio Concepts 17

4 ...To Sound Spectrum Analysis 37

5 Real-Time Audio Applications 59

6 Perceptual Onset Detection 69

A MSP, anatomy of the object 89

B bonk∼ source code 93

C Writing External for Max/MSP with XCode 117

5.1 Musical software for realtime synthesis and control . . . . . . . . . . . . 67

2.1 Winckel’s treshold of hearing [1967]. . . . . . . . . . . . . . . . . . . . 7

There are no theoretical limitations to the performance of the computer as a source

2.1 Auditory Cognition (Reminding Psychoacous-

2.1.1 Limits of Perception, Perception of Intensity. Loudness

where p0 corresponds to the estimated threshold of hearing at 1 KHz. The threshold of

Figure 2.1: Winckel’s treshold of hearing [1967].

2.1.2 The Human Ear

Figure 2.3: Peripheral auditory system.

• The middle ear: transduces air vibrations into mechanical vibrations.

2.1.3 Perception of Time and Periods

Temporal integration is another important feature in perception of time. Human ear

The case of reverberation

2.1.4 Perception of Frequency, the Sensation of Pitch

while their corresponding bandwidths, are:

These center-frequencies and bandwidths should be interpreted as being associated with

2.1.5 Perception of Timbre

Digital Audio Concepts

3.1 Toward Digital Representation of Sound

The Sampling Theorem (Nyquist/Shannon)

Dynamic Range and Signal-to-Noise Ratio

low pass digital low pass

Figure 3.1: Simpler digital audio system

(DR)Max = Nbit · 6.11 [ dB]

Digital Audio Signal, File Format and Perceptual Codec

3.2 Digital Filters

3.2.1 Filters Background

Pass-band and stop-band, cutoff and center frequency

rolloff. In musical application’s filters, the rolloff is frequently measured in dB/octave7

Selectivity and quality factor (Q)

The Q, in musical purpose filters, can be interpreted as the resonance degree of a

Here, BW is an acronym for bandwidth. Hence, higher Q implies narrower bandwidth. As

Filter combination, parallel and cascade

fc [k] = 2 k/3 · 1000 Hz

Impulse response and time-domain effect of filtering

3.2.2 Introduction to Digital Audio Processing with Filters

y[n] = x[n] ∗ h[n]

y[n] = x[n] ∗ U I[n] = x[n]

y[n] = x[n] ∗ (c · U I[n]) = c · x[n]

Figure 3.6: Echo and reverberation effects explained by convolution.

finite-length input signals a[n1 ] and b[n2 ], is:

Transfer function and frequency response

Which is related to the Discrete Fourier transform, introduced later, by substitution

3.2.3 Digital implementation of filters

Figure 3.7: Simple delay line.

3.2.4 FIR Filters

Follows the general formula for FIR filters:

y[n] = a0 · x[n] ± a1 · x[n − 1] ± . . . ± ai · x[n − i]

3.2.5 IIR Filters

Figure 3.8: Circular buffer.

and the transfer function of this filter is: