POLITECNICO DI TORINO

III Facoltà di Ingegneria dell’Informazione
Corso di Laurea in Ingegneria Elettronica

Tesi di Laurea Magistrale

A perceptually grounded approach to sound analysis
An application for Orchestra Meccanica Marinetti

Relatore: prof. Marco Masoero Corrado SCANAVINO

LUGLIO 2009

II

Acknowledgements

Àlaleria.

III

Contents
Acknowledgements 1 Introduction 2 A Perceptually Grounded Approach... 2.1 Auditory Cognition (Reminding Psychoacoustics) . . . . . . . 2.1.1 Limits of Perception, Perception of Intensity. Loudness 2.1.2 The Human Ear . . . . . . . . . . . . . . . . . . . . 2.1.3 Perception of Time and Periods . . . . . . . . . . . . 2.1.4 Perception of Frequency, the Sensation of Pitch . . . 2.1.5 Perception of Timbre . . . . . . . . . . . . . . . . . 3 Digital Audio Concepts 3.1 Toward Digital Representation of Sound . . . . . . . . . . . 3.2 Digital Filters . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Filters Background . . . . . . . . . . . . . . . . . . 3.2.2 Introduction to Digital Audio Processing with Filters 3.2.3 Digital implementation of filters . . . . . . . . . . . 3.2.4 FIR Filters . . . . . . . . . . . . . . . . . . . . . . 3.2.5 IIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
III

1 5 5 6 8 10 12 15 17 17 21 21 27 32 33 34 37 38 40 42 43 46 47 49 51

4 ...To Sound Spectrum Analysis 4.1 Introduction to Sound Analysis in the Frequency Domain . . . . 4.2 Introduction to the Fourier Analysis . . . . . . . . . . . . . . . 4.2.1 Fourier Transform (FT), Classic Formulation . . . . . . 4.2.2 Discrete Fourier Transform (DFT) . . . . . . . . . . . . 4.3 The Short Time Fourier Transform (STFT) . . . . . . . . . . . 4.3.1 The Filterbank View . . . . . . . . . . . . . . . . . . . 4.3.2 Windowing: Length and Shape of the Window Function 4.3.3 Computation of the DFT (via FFT) . . . . . . . . . . .
IV

4.3.4 4.4

The Inverse Short Time Fourier Transform & Overlap-Add Resynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Constant-Q analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.1 Implementation of Constant-Q Analysis . . . . . . . . . . . . . . 55 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 60 62 64 64 69 69 71 74 76 77 79 81 82 83 85 85 87 89

5 Real-Time Audio Applications 5.1 Max/MSP . . . . . . . . . . 5.2 CSound . . . . . . . . . . . . 5.3 Supercollider . . . . . . . . . 5.4 Chuck . . . . . . . . . . . .

6 Perceptual Onset Detection 6.1 The Curious Case of ·O M M· . . . . . . . . . . . . . . . 6.2 From Transient to Attack and Onset Definitions . . . . 6.3 General Scheme for Onset Detection . . . . . . . . . . 6.3.1 Energy Based Approach . . . . . . . . . . . . . 6.3.2 Phase Based Approach . . . . . . . . . . . . . . 6.4 Introduction to the Perceptual Based Approach to Onset 6.5 Onset Detection in ·O M M· . . . . . . . . . . . . . . . . 6.5.1 The bonk∼ Method . . . . . . . . . . . . . . . 6.5.2 Result of the Analysis in ·O M M· . . . . . . . . . 6.6 From Onset Analysis to Sound Classification . . . . . . 6.6.1 Learning Results . . . . . . . . . . . . . . . . . 7 Conclusion A MSP, anatomy of the object

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

B bonk∼ source code 93 B.1 The bonk∼ Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 C Writing External for Max/MSP with XCode Bibliography 117 121

V

List of Tables
5.1 6.1 6.2 6.3 Musical software for realtime synthesis and control . . . . . . . . . . . Filterbank design in our method based on bonk∼ . . . . . . . . . . . Results in detecting onset of the five soundtracks created for analysis purpose, played at different bpm. ·O M M· . . . . . . . . . . . . . . . . Numerical result in detecting onset and recognizing the three sounds (A/B/C) produced by the ·O M M· . . . . . . . . . . . . . . . . . . . . . 67 . 84 . 85 . 86

VI

List of Figures
2.1 2.2 Winckel’s treshold of hearing [1967]. . . . . . . . . . . . . . . . . . . Equal-loudness contours for the human ear, determined experimentally by Fletcher and Munson, published on Loudness, its definition, measurement and calculation [1933]. . . . . . . . . . . . . . . . . . . . . . . . . . . Peripheral auditory system. . . . . . . . . . . . . . . . . . . . . . . . Part of the inner ear, the cochlea is shaped like a 32 mm longs nail and is filled with two different fluids separated by the basilar membrane. . . Cochleagrams, expressed in bark unit as function of time. On the left the spoken italian word "ape", on the right a short excerpt of Moondog’s “Pigmy pig”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpler digital audio system . . . . . . . . . . . . . . . . . . . . . . . Amplitude (A) response versus frequency, for the four basic types of filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The pass-band or bandwidth of a band-pass filter is the difference between the upper and lower cutoff frequency. The cutoff frequencies are defined as the frequency at which the amplitude, but energy would be better to say instead, is half the pass-band amplitude. In the figure, 40 dB is assumed as the maximum level of amplitude in pass-band. . . . . . . . Example of application of a constant Q filter. Here the center frequencies are tuned around generic musical octave. In music, an octave, is the interval between one musical pitch and another with half or double its frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alteration of the envelope of a tone (INPUT) passed through a narrow filter (OUTPUT). The output envelope has been stretched in time during onset and offset components of the tone (initial and final portion). . . Echo and reverberation effects explained by convolution. . . . . . . . . Simple delay line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Circular buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Digital sound synthesis and sound analysis. . . . . . . . . . . . . . . .
VII

.

7

2.3 2.4 2.5

. .

8 9

. 12

3.1 3.2 3.3

. 14 . 19 . 22

. 23

3.4

. 24

3.5

3.6 3.7 3.8 4.1

. . . . .

27 30 33 35 40

4.2

Two plots of static spectrum. The image represents the SPL against frequency of a drum hit played by a robot (on the left), and a note of a violin (on the right). The difference is noticeable, while the robot hit has apparently no harmonically related frequency components, in the violin note this is clear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Basic operation of the STFT used for sound analysis. . . . . . . . . . . 46 Waterfall spectrum, a 3D representation os the STFT spectrum. The graph was obtained with Spectutils package for GNU Octave. The analysis parameters of the STFT are shown above the figure, the audio sample analyzed is extracted from Laurie Anderson’s Violin Solo. . . . . . . . . 48 Types of windows used in STFT for audio analysis. No ideal window exists, the term "optimal window" is preferred. Several types of windows are used, for musical purpose the Kaiser window has usually a preferential use. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Spacing of filters for STFT (filterbank view) on the top and Constant-Q filterbank on the bottom. It is clear the advantage of the Constant-Q filterbank method, which places the filters linearly against log(frequency), which is similar to the frequency response of the human ear. . . . . . . . 54 Waterfall spectrogram of a Constant Q transform of violin glissando from 578 Hz to 880 Hz (D5 to A5). Taken from Judith Brown’s Calculation of a constant Q spectral transform. [A glissando is a glide from one pitch to another.
It is an Italianized musical term derived from the French glisser, to glide, It is also where the pianist slides up the piano with his or her hands. From Wikipedia.]

4.3 4.4

4.5

4.6

4.7

. . . . . . . . . . . . . . . . 56

4.8

Waterfall spectrogram of a Constant Q transform of flute playing diatonic scale from 262 Hz to 523 Hz (C4 to C5). Taken from Judith Brown’s Calculation of a constant Q spectral transform. [In music theory, a diatonic scale
is a seven note musical scale comprising five whole steps and two half steps, in which the half steps are maximally separated. From Wikipedia.]

. . . . . . . . . . . . . . . . . . . . . . 57

5.1 5.2 6.1 6.2

Max 5 patcher window . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Max 5 window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 The two robots on the sides, SCS + the performer in the middle. . . . . 70 On the top, the waveform corresponding to a hit of a robot percussionist of ·O M M· . On the bottom, the intensity profile of the hit (using Praat), where onset, attack and the transient/steady state separation are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 From top to bottom: waveform, static spectrum (FFT) and time-varying pectrum (STFT). From right to left: one hit of ·O M M· robot, one hit of snare drum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
VIII

6.3

6.4

Unwrapped phase deviation between two adjacent analysis frames. ∆ϕn,k is the unwrapped phase deviation. For the simpler case represented by a steady state sinusoid, the phase deviation is approximately 0 constant in-between the whole analysis frames, while, during transient the phase deviation should be extremely large and easy to detect. . . . . . . . . 6.5 Graphical representation of the bounded-Q filterbank. Only the octave are geometrically spaced, in between the octave the spacing between analysis bins is linear. This allows the application of FFT-like algorithm to calculate the spectrum of each component. . . . . . . . . . . . . . B.1 Max patcher window showing our test patch realized to analyze the ·O M M· sounds with bonk∼ 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . C.1 XCode main window . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 a Bundle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 79

. 83 . 116 . 118 . 119

IX

Chapter 1 Introduction
The sound analysis is a very wide area of research; its typical applications range from studies about the environment impact, the vibration models, bioacoustics to. . . Music. Each of these fields has its own specific characteristics, thus we need to identify every time the best approaches. The specific field of application of this thesis is musical, meaning that the method has been tested inside the Orchestra Meccanica Marinetti project, or ·O M M· . This thesis is about a specific sound analysis approach, defined as perceptually grounded because it mimic s the human perception (auditory system) of sound. The perceptual sound analysis is an ideal candidate for applications in this context. The idea was to extend or improve some of the musical characteristics of the robotic Orchestra. ·O M M· is a project about a robotic orchestra, controlled in real-time by a performer. The project has been conceived by the programmer-digital artist Angelo Comino AKA Motor, and consists mainly of two robots, which play drums, conducted by a performer through a gestural controller, via MIDI. The two robots are more than 2 meter high, and the drums consist of oil cans (such as the one used by petrol companies) of standard size. These devices were designed and built with industrial component, with special care in order to emulate the movement of a real drummer, by the people of Mechatronic Lab (LIM) of Politecnico di Torino, thanks to the collaboration of local robotic companies: Prima Electronics, ERXA e ACTUA. Each robot has two, moved by two power electric engines, controlled by a FPGA-DSP dedicated hardware, while the interaction with the performer is adjusted in real-time by the Show Control System, developed in Max/MSP (a typical development environment for this kind of application), running on a Apple laptop. A picture of the ensemble robots + performer can be seen in picture 6.1. ·O M M· was presented in October 2008 in Turin, arousing great visibility on national media (press and television) and it is actually completing the engineering development and starting the artistic deployment. The idea of robot musician is not new, in early 80s at Waseda University (Japan) 1

1 – Introduction

WABOT-2 was experimented: a robot keyboardist able to converse with a person, read a musical score with its eyes (a camera) and play it on an electronic organ. A robot drummer has been developed, since 2005 to present, at the Georgia Tech College of Computing programmed. Haile, that’s the name, is able to listen to live musicians and accompany them, playing a drum. Haile’s output is based on live-analysis and processing of sounds produced by other musicians playing at the same time, not pre-recorded sequences. Other examples are the recent robotic trumpeter and violinist by the TMC (Toyota Motor Corporation). One of the critical point within a mechanical orchestra is to sync the execution of the musical score among real and virtual instruments. The human ear is particularly careful about timing, but physical devices, as the electromechanical arms of the robots, have variable delays when activated. These variations depend on the note intensity requested, or the execution rhythm. Each robot basically receives a message on a serial line (MIDI), stating what kind of hit should be executed. Besides the message generation and data transmission delay, usually negligible from a human point of view, we have the delay introduced by the physical movement of the arms. Thus, we need to measure the time interval between the digital command and the perceived strike on the can. We propose a "perceptually grounded approach" to recognize the hit of the different strikes, in order to compute the delay matrix of a generic score. The robots can play approximately two hits per second by each arms and the sounds they can play consists of three different variations. That is, the robot arms must be positioned to the correct level in height and will hit the drum after a delay, which is primarily related to the distance of the arm to the drum and the acceleration by which the arm is driven. Hence, the delays are variable and not only, problems related to non-absorbed vibrations between a hit and another could cause unwanted change in loudness and pitch perceived. That’s why a perceptual based approach was needed to characterize the robots performance’s behaviour and its response to the applied digitally stimuli. The thesis contains, in his first part, an overview to the phenomena occurring in our auditory systems, in particular when a new musical event occurs (i.e. a robot hit in our case), in which our ear encodes both time and frequency phenomena related to the human perception of sound. In the specific case of percussions, the sound produced can be for convenience subdivided in two parts, the transient (the origin of the sound) and the steady state portions (could be intended as the extended support to transient). Two important points should be considered. First, during transient, at the precise instant the sound is originated (onset of a sound), corresponds a rapid increase in sound energy which can reach its peak in less than 5 ms. Detecting onset, our first aim, is not a trivial task. Second, the time in which the onset occurs is the meaningful component of a percussive sound; performing a sort of spectrum analysis at this point, the information achieved are usually considered sufficient to predict the entire sound, i.e. the steady state portion of the sound should be derived. 2

Efforts have been then extended to the subfield of signal processing, for specific treatments of the musical audio signal (DSP). Therefore we present the digital audio representation and the realization of digital filters, basic (but still very advanced) component of every digital signal processing task. For this purpose the books of Roads, Dodge, Rocchesso and Beauchamp were a good starting point. Later we introduce the methods used for audio analysis, in particular those systems which perform harmonic analysis, usually represented under the name Fourier analysis. The Fourier transform operation, applied to a musical signal, can be viewed as a decomposition of the sound into a finite number of harmonics, each of which represented by a complex value. This value is sufficient to extract all the information needed to derive frequency, intensity and phase of each harmonic; but doesn’t represent the only solution for musical analysis. The method we suggest considering, under certain condition (in particular for music or speech), is Constant-Q Transform based, well approximating some features of the human auditory system. Composers of the 20th century have contributed to the evolution of electronic music, in a way that even they wouldn’t be expected. From Luigi Russolo and the intonarumori in 1918, mechanical instrument producing non harmonic sound (Art of Noise), musical schemes have been continuously redefined. Russolo influenced even Stravinsky (Paris 1921), and after Stravinsky and the expressivity and richness of his Music, musicians became also technician and explored new electro-acoustic machines producing sound. Between the russian composer and the first Moog have passed several years, but the works of other composers like Bartók, Varèse, Messiaen, Shaeffer, Ligeti, Cage have maintained straightforward the state of innovation. In parallel, the evolutionary studies of certain mathematicians and physicians of the 19th century (primarily Helmholtz and Fourier) have lead technicians in the discoveries that made it possible the realizations of the first electronic instruments, become famous with movie like "Forbidden Planet" (Louis and Bebe Barron) or "2001: A Space Odissey" (Ligeti and HAL voice inspired to computer synthesis experiment of Max Mathews). Or the experimental works of Norman McLaren. Other experiments in between art and technology had been proposed, such as the electroacoustical compositions "A man sitting in a cafeteria" by Charles Dodge and "I am sitting in a room" by Alvin Lucier; very attractive for their expressivity and their educational approach. The first is one of the first experiment of reproducing speech with computer and the second is a brilliant example of application of different impulse responses of a room. This music come from 60s and 70s. Grazie!

3

1 – Introduction

4

Chapter 2 A Perceptually Grounded Approach...
There are no theoretical limitations to the performance of the computer as a source of musical sounds, in contrast to the performance of ordinary instruments. At present, the range of computer music is limited principally by cost and by our knowledge of psychoacoustics. M V Mathews1

2.1

Auditory Cognition (Reminding Psychoacoustics)

Some of the subjects treated in this section require notions of acoustic. Intensity, frequency, duration and spectrum are physical attributes used in literature to describe the acoustical properties of a sound. These attributes do not form music itself, but they can vary the perception of each sound components of a musical flow. The perceptual attributes, pitch, loudness and timbre, describe how the physical attribute related to sound are perceived and interpreted as mental construct by the brain, through our hearing system. The composer needs to know how to construct and balance physical attributes of sound in a way that correspond, more or less, to the composer’s musical concept [44] Since sound is supposed carried by vibrations2 , propagating through a medium such as air, the detection of these vibrations constitute our sense of hearing. Physical informations
Appeared on article called The Digital Computer as a Musical Instrument, on journal Science, dated 1 Nov. 1963. Now computers costs little bit less. And some on psychoacoustics will follow. 2 Other representation of sound are possible, a microscale was proposed, i.e. sound can be decomposed into smaller time unit called microsound or sound particle. See [45] and Gabor, Schaeffer, Shoenberg et al. for more on different sound decomposition.
1

5

2 – A Perceptually Grounded Approach...

conveyed by sounds, with respect to the study of natural auditory systems (human ear), had been successfully applied to derive the relationship between physical stimuli and the induced mental construct. The subfield of psychophysics (the study of psychological responses to physical stimuli) depicting this phenomena is the psychoacoustic.

2.1.1

Limits of Perception, Perception of Intensity. Loudness

Intensity is proportional to energy, i.e. the variance of air pressure, in a sound wave. Sound intensity is measured in terms of sound pressure level (SPL) on a logarithmic scale, thus the result can be expressed in dB: SP L[ db] = 20 · log10 p p0

where p0 corresponds to the estimated threshold of hearing at 1 KHz. The threshold of hearing is generally reported as the RMS3 sound pressure of 20 µPa, which is approximately the quietest sound a young human with undamaged hearing can detect at 1 KHz. SPL is inversely proportional to distance from the sound sources. Loudness is the perceptual attribute related to changes in intensity, that is, increase in sound intensity are perceived as increase in the loudness mechanism. Unfortunately there is not a trivial relationship. Loudness also depends on other factors like spectrum, duration and presence of background sounds. Winckel4 in 1967 proposed the range of hearing for a young adult human ear, shown in figure2.1, this range can vary with age and individual’s sensitivity. Winckel’s range of hearing is valid for sustained sine tones. For shorter tones this threshold can raise, this is because, approaching to the borders of the threshold, the ear seems to integrate energy for shorter tones, at leat for less than 200 ms. Other studies have shown that human body can sense very low frequencies, although ears do not, and that the upper limit of sensitivity may be well beyond 20 Khz. Another useful tool are the Fletcher-Munson curves. They proposed a graph similar to that of Winckel, introducing the concept of constant-loudness contour, easy to identify in figure 2.2. The meaning of this graph is that each curve has roughly the same loudness. These constant loudness curves are called phons. A phon is intended as "number of dB at 1 KHz". In other words, a sine tone at 1 KHz with intensity of 50 dB has a loudness level of 50 phons. Therefore, if we want to produce a sine tone at 300 Hz with the same loudness as the 1KHz tone, it is necessary to follow the 50 phons curve until
Roots mean square, a statistical measure of the magnitude of a varying quantity. Fritz Winckel, austrian acoustician, is considered one of the pioneer of the electronic music. He published in 1967 the book Music, Sound and Sensation: A Modern Exposition.
4 3

6

2.1 – Auditory Cognition (Reminding Psychoacoustics)

Figure 2.1: Winckel’s treshold of hearing [1967].

300 Hz and use the corresponding value of SPL, then the two tones will sound equally loud to the listener. Obviously, the perfect sine wave is an artifact, no sound exists in nature as expression solely of a frequency. However, it is demonstrated that is possible to destructure5 the sound as a sum of perfect sine waves. Therefore we can assume that each of which, weighted per the FM curves and then summed, will contribute to total loudness. But this is another theoretical situation, since no linearity can be actually applied, at least not on the overall spectrum, because of the presence of critical bands6 . Before introducing the time and frequency perception of human hearing, the most advanced features of our auditory system, it maybe better to understand how the ear works. 7

2 – A Perceptually Grounded Approach...

Figure 2.2: Equal-loudness contours for the human ear, determined experimentally by Fletcher and Munson, published on Loudness, its definition, measurement and calculation [1933].

2.1.2

The Human Ear

The peripheral auditory system is the medium by which sound waves are detected, encoded, and retransmitted through nerve cells to the brain, where human can finally render sound. Although very sophisticated, the process can be intuitively subdivided into three steps, each accomplished into different place in the ear. • The outer ear: amplifies and conveys incoming sound waves such as air vibration. Here the sound waves enter the auditory canal, which can amplify sounds containing frequencies in the range between 3 Hz and 12 kHz. At the far end of the
5 6

See chapter 4, Fourier Trasnform and Overlapp Add Resysnthesis, for explanation to the fact. See section 3.1.3. for explanations of critical bands.

8

2.1 – Auditory Cognition (Reminding Psychoacoustics)

Figure 2.3: Peripheral auditory system. auditory canal is the eardrum (or tympanic membrane), which marks the beginning of the middle ear. • The middle ear: transduces air vibrations into mechanical vibrations. Sound waves, coming from the auditory canal, are now hitting the tympanic membrane. Here, three delicate bones, the malleus (hammer), incus (anvil) and stapes (stirrup), convert the low-level pressure eardrum sound vibrations into higher-level pressure sound vibrations to another, smaller membrane, called the oval or elliptical window. Finally, another ,The stapedius musclewhich has the role to prevent damages in the inner ear. The middle ear still contains the sound information in wave form; it is converted to nerve impulses in the cochlea. Higher pressure is necessary because the inner ear beyond the oval window contains liquid rather than air. 9

2 – A Perceptually Grounded Approach...

• The inner ear: processes mechanical vibration and transduce them mechanically, hydrodynamically and electrochemically. These are then transmitted through nerves to the brain. The inner ear consists of the cochlea and several non-auditory structures. The cochlea has three fluid-filled sections, and supports a fluid wave driven by pressure across the basilar membrane separating two of the sections. Strikingly, one section, called the cochlear duct or scala media, contains an extracellular fluid similar in composition to endolymph, which is usually found inside of cells. The organ of Corti is located at this duct, and transforms mechanical waves to electric signals in neurons. The other two sections are known as the scala tympani and the scala vestibuli, these are located within the bony labyrinth which is filled with fluid called perilymph. The chemical difference between the two fluids (endolymph & perilymph) is important for the function of the inner ear. Additional processes occur at the brain level, for example, other neural encoded informations are used in order to combine signals coming from both ears and fuse them into one sensation. However, although complex, the mechanism do not yield necessary information to the brain to understand, for example a single note, an harmony, a rhythm, or higher-level musical structures. It appeared that also the low-level time and frequency perceptual mechanisms, operate both on the musical signal in parallel. Thus the determination of the nature of sound is not only determined by the physical properties of sound and human ear, but all these informations will be combined at high-level (i.e. in the brain) where the sound takes its musical form.

2.1.3

Perception of Time and Periods

Higher level perceptual processes can be obtained only because other mechanisms, in the inner ear, encode both time and frequency. In this section, we look at temporal features, two of them seems to be the most prominent: period detector and temporal integration. Period detector The mechanism of period detector inside auditory system, operates on the fine structure of the neurally translated incoming waveform. The neural pattern is obtained by nerve cells (in the organ of Corti) firing individually or in group, at a rate which corresponds to the wave’s period. Individually, each cells can operate in this manner only up to a certain period, if this is too small, they cannot recover quickly enough. However, group of cells can rotate or stagger their firing, so that they, in effect, follow submultiples of sound period. 10

2.1 – Auditory Cognition (Reminding Psychoacoustics)

A special feature is that the ear can encode variation in the envelope of the wave, studies have demonstrated the existence of a mechanism in the central auditory system to detect amplitude modulation (AM), although in a small range of frequencies (75 to 500 Hz) and only for significant depth of modulation. Event detector Another time-related mechanism, deep inside the human ear, is the perception of event. Musical event occurs every time there is a variation of the vibration pattern, that is, something is happen nearby and we hear a new sound. Sound onset 7 is the perception of new sound is born. At onset time other nerve cells fire, and different cells operate on different onset slopes. A model for onset detection, developed by Gordon in 1984 [26], showed that the moment of perceptual onset of musical event can be significantly delayed from the physical onset. Another problem is that is not possible to establish unequivocally the threshold over which an event becomes audible to the ear, that is, the definition of the threshold over which the ear recognizes the onset. What does the human ear consider as audible event? Bilmes proposed these questions: does it refer to the time when physical energy in a signal increases infinitesimally? the time of peak firing rate of the cochlear nerve? the time when we first notice a sound? the time when we first perceive a musical event? or something else? [8] Whatever it means it is demonstrated that perceptual onset time is not necessarily coincident with initial increase in physical energy. Again, since other cells respond to temporal interval between events, this means that human auditory system is able to connect single events into rhythmic stream. Temporal integration Temporal integration is another important feature in perception of time. Human ear seems to integrate two or more event, if they are too close together. This is the principal limit to the resolution of perceived rhythm. The minimal time between frames that human ear can sense separately is variable, depending on the duration of each events. For example this period can be few milliseconds if the events are very short, but can also be much greater than 50 ms. What happens when succession of sound events cannot be perceived separately in time by human ear? They smear together to form one sensation, in other words, temporal resolution is lost. Therefore, human ear has no fixed “time resolution”. A lot of phenomenons are related to temporal integration, one for all, the effect (sometimes desired) of reverberation.
Onset is the point at which a musical event becomes audible. For percussive sound is considered to be the same of attack time, the instant in which the stick strokes the drum.
7

11

2 – A Perceptually Grounded Approach...

The case of reverberation Reverberation is different from echo, also from sequences of echoes. However, if a sound is reflected by a surface we hear the sound and its echo, if the surface is irregular or other surfaces are present in the room, several echoes can be heard. The number of echoes per second is normally referred to as echo density. When echo density is greater than 30, individual echoes are separated by less than 35 ms and the ear cannot perceive them separately. The fusion of echoes in a unique sensation, that is, reverberation. Not only short time rates between events affects the probability of smearing, but also the frequency. If two following notes in a musical stream have similar frequencies they will probably smear together.

2.1.4

Perception of Frequency, the Sensation of Pitch

Figure 2.4: Part of the inner ear, the cochlea is shaped like a 32 mm longs nail and is filled with two different fluids separated by the basilar membrane. Frequency is a physical parameter associated to each wave that carries the sound energy to the ear. Pitch is the perceived parameter related to frequency, it can be thought as the quality of a sound, governed by the rate of vibrations produced by the sound [44]. In the inner ear, the oscillations of the oval window assume the form of traveling waves which move along the basilar membrane, ie. along the entire length of the cochlea. 12

2.1 – Auditory Cognition (Reminding Psychoacoustics)

The mechanism for detecting frequencies is located in the basilar membrane. A simple correspondence occurs: when a single sine tone excites the ear, a region of the basilar membrane oscillates around its equilibrium position. Since real sounds have no single frequency, this region will show a place where excitation has a maximum, corresponding to the fundamental frequency. The distance of this maximum from the end of the basilar membrane is directly related to frequency, so that each frequency is mapped in a precise place along the membrane. The mechanical properties of the cochlea (wide and stiff at the base, narrower and much less stiff at the apex) denotes a roughly logarithmic decrease in bandwidth as we move linearly away from the cochlear opening (the oval window), as shown in figure 2.4. Thus, the auditory system acts as a spectrum analyzer, detecting the frequencies in the incoming sound at every moment in time. In the inner ear, the cochlea can be understood as a set of band-pass filters, each filter letting only frequencies in a very narrow range pass. This mechanism could be associated to a filterbank of constant-Q filters 8 , because of their property to be linearly spaced on a logarithmic scale9 . However, the sensation of pitch is not only related to the fundamental frequency perceived. Other contributes, related to the temporal mechanism encoded in the ear, such as period detection, can alter the sensation of pitch. The sounds that ear can sense, have wide frequency range, approximately from 20 to 20 KHz. The perceived pitch, also expressed in Hz, has a limited range, approximately from 60 to 5 KHz. Critical Bands Since each frequency stimulates a region of the basilar membrane, a limit to frequency resolution of the ear is imposed. This limit is reflected to another characteristic of perception, known as critical band. A simple example to understand how the ear works in the critical band is necessary. Think, or better listen, two sine waves very close in frequency, they have a total loudness which is less than the sum of the two loudness we would hear if they were separated in frequency. Now, if we slowly separate each other in frequency, we perceive the same loudness up to a point, then, over a certain frequency the total loudness increases approximately to the value of the sum of individual loudness. The frequency difference, needed to perceive loudness as sum of individual loudness is the critical band. The ear behavior in this region, can be thought as a kind of frequency integration, because it is similar to the temporal integration we have seen earlier. Inside the critical band resides other important factors of perception, roughness and beating. Roughness
A constant-Q filterbank is a set of bandpass filters, which fit their bandwidths according to central frequencies, to maintain a fixed ratio (Q). 9 See chapter 4, the section on Constant-Q analysis, which base his benefit on the similarity with the human pitch detector mechanism occurring in the basilar membrane.
8

13

2 – A Perceptually Grounded Approach...

Figure 2.5: Cochleagrams, expressed in bark unit as function of time. On the left the spoken italian word "ape", on the right a short excerpt of Moondog’s “Pigmy pig”. is a sensation of dissonance, its presence is particularly strong in the lower and upper bound of the critical band, where the two tones are almost separated but not yet ready to be perceived as two sounds. In the middle of the critical band the two tones are heard as one with a frequency that lies between the two frequencies, where we can clearly perceive the sensation of beating. When the two tones are separated by 1 Hz we perceive a single beating per second. The width of critical bands (bandwidths) increase in frequency. The Bark scale was proposed to represent the human ear behavior inside the critical bands. An example of such a representation is proposed in figure 2.5, where the spectrogram produced is plotted against frequency on a Bark scale; in this case referring to cochleagram is appropriate. The Bark scale (of human hearing) ranges from 1 to 24 Barks, corresponding to the first 24 critical bands. The proposed Bark center frequencies, in Hz, are: 50, 150, 250, 350, 450, 570, 700, 840, 1000, 1170, 1370, 1600, 1850, 2150, 2500, 2900, 3400, 4000, 4800, 5800, 7000, 8500, 10500, 13500 while their corresponding bandwidths, are: 100, 100, 100, 100, 110, 120, 140, 150, 160, 190, 210, 240, 280, 320, 380, 450, 550, 700, 900, 1100, 1300, 1800, 2500, 3500, 5000 These center-frequencies and bandwidths should be interpreted as being associated with a specific fixed filter bank in the ear. Note that since the Bark scale is defined only up 14

2.1 – Auditory Cognition (Reminding Psychoacoustics)

to 15.5 kHz, the highest sampling rate for which the Bark scale is defined up to the Nyquist limit, 31 KHz. When many frequencies are present (fundamental tones and harmonics) the auditory system works on all of them simultaneously, with the limit of resolution introduced by critical band. This effect on the overall spectrum is another contribute to the perceived pitch. Not only evitabile, the pitch is also influenced by inharmonic spectra, which is a characteristic of noise. Perception of noise White noise does not affect pitch because it is completely random and has a flat spectrum that doesn’t evoke any sensation, if not trouble. Since colored noise are created by modulating white noise, some of them can yield a vague pitch sensation, depending on the modulation applied. For example, for an AM modulation of white noise, there may be a pitch corresponding to the modulation’s frequency. Other sensation of pitch can be achieved by filtering or applying digital effects to white noise. We have seen the most important factors characterizing the sensation of pitch, now we can introduce the last perceived attribute of a sound, the timbre.

2.1.5

Perception of Timbre

The generic definition of timbre is this: the attribute by which we can distinguish two sound with the same loudness and pitch. Thus timbre is the character or quality of a musical sound, distinct but influenced from its pitch and loudness. Sometimes, in more ecstatic way, is also referred to as the color of sound. The characteristics determining timbre reside in the constantly changing spectrum of a musical sound, produced for example by an instrument. The steady-state spectrum is not enough to distinguish a sound produced by an instrument to another, but also the attack and decay portion of the spectrum are very important. Therefore, timbre has to be more than one dimension, because involves temporal envelope and evolution of the spectral distribution over time.[44]

15

2 – A Perceptually Grounded Approach...

16

Chapter 3 Digital Audio Concepts
Prior to the study of specific aspects in sound analysis (chapter 4 and 6), it is better to clarify the basic concept behind the audio representation on digital computers. In the following paragraph, the main attributes of a digitized sound are dealt with from basic terms (sampling and quantization) to advanced applications (digital filters). This chapter is therefore divided into two sections: the fist contains a brief illustration of the theories behind digital representation of music, while the second gives a deeper explanation of how the filters are implemented on digital computers.

3.1

Toward Digital Representation of Sound

Analog audio signal
Sounds come in the form of vibrations, carried to our ears by a physical medium such as air. Referring to an electrical system, thus replacing ears with a microphone, sounds are transduced into a time-varying voltage in accordance to the vibration’s patterns present in the air. This is what is called the analog audio signal, a continuous signal which consists of a continuum of values. In physic, an analog sound signal is usually considered a mono-dimensional signal representing the air pressure over the microphone membrane.

The Sampling Theorem (Nyquist/Shannon)
In order to perform any sort of sound processing on digital computer, the analog signal must be reduced to digital data, each representing a discrete value of the signal’s instantaneous voltage. The operation that transforms the analog signal into digital signal is, ladies and gentlemen, the sampling. The sampling theorem states that, in order to accurately represent sound digitally, the sampling rate, defined as the frequency in Hz 17

3 – Digital Audio Concepts

at which the sampling operation is performed, has to be at least twice the frequency band of the analog signal. The frequency band is determined by the maximum frequency contained in the signal. Since the (average) upper frequency limit to human hearing is considered to be 20 KHz, sampling rate higher than 40 KHz must be choosen. This is enough to allow reconstruction of the original signal, starting from samples, in a way that human ear cannot distinguish from the original. The device performing this operation is called analog-to-digital converter (ADC). At each period (i.e. the inverse of the sampling rate), the ADC produces a string of binary numbers, called sample, which are stored in memory in the exact order they are received. The inverse operation, from digital-to-analog, is realized by the digital-toanalog converter, DAC. The sampling rate normally used in computer to represent digital audio signal is 44,1 KHz or 48 KHz. The frequency halves the sampling rate, is called the Nyquist frequency. The faster the sampling rate, the greater the Nyquist frequency and consequently the frequencies that can be represented (but also the demands on speed and power consumption of the hardware).

Aliasing
Like any other analog to digital conversions, also the audio conversion may be affected by the problem of aliasing. Aliasing occurs cause frequencies, higher than half the sampling rate (Nyquist frequency), may be present at input of the ADC. This results as distortion of the original signal and it can be heard, in acoustical term, as an unwanted change in pitch1 , because frequencies over Nyquist are probably converted at low frequencies. The problem can be easily overcome by placing an anti-aliasing filter 2 before the ADC, which ensures that only signals below Nyquist enter the converter. This system is also replicated at the end of the audio chain, in between the DAC and the speaker, for the same reason. In figure C.2 is proposed a generic audio system.

Dynamic Range and Signal-to-Noise Ratio
The dynamic range is the difference between the softer and the louder sound that can exist in the system. It is expressed in terms of decibels, because of their useful log compression for large numbers (e.g. doubling a number will reflects in a 3 dB increment). Since decibel is a unit of measurement for ratio, in acoustic, dB is used to represent the ratio between the actual level of intensity to a reference level. As reference level it is normally considered the threshold of hearing, 10−12 W/m2 . The maximum value of
Pitch is a sound attribute, perceptually related to frequency. Pitch and the other musical perceptual attribute will be presented in chapter 2. 2 The simpler realization of anti-aliasing filter is obtained by a low-pass filter with cutoff frequency equal to Nyquist frequency. See next section for filter’s explanation.
1

18

3.1 – Toward Digital Representation of Sound
analog audio input analog audio output

MEMORY low pass filter nyquist frequency ADC digital samples DAC low pass filter nyquist frequency

Figure 3.1: Simpler digital audio system dynamic range for human hearing, is called threshold of pain3 and it is estimated above 120 dB. Hiroshima explosion was 180 dB. If the sound is particularly short the threshold of pain can increase, but is better not to try... While recording music, it is important to capture the wide-as-possible dynamic range, in order to reproduce music in its fully expressive way. For example, recording an orchestra will require wider dynamic range than a solo instrument[44]. The number of bits (Nbit), used to represent each sample4 , has a direct influence on the maximum dynamic range of digital audio systems. The following simple formula can be used for this purpose: (DR)Max = Nbit · 6.11 [ dB] Therefore, a 24 bit system may reach 147 dB, much more than the threshold of pain. Considerations should be given on noise, when speaking of dynamic range, because noisy sound components (not only the noise introduced by all electronic devices, but real noisy sounds), are always present in the proximity of the audio system and they can alter, for example, the minimum of the dynamic range. Signal-to-noise ratio (SNR) compares the level of a given signal to the level of noise in the system. Noise can have a wide variety of meanings and also depends on the environment and sensibility of the listener. SNR is also expressed in dB so that a great value of dB means a clear sound. SNR of a good audio system is often higher than 90 dB. Dynamic range and SNR are good indicator of the quality of any audio system, but not the only.

Quantization Error
Now, we present the last (and one of the most) important factor, determining digital audio quality, the quantization. How many bits are needed to represent the sampled amplitude of the signal? Normally the answer is given by the maximum resolution of the
3 4

Level higher to this threshold can seriously damage the human hearing system. That is, the quantization, explained in the next section.

19

3 – Digital Audio Concepts

ADC used to compute sampling. Obviously, the higher the resolution of the converter, the better the quality of the digitized sound. Since the number of bits is a finite integer n, only 2n values can be used to represent the original value, these are called quantization levels. When the system has to convert a value, which is not integer, a round off is necessary. The quantization error is the difference between the real value and the binary strings used to represent it, that is typical of almost all the samples and introduce the quantization noise. In 16 and 24 bit ADC the quantization noise is negligible.

Digital Audio Signal, File Format and Perceptual Codec
Digital audio samples are finally grouped into files, to be stored on hard drive of digital computer. In this case we can distinguish between two different format for files, those obtained after an algorithm for compression is applied and those who not. In the noncompressed case, all the samples are stored without applying changes and a preamble is added to the begin of the file, which includes the information required by a player in order to correctly reproduce the musical content (sampling frequency, number of channels, modulation etc...). Compressed files, were obviously introduced to save the amount of space required to store the files. The goodness of the algorithm determines the quality of the digital compressed file (against the original, non-compressed). In MP3, for example, the algorithm takes into account phenomena occurring in the perception of frequency (frequency masking) by the human ear. Therefore, what it does is a decimation (and adaptive bit re-allocation) of the samples used to represent frequencies that human ear cannot sense or distinguish with the resolution imposed by the sampling. For this reason, compressors such as MP3, are sometimes called perceptual codec; under certain condition, e.g. bitrate higher than 128 Kbps, perceptual codecs ensure a remarkable level of transparency, thus their quality should not be distinguished from the original.

20

3.2 – Digital Filters

3.2

Digital Filters

The most general definition of a digital filter is: a computational algorithm that converts one sequence of numbers into another[18]. Thus, any digital device with an input and an output is a filter[44]. The advanced design of filters is beyond the aim of this thesis, but the basics have to be understood with respect to chapter 5 and 6, where digital filters are implemented for specific purposes in sound analysis. In the following pages, instead of "implement", a term specifically used in computer language, we may prefer the use of "design", to point out more or less the same information, that is, the way the filters are realized. Digital filters began part of the integrated circuit since 1950. From the 60s, the so called Z-transform5 was introduced to standardize the mathematical representation of filter’s behavior. In sound synthesis programming language, argued in the next chapter, digital filters appears in the early 60, with MUSIC IV6 . Only later in the 80, when the cost of hardware architecture fell down, real-time digital filter played the most important role on the widespreading low-cost applications, such as synthesizer, effects unit and digital mixer. Historically, the most common use of filters, at least in computer music, was that of boosting, attenuating or separating regions of the sound spectrum. All these operations imply processing in the frequency domain. However, since filters also carry out other important sound processing techniques, such as reverberation and delay, the effect of filtering should not be intended as to be frequency-domain-only related. As we’ll see very soon, also the time structure of the signal can be altered by means of filtering operation.

3.2.1

Filters Background

The frequency response All filters may be characterized by the frequency response. The well-known frequency responses are: low-pass, high-pass, band-pass and band-reject. The frequency response consists of two parts: amplitude response, shown in figure 3.2 for the four basic types of filters, and phase response. The amplitude response is the ratio of the amplitude of output signal to the input signal, varying along frequency range. The phase response (also varying with frequency) is the amount of phase alteration in the signal passing through the filter. Sometimes it is defined in terms of phase delay, that is, the amount of phase change from the original phase, expressed in ms.
The Z-transform converts a discrete time-domain signal, a sequence of real or complex numbers, into a complex frequency-domain representation. See later in the text for more details. 6 The 4th release of the MUSIC saga, the last developed by Max Mathews at Bell Labs.
5

21

3 – Digital Audio Concepts

A[dB]

A[dB]

f[Hz]

f[Hz]

LOW-PASS FILTER
A[dB] A[dB]

HIGH-PASS FILTER

f[Hz]

f[Hz]

BAND-PASS FILTER

BAND-REJECT FILTER

Figure 3.2: Amplitude (A) response versus frequency, for the four basic types of filters. Pass-band and stop-band, cutoff and center frequency The pass-band or bandwidth of a filter is defined as the frequency region in which the filter has no effect (or at most a little attenuation) over the signal, while the stopband is the frequency region where a great attenuation is applied to the signal. In all kind of filters, there’s always a smooth transition between the pass-band and the stop-band (and viceversa), which is normally called the transition band. The most important characteristic associated to the transition band, is the cutoff frequency fc . Conventionally, fc is chosen as the frequency at which the power transmitted is reduced to one-half (i.e. -3 dB) of the maximum power in the pass-band. In low-pass and high-pass realization, the cutoff frequency determines the bandwidth of the filter, e.g., the extension over a frequency range of the signal passed through the filter. In band-pass and band-reject filters, the bandwidth is limited by two cutoff frequencies, fu and fl , which stands for the upper and lower limit of the bandwidth. Consequently, the center frequency fo of a band-pass (and band-reject) filter is defined 1 as 2 · (fu − fl ). These characteristics are shown in figure 3.3 for the case of bandpass filter. The rate at which the attenuation slope increases in the stopband, is called 22

3.2 – Digital Filters

A [dB]

f [Hz]

Figure 3.3: The pass-band or bandwidth of a band-pass filter is the difference between the upper and lower cutoff frequency. The cutoff frequencies are defined as the frequency at which the amplitude, but energy would be better to say instead, is half the pass-band amplitude. In the figure, 40 dB is assumed as the maximum level of amplitude in passband. rolloff. In musical application’s filters, the rolloff is frequently measured in dB/octave7 of decrement during the transition band. The slope of the transition band is determined by the order8 of the filter. In the analog counterpart, the order determined by summing all the electronic components used to realize the filter. Whereas, for digital filters is much more elaborate, as explained later in the chapter. Selectivity and quality factor (Q) The bandwidth of the band-pass filter, is also called the selectivity of the filter, and is useful in quantifying the quality factor, Q.
An octave is the interval between two points where the frequency at the second point is twice the frequency of the first. 8 The order is the mathematical measure of complexity.
7

23

3 – Digital Audio Concepts

Figure 3.4: Example of application of a constant Q filter. Here the center frequencies are tuned around generic musical octave. In music, an octave, is the interval between one musical pitch and another with half or double its frequency. The Q, in musical purpose filters, can be interpreted as the resonance degree of a band-pass[18] and is given by: Q= fc BW

Here, BW is an acronym for bandwidth. Hence, higher Q implies narrower bandwidth. As an example, high Q is needed when the target is to extract a single frequency component from a signal. A solution to this task will be presented further in this thesis. A special type of band-pass filter is the constant-Q filter, which is widely used in sound analysis. This filter allows variation of the bandwidth as a function of the center frequency, by maintaining fixed its Q. For example, with fixed Q = 5 and fc = 300 Hz the bandwidth is 300/5 = 60 Hz. Shifting the center frequency around 6 Khz, the bandwidth becomes much wider, 6000/5 = 1200 Hz. This type of filter has two fascinating characteristics, in musical application. Since the energy of a sound is normally concentrated at low frequencies and spreads towards the high frequency, a band-pass constant Q filters allows extraction of narrower bandwidth while its center frequency is set around low frequency region, and wider, as the center frequency moves toward high frequencies. This is only the first, the second is that, the bandwidths of this kind of filter, plotted against log frequency, appears to be constant. This behavior, presented in figure 4.6, seems to be quite similar to the perception of frequency of the human auditory system, as we aimed to explain in first part of chapter 2. Filter combination, parallel and cascade The four basic types of filters can be combined to form more complex filter design. Two fundamental methods of combinations are possible: parallel and cascade. 24

3.2 – Digital Filters

Parallel connection allow the filters to operate on the same signal at the same time. The output signal will be given by the sum of all the filters’ output; that means, the frequency response of a parallel connection is the sum of all the frequency responses. For instance, a band-reject filter can be obtained by connecting a low-pass and a highpass filter in parallel. An interesting example of parallel connection is represented by the contant-Q filterbank, which consists of an array of constant-Q band-pass filters that separates the input signal into several components, each one carrying a single frequency subband9 of the original signal. For musical purpose, these subbands are normally nonoverlapping and exponentially distributed e.g. in the whole frequency range of human hearing, between 20 Hz and 20 KHz. A special type of constant-Q filterbank, had been historically represented by the octave filterbank, in particular, the third-octave filterbanks have also been standardized for use in audio analysis10 [12]. In a third-octave filter bank, the center frequencies of the various bands are exponentially spaced along frequency axis, in a way described by the formula: fc [k] = 2 k/3 · 1000 Hz where f cc [k] are the center frequencies of an array of k filters, the first of which is centered at fc [0] = 1000 HZ, as an example. The bandwidth of the k th filter is proportional to the k th center frequency, as the following formula states: BW [k] = fc [k] · 2 1/3 − 1 2 1/6

Therefore, since the expression of bandwidth contains the center frequency, the quality factor Q[k] = fc [k]/BW [k], is constant for all k filters. Cascade connection, also called series connection, is the other way to connect filters each others. In this case, the signal will pass through a series of filters, one by one, respecting the linking order. The direct consequence is that: the overall amplitude response becomes the multiplication (thus sum in dB) of the individual filter responses, while, the overall filter order, becomes the sum of the individual filter orders [18]. For instance, cascading two or more low-pass filters with the same frequency response, makes it easy to obtain higher rolloff i.e. greater attenuation around crossover frequency. Cascade connection of filters, may be critical in some cases, for example much care must be taken, when designing series of filters with different bandwidths. Each filter of the cascade, must guarantee that significant energy will pass at least through a common range of frequencies, otherwise the output could be inaudible.
Submultiple of the signal’s bandwidth. Third-octave filters are useful because they have a good correlation to the subjective response of the human ear.
10 9

25

3 – Digital Audio Concepts

Impulse response and time-domain effect of filtering As well as being characterized by the frequency response, every filter has an impulse response. Impulse response, is the time-domain description of the filter’s response for very short pulses, approximations of the mathematical unit impulse function (or Dirac delta11 ). Studying the filter’s response to a unit impulse, could be useful to determine the filter’s response to any short-time change of the input signal. Therefore, the impulse response could be interpreted as the adaptive response of a filter, the most related to the timevarying characteristics of the signal. That’s why, filters are sometimes designed to have a specific impulse response, instead of the frequency response. What the frequency response shows us, is the filter’s response after a stable output is reached, that is, obtained after a long enough time is elapsed from the beginning of the filter’s operation. However, in order to understand the effect of filtering, a time-domain point of view is equally important. To advance what follows, which includes musical-related content, figure 3.5 is proposed as example. What we’ll say about sound, in the figure is referred to a pure sinusoidal tone, the simpler sound produced by an oscillator 12 . By definition, every sound could be delimited between the so called attack and decay portions, which roughly correspond to the sudden start and the softer end of a sound. It is demonstrated that their role leads one the most important time-related feature, typical in human auditory system, the perception of event. We recall this observation here, because it would be important in proceeding of this thesis. Thus, an alternative way to design a filter, is deriving its impulse response, and one way to do this is to observe the way in which the filter reacts at the beginning and at the end of the signal passing through it. These are two particular case of the so called transient response, which introduce another fundamental aspect related to sound. Transient will be discussed more detail in chapter 6. However, a close relation exists between impulse and transient response [18]. Still refer to figure 3.5, the transient response appears evident, acting as a timestretching operation applied to the initial portion of the sine wave, and similarly, at the end. Transient response, at least in this simple case, could be associated to the two time-intervals required by the filter, in order to produce the steady-state output after the attack transient occurs, and the time spent before the sound die, after the decay transient. This behavior becomes very dangerous when large number of filters
∞, if t=0; ∞ , with the constraint −∞ δ(x) dx = 1. 0, otherwise. 12 The digital oscillator is the simpler sound generator in computer music. It represents the propagation (in the air) of a single sound wave at a certain frequency and amplitude. It is still present in current computer synthesis programs and was introduced in 1960 as the fundamental Unit Generator in Music 3 by Max Mathews.
11

The Dirac delta definition is: δ(t) =

26

3.2 – Digital Filters

Figure 3.5: Alteration of the envelope of a tone (INPUT) passed through a narrow filter (OUTPUT). The output envelope has been stretched in time during onset and offset components of the tone (initial and final portion). are connected in cascade, since everyone will affect the time-duration of the sound, unwanted distortions can occur.

3.2.2

Introduction to Digital Audio Processing with Filters

In the first part of the chapter, we described the process by which a generic audio signal is transformed into digital samples and stored onto hard disk drive of a digital computer. Thus, the simplest use of the computer in audio system, can be addressed to digital recording and reproduction of sound files. The consequent step is represented by all the useful and/or aesthetic operations that could be done over the digital audio signals. This is the vast area of digital audio processing, in which filters play a determinant role. In digital audio processing, four fields of particular interest can be isolated, because of their extensive treatment in literature: sound mixing ([44]), delays and effects([63] for all), sound spatialization and reverberation13 ([63] and [41][15][55] for implementation’s examples) and sound modeling 14 ([44][52][33]). Useful applications of sound processing can be found in all the widespread digital audio systems, e.g. in digital music player (mp3,
Sound spatialization is essentially the movement of sound through space. Dolby Digital Surround is one of the best known techniques. 14 An example of sound modeling, sound by applying physical knowledge.
13

27

3 – Digital Audio Concepts

cd, miniDisc etc). Compressor, limiter, expander, noise gates, and noise reduction, are only few examples of the sound processing treatment normally applied to the dynamic range of the music we hear, coming from almost every digital music medium. Digital filters assume a primary role in quite all sound processing applications. After classical filter theory had been quickly exposed in the previous sections, the hard task should be that to transpose some of those concepts, to the discrete world of quantized samples. For that purpose, a general and expressive definition coming from LTI system theory, will be generalized for the case of digital filters, but before, some clarifications must precede it. Systems who don’t change their behavior in time and fulfill the superposition property15 are called linear time-invariant (LTI) and the most important property is that those systems can be completely characterized by the impulse response. The impulse response, is the behavior of those systems for short impulse input. Hence, the general the definition we’ll adopt to explicit filter realizations, is the following: the output signal of an LTI system is given by the convolution of the impulse response with the input signal. Thence, the assumption here is that to consider filters as LTI systems. Impulse Response The general definition of impulse response of a filter is the response of such a filter, fed with a short pulse. The short pulse can be considered as a test signal, through which the characteristics of the filter are bared. The common test signal used in LTI systems, such as in filters, is the unit impulse, defined as: U I(t) = 1, if t=0; 0, otherwise.

In the case of discrete systems, such as digital filters, the unit impulse is obtained substituting t with n, and delimiting the sample index in between brackets [·], for unambiguity. Therefore, for discrete LTI systems, the unit impulse could be rewritten as follows: U I[n] = 1, if n=0; 0, otherwise.

which can be seen as one-sample impulse. In digital terms, the briefest signal possible (the approximation of the unit impulse) is exactly a single sample, which contains energy
In the case of filters, the superposition property states: when two signals are added together and fed to the filter, the filter’s output is the same as if the two signal were putted through the filter separately and then added the outputs.
15

28

3.2 – Digital Filters

at all the frequencies below Nyquist16 . By definitions, the output signal of a filter fed by unitary impulse is the impulse response, henceforth simply called IR. Since we can say that unit impulse contains all the frequencies of the signal, IR can also be seen as the time domain representation of the amplitude-versus-frequency response, earlier presented as the frequency response. The bridge between the two domain is represented by the convolution. Convolution Convolution is a generic signal processing operation, like addition or multiplication, but has a lot of more interest because convolving two signals in the time domain is equal to multiply them in the frequency domain. That’s why convolution operation is considered the bridge between the two domain. Convolution is a fundamental operation in digital audio processing as well as in filters. Let’s see how it works, starting from the formula representing the previous definition given for LTI systems, now generalized for the case of filters: the output signal y[n]of every digital filter is given by the convolution of the impulse response of the filter with the input signal x[n]. Here it is: y[n] = x[n] ∗ h[n] where ∗ is the convolution and h(t) is the impulse response. When the impulse response h(t) is obtained through the one-sample impulse, acting as unit impulse, convolution proves to be an identity operation: y[n] = x[n] ∗ U I[n] = x[n] That is, every function convolved with the unit impulse remains the same. While speaking of convolution in terms of signal processing, a certain regard to other two properties, is necessary. Convolving the input signal with scaled version of the unit impulse: y[n] = x[n] ∗ (c · U I[n]) = c · x[n] and convolving the input signal with a delayed copy of the unit impulse, by means of time-shifting: y[n] = x[n] ∗ U I[n − t] = x[n − t]
According to Fourier’s theories, an inverse relationship exists between the duration of a signal and its frequency content: the shorter the signal, the wider the spectrum.
16

29

3 – Digital Audio Concepts

Figure 3.6: Echo and reverberation effects explained by convolution.

That is, the result of the convolution between input signal and scaled or time-shifted unit response is the same as to scaling or time shifting the input signal. Consequently: any input signal can be represented by a sequence of scaled and delayed unit impulse functions. Not only, easily recognizable effect in sound systems, such as echo and reverberation can be recreated by really simple but appropriate design of IR function, as showed right in figure 3.6. In the case of reverberation effect, showed in right side of figure 3.6, the timesmearing 17 effect occurs when the two time-shifted unit impulse functions are too close, relatively to the duration of the sounds. Thus the first sound cannot be separated from its following replica. Those effects, when thick, assumes the form of reverberation. The law of convolution, applied to computer music, affirms that the convolution of two waveforms 18 in the time domain, is equal to the multiplication of the two spectra in the frequency domain. This is fundamental concept in sound processing techniques, because any of the transformations applied to sound in the time domain have a direct correspondence in the frequency domain, and vice versa. Finally, the mathematical definition of discrete convolution, applied over two generic
17

See chapter 3. Time-smearing is a phenomena which occurs when two close-in-time sounds cannot be separated by the time-resolution of the ear. 18 From chapter 3, waveform will be used to define the analogue sound signal in the time-domain

30

3.2 – Digital Filters

finite-length input signals a[n1 ] and b[n2 ], is:
n1 −1

a[n1 ] ∗ b[n2 ] =
m=0

a[m] · b[n2 − m] = y[k]

To enhance the analogy with filters, the formula may be interpreted by this way: a[n1 ] acts as a weighting function (such as the IR) for each delayed copy of b[n2 ] (i.e. the input signal). The result of the operation y[k] is k sample long, with respect to: k = length(a[n]) + length(b[n]) − 1 That way to compute convolution, a sum for each value of k, is called direct convolution. The direct form is computationally intensive, requiring N 2 operations, where N is the length of the longest of the two input. A faster solution to implement convolution on digital computers was founded. It works with the FFT19 algorithm, applied to both the convolutional operands. The results are multiplied, and finally reversed to time-domain through the IFFT20 algorithm, to be finally summed. The cost of the fast convolution drastically reduces the computational complexity to N log N operations. Transfer function and frequency response The frequency domain description of a digital filter reflects its ability to pass, reject or enhance certain frequencies included in the input signal spectrum. The common terms used to describe the characteristics of filters in the frequency domain are the transfer function H(z) and the frequency response H(f ). Both can be obtained by means of mathematical transforms applied to the impulse response. Since the transfer function is considered a useful tool when designing filter, especially in electronic literature, the Z-trasform must be been introduced. Here is the Z-transform definition, of a generic digital signal x[n]:

X[z] =
n=−∞

x[n] · z −n

Which is related to the Discrete Fourier transform, introduced later, by substitution z = ejω . The transfer function is achieved by applying the Z-transform to the IR h[n] of the filter, as follow:
19 20

Fast Fourier Transform Inverse Fast Fourier Transfor

31

3 – Digital Audio Concepts

X[z] =
n=−∞

h[n] · z −n

while the frequency response can be achieved by applying the DFT to the IR of the filter.

3.2.3

Digital implementation of filters

Traditionally, digital filter realizations have been classified into two large families: • non-recursive filters • recursive filters These names came from the nature of algorithms used to design those filters. From the transfer function point of view, these two categories can be reformulated as follows: • filters whose transfer function doesn’t have the denominator • filters whose transfer function have the denominator In both cases, every output sample is calculated as a combination of the previous input and/or output samples. Respecting the order of previous classification, such possible combinations are: • current input samples with past input samples • both present and past output samples and sometimes past input samples Since digital filters base their output on combinations of past input/output samples, they imply the concept of "memory". In computers, the first realization of filter is easier than the second. The concept of delay line 21 should be observed. A generic scheme, used to represent delay line is presented in figure 3.7. The amount of delay required depends on the memory we need to have, for the specific design of filter. This delay determines the number of memory cells dedicated for storage of the delayed samples. As an obvious consequence, the storage space required is greater and the computational cost is higher when we take in account a lot of past samples. Finally, since the two possible realizations of filters depend on the nature of their IR, the last and commonly used subdivision of filters is that: • Finite Impulse Response filter (FIR filter) • Infinite Impulse Response filter (IIR filter)
Recirculating memory unit, whose purpose is that of delaying the incoming signal by an established number of samples. See [41] for more on delay line and digital implementation.
21

32

3.2 – Digital Filters

Figure 3.7: Simple delay line.

3.2.4

FIR Filters

In FIR filter, the response due to an impulse input will decay within a finite time. Conversely, in FIR filter realizations, the impulse response will theoretically never die. In comparison, implementation of FIR filters are easier, but slower, when compared to IIR filters. Though IIR filters are fast, practical implementation is a bit tough compared to FIR filters. A FIR filter is a linear combination of a finite number of samples of the input signal.
M

y[n] =
m=0

h[m] · x[n − m]

In the equation above the convolution formula given above, can be easily recognized. Here h[m] is the finite impulse response, typical of FIR filter realizations. The time extension of the impulse response determines the lenght of the filter, which is N + 1. As introduced above, the transfer function can be achieved by applying the Z-transform to the impulse response, which result as:
N

H[z] =
m=0

h[m] · z −m = h[0] + h[1] · z −1 + h[2] · z −2 + . . . + h[N ] · z −N

The simpler example of FIR filter is the first order low pass filter, which takes into account only the first previous input sample. The formula of this kind of filter is the following: y[n] = 0.5(x[n] + x[n − 1]) 33

3 – Digital Audio Concepts

Besides, to obtain an high pass filter, again of the first order, we must simply change the operand, like this: y[n] = 0.5(x[n] − x[n − 1]) Follows the general formula for FIR filters: y[n] = a0 · x[n] ± a1 · x[n − 1] ± . . . ± ai · x[n − i] In order to run an N order FIR filter we need to have, at any instant, the current input sample together with the sequence of the N preceding samples. These N samples constitute the memory of the filter. In practical implementations, it is customary to allocate the memory in contiguous cells of the data memory or, in any case, in locations that can be easily accessed sequentially. At every sampling instant, the state must be updated in such a way that x(k) becomes x(k + 1), and this seems to imply a shift of N data words in the filter memory. Indeed, instead of moving data, it is convenient to move the indexes that access the data. Such as an example, three memory words are put in an area organized as a circular buffer (see figure 3.8). The input is written to the word pointed by the index and the three preceding values of the input are read with the three preceding values of the index. At every sample instant, the four indexes are incremented by one, with the trick of beginning from location 0 whenever we exceed the length M of the buffer (this ensures the circularity of the buffer). The counterclockwise arrow indicates the direction taken by the indexes, while the clockwise arrow indicates the movement that should be done by the data if the indexes would stay in a fixed position. As a matter of fact, an FIR filter contains a delay line since it stores N consecutive samples of the input sequence and uses each of them with a delay of N samples at most. The points where the circular buffer is read are called taps and the whole structure is called a tapped delay line.

3.2.5

IIR Filters

The filters of the second family admit only recursive realizations; thus the impulse response of these filters is infinitely long, justifying their name, Infinite Impulse Response (IIR) filters. In general, an IIR filter is represented by a difference equation where the output signal at a given instant is obtained as a linear combination of samples of the input and output signals at previous time instants. The simplest, nontrivial, IIR filter that can be conceived: the one-pole filter having coefficients a1 = 1/2 and b0 = 1/2, is defined by: y[n] = 0.5(y[n − 1] + x[n])

34

3.2 – Digital Filters

Figure 3.8: Circular buffer. and the transfer function of this filter is: H[z] = 1/2 1 − 1 z −1 2

Due to the advanced forms in which digital filters can be designed, the result obtained could be even more precise than the analog counterpart.

35

3 – Digital Audio Concepts

36

Chapter 4 ...To Sound Spectrum Analysis

37

4 – ...To Sound Spectrum Analysis

4.1

Introduction to Sound Analysis in the Frequency Domain

In this chapter when we speak of analog signal we will usually refer to waveforms. The spectral analysis of musical sounds is the legacy of Fourier analysis. Although other methods had been explored, Fourier concepts are still applied to every digital sound applications. The soundness of Fourier analysis resides in its representation which highlight similar characteristics the auditory system by psychoacoustical knowledge. We mentioned before, at the end of chapter 2, that all the operation could be done over the sampled signal are performed to achieve goals coming from the necessity to have different output (i.e. sound processing, in which filters have a dominant role), to generate output from nothing (i.e. synthesis) or to analyze the digital sample to predict some of the sound characteristics. The principal effort in sound analysis are pitch recognition, timbre perceptio, rythm recognition and bpm extraction. In some cases analysis is only the first stage in order to perform reconstruction of the original signal, that is, resynthesis. Audio synthesis can be definitively different from the above, because is the process in generating sounds, and this can be obtained in so many ways, that a suitable discussion in this thesis would be too long. In general, in digital music system, we can distinguish between: • Audio synthesis, the process of generating stream of audio samples by algorithmic means. • Audio analysis, takes digital signal (but leaves unaltered the stream of samples) and mathematically determines its characteristics. Those systems are represente in figure 4.1.

Digital audio synthesis
Audio synthesis is the other great challenge in digital audio performance. Since no sound is generated by synthesis for the purpose of this thesis, we will only treat marginally the argument. Everything started with the necessity to let the computer speaks in human(and not humanoid) way. Since psychoacoustic and vocoder techniques1 , several studies made it possible to implement the physical studies behind human auditory system. Hence why, why do not try to replicate with computer, the most sophisticated sound produced by human body, the speech.
A phase vocoder is a type of vocoder which can scale both the frequency and time domains of audio signals by using phase information. The computer algorithm allows frequency-domain modifications to a digital sound file (typically time expansion/compression and pitch shifting).
1

38

4.1 – Introduction to Sound Analysis in the Frequency Domain

Sound synthesis made by computer, starts in 1957 with Max Mathews’ MUSIC 1. With MUSIC 3 in 1960 was introduced the concept of Unit Genertor (UG), the simpler instrument for the computer, the greatest change in the way to the computer sound programmer’s approach. With a UG, one can create a sine wave to produce an oscillator, with logical and arithmetic UG one can multiply two oscillator to produce another sound, design filters’ frequency and impulse response, combine filters with oscillators to create new more complex sounds, and so on, to infinity. The UG so created, quickly increased in complexity, in parallel to the rapid rise of microelectronics, becoming one of typical features in most music programming language. With the consequently development of faster algorithm, music synthesis has been widely extended in many research areas. Curtis Roads says about synthesis: After Max Mathews in 1957, dozens of sound synthesis techniques have been invented. As in the field of computer graphics, it is difficult to say at any time which techniques will flourish and which will fade over time. This situation is fueled by competitive pressure in the music industry, making it inevitable that synthesis methods fall in and out of fashion, because no one of these methods can satisfy [44]. As just a souvenir, here are reported some of the synthesis methods, in no precise order: • wavetable synthesis • sampling synthesis • additive synthesis • subtractive synthesis • sinusoidal synthesis • granular synthesis • modulation synthesis • physical modeling synthesis • formant synthesis • residual synthesis • graphic synthesis • stochastic synthesis

39

4 – ...To Sound Spectrum Analysis
analog audio input

loudness low pass filter ADC computer
(...)

pitch (rhytm) bpm

SOUND ANALYSIS

algorithms SOUND SYNTHESIS wavetables oscillators computer
(...)

analog audio output

DAC

low pass filter

Figure 4.1: Digital sound synthesis and sound analysis.

Digital audio analysis
Any sound can be interchangeably represented in the time domain by a waveform or in the frequency domain by a set of spectra[54]. Thus, in the following pages waveform is the term adopted to to audio signal Three main aspects are treated in analyzing sounds: pitch recognition, rhythm detection and spectrum analysis (also helpful for the other two). A lot of synthesis technique are based on data outputted by analyzing sound, in these cases we can speak of analysis/resynthesis or analysis/synthesis techniques. To understand this purpose imagine a peak detector, applied to a musical flow, and think as using frequencies and amplitude of every peak detected (derivative of the analysis) to drive digital oscillators. analysis/synthesis techniques analysis/resynthesis techniques

4.2

Introduction to the Fourier Analysis

Spectral modelling techniques are the legacy of the Fourier analysis theory. Originally developed in the nineteenth century, Fourier analysis considers that a pitched sound is made up of various sinusoidal components, where the frequencies of higher components are integral multiples of the frequency of the lowest component. The pitch of a musical note is then assumed to be determined by the lowest component, normally referred to as the fundamental frequency. In this case, timbre is the result of the presence of specific components and their relative amplitudes, as if it were the result of a chord over a prominently loud fundamental with notes played at different volumes. Despite the fact 40

4.2 – Introduction to the Fourier Analysis

that not all interesting musical sounds have a clear pitch and the pitch of a sound may not necessarily correspond to the lower component of its spectrum, Fourier analysis still constitutes one of the pillars of acoustics and music. [36]

Origin
Since Jean Baptiste Joseph, Baron de Fourier, in 1822 published his evolutionary theory, we can be traced back to the events that made the history, in rapid succession: • 1870 first mechanical harmonic analyzer, • 1898 first mechanical harmonic analyzer that could be reversed to waveforms synthesizer, • 1930 advent of analog filters made it possible spectrum analysis, • 1940 first digital implementation on computers, • 1960 advent of FFT algorithm reduced enormous calculus computing fourier transform, • 1977 advent of STFT, short-time fourier transform, widely used in music systems. What Fourier stated, in a few mathematical words, was that complex but periodic signal can be seen as a sum of simple signals. In musical context this was intended as periodic waveform that can be deconstructed in a combination of simple sinusoidal waves, each one with its own amplitude, frequency an phase. On digital computers a sine wave is generated by an oscillator (first UG was an oscillator) able to produce sounds by a sine wave with only three parameters: amplitude, frequency and phase. In engineering mathematic an oscillator is normally expressed in another form through the Euler’s relations, which allow to express sine and cosine functions by means of complex exponential. In this chapter we will introduce a particular Fourier-based analysis and synthesis system, called the short-time Fourier transform, STFT, due to Allen and Rabiner (1977). This is a very general technique, useful in the study of time-varying signals such as musical sounds, that can be used as the basis for more specialized techniques. In the following chapters the STFT is accounted as the basis for several analysis/synthesis systems. In musical contexts, Fourier Transform is applied to analog signals (FT) having a limited bandwidth or to a finite number of digital samples (DFT or STFT). We can summarize here the techniques used to compute FT over analog and digital input signals: • FT, time-continuous signal input, frequency-continuous spectrum output. 41

4 – ...To Sound Spectrum Analysis

• DTFT, time-discrete signal input, frequency-continuous spectrum output. • DFT, time-discrete input signal, frequency-discrete spectrum output. • STFT, short-time input signal (time-continuos inside short-time periods), timevarying frequency-discrete spectrum output.

4.2.1

Fourier Transform (FT), Classic Formulation

In signal processing, the Fourier Transform is intended as the mathematical transformation by which a time-domain signal is converted into its frequency spectrum. Usually, the result of the FT is just called the spectrum of the input signal. In its original formulation, the FT extends the spectrum of the signal to the whole frequency range, from 0 to ∞. Here the definition:

X(ω) =
−∞

x(t) · e−jωt dt

where x(t) is a generic waveform and t and ω are the continuous time index and the continuous frequency index. ω is the angular frequency, expressed in radians per second. The simple relationship with the correspondent frequency in Hz is f = ω/(2π). The FT could also lead to another interpretation, more interesting in musical context, that is the decomposition of the waveform into an infinite number of sinusoidal components. The result of the FT is a complex value X(ω) for every values of ω, but X(ω) is usually considered the whole spectrum of x(t). Each complex value, expressed in the form (a + jb), with a and b the real and imaginary part, reveals the three fundamental components of a sinusoid: frequency, amplitude and phase. Obviously, ω is the frequency and the other two can be computed with the following simple formulas: a2+b2 b phase =⇒ arg[X(ω)] = arctan a amplitude =⇒ |X(ω)| = X(ω), again the whole spectrum of x(t), is a periodic function of ω with period 2π and the original signal x(t) can be recontructed by means of the Inverse Fourier Transform, defined as follows: x(t) = 1 2π

X(ω) · ejωt dt
−∞

The Fourier Transform is valid only applied over time-continuous signals, e.g. waveforms. Let’s see how it works with digital signals. 42

4.2 – Introduction to the Fourier Analysis

4.2.2

Discrete Fourier Transform (DFT)

Figure 4.2: Two plots of static spectrum. The image represents the SPL against frequency of a drum hit played by a robot (on the left), and a note of a violin (on the right). The difference is noticeable, while the robot hit has apparently no harmonically related frequency components, in the violin note this is clear. In digital computers, waveforms are transformed into discrete samples by means of sampling, therefore in this case the Discrete Fourier Transform is computed in place of FT. A signal that has discrete value representation in frequency-domain is called a periodic signal, that means that the spectrum shows isolated spectral lines. The DFT formula can be written as follows:
N/2−1

X[k] =
n=−N/2

x[n] · e−jωk n

where x[n] is the nth value of discrete-time signal N samples long . That’s the motive k because the integral in the formula goes to − N to N ). ωk = 2π · ( N ) is the discrete 2 2 angular frequency, k is an integer number going from 0 to N-1 and N must be chosen even. While X(k) is called the discrete spectrum, the k-th X(k) discrete frequency sample is called the k-th frequency bin. In DFT the relationship between discrete angular frequency and frequency in Hz is: f = fs · where fs =
1 T

ωk 2π

is the sampling frequency and T the period between samples. 43

4 – ...To Sound Spectrum Analysis

Due to discrete value of k, DFT assumes that x[n] can be represented by a finite number of sinusoids, this means that the signal x[n] is band-limited in frequency. Besides, the frequencies of the sinusoids are equally distributed between 0 Hz and the sampling rate fs , or, in radians, between 0 and 2π. The DFT internally masks the frequency-domain sampling function, because there is a direct correspondence between the number of input samples and the number of outputted frequencies. The inverse DFT is defined as: 1 x[n] = N
N/2−1

X[k] · ejωk n
k=−N/2

There is a faster computational version of the DFT, which is called the FFT. The algorithm used to compute FFT allows the substitution of complex products with weighted sum so that the computational cost is reduced from N 2 to N · log N . This is still one of the most used and advanced technique of implementing DFT on digital computers, especially where real-time DFT is needed or the space in memory is a critical point (i.e. on chips). Unfortunately both FT and DFT work only for periodic signals: in music only an accurate note coming out from a tuned musical instruments can be treated as a periodic waveform, while most of the sounds are non-periodic and time-varying waveforms. So, let’s now introduce to the most used FT technique for musical purpose on digital computers, the Short Time Fourier Transform.

44

4.2 – Introduction to the Fourier Analysis

this page is intentionally left blank,

take a breath. 45

4 – ...To Sound Spectrum Analysis

4.3

The Short Time Fourier Transform (STFT)

One of the main problems with the original Fourier transform theory is that it does not take into account that the components of a sound spectrum vary substantially during its course. In this case, the result of the analysis of a sound of, for example, 5 minutes’ duration would inform the various components of the spectrum but would not inform when and how they developed in time.[36] The Short Time Fourier Transform represents a solution to this problem. It splits the sound into short-time segments performing an operation called windowing, and sequentially analyses each segment. Normally the FFT technique is applied in order to compute FT on each windowed portion, because the computational cost of this operation is extremely lower. The reasons of the wide use of this technique can be summarized as follows: the spectrum derives from a sequence of individual analysis windows that can trace the time evolution of the sound. Thus the spectrum can be seen as a set of spectra equally spaced in time, one for each windowed portion of the waveform, giving a more sophisticated and convenient representation than DFT. This could be seen in figure refSTFTspectrum STFT is also a focal point in sounds analysis, because a time-varying spectrum is more similar to the human auditory system, therefore this could be helpful in determining perceptual attributes, overall pitch and timbre. How short should be chosen the short-time segment? Less than 1/10 of a second, tipically.

wave form window

FFT

rectangular to polar coordinates

phase spectra magnitude spectra

Figure 4.3: Basic operation of the STFT used for sound analysis. The STFT operation over a waveform can be interpreted in two ways: • windowed DFT, where the DFT (FFT) is computed over each windowed segment; windows may overlap • filterbank view, a bank of bandpass filters equally spaced across the frequency domain (i.e. from 0Hz to Nyquist frequency) 46

4.3 – The Short Time Fourier Transform (STFT)

Windowed DFT (realized by FFT)
Windowing means that incoming signal is segmented in temporal windows, each of which has the same duration. but windows may overlap. Then the segmented portions of signal are analyzed with DFT (or FFT) separately. The general formula of STFT is:

X[n,k] =
m=−∞

{x[m] h[n − m]·}e−jωk n

where the output, X[n,k], is the DFT of the windowed input at each discrete-time n for each discrete frequency bin k. h[n − m] is the time-shifting window function that follows the signal. m, in the general formulation can vary from −∞ to +∞ but can be substituted with the appropriate length of the window. N is the number of points in the spectrum. The given angular frequency, for each bins k, is that: ωk = k · fs 2π · N

Another formulation of STFT, is that of X.Serra, where H, the hop size, is the time advance of the incoming signal, substituting the time-shifting window function. It is also a function of two variable, follows the definition:
N −1

X[l,k] =
n=0

{w[n] x[n + lH]} · e−jωk n

now, w(n) is a real window, l indicates the frame to pass through window and again, the same exponential function. X[l,k], the spectrum, is the DFT of the sequence of w(n)x(n + lH) for 0 < n < N − 1. The spectrum is computed at every frame l, advancing with H along the input signal x(n).

4.3.1

The Filterbank View

The other possible view of the STFT, is represented by a group of band-pass filters (filterbank2 ), equally spaced in order to cover the whole frequency band of the input signal. This method is similar to the one used in spectrum equalizer, in which the shape of the spectrum can be modeled by the user controlling each level of the filters. But here, all filters have the same bandwidth and the center frequencies are equally spaced up to Nyquist.
2

see first chapter 3 for more detail on filterbank.

47

4 – ...To Sound Spectrum Analysis

Figure 4.4: Waterfall spectrum, a 3D representation os the STFT spectrum. The graph was obtained with Spectutils package for GNU Octave. The analysis parameters of the STFT are shown above the figure, the audio sample analyzed is extracted from Laurie Anderson’s Violin Solo. STFT interpreted by this way, can be seen ad a filterbank which perform analysis in parallel on each windowed segment of the input signal. For every frame of the input signal, a complex value returned by the n filters can describe n sinusoids. Filterbank view was the base for phase vododer analysis/resynthesis technique, and inspired the constant Q method later proposed. The filterbank view is therefore an abstraction, used in computing STFT with programming language on digital computer. The STFT output (both the two views) is a series of spectra, one for each frame of input signal. Each spectrum has a real and imaginary part, which can be easily converted into magnitude and phase value. In the filterbank view, frequency covers most important role than phase. Istantaneous frequency is therefore calculated by converting the phase value by the method of phase unwrapping3 , to obtain a sinusoid with the obtained
3

Phase unwrapping ensures that all appropriate multiples of 2π have been included in Θ(ω)

48

4.3 – The Short Time Fourier Transform (STFT)

frequency and magnitude given by the derived spectrum magnitude. Some steps are necessary in the calculation of the STFT on digital computers. They are presented in the following sections.

4.3.2

Windowing: Length and Shape of the Window Function

The meaning of this operation is that every input waveforms must be time-limited (windowed) in order to calculate its digital FT. Windowing operation is so called because it consists in the multiplication of the input waveform for a window function, which allows us to extract values from a segment (frame) of the waveform, depending on the window function and window length. Since multiplication in time-domain correspond to convolution in frequency domain, the product operation inside the Fourier transform operation, give a resulting spectrum that will be the convolution of the spectrum of the waveform and the spectrum of the window. The window is mathematical function, which has non zero values over a limited time range. The simpler one is the rectangular window, which assumes 1 inside window length and 0 outside. The choice of window length is very important because determines the frequency resolution and the time resolution of the analysis. In the filterbank view of the STFT, the frequency resolution is the frequency band of each band-pass filters. It can be derived by the ratio (Fs/N samples in window length), for example, for a fs=44.1KHz and 1024 samples per window length, we obtain a frequency resolution of 43 Hz. Thus, in this case the waveform is decomposed in 1024 sinusoids having frequencies integer multiple of 43Hz (harmonics). The analysis of the frequencies in between the analysis bands is obviously as much important, but this is beyond the scope of this thesis. We recommend [3] for further readings. The wave shape is the other most important criterion while choosing the appropriate window function. All the standard windows adopted for computing STFT on waveforms, are real and symmetric function and their spectra look like the sinc function sin(t)/t. In figure 4.5 are presented the most commonly used window function and their spectra. In the spectrum characteristics of the window we can see two features: the main lobe and the side lobes. The width of the main lobe, defined as the extended bins across a period, determines the frequency resolution, i.e. narrower lobe allows better resolution. The attenuation of the side lobe from the main, the difference in dB between the height of the main lobe and the height of the adjacent side lobe, determines the level of cross-talk interference between two adjacent analyzing window. Typically reduce the side lobe is reflected in an increase of the width of the main lobe, so a compromise must be chosen. A generic rule could be the following: when the waveform is mainly composed by distinct number of sine wave a narrow main lobe is preferable, when the waveform is made of noisy like waves a wide main lobe is preferred. 49

4 – ...To Sound Spectrum Analysis

types.png

Figure 4.5: Types of windows used in STFT for audio analysis. No ideal window exists, the term "optimal window" is preferred. Several types of windows are used, for musical purpose the Kaiser window has usually a preferential use. Another point in the choice of the window length is between odd and even length. For phase detection a zero-phase window is better because the windowing process won’t modify the phase of the analysis waveform. Therefore an odd window length is preferred, with the middle sample centered at the time origin of the analysis window. There are a lot of standard window function, used for STFT purpose: • rectangular • Hamming • Hanning or Hann • Gaussian • Blackman • Blackman-Harris • Kaiser Gaussian, Hamming and Kaiser are the more often used. The kaiser window has characteristics which are well tuned around musical context. 50

4.3 – The Short Time Fourier Transform (STFT)

4.3.3

Computation of the DFT (via FFT)

The discrete spectrum of each portions of the windowed waveform can now be calculated using the DFT. In practice, when possible, the FFT algorithm is used for this purpose. The implementation of FFT algorithm will be discussed below. FFT Size, Zero Padding The problem with FFT is that requires that the analyzed signal must be N long, with N a power of two number, called the FFT size. Therefore, since the signal length is fixed by the window length, and this is chosen to obtain a desired frequency function (and can also be variable, i.e. expanded for high frequency) the analysis window hardly ever fit the FFT size. This problem in practice can be easily overcome by adding 0 to the rest of the length required to match the FFT size, this operation is called zero-padding. This method has also other benefits, since zero-padding in time means interpolating in frequency domain, therefore the spectrum obtained will be sharper (more spectral lines, oversampled spectrum). Note that zero-padding does not increase frequency resolution in the spectrum, because the analysis window length remains the same, but can make easy to track spectral peak which are not exact bin frequencies. Usually the FFT size is chosen to be the first power of two, at least twice the window length M, therefore M-N points will be forced to 0. N/M will be called the zero-padding factor. This factor should be large enough to enable estimation of the maximum of the main lobe, that is, the spectral peak. Since the window length is not an exact number of periods for every frequency, the center frequency of the spectral peaks will rarely match the frequency bins, hence the zero-padding factor chosen appropriately can resolve this problem and the peak can be found. As said before, choosing an odd length window, help us in phase detection. The odd length window means that the windowed waveform will be centered at time origin, thus half of the samples will be positioned before the time origin (negative-time value) and the other half after the analysis time origin (positive-time value).The FFT input buffer of this windowed waveform, will contain the positive-time values at the beginning and the negative-time value at the end, the rest of the length (in the middle) will be zero-padded. Hop Size, Overlap Factor The hop size determines the time advance of the analysis window, i.e. the time difference (in samples) between two adjacent analysis windows. Hop size is normally referred to assume the unitary value when hop size is equal to window length (in samples).The analysis windows could be overlapped in order to have more analysis point and therefore more time resolution over the input waveform. 51

4 – ...To Sound Spectrum Analysis

The inverse of the hop size is called overlap factor (if H > M the analysis window will not overlap). For example, if H=M=1024 and fs=44100Hz, the time resolution over the input waveform is 1024/44100 = 23 ms , if the overlap factor is 8 the time resolution becomes 2,9ms. Greater overlap factor will generally give better analysis results, but also greater computational cost. Hence, overlap factor has to be chosen whereas the input waveform characteristics, i.e. fast-changing waveforms need more overlap. There are some general criterion for determining an efficient overlap factor, the more general is to choose overlap in a way that all the data are equally weighted as in the case of overlap-add synthesis presented later. Other criterion is too chose overlap factor according to the nature of window function, that is, overlapping windows should add perfectly to a constant value, i.e. 1. For a rectangular window this is easy to obtain, hop size can be simply M/i, with i any positive integer. If consecutive analysis windows are added each others to a constant, no amplitude deviation is possible, hence successive windowing operation will not perform amplitude modulation to the input waveform. To summarize, the STFT operation is applied to a stream of input samples and results in a series of frames that one after another produce a time-varying spectrum, thus the impression is to see a continuous spectrum. The four parameters to choose in designing efficient STFT, can be summarized as follows: • window shape • window length • hop size / overlap factor • FFT size

4.3.4

The Inverse Short Time Fourier Transform & OverlapAdd Resynthesis

The ISTFT is the process by which the original waveform can be reconstructed starting from the frequency-domain analysis data produced by the STFT. This is the typical feature of a synthesis process, thus what will be presented in this section is the use of ISTFT in the particular synthesis method called overlap-add resynthesis. But before, the definition of the inverse STFT:
N/2−1

Xm (ωk ) = e

−jωk mR n=−N/2

x(n + mR)w(n)e−jωk n 52

4.4 – Constant-Q analysis

The overlap-add resynthesis method, due to Allen and Rabiner (1977), says that we can reconstruct each windowed segment of the original waveform starting from the spectrum components by the use of ISTFT over each frames. It takes the magnitude and phase value of each spectrum to generate a time-domain waveform using the same envelope of the analysis window used to compute the STFT. Then each resynthesized time-domain segment is overlapped and added to reconstruct the original waveform. In theory, the overlap-add process is an identity operation (i.e. the reconstructed signal equal the original) by mathematical mean, only if the overlapped and added windows sum to a contant. That means that we can pass countless times the signal into STFT and back to the original with ISTFT, however, even good implementations of the STFT, lose even a small amount of information, demonstrating that this is impossible. OA resysnthesis is not the only method to do resysnthesis of the orignal waveform based on STFT, many others are possible. Weighted overlap-add method is similar to OA, the difference resides in the transformation applied to the window function before resynthesis. The analysis window and the synthesis window must maintain the identity property, this is achieved by respecting the relationship:

w[m − nH] = c
n=−∞

where c is a constant. The synthesis window is needed when, before resynthesis, a transformation is applied to the phase spectrum, which can create phase discontinuities at the frame boundaries. Oscillator-bank resynthesis (also called sinusoidal additive resynthesis, SAR) in another method, in which analysis data (magnitude and phase) are converted into synthesis data (amplitude and frequency) in order to drive one oscillator for each frames which are then summed to recreate the original signal. This method frees from the add to constant rule of the OA, because the converted spectrum is more robust against digital processing transformation eventually applied before synthesis. SAR method can be applied to the filterbank interpretation of the STFT, by matching each frequency bins to a sine wave and then sum all the sine waves for synthesis.[54]

4.4

Constant-Q analysis

The constant Q filterbank analysis is another technique descendant of harmonic analysis, such as the Fourier transform. Its use was adopted since late 70 and inspired other techniques such as bounded-Q transform, auditory transform and wavelet transform. Q, the ratio between center frequency and bandwidth of a filter, and filterbank, a connection of filters in parallel, were introduced in chapter 2. Now the use of both these characteristics will be addressed to the case of constant-Q analysis, as alternative to the strict Fourier 53

4 – ...To Sound Spectrum Analysis

Figure 4.6: Spacing of filters for STFT (filterbank view) on the top and Constant-Q filterbank on the bottom. It is clear the advantage of the Constant-Q filterbank method, which places the filters linearly against log(frequency), which is similar to the frequency response of the human ear. analysis. Our aim in this thesis, will be that to demonstrate its goodness against a specific task of sound analysis, the onset detection, applied to the case of ·O M M· (next chapter) and its perceptually grounded approach applied to the recognition of the sound has been played (final chapter). The constant-Q transform has advantages over the Fourier transform, which lie in musical aspects. The STFT computes frequency components (frequency bins) on a linear scale, that means, it expresses frequencies with fixed resolution or bandwidth. This method has an inconvenient, because it frequently results into inadequate resolution for low musical frequencies and exaggerated resolution for high frequencies. The choice of the frequency resolution is addressed to the choice of the appropriate window length, which best fits the resolution needed (i.e. the lower frequency content which can solve). Moreover, high frequency resolution means poor time resolution and vice-versa, that is, there is always a tradeoff between time/frequency resolution in STFT based analysis. Such a problem will be discussed through an example: suppose the sampling frequency to be fs =44100 Hz and N =1024 samples the window length. The frequency 54

4.4 – Constant-Q analysis

bins that can be analyzed will be 512, equally spaced over the bandwidth, i.e. from 0 to 20 KHz. Increasing the sample rate, e.g fs =96 KHz will not increase the frequency resolution of analysis but will only widen the bandwidth up to 48 KHz. To get an increase in frequency resolution, one must choose a larger window length. The limit example is that to obtain a frequency resolution of 1 Hz, a window length up to 44100 samples must be chosen, by sacrificing the time resolution to 1 s! Conversely, if time resolution needed is 1 ms, i.e. to analyze 1000 events per second, the window length should be 44 samples, thus the frequency resolution will be about 1000 Hz! Now let’s see a practical example, introducing the need of a different tool for analysis. Suppose the task is to solve (to analyze separately) the frequencies corresponding to the fundamental frequencies of notes in a piano. Now, suppose the two lower notes being spaced, for example, 2.5 Hz apart. The analysis window must be chosen N =16384 samples long, that is, for fs = 44100 Hz the frequency resolution will be fs /N 2.5 Hz. This would not only result in a bad time resolution (400 ms), but the real problem consists in the extremely useless frequency resolution used to solve higher frequencies notes, because here the spacing between notes is much more than 2.5 Hz. Not only, what we said in previous chapter about perception of frequencies, here is completely neglected. However, since STFT is performed via FFT, the time required for output is extremely low and implementation in real-time do not constitute a problem, although lots of data are useless and must be discarded after analysis. Therefore, the complexity reduction achieved by applying FFT algorithm, is the principal reason behind the wide use of STFT in sound analysis purposes. Constant-Q transforms constitutes an alternative to the fixed frequency representation of Fourier transform. In a constant-Q transform the bandwidth of each frequency bins, varies proportionally with frequency. In the next section, we’ll see a typical implementation of constant Q for musical analysis purpose, applied to the case of a piano. But first, we should take a look at the waterfall spectrograms represented in figures 4.7 and 4.8, taken from Brown [10]. The two pictures clearly point out the advantaged in representing musical signal with the constant Q transform, which lies in musical aspect. It is especially clear if compared to the previous image (4.4) showing STFT waterfall spectrum.

4.4.1

Implementation of Constant-Q Analysis

Such an example of constant Q transform implementation is represented by the 1/24th octave bank of filters. The constant Q filter bank and its similarity to the auditory system has been explored in [42] and [38]. Various schemes for implementing constant Q spectral analysis outside a musical context have been published, for example that of Gambardella, which proposed an inverse funcion to reverse the constant-Q method back to the time domain. This is of importance if manipulation of the signal in the spectral domain followed by transformation back to the time domain is desired. 55

4 – ...To Sound Spectrum Analysis

Figure 4.7: Waterfall spectrogram of a Constant Q transform of violin glissando from 578 Hz to 880 Hz (D5 to A5). Taken from Judith Brown’s Calculation of a constant Q spectral transform. [A glissando is a glide from one pitch to another. It is an Italianized musical term derived
from the French glisser, to glide, It is also where the pianist slides up the piano with his or her hands. From Wikipedia.]

For musical analysis, we would like frequency components corresponding to quartertone spacing of the equal tempered scale4 . The frequency of the k th spectral component is thus: fk = (21/24 )k fmin where f will vary from fmin to an upper frequency chosen below the Nyquist frequency. The minimum frequenc fmin can be chosen to be the lowest frequency about which
Equal temperament is a musical temperament, or a system of tuning in which every pair of adjacent notes has an identical frequency ratio. In equal temperament tunings an interval, usually the octave, is divided into a series of equal steps (equal frequency ratios).
4

56

4.4 – Constant-Q analysis

Figure 4.8: Waterfall spectrogram of a Constant Q transform of flute playing diatonic scale from 262 Hz to 523 Hz (C4 to C5). Taken from Judith Brown’s Calculation of a constant Q spectral transform. [In music theory, a diatonic scale is a seven note musical scale comprising five
whole steps and two half steps, in which the half steps are maximally separated. From Wikipedia.]

information is desired, e.g. a frequency just below that of the G string for calculations on sound produced by a violin. The resolution or bandwidth δf for the discrete Fourier transform is equal to the sampling rate divided by the window size (the number of samples analyzed in the time domain). In order for the ratio of frequency to bandwidth to be a constant (constant Q), then the window size must vary inversely with frequency. More precisely, for quarter-tone resolution required is: Q = f /δf = f /0.029f = 34 where the quality factor Q is defined as δf = f /Q. We note that the bandwidth δf = f /Q. With a sampling frequency fs = 1/T where T is the sample period, the length of the window in samples at frequency fk , N [k] = S/δfk = (S/fk )Q 57

4 – ...To Sound Spectrum Analysis

Note also from this equation that the window contain Q complete cycles for each frequency fk , since the period in samples is fk . This have physical means since, in orde to distinguish between fk+1 and fk when their ratio is, e.g 21/24 = 34/33, we must look at at least 33 cycles. It is also interesting for comparison, to consider the conventional discrete Fourier transform in terms of the quality factor Q = f /δf . We find that f /δf is equal to the number of the coefficient k, and this is the number of periods in the fixed window for that frequency. The constant-Q transform have demonstrated good result as approaching to the task of sound analysis; especially regarding the identification of musical notes, this transform shows to be a more appropriate spectral representation due to its geometrically spaced frequency channels. Although it should not be considered a good starting point for musical synthesis, because of the controversial inverse function. Inverse function had been proposed but not successfully implemented for musical purpose. This method will be proposed, in the final chapter, to achieve the goal of onset detection.

58

Chapter 5 Real-Time Audio Applications
To generate 1 and process acoustical signals is to compose music, more directly than inscribing ink on paper. Curtis Roads[44].

For our purpose, that is to analyze the musical flow of the robotic orchestra, we need a flexible and extensible platform. For this reason we discarded a priori the use of numerical analysis software, like Matlab/Octave (we used them but for other purposes). Although they offer lot of (free for GNU Octave) packages tuned around sound analysis and processing ([63][47] and [32]), our need is to integrate the analysis onto the Show Control System2 of the Orchestra, thus, for musical context other softwares are recommended. One of the pillar in this category is represented by Max/MSP, the software already used for the development of ·O M M· SCS. In this chapter the main softwares for real-time audio applications are treated from basic, to advanced (in the case of Max/MSP, the software we choose for our purpose) features. Later, at the end of chapter, an overview of the typical application, which are these software designed for, and a comparison between the textual based (unix style, terminal-like window) and graphical based software.

Musical sound synthesis. Show control system (entertainment) is a generic for a system (could be very complex) whose main feature is that to coordinate all the different systems (audio, video, MIDI, OSC....) controlling the hardware, by which a show is formed. In our case, the SCS, coordinates the robots musician and the gestural controller and possible other (to be experimented) features.
2

1

59

5 – Real-Time Audio Applications

5.1

Max/MSP

Max/MSP devotes his first part of the name to Max Mathews3 , who wrote in 1957, the first ever computer program, specific for sound generation4 . Max was also the original name of the software, developed by Miller S. Puckette at IRCAM in the mid 80s and first commercially distributed since early 90s. MSP, is a package for real-time DSP (standing for Max Signal Processing or the initials of Miller S. Puckette), added to the software since 1997. Due to its graphical (but minimal) nature, Max/MSP differs from the most MUSIC-N languages, Max can be considered a visual programming language. Visual programming let you graphically connect objects together with patch cords to design interactive software. This is normally the attitude of designing programs, think at the flowchart or most modern techniques such as UML. But the difference resides here, with flowcharts the blocks represent code that will be written, in Max, the code is written already. Since Max uses icons to represent objects written in high level language, Max is a meta-language, responding to the paradigm "programs can write programs". Max/MSP distinguishes between two levels of timing: that of an "event" scheduler, and that of the DSP (similar to the distinction between control-rate and audio-rate processes in Csound, direct descendant of MUSIC.) With Max you can also control external hardware, read data from sensors, interchange audio and data with other software other than generate and analyze sounds, create musical intruments, video and animation. All these features let Max be a popular choice for composing interactive media works. Most of all for the approachable graphical interface, extensive bindings to media processes and protocols, and the open-ended philosophy. Follows a short description of principal Max features:

The patcher windows
MAX is designed to look familiar to composers who have worked with patchable synthesizers. What you see on the screen is a lot of boxes with lines connecting them. The boxes are called objects and the lines are patch cords. What happens is that each object represents a process. The results of the process are passed along the patch cord to the next object. Ultimately, there is an object that sends MIDI data, audio or video out. Each window full of objects is called a patcher. Several patchers may be open at once and they can all be active, even if their window is hidden. Patchers can be saved, and then entered as an object in another patcher. There is also a patcher object, that can
Max Vernon Mathews is considered unequivocally one of the pioneer in the world of computer music. Sc.D. in 1954 at Massachusetts Institute of Technology (MIT), while working at Bell Lab he was founder of MUSIC, the first computer based programming language for sysnthesis (1957). 4 MUSIC 1
3

60

5.1 – Max/MSP

Figure 5.1: Max 5 patcher window be opened up and filled with objects which will continue to work after the patcher object is folded up again. The action flows from the top down. When an object is tweaked by the user or MIDI comes in, a message is sent to any connected objects, which react with messages of their own. Only one thing happens at a time, but it’s all so fast it seems instantaneous. When a pathway branches, messages are sent to right destinations before left.

Objects
The name of the object represents what it does. There are a few hundred objects included with Max, ranging in complexity from simple math to full featured sequencers. Arguments, if present, specify initial values for the object to work with. Data comes into the object via the inlets, and results are put out the outlets. Each inlet or outlet on an object has a specific meaning. This will be displayed in a flag as the mouse passes by (further details are in the manual). Usually, input to the left inlet triggers the operation of the object. For instance, the delay object (as shown) will send a bang message out the outlet 500 milliseconds after a bang is received in the left inlet. Data applied to the right inlet will change the delay time.

Messages
Data bytes sent down the patch cords are called messages, which fall into one of the following types: • int A number without a decimal point. 61

5 – Real-Time Audio Applications

• float A number with a decimal point. • symbol A character string1 such as “stop” that may be understood by certain objects. A symbol may be followed by further information in the message. • list Several of the above, separated by spaces. The first element of a list must be a number. • bang A message that triggers the action of an object. Audio signals are sent in yellow patch cords. These are little packets of data, but sent so fast as to be effectively continuous. Jitter signals are sent via green patch cords. Jitter messages are names of matrices that hold data for jit.objects to process. Every object responds to a variety of messages. If a message won’t work, a warning will appear in the Max window.

Max windows
The Max window contains information sent from Max (like error messages) or things you might like to print. It’s sort of a terminal window.

Max runtime
Max is not required to run a finished patch. Anyone can download Max/MSP Runtime for free, which will run patches but not edit them. There is also a process for converting patches into stand-alone applications.

Pure Data
PD is the open source twin of Max released under a BSD license, developed by the same author, Miller Puckette, since 1996. It show off the same potentiality of Max, with little differences explicated by the author in [42] and [38].

5.2

CSound

CSound is one of the better-known textual interfaces for computer music composition. CSound was originally written by Barry Vercoe at MIT in 1985, based on languages of the Music-N family, and continues to be developed today. At its core, CSound is “designed around the notion that the composer creates a synthesis orchestra and a score that references the orchestra.” 62

5.2 – CSound

Figure 5.2: Max 5 window

Csound files were originally processed in non real-time to render sonic output, in a “process referred to as ‘sound rendering’ as analogous to the process of ‘image rendering’ in the world of computer graphics.” [55]. Csound instruments are defined in the orchestra file as directed graphs of unit generator types (called ‘opcodes’). Flexible sound routing can be achieved using control and audio busses via the Zak objects. Control rate is evident in CSound through the a-rate and k-rate notations. The strong separation of synthesis and temporal event definition imposes a strict limitation on the scope for algorithmic composition: new synthesis processes cannot be defined in response to temporal events, and new temporal events cannot occur in response to the synthesis output. “Csound is very powerful for certain tasks (sound synthesis) while not particularly suited to others (data management and manipulation, etc.).” [55] 63

5 – Real-Time Audio Applications

5.3

Supercollider

SuperCollider is a high-level programming music language, designed specifically for dynamic and generative structures and synthesis of computer music. It can be generally applied to many different approaches to composition and improvisation rather than any particular preconceived model. It features an application-specific high-level programming language SCLang (inspired to C++) with extensive data-description and functional programming capabilities, and support functions for common musical needs. SuperCollider has also features as several library of unit generators for signal processing. Sample-rate and control-rate distinctions are made explicit via the .ar and .kr notation. A key distinction from CSound is that code can be evaluated in real-time as the program runs. SuperCollider is ideal for algorithmic composition. Since version 3.0 (the currently available version), graphs of unit generators are defined textually and compiled at runtime into dynamic libraries (‘SynthDefs’) to be loaded as instruments (‘synths’) by the synthesis engine (‘SCServer’), all under control of the language. The separation of language and synthesis into distinct processes in version 3.0 introduces compilation and performance optimizations, but also implies limitations in the degree of temporal control: “Because instruments are compiled into code, it is not possible to generate patches programmatically at the time of the event as one could in SC2. In SC2, an audio trigger could suspend the signal processing, run some composition code, and then resume signal processing. In SC Server, messaging between the engines causes a certain amount of latency.”[34] SuperCollider 3.0 therefore represents a return to the CSound model of orchestra and score, in which however the score is procedural rather than declarative.

5.4

Chuck

ChucK represents one of the only contemporary options that avoids latency in the procedural control of synthesis. It also provides a library of unit generators to be freely instantiated and connected into graphs within ChucK scripts. The authors refer to ChucK as ‘strongly timed’, which can be defined as follows: • supports sample accurate events • defines no-control-rate (or supports dynamically arbitrary control-rates) • supports concurrent functional logic • control logic can be placed at any granularity relative to synthesis • supports run-time interaction and script execution 64

5.4 – Chuck

Like SuperCollider’s SCLang, the ChucK language was written especially for the ChucK software. It is a high-level interpreted programming language.

65

5 – Real-Time Audio Applications

Comparison between Graphical User Interface and Textual Language Interface Musical Softwares
GUI + Easier to view and input quantitatively rich data such as control envelopes Common tasks can be immediately and intuitively represented Interaction can be more rapid GUI Interfaces tend to be more specific Complex data-structures, if made visible, can be visually overwhelming Precise qualitative specification can be difficult at fine granularity Compact description of complex data-structures High degree of precision & control Textual elements may more easily refer to or embed each other Tiresome to specify by data-entry when precision is not required Simple tasks may require detailed code Interaction can be time-consuming, particularly if text must be compiled

TLI +

TLI -

66

Table 5.1: Musical software for realtime synthesis and control
Creator Realtimesynthesis, hardware-control Realtimesynthesis, hardware-control 1986 Csound 5.10 LGPL 1990s pd-extended (0.41.4), pd-vanilla (0.42.5) BSD-like mid-1980s Max 5.0.7 Commercial software (Cycling’74) Typical purposes First release date Recent release (2009) License Development status Mature

Name

Max/MSP

Miller Puckette

Pure Data

Miller Puckette

Stable

Csound

Barry Vercoe

Mature

5.4 – Chuck

67
James McCartney Realtimesynthesis, live-codingb , algorithmic composition Realtime synthesis, live-coding, algorithmic composition 2004 1996

Realtimesynthesis, algorithmic compositiona , audio-rendering SC 3.3.1

SuperCollider

GPL

Stable

ChucK

Ge Wang and Perry Cook

ChucK 1.2.1.2 (dracula)

GPL

Immature

a

Algorithmic composition is the technique of using algorithms to create music. Live-coding is the name given to the process of writing code to modify software in realtime as part of a performance. Most generally, writing (parts of) programs while they run. It’s ometimes known as "interactive programming" or "on-the-fly programming".

b

5 – Real-Time Audio Applications

68

Chapter 6 Perceptual Onset Detection
In the first part of chapter 2, we presented the attributes of sound perceived through the human auditory system, now we are going to focus the discussion over a particular aspect of such a mechanism, the perception of time events, especially the ones related to the initial portion of sounds. By implementation of the skills of computer music discussed in chapter 3 and 4, in the detail, digital filters and constant Q analysis, we’ll try to simulate the functionalities of auditory system to achieve the goal of event detection. The interest in our research is largely treated in literature and must be anticipated by some fundamental definitions, in primis attack and onset of a sound. Then, some of the most advanced techniques for onset and attack detection will be proposed and finally, our method based on bonk∼ for Max/MSP will complete this chapter. Therefore, and first of all, the project on which we are working, will advance at a glance.

6.1

The Curious Case of ·O M M·

Two robots, loud sounds, and one computer. The reason I decided to make a thesis. The project ·O M M· consists of a show in which a performer conduct two robot drummers. Many people worked for more than one year, developing the electromechanical parts of the robots, at LIM laboratories, and developing the software, in Max/MSP, to control the robots. The two robotic percussionists (were ten in the original concept) are directed by the performer though a gestural controller, the GipsyMIDI1 exoskeleton. The ·O M M· orchestra, called after italian futurist Filippo Tommaso Marinetti2 in the centenary of
http://www.sonalog.com/framesets/gypsymidi_frame.htm Filippo Tommaso Marinetti (1876 – 1944) was an Italian poet and musician, founder of the Futurist movement. It is responsible of the Futurist Manifesto, published on the french journal Le figaro, in 1908. He had also introduced the presence "on stage" of humanoid form of life, mechanical bodies ˆ considered primordial example of "robot" (ten year before Karel Capek introduced the concept onto his novel Rossum’s Universal Robots)
2 1

69

6 – Perceptual Onset Detection

Figure 6.1: The two robots on the sides, SCS + the performer in the middle. the Futurism, is ready for the launch, in november, after had been presented in october 2008. In the organization of this thesis, this chapter is the right place to introduce the project we are working on, before going into the discussion on the recognition of particular sound onset. Figure 6.1, representing the ensemble of the show control system of ·O M M· , is proposed to lighten the visual imagination. The two robotic drummers, both having two arms, can play at maximum 120 bpm through each arms (i.e. until four contemporary3 hits), a score, sent them via MIDI4 from a computer. On the same computer, a complex Max/MSP patch, elaborates input data received from the exoskeleton, calibrate it, and let the output modify the rhythmic pattern in real-time5 . Since the robots play big and cumbersome drums, two oil bins of regular dimension, the music produced is kind of loud percussive sounds, on the style of the french band Les tambours du Bronx.
Perceptually contemporary. A minimum delay (but less than 15 ms) between the mechanically of the two arms of a robot, must be ensured between two electrical transmissions. 4 Musical Intrument Digital Interface. 5 Quasi-realtime, to be precise.
3

70

6.2 – From Transient to Attack and Onset Definitions

6.2

From Transient to Attack and Onset Definitions

Every musical signal can be subdivided6 into smaller unit of sound. The two subdivisions considered here are the transient and the steady-state portions of a sound. The transient portion is located at the origin of a sound, originated by a stimulus (e.g. a chord is played on guitar, a stick strikes the drum), causing a sudden change in the perceptual attributes. The duration of transient is assumed to be very short, substantially related to the duration of the stimulus. The steady-state portion, coming after the transient, is a kind of support to the sound, it can be considered as the natural evolution of transient according to the instrument design and environment. Figure 6.2 shows this separation for the case of a drum hit. To give more precise definitions, regarding the terms used in figure 6.2, the words of Bello [4] are proposed: • transient: period during which the excitation is applied and then damped • attack: time-lag during which the amplitude of a sound increases • onset: single instant chosen to mark the start of the transient Another interesting (and more intuitive) definition of onset is that given by Bilmes in [8]: onset is the point when a musical event becomes audible. He also called onset with the name attack time. Actually, sometimes attack and onset are used as synonyms to represent the same information. To circumvent this ambiguity, we propose what follows, again from Bilmes: “each musical event in percussive music is called a drum stroke or just a stroke, it corresponds to a hand or stick hitting a drum, two sticks hitting together, two hands hitting together, a stick hitting a bell, etc. The attack time of a drum stroke is its onset time, therefore the terms attack and onset have synonymous meanings, and are used interchangeably.” This is not an hazard, because it is demonstrated [26] that the time between zero and maximum energy in a percussive musical event is almost instantaneous. From the onset recognition point of view, the simpler case is therefore represented by the drum, where the sounds generated are kinds of percussive events, well characterized by the noticeable change of sound parameters associated to the strokes. Hence,
This subdivision is normally indicated with the name "segmentation". Segmentation (also present in many other different applications, as well as in sound) is an operation which preserve the temporal structure of the signal but can be used to identify, separate and organize the smallest rhythmic event founded in the audio signal. Is a complex operation and can be done in several ways (choose of the smallest unit) and several domains (time, frequency, time-frequency, complex), according to the target to achieve. The main applications are transcription, onset detection and BPM and tempo calculation. Transcription allows the musical flow to be represented by the note from which is generated, for this reason is one of the most explored and advanced area of studies.
6

71

6 – Perceptual Onset Detection

Figure 6.2: On the top, the waveform corresponding to a hit of a robot percussionist of ·O M M· . On the bottom, the intensity profile of the hit (using Praat), where onset, attack and the transient/steady state separation are highlighted. transients are normally considered the principal part of a percussive sound, thus characterizing them the steady state portions can be, although approximately, derived by applying an appropriate synthesis stage which recreates the slow decay at the estimated resonance of the drum. The actual meaning of onset, coming from psychoacoustics knowledges, allow us to define transient in according to the behavior of the three perceptual attributes presented in chapter 2. In correspondence of a transient, we can denote the following behavior: • abrupt change in perceived loudness • abrupt change in perceived pitch • abrupt variation of the perceived timbre For the purpose of this thesis, the transient/steady-state separation is left apart from the following discussions (see [56]for deepening) and attention will be focused on the transient region, in particular to onset and attack recognition. But first of all, let’s spend 72

6.2 – From Transient to Attack and Onset Definitions

Figure 6.3: From top to bottom: waveform, static spectrum (FFT) and time-varying pectrum (STFT). From right to left: one hit of ·O M M· robot, one hit of snare drum. a few words on the importance of onset detection, often kept secret, in most musical software applications.

Importance of onset detection in musical applications
Several commercial software for musical applications, exploit the onset detection for many special purposes, very often not explicitly. With a rapid research, we found that the use of onset detection’s technique is particularly important in all these applications: cut’n paste (audio editing), audio/video synchronization, estimation of rythm and temporal 73

6 – Perceptual Onset Detection

features (audio analysis), compression, content delivery7 , indexing8 , music information retrieval9 , time-stretching10 and pitch-shifting11 (audio processing). Besides, as if not enough, in other musical applications, detected and modeled onsets can be used for further musical processing or even synthesis [50], [3]. Detected onsets can, for instance, trigger a synthesis algorithm in order to create a new musical piece from the percussive rhythm founded in a recording.

6.3

General Scheme for Onset Detection

Several studies and different approaches had been proposed for the solution of the onset detection task. Here are presented in order the three general steps, which can be found in almost all the techniques[4]: 1. preprocessing: before analysis sometimes transformations are applied to sound (e.g. compression and limiter) in order to emphasize or de-emphasize some of the characteristics of the signal. Usually preprocessing is considered an optional, but could be extremely helpful, for example to accentuate the sudden changes in the waveform. Normally, when high speed performance is required by the system, i.e. for real-time applications, preprocessing is avoided. 2. reduction: reduction is a process through which the complex signal, here considered as sum of sinusoids or oscillators, is simplified for analysis (e.g. subsampling). The simplified signal (must reflect the local structure of the original) should enhance the transient characteristic while de-emphasize the steady state. This operation is critical and has been proposed in many ways, but can be summarized in two categories: the methods that make use of explicitly signal features (i.e energy, frequency, phase) and methods based on probabilistic models, which approximate the signal’s behavior. The function obtained, after reduction of the original signal, can be called detection function[4] or observation function[50]. 3. pick-peaking: the final stage of onset detection is historically entrusted to peakpicking algorithm, which localize onsets as local maxima (i.e. peak) of the detection function. This stage is also critical, because depends on the "goodness" of the detection function and must be "robust" against possible misunderstanding.
Content delivery describes the delivery of audio "content" through a delivery medium such as Internet, onset detection may improve for example the organization of sound into UDP packets. 8 Indexing is feature of database. While creating a database of sounds, onset detection may be important to determine similar characteristics to be used to subdivide the database into categories. 9 MIR, See http://en.wikipedia.org/wiki/Music_information_retrieval 10 Time stretching is a way to change the speed or duration of an audio signal without affecting its pitch. 11 Pitch Shifting is a way to change the pitch of a signal without changing its length.
7

74

6.3 – General Scheme for Onset Detection

Is not difficult to imagine that overlapping sounds, noise, musical effects (e.g. vibrato and tremolo12 ) or modulations, are just some examples of the difficulties that a peak-picking stage can encounter. That’s why the final decision of the pick-peaking algorithm may be, in certain cases, anticipated by a post-processing and thresholding stages. In the next sections, we will adopt the term detection function, described in point 2; every detection scheme has its own detection function. Our method based on bonk∼ will follow, after a discussion on the modern techniques, largely treated in literature and well summarized in [4][50] and [31]. Choice of the Appropriate Detection Function A brief guideline is proposed here, before entering the discussion of our method for recognizing onsets in ·O M M· . What follows could be useful in determining the appropriate methods for approaching the onset detection task. The choice strongly depends on the sound object of the analysis. The general, good practice, usually requires a balance of complexity between preprocessing, construction of the detection function, and peak-picking.[4] In this summary, the methods proposed (in order of complexity) are followed by the corresponding text where it’s possible to find implementation details. • If the signal is strongly percussive (e.g. drums), time-domain methods are also adequate (i.e. method based on thresholding the amplitude). • Spectral methods, such as those based on phase distributions and spectral difference perform relatively well on strongly pitched transients. [4] • The complex-domain spectral difference seems to be a good choice in general, at the cost of a slight increase in computational complexity. [21][56] • When precise time localization is required, then wavelet methods can be useful, possibly in combination with another method. [21][22] • If a high computational load is acceptable, and a suitable training set is available, then statistical methods give the best overall results, and are less dependent on a particular choice of parameters. [4] for introduction and [24][2] for more detail.
Vibrato and tremolo are two important musical effects. Vibrato is produced, in singing and musical instruments, by a regular pulsating change of pitch, and is used to add expression and vocal-like qualities to instrumental music. Tremolo usually refers to periodic variations in the amplitude of a musical note (or in singing). Depth and speed of vibrato/tremolo determine the amount and speed of pitch/amplitude changes. It is difficult to achieve, with singing voice, separated variations in pitch and amplitude, they will usually be achieved at the same time; that’s why the two terms are sometimes confused. In digital signals processing, vibrato and tremolo are easier to achieve separately.
12

75

6 – Perceptual Onset Detection

6.3.1

Energy Based Approach

Log Energy Approach Since the energy of a sound is the most prominent variation in a musical flow, occurring during transients, energy based approach could be considered the easier way to achieve the goal of event detection. Usually, the introduction of a new generic note (e.g. that of a piano) leads to an increase in the energy of the signal and for the specific case of strong percussive note attacks (e.g. the case of ·O M M· ), this increase in energy will be very sharp. For this reason, energy method proves to be a useful and efficient approach for lot of onset detection’s applications, in particular detecting percussive transients. The local energy of a frame x(n)of the signal is defined as: 1 E[n] = N
N 2

−1

(x[n + m])2 h(m)
m=− N 2

where h(m) is a window of lenght N, centered at m = 0. Taking the first difference13 of E(n) produces a detection function, in which peaks could be localized in time by a pick-picking algorithm, to find onset locations. An improvement to this equation, follows from psychoacoustical knowloedge, is to consider that loudness is perceived logarithmically. Hence, computing the first difference of log E(n) roughly simulates the ear’s perception of loudness. These are usually considered the simplest approaches to note onset detection. Spectral Difference This idea can be extended to reach a more appropriate detection function, that is, considering frames of the STFT. We recall that a generic STFT frame of a waveform, is given by:

X[n,k] =

{x[m] h[n − m]·}e−jωk n
m=−∞

where k = 0,1,...,N − 1 is the frequency bin index and h the finite-length sliding window14 . As previous method, if we now take into account the first difference between the magnitude of consecutive STFT frames, that is:
N

δX =
k=1
13

|X[n,k]| − |X[n − 1,k]|

The first difference is the difference between two consecutive samples, in this case each sample describes the energy content of the sampled waveform. 14 see chapter 5 for detail on STFT

76

6.3 – General Scheme for Onset Detection

this measure, known as the spectral difference, can be used to build an efficient onset detection function. Energy-based algorithms are fast and easy to implement, decrease their effectiveness when approaching to nonpercussive sounds or when transient energy is given by overlapping (and more complex, e.g. strongly pitched) sounds. High Frequency Content This technique is very interesting and successfully applied by Masri in [33]. The consideration over which is grounded, are proposed by Rodet and Jaillet: energy increases linked to transient tend to appear as a broadband event and since the energy of the signal is usually concentrated at low frequencies, changes due to transients are more noticeable at high frequencies. To emphasize this, the spectrum obtained by the STFT, can be weighted preferentially toward high frequencies, before summing to obtain a weighted energy measure. The following formula was proposed: 1 E[n] = N
N 2

−1

Wk |X[n,k]|2
k=− N 2

where Wk is the (frequency dependent) weighting function. Masri proposed Wk = |k|, called high frequency content (HFC), a linear weighting function, by which each frequency bin gives a contribute proportional to its frequency. Energy increment related to transient component are more noticeable at higher frequencies (although the total energy is usually concentrated at lower frequencies), therefore HFC should be considered one of the pillar in the energy based onset detection task. Besides, HFC method do not take into account temporal features of the waveform, which could be equally important. That’s why, alternative methods, (e.g. considering phase spectrum information), should also be considered.

6.3.2

Phase Based Approach

The use of phase spectra in approaching the onset detection task, is relatively a recent introduction [20]. Let’s see how it works. The starting points are the definition of unwrapped phase 15 , and its application
The unwrapped phase is the phase which allow variation into the limited range between 0 and 2π. It is calculated from the instantaneous frequency value in the following manner: ω(t) = ϕ (t) = d ϕ(t), instantaneous angular f requency dt 1 f (t) = φ (t), instantaneous f requency in Hz 2π
t 15

φ(t) = 2π
0

f (τ ) dτ + φ(0), the 2π − unwrapped phase.

77

6 – Perceptual Onset Detection

in the case of a given stationary sinusoid (i.e. extracted from steady state portion of the signal). In a steady state sinusoid, extracted from a single frame of the STFT, the phase, as well as the phase in the previous frame, are used to calculate a value for the instantaneous frequency. An estimate of the instantaneous frequency of the STFT frame within this window, is that: fk (n) = ϕk (n) − ϕk (n − 1) 2πh fs

where h is the hop size between windows and fs the sampling rate. What is expected, for a stationary sinusoid, is that the instantaneous frequencies should be approximately constant over adjacent windows. Furthermore, this is equivalent to say that the phase increment from adjacent windows remaining approximately constant. This is expressed in formula as follows: ϕk (n) − ϕk (n − 1) ϕk (n − 1) − ϕk (n − 2)

Equivalently, the phase deviation can be defined as the second difference of the phase, which is: ∆ϕk (n) − 2ϕk (n − 1) + ϕk (n − 2) 0 During a transient region, the instantaneous frequency is not usually well defined, and hence will tend to a large value. This is illustrated in figure 6.4, Bello proposes a method that analyzes the instantaneous distribution (in the sense of a probability distribution or histogram) of phase deviations across the frequency domain. During the steady-state part of a sound, deviations tend to zero, thus the distribution is strongly peaked around this value. During attack transients, values increase, widening and flattening the distribution. However, this method, although showing some improvement for complex signals, is susceptible to phase distortion and to noise introduced by the phases of components with no significant energy. Finally, why do not mix phase and energy approaches? Again Bello gives the answer, proposing this solution to the detection task in [5]. Phase, energy, and phase/energy approaches do not represent the only methods applied to the solution of the onset detection task. Several other methods had been proposed, with particular regard to stochastic and statistical methods ([2] for example and [4] for comparison), very different from the ways above. A Deterministic Plus Stochastic model (such as described by Serra in [54]) for specific onset detection, had been recently presented by Gifford and Brown in [24]. But let’s now introduce the perceptual based approach, we used to reach the task of onset detction.

78

6.4 – Introduction to the Perceptual Based Approach to Onset Detection

Figure 6.4: Unwrapped phase deviation between two adjacent analysis frames. ∆ϕn,k is the unwrapped phase deviation. For the simpler case represented by a steady state sinusoid, the phase deviation is approximately 0 constant in-between the whole analysis frames, while, during transient the phase deviation should be extremely large and easy to detect.

6.4

Introduction to the Perceptual Based Approach to Onset Detection

Perceptual based approach have demonstrated their strength in the detection task of both pitched and non-pitched sounds16 , as summarized in [13]. Since perceptual attributes of sound are usually subjected to judgment from person to person, the perceptual onset detection should be always considered a good method, when it reaches the aim of approximate the human ear onset detection mechanism. The different approach at the base, may justify the possible divergence of results as compared with other methods; usually the aim here is not that to find the best and efficient algorithms which recognize every onset, but only the perceptually meaningful onsets. At the human auditory level,
Pitched and non-pitched sound is a common way to describe sound with strong harmonic related component frequencies (pitched sound, e.g. piano, violin...) and sparse and not related harmonics (non-pitched sound, e.g. drums...). At a perceptual level, the difference is that when a pitched sound occurs, a pitch is clearly associated to the fundamental frequency of the sound. Non-pitched sounds trigger different mechanisms in human auditory system, however, sometimes a pitch is associated, but this won’t actually correspond to the fundamental frequency.
16

79

6 – Perceptual Onset Detection

mechanisms, which encodes both time and frequency effects, determines the subjective perception of sound onsets. The principal limits are imposed by time resolution and frequency masking effects (both explained in chapter 2). Moreover, overlapping pitched and non-pitched sounds (even percussive pitched sounds) could obfuscate the perception of pitch and also delay or obscure one or more adjacent onsets. Let’s see an example, before introducing the perceptual method applied to ·O M M· , based on bonk∼ for Max/MSP. Band-Wise Processing Scheirer in [51] was the first to clearly demonstrate the fact that an onset detection algorithm should follow the human auditory system, by treating frequency bands separately and then combining the results at the end. An earlier system described by Bilmes, was similar to the way above, but his system only used a high-frequency and a low-frequency bands, which himself judged not so effective [8]. Scheirer in [51] described a psychoacoustic demonstration on beat perception, which shows that certain kinds of signal simplifications can be performed without affecting the perceived rhythmic content of a musical signal. When the signal is divided into at least four frequency bands and the corresponding bands of a noise signal are controlled by the amplitude envelopes of the musical signal, the noise signal will have a rhythmic percept which exploits significant similarities to the original signal. On the other hand, this does not hold if only one band is used, in which case the original signal is no more recognizable from its simplified form (detection funtion). The method proposed by Klapuri [30], is the most significant example of succesfull application by applying psychoacoustic knowledge to the onset detection task. It utilizes the band-wise processing principle as introduced by Bilmes and Scheirer. The procedure is the following: 1. the overall loudness of the signal is normalized to 70 dB level (pre-processing) using the model of equal loudness contour17 , 2. a filterbank divides the signal into 21 non-overlapping bands and, at each band, the onset components are detected and their time and intensity is determined, 3. in final phase, the onset components are combined to yield onsets. This method uses psychoacoustic models both in onset component detection, in its time and intensity determination, and in combining the results. The design of the filterbank is the core of this system, Klapuri proposed a filterbank which approximates the critical bands displacement and covers the frequencies from 44 Hz to 18 KHz. This is obtained with 21 filters, where the lowest three are one-octave band-pass filters and
17

See chapter 2 for more on equal loudness contour.

80

6.5 – Onset Detection in ·O M M·

the remaining eighteen are third-octave band-pass filters18 . All subsequent calculations can be done one band at a time. This reduces the memory requirements of the algorithm in the case of long input signals, assumed that parallel processing is not desired. The output of each filter is full-wave rectified and then decimated by factor 180 to ease the following computations. Amplitude envelopes are calculated by convolving the band-limited signals with a 100ms half-Hanning (raised cosine) window. This window performs much the same energy integration as the human auditory system, preserving sudden changes, but masking rapid modulation [30].

6.5

Onset Detection in ·O M M·

From a Listener Point of View One of the critical point within a mechanical orchestra is to sync the execution of the musical score among real and virtual instruments; this problem is addressed to the Show Control System of ·O M M· . We remind that human ear is particularly careful about timing, hence, with physical devices such as the electromechanical arms of the robots, we have to ensure correct timing, as expected by the listener. In our system, the variable delays afflicting the orchestra while playing, must be overcome to made possible a synchronized execution. The variations of the delays, strongly depends on the note intensity requested and the execution rhythm imposed. Each robot basically receives a message on a serial line (MIDI), stating what kind of hit should be executed. In addition to the message generation and data transmission delay, usually negligible from a human point of view, we have the delay introduced by the physical movement of the arms. That is, the robot arms must be positioned to the correct level height and will hit the drum after a time elapsed, proportional to the distance of the arm from the drum and the acceleration by which is driven. Furthermore, the delays are not only variable in a predictable manner, but problems related to non-absorbed vibrations between a hit and another can also occur. This could cause unwanted change in loudness and pitch19 perceived by the listener. That’s why we need a perceptual based approach to analyze (in real-time) the sound produced by the robot, and its response to the different, digitally applied, stimuli. What we need is to measure the time elapsed between the digital command and the perceived strike on the can, and try to compensate it during the robotic performance. At this purpose we thought to create a delay matrix, where all the fields represent a delay of a note to be executed. Therefore, once having this delay matrix we can take into
Octave and third-octave filters are presented in chapter 3. Since sound produced by ·O M M· can be considered non-pitched, we do not provide a pith detector stage in the orchestra, although we might think that a pitch will be perceived. We tried to understand this behaviour with fiddle∼ and pitch∼ external objects for Max/MSP
19 18

81

6 – Perceptual Onset Detection

account the delay required for the note to be completed, and anticipate the execution of such a note, for the exact time of the delay. To calculate the delay we needed to provide an onset detector stage in our Show Control System. We exactly know the time of the MIDI event, hence what is missing is the detection of the note executed. Integration in the SCS Our need was that to integrate the onset detection stage into the SCS, developed in Max/MSP, and running on an Apple laptop. We initially tried to implement a new method, to familiarize with development in Max. It was based on an envelope follower of the signal with a variable threshold applied to it. When the amplitude envelope of the signal exceed the threshold, an onset is detected. Since its practical inefficiency, this method was immediately left apart, and other methods were explored. We founded a very interesting approach in bonk∼ object, an external library available open source on the web, for Max/MSP. The original code was written by the same author of Max, Miller Puckette, for Pure Data in 1989. Then it has been revised by other people during the years: Ted Apel ported bonk∼ to Max/MSP platform and later Barry Threw applied the latest modification (2008). The version we have used is called bonk∼ 1.4, founded in M.Puckette repository, with permission to apply changes.

6.5.1

The bonk∼ Method

The bonk∼ method works essentially on a specialization of the constant-Q filter bank analysis, called emphbounded-Q analysis. This method has the advantage to drastically reduce the complexity of the constant-Q transform, such as well described in [17] after [29] and [11]. In this kind of analysis the value of Q is limited (bounded) to approximately 5 and a few number of filters could be used to obtain a filterbank which give us at least the same results of a constant-Q analysis. In addition, the bounded-Q analysis, takes the advantages of a FFT-like algorithm, applied in between each frequency channel. This is possible because the octaves are geometrically separated, but within each octave, the frequency bins are equally spaced, as shown in figure 6.5. This channel distribution becomes a good approximation for the geometric scale with a proper number of channels per octave. Puckette in [38] says: the bonk∼ object was written for dealing with sound sources for which sinusoidal decomposition breaks down; the first application has been to drums and percussion. That is, our case. The bandwidths of the filters subdivide the sound spectrum into regions which are approximately tuned around the critical bands, in a similar manner to the above Klapuri approach. This should well mimics the auditory system behavior. 82

6.5 – Onset Detection in ·O M M·

Figure 6.5: Graphical representation of the bounded-Q filterbank. Only the octave are geometrically spaced, in between the octave the spacing between analysis bins is linear. This allows the application of FFT-like algorithm to calculate the spectrum of each component. We found that the implementation of 15 (non overlapping) filters was successful for our case. See table 6.1 for detail on the filters used for the band-wise analysis. In this table can be easily recognized the filter spacing with two filters per octave, except where prohibited (the first two filters do not respect this spacing20 ). The details of filterbank’s implementation can be founded in appendix of this thesis. The final stage, what we have called before the pick-picking stage, in bonk∼ works essentially with the definition of a growth function.

6.5.2

Result of the Analysis in ·O M M·

After several months spent in debugging, optimizing and adapting the source code to our needs, we have subjected it to several tests. These tests, performed with the sounds recorded at LIM laboratories in Verres (AO), have demonstrated the soundness of this approach. We recorded 3 track, including 300 sounds sounds each, then we made a lot of cut/n/paste to obtain five scores, called soundtracks in the result summary. Every soundtrack, very different each others, have been realized at different bpm21 , from 100 to 120 (which is the maximum values that a single robot of ·O M M· can reach). These tracks, created with Ableton Live, were sent to bonk∼ by a dedicated Max patch, realized for testing. The sounding objects are sent to bonk∼ via a special object for Max, called Elastic, which allows variation of the pitch end tempo of the execution. The testing patch we realized for that purpose is presented in appendix of the thesis.
20 21

See chapter 2 for critical band description. Beats Per Minute.

83

6 – Perceptual Onset Detection

Table 6.1: Filterbank design in our method based on bonk∼ Filter number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 fc [Hz] 86 150.5 220.59 312.18 441.18 623.93 882.36 1247.86 1764.72 2495.72 3529.44 4991.44 7059.31 9983.31 14118.19 Bandwidht [Hz] 32.25 32.25 37.84 53.32 75.68 107.07 151.36 214.14 302.72 428.28 605.44 856.56 1211.31 1712.69 2422.19 Filter Points 1024 1024 873 617 436 308 218 154 109 77 55 39 27 19 14 Number of hops 1 1 1 2 3 5 8 12 17 25 36 52 72 101 145 Hop size 512 512 436 308 218 154 109 77 55 39 27 19 14 10 7

The final configuration of bonk∼ gave great results: all the hits produced by the orchestra can be located in time and the value of CDR (Correct Detection Result), proposed in [30], was very easy to calculate. The CDR is given by: CDR = total − undetected − f alse detected · 100% total

and the result of the analysis are the following: Only at more than 110 bpm, bonk fails in some cases, that is, spurious onset are reported (but all the provided onsets are recognized). 84

6.6 – From Onset Analysis to Sound Classification

Table 6.2: Results in detecting onset of the five soundtracks created for analysis purpose, played at different bpm. ·O M M· ·O M M· soundtrack (bpm) 100 105 110 115 120 Total Onsets 120 80 90 120 60 Detected Undetected False Detected 0 0 0 2 5 CDR [%]

120 80 90 120 60

0 0 0 0 0

100 100 100 98 95

6.6

From Onset Analysis to Sound Classification

In this last section, we aim to demonstrate that the use of the bonk∼ analysis in detecting onset, could be extended to the aspiring task of sound classification. As mentioned before, ·O M M· robots are able to produce three different sounds, substantially with different intensity and duration. What we are gonna present in this section is the ability of bonk∼ to predict which of the three sounds has been played, only using the loudness content of each frequency bands during onset detection. This, if true, would mean that the spectral analysis performed by bonk∼ , especially when it detects an onset, is enough to predict which kind of sound had been produced by the orchestra. Puckette has provided another tool in bonk∼ , called learn mode, which allows to store into a template the pattern originated by the output of each filters used for the analysis. The learn mode works essentially by this way: 1. enter learn mode and choose the number of equal 2. start the bonk∼ analysis and stop it after all the onsets provided are recognized 3. store the spectral template into file and read it 4. exit learn mode and continue to analyze the signal 5. bonk∼ will report the number corresponding to the sound which best fit the spectral template read from file

6.6.1

Learning Results

Results of bonk∼ learning for the case of ·O M M· : 85

6 – Perceptual Onset Detection

Table 6.3: Numerical result in detecting onset and recognizing the three sounds (A/B/C) produced by the ·O M M· ·O M M· soundtrack (bpm) 100 105 110 115 120 Total Onsets Total A/B/C Notes 30/60/30 30/30/20 33/26/31 100/12/8 25/20/15 Total A/B/C Notes Recognized 25/64/31 25/32/23 32/28/30 97/16/7 30/15/15 Note A/B/C Notes Confused 5/5/0 5/6/4 2/1/1 4/1/3 4/0/7

120 80 90 120 60

The results is quite unexpected, more than the 80% (on average, in particular cases were higher than 95%) of correct correspondence had been found, simply looking at the onset analysis results.

86

Chapter 7 Conclusion
A specific component of the human ear, the basilar membrane inside the cochlea, located in the inner ear, is responsible of the detection of frequency components of a sound. In this thin membrane, 32 mm long, the frequencies cause oscillations around specific points of the basilar membrane. The mechanical properties of the cochlea (wide and stiff at the base, narrower and much less stiff at the end), in which the basilar membrane is located, denotes a roughly logarithmic decrease in bandwidth as we move linearly away from the cochlear opening (the oval window). Therefore we propose a different approach to sound analysis, which is known as the Constant-Q filterbank method. This method is typically implemented with a bank of band pass filters (filterbank) with constant Q ratio. We recall the definition of Q which is the ratio between the center frequency and the bandwidth of a filter. This method mimics the behaviour of the auditory system in detecting frequency, i.e the filter are linearly spaced along a logarithmic frequency axis. This can be obtained with filters maintaining constant their Q ratio. It is established that the Q must be chosen approximately equal to 37, to perform a rigorous scan over the frequency range from 20 Hz to 20 KHz and several filters (at least one hundred) must be used. Our task had been compared to the one provided by an external library for Max/MSP, called bonk∼ , developed by the same author of Max/MSP, Miller Puckette. We found in it a very interesting approach, very helpful for our purpose. The code has been revised by other people during the years (the original code was 1989) and the version we have taken into account is bonk∼ 3.0. The code was found on M.Puckette repository, with permission to apply changes. The bonk∼ method works essentially on a specialization of the constant-Q filter bank analysis, called bounded-Q analysis. This method has the advantage to reduce the complexity of the constant-Q transform. In this kind of analysis the value of Q is limited (bounded) to approximately 5 and a few number of filters should be used to obtain at least the same results. We found that the implementation of 15 non overlapping filters 87

7 – Conclusion

was successful for our case. The bandwidths of the filters subdivide the sound spectrum into regions which are approximately tuned around musical octave, thus respecting what seems to be the auditory system response. This method, implemented for the first time in the late 80s, has showed good results in various fields of musical analysis, in particular segmentation and transcription. After several months spent in debugging, optimizing and adapting the source code to our needs, we have subjected it to several tests. These tests, performed with the sounds recorded at LIM laboratories in Verres (AO), have demonstrated the soundness of this approach. All the hits produced by the orchestra can be located in time. Not only, we also trained bonk∼ to recognize which kind of sound have been produced, and we can obtain more than 85% of successful correspondences. The percentage of recognized hits indicates the approach validity; possible other musical applications can be foreseen in or outside the ·O M M· .

88

Appendix A MSP, anatomy of the object
Source 1: main.c
1 /∗ ∗ @page chapter_msp_anatomy Anatomy o f a MSP O b j e c t An MSP o b j e c t t h a t h a n d l e s a u d i o s i g n a l s i s a r e g u l a r Max o b j e c t w i t h e x t r a s . R e f e r t o t h e <a h r e f =" p l u s s z ~_8c−s o u r c e . h t m l"> p l u s s z ~</a> p r o j e c t s o u r c e a s we d e t a i l t h e s e a d d i t i o n s . p l u s s z ~ i s s i m p l y an t h a t a d d s 1 t o a s i g n a l , i d e n t i c a l i n f u n c t i o n t o t h e r e g u l a r MSP i f you w e r e t o g i v e i t an a r g u m en t o f 1 . 6 Here i s an e n u m e r a t i o n o f t h e b a s i c t a s k s : 1) a d d i t i o n a l header files a few example object +~ o b j e c t

A f t e r i n c l u d i n g e x t . h and ext_obex . h , i n c l u d e z_dsp . h 11 @code #i n c l u d e " z_dsp . h" @endcode 16 2) C s t r u c t u r e d e c l a r a t i o n

The C s t r u c t u r e d e c l a r a t i o n must b e g i n w i t h a #t _ p x o b j e c t , n o t a #t _ o b j e c t : @code t y p e d e f s t r u c t _ my d sp ob je c t { 21 t _ p x o b j e c t m_obj ; // r e s t o f t h e s t r u c t u r e ’ s f i e l d s } t_mydspobject ; @endcode 26 3) i n i t i a l i z a t i o n routine

When c r e a t i n g t h e c l a s s w i t h c l a s s _ n e w ( ) , you must h a v e a f r e e f u n c t i o n . I f you h a v e n o t h i n g s p e c i a l t o do , u s e d s p _ f r e e ( ) , w h i c h i s d e f i n e d f o r t h i s p u r p o s e . I f you w r i t e y o u r own f r e e f u n c t i o n , t h e f i r s t t h i n g i t s h o u l d do i s c a l l d s p _ f r e e ( ) . T h i s i s e s s e n t i a l t o a v o i d c r a s h e s when f r e e i n g y o u r o b j e c t when a u d i o p r o c e s s i n g i s t u r n e d on . @code c = c l a s s _ n e w ( " m y d s p o b j e c t " , ( method ) mydspobject_new , ( method ) d s p _ f r e e , s i z e o f ( t _ m y d s p o b j e c t ) , NULL , 0 ) ; 31 @endcode

89

A – MSP, anatomy of the object

A f t e r c r e a t i n g y o u r c l a s s w i t h c l a s s _ n e w ( ) , you must c a l l c l a s s _ d s p i n i t ( ) , w h i c h w i l l add some s t a n d a r d method h a n d l e r s f o r i n t e r n a l m e s s a g e s u s e d by a l l signal objects . @code class_dspinit (c) ; 36 @endcode Your s i g n a l o b j e c t n e e d s a method t h a t i s bound t o t h e s y m b o l " d s p " −− we ’ l l d e t a i l what t h i s method d o e s below , b u t t h e f o l l o w i n g l i n e n e e d s t o be added w h i l e i n i t i a l i z i n g t h e c l a s s : @code c l a s s _ a d d m e t h o d ( c , ( method ) m y d s p o b j e c t _ d s p , " d s p " , A_CANT, 0 ) ; 41 @endcode 4 ) new i n s t a n c e r o u t i n e The new i n s t a n c e r o u t i n e must c a l l d s p _ s e t u p ( ) , p a s s i n g a p o i n t e r t o t h e n e w l y a l l o c a t e d o b j e c t p o i n t e r p l u s a number o f s i g n a l i n l e t s t h e o b j e c t w i l l h a v e . I f t h e o b j e c t h a s no s i g n a l i n l e t s , you may p a s s 0 . The p l u s z ~ o b j e c t ( a s an e x a m p l e ) h a s a s i n g l e s i g n a l i n l e t : 46 @code dsp_setup ( ( t_pxobject ∗) x , 1) ; @endcode d s p _ s e t u p ( ) w i l l make t h e s i g n a l yourself . i n l e t s ( a s p r o x i e s ) s o you n e e d n o t make them

51

I f y o u r o b j e c t w i l l h a v e a u d i o s i g n a l o u t p u t s , t h e y n e e d t o be c r e a t e d i n t h e new i n s t a n c e r o u t i n e w i t h o u t l e t _ n e w ( ) . However , you w i l l n e v e r a c c e s s them d i r e c t l y , s o you don ’ t n e e d t o s t o r e p o i n t e r s t o them a s you do w i t h r e g u l a r o u t l e t s . Here i s an e x a m p l e o f c r e a t i n g two s i g n a l o u t l e t s : @code outlet_new ( ( t_object ∗) x , " s i g n a l ") ; outlet_new ( ( t_object ∗) x , " s i g n a l ") ; 56 @endcode 5 ) The d s p method and p e r f o r m r o u t i n e The d s p method s p e c i f i e s t h e s i g n a l p r o c e s s i n g f u n c t i o n y o u r o b j e c t d e f i n e s a l o n g w i t h i t s a r g u m e n t s . Your o b j e c t ’ s d s p method w i l l be c a l l e d when t h e MSP s i g n a l c o m p i l e r i s b u i l d i n g a s e q u e n c e o f o p e r a t i o n s ( known a s t h e DSP C h a i n ) t h a t w i l l be p e r f o r m e d on e a c h s e t o f a u d i o s a m p l e s . The o p e r a t i o n sequence c o n s i s t s of a p o i n t e r s to f u n c t i o n s ( c a l l e d perform r o u t i n e s ) f o l l o w e d by a r g u m e n t s t o t h o s e f u n c t i o n s . The d s p method i s d e c l a r e d a s f o l l o w s : @code v o i d m y d s p o b j e c t _ d s p ( t _ m y d s p o b j e c t ∗ x , t _ s i g n a l ∗∗ sp , s h o r t ∗ c o u n t ) ; @endcode To add an e n t r y t o t h e DSP c h a i n , y o u r d s p method u s e s dsp_add ( ) . The d s p method i s p a s s e d an a r r a y o f s i g n a l s (# t _ s i g n a l p o i n t e r s ) , w h i c h c o n t a i n p o i n t e r s t o t h e a c t u a l s a m p l e memory y o u r o b j e c t ’ s p e r f o r m r o u t i n e w i l l be u s i n g f o r i n p u t and o u t p u t . The a r r a y o f s i g n a l s s t a r t s w i t h t h e i n p u t s ( from l e f t t o r i g h t ) , f o l l o w e d by t h e o u t p u t s . F o r example , i f y o u r o b j e c t h a s two i n p u t s ( b e c a u s e y o u r new i n s t a n c e r o u t i n e c a l l e d d s p _ s e t u p ( x , 2 ) ) and t h r e e o u t p u t s ( b e c a u s e y o u r new i n s t a n c e c r e a t e d t h r e e s i g n a l o u t l e t s ) , t h e s i g n a l a r r a y s p would c o n t a i n f i v e i t e m s a s f o l l o w s : @code s p [ 0 ] // l e f t i n p u t s p [ 1 ] // r i g h t i n p u t

61

66

90

71

s p [ 2 ] // l e f t o u t p u t s p [ 3 ] // m i d d l e o u t p u t s p [ 4 ] // r i g h t o u t p u t @endcode The #t _ s i g n a l d a t a s t r u c t u r e ( d e f i n e d i n z_dsp . h ) , c o n t a i n s two i m p o r t a n t e l e m e n t s : t h e s_n f i e l d , w h i c h i s t h e s i z e o f t h e s i g n a l v e c t o r , and s_vec , w h i c h i s a p o i n t e r t o an a r r a y o f 32− b i t f l o a t s c o n t a i n i n g t h e s i g n a l d a t a . A l l t _ s i g n a l s y o u r o b j e c t w i l l r e c e i v e h a v e t h e same s i z e . T h i s s i z e i s n o t n e c e s s a r i l y t h e same a s t h e g l o b a l MSP s i g n a l v e c t o r s i z e , b e c a u s e y o u r o b j e c t m i g h t be i n s i d e a p a t c h e r w i t h i n a p o l y ~ o b j e c t t h a t d e f i n e s i t s own s i z e . T h e r e f o r e i t i s i m p o r t a n t t o u s e t h e s_n f i e l d o f a s i g n a l p a s s e d t o y o u r o b j e c t ’ s d s p method . You can u s e a v a r i e t y o f s t r a t e g i e s t o p a s s a r g u m e n t s t o y o u r p e r f o r m r o u t i n e v i a dsp_add ( ) . F o r s i m p l e u n i t g e n e r a t o r s t h a t don ’ t s t o r e any i n t e r n a l s t a t e between computing v e c t o r s , i t i s s u f f i c i e n t to pas s the i n p u t s , o u t p u t s , and v e c t o r s i z e . F o r o b j e c t s t h a t n e e d t o s t o r e i n t e r n a l s t a t e b e t w e e n c o m p u t i n g v e c t o r s s u c h a s f i l t e r s o r ramp g e n e r a t o r s , you w i l l p a s s a p o i n t e r t o y o u r o b j e c t , whose d a t a s t r u c t u r e s h o u l d c o n t a i n s p a c e t o s t o r e t h i s s t a t e . The p l u s 1 ~ o b j e c t d o e s n o t n e e d t o s t o r e i n t e r n a l s t a t e . I t p a s s e s t h e i n p u t , o u t p u t , and v e c t o r s i z e t o i t s p e r f o r m r o u t i n e . The p l u s 1 ~ d s p method i s shown b e l o w :

76

81

86

@code v o i d p l u s 1 _ d s p ( t _ p l u s 1 ∗ x , t _ s i g n a l ∗∗ sp , s h o r t c o u n t ) { dsp_add ( p l u s 1 _ p e r f o r m , 3 , s p [0]−> s_vec , s p [1]−> s_vec , s p [0]−>s_n ) ; } @endcode The f i r s t a r g um e n t t o dsp_add ( ) i s y o u r p e r f o r m r o u t i n e , f o l l o w e d by t h e number o f a d d i t i o n a l a r g u m e n t s you w i s h t o c o p y t o t h e DSP c h a i n , and t h e n t h e arguments . The p e r f o r m r o u t i n e i s n o t a " method " i n t h e t r a d i t i o n a l s e n s e . I t w i l l be c a l l e d w i t h i n t h e c a l l b a c k o f an a u d i o d r i v e r , which , u n l e s s t h e u s e r i s e m p l o y i n g t h e Non−R e a l Time a u d i o d r i v e r , w i l l t y p i c a l l y be i n a h i g h − p r i o r i t y t h r e a d . Thread p r o t e c t i o n i n s i d e t h e p e r f o r m r o u t i n e i s m i n i m a l . You can u s e a c l o c k , b u t you c a n n o t u s e q e l e m s o r o u t l e t s . The d e s i g n o f t h e p e r f o r m r o u t i n e i s somewhat u n l i k e o t h e r Max methods . I t r e c e i v e s a p o i n t e r t o a p i e c e o f t h e DSP c h a i n and i t i s e x p e c t e d t o r e t u r n t h e l o c a t i o n o f t h e n e x t p e r f o r m r o u t i n e on t h e c h a i n . The n e x t l o c a t i o n i s d e t e r m i n e d by t h e number o f a r g u m e n t s you s p e c i f i e d f o r y o u r p e r f o r m r o u t i n e w i t h y o u r c a l l t o dsp_add ( ) . F o r example , i f you w i l l p a s s t h r e e a r g u m e n t s , you n e e d t o r e t u r n w + 4 .

91

Here i s t h e p l u s 1 p e r f o r m r o u t i n e :

@code t _ i n t ∗ p l u s 1 _ p e r f o r m ( t _ i n t ∗w) { 96 t _ f l o a t ∗ in , ∗ out ; int n; i n = ( t _ f l o a t ∗ )w [ 1 ] ; o u t = ( t _ f l o a t ∗ )w [ 2 ] ; n = ( i n t )w [ 3 ] ; // g e t i n p u t s i g n a l v e c t o r // g e t o u t p u t s i g n a l v e c t o r // v e c t o r s i z e

101

w h i l e ( n−−) // p e r f o r m c a l c u l a t i o n on a l l s a m p l e s ∗ o u t++ = ∗ i n++ + 1 . ;

91

A – MSP, anatomy of the object

106 } @endcode 111

return w + 4;

// must r e t u r n n e x t DSP c h a i n l o c a t i o n

6) Free f u n c t i o n The f r e e f u n c t i o n f o r t h e c l a s s must e i t h e r be d s p _ f r e e ( ) o r i t must be w r i t t e n t o c a l l d s p _ f r e e ( ) a s shown i n t h e e x a m p l e b e l o w :

116

@code v o i d mydspobject_free ( t_mydspobject ∗x ) { dsp_free ( ( t_pxobject ∗) x ) ; } @endcode / // can do o t h e r s t u f f h e r e

121

92

Appendix B bonk∼ source code
No substantial modification has been applied to the original bonk∼ code. Previous modification to the original has been done by Barry Threw for the latest version of bonk∼ , the what we used1 .

B.1

The bonk∼ Method
Source 2: main.c

1 /∗ ###################################### ##################################### # bonk~ − a pd and Max/MSP e x t e r n a l # by m i l l e r p u c k e t t e and t e d a p p e l # h t t p : / / c r c a . u c s d . edu /~msp/ 6 # Max/MSP p o r t by b a r r y t h r e w ( m e @ b a r r y t h r e w . com ) # h t t p : / /www . b a r r y t h r e w . com # San F r a n c i s c o , CA 2008 # f o r Kesumo − h t t p : / /www . kesumo . com # Max 5 o p t i m i z e d v e r s i o n f o r l o u d p e r c u s s i v e s o u n d s , by Z e n g i . BETA v e r s i o n 11 # T u r i n , June 2009 ###################################### ##################################### // bonk~ d e t e c t s a t t a c k s i n an a u d i o s i g n a l ###################################### ##################################### T h i s s o f t w a r e i s c o p y r i g h t e d by M i l l e r P u c k e t t e and o t h e r s . The f o l l o w i n g 16 t e r m s ( t h e " S t a n d a r d I m p r o v e d BSD L i c e n s e " ) a p p l y t o a l l f i l e s a s s o c i a t e d w i t h the software u n l e s s e x p l i c i t l y disclaimed in i n d i v i d u a l f i l e s : R e d i s t r i b u t i o n and u s e i n s o u r c e and b i n a r y f o r m s , w i t h o r w i t h o u t modification , are permitted provided that the f o l l o w i n g c o n d i t i o n s are met : 1. Redistributions notice , t h i s l i s t 2. Redistributions copyright notice ,
1

21

26

o f s o u r c e c o d e must r e t a i n t h e a b o v e c o p y r i g h t o f c o n d i t i o n s and t h e f o l l o w i n g d i s c l a i m e r . i n b i n a r y form must r e p r o d u c e t h e a b o v e t h i s l i s t o f c o n d i t i o n s and t h e f o l l o w i n g

Bonk3 can be found in M. Puckette repository (ask to him) or in other non precise location on the web. The one we used was found on Barry Threw website, but is now no more longer available.

93

B – bonk∼ source code

31

d i s c l a i m e r i n t h e d o c u m e n t a t i o n and / o r o t h e r m a t e r i a l s p r o v i d e d with the d i s t r i b u t i o n . 3 . The name o f t h e a u t h o r may n o t be u s e d t o e n d o r s e o r promote p r o d u c t s d e r i v e d from t h i s s o f t w a r e w i t h o u t s p e c i f i c p r i o r written permission . THIS SOFTWARE I S PROVIDED BY THE AUTHOR ‘ ‘ AS I S ’ ’ AND ANY EXPRESS OR IMPLIED WARRANTIES , INCLUDING , BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED . IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT , INDIRECT , INCIDENTAL , SPECIAL , EXEMPLARY, OR CONSEQUENTIAL DAMAGES ( INCLUDING , BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES ; LOSS OF USE , DATA, OR PROFITS ; OR BUSINESS INTERRUPTION ) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY , WHETHER IN CONTRACT, STRICT LIABILITY , OR TORT ( INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN I F ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ∗/ /∗ dolist : d e c a y and o t h e r t i m e s i n msec // s t i l l ∗/ #i n c l u d e <math . h> #i n c l u d e < s t d i o . h> #i n c l u d e < s t r i n g . h>

36

41

46

t o do

51

56 //#i f d e f NT //#pragma w a r n i n g ( d i s a b l e : 4305 4 2 4 4 ) //#e n d i f #i n c l u d e " e x t . h" 61 #i n c l u d e " z_dsp . h" #i n c l u d e " math . h" #i n c l u d e " e x t _ s u p p o r t . h" #i n c l u d e " e x t _ p r o t o . h" #i n c l u d e " ext_obex . h" 66 typedef double t _ f l o a t a r g ; #d e f i n e #d e f i n e 71 #d e f i n e #d e f i n e

/∗ from m_pd . h ∗/

flog log f e x p exp fsqrt sqrt t _ r e s i z e b y t e s ( a , b , c ) t _ r e s i z e b y t e s ( ( char ∗) ( a ) , ( b ) , ( c ) )

void ∗ bonk_class ; #d e f i n e g e t b y t e s t _ g e t b y t e s 76 #d e f i n e f r e e b y t e s t _ f r e e b y t e s //BONK ATTRIBUTE SETTINGS , YOU CAN OVERRIDE THEM IN MAX PATCH #d e f i n e DEFNPOINTS 1024 #d e f i n e MAXCHANNELS 2 /∗ 8 ∗/ 81 #d e f i n e MINPOINTS 64 #d e f i n e DEFPERIOD 256 /∗ 128 ∗/ #d e f i n e DEFNFILTERS 15 #d e f i n e DEFHALFTONES 6 #d e f i n e DEFOVERLAP 1 86 #d e f i n e DEFFIRSTBIN 2 /∗ m o d i f i c a t o , e r a 1 ∗/ #d e f i n e DEFHITHRESH 10 /∗ 5 ∗/ #d e f i n e DEFLOTHRESH 5 /∗ 2 . 5 ∗/

94

B.1 – The bonk∼ Method

#d e f i n e #d e f i n e 91 #d e f i n e #d e f i n e #d e f i n e #d e f i n e

DEFMASKTIME 4 DEFMASKDECAY 0 . 7 DEFDEBOUNCEDECAY 0 DEFMINVEL 7 DEFATTACKBINS 1 MAXATTACKWAIT 4

96 //DATA STRUCTURES typedef struct _ f i l t e r k e r n e l { int k_filterpoints ; i n t k_hoppoints ; 101 int k_skippoints ; i n t k_nhops ; float k_centerfreq ; f l o a t k_bandwidth ; float ∗ k_stuff ; 106 } t _ f i l t e r k e r n e l ;

/∗ c e n t e r f r e q u e n c y , b i n s ∗/ /∗ b a n d w i d t h , b i n s ∗/

// f i l t e r b a n k s t r u c t u r e , i m p l e m e n t s b_next . typedef struct _ f i l t e r b a n k { 111 int b_nfilters ; /∗ i n t b_npoints ; /∗ float b_halftones ; /∗ f l o a t b_overlap ; /∗ float b_firstbin ; /∗ 116 t _ f i l t e r k e r n e l ∗ b_vec ; /∗ i n t b_refcount ; /∗ s t r u c t _ f i l t e r b a n k ∗ b_next ; /∗ } t_filterbank ; 121 /∗ 1 . 3 r e v i e w ∗/ #d e f i n e MAXNFILTERS 50 #d e f i n e MASKHIST 8 126

a linked

list

of

filter

with s t r u c _ f i l t e r b a n l ∗

number o f f i l t e r s i n bank ∗/ i n p u t v e c t o r s i z e ∗/ f i l t e r b a n d w i d t h i n h a l f t o n e s ∗/ o v e r l a p ; d e f a u l t 1 f o r 1/2− power p t s ∗/ f r e q o f f i r s t f i l t e r i n b i n s , d e f a u l t 1 ∗/ f i l t e r k e r n e l s ∗/ number o f bonk~ o b j e c t s u s i n g t h i s ∗/ n e x t i n l i n k e d l i s t ∗/

static t_filterbank ∗ bonk_filterbanklist ;

typedef struct _hist { f l o a t h_power ; f l o a t h_before ; 131 f l o a t h_outpower ; i n t h_countup ; f l o a t h_mask [ MASKHIST ] ; } t_hist ; 136 t y p e d e f s t r u c t t e m p l a t e { f l o a t t_amp [ MAXNFILTERS ] ; } t_template ; 141 t y p e d e f s t r u c t _ i n s i g { t _ h i s t g _ h i s t [ MAXNFILTERS ] ; /∗ h i s t o r y f o r e a c h f i l t e r ∗/ void ∗ g_outlet ; /∗ o u t l e t f o r raw d a t a ∗/ f l o a t ∗ g_inbuf ; /∗ b u f f e r e d i n p u t s a m p l e s ∗/ 146 t _ f l o a t ∗ g_invec ; /∗ new i n p u t s a m p l e s ∗/ } t_insig ; t y p e d e f s t r u c t _bonk

95

B – bonk∼ source code

151

{ t _ p x o b j e c t x_obj ; void ∗ obex ; v o i d ∗ x_cookedout ; void ∗ x_clock ; s h o r t x_ vo l ; /∗ p a r a m e t e r s ∗/ i n t x_npoints ; i n t x_period ; int x_nfilters ; float x_halftones ; f l o a t x_overlap ; float x_firstbin ; float x_hithresh ; float x_lothresh ; f l o a t x_minvel ; f l o a t x_maskdecay ; i n t x_masktime ; int x_useloudness ; f l o a t x_debouncedecay ; f l o a t x_debouncevel ; double x_learndebounce ; int x_attackbins ; /∗ /∗ /∗ /∗ number o f p o i n t s i n i n p u t b u f f e r ∗/ number o f i n p u t s a m p l e s b e t w e e n a n a l y s e s ∗/ number o f f i l t e r s r e q u e s t e d ∗/ n o m i n a l h a l f t o n e s b e t w e e n f i l t e r s ∗/

156

161

166

/∗ t h r e s h o l d f o r t o t a l g r o w t h t o t r i g g e r ∗/ /∗ t h r e s h o l d f o r t o t a l g r o w t h t o r e −arm ∗/ /∗ minimum v e l o c i t y we o u t p u t ∗/ /∗ u s e l o u d n e s s s p e c t r a i n s t e a d o f power ∗/ /∗ d e b o u n c e t i m e ( i n " l e a r n " mode o n l y ) ∗/ /∗ number o f b i n s t o w a i t f o r a t t a c k ∗/

171

176

t_filterbank ∗ x_filterbank ; t _ h i s t x _ h i s t [ MAXNFILTERS ] ; t_template ∗ x_template ; t_insig ∗ x_insig ; 181 int x_ninsig ; i n t x_ntemplate ; int x_infill ; i n t x_countdown ; int x_willattack ; 186 i n t x_attacked ; i n t x_debug ; i n t x_learn ; int x_learncount ; i n t x_spew ; 191 i n t x_maskphase ; f l o a t x_sr ; i n t x_hit ; ∗/ } t_bonk ;

/∗ /∗ /∗ /∗ /∗

c o u n t u p f o r " l e a r n " mode ∗/ i f t r u e , a l w a y s g e n e r a t e o u t p u t ! ∗/ p h a s e , 0 t o MASKHIST−1 , f o r mask h i s t o r y ∗/ c u r r e n t s a m p l e r a t e i n Hz . ∗/ next " t i c k " c a l l e d because of a hit , not a p o l l

196 //PROTOTYPES // p r o t o t y p e s f o r methods : n e e d a method f o r e a c h i n c o m i n g m e s s a g e s t a t i c v o i d ∗bonk_new ( t_symbol ∗ s , l o n g ac , t_atom ∗ av ) ; s t a t i c v o i d b o n k _ t i c k ( t_bonk ∗ x ) ; s t a t i c v o i d b o n k _ d o i t ( t_bonk ∗ x ) ; 201 s t a t i c t _ i n t ∗ bonk_perform ( t _ i n t ∗w) ; s t a t i c v o i d bonk_dsp ( t_bonk ∗ x , t _ s i g n a l ∗∗ s p ) ; v o i d b o n k _ a s s i s t ( t_bonk ∗ x , v o i d ∗b , l o n g m, l o n g a , c h a r ∗ s ) ; s t a t i c v o i d b o n k _ f r e e ( t_bonk ∗ x ) ; v o i d bonk_setup ( v o i d ) ; 206 v o i d main ( ) ; // methods f o r t r e s h o l d and o t h e r f e a t u r e s s t a t i c v o i d b o n k _ t h r e s h ( t_bonk ∗ x , t _ f l o a t a r g f 1 , t _ f l o a t a r g f 2 ) ; s t a t i c v o i d b o n k _ p r i n t ( t_bonk ∗ x , t _ f l o a t a r g f ) ;

96

B.1 – The bonk∼ Method

211 s t a t i c v o i d bonk_bang ( t_bonk ∗ x ) ; s t a t i c v o i d b o n k _ l e a r n ( t_bonk ∗ x , i n t n ) ; s t a t i c v o i d b o n k _ f o r g e t ( t_bonk ∗ x ) ; // methods f o r r e a d s and w r i t e t e m p l a t e s 216 s t a t i c v o i d b o n k _ w r i t e ( t_bonk ∗ x , t_symbol ∗ s ) ; s t a t i c v o i d bonk_read ( t_bonk ∗ x , t_symbol ∗ s ) ; // method f o r a t t r i b u t e s s e t t e r v o i d b o n k _ m i n v e l _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ; 221 v o i d b o n k _ l o t h r e s h _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ; v o i d b o n k _ h i t h r e s h _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ; v o i d bonk_masktime_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ; v o i d bonk_maskdecay_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ; v o i d b o n k _ d e b o u n c e d e c a y _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ; 226 v o i d bonk_debug_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ; v o i d bonk_spew_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ; v o i d b o n k _ u s e l o u d n e s s _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ; v o i d b o n k _ a t t a c k b i n s _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) ; 231 f l o a t q r s q r t ( f l o a t f ) ; double c l o c k _ g e t s y s t i m e ( ) ; double c l o c k _ g e t t i m e s i n c e ( double p r e v s y s t i m e ) ; c h a r ∗ s t r c p y ( c h a r ∗ s1 , c o n s t c h a r ∗ s 2 ) ; 236 // c l o c k f u n c t i o n s t a t i c v o i d b o n k _ t i c k ( t_bonk ∗ x ) ; #d e f i n e HALFWIDTH 0 . 7 5 /∗ h a l f p e a k b a n d w i d t h a t h a l f power p o i n t i n b i n s ∗/ 241 //CONTANT Q FILTERBANK IMPLEMENTATION s t a t i c t_filterbank ∗ bonk_newfilterbank ( int npoints , int n f i l t e r s , float halftones , float overlap , float f i r s t b i n ) { int i , j ; f l o a t c f , bw , h , r e l s p a c e ; 246 t _ f i l t e r b a n k ∗b = ( t _ f i l t e r b a n k ∗ ) g e t b y t e s ( s i z e o f ( ∗ b ) ) ; b−>b _ n p o i n t s = n p o i n t s ; b−>b _ n f i l t e r s = n f i l t e r s ; b−>b _ h a l f t o n e s = h a l f t o n e s ; b−>b _ o v e r l a p = o v e r l a p ; 251 b−>b _ f i r s t b i n = f i r s t b i n ; b−>b _ r e f c o u n t = 0 ; b−>b_next = b o n k _ f i l t e r b a n k l i s t ; bonk_filterbanklist = b; b−>b_vec = ( t _ f i l t e r k e r n e l ∗ ) g e t b y t e s ( n f i l t e r s ∗ s i z e o f ( ∗ b−>b_vec ) ) ; 256 // i n c o n s t a n t Q f i l t e r b a n k , s p a c i n g b e t w e e n f i l t e r s i s i m p l e m e n t e d by t h i s way h = exp ( ( l o g ( 2 . ) / 1 2 . ) ∗ h a l f t o n e s ) ; /∗ s p e c c e d i n t e r v a l b e t w e e n f i l t e r s ∗/ // h=h a l f t o n e s ∗ 5 ; r e l s p a c e = ( h − 1) /( h + 1) ; /∗ n o m i n a l s p a c i n g −p e r −f f o r f i l t e r b a n k ∗/ 261 // r e l s p a c e=h / 2 ; c f = f i r s t b i n ; // f i r s t c e n t e r f r e q o f t h e f i l t e r b a n k // b a n d w i d t h bw = c f ∗ r e l s p a c e ∗ o v e r l a p ; i f ( bw < HALFWIDTH) bw = HALFWIDTH ; // c r e a t e s ( i ) f i l t e r s , MAX( i ) =50. // s t o p s c r e a t i n g f i l t e r s when c f e x c e e d n p o i n t s / 2 , r e t u r n i . f o r ( i = 0 ; i < n f i l t e r s ; i ++)

266

271

97

B – bonk∼ source code

{

276

f l o a t ∗ f p , newcf , newbw ; float normalizer = 0; i n t f i l t e r p o i n t s , s k i p p o i n t s , h o p p o i n t s , nhops ; f i l t e r p o i n t s = 0 . 5 + n p o i n t s ∗ HALFWIDTH/bw ; // f i l t e r p o i n t s = 0 . 5 + n p o i n t s / bw ; i f ( c f > n p o i n t s /2) { p o s t ( " bonk ~ : o n l y u s i n g %d f i l t e r s ( r a n p a s t N y q u i s t ) " , i +1) ; break ; } i f ( f i l t e r p o i n t s < 4) { p o s t ( " bonk ~ : o n l y u s i n g %d f i l t e r s ( k e r n e l s g o t t o o s h o r t ) " , i +1) ; break ; } else i f ( f i l t e r p o i n t s > npoints ) f i l t e r p o i n t s = npoints ; h o p p o i n t s = 0 . 5 + 0 . 5 ∗ n p o i n t s ∗ HALFWIDTH/bw ; // h o p p o i n t s = 0 . 5 + 0 . 5 ∗ n p o i n t s /bw ; nhops = 1 . + ( n p o i n t s − f i l t e r p o i n t s ) /( f l o a t ) h o p p o i n t s ; s k i p p o i n t s = 0 . 5 ∗ ( n p o i n t s − f i l t e r p o i n t s − ( nhops −1) ∗ h o p p o i n t s ) ; // F i l l t h e b−>b_vec [ i ( float b−>b_vec [ i b−>b_vec [ i b−>b_vec [ i b−>b_vec [ i b−>b_vec [ i b−>b_vec [ i k e r n e l o f t h e f i l t e r s i n f i l t e r b a n k −> f i l t e r k e r n e l ] . k_stuff = ∗) g e t b y t e s (2 ∗ s i z e o f ( f l o a t ) ∗ f i l t e r p o i n t s ) ; ]. k_filterpoints = filterpoints ; ] . k_nhops = n h o p s ; ] . k_hoppoints = hoppoints ; ] . k_skippoints = skippoints ; ] . k_centerfreq = cf ; ] . k_bandwidth = bw ;

281

286

291

296

301

306

311

316

321

//BANDPASS FILTER DESIGN : v e d e r e f o r ( f p = b−>b_vec [ i ] . k _ s t u f f , j = 0 ; j < f i l t e r p o i n t s ; j ++, f p+= 2 ) { f l o a t phase = j ∗ c f ∗ (2∗3.14159/ n p o i n t s ) ; f l o a t wphase = j ∗ ( 2 ∗ 3 . 1 4 1 5 9 / f i l t e r p o i n t s ) ; f l o a t window = s i n ( 0 . 5 ∗ wphase ) ; f p [ 0 ] = window ∗ c o s ( p h a s e ) ; f p [ 1 ] = window ∗ s i n ( p h a s e ) ; n o r m a l i z e r += window ; // p o s t ( " c o s p h a s e %.2 f s i n p h a s e %.2 f wphase %.2 f window %.2 f norm %.2 f f p 0 %.2 f f p 1 %.2 f " , c o s ( p h a s e ) , s i n ( p h a s e ) , wphase , window , normalizer , fp [ 0 ] , fp [ 1 ] ) ; } n o r m a l i z e r = 1/( n o r m a l i z e r ∗ nhops ) ; f o r ( f p = b−>b_vec [ i ] . k _ s t u f f , j = 0 ; j < f i l t e r p o i n t s ; j ++, f p+= 2 ) f p [ 0 ] ∗= n o r m a l i z e r , f p [ 1 ] ∗= n o r m a l i z e r ; p o s t ( " i %d c f %.2 f bw %.2 f Q %.2 f n h o p s %d , hop %d , s k i p %d , n p o i n t s %d , n o r m a l i z e r %.8 f f p 0 %.6 f , f p 1 %.6 f " , i , c f , bw , c f /bw , nhops , h o p p o i n t s , s k i p p o i n t s , f i l t e r p o i n t s , n o r m a l i z e r , &f p [ 0 ] , &f p [ 1 ] ) ;

326

n e w c f = ( c f + bw/ o v e r l a p ) / ( 1 − r e l s p a c e ) ; newbw = n e w c f ∗ o v e r l a p ∗ r e l s p a c e ; i f ( newbw < HALFWIDTH) {

98

B.1 – The bonk∼ Method

331

336

341

}

} // s e t s t o 0 t h e r e m a i n i n g f i l t e r s , i f l e s s t h a n 50 f i l t e r s a r e u s e d f o r ( ; i < n f i l t e r s ; i ++) b−>b_vec [ i ] . k _ s t u f f = 0 , b−>b_vec [ i ] . k _ f i l t e r p o i n t s = 0 ; return (b) ;

} c f = newcf ; bw = newbw ;

newbw = HALFWIDTH ; n e w c f = c f + 2 ∗ HALFWIDTH / o v e r l a p ;

s t a t i c v o i d b o n k _ f r e e f i l t e r b a n k ( t _ f i l t e r b a n k ∗b ) { t _ f i l t e r b a n k ∗ b2 , ∗ b3 ; int i ; 346 i f ( b o n k _ f i l t e r b a n k l i s t == b ) b o n k _ f i l t e r b a n k l i s t = b−>b_next ; e l s e f o r ( b2 = b o n k _ f i l t e r b a n k l i s t ; b3 = b2−>b_next ; b2 = b3 ) i f ( b3 == b ) { 351 b2−>b_next = b3−>b_next ; break ; } f o r ( i = 0 ; i < b−>b _ n f i l t e r s ; i ++) i f ( b−>b_vec [ i ] . k _ s t u f f ) 356 f r e e b y t e s ( b−>b_vec [ i ] . k _ s t u f f , b−>b_vec [ i ] . k _ f i l t e r p o i n t s ∗ s i z e o f ( f l o a t ) ) ; f r e e b y t e s (b , s i z e o f (∗ b ) ) ; } 361 s t a t i c v o i d bonk_donew ( t_bonk ∗ x , i n t n p o i n t s , i n t p e r i o d , i n t n s i g , i n t float halftones , float overlap , float f i r s t b i n , float samplerate ) { int i , j ; t _ h i s t ∗h ; f l o a t ∗ fp ; 366 t _ i n s i g ∗g ; t_filterbank ∗ fb ; f o r ( j = 0 , g = x−>x _ i n s i g ; j < n s i g ; j ++, g++) { f o r ( i = 0 , h = g−>g _ h i s t ; i −−; h++) 371 { h−>h_power = h−>h _ b e f o r e = 0 , h−>h_countup = 0 ; f o r ( j = 0 ; j < MASKHIST ; j ++) h−>h_mask [ j ] = 0 ; } 376 /∗ we o u g h t t o c h e c k f o r f a i l u r e t o a l l o c a t e memory h e r e ∗/ g−>g _ i n b u f = ( f l o a t ∗ ) g e t b y t e s ( n p o i n t s ∗ s i z e o f ( f l o a t ) ) ; f o r ( i = n p o i n t s , f p = g−>g _ i n b u f ; i −−; f p++) ∗ f p = 0 ; } i f (! period ) period = npoints /2; 381 x−>x _ n p o i n t s = n p o i n t s ; x−>x _ p e r i o d = p e r i o d ; x−>x _ n i n s i g = n s i g ; x−>x _ n f i l t e r s = n f i l t e r s ; x−>x _ h a l f t o n e s = h a l f t o n e s ; 386 x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) g e t b y t e s ( 0 ) ; x−>x _ n t e m p l a t e = 0 ; x−> x _ i n f i l l = 0 ; x−>x_countdown = 0 ; x−>x _ w i l l a t t a c k = 0 ; nfilters ,

99

B – bonk∼ source code

391

396

401

406

411

416

421 }

x−>x _ a t t a c k e d = 0 ; x−>x_maskphase = 0 ; x−>x_debug = 0 ; x−>x _ h i t h r e s h = DEFHITHRESH ; x−>x _ l o t h r e s h = DEFLOTHRESH ; x−>x_masktime = DEFMASKTIME ; x−>x_maskdecay = DEFMASKDECAY ; x−>x _ l e a r n = 0 ; x−>x _ l e a r n d e b o u n c e = c l o c k _ g e t s y s t i m e ( ) ; x−>x _ l e a r n c o u n t = 0 ; x−>x _ d e b o u n c e d e c a y = DEFDEBOUNCEDECAY ; x−>x _ m i n v e l = DEFMINVEL ; x−>x _ u s e l o u d n e s s = 0 ; x−>x _ d e b o u n c e v e l = 0 ; x−>x _ a t t a c k b i n s = DEFATTACKBINS ; x−>x_sr = s a m p l e r a t e ; x−>x _ f i l t e r b a n k = 0 ; x−>x _ h i t = 0 ; f o r ( f b = b o n k _ f i l t e r b a n k l i s t ; f b ; f b = f b −>b_next ) i f ( f b −>b _ n f i l t e r s == x−>x _ n f i l t e r s && f b −>b _ h a l f t o n e s == x−>x _ h a l f t o n e s && f b −>b _ f i r s t b i n == f i r s t b i n && f b −>b _ o v e r l a p == o v e r l a p && f b −>b _ n p o i n t s == x−>x _ n p o i n t s ) { f b −>b _ r e f c o u n t ++; x−>x _ f i l t e r b a n k = f b ; break ; } i f ( ! x−>x _ f i l t e r b a n k ) x−>x _ f i l t e r b a n k = b o n k _ n e w f i l t e r b a n k ( n p o i n t s , n f i l t e r s , h a l f t o n e s , o v e r l a p , f i r s t b i n ) , x−>x _ f i l t e r b a n k −>b _ r e f c o u n t ++;

s t a t i c v o i d b o n k _ t i c k ( t_bonk ∗ x ) { 426 t_atom a t [ MAXNFILTERS ] , ∗ ap , a t 2 [ 3 ] ; int i , j , k , n ; t _ h i s t ∗h ; f l o a t ∗pp , v e l = 0 , t e m p e r a t u r e = 0 ; f l o a t ∗ fp ; 431 t_template ∗ tp ; i n t n f i t , n i n s i g = x−>x _ n i n s i g , n t e m p l a t e = x−>x_ntemplate , n f i l t e r s = x−> x_nfilters ; t _ i n s i g ∗ gp ; #i f d e f _MSC_VER f l o a t p o w e r o u t [ MAXNFILTERS∗MAXCHANNELS ] ; 436 #e l s e f l o a t ∗ p o w e r o u t = a l l o c a ( x−>x _ n f i l t e r s ∗ x−>x _ n i n s i g ∗ s i z e o f ( ∗ p o w e r o u t ) ) ; #e n d i f 441 f o r ( i = n i n s i g , pp = po we ro u t , gp = x−>x _ i n s i g ; i −−; gp++) { f o r ( j = 0 , h = gp−>g _ h i s t ; j < n f i l t e r s ; j ++, h++, pp++) { f l o a t power = h−>h_outpower ; f l o a t i n t e n s i t y = ∗ pp = ( power > 0 ? 1 0 0 . ∗ q r s q r t ( q r s q r t ( power ) ) : 0 ) ; v e l += i n t e n s i t y ; t e m p e r a t u r e += i n t e n s i t y ∗ ( f l o a t ) j ; // p o s t ( " power %.12 f i n t e n s i t y %.6 f " , power , i n t e n s i t y ) ; } }

446

100

B.1 – The bonk∼ Method

451

456

i f ( v e l > 0 ) t e m p e r a t u r e /= v e l ; else temperature = 0; v e l ∗= 0 . 5 / n i n s i g ; /∗ f u d g e f a c t o r ∗/ i f ( x−>x _ h i t ) { /∗ i f h i t n o n z e r o i t ’ s a c l o c k c a l l b a c k . i f i n " l e a r n " mode u p d a t e t h e t e m p l a t e l i s t ; i n any e v e n t match t h e h i t t o known t e m p l a t e s . ∗/ i f ( v e l < x−>x _ d e b o u n c e v e l ) { i f ( x−>x_debug ) p o s t ( " b o u n c e c a n c e l l e d : v e l %f d e b o u n c e %f " , v e l , x−>x _ d e b o u n c e v e l ) ; return ; } i f ( v e l < x−>x _ m i n v e l ) { i f ( x−>x_debug ) p o s t ( " l o w v e l o c i t y c a n c e l l e d : v e l %f , m i n v e l %f " , v e l , x−>x _ m i n v e l ) ; return ; } x−>x _ d e b o u n c e v e l = v e l ; i f ( x−>x _ l e a r n ) { d o u b l e l a s t t i m e = x−>x _ l e a r n d e b o u n c e ; d o u b l e msec = c l o c k _ g e t t i m e s i n c e ( l a s t t i m e ) ; i f ( ( ! n t e m p l a t e ) | | ( msec > 2 0 0 ) ) { i n t c o u n t u p = x−>x _ l e a r n c o u n t ; /∗ n o r m a l i z e t o 100 ∗/ f l o a t norm ; f o r ( i = n f i l t e r s ∗ n i n s i g , norm = 0 , pp = p o w e r o u t ; i −−; pp++) norm += ∗ pp ∗ ∗ pp ; i f ( norm < 1 . 0 e −15) norm = 1 . 0 e −15; norm = 1 0 0 . f ∗ q r s q r t ( norm ) ; /∗ c h e c k i f t h i s i s t h e f i r s t s t r i k e f o r a new t e m p l a t e ∗/ i f ( ! countup ) { int oldn = ntemplate ; x−>x _ n t e m p l a t e = n t e m p l a t e = o l d n + n i n s i g ; x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te , o l d n ∗ s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) , n t e m p l a t e ∗ s i z e o f ( x−> x_template [ 0 ] ) ) ; f o r ( i = n i n s i g , pp = p o w e r o u t ; i −−; o l d n++) f o r ( j = n f i l t e r s , f p = x−>x _ t e m p l a t e [ o l d n ] . t_amp ; j −−; pp++, f p++) ∗ f p = ∗ pp ∗ norm ; } else { int oldn = ntemplate − n i n s i g ; i f ( o l d n < 0 ) p o s t ( " b o n k _ t i c k bug " ) ; f o r ( i = n i n s i g , pp = p o w e r o u t ; i −−; o l d n++) { f o r ( j = n f i l t e r s , f p = x−>x _ t e m p l a t e [ o l d n ] . t_amp ; j −−; pp++, f p++) ∗ f p = ( c o u n t u p ∗ ∗ f p + ∗ pp ∗ norm ) /( countup + 1.0 f ) ; } } c o u n t u p ++;

461

466

471

476

481

486

491

496

501

506

101

B – bonk∼ source code

511

516

521

526

531

536

541

546

} else

} x−>x _ l e a r n d e b o u n c e = c l o c k _ g e t s y s t i m e ( ) ; i f ( ntemplate ) { f l o a t b e s t f i t = −1e30 ; int templatecount ; n f i t = −1; f o r ( i = 0 , t e m p l a t e c o u n t = 0 , t p = x−>x _ t e m p l a t e ; t e m p l a t e c o u n t < n t e m p l a t e ; i ++) { f l o a t dotprod = 0; f o r ( k = 0 , pp = p o w e r o u t ; k < n i n s i g && t e m p l a t e c o u n t < n t e m p l a t e ; k++, t p ++, t e m p l a t e c o u n t ++) { f o r ( j = n f i l t e r s , f p = tp−>t_amp ; j −−; f p ++, pp++) { i f ( ∗ f p < 0 | | ∗ pp < 0 ) p o s t ( " b o n k _ t i c k bug 2 " ) ; d o t p r o d += ∗ f p ∗ ∗ pp ; } } i f ( dotprod > b e s t f i t ) { b e s t f i t = dotprod ; nfit = i ; } } i f ( n f i t < 0 ) p o s t ( " b o n k _ t i c k bug " ) ; } else n f i t = 0; n f i t = −1; /∗ h i t i s zero ; t h i s i s t h e " bang " method . ∗/

} else return ;

i f ( c o u n t u p == x−>x _ l e a r n ) c o u n t u p = 0 ; x−>x _ l e a r n c o u n t = c o u n t u p ;

551

x−>x _ a t t a c k e d = 1 ; i f ( x−>x_debug ) p o s t ( " bonk o u t : number %d , v e l %f , t e m p e r a t u r e %f " , n f i t , v e l , t e m p e r a t u r e ) ; SETFLOAT( at2 , n f i t ) ; SETFLOAT( a t 2 +1 , v e l ) ; SETFLOAT( a t 2 +2 , t e m p e r a t u r e ) ; o u t l e t _ l i s t ( x−>x_cookedout , 0 , 3 , a t 2 ) ; f o r ( n = 0 , gp = x−>x _ i n s i g + ( n i n s i g −1) , pp = p o w e r o u t + n f i l t e r s ∗ ( n i n s i g −1) ; n < n i n s i g ; n++, gp−−, pp −= n f i l t e r s ) { f l o a t ∗ pp2 ; f o r ( i = 0 , ap = at , pp2 = pp ; i < n f i l t e r s ; i ++, ap++, pp2++) { ap−>a_type = A_FLOAT ; ap−>a_w . w _ f l o a t = ∗ pp2 ; } o u t l e t _ l i s t ( gp−>g _ o u t l e t , 0 , n f i l t e r s , a t ) ; }

556

561

566

571 }

102

B.1 – The bonk∼ Method

576

581

586

591

596

601

606

611

616

621

626

631

// r e p o r t t h e a t t a c k s t a t i c v o i d b o n k _ d o i t ( t_bonk ∗ x ) { i n t i , j , ch , n ; t _ f i l t e r k e r n e l ∗k ; t _ h i s t ∗h ; f l o a t growth = 0 , ∗ fp1 , ∗ fp3 , ∗ fp4 , h i t h r e s h , l o t h r e s h ; i n t n i n s i g = x−>x _ n i n s i g , n f i l t e r s = x−>x _ n f i l t e r s , maskphase = x−>x_maskphase , n e x t p h a s e , o l d m a s k p h a s e ; t _ i n s i g ∗ gp ; n e x t p h a s e = maskphase + 1 ; i f ( n e x t p h a s e >= MASKHIST) nextphase = 0; x−>x_maskphase = n e x t p h a s e ; o l d m a s k p h a s e = n e x t p h a s e − x−>x _ a t t a c k b i n s ; i f ( oldmaskphase < 0) o l d m a s k p h a s e += MASKHIST ; i f ( x−>x _ u s e l o u d n e s s ) h i t h r e s h = q r s q r t ( q r s q r t ( x−>x _ h i t h r e s h ) ) , l o t h r e s h = q r s q r t ( q r s q r t ( x−>x _ l o t h r e s h ) ) ; e l s e h i t h r e s h = x−>x _ h i t h r e s h , l o t h r e s h = x−>x _ l o t h r e s h ; f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < n i n s i g ; ch++, gp++) { f o r ( i = 0 , k = x−>x _ f i l t e r b a n k −>b_vec , h = gp−>g _ h i s t ; i < n f i l t e r s ; i ++, k++, h++) { f l o a t power = 0 , maskpow = h−>h_mask [ maskphase ] ; f l o a t ∗ i n b u f= gp−>g _ i n b u f + k−>k _ s k i p p o i n t s ; i n t c o u n t u p = h−>h_countup ; i n t f i l t e r p o i n t s = k−>k _ f i l t e r p o i n t s ; /∗ i f t h e u s e r a s k e d f o r more f i l t e r s t h a t f i t u n d e r t h e N y q u i s t f r e q u e n c y , some f i l t e r s won ’ t a c t u a l l y be f i l l e d i n s o we s k i p r u n n i n g them . ∗/ if (! filterpoints ) { h−>h_countup = 0 ; h−>h_mask [ n e x t p h a s e ] = 0 ; h−>h_power = 0 ; continue ; } // f o r e a c h f i l t e r : /∗ r u n t h e f i l t e r r e p e a t e d l y , s l i d i n g i t f o r w a r d by h o p p o i n t s , f o r nhop t i m e s ∗/ f o r ( f p 1 = i n b u f , n = 0 ; n < k−>k_nhops ; f p 1 += k−>k _ h o p p o i n t s , n++) { f l o a t rsum = 0 , is um = 0 ; f o r ( f p 3 = f p 1 , f p 4 = k−>k _ s t u f f , j = f i l t e r p o i n t s ; j −−;) { // / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / // c a l c u l a t i n g t h e power f o r e a c h f i l t e r / // g=t h e i n p u t b u f f e r / / / / / / / / / / / / / / / / / / / / // f p 4= fp [ 0 ] e fp [ 1 ] / ////// ////// / // / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / f l o a t g = ∗ f p 3 ++; rsum += g ∗ ∗ f p 4 ++; is um += g ∗ ∗ f p 4 ++; } power += rsum ∗ rsum + i sum ∗ is um ; // p o s t ( " power %.12 f " , power ) ; // c a p i r e s e p o s s i b i l e d e c i m a r e i v a l o r i d i power da p o s t a r e i n max window ) }

103

B – bonk∼ source code

if 636

( ! x−>x _ w i l l a t t a c k ) h−>h _ b e f o r e = maskpow ;

641

i f ( power > h−>h_mask [ o l d m a s k p h a s e ] ) { i f ( x−>x _ u s e l o u d n e s s ) g r o w t h += q r s q r t ( q r s q r t ( power / ( h−>h_mask [ o l d m a s k p h a s e ] + 1 . 0 e −15) ) ) − 1 . f ; e l s e g r o w t h += power / ( h−>h_mask [ o l d m a s k p h a s e ] + 1 . 0 e −15) − 1 . f ; // p o s t ( " power %.12 f h−>h_mask [ o l d m a s k p h a s e ] %.12 f g r o w t h %.12 f " , power , h−>h_mask [ o l d m a s k p h a s e ] , g r o w t h ) ; } if ( ! x−>x _ w i l l a t t a c k && c o u n t u p >= x−>x_masktime ) maskpow ∗= x−>x_maskdecay ;

646

651

656

661

666

671

676

681

686

} } i f ( x−>x _ w i l l a t t a c k ) // an a t t a c k i s r e p o r t e d { // h o w e v e r we won ’ t a c t u a l l y r e p o r t t h e a t t a c k u n t i l t h e s p e c t r u m s t o p g r o w i n g . g r o w t h must d e c r e a s e b e l o w l o t h r e s h . i f ( x−>x _ w i l l a t t a c k > MAXATTACKWAIT | | g r o w t h < x−>x _ l o t h r e s h ) { /∗ i f h a v e n ’ t y e t , and i f n o t i n spew mode , r e p o r t a h i t ∗/ i f ( ! x−>x_spew && ! x−>x _ a t t a c k e d ) { f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < n i n s i g ; ch++, gp++) f o r ( i = n f i l t e r s , h = gp−>g _ h i s t ; i −−; h++) h−>h_outpower = h−>h_mask [ n e x t p h a s e ] ; x−>x _ h i t = 1 ; // s e t s a c l o c k t o go o f f n m i l l i s e c o n d s from t h e c u r r e n t l o g i c a l t i m e w i t h c l o c k _ d e l a y ( c l o c k t o s c h e d u l e , n ( ms ) ) // S c h e d u l e t h e e x e c u t i o n o f a C l o c k c l o c k _ d e l a y ( x−>x _ c lo c k , 0 ) ; } } i f ( g r o w t h < x−>x _ l o t h r e s h ) x−>x _ w i l l a t t a c k = 0 ; e l s e x−>x _ w i l l a t t a c k ++; } e l s e i f ( g r o w t h > x−>x _ h i t h r e s h ) { i f ( x−>x_debug ) p o s t ( " a t t a c k : g r o w t h = %f " , g r o w t h ) ; x−>x _ w i l l a t t a c k = 1 ; x−>x _ a t t a c k e d = 0 ; f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < n i n s i g ; ch++, gp++) f o r ( i = n f i l t e r s , h = gp−>g _ h i s t ; i −−; h++) h−>h_mask [ n e x t p h a s e ] = h−>h_power , h−>h_countup = 0 ; } // spew mode a l w a y s o u t p u t d a t a f o r e v e r y p e r f o r m e d a n a l y s i s /∗ i f i n " spew " mode j u s t a l w a y s o u t p u t ∗/

i f ( power > maskpow ) { maskpow = power ; countup = 0 ; } c o u n t u p ++; h−>h_countup = c o u n t u p ; h−>h_mask [ n e x t p h a s e ] = maskpow ; h−>h_power = power ;

104

B.1 – The bonk∼ Method

691

696

701

}

i f ( x−>x_spew ) { f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < n i n s i g ; ch++, gp++) f o r ( i = n f i l t e r s , h = gp−>g _ h i s t ; i −−; h++) h−>h_outpower = h−>h_power ; x−>x _ h i t = 0 ; c l o c k _ d e l a y ( x−>x _ c lo c k , 0 ) ; } x−>x _ d e b o u n c e v e l ∗= x−>x _ d e b o u n c e d e c a y ;

706

711

716

721

726

731

736

741

// 4//PERFORM ROUTINE // I t r e c e i v e s a p o i n t e r t o a p i e c e o f t h e DSP c h a i n and i t i s e x p e c t e d t o r e t u r n t h e l o c a t i o n o f t h e n e x t p e r f o r m r o u t i n e on t h e c h a i n . // The n e x t l o c a t i o n i s d e t e r m i n e d by t h e number o f a r g u m e n t s s p e c i f i e d f o r t h e p e r f o r m r o u t i n e w i t h t h e c a l l t o dsp_add ( ) . // F o r example , i f we p a s s t h r e e a r g u m e n t s , we n e e d t o r e t u r n w + 4 . s t a t i c t _ i n t ∗ bonk_perform ( t _ i n t ∗w) { t_bonk ∗ x = ( t_bonk ∗ ) (w [ 1 ] ) ; i n t n = ( i n t ) (w [ 2 ] ) ; // v e c t o r s i z e int onset = 0; i f ( x−>x_countdown >= n ) x−>x_countdown −= n ; else { i n t i , j , n i n s i g = x−>x _ n i n s i g ; t _ i n s i g ∗ gp ; i f ( x−>x_countdown > 0 ) { n −= x−>x_countdown ; o n s e t += x−>x_countdown ; x−>x_countdown = 0 ; } while ( n > 0) { i n t i n f i l l = x−> x _ i n f i l l ; i n t m = ( n < ( x−>x _ n p o i n t s − i n f i l l ) ? n : ( x−>x _ n p o i n t s − i n f i l l ) ) ; f o r ( i = 0 , gp = x−>x _ i n s i g ; i < n i n s i g ; i ++, gp++) { f l o a t ∗ f p = gp−>g _ i n b u f + i n f i l l ; t _ f l o a t ∗ i n 1 = gp−>g _ i n v e c + o n s e t ; f o r ( j = 0 ; j < m; j ++) ∗ f p++ = ∗ i n 1 ++; } i n f i l l += m; x−> x _ i n f i l l = i n f i l l ; // when i n p u t i s f i l l e d w i t h n p o i n t s a m p l e s , b o n k _ d o i t ! i f ( i n f i l l == x−>x _ n p o i n t s ) { bonk_doit ( x ) ; /∗ s h i f t o r c l e a r t h e i n p u t b u f f e r and u p d a t e c o u n t e r s ∗/ i f ( x−>x _ p e r i o d > x−>x _ n p o i n t s ) x−>x_countdown = x−>x _ p e r i o d − x−>x _ n p o i n t s ; e l s e x−>x_countdown = 0 ; i f ( x−>x _ p e r i o d < x−>x _ n p o i n t s ) { i n t o v e r l a p = x−>x _ n p o i n t s − x−>x _ p e r i o d ; f l o a t ∗ fp1 , ∗ fp2 ; f o r ( n = 0 , gp = x−>x _ i n s i g ; n < n i n s i g ; n++, gp++)

746

105

B – bonk∼ source code

751

756

761 }

} } r e t u r n (w+3) ;

} n −= m; o n s e t += m;

} e l s e x−> x _ i n f i l l = 0 ;

f o r ( i = o v e r l a p , f p 1 = gp−>g_inbuf , f p 2 = f p 1 + x−>x _ p e r i o d ; i −−;) ∗ f p 1++ = ∗ f p 2 ++; x−> x _ i n f i l l = o v e r l a p ;

// 3//DSP METHOD 766 // From MAX 5 API : The d s p method s p e c i f i e s t h e s i g n a l p r o c e s s i n g f u n c t i o n y o u r o b j e c t d e f i n e s along with i t s arguments . // The o b j e c t ’ s d s p method w i l l be c a l l e d w h e n e v e r t h e MSP s i g n a l c o m p i l e r i s b u i l d i n g a s e q u e n c e o f o p e r a t i o n s ( known a s t h e DSP C h a i n ) t h a t w i l l be p e r f o r m e d on e a c h s e t o f a u d i o s a m p l e s . // The o p e r a t i o n s e q u e n c e c o n s i s t s o f a p o i n t e r s t o f u n c t i o n s ( c a l l e d p e r f o r m r o u t i n e s ) f o l l o w e d by a r g u m e n t s t o t h o s e f u n c t i o n s . s t a t i c v o i d bonk_dsp ( t_bonk ∗ x , t _ s i g n a l ∗∗ s p ) { 771 i n t i , n = s p [0]−>s_n , n i n s i g = x−>x _ n i n s i g ; t _ i n s i g ∗ gp ; x−>x_sr = s p [0]−> s _ s r ; 776 f o r ( i = 0 , gp = x−>x _ i n s i g ; i < n i n s i g ; gp−>g _ i n v e c = ( ∗ ( s p++))−>s_vec ; i ++, gp++)

781 }

// a d d s y o u r o b j e c t ’ s p e r f o r m method t o t h e DSP c a l l c h a i n w i t h dsp_add ( o b j e c t ’ s p e r f o r m r o u t i n e , #, . . . ) and s p e c i f i e s t h e a r g u m e n t s i t w i l l be p a s s e d . // The p e r f o r m r o u t i n e i s u s e d f o r p r o c e s s i n g a u d i o . //#=The number o f a r g u m e n t s t h a t w i l l f o l l o w // . . . t h e a r g u m e n t s dsp_add ( bonk_perform , 2 , x , n ) ;

786 s t a t i c v o i d b o n k _ t h r e s h ( t_bonk ∗ x , t _ f l o a t a r g f 1 , t _ f l o a t a r g f 2 ) { i f ( f1 > f2 ) p o s t ( " bonk : w a r n i n g : l o w t h r e s h o l d g r e a t e r t h a n h i t h r e s h o l d " ) ; x−>x _ l o t h r e s h = ( f 1 <= 0 ? 0 . 0 0 0 1 : f 1 ) ; 791 x−>x _ h i t h r e s h = ( f 2 <= 0 ? 0 . 0 0 0 1 : f 2 ) ; } s t a t i c v o i d b o n k _ p r i n t ( t_bonk ∗ x , t _ f l o a t a r g f ) { 796 int i ; p o s t ( " t h r e s h %f %f " , x−>x _ l o t h r e s h , x−>x _ h i t h r e s h ) ; p o s t ( " mask %d %f " , x−>x_masktime , x−>x_maskdecay ) ; p o s t ( " a t t a c k −b i n s %d" , x−>x _ a t t a c k b i n s ) ; p o s t ( " d e b o u n c e %f " , x−>x _ d e b o u n c e d e c a y ) ; 801 p o s t ( " m i n v e l %f " , x−>x _ m i n v e l ) ; p o s t ( " spew %d" , x−>x_spew ) ; p o s t ( " u s e l o u d n e s s %d" , x−>x _ u s e l o u d n e s s ) ; 806 p o s t ( " number o f t e m p l a t e s %d" , x−>x _ n t e m p l a t e ) ; i f ( x−>x _ l e a r n ) p o s t ( " l e a r n mode" ) ; i f ( f != 0 )

106

B.1 – The bonk∼ Method

{ 811

816

821

826

831 }

} i f ( x−>x_debug ) p o s t ( " debug mode" ) ;

i n t j , n i n s i g = x−>x _ n i n s i g ; t _ i n s i g ∗ gp ; f o r ( j = 0 , gp = x−>x _ i n s i g ; j < n i n s i g ; j ++, gp++) { t _ h i s t ∗h ; i f ( n i n s i g > 1 ) p o s t ( " i n p u t %d : " , j +1) ; f o r ( i = x−>x _ n f i l t e r s , h = gp−>g _ h i s t ; i −−; h++) p o s t ( "pow %f mask %f b e f o r e %f c o u n t %d" , h−>h_power , h−>h_mask [ x−>x_maskphase ] , h−>h _ b e f o r e , h−>h_countup ) ; } p o s t ( " f i l t e r d e t a i l s ( f r e q u e n c i e s a r e i n u n i t s o f %.2 f −Hz . b i n s ) : " , x−>x_sr ) ; f o r ( j = 0 ; j < x−>x _ n f i l t e r s ; j ++) p o s t ( "%2d c f %.2 f bw %.2 f n h o p s %d hop %d s k i p %d n p o i n t s %d" , j , x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ c e n t e r f r e q , x−>x _ f i l t e r b a n k −>b_vec [ j ] . k_bandwidth , x−>x _ f i l t e r b a n k −>b_vec [ j ] . k_nhops , x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ h o p p o i n t s , x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ s k i p p o i n t s , x−>x _ f i l t e r b a n k −>b_vec [ j ] . k _ f i l t e r p o i n t s ) ;

s t a t i c v o i d b o n k _ f o r g e t ( t_bonk ∗ x ) 836 { i n t n t e m p l a t e = x−>x_ntemplate , newn = n t e m p l a t e − x−>x _ n i n s i g ; i f ( newn < 0 ) newn = 0 ; x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te , x−>x _ n t e m p l a t e ∗ s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) , 841 newn ∗ s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) ) ; x−>x _ n t e m p l a t e = newn ; x−>x _ l e a r n c o u n t = 0 ; } 846 s t a t i c v o i d bonk_bang ( t_bonk ∗ x ) { i n t i , ch ; x−>x _ h i t = 0 ; t _ i n s i g ∗ gp ; 851 f o r ( ch = 0 , gp = x−>x _ i n s i g ; ch < x−>x _ n i n s i g ; ch++, gp++) { t _ h i s t ∗h ; f o r ( i = 0 , h = gp−>g _ h i s t ; i < x−>x _ n f i l t e r s ; i ++, h++) h−>h_outpower = h−>h_power ; 856 } bonk_tick ( x ) ; } s t a t i c v o i d bonk_read ( t_bonk ∗ x , t_symbol ∗ s ) 861 { FILE ∗ f d = f o p e n ( s−>s_name , " r " ) ; f l o a t v e c [ MAXNFILTERS ] ; int i , ntemplate = 0 , remaining ; f l o a t ∗ fp , ∗ fp2 ; 866 i f ( ! fd ) { p o s t ( "%s : open f a i l e d " , s−>s_name ) ; return ;

107

B – bonk∼ source code

871

876

881

886

891

896

} x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te , x−>x _ n t e m p l a t e ∗ s i z e o f ( t _ t e m p l a t e ) , 0 ) ; while (1) { f o r ( i = x−>x _ n f i l t e r s , f p = v e c ; i −−; f p++) i f ( f s c a n f ( f d , "%f " , f p ) < 1 ) goto nomore ; x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te , ntemplate ∗ s i z e o f ( t_template ) , ( ntemplate + 1) ∗ s i z e o f ( t_template ) ) ; f o r ( i = x−>x _ n f i l t e r s , f p = vec , f p 2 = x−>x _ t e m p l a t e [ n t e m p l a t e ] . t_amp ; i −−;) ∗ f p 2++ = ∗ f p ++; n t e m p l a t e ++; } nomore : i f ( r e m a i n i n g = ( n t e m p l a t e % x−>x _ n i n s i g ) ) { p o s t ( " bonk_read : %d t e m p l a t e s n o t a m u l t i p l e o f %d ; d r o p p i n g e x t r a s " ) ; x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te , ntemplate ∗ s i z e o f ( t_template ) , ( ntemplate − remaining ) ∗ s i z e o f ( t_template ) ) ; ntemplate = ntemplate − remaining ; } p o s t ( " bonk : r e a d %d t e m p l a t e s \n" , n t e m p l a t e ) ; x−>x _ n t e m p l a t e = n t e m p l a t e ; f c l o s e ( fd ) ; } s t a t i c v o i d b o n k _ w r i t e ( t_bonk ∗ x , t_symbol ∗ s ) { FILE ∗ f d = f o p e n ( s−>s_name , "w" ) ; i n t i , n t e m p l a t e = x−>x _ n t e m p l a t e ; t _ t e m p l a t e ∗ t p = x−>x _ t e m p l a t e ; f l o a t ∗ fp ; i f ( ! fd ) { p o s t ( "%s : c o u l d n ’ t c r e a t e " , s−>s_name ) ; return ; } f o r ( ; n t e m p l a t e −−; t p++) { f o r ( i = x−>x _ n f i l t e r s , f p = tp−>t_amp ; i −−; f p++) f p r i n t f ( f d , " %6.2 f " , ∗ f p ) ; f p r i n t f ( f d , " \n" ) ; } p o s t ( " bonk : w r o t e %d t e m p l a t e s \n" , x−>x _ n t e m p l a t e ) ; f c l o s e ( fd ) ; }

901

906

911

916

// f r e e f u n t i o n 921 s t a t i c v o i d b o n k _ f r e e ( t_bonk ∗ x ) { i n t i , n i n s i g = x−>x _ n i n s i g ; t _ i n s i g ∗ gp = x−>x _ i n s i g ; 926 #i f d e f MSP dsp_free ( ( t_pxobject ∗) x ) ; #e n d i f f o r ( i = 0 , gp = x−>x _ i n s i g ; i < n i n s i g ; i ++, gp++) f r e e b y t e s ( gp−>g_inbuf , x−>x _ n p o i n t s ∗ s i z e o f ( f l o a t ) ) ; 931 c l o c k _ f r e e ( x−>x _ c l o c k ) ;

108

B.1 – The bonk∼ Method

}

i f (!−−(x−>x _ f i l t e r b a n k −>b _ r e f c o u n t ) ) b o n k _ f r e e f i l t e r b a n k ( x−>x _ f i l t e r b a n k ) ;

936 // 1// INITILIZATION ROUTINE v o i d main ( ) { t_class ∗c ; t_object ∗ a t t r ; 941 long a t t r f l a g s = 0 ; t_symbol ∗ sym_long = gensym ( " l o n g " ) , ∗ s y m _ f l o a t 3 2 = gensym ( " f l o a t 3 2 " ) ; //NEW INSTANCE ROUTINE // c r e a t e s new i n s t a n c e o f t h e c l a s s w i t h c l a s s _ n e w ( name , mnew , m f r e e , s i z e ( i n b y t e s ) o f t h e d a t a s t r u c t u r e , mmenu , t y p e ( most o f t e n A_GIMME, 0 ) , 0 . ) //mnew=The i n s t a n c e c r e a t i o n f u n c t i o n // m f r e e=The i n s t a n c e f r e e f u n c t i o n //mmenu=The f u n c t i o n c a l l e d when t h e u s e r c r e a t e s a new o b j e c t o f t h e c l a s s from t h e Patch window ’ s p a l e t t e ( UI o b j e c t s o n l y ) , 0L i f you ’ r e n o t d e f i n i n g a UI o b j e c t , // t y p e=A s t a n d a r d Max t y p e . The f i n a l ar g u m e nt o f t h e t y p e l i s t s h o u l d be a 0 . G e n e r a l l y , o b e x o b j e c t s h a v e a s i n g l e t y p e argument , A_GIMME, f o l l o w e d by a 0. c = c l a s s _ n e w ( " bonk3~" , ( method ) bonk_new , ( method ) b o n k _ f r e e , s i z e o f ( t_bonk ) , ( method ) 0L , A_GIMME, 0 ) ; c l a s s _ o b e x o f f s e t _ s e t ( c , c a l c o f f s e t ( t_bonk , o b e x ) ) ; // c a l c o f f s e t c a l c u l a t e s b y t e − o f f s e t from t h e b e g i n n i n g o f bonk s t r u c t u r e . The v a l u e i s s t o r e i n o b e x f i e l d o f same s t r u c t u r e . 956 //NEW ATTRIBUTES // c r e a t e s ( new ) a t t r i b u t e w i t h a t t r _ o f f s e t _ n e w ( name , t y p e , a t t r i b u t e i s f o r s e t t i n g / q u e r y f l a g , method (NULL i s d e f a u l t method ) g e t , method s e t , b y t e − offset ) . // a d d s a t t r i b u t e t o t h e o b j e c t o f t h e c l a s s . w i t h c l a s s _ a d d a t r ( ) a t t r = a t t r _ o f f s e t _ n e w ( " n p o i n t s " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) 0L , c a l c o f f s e t ( t_bonk , x _ n p o i n t s ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " hop " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) 0L , c a l c o f f s e t ( t_bonk , x _ p e r i o d ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " n f i l t e r s " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) 0L , c a l c o f f s e t ( t_bonk , x _ n f i l t e r s ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " h a l f t o n e s " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method ) 0L , c a l c o f f s e t ( t_bonk , x _ h a l f t o n e s ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " o v e r l a p " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method ) 0 L , c a l c o f f s e t ( t_bonk , x _ o v e r l a p ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " f i r s t b i n " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method ) 0L , c a l c o f f s e t ( t_bonk , x _ f i r s t b i n ) ) ; class_addattr (c , attr ) ; 976 a t t r = a t t r _ o f f s e t _ n e w ( " m i n v e l " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method ) bonk_minvel_set , c a l c o f f s e t ( t_bonk , x _ m i n v e l ) ) ;

946

951

961

966

971

109

B – bonk∼ source code

class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " l o t h r e s h " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method ) b o n k _ l o t h r e s h _ s e t , c a l c o f f s e t ( t_bonk , x _ l o t h r e s h ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " h i t h r e s h " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method ) b o n k _ h i t h r e s h _ s e t , c a l c o f f s e t ( t_bonk , x _ h i t h r e s h ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " masktime " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) bonk_masktime_set , c a l c o f f s e t ( t_bonk , x_masktime ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " maskdecay " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method ) bonk_maskdecay_set , c a l c o f f s e t ( t_bonk , x_maskdecay ) ) ; class_addattr (c , attr ) ; 991 a t t r = a t t r _ o f f s e t _ n e w ( " d e b o u n c e d e c a y " , s y m _ f l o a t 3 2 , a t t r f l a g s , ( method ) 0L , ( method ) bonk_debouncedecay_set , c a l c o f f s e t ( t_bonk , x _ d e b o u n c e d e c a y ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " debug " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) bonk_debug_set , c a l c o f f s e t ( t_bonk , x_debug ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " spew " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) bonk_spew_set , c a l c o f f s e t ( t_bonk , x_spew ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " u s e l o u d n e s s " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) b o n k _ u s e l o u d n e s s _ s e t , c a l c o f f s e t ( t_bonk , x _ u s e l o u d n e s s ) ) ; class_addattr (c , attr ) ; a t t r = a t t r _ o f f s e t _ n e w ( " a t t a c k b i n s " , sym_long , a t t r f l a g s , ( method ) 0L , ( method ) b o n k _ a t t a c k b i n s _ s e t , c a l c o f f s e t ( t_bonk , x _ a t t a c k b i n s ) ) ; class_addattr (c , attr ) ; 1006 //METHODS! // a d d s method t o o b j e c t o f t h e c l a s s w i t h c l a s s _ a d d m e t h o d ( c l a s s p o i n t e r , m, name , t y p e , 0 ) //m=f u n c t i o n g e t c a l l e d when method i s i n v o q u e d c l a s s _ a d d m e t h o d ( c , ( method ) bonk_dsp , " d s p " , A_CANT, 0 ) ; class_addmethod ( c , class_addmethod ( c , class_addmethod ( c , class_addmethod ( c , class_addmethod ( c , class_addmethod ( c , class_addmethod ( c , class_addmethod ( c , ( method ) bonk_bang , " bang " , A_CANT, 0 ) ; ( method ) b o n k _ f o r g e t , " f o r g e t " , 0 ) ; ( method ) b o n k _ l e a r n , " l e a r n " , A_LONG, 0 ) ; ( method ) b o n k _ t h r e s h , " t h r e s h " , A_FLOAT, A_FLOAT, 0 ) ; ( method ) b o n k _ p r i n t , " p r i n t " , A_DEFFLOAT, 0 ) ; ( method ) bonk_read , " r e a d " , A_DEFSYM, 0 ) ; ( method ) b o n k _ w r i t e , " w r i t e " , A_DEFSYM, 0 ) ; ( method ) b o n k _ a s s i s t , " a s s i s t " , A_CANT, 0 ) ;

981

986

996

1001

1011

1016

1021

// a d d s s p e c i a l o b e x methods c l a s s _ a d d m e t h o d ( c , ( method ) object_obex_dumpout , " dumpout " , A_CANT, 0 ) ; c l a s s _ a d d m e t h o d ( c , ( method ) o b j e c t _ o b e x _ q u i c k r e f , " q u i c k r e f " , A_CANT, 0 ) ; // a d d s some s t a n d a r d method h a n d l e r s f o r i n t e r n a l m e s s a g e s u s e d by a l l MSP o b j e c t s with c l a s s _ d s p i n i t ( c l a s s p o i n t e r ) class_dspinit (c) ;

1026

110

B.1 – The bonk∼ Method

// r e g i s t e r s a p r e v i o u s l y d e f i n e d o b j e c t c l a s s w i t h c l a s s _ r e g i s t e r ( name_space , c l a s s p o i n t e r ) . T h i s f u n c t i o n i s r e q u i r e d , and s h o u l d be c a l l e d a t t h e end o f main ( ) . // namespace=The d e s i r e d c l a s s ’ s name s p a c e . T y p i c a l l y , #CLASS_BOX, f o r o b e x c l a s s e s o r #CLASS_NOBOX f o r c l a s s e s w h i c h w i l l o n l y be u s e d i n t e r n a l l y c l a s s _ r e g i s t e r (CLASS_BOX, c ) ; 1031 bonk_class = c ; p o s t ( " \n" ) ; p o s t ( "BonkOMM~ v 1 . 0 − d e t e c t s a t t a c k s i n a u d i o s i g n a l s " ) ; post ( " Zengi r e v i s i o n " ) ; p o s t ( " O r i g i n a l by M i l l e r P u c k e t t e and Ted Appel , h t t p : / / c r c a . u c s d . edu /~msp/ " ) ; p o s t ( " \n" ) ;

1036 }

1041 // 2//NEW INSTANCE ROUTINE s t a t i c v o i d ∗bonk_new ( t_symbol ∗ s , l o n g ac , t_atom ∗ av ) { short j ; t_bonk ∗ x ; 1046 //CREATE INSTANCE // c r e a t e s i n s t a n c e o f t h e o b j e c t c l a s s by a l l o c a t i n g memory w i t h o b j e c t _ a l l o c ( class pointer ) . // I t s u s e i s r e q u i r e d w i t h obex−c l a s s o b j e c t s , i n s i d e t h e o b j e c t ’ s new i n s t a n c e routine . i f ( x = ( t_bonk ∗ ) o b j e c t _ a l l o c ( b o n k _ c l a s s ) ) { 1051 void ∗ o b j e c t _ a l l o c ( t_class ∗c ) ; t _ i n s i g ∗g ; 1056 x−>x _ n p o i n t s = DEFNPOINTS ; x−>x _ p e r i o d = DEFPERIOD ; x−>x _ n f i l t e r s = DEFNFILTERS ; x−>x _ h a l f t o n e s = DEFHALFTONES ; x−>x _ f i r s t b i n = DEFFIRSTBIN ; x−>x _ o v e r l a p = DEFOVERLAP ; x−>x _ n i n s i g = 1 ; x−>x _ h i t h r e s h = DEFHITHRESH ; x−>x _ l o t h r e s h = DEFLOTHRESH ; x−>x_masktime = DEFMASKTIME ; x−>x_maskdecay = DEFMASKDECAY ; x−>x _ d e b o u n c e d e c a y = DEFDEBOUNCEDECAY ; x−>x _ m i n v e l = DEFMINVEL ; x−>x _ a t t a c k b i n s = DEFATTACKBINS ; i f ( ! x−>x _ p e r i o d ) x−>x _ p e r i o d = x−>x _ n p o i n t s / 2 ; x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) g e t b y t e s ( 0 ) ; x−>x _ n t e m p l a t e = 0 ; x−> x _ i n f i l l = 0 ; x−>x_countdown = 0 ; x−>x _ w i l l a t t a c k = 0 ; x−>x _ a t t a c k e d = 0 ; x−>x_maskphase = 0 ; x−>x_debug = 0 ; x−>x _ l e a r n = 0 ; x−>x _ l e a r n d e b o u n c e = c l o c k _ g e t s y s t i m e ( ) ; x−>x _ l e a r n c o u n t = 0 ;

1061

1066

1071

1076

1081

111

B – bonk∼ source code

1086

x−>x _ u s e l o u d n e s s = 0 ; x−>x _ d e b o u n c e v e l = 0 ; x−>x_sr = s y s _ g e t s r ( ) ; /∗ g e t s t h e s a m p l e r a t e ∗/ /∗ s o m e t h i n g u s e f u l f o r debug i f ( ac ) { s w i t c h ( av [ 0 ] . a_type ) { c a s e A_LONG: x−>x _ n i n s i g = av [ 0 ] . a_w . w_long ; break ; } } i f ( x−>x _ n i n s i g < 1 ) x−>x _ n i n s i g = 1 ; i f ( x−>x _ n i n s i g > MAXCHANNELS) x−>x _ n i n s i g = MAXCHANNELS ; ∗/

1091

1096

1101

1106

// t a k e s an atom l i s t and p r o p e r l y s e t any a t t r i b u t e s d e s c r i b e d w i t h i n . t o do t h i s t h e s i m p l e s t way i s t o u s e f u n t i o n a t t r _ a r g s _ p r o c e s s ( o b j e c t whose a t t r i b u t e s w i l l be p r o c e s s e d , ac , av ) // ac=The c o u n t o f t_atoms i n av // av=An atom l i s t // The f u n c t i o n a t t r _ a r g s _ p r o c e s s ( x , ac , av ) i s t y p i c a l l y u s e d i n o b j e c t ’ s new i n s t a n c e t o c o n v e n i e n t l y p r o c e s s a t t r i b u t e a r g u m e n t s . a t t r _ a r g s _ p r o c e s s ( x , ac , av ) ; x−>x _ i n s i g = ( t _ i n s i g ∗ ) g e t b y t e s ( x−>x _ n i n s i g ∗ s i z e o f ( ∗ x−>x _ i n s i g ) ) ; //CREATE INLET // c r e a t e s t h e s i g n a l i n l e t w i t h d s p _ s e t u p ( ( c a s t t o t _ p r o b j e c t ) o b j e c t p o i n t e r , n s i g n a l s ) , s o you n e e d n o t make them y o u r s e l f ! // n s i g n a l s=The number o f s i g n a l / p r o x y i n l e t s t o c r e a t e f o r t h e o b j e c t . t h e o b j e c t h a s no s i g n a l i n l e t s , you may p a s s 0 . d s p _ s e t u p ( ( t _ p x o b j e c t ∗ ) x , x−>x _ n i n s i g ) ;

1111

If

1116

1121

//CREATE OUTLETS // s t o r e s t h e dumpout o u t l e t i n t h e o b e x w i t h t h e g e n e r i c f u n c t i o n o b j e c t _ o b e x _ s t o r e ( o b j e c t p o i n t e r , key , v a l ) . The dumpout o u t l e t a r e t h a t u s e d by a t t r i b u t e s t o r e p o r t d a t a i n r e s p o n s e t o ’ g e t ’ q u e r i e s . // k e y=A s y m b o l i c name f o r t h e d a t a t o be s t o r e d // v a l=A t _ o b j e c t ∗ , t o be s t o r e d i n t h e obex , r e f e r e n c e d u n d e r t h e k e y // The g e n e r i c c a s e i s n o r m a l l y a d a p t e d t o be u s e d a s f o l l o w : o b j e c t _ o b e x _ s t o r e ( x , _sym_dumpout , o u t l e t _ n e w ( x , NULL) ) ; // c r e a t e s new o u t l e t s w i t h o u t l e t _ n e w ( o b j e c t , s ) . // s=A C−s t r i n g s p e c i f y i n g t h e m e s s a g e t h a t w i l l be s e n t o u t t h i s o u t l e t , o r NULL t o i n d i c a t e t h e o u t l e t w i l l be u s e d t o s e n d v a r i o u s m e s s a g e s . o b j e c t _ o b e x _ s t o r e ( x , gensym ( " dumpout " ) , o u t l e t _ n e w ( x , NULL) ) ; // c r e a t e s an o u t l e t t h a t w i l l ALWAYS s e n d t h e t _ o b j e c t +) o b j e c t ) ) . x−>x_cookedout = l i s t o u t ( ( t _ o b j e c t ∗ ) x ) ; l i s t message with l i s t o u t ( (

1126

f o r ( j = 0 , g = x−>x _ i n s i g + x−>x _ n i n s i g −1; j < x−>x _ n i n s i g ; j ++, g−−) { g−>g _ o u t l e t = l i s t o u t ( ( t _ o b j e c t ∗ ) x ) ; } //CLOCK // c r e a t e s a new C l o c k o b j e c t w i t h clock_new ( o b j e c t p o i n t e r , ( method ) f n ) . T h i s f u n c t i o n i s n o r m a l l y c a l l e d i n t h e new i n s t a n c e r o u t i n e f u n c t i o n . // f n=F u n c t i o n t o be c a l l e d when t h e c l o c k g o e s o f f , t h i s f u n c t i o n must be called object_tick . // C l o c k o b j e c t i s u s e d a s i n t e r f a c e t o t h e Max s c h e d u l e r .

1131

112

B.1 – The bonk∼ Method

x−>x _ c l o c k = clock_new ( x , ( method ) b o n k _ t i c k ) ; 1136 bonk_donew ( x , x−>x _ n p o i n t s , x−>x _ p e r i o d , x−>x _ n i n s i g , x−>x _ n f i l t e r s , x−>x _ h a l f t o n e s , x−>x _ o v e r l a p , x−>x _ f i r s t b i n , s y s _ g e t s r ( ) ) ;

1141

}

} return ( x ) ;

/∗ A t t r i b u t e s e t t e r s . ∗/ v o i d b o n k _ m i n v e l _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) { i f ( ac && av ) { 1146 f l o a t f = a t o m _ g e t f l o a t ( av ) ; i f ( f < 0) f = 0 ; x−>x _ m i n v e l = f ; } } 1151 v o i d b o n k _ l o t h r e s h _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) { i f ( ac && av ) { f l o a t f = a t o m _ g e t f l o a t ( av ) ; 1156 i f ( f > x−>x _ h i t h r e s h ) p o s t ( " bonk : w a r n i n g : l o w t h r e s h o l d g r e a t e r t h a n h i t h r e s h o l d " ) ; x−>x _ l o t h r e s h = ( f <= 0 ? 0 . 0 0 0 1 : f ) ; } } 1161 v o i d b o n k _ h i t h r e s h _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) { i f ( ac && av ) { f l o a t f = a t o m _ g e t f l o a t ( av ) ; 1166 i f ( f < x−>x _ l o t h r e s h ) p o s t ( " bonk : w a r n i n g : l o w t h r e s h o l d g r e a t e r t h a n h i t h r e s h o l d " ) ; x−>x _ h i t h r e s h = ( f <= 0 ? 0 . 0 0 0 1 : f ) ; } } 1171 v o i d bonk_masktime_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) { i f ( ac && av ) { i n t n = a t o m _ g e t l o n g ( av ) ; 1176 x−>x_masktime = ( n < 0 ) ? 0 : n ; } } v o i d bonk_maskdecay_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) 1181 { i f ( ac && av ) { f l o a t f = a t o m _ g e t f l o a t ( av ) ; f = ( f < 0) ? 0 : f ; f = ( f > 1) ? 1 : f ; 1186 x−>x_maskdecay = f ; } } v o i d b o n k _ d e b o u n c e d e c a y _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) 1191 { i f ( ac && av ) { f l o a t f = a t o m _ g e t f l o a t ( av ) ; f = ( f < 0) ? 0 : f ; f = ( f > 1) ? 1 : f ;

113

B – bonk∼ source code

1196 }

}

x−>x _ d e b o u n c e d e c a y = f ;

v o i d bonk_debug_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) 1201 { i f ( ac && av ) { i n t n = a t o m _ g e t l o n g ( av ) ; x−>x_debug = ( n != 0 ) ; } 1206 } v o i d bonk_spew_set ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) { i f ( ac && av ) { 1211 i n t n = a t o m _ g e t l o n g ( av ) ; x−>x_spew = ( n != 0 ) ; } } 1216 v o i d b o n k _ u s e l o u d n e s s _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) { i f ( ac && av ) { i n t n = a t o m _ g e t l o n g ( av ) ; x−>x _ u s e l o u d n e s s = ( n != 0 ) ; 1221 } } v o i d b o n k _ a t t a c k b i n s _ s e t ( t_bonk ∗ x , v o i d ∗ a t t r , l o n g ac , t_atom ∗ av ) { 1226 i f ( ac && av ) { i n t n = a t o m _ g e t l o n g ( av ) ; n = ( n < 1) ? 1 : n ; n = ( n > MASKHIST) ? MASKHIST : n ; x−>x _ a t t a c k b i n s = n ; 1231 } } /∗ end a t t r s e t t e r s ∗/ v o i d b o n k _ a s s i s t ( t_bonk ∗ x , v o i d ∗b , l o n g m, l o n g a , c h a r ∗ s ) { i f (m == ASSIST_INLET ) s t r c p y ( s , " ( S i g n a l ) A ud io I n p u t , A n a l y s i s A t t r i b u t e s " ) ; e l s e i f (m ==ASSIST_OUTLET) { switch ( a ) { c a s e 0 : s t r c p y ( s , " ( L i s t ) Raw F i l t e r A m p l i t u d e s " ) ; b r e a k ; 1241 c a s e 1 : s t r c p y ( s , " ( L i s t ) I n s t r u m e n t Number , L o u d n e s s , T e m p e r a t u r e " ) ; break ; c a s e 2 : s t r c p y ( s , "Dump" ) ; b r e a k ; } } } 1246 s t a t i c v o i d b o n k _ l e a r n ( t_bonk ∗ x , i n t n ) { i f ( n < 0) n = 0 ; i f (n) 1251 { x−>x _ t e m p l a t e = ( t _ t e m p l a t e ∗ ) t _ r e s i z e b y t e s ( x−>x_te mpla te , x−>x _ n t e m p l a t e ∗ s i z e o f ( x−>x _ t e m p l a t e [ 0 ] ) , 0 ) ; x−>x _ n t e m p l a t e = 0 ; } x−>x _ l e a r n = n ; 1236

114

B.1 – The bonk∼ Method

1256

}

x−>x _ l e a r n c o u n t = 0 ;

/∗ g e t c u r r e n t s y s t e m t i m e ∗/ double c l o c k _ g e t s y s t i m e ( ) 1261 { return gettime () ; } /∗ g e t t h e e l a p s e d t i m e s i n c e t h e g i v e n s y s t e m ti me , i n m i l l i s e c o n d s ∗/ 1266 d o u b l e c l o c k _ g e t t i m e s i n c e ( d o u b l e p r e v s y s t i m e ) { return (( gettime () − prevsystime ) ) ; } 1271 f l o a t q r s q r t ( f l o a t f ) { r e t u r n 1/ s q r t ( f ) ; }

115

B – bonk∼ source code

Figure B.1: Max patcher window showing our test patch realized to analyze the ·O M M· sounds with bonk∼ 3.0

116

Appendix C Writing External for Max/MSP with XCode
Max5 and XCode3.X This section is presented to familiarize with the Max/MSP environment and will show how it can be extended by creating external objects. With respect to this thesis, the tutorial of Zicarelli [62] and has been taken in account. Since no existing paper was found on how to write external for Max5, the three tutorial have been adapted. In writing an external object for Max, the task is to write a shared library in C that is loaded and called by the “master environment” and in turns calls upon helpful routines back in the master environment. You create a class, or template for the behavior of an object. Instances of this class “do the work” of the object, when they are sent messages. Your external object definition will: Define the class: its data structure, size, and how instances are to be created and destroyed Define functions (called methods) that will respond to various messages, performing some action With the name externals are considered all the external object of Max/MSP, i.e. not included in the software issue. Therefore an external could be any objects created by your own (in such a programming language) or developed by somebody else. In later chapters, 4 and 5, we use bonk∼ , an external object originally developed by Miller Puckette, will be extensively analyzed, modified, for the purpose of onset detection.someone else. The need to write an external is to add one or more specific task to the the logical and arithmetics unit or the DSP chain of the software. First, we downloaded the Max 5 Software Development Kit (SDK) from cycling74.com, which includes framework, API reference and some examples. The framework contains 117

C – Writing External for Max/MSP with XCode

the header files where Max/MSP standard function and struct get called. The various parts of the framework are described in the API. So, while creating new one or modifying existing external, you must do it in according to the Max/MSP API reference. To develop your object, Since the most objects are written in C, now we procede describing the process to develop objects in C, but externals can also be created it in such different programming languages. We used Xcode(version 3.1.2) to develop the external, the latest version of native IDE of the Apple Mac OS X. Let’s see an Xcode example to understand the most significant contents of a project:

Figure C.1: XCode main window The Source folder includes the source code you develop, typically is only one file named yourobject.c. The External Frameworks and Libraries folder is the place where to add the MaxAPI and MaxAudioAPI (MSP) frameworks. The Product folder contains the external created after compiling the source code, while Target are the option of the compiler. The objects created are single file with .mxo extension, but those only seem to be files, because they hide contents in it. This is what under Mac OS is called a 118

"bundle", or simply a package. A bundle contains a list of files and folders, like the ones showed in this view:

Figure C.2: a Bundle You can create both Max or MSP external, depending on user requirement, with some difference in the structure, essentially MSP externals are the ones which involve Audio DSP, such as the one i used while Max externals are logic and arithmetic objects. In order to use the external in the Max patcher windows, you have to add the .mxo package, produced by the building of the source code, into the msp-external (or maxexternal) folder in the application folder. But you can do this automatically by telling XCode where to build your object in the building target. For better understand what target are, you can think at the option of the compiler. Most of the option are predefined when compiling an external for Max/MSP with xCode. An example of configuring the target manually, should be represented by typing a file with .xcconfig extension, and adding this to the project. Then you can use it as target field in XCode. Three are the basic component of a Max external source code: 1. the entry poin as main() function 2. description of the object as the Structs 3. definition of the functionality as the Methods =>BEHAVIOUR Some methods and element of the structs are required by Max, and are explained in the MaxAPI reference. The development of the source code can be summarized in five points: 1. including the right header files (usually ext.h and ext_obex.h for MSP objects) 2. declaring a C structure for your object 3. writing an initialization routine called main that defines the class 4. writing a new instance routine that creates a new instance of the class, when someone makes one or types its name into an object box 5. writing methods (or message handlers) that implement the behavior of the object. 119

C – Writing External for Max/MSP with XCode

120

Bibliography
[1] Cycling ’74. Max 5 API Reference, 2009. [2] Samer Abdallah and Mark Plumbley. Unsupervised onset detection: A probabilistic approach using ica and a hidden markov classifier, 2003. [3] James Beauchamp. Analysis, Synthesis, and Perception of Musical Sounds: The Sound of Music (Modern Acoustics and Signal Processing). Springer, December 2006. [4] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. Sandler. A tutorial on onset detection in music signals. Speech and Audio Processing, IEEE Transactions on, 13(5):1035–1047, 2005. [5] J. P. Bello, C. Duxbury, M. Davies, and M. Sandler. On the use of phase and energy for musical onset detection in the complex domain. Signal Processing Letters, IEEE, 11(6):553–556, 2004. [6] Juan Pablo Bello and Mark Sander. Phase-based note onset detection for music signals. In in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’03), pages 441–444. IEEE Computer Society, 2003. [7] Luciano Berio. Intervista sulla musica. Laterza, 1981. [8] J. Bilmes. Timing is of the essence: Perceptual and computational techniques for representing, learning, and reproducing expressive timing in percussive rhythm. Master’s thesis, MIT, Cambridge, MA, 1993. [9] Paul Brossier. Automatic Annotation of Musical Audio for Interactive Applications. PhD thesis, Queen Mary University of London, UK, August 2006. [10] Judith C. Brown. Calculation of a constant q spectral transform. J. Acoust. Soc. Am., 1991. [11] Judith C. Brown and Miller S. Puckette. An efficient algorithm for the calculation of a constant q transform. J. Acoust. Soc. Am., 1992. [12] Ryan J. Cassidy and J. O. Smith III. Auditory filter bank lab. [13] Nick Collins. A comparison of sound onset detection algorithms with emphasis on psychoacoustically motivated detection functions. In In AES Convention 118, pages 28–31, 2005. [14] P. Cosi, G. De Poli, and G. Lauzzana. Auditory modelling and self-organizing neural networks for timbre classification. Journal of New Music Research, 23(1):71–98, 121

Bibliography

[15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30]

[31]

March 1994. Roger B. Dannenberg. Nyquist Reference Manual. Carnegie Mellon University School of Computer Science, Pittsburgh, PA 15213, U.S.A., 2007. Alain de Cheveignè. Pitch perception models - a historical review. Technical report, CNRS - Ircam, Paris, 2004. Filipe Diniz, Iuri Kothe, Sergio L. Netto, and Luiz W. P. Biscainho. High-selectivity filter banks for spectral analysis of music signals. EURASIP Journal on Advances in Signal Processing, 2007. C. Dodge and T. Jerse. Computer music: syntesis, composition and performance. Thomson Learning, 1985. Carlo Drioli and Nicola Orio. Elementi di acustica e psicoacustica, 1999. C. Duxbury, M. Sandler, and M. Davis. A hybrid approach to musical note onset detection. In In Proc. Digital Audio Effects Workshop (DAFx, 2002. Chris Duxbury, Juan Pablo Bello, Mike Davies, Mark Sandler, and Mark S. Complex domain onset detection for musical signals. In In Proc. Digital Audio Effects Workshop (DAFx, 2003. Chris Duxbury, Juan Pablo Bello, Mark Sandler, and Mike Davies. A comparison between fixed and multiresolution analysis for onset detection in musical signals. In In Proc. Digital Audio Effects Workshop (DAFx, 2004. Ichiro Fujinaga. Max/MSP Externals Tutorial, 2005. Toby Gifford and Andrew R. Brown. Listening for noise: An approach to percussiv onset detection. In The Australasian Computer Music Conference, 2008. M. Gimenes, E. R. Miranda, and C. Johnson. A memetic approach to the evolution of rhythms in a society of software agents. In Proceedings of the 10th Brazilian Symposium of Musical Computation (SBCM), Belo Horizonte (Brazil), 2005. John William Gordon. Perception of Attack Transients in Musical Tones. PhD thesis, CCRMA, Department of Music, Stanford University, 1984. Paul Gurnig. An Introduction to Writing Externs in C for Max/MSP. University of Chicago, 2005. Kurt Jacobson. A metric for music similarity derived from psychoacoustic features in digital music signals. PhD thesis, University of Miami, 2006. K. L. Kashima and B. Mont-Reynaud. The bounded-q approach to time-varying spectral analysis. Tech. Rep. STAN M-28, Stanford University, Department of Music, 1985. A. Klapuri. Sound onset detection by applying psychoacoustic knowledge. In ICASSP ’99: Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference, pages 3089–3092, Washington, DC, USA, 1999. IEEE Computer Society. Alexandre Lacoste and Douglas Eck. A supervised classification algorithm for note onset detection. EURASIP J. Appl. Signal Process., 2007(1):153, January 2007. 122

Bibliography

[32] Kai Lassfolk and Jaska Uimonen. Spectutils, an audio signal analysis and visualization toolkit for gnu octave. In 11th Int. Conference on Digital Audio Effects (DAFx-08), 2008. [33] Paul Masri. Computer Modeling of Sound for Transformation and Synthesis of Musical Signals. PhD thesis, University of Bristol, UK, 1996. [34] James Mccartney. Rethinking the computer music language: Supercollider. In Rethinking the Computer Music Language: SuperCollider, volume 26, pages 61– 68, Cambridge, MA, USA, 2002. MIT Press. [35] Jon Mccormack. A developmental model for generative media. In Advances in Artificial Life, pages 88–97. 2005. [36] E. R. Miranda. Computer Sound Design Synthesis techniques and programming. Focal press, 2002. [37] Eduardo R. Miranda. Artificial phonology: Disembodied humanoid voice for composing music with surreal languages. Leonardo Music Journal, 15(1):8–16, 2005. [38] M. S. Puckette, T. Apel, and David Zicarelli. Real-time audio analysis tools for pd and msp. In In Proceedings of the ICMC, 1998. [39] Miller Puckette. Is there life after midi? ICMA, 1994. [40] Miller Puckette. Max at seventeen. Comput. Music J., 26(4):31–43, 2002. [41] Miller Puckette. The Theory and Technique of Electronic Music. World Scientific Publishing Co. Pte. Ltd., 2007. [42] Miller S. Puckette. Pure data: recent progress. In Pure Data: recent progress, 1997. [43] Arunan Ramalingam and Sridhar Krishnan. Gaussian mixture modeling of shorttime fourier transform features for audio fingerprinting. IEEE Transactions on Information Forensics and Security, 1(4):457–463, December 2006. [44] Curtis Roads. The Computer Music Tutorial. The MIT Press, February 1996. [45] Curtis Roads. Microsound. The MIT Press, 2004. [46] D. Rocchesso and F. Fontana. The Sounding Object. Mondo Estremo, 2003. [47] Davide Rocchesso. Introduction to Sound Processing. GNU GNU Free Documentation License, 2003. [48] Davide Rocchesso. Programmazione visuale, versione 1.3, 2007. [49] Davide Rocchesso. Sound to sound, sense to sense, 2008. [50] X. Rodet and F. Jaillet. Detection and modeling of fast attack transients. In International Computer Music Conference (ICMC), pages 30–33, 2001. [51] E. D. Scheirer. Tempo and beat analysis of acoustic musical signals. Journal of the Acoustical Society of America, 103(1):588–601, 1998. [52] X. Serra. Musical Sound Modeling with Sinusoids plus Noise, pages 91–122. Swets and Zeitlinger, 1997. [53] Xavier Serra. Parshl: An analysis/synthesis program for non-harmonic sounds based on a sinusoidal representation, 1985. 123

Bibliography

[54] Xavier Serra. A System for Sound Analysis/Transformation/Synthesis Based on a Deterministic Plus Stochastic Decomposition. PhD thesis, Stanford University, 1989. [55] Barry Vercoe. The Csound Book: Perspectives in Software Synthesis, Sound Design, Signal Processing,and Programming. The MIT Press, March 2000. [56] Tony S. Verma and Teresa H. Y. Meng. Extending spectral modeling synthesis with transient modeling synthesis. Comput. Music J., 24(2):47–59, 2000. [57] Gil Weinberg and Scott Driscoll. Robot-human interaction with an anthropomorphic percussionist. In CHI ’06: Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 1229–1232, New York, NY, USA, 2006. ACM. [58] Gil Weinberg and Scott Driscoll. Toward robotic musicianship. Comput. Music J., 30(4):28–45, 2006. [59] Gil Weinberg and Scott Driscoll. The interactive robotic percussionist: new developments in form, mechanics, perception and interaction design. In HRI ’07: Proceedings of the ACM/IEEE international conference on Human-robot interaction, pages 97–104, New York, NY, USA, 2007. ACM. [60] Gil Weinberg, Mark Godfrey, Alex Rae, and John Rhoads. A real-time genetic algorithm in human-robot musical improvisation. Computer Music Modeling and Retrieval: Sense of Sounds, pages 351–359, 2008. [61] Stephen Wilson. Information Arts : Intersections of Art, Science, and Technology (Leonardo Books). The MIT Press, April 2003. [62] David Zicarelli. Writing External Objects for Max 4.0 and MSP 2.0. Cycling ’74, 2001. [63] Udo Zölzer, Xavier Amatriain, Daniel Arfib, Jordi Bonada, Giovanni De Poli, Pierre Dutilleux, Gianpaolo Evangelista, Florian Keiler, Alex Loscos, Davide Rocchesso, Mark Sandler, Xavier Serra, and Todor Todoroff. DAFX:Digital Audio Effects. John Wiley & Sons, May 2002.

124

Sign up to vote on this title
UsefulNot useful