You are on page 1of 8

DEVELOPING SYSTEMS FOR IMPROVISATION BASED ON LISTENING

Doug Van Nort, Pauline Oliveros, Jonas Braasch


School of Architecture and Department of the Arts
Rensselaer Polytechnic Institute, Troy, NY
vannod2@rpi.edu

ABSTRACT sical decision making, and is designed with certain modes


of listening and musical reaction in mind.
In this paper we discuss our approach to designing impro- In particular our work is strongly informed by the prac-
vising music systems whose intelligence is centered around tice of Deep Listening [11], including the notions of focal
careful listening, particularly to qualities related to timbre and global modes of attention. The former is concerned
and texture. Our interest lies in systems that can make con- with discovering a particular sonic event and following this
textual decisions based on the overall character of the en- as it shifts over time. The latter is concerned with listening
tire sound field, as well as the specific shape and contour to the entirety of the sound field everything that one can
created by each player. We describe the paradigm of ex- perceive. The key to our use of these principles is that it is
panded instrument systems that we in turn build upon, en- not one or the other, but rather the process of these modes,
dowing these with the ability to listen, recognize and trans- and the dynamics of balancing the two that are embedded in
form performers sound in a contextually relevant fashion. our work. Meanwhile, a second structuring principle arises
This musical context is defined by our free improvisation from particulars of our musical aesthetic. Our interest is in
trio, which we briefly describe. The different modules of a blending of acoustic and electronic sound such that source
our current system are described, including a new tool for and cause of an event can be obscured, where gestures are
real-time sound analysis and method for sonic texture recog- traded between musicians within the sound field. Achieving
nition. a structured listening in such performance practice with
a necessary level of abstraction and receptiveness pro-
1. INTRODUCTION vides similar listening challenges to electroacoustic music,
and we find e.g. Denis Smalleys principles of Gesture and
When a group of musicians engages in free improvisation, Texture [17] to be appropriate metaphors for further defin-
there is no a priori set of rules that the players use to commu- ing our systems listening process. The former refers to an
nicate their intent. Rather, the musical language that players energy-motion trajectory that provides a sense of human
develop is acquired throughout the course of a performance intention and direction, while the latter refers both to the in-
or rehearsal session. While particular harmonic, melodic ner detail of a given sound event as well as the quality of the
and rhythmic structures might be presented from various entire sound field. Our work is based on the conviction that
styles, the intent behind this presentation is completely open these principles of focal/global and gesture/texture provide
and unknowable beforehand, meaning that such intent must an excellent framework for developing a system based on a
be found through careful listening to these qualities but also close coupling of listening, recognition and creative interac-
more fundamentally to the nature of the sound itself. There- tion that speaks to the self-organizing and emergent nature
fore, understanding the intention behind a sound event arises of form in free improvisation.
from careful attention to the shape and trajectory of salient
sound features, the state of the overall sound field and a 2. RELATED WORK
memory of recent actions taken by fellow improvisers. This
need to define style as a function of the immediate mu- In the realm of real-time interactive systems, there have been
sical situation has prompted Derek Bailey to refer to free many examples of work that is focused on creating an au-
improvisation as non-idiomatic [2]. It has been our expe- tonomous partner for musical improvisation. One very promi-
rience as improvising musicians that structuring the musi- nent system is George Lewis Voyager [9], which converts
cal interaction by fundamental listening principles rather pitch to MIDI data in order for various internal processes
than musical stylistic rules is an effective way to engage to make decisions depending on the harmonic, melodic and
in the act of free improvisation. We therefore have focused rhythmic content before producing symbolic output to in-
our work in interactive system development on building an teract with the human performer. A more recent system
involved machine listening system that is coupled with mu- that focuses on timbral parameters as well as analysis of

108
their gestural contour was taken by Hsu [8] in which param- 1967 musical instruments were included as sound sources,
eter curves for pitch, loudness, noisiness, roughness, etc. allowing them to be colored by delays, layering and vol-
were extracted, with captured sequences being stored in a ume changes. Moving away from unwieldy tape machines,
database. Our system shares the conviction that sonic ges- in 1983 tape delay was replaced by digital delay. Modula-
tures [20] are essential to recognize in human-machine in- tion of the delayed signals was introduced so that a signal
teraction for improvisation. However we differ in that the could be processed and returned with a new shape. De-
recognition itself is on-line with our system, and continual layed signals could also now be shaped via voltage control
adaptation to the anticipated gesture is used as an element by a foot pedal. These innovations led to the desire for more
of the machine intelligence. delays and the need for a better performance interface in
Taking off in a different direction, Pachet [14] has fo- order to control the many signal paths [10]. With the transi-
cused on creating a system that is both interactive and which tion of EIS to the computer [7], continuous foot gestures
learns musical style. This approach is based on variable remained an integral control paradigm while a variety of
Markov chains that learn a tree structure through analysis machine-driven modulations were programmed in order to
of input MIDI data. The goal is focused on imitation and shape the sound in tandem with this performer action. These
predicting musical sequences in a given style. The author same modulatory gestures were similarly applied to spatial-
claims that the system can fool a listener on the short ization of the sound output, as EIS was also conceived as a
term, but fails to convince on the scale of an entire piece. A multi-channel system and gestural movement of sounds is
similar approach is found in the OMAX system [1], which an integral part of the system aesthetic.
learns variable-length musical sequences, and can under- In 2001, algorithmic control of delay times, modulations
stand phrases as being recombinations or variations of past and spatialization was implemented. This allowed for meta-
stylistic tendencies. These approaches are interesting in their control of up to 40 delay times having fluctuations from mil-
lack of style assumptions a priori. However, the ability to liseconds to one minute, as well as type and depth of mod-
produce creative output is more crucial to our work than ulations. This higher-level of control allowed a performer
a system that presents variations on a current theme, and to create a sparse to dense field of moving sound that de-
so we approach such methods as one part of the learning manded focal and global listening, as each new sound would
paradigm to be augmented by creative and contextual decision- exist in some form in the future and also become a part of
making. On the other end of the novel output spectrum, the past. In this sense, EIS is a time machine allowing
Blackwell and Young [5] have focused on the use of evolu- the performer to expand, listen and react making impro-
tionary techniques such as a swarm algorithm to guide the visatory decisions to all three times and to multiple param-
spontaneity of the improvisation. Recognizing the benefits eters simultaneously. The performer reacts to material that
of these two approaches, we utilize a combination of simi- is a combination of human and machine modulations. EIS
lar techniques, structured both by the aforementioned listen- has always acted as such an improvising partner, pushing
ing as well as our particular approach to engaging with the the performer to make instantaneous decisions and to deal
computer through transformation of the performers audio with the consequences musically.
in real-time.
3.1. EIS: Present and Future
3. PARADIGM OF EXPANDED INSTRUMENTS
Currently EIS exists with algorithmic controls that can be
The Expanded Instrument System (EIS) [13] began at the set to learn and repeat patterns of control data, modulation
end of the 50s, and was informed by listening to the la- forms can be drawn and edited by the user, as can spatializa-
tency between the record and playback heads on reel-to-reel tion patterns. Nevertheless, performer control on the fly is
tape recorders. This monitoring of signals led to impro- still daunting, and higher-level control is continually being
visatory manipulation of amplitudes, which in turn afforded re-considered. Changes in EIS parameters are governed by
changes in sound quality. Using feedback within and be- 50/50 randomization, a chaos generator or a random event
tween tracks on a stereo machine increased the possibilities, generator and so the performer is challenged to develop
while adding a second machine with a shared tape afforded a ever more extensive abilities of intuition regarding what to
second and longer time delay, and more routing possibilities play and when to play it. The result is a fluctuating mix
as described in [11]. Though limited by the available hard- of live instrumental input with machine feedback that has
ware, through control of time delays, amplitude and signal the feel of human partnering, mirroring his/her own sounds
routing this rudimentary EIS produced music with changing interactively.
timbres and spatial qualities such as Bye Bye Butterfly and The current delay processor module has up to twenty de-
I of IV [12], all improvised in the studio by the composer. lays. Modulator types may be selected by the user, turned
Improvisation, listening and system development con- off or changed randomly with user control for the rate of
tinually informed and increased the sophistication of EIS. In switching. The number of active delays may be set manu-

109
A second and more profound difference is that GREIS
has focused on the notion of scrubbing captured sound
with the dominant hand as it enters one of the system buffers,
while modulating the output with pen articulations and left
hand control. This approach extends the notion of the computer-
mediated modulating delay lines of EIS, which can be thought
of as a turntablistic scratching of the captured audio, with
a sound that ranges from a gentle phasing to a dramatic
screetching, both of which alter the pitch content as they
provide a new gesture in response. The use of scrubbing in
GREIS returns the modulation to the musician (whose hands
are free in this context), so that material can be shaped into
Figure 1. Current EIS Screen Interface. a new sonic gesture through capturing sound and re-shaping
amplitude envelope and other features over time. This is fur-
thered by the use of a combined granular and time/frequency
ally, controlled by one of the aforementioned random gen- approach that allows for extreme time stretching and pitch
erators or by learned control patterns. A virtual Lexicon shifting in a decoupled fashion, but also allows for control
PSP-42 module allows for gestural control of modulation of the textural nature of the signal in a dynamic fashion,
with a foot pedal, and introduces its own unique modulation placing focus on a combined gesture/texture shaping.
and envelope shapes as well as feedback gain. In addition The system achieves this control through treatment of
to the gestural shaping and time component, there is a space grains in a general sense to mean both temporal or spec-
component to EIS that currently includes a module for re- tral units that many be shaped by the tablet and propagated
verberation and spatialization in 4 to 8 channels, with the through the system in a large feedback delay network. This
shape of spatial movement selected from ten different geo- is achieved through a combination of two granular synthesis
metrical patterns or drawn by hand. The spatial parameters modules and two phase-vocoder based modules. The gran-
may be algorithmically controlled as well. A large body of ular technique introduces two separate processing channels:
work has been produced using EIS and its interfaces through one being controlled by an MSP phasor object at signal
many different designs from analog to digital in continual rate, and one being controlled by a Max metro/gate system
evolution from 1960 to the present. at control rate. This allows for a very smooth stretching with
minimal artifacts in the former case, while the latter allows
for grains to be separated intro streams with each being pro-
3.2. GREIS
cessed in a unique way. In GREIS, each of these temporal
The first author has developed a performance system that grains has a feedback delay and pitch shifting applied (al-
has taken off from the EIS paradigm and, while having many lowing for a very profound texturization) before each in-
key differences, shares a similar intention of interacting with dividual stream is sent into the primary feedback matrix
the immediate past in a way that is nonlinear and which (lower right of figure 2). The control-rate grain streams are
presents new gestural modulations into the musical mix. This configurable to 8, 16 or 32 separate streams per processing
system is called GREIS (pronounced as grace), for Granular- module. The signal and control rate grain channels can be
feedback, Expanded Instrument System (an homage to EIS) mixed separately, and an involved mapping structure is used
and was developed in Max/MSP at the external and abstrac- in order to transition between processing types and synthesis
tion level. One of the prime differences between EIS and parameters, resulting in the ability to either hide or accen-
GREIS are in the mode of interacting with the system for the tuate digital artifacts [19]. The spectral grain processing
player, and consequently on the time scale that the interac- is based on a combination of gabor modules and the super
tion takes place. For EIS, the second author most often per- phase vocoder [16]. This allows for partials, noise and tran-
forms on accordion as primary instrument, with EIS serving sients to be separated in real time as the sound is scrubbed,
to propel this content through time and space after a duration and each of these are propagated through the primary system
often in the range of 20-60 seconds. GREIS, meanwhile, matrix as well. Important to GREIS ability to shift between
has been developed for more direct, bi-manual interaction different sonic shapings and textural transformations is that
via Wacom tablet and various controllers. Through the pro- the control interface remains the same while only the state
cess of acting more immediately on the system parameters, of temporal or spectral is changed (in the three iden-
the focus has shifted to shorter time delays (e.g. from mil- tical modules of figure 2, this is reflected in the upper left
liseconds to 10 seconds) that influence the timbral quality of drop-down menu).
the immediate sound output. Regardless, both systems can The final key aspect of the system is that it merges this
and do traverse both time scales in performance. hands-on approach to shaping new sonic gestures (while

110
Figure 2. GREIS performance interface.

in the process breaking apart the textural components) with the musical material in a new context to be considered by
the EIS approach to computer-mediated gestural modula- performers, the system does not listen to the sonic and mu-
tions. The large-scale modulations are thus placed in the sical context, to present material in a new way as a result
feedback matrix, allowing for an interplay between human- of this. Such intelligent decision making, based on our lis-
based shapings that leverage the power of time stretching tening principles, is what we strive to introduce. Some of
and parametric resynthesis with surprising new delay-line our designs to this end will be discussed from this point for-
modulations that further reshape the already-transformed sound, ward, after describing the performance context that provides
arriving at unsuspecting times. This delayed output can then the musical problem and source of data that inspires and
pass back into the granular-feedback chain to be further re- directs our system improvements.
shaped and remain in memory longer as a new sonic object.
Either multi-dimensional Wacom tablet gestures or the re-
sultant scrubbed audio may be saved in memory, and the 4. TRIPLE POINT
performer can recall these at a later point in performance.
Finally, a higher-level control layer of directed randomness Triple Point (figure 3 ) is our improvising trio whose core in-
introduces noise into the sound processing parameters as strumentation is GREIS, soprano saxophone and digital ac-
well as the matrix routing and per-grain filtering, while the cordion synthesizer. The name refers to the point of equilib-
behavior of the modulated delay-lines may be controlled rium on a phase plot, which is a metaphor for how we create
probabilistically. dialogue as performers. Our musical interaction is centered
around an interplay with proper acoustics, modeled acous-
tics (from the Roland V-accordion) and electronics. The
3.3. Towards an Intelligent Response
GREIS performer captures the sound of the other players
In both EIS and GREIS, many surprising sounds can arise in real time, either transforming these to create new sonic
that are a re-presentation of performer actions. These re- gesture or holding them for return in the near future. The V-
sults are most often an interesting commentary on the sound accordion player changes between timbres and bends the
quality as well as the musical structure. Both systems (and presets of the instrument through idiosyncratic use of the
particularly EIS) have been used in countless recordings and virtual instrument, while the saxophonist explores extended
performance contexts, and are regarded as having high de- technique including long circular-breathing tones and multi-
gree of musicality and interesting musical interaction poten- phonics. This mode of interaction has resulted in situations
tial by those who have performed with the systems. Having where acoustic/electronics source is indistinguishable with-
said this, both rely on a performance memory and decision out careful listening, while other times this becomes wildly
making that arise from a well-tuned (in regards to range, apparent. This morphing is based on timbral transforma-
timing) set of random generators, whose effectiveness lies tions of the acoustic players, but just as much is a product
in having a behavior that can be anticipated and manipu- of these players playing into the process of transformation as
lated by performers. Therefore while the system presents it occurs. Examples of this performance style can be found

111
improvisation by attempting to recognize intent. We search
for the most likely type of sonic gesture that is being played,
with a continuous degree of certainty about this understand-
ing. We are less interested in a musical retrieval than using
the process of musical retrieval in a way that an improvising
performer does, continually updating their expectation.

5.1. Feature Extraction


Our choice of features is focused on the ability to differen-
tiate between the sound made by a performer along pitch,
dynamics, timbre and texture dimensions. We further need
features that work well for the wide palette of physical mod-
els from the V-accordion, but also that are sensitive to the
Figure 3. Triple Point Performing at Roulette in New York. rich acoustics from extended saxophone technique. At the
same time these features need to make sense as a sequence
of states for the resultant HMM model. We have found that
on the album Sound Shadows [18]. global spectral features related to timbre do well for a gross
The blending of these disparate sources tends to have characterization of different gestures, while finer qualita-
two distinct modes: when separate gestural contours arise tive differences can be distinguished by looking at features
that are distinct with sources that are difficult to follow, and that react to textural differences, which can be considered
a fusion of sources into a singular texture that moves in a co- as spectro-temporal features that are more separable in time
herent fashion, or which lacks a linear direction. This, again, and acting over a larger time scale than timbre (e.g. less than
is guided purely by focal and global listening in the moment, 20ms vs. 20-400ms). Rather than pitch we look more gen-
and is both an awareness of the motion between gestural dy- erally at pitchness and temporal regularity. These classes
namics and textural sustain. The group decides to follow of sound features are thus decomposable as global spectral,
these recognizable musical states of: striated gestures, ges- pitch strength and textural. To this end we have found spec-
ture/texture mixture, complete texture or disjointed sources tral centroid and spectral deviation most effective (in our
(a sort of noise in this context). This observation on our trio) for the first group, frequency, energy, periodicty and
playing style has motivated several sub-systems for machine ratio of correlation coefficients derived from a normalized
listening and analysis that we have since endowed with addi- autocorrelation pitch model for the second, and finally we
tional sound transformation capabilties, resulting in an in- use the 3rd and 4th order moments of the LPC residual for
telligent agent. While Triple Point at its core is the above the final category. These react well to the inner textu-
trio instrumentation, we at times also become a quartet or ral detail, while the approach in section 6 detects the global
even quintet by including EIS or the agent as additional per- texture of the overall auditory scene.
formers. These extra performers are in turn noted on pieces
or recordings, in order to allow the listener to navigate the
5.2. Sonic Gestural Analysis
interaction and intent behind the spectra of human/machine
and acoustic/electronic at varying levels of sonic and musi- The initial process of recognition is triggered by onset and
cal structure. offset messages, using an adaptive threshold version of the
onset detection method described in [3]. This operates in
5. RECOGNITION OF SONIC GESTURES the complex spectral domain to provide a continuous onset
function that can be thresholded for on/offset. The system
The above mode of interaction and set of musical states are can look at sharp changes in any of the above features, and
a product of inter-performer interactions but also of the ges- we have found that detecting onsets in dynamics, fundamen-
tural modulations presented by GREIS +/- EIS, and so to tal frequency and spectral centroid are the most effective
invert this process we look to build a system that recognizes measures for possible event segmentation. The detection of
different sonic gestures. This aspect of the system, first pre- such an event triggers the recognition process, but the final
sented in [21], begins with analysis of sound features fo- determination of an events significance is the result of the
cused on combined timbre and texture recognition, followed recognition stage. This component is built using combined
by a Hidden Markov-Model (HMM) based approach. While HMM/dynamic time warping [4] and the FTM package for
HMMs have become commonplace for speech recognition Max/ MSP.
and even recognition of musical timbre, most often the inter- This recognition system is trained on example sonic ges-
est lies in an out-of-time or a posteriori act of classification. tures, which are represented as a left-to-right topology of
By contrast, our work specializes to the uncertainty of free states, each comprised of the multidimensional feature set.

112
This defines a sort of gestural dictionary that all incoming ation the global sonic texture, which is a parallel listening
gestures can be compared against. These gestures should be process.
orthogonal in some musical sense, which is taken care of ei-
ther a priori by exemplary training, or by on-line learning as 6. LISTENING TO SONIC TEXTURE
described in section 7. Once the system is trained on a set
of gestures, it compares any segmented input gesture to its We also endow our system with the ability to listen to the
internal lexicon of gestural archetypes, providing a vector of global auditory scene in regards to its textural quality. Fol-
probabilities for each member. While a standard classifica- lowing the musical analysis presented in [20], we consider
tion system would look to the most likely member, we in fact texture in our work as any sustained sonic phenomenon that
examine the entire probability vector and its dynamic form possesses global stationarity (of amplitude and frequency
as it changes continuously in real-time. This information is content) while having local micro-variations. There is no
key to defining dynamic attention, as it is used to represent clear separation between timbral events and textural sustain
the systems expectation of what a sonic gesture is at any on one hand, but also between the local temporal variation
given moment, which in turn is continually being updated. or grain [6] of a sound texture and larger-scale modulations
In particular, we extract the normalized probability, the such as tremolo or vibrato. These phenomena differ along a
maximally-likely gesture and the deviation between the max- continuum in regards to separability of temporal fluctuation.
imum and the few highest values. These latter values each Our work explores the possibility of creating a machine lis-
give some indication of how strongly the system believes tening system that creates a partition of the incoming au-
that the performed gesture is one from the system. Both dio in an adaptive and signal-driven way to listen along this
values are needed in order to know uniqueness as well as continuum. To this end, we create our system based on
strength of recognition. In addition to these instantaneous the nonlinear time-frequency analysis technique called Em-
recognition values, we define a confidence measure based on pirical Mode Decomposition (EMD) [23], which separates
a set of leaky integrators applied to the maximum and devi- a signal into a set of (not necessarily orthogonal) intrinsic
ation values [21], which reflects a building up of confidence mode functions (IMF) that are created not by differing spec-
in the likelihood of a given sonic gestures over time. There- tral content per se (though this weakly is a by-product), but
fore, the system can change its mind abtruptly, or become by different levels of temporal modulation. The technique is
more certain over time in a continuous fashion. Changes aimed at factoring out all components by virtue of their sim-
in system output follow accordingly by mapping these ex- ilar amplitude/frequency modulated content. We leave the
tracted probability features into sound transformations or details of our analysis to [20], but after separating the signal
higher-level musical state changes. into this set of IMFs, these are clustered by virtue of their
The final key component of this sub-section of our lis- signal properties including the use of a roughness algorithm
tening agent is that we utilize several gestural spaces in par- based on the work of [15], an analysis of envelope modula-
allel, that each cover distinct time scales (most often three tion and a measure that the first author has created known
scales, from note-level up to a 10-second phrase). This is as temporal fine structure, defined on each signal block L of
partially practical, as the HMM model-comparing process size N as
works best when the dictionary of gestures exist on the same
time scale, but also because a note-level sonic gesture is mu-
(L+1)N x p [i+1]+x p [i]+x p [i1]) 2
sically very different from a phrase-level one, and needs to i=(L)N+1 (x p [i] 3 )
be understood accordingly. The process of traversing the T FS(L) = (1)
RMS(L)
different times scales is quite general, and can be extended
to more layers: an onset (i.e. detection of an event) triggers where RMS is the root mean square for the given signal
the recognition of the smallest possible recognition level; if block and x p is the instantaneous power-amplitude of the
the accumulated confidence does not pass a given threshold signal. After separating out the time-frequency IMF com-
within an adaptive time frame (determined by the average ponents and extracting these features chosen precisely be-
length of the gestural dictionary) then the system begins to cause they have proven effective in separating out qualita-
try to make sense of the event as a higher level of gesture. tively different types of signal grain [20] we apply an
This further allows for only one recognition process to be expectation-maximization (EM) algorithm, which learns the
enacted at any given time, reducing the overall CPU load. dynamics of these different spectral regions in EMD-space
Whether or not a gesture is perfectly recognized, if the (i.e. groupings based on type of temporal modulatory struc-
system detects a certain level of similarity, it can make de- ture), and further detects the boundary between textural changes.
cisions as to what output processing should occur. This out- That is, when a global texture changes to a new scene, the
put is a product of the mapping defined by the user, the in- algorithm notes this boundary. The full details of our ap-
stantaneous probability values and the smoothed confidence proach are presented in [22], but the key idea is that we
value. The overall musical output also takes into consider- train the system on a set of exemplary musical textures (i.e.

113
vant in free improvisation as well as electroacoustic music.
However, there is still the need to balance attention between
these modes, to learn from the environment, and to create
some form of memory that is engaging as a fellow impro-
viser. The first issue is very much an open question for us,
and we currently address this by mapping the output from
each into low level parameters or higher meta-parameters
for gesture recognition, and into high-level state changes for
texture recognition.
Meanwhile, the issue of learning is addressed differently
for these two modes of listening. The learning of sonic ges-
tures is an adaptive process that mimics the EIS and GREIS
paradigms of a moving window of delay-line memory (e.g.
of 60 seconds in length). We treat this buffer as the sys-
Figure 4. emd external for Empirical Mode Decomposi- tems short-term episodic memory from which we learn new
tion. gestures. For this on-line learning, we segment the entire
delay line buffer using the same onset analysis, and treat
each potential event as a multidimensional feature vector.
recordings of our music that we characterize as complete After building a K-Nearest Neighbor (KNN) tree, we define
textural sustain). The system analyzes the input audio and a novel gesture using one of two methods:
classifies this in terms of number and quality of each textural
1) If an event that is sufficiently far in sound feature
sub-group. It cannot be known when a texture has changed
space from the other members of the known database is dis-
until after the fact, but we make use of our EM algorithm
covered, this is included in the gestural lexicon. If a max
the E part which is based on a Kalman filter in or-
threshold is reached, members that are closer to the centroid
der to maintain a running prediction of a potential change in
of the gesture space are thrown away.
texture on the receding horizon.
2) After clustering the KNN tree, the gestures that are
Therefore, our global texture-listening system is cen-
closest to the centroid of each cluster define a new gesture
tered around listening to the sound from all players, decid-
space.
ing on the number and type of sub-textures the music is
Using this method of learning, we either privilege recent
comprised of (e.g. long beating modulations, white noise
performance memory completely as the relevant set of ar-
and dust noise), and giving a short-term prediction of when
ticulations to compare against, or we mix new gestures with
a change in texture will occur based on this. Currently, we
those that we define a priori as being important. By contrast,
analyze the number and type of scenes in real-time, utilizing
we train the system on a set of sonic textural scenes a priori.
externals we have written in Max for analysis including the
This is necessary due to the nature of the learning algorithm,
EMD technique, as pictured in figure 4. The example patch
but also we have found this to be sufficient given the global
shows a mixture of two sine waves and white noise. Note
and generalized nature of the listening task. Both of these
how the external is able to separate the three components.
parallel listening tasks are involved with the transition from
In contrast to linear techniques such as the STFT, when the
short-term episodic to semantic memory, and define the im-
sine waves drift closer and cause beating, this modulation
mediate musical context.
is identified as a separate IMF, much as a listener is able to
detect this phenomena as distinct from the tones proper. We The final issue is that of longer-term episodic memory.
use this global texture-listening system as complement to For this, we currently leverage the factor oracle pattern learn-
the sonic gesture analysis, with a memory and learning that ing algorithm from the OMAX system [1]. This work is fo-
is updated in a way that is heavily based on current musical cused on learning a sequence of audio-level or MIDI events
information. in order to learn stylistic trends without any assumption of
style beforehand. One powerful aspect of this algorithm is
that it allows for a succession of similar audio events to be
7. LEARNING, EPISODIC MEMORY AND identified as a single unit, with a precise time-stamp, and to
DYNAMIC ATTENTION reference past material that has structural similarity in re-
gards to a left-to-right temporal sequence of audio events.
With the ability to segment and recognize individual sonic In our approach we build such a structure over the duration
gestures as well as the overall musical texture in place, we of a performance, based on spectral features. The features
believe our design allows for a machine listening that is at- related to gesture and texture are stored in parallel for each
tentive to changing gesture/texture content, as well as fo- distinct sound event, so that when the players input audio
cal and global modes of listening that are particularly rele- relates to the stored short-term memory, the system can de-

114
cide to act on the longer-term memory that is precisely time- [2] D. Bailey, Improvisation: Its Nature and Practice in Music. Da
stamped and structured by the factor oracle algorithm. This Capo Press, 1993.
oracle tree structure can then be navigated by the system [3] J. P. Bello, C. Duxbury, M. Davies, and M. Sandler, On the Use of
Phase and Energy for Musical Onset Detection in the Complex Do-
in order to re-present audio, reacting to the currently rec-
main, IEEE Signal Processing Letters, vol. 11, pp. 553556, 2004.
ognized musical context. We believe that recalling a set of
[4] F. Bevilacqua, F. Guedy, N. Schnell, E. Flety, and N. Leroy, Wire-
transitions between gestures and between general textural less sensor interface and gesture-follower for music pedagogy, in
states is an appropriate level of structure to apply in listen- Proceedings of the 7th Int. Conf. on New interfaces for musical ex-
ing during our own music-making, and in many other forms pression, 2007, pp. 124129.
of improvisation as well. [5] T. Blackwell and M. Young, Self-organised music, Organised
Sound, vol. 9, no. 2, pp. 123136, 2004.
[6] M. Chion, Guide des Objets Sonores. INA/Buchet-Chastel, 1983.
8. MUSICAL ACTIONS
[7] D. Gamper and P. Oliveros, A Performer-Controlled Live Sound-
Processing System: New Developments and Implementations of the
A full discussion of the musical transformations of our agent Expanded Instrument System, Leonardo Music Journal, vol. 8, pp.
is beyond the scope of this paper instead we have fo- 3338, 1998.
cused on the listening, memory and reaction aspects of our [8] W. Hsu, Managing gesture and timbre for analysis and instrument
work as well as real-time implementations. However the control in an interactive environment, in Proceedings of the 2006
Int. Conf. on New Interfaces for Musical Expression, 2006, pp. 376
key component of musical actions that we design for are 379.
directed spontaneity, gestural transformations and scene- [9] G. Lewis, Too many notes: Computers, complexity and culture in
aware transformations. The first of these was discussed in voyager, Leonardo Music Journal, vol. 10, pp. 3339, 2000.
[21], and is based on the use of probability and confidence [10] P. Oliveros and Panaiotis, Expanded instrument system (EIS), in
from gesture recognition to drive the selection, crossover, Proc. of the 1991 International Computer Music Conference (ICMC
mutation and rate of a genetic algorithm. The population 91), Montreal, QC, Canada, 1991, pp. 404407.
space consists of low-level parameters related to granular [11] P. Oliveros, Software for people. Center for Music Experiment and
Related Research, University of California at San Diego, 1979.
synthesis as well as selection of routing in a delay matrix.
[12] , Electronic Works 1965/66 . Paradigm Records PD04, 1997,
The second set of actions can be thought of as a sonic ges-
tracks originally released on vinyl in 1965-66.
tural analysis/synthesis: rather than quasi-random modula-
[13] , The Future of Creative Technologies, Journal of the Institute
tions on audio in a delay-line buffer, we scan the systems of Creative Technologies. Leicester, UK: De Montfort University,
episodic memory and select audio for transformation based 2008, ch. The Expanded Instrument System: Introduction and Brief
on its gestural contour. The type of transformation gran- History, pp. 2123.
ular, additive-based or spectral envelope modulation and [14] F. Pachet, The continuator: Musical interaction with style, Journal
specific actions are a product of the input sonic gesture. Fi- of New Music Research, vol. 32, no. 3, pp. 333341, 2003.
nally, the highest level of musical actions including den- [15] D. Pressnitzer and S. McAdams, An efffect of the coherence be-
tween envelopes across frequency regions on the perception of rough-
sity of events, switching speed between actions, overall level ness, in Psychophysics, Physiology and Models of Hearing. Lon-
are determined by the relative strength and ratio of the ges- don: World Scientific, 1999, pp. 105108.
ture/texture recognition. [16] Real-Time Musical Interactions and Analysis/Synthesis Teams,
http://www.ircam.fr/imtr.html?L=1, Last Accessed April 15, 2010.

9. CONCLUSION AND FUTURE WORK [17] D. Smalley, Spectromorphology: Explaining sound-shapes, Organ-
ised Sound, vol. 2, no. 2, pp. 107126, 1997.
[18] Triple Point and S. Dempster, Sound Shadows. Deep Listening DL-
Though focused on the musical activities of Triple Point, DD1, 2009.
we believe our approach to system design based on listening [19] D. Van Nort and M. Wanderley, Control Strategies for Navigation of
and dual awareness to both gesture and texture is highly rel- Complex Sonic Spaces, in Proc. of the 2007 International Confer-
evant to human/machine free improvisation in general. We ence on New Interfaces for Musical Expression (NIME 07), 2007, pp.
have found our combined analysis, recognition and genera- 379382.
tive output approach to be very satisfying in extending our [20] D. Van Nort, Instrumental Listening: sonic gesture as design princi-
ple, Organised Sound, vol. 14, no. 2, pp. 177187, 2009.
musical practice. Future work includes an extension into
real-time global texture recognition and prediction, as well [21] D. Van Nort, J. Braasch, and P. Oliveros, A System for Musical Im-
provisation Combining Sonic Gesture Recognition and Genetic Al-
as meaningful recognition of larger musical trajectories at gorithms, in Proceedings of the 6th Sound and Music Computing
the level of an entire improvisation. Conference, Porto, Portugal, 2009, pp. 131136.
[22] , Sound texture analysis based on a dynamical systems model
10. REFERENCES and empirical mode decomposition, 2009, submitted for Publication.
[23] Z. Wu and N. E. Huang, A study of the characteristics of white
[1] G. Assayag, G. Bloch, M. Chemillier, A. Cont, and S. Dubnov, noise using the empirical mode decomposition method, Proceedings
Omax brothers: A dynamic topology of agents for improvisation - Royal Society. Mathematical, physical and engineering sciences,
learning, in ACM Multimedia Workshop on Audio and Music Com- vol. 460, no. 2046, pp. 15971611, 2004.
puting for Multimedia, 2006.

115

You might also like