You are on page 1of 6

VirtuaLatin - Towards a Musical Multi-Agent System

David Murray-Rust, Alan Smaill and Manuel Contreras Maya


Centre for Intelligent Systems and their Applications
University of Edinburgh
D.S.Murray-Rust@sms.ed.ac.uk
Abstract
This project investigates the use of multi-agent systems for musical accompaniment1 . It
details the construction and analysis of a percussive agent, able to add timbales accompani-
ment to pre-recorded salsa music. We propose, implement and test a novel representational
structure directed towards latin music, and develop a music listening system designed to
build up these high level representations. We develop a generative system which uses expert
knowledge and high level representations to manage a set of behaviours which combine and
alter templates in a musically sensitive manner.
Overall, we find that the agent is capable of creating accompaniment which is indistin-
guishable from human playing to the general public, and difficult for domain experts to
identify

1 Introduction
There are many definitions of what an agent is; [4] attempts to give a set of key properties,
and comes up with:
Situatedness - the ability to perceive and affect an environment
Autonomy - capability for action without intervention, and control over internal state.
Flexibility - which comprises Responsiveness, Pro-activity and Sociability.
If we consider a group of musicians playing together, we can see that similar qualities
are necessary for satisfactory interaction: one cannot play with someone unless one can
apprehend their playing and produce playing of one’s own (situatedness); one cannot stop
in the middle of piece of music and wait until told what to do next (autonomy); and a
musician who does not display flexibility will not be enjoyable to play with. It is clear,
then, that there is a commonality between the qualities needed for successful musicianship
and agenthood.
There are several ways in which agents could be used for music. One natural breakdown
is to model each player in an ensemble as an agent. This is the approach taken in the current
project. A alternative would be to model a single musician as a collection of agents, as in
Minsky’s Society of Mind [7] model of cognition. A variety of approaches to musical agents
are:
• Pachet [10] uses a collection of agents each representing an individual percussive sound
to evolve traditional rhythms. The agents here do not communicate as such - they all
have access to a global score, which they can alter their contribution to, and analyse
the contribution of others.
1
A more in-depth report can be found in [9]

Proceedings of the Sixth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’05)
0-7695-2358-7/05 $20.00 © 2005 IEEE
Human Input Timbalero Other Musicians
Individual
High Level Generative
Musical
Representation Subsystems Output Trumpet

Rhythm Conductor
other
Libraries Musical MIDI File
instruments
Analysis Input
Collated
Domain Output

Knowledge Piano
MIDI File
Output

Figure 1. Overview of System Structure

• Dixon [2] uses a set of agents to perform a single function: beat tracking. Here again,
the agents do not communicate, they allow the examination of multiple hypotheses
about beat position to be examined concurrently.
• Miranda [8] creates a society of agents who attempt to create a shared repertoire of
sounds. Each agent uses a set of internal parameters to voice a sound, analogous to
positions of the tongue, lips and vocal chords. Another agent will attempt to mimic
this, with good imitations being kept, and poor ones adjusted or discarded.
• In [12], Wulfhorst, Nakayama and Vicari describe a multi agent system (MAS) in
which each agent corresponds to a musician. These agents communicate using MIDI
messages, and attempt to follow changes in tempo, and react to harmonic information.
They work in real time, and can interact with human performers.

2 VirtuaLatin - A Rhythmic Agent


Following on from the convergence noted above between musicianship and agenthood, we
look at the development of a multi-agent system capable of forming part of a heterogeneous
ensemble containing both human and mechanical players. We consider the scenario of a
timbales player joining a latin band, and learning a song to play with the band. We look at
music which keeps a constant structure - at the level of verses and choruses (or in this case
son sections and montunos). This allows the system to build up a high level representation
of a song’s structure, and is in keeping with the way latin bands perform.
The current system is not real-time, and works with pre-recorded MIDI files. The input
data is metricised, but not quantized - note timings are expressed in relation to a beat, but
they do not have to be an integer multiple of some subdivision of the beat.
We break down the functions of our agent system into:
• A “music listening” system, which uses the output of the other musicians to build up
a high level representation of the piece.
• A high level representation of the music, and a memory of what has been played
• An output system which uses these representations to create music
• A low-level music transmission infrastructure
The overall structure of the agent system is given in Figure 1. As only one active agent
is currently designed, a set of simple agents are created, which each echo a single part of

Proceedings of the Sixth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’05)
0-7695-2358-7/05 $20.00 © 2005 IEEE
the MIDI file to be accompanied. Every musical agent creates output one bar at a time.
This output is then sent to the Conductor, who collates the input, and sends it to all the
listening musicians.
3 Representations
We take Lerdahl and Jackendoff’s Generative Theory of Tonal Music [5] as inspiration for
a customised latin music representation, combined with ideas from the “Latin Real” book
[3] (from the “Real Book” series). These books contain scores for popular songs, typically
giving chord sequences, section markers, rhythmic indications, cues and phrasing (certain
notes which all of the band accent). A score such as this could be given to a percussionist
who was joining a band, and they would then expect to be able to play along with the band
with only minimal rehearsal.
Since an arbitrary hierarchical tree of groups (a la GTTM) is likely to be difficult to deal
with, we pick out certain structural levels of interest. A Song is composed of Sections, each
Section is composed of Segments and each Segment is composed of Bars. Working from a
set of structural assumptions, we arrive at:
Bar a bar is exactly four beats long, and contains one chord, and may contain phrasing
information.
Segment A segment is an integer number of bars, and has a certain rhythmic style, exam-
ples being Son Montuno, Rumba and Bomba. There are other specialised rhythms,
such as Phrasing Only (where only phrased notes are played), Tacet and Timbales
Solo.
Section A section has a single, defined rôle in the piece. In salsa music typical rôles are Son
(the quiet sections at the beginning), Montuno (the high point of the piece, with vocal
improvisation), Mambo (instrumental solos over repeated backing) and Intro/Outro.
Our structural assumptions allow us to create a set of Well Formedness rules, which
dictate what may and may not be a Bar, Segment or Section. Since there are many legal
ways to segment a piece of music, we need a way to choose preferred segmentations, so
we introduce a set of Preference Criteria, from which we derive Preference Rules for the
various levels. For instance, the preference criterion of maximising reusability translates to
a preference rule which attempts to find units whose parameters are the same as others.
4 Perception
When music is received from the other agents, it is passed to the music listening system
(Figure 2). Two waves of feature extraction are performed; this allows for a set of features
which is built on previously extracted features. Features are extracted from complete bars
of music, following the assumption that this is the finest level of structural detail.
Broadly speaking, the features used are as follows:
Activity The number of active players, and their levels of activity. This helps determine
section boundaries and type of section.
Harmonic Information The current chord and tonality are extracted using a modified
version of the Parncutt chord recogniser, [11] in combination with a key finding al-
gorithm due to Longuet-Higgins[6], as described in [11]. Several small modifications
were made to adapt this to the polyphonic environment.
Rhythmic Information The degree of phrasing, or rhythmic coherence, is extracted by
two separate but similar algorithms. Both algorithms divide the bar into small seg-
ments and then quantize each note onset to the nearest segment boundary. The first

Proceedings of the Sixth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’05)
0-7695-2358-7/05 $20.00 © 2005 IEEE
Basic Rhythm
Selection

Ornamentation
Representation

Phrasing

Fills

Memory

Chatter

Output

Figure 2. Music Listening Subsystem Figure 3. Generative System

algorithm computes the ratio of subdivisions where everyone plays to the subdivisions
n
where some people play; the second algorithm sums the “agreement” (| nplaying
players
− 12 |)
for each subdivision. The scores from the two algorithms are then compared to thresh-
old values to decide whether the bar has some phrasing, no phrasing or is entirely
phrasing.
Parallelism The main reason for performing the harmonic analysis detailed above is that
it gives a lot of information about the structure of the piece. Using the principle
that patterns which occur frequently are likely to be structural units at some level,
we create a tree of chord sequences, and prefer boundaries which are the start of
common sequences.
We implement a set of rules based on our Well-Formedness and Preference rules (in a
similar manner to the GTTM [5]; Once the entire piece has been heard, each rule is run
for each bar of the piece. Some rules can force a boundary at a particular bar, while other
rules give a numeric score for the bar. The scores from all the rules are summed, and when
the score for a particular bar exceeds a certain threshold, that bar is considered to be a
boundary.
Once we have calculated the boundary points within the piece, we can create a repre-
sentation of it. We create appropriate structures for each Bar, Segment and Section in the
piece, and use the features we have already extracted from the music heard to fill in the
various attributes of these structures. This leads to a representation that the timbalero can
use to accompany a song.
5 Output
The generative system is outlined in Figure 3. Basic rhythm selection uses the agent’s
representation of the song, along with some simple logic to select a template pattern from
a small library. This is in keeping with timbales playing in general - there is a relatively
small set of standard rhythms which a player will choose from. It is a deterministic process
- for a given representation, the timbalero will always select the same basic rhythm for each
bar.
Ornamentation is a more complex issue:

Proceedings of the Sixth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’05)
0-7695-2358-7/05 $20.00 © 2005 IEEE
Phrasing adds accents dictated by the score
Chatter alters the template to provide interest at structurally relevant points.
Fills are used to emphasise changes of section, or other structurally relevant features.
It is the combination of template patterns with methods to alter them to fit the current
context and the use of formal structures at appropriate points which allows the system to
function as a realistic musician. Since this project is aimed at a specific style of music, the
logic to do with rhythm and ornamentation selection is contained within style libraries; this
means that it could be adapted to deal with other styles of music.
6 Results
6.1 Music Listening System
The first point of analysis of the agent is its ability to “listen” to the music correctly, and
build up a structural representation. It was found here that many elements of the process
worked correctly, but it was not possible to create a full representation.
The phrasing detection algorithms performed well, finding correct patterns, and labelling
bars appropriately. However, phrasing during instrumental solos was often missed.
The chordal analysis ran up against major problem - although the chords encountered
lasted for about a bar each, often chord transitions do not match bar lines; most commonly,
the chord will change on the fourth beat of the bar, leading to mislabelling. When looking
for repeated patterns in the sequence of chords, it was common to find appropriate periodic
patterns, but these were often shifted by one or two bars from the “correct” startpoints.
Dissection of the piece is at present incomplete. The rules for Segment boundaries find
many in the correct place, but also many in places which are plausible, but incorrect.
Although tuning of rule weights might help this, it is clear that more musical knowledge
must be incorporated to give a robust system. There are no rules at present to differentiate
different sections - this would require a deeper stylistic analysis of the musical form.
6.2 The Salsa Challenge - A Musical Turing Test
The listening tests were designed to test the musicality of the timbalero; due to the
incompleteness of the analysis section, handmade high level representations were used.
Two groups of listeners were tested: the general public, and a set of domain experts (the
salsa band Cambiando, and another experienced conga player)
Two versions of Mi Tierra (Gloria Estefan) were recorded, one played by the virtual
timbalero, the other played by the author, using a MIDI drum kit rearranged to approximate
timbales. The use of a MIDI kit allowed completely identical sounds to be used - a recording
of timbales would be relatively easy to distinguish from one composed of triggered sounds.
The human playing was quantized, and obvious mistakes were edited, so there were no
obvious signs of human error.
No statistical evidence was found that the general public could tell the difference between
the two versions, and were equally divided with regard to which version they preferred. A
higher proportion of the domain experts correctly identified the computer player, but there
were not enough data for a strong. All subjects indicated a degree of uncertainty, with the
experts expressing more uncertainty than the general public.
Cited features that gave away the virtual player include similarity of fills, sounding too
polished and following the marked phrasing too closely.

Proceedings of the Sixth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’05)
0-7695-2358-7/05 $20.00 © 2005 IEEE
7 Conclusions and Future Work
This system addresses a certain type of situation - it is not designed to model human
characteristics, but rather to take a direct approach to creating high quality musical output.
As such, it would be useful for situations such as auto-accompaniment or computer games,
where innovation is secondary to consistent, acceptable output. It is also conceivable that
it could be used as a basic compositional tool - one could direct a piece of music at a high
level (in terms of chords, verses and choruses etc), have the system create a realisation of
this, and then refine the output as necessary.
There are some clear limitations with the work at present:
• The lack of real time operation makes it impossible for musicians to play with the
system.
• The structure of the representation is inflexible. It would be useful to be able to
handle concepts such as “a horn solo which continues until the piano gives a lead in”,
to have variable length sections.
• The perception subsystem is not currently capable of building up representations
accurately
The closest system for comparison is described in [12]. The two approaches tackle com-
plementary aspects of the same problem; in this work, the focus is on high quality, musically
sensitive output, while in [12] the emphasis is on real-time, flexible playing. A system which
combined qualities of the two approaches would be an exciting development. It is encour-
aging that this system was able to fool the general public in its current form, as there are
many easy alterations which would improve the quality of output with minimal changes.
References
[1] Kenny R. Coventry and Tim Blackwell. Pragmatics in language and music. In Matt Smith,
Alan Smaill, and Geraint A. Wiggins, editors, Music Education: An Artificial Intelligence
Approach, Workshops in Computing. Springer, 1994.
[2] Simon Dixon. A lightweight multi-agent musical beat tracking system. In Pacific Rim Inter-
national Conference on Artificial Intelligence, pages 778–788, 2000.
[3] Chuck Sher Hal Leonard Corp. The Latin Real Book. Sher Music Co., 1999.
[4] N. R. Jennings, K. Sycara, and M. Wooldridge. A roadmap of agent research and development.
Journal of Autonomous Agents and Multi-Agent Systems, 1(1):7–38, 1998.
[5] Fred Lerdahl and Ray Jackendoff. A Generative Theory of Tonal Music. MIT Press, 1983.
[6] H. C. Longuet-Higgins. Letter to a musical friend. The musical review, 23:244–8,271–80, 1962.
[7] Marvin Minsky. The society of mind. Simon & Schuster, Inc., 1986.
[8] E. R. Miranda. Emergent sound repertoires in virtual societies. Computer Music Journal (MIT
Press), 26(2):77–90, 2002.
[9] D. Murray-Rust. Virtualatin - agent based percussive accompaniment, 2003.
http://www.inf.ed.ac.uk/publications/thesis/
online/IM030053.pdf.
[10] F. Pachet. Rhythm as emerging structure. In Proceedings of ICMC, Berlin, 2000. ICMA.
[11] Robert Rowe. Machine Musicianship. MIT Press, 2001.
[12] Rodolfo Daniel Wulfhorst, Lauro Nakayama, and Rosa Maria Vicari. A multiagent approach
for musical interactive systems. In Proceedings of the second international joint conference on
Autonomous agents and multiagent systems, pages 584–591. ACM Press, 2003.

Proceedings of the Sixth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’05)
0-7695-2358-7/05 $20.00 © 2005 IEEE

You might also like