You are on page 1of 26

Computer Music Analysis 

David Gerhard
School of Computing Science
Simon Fraser University
Burnaby, BC
V5A 1S6
email: dbg@cs.sfu.ca
Simon Fraser University, School of Computing Science
Technical Report CMPT TR 97-13
Keywords: Automatic Music Transcription, Musical Grammars,
Computer Music, Psychoacoustics.
July 21, 1998

Abstract 1 Introduction
A great deal of work in computer music deals
Computer music analysis is investigated, with with synthesis|using computers to compose
speci c reference to the current research elds and perform music. An interesting exam-
of automatic music transcription, human mu- ple of this type of work is Russell Ovens'
sic perception, pitch determination, note and thesis, titled \An Object-Oriented Constraint
stream segmentation, score generation, time- Satisfaction System Applied to Music Com-
frequency analysis techniques, and musical position" [Oven88]. In contrast, this report
grammars. Human music perception is in- deals with analysis|using computers to ana-
vestigated from two perspectives: the com- lyze performed and recorded music. The in-
putational model perspective desires an algo- terest in this eld comes from the realization
rithm that perceives the same things that hu- that decoding a musical representation into
mans do, regardless of how the program ac- sound is fairly straightforward, but translat-
complishes this, and the physiological model ing an arbitrary sound into a score is a much
perspective desires an algorithm that models more dicult task. This is the problem of Au-
exactly how humans perceive what they per- tomatic Music Transcription (AMT) that has
ceive. been looked at since the late 1970's. AMT
consists of translating an unknown and arbi-
 This research is partially supported by The Nat-
trary audio signal into a fully notated piece
ural Sciences and Engineering Research Council of of musical score. A subset of this problem is
Canada and by a grant from The BC Advanced Sys- monophonic music transcription, where a sin-
tems Institute. gle melody line played on a single instrument

1
in controlled conditions is translated into a { Musical grammars
single note sequence, often stored in a MIDI
track1 . Monophonic music transcription was
solved in 1986 with the publication of Martin
Piszczalski's Ph.D. thesis, which will be dis- Part I
cussed in Section 3.0.1 on page 6.
AMT is a subset of Music Perception, a
eld of research attempting to model the way Transcription
humans hear music. Psychological studies
have shown that many of the processing ele- The ultimate goal of much of the computer
ments in the human auditory perceptual sys- music analysis research has been the develop-
tem do the same thing as processing elements ment of a system which would take an audio
in the human visual perceptual system, and le as input and produce a full score as out-
researchers have recently begun to draw from put. This is a task that a well-trained human
past research in computer vision to develop can perform, but not in real time. The person,
computer audition. Not every processing ele- unless extremely gifted, requires access to the
ment is transferable, however, and the di er- same audio le many times, paying attention
ences between vision and audition are clear. to a di erent instrument each time. A mono-
There are millions of each of the four types of phonic melody line would perhaps require a
vision sensors, L, M, and S cones and rods, single pass, while a full symphony might not
while there are only two auditory sensors, the be fully transcribable even with repeated au-
left and right ears. Some work has been done ditions. Work presented in [Tang95] suggests
on using visual processing techniques to aid in that if two instruments of similar timbres play
audition (see Section 2.1 on page 4 and Sec- \parallel" musical passages at the same time,
tion 4.2.2 on page 13), but there is more to be these instruments will be inseperable. An ex-
gained from cautiously observing the connec- ample of this is the string section of an orches-
tions between these two elds. tra, which is often heard as a single melody line
Many research areas relate to computer mu- when all the instruments are playing together.
sic and AMT. Most arise from breaking AMT
down into more manageable sub-problems, Some researchers have decided to work on
while some sprouted from other topics and a complementary problem, that of extracting
studies. Six areas of current research work will errors from performed music [Sche95]. Instead
be presented in this report: of listening to the music and writing the score,
the computer listens to the music, follows the
 Transcription score, and identi es the di erences between
what the ideal music should be and what the
{ Music perception performed music is. These di erences could be
{ Pitch determination due to expression in the performance, such as
{ Segmentation vibrato, or to errors in the performance, such
as incorrect notes. A computer system that
{ Score generation could do this would be analogous to a novice
music student who knows how to read music
 Related Topics and can listen to a piece and follow along in
{ Time-frequency analysis the music, but cannot yet transcribe a piece.
Such a system gives us a stepping stone to the
1 MIDI will be discussed in Section 6.1 on page 15. full problem of transcription.

2
2 Music Perception closer to home.
When a scienti c theory tries to explain
There have been two schools of thought con- some natural phenomenon, the goal is to be
cerning automatic music transcription, one re- able to predict that phenomenon in the fu-
volving around computational models and one ture. If we can set up a model that will pre-
revolving around psychological models. dict the way the world works, and we use a
The psychological model researchers take computer algorithm to do this prediction, we
the \clear-box" approach, assuming that the have a computational model. The argument is
ultimate goal of music transcription research that these algorithms do not do the same thing
is to gure out how humans hear, perceive and that is happening in the world, even though
understand music. A working system is not as they predict what is happening in the world.
important as little bits of system that accu- Kinematics models do not take into account
rately (as close as we can tell) model the hu- quantum e ects in their calculations, and since
man perceptual system. This attitude is valu- they do not model the way the world really
able because by modeling a working system { works, they are not valuable, some would say.
the human auditory system, we can develop a These models do explain and predict motion
better arti cial system, and gain philosophical accurately within a given domain and whether
insight into how it is that we hear things. or not they model the complete nature of the
In contrast, the computational model re- universe, they do explain and predict natural
searchers take the \black-box" approach, as- phenomena.
suming that a music transcription system is This is less easy to accept when dealing with
acceptable as long as it transcribes music. our own minds. We want to know the under-
They are more concerned with making a work- lying processes that are going on in the brain,
ing machine and less concerned with modeling so how useful is a theory, even if it is good at
the human perceptual system. This attitude is predicting the way we work, if we don't know
valuable because in making a working system how close it is to our underlying processes?
we can then work backward and say \How is Wouldn't it be better to develop a theory of
this like or unlike the way we hear?" The prob- the mind from the opposite viewpoint and say
lem is that the use of self-evolving techniques that a model is valid if it simulates the way we
like neural nets and genetic algorithms limits work from the inside { if it concurs with how
our ability as humans to understand what the the majority of people react to a certain stim-
computer is doing. ulus? This is very dicult because psycholog-
These two elds of research rarely work to- ical testing cannot isolate individual processes
gether and are more often at odds with each in the brain, it can only observe input and out-
other. Each does have valuable insights to gain put of the brain as a whole, and that is a sys-
from the other, and an interdisciplinary ap- tem we cannot hope to model at present.
proach, using results from both elds, would The other problem with this is one of prag-
be more likely to succeed. matics. Ocham's razor says that when theo-
Aside: Computational Models. A point ries explain something equally well, the sim-
of dispute in interdisciplinary research has of- pler theory is more likely to be correct. A sim-
ten been the notion of a computational model pler theory is more likely to be easier to pro-
for explaining human psychological phenom- gram as well, and so simpler theories tend to
ena. Computer scientists and engineers have come out of computational models. Of course,
been using computational models for many we must be careful that the theories we com-
years to explain natural phenomena, but when pare do in fact explain the phenomena equally
we start going inside the mind, it hits a little well, and we must be aware that the simplest

3
computational model we get is not necessarily
the same processing that goes on in the mind. @ ,
This brings another advantage of computa- JJ

tional models, their testability. Psychological


models can be tested, but it is much more dif- @ ,
cult to control such experiments, because all
systems in the brain are working together at
the same time. In computational models, only Figure 1: The Face-Vase Illusion.
the processing we are interested in is present,
and that bit of processing can be tested with
completely repeatable results. the lake analogy. Imagine digging two short
trenches up from the shore of a lake, and then
stretching handkerchiefs across the trenches.
2.1 Auditory Scene Analysis The human auditory system is then like deter-
Albert Bregman's landmark book in 1990 mining how many boats are on the lake, what
[Breg90] presented a new perspective in hu- kind of engines are running in them, which di-
man music perception. Until then, much work rection they are going, which one is closer, if
had been done in the organization of human any large objects have been recently thrown
visual perception, but little had been done in the lake, and almost anything else, merely
on the auditory side of things, and what lit- from observing the motion of the handker-
tle there was concentrated on general concepts chiefs. When we bring the problem out to our
like loudness and pitch. Bregman realized that conscious awareness, it seems impossible, and
there must be processes going on in our brains yet we do this all the time every day without
that determine how we hear sounds, how we thinking about it.
di erentiate between sounds, and how we use Bregman shows that there are many phe-
sound to build a \picture" of the world around nomena going on in the processing of audi-
us. The term he used for this picture is \the tory signals that are similar to those in visual
auditory scene". perception. Exclusive allocation indicates that
The classic problem in auditory scene anal- properties belong to only one event. When it
ysis is the \cocktail party" situation, where is not clear which event that property applies
you are in a room with many conversations go- to, the system breaks down and illusions are
ing on, some louder than the one you are en- perceived. The most common visual exam-
gaged in, and there is background noise such ple of this is the famous \face-vase" illusion,
as music, clanking glasses, and pouring drinks. Figure 1, where background and foreground
Amid all this cacophony, humans can readily are ambiguous, and it is not clear whether the
lter out what is unimportant and pay atten- boundary belongs to the vase or the two faces.
tion to the conversation at hand. Humans can This phenomenon occurs in audition as well.
track a single auditory stream, such as a per- In certain circumstances, musical notes can be
son speaking, through frequency changes and ambiguous. Depending on what follows a sus-
amplitude changes. The noise around may be pended chord, the chord can be perceived as
much louder than your conversation, and still both major and minor, until the ambiguity is
you have little trouble understanding what the removed by resolving the chord.
other person is saying. For a recent attempt Apparent motion occurs in audition as it
to solve this problem, see [GrBl95]. does in vision. When a series of lights are
An analogy that shows just how much pro- ashed on and o in a particular sequence,
cessing is done in the auditory system is it seems like there is a single light traveling

4
along the line. If the lights are ashed too slow Axiom 4 (Simplicity Principle) Data are rep-
or they are too far apart, the illusion breaks resented in the least complex way in
down, and the individual lights are seen turn- the sense of Kolmogorov (least memory
ing on and o . In audition, a similar kind of stored).
streaming occurs, in two dimensions. If a se-
ries of notes are of a similar frequency, they
will tend to stream together, even if there are Axiom 1 stems from the well-recognized fact
notes of dissimilar frequencies interspersed. that a note with a frequency twice that of
A sequence that goes \Low-High-Low-High..." some reference note is an octave higher than
will be perceived as two streams, one high and the reference note. An octave above \A" at
one low, in two circumstances. If the tempo 440 Hz would be \A" at 880 Hz, and then \A"
is fast enough, the notes seem closer together at 1760 Hz and so on. This is a logarithmic
and clump into streams. Also, if the di erence scale, and has been observed and documented
between the \Low" and the \High" frequen- exhaustively.
cies is large enough, the streaming will also oc-
cur. If the tempo is slow and the frequencies Axiom 2 is a bit harder to recognize. Two
do not di er by much, however, the sequence di erent time signals will produce the same au-
will be perceived as one stream going up and ditory response if the power frequency spectra
down in rapid succession. are the same. The phase spectrum of a signal
There are more examples of the link be- indicates where in the cycle each individual si-
tween vision and audition in Bregman's book, nusoid starts. As long as the sinusoidal com-
as well as further suggestions for a model of ponents are the same, the audio stimulus will
human auditory perception. Most of the book sound the same whether the individual com-
explains experiments and results that reinforce ponents are in phase or not.
his theories.
The third axiom is an attempt to describe
2.2 Axiomatization of Music the fact that humans group audio events in
the same gestalt manner as other perceptual
Perception constructs. Researchers in musical grammars
In 1995, Andranick Tanguiane presented the have identi ed this, and there is more material
progress of a model of human music perception on this phenomenon in Section 2.1 on page 4.
that he had been working on for many years The last axiom suggests that while we may
[Tang95]. The model attempts to explain hu- hear very complicated rhythms and harmonies
man music perception in terms of a small set in a musical passage, this is done in men-
of well-de ned axioms. The axioms that Tan- tal processing, and the mental representation
guiane presents are: used is that which uses the least memory.
Axiom 1 (Logarithmic Pitch) The frequency Tanguiane argues that all of these axioms
axis is logarithmically scaled. are necessary because without any one of
Axiom 2 (Insensitivity to the phase of the them, it would be impossible for humans to
signal) Only discrete power spectra are recognize chords as collections tones. The fact
considered. that we are able to perceive complex tones as
individual acoustical units, claims Tanguiane,
Axiom 3 (Grouping Principle) Data can be argues for the completeness of the axiom set.
grouped with respect to structural iden- This perception is only possible when the mu-
tity. sical streams are not parallel.

5
2.3 Discussion is produced.
Research has approached the problem of mu- A windowed Fourier transform is used to
sic perception and transcription from two dif- extract frequency information. Three stages
ferent directions. In order to fully under- of heuristics are used to identify note starts
stand music enough to develop a computa- and stops. The system rst recognizes funda-
tional model, we must recognize the psycho- mental frequencies below or above the range
logical processing that is going on. Early work of hearing as areas of silence. More dicult
in automatic music transcription was not con- boundaries are then identi ed by uctuations
cerned with the psychological phenomena, but in the lower harmonics and abrupt changes in
with getting some sort of information from one fundamental frequency amplitude. The sec-
domain to another. This was a valuable place ond phase averages the frequencies within each
to start, but in its present state, the study of perceived \note" to determine the pitch which
music perception and transcription has much is then compared to a base frequency. In the -
to gain from the study of the psychology of au- nal phase, a new base frequency is determined
dition. from uctuations in the notes, and grouping,
beaming and other notational tasks are per-
formed.
3 Beginnings Since this early paper, Piszczalski has pro-
posed a computational model of music tran-
In the mid 70's, a few brave researchers tried scription in his 1986 Ph.D. Thesis [Pisz86].
to tackle the whole transcription problem by The thesis describes the progress and comple-
insisting on a limited domain. Piszczalski tion of the system started in 1977. There is
and Galler presented a system that tried to also a thorough history of AMT up to 1986.
transcribe musical sounds to a musical score This thesis is a good place to start, for those
[PiGa77]. The intent was to take the presented just getting in to computer music research. It
system and develop it toward a full transcrip- is now more than 10 years old and much has
tion system. In order to make the system func- been accomplished since then in some areas,
tional, they required the input audio to be but monophonic music transcription remains
monophonic from a ute or a recorder, pro- today more or less where Piszczalski left it in
ducing frequency spectra which are easily an- 1986.
alyzed.
3.0.2 Moorer
3.0.1 Piszczalski and Galler In 1977, James Moorer presented a paper
At this early point in the research, Piszczal- [Moor77] on the work he accomplished since
ski and Galler recognized the importance of then, which would turn out to be the rst
breaking the problem down into stages. They attempt at polyphonic music transcription.
divided their system into three components, Piszczalski's work required the sound input to
working bottom-up. First, a signal process- be a single instrument and a single melody
ing component identi es the amplitudes and line, while Moorer's new system allowed for
starting and stopping times of the component two instruments playing together. Restric-
frequencies. The second stage takes this infor- tions on the input to this system were tighter
mation and formulates note hypotheses for the than Piszczalski's work: there could only be
identi ed intervals. Finally, this information two instruments, they must both be of a type
is analyzed in the context of notation to iden- such that the pitches are voiced (no percussive
tify beam groups, measures, etc., and a score instruments) and piecewise constant (no vi-

6
brato or glissando) and the notes being played heuristics and trial and error. Moorer then
together must be such that the fundamental uses an o -the-shelf program to create a score
frequencies and harmonics of the notes do not out of the two melody lines.
overlap. This means that no note pairs are al-
lowed where the fundamental frequencies are 3.1 Breakdown
small whole number ratios of each other.
Moorer claims that these restrictions, apart The early attempts at automatic music tran-
from the last one, are merely for convenience scription have shown that the problem must
and can easily be removed in a larger sys- be limited and restricted to be solved, and the
tem. He recognized that the fundamental fre- partial solutions cannot easily be expanded to
quency restriction is dicult to overcome, as a full transcription system. Many researchers
most common musical intervals are on whole have chosen to break the problem down into
number frequency ratios. The other restric- more manageable sub-problems. Each re-
tions he identi ed have been shown to be more searcher has her own ideas as to how the prob-
dicult to overcome than he rst thought. lem should be decomposed, and three promi-
Research into percussive musical transcrip- nent ones are presented here.
tion has shown that it is suciently di er-
ent from voiced transcription to merit inde- 3.1.1 Piszczalski and Galler
pendent work, and glissando and vibrato have
proved to be dicult problems. Allowing In [PiGa77], discussed above, Piszczalski and
more than two instruments in Moorer's sys- Galler propose three components, breaking the
tem would require a very deep restriction on process down temporally as well as computa-
the frequencies. tionally.
These restrictions cannot be lifted without 1. (low-level) Determine the fundamental
a redesign of the system, because notes on the frequency of the signal at every point, as
octave, or even a major third apart have fun- well as where the frequencies start and
damental frequencies that are whole number stop.
multiples of each other, and thus cannot be
separated by his method. 2. (intermediate-level) Infer musical notes
Moorer uses an autocorrelation function in- from frequencies determined in stage 1.
stead of a Fourier transform to determine the
period of the signal, which is used to determine 3. (high-level) Add notational information
the pitch ratio of the two notes being played2. such as key signature and time signature,
The signal is then separated into noise seg- as well as bar lines and accidentals.
ments and harmonic segments, by assigning
a quality measure to each part of the signal. Piszczalski's 1986 thesis proposes a larger
In the harmonic portions, note hypotheses are and more speci c breakdown, in terms of the
formed based on fundamental frequencies and intermediate representations. In this break-
their integer multiples. Once hypotheses of down, there are eight representations, suggest-
the notes are con rmed, the notes are grouped ing seven processing stages. The proposed
into two melody lines (which do not cross) by data representations are, from lowest level to
rst nding areas where the two notes over- highest level:
lap completely, and then lling in the gaps by
 Time waveform: the original continuous
2 Pitch and Period are not necessarily interchange- series of analog air pressures representing
able. For a discussion, see Section 4.1.2 on page 10. the sound.

7
 Sampled signal: a series of discrete volt- part, solved since the time of Moorer and his
ages representing the time waveform at colleagues, and there exist many commercial
every sample time. software programs today that will translate a
MIDI signal (essentially a note sequence) into
 Digital spectrogram: the sinusoid spec- a score. For more on MIDI see Section 6.1 on
trum of the signal at each time window. page 15.
 Partials: the estimation of the frequency
position of each peak in the digital spec- 3.1.3 Tanguiane
trogram.
In 1988, Andranick Tanguiane published a pa-
 Pitch candidates: possibilities for the per on recognition of chords[Tang88]. Be-
pitch of each time frame, derived from the fore this, chord recognition had been ap-
partials. proached from the perspective of polyphonic
 Pitch and amplitude contours: a descrip- music|break down the chord into its compo-
tion of how the pitch and amplitude vary nent notes. Tanguiane didn't believe that hu-
with time. mans actually did this dissection for individual
chords, and so his work on music recognition
 Average pitch and note duration: discrete concentrated on the subproblems that were
acoustic events which are not yet assigned parts of music. He did work on chord recogni-
a particular chromatic note value. tion, separate from rhythm recognition, sepa-
rate again from melody recognition. The divi-
 Note sequence: the nal representation. sion between the rhythm component and the
The rst four representations t into stage tone component has been identi ed in later
1 of the above breakup, The next three t into work on musical grammars, and will be dis-
stage 2, and the last one ts into stage 3. In his cussed in Section 8 on page 18.
system, Piszczalski uses pre-programmed no-
tation software to take the note sequence and
create a graphical score. The problem of infer-
3.2 An Adopted Breakdown
ring high-level characteristics, such as the key For the purposes of this report, AMT will be
signature, from the note sequence has been re- broken down into the following sub-categories:
searched and will be covered in Section 6.2 on
page 16. 1. Pitch determination: identi cation of the
pitch of the note or notes in a piece of mu-
3.1.2 Moorer sic. Work has been done on instantaneous
pitch detection as well as pitch tracking.
James Moorer's proposed breakdown is similar
to that of Piszczalski and Galler, in that it sep- 2. Segmentation: breaking the music into
arates frequency determination (pitch detec- parts. This includes identi cation of
tion) from note determination, but it assigns note boundaries, separation of chords into
a separate processing segment to the identi- notes, and dividing the musical informa-
cation of note boundaries, which Piszczal- tion into rhythm and tone information.
ski and Galler group together with pitch de-
termination. Moorer uses a pre-programmed 3. Score generation: taking the segmented
score generation tool to do the nal nota- information and producing a score. De-
tion. Indeed, this is a part of automatic mu- pending on how much processing was
sic transcription which has been, for the most done in the segmentation section, this

8
could be as simple as taking fully de- and amplitude corresponding to pixel value.
ned note combinations and sequences Pitch determination techniques have been
and printing out the corresponding score. understood for many years, and while im-
It could also include identi cation of key provements to the common algorithms have
and time signature. been made, few new techniques have been
identi ed.
4 Pitch Determination 4.1 Instantaneous Pitch Tech-
Pitch Determination has been called Pitch Ex- niques
traction and Fundamental Frequency Identi - Detecting the pitch of a signal is not as easy as
cation, among a variety of other titles, however detecting the period of oscillation. Depending
pitch and frequency are not exactly the same on the instrument, the fundamental frequency
thing, as will be discussed later. The task is may not be the pitch, or the lowest frequency
this: given an audio signal, what is the musical component may not have the highest ampli-
pitch associated with the signal at any given tude.
time. This problem has been applied to speech
recognition as well, since some languages such 4.1.1 Period Detectors
as Chinese rely on pitch as well as phonemes
to convey information. Indeed, spoken English Natural music signals are pseudo-periodic,
relies somewhat on pitch to convey emotional and can be modeled by a strictly periodic
or insinuated information. A sentence whose signal time-warped by an invertible func-
pitch increases at the end is interpreted as a tion[ChSM93]. They repeat, but each cycle is
question. not exactly the same as the previous, and the
In monophonic music, the note being played cycles tend to change in a smooth way over
has a pitch, and that pitch is related to the time. It is still meaningful to discuss the pe-
fundamental frequency of the quasi-periodic riod of such a signal, because while each cycle
signal that is the musical tone. In polyphonic is not the exact duplicate of the previous, they
music, there are many pitches acting at once, di er only by a small amount (within a musi-
and so a pitch detector may identify one of cal note) and the distance from one peak to the
those pitches or a pitch that represents the next can be considered one cycle. The period
combination of tones but is not present in any of a pseudo-periodic signal is how often the
of them separately. While pitch is indispens- signal \repeats" itself in a given time window.
able information for transcription, more fea- Period detectors seek to estimate exactly
tures must be considered when polyphonic mu- how fast a pseudo-periodic signal is repeating
sic is being transcribed. itself. The period of the signal is then used
Pitch following and spectrographic analysis to estimate the pitch, through more complex
deal with the continuous time-varying pitch techniques described below.
across time. As with instantaneous pitch de- Fourier Analysis. The \old standard"
termination, many varied algorithms exist for when discussing the frequency of a signal. A
pitch tracking. Some of these are modi ed signal is decomposed into component sinu-
image processing algorithms, since a time- soids, each of a particular frequency and am-
varying spectrum has three dimensions (fre- plitude. If enough sinusoids are used, the sig-
quency, time and amplitude) and thus can be nal can be reconstructed within a given er-
considered an image, with time corresponding ror limit. The problem is that the discrete
to width, frequency corresponding to height, Fourier transform centers sinusoids around a

9
given base frequency. The exact period of the time di erence, but after a little movement,
signal must then be inferred by examining the there should be another spike where one cycle
Fourier components. The common algorithm lines up with the previous cycle. The period
that is used to calculate the fourier spectrum can be determined by the location of the rst
is called the fast fourier transform (FFT). spike from 0.
Chirp Z Transform. A method presented The problem with most of these techniques
in [Pisz86], it uses a charge-coupled device is that they assume a base frequency, and all
(CCT) as an input. The output is a frequency higher components are multiples of the rst.
spectrum of the incoming signal, and in theory Thus, if the frequency of the signal does not
should be identical to the output of a Fourier lie exactly on the frequency of one of the com-
transform on the same signal. The method ponents for example on one of the frequency
is extremely fast when implemented in hard- channels in a bank of lters, then the result is
ware, but performs much slower than the FFT a mere approximation, and not an exact value
when simulated in software. The CCD acts as for the period of the signal.
a delay line, creating a variable lter. The l-
ter coecients in the delay line can be set up 4.1.2 Pitch from Period
to produce a spectrum, or used as a more gen-
eral lter bank. A common assumption is that the pitch of a
Cepstrum Analysis. This technique uses signal is directly confessed by its period. For
the Fourier transform described above, with simple signals such as sinusoids, this is correct
another layer of processing. The log magni- in that the tone we hear is directly related to
tude of the Fourier coecients is taken, and how fast the sinusoid cycles. In natural mu-
then inverse Fourier-transformed. The result sic, however, many factors in uence the pe-
is a large peak at the frequency of the original riod of the signal apart from the actual pitch
signal, in theory. This technique sometimes of the tone within the signal. Such factors in-
needs tweaking as well. clude the instrument being played, reverbera-
Filter Banks. Similar to Fourier analysis, tion, and background noise. The di erence be-
this technique uses small bandpass lters to tween period and pitch is this: a periodic sig-
determine how much of each frequency band is nal at 440 Hz has a pitch of \A", but a period
in the signal. By varying the center frequency of about 0.00227 seconds.
of the lters, one can accurately determine the A technique proposed in [Pisz86] to extract
frequency that passes the largest component the pitch from the period consists of formu-
and therefore the period of the signal. This lating hypotheses and then scoring them and
is the most psychologically faithful model, be- selecting the highest score as the fundamental
cause the inner ear acts as a bank of lters, frequency of the note. Candidates are selected
providing output to the brain through a num- by comparing pairs of frequency components
ber of orthogonal frequency channels. to see if they represent a small whole number
Autocorrelation. A communication sys- ratio with respect to other frequency compo-
tems technique, this consists of seeing how nents. All pairs of partials are processed in
similar a signal is to itself at each point. The this way, and the result is a measure of pitch
process can be visualized as follows: take a strength versus fundamental frequency.
copy of the signal and hold it up to the orig-
inal. Move it along, and at each point make 4.1.3 Recent Research
a measurement of how similar the signals are.
There will be a spike at \0", meaning that the Xavier Rodet has been doing work with mu-
signals are exactly the same when there is no sic transcription and speech recognition since

10
before 1987. His fundamental frequency esti- evenly spaced harmonics in the spectrum.
mation work has been done with Boris Doval, While the system by itself works well, the
and they have used techniques such as Hidden authors present a pre-processor and a post-
Markov Models and Neural Nets. They have processor to improve performance. The pre-
worked with frequency tracking as well as es- processor minimizes noise and abhorrent fre-
timation. quencies, and the processor uses fuzzy neural
In [DoRo93], Doval and Rodet propose a nets to determine the pitch from the funda-
system for the estimation of the fundamental mental frequency.
frequency of a signal, based on a probabilis- The system uses a \center of gravity" type
tic model of pseudo-periodic signals previously measurement to more accurately determine
proposed by them in [DoRo91]. They consider the location of the spectral peaks. Since the
pitch and fundamental frequency to be two dif- original signal is only pseudo-periodic, an es-
ferent entities which sometimes hold the same timation of the spectrum of the signal is used
value. based on a given candidate frequency. The er-
The problem they address is the misnaming ror between the spectrum estimation and the
of fundamental frequency by computer when a true spectrum will be minimal where the can-
human can easily identify it. There are cases didate frequency is most likely to be correct.
where a human observer has diculty identi- Ray Meddis and Lowel O'Mard presented a
fying the fundamental frequency of a signal, system for extracting pitch that attempts to
and in such cases they do not expect the al- do the same thing as the ears do [MeOM95].
gorithm to perform well. The set of partials Their system observes the auditory input at
that the algorithm observes is estimated by a many frequency bands simultaneously, the
Fourier transform, and consists of signal par- same way that the inner ear transforms the
tials (making up the pseudo-periodic compo- sound wave into frequency bands using a l-
nent) and noise partials (representing room ter bank. The information that is present in
noise or periodic signals not part of the main each of these bands can then be compared, and
signal being observed). the pitch extracted. This is a useful method
Doval and Rodet's probabilistic model con- because it allows auditory events to be seg-
sists of a number of random variables includ- mented in terms of their pitch, using onset
ing a fundamental frequency, an amplitude en- characteristics. Two channels whose input be-
velope, the presence or absence of speci c har- gins at the same time are likely to be rec-
monics, the probability density of speci c par- ognizing the same source, and so information
tials, and the number and probability of other from both channels can be used to identify the
partials and noise partials. Partials are rst pitch.
classi ed as harmonic or not, and then a likeli-
hood for each fundamental frequency is calcu- 4.1.4 Multi-Pitch Estimation for
lated based on the classi cation of the corre- Speech
sponding partials. The fundamental frequency
with maximum likelihood is chosen as the fun- Dan Chazan, Yoram Stettiner and David
damental frequency for that time frame. This Malah presented a paper on multi-pitch esti-
paper also presented work on frequency track- mation [ChSM93]. The goal of their work was
ing, see Section 4.2 on page 12. to segment a signal containing multiple speak-
In 1994 Quiros and Enrquez published a ers into individuals using the pitch of each
work on loose harmonic matching for pitch speaker as a hook. They represent the signal
estimation [QuEn94]. Their paper describes using a sum of quasiperiodic signals, with a
a pitch-to-MIDI converter which searches for separate warping function for each quasiperi-

11
odic signal, or speaker. 4.2 Pitch Tracking
It is unclear if this work can be extended to Determining the instantaneous frequency or
music recognition, because only the separation pitch of a signal may be a more dicult prob-
of the speakers was the goal. Octave errors lem than needs to be solved. No time frame is
were not considered, and the actual pitch of independent of its neighbors, and for pseudo-
the signal was secondary to the signal separa- periodic signals within a single note, very lit-
tion. Work could be done to augment the sep- tle change occurs from one time frame to the
aration procedure with a more robust or more next. Tracking algorithms use the knowledge
accurate pitch estimation algorithm. The idea acquired in the last frame to help estimate the
of a multi-pitch estimator is attractive to re- pitch or frequency in the present frame.
searchers in automatic music transcription, as
such a system would be able to track and mea-
sure the overlapping pitches of polyphonic mu- 4.2.1 From the Spectrogram
sic. Most of the pitch tracking techniques that are
in use or under development today stem from
pitch determination techniques, and these use
4.1.5 Discussion the spectrogram as a basis. Individual time
frames are linked together and information is
There is work currently being done on pitch passed from one to the next, creating a pitch
detection and frequency detection techniques, contour. Windowing techniques smooth the
but most of this work is merely applying new transition from one frame to the next, and in-
numerical or computational techniques to the terpolation means that not every time frame
original algorithms. No really new ideas seem needs to be analyzed. Index frames may
pending, and the work being done now con- be considered, and frames between these key
sists of increasing the speed of the existing al- frames should be processed only if changes oc-
gorithms. cur between the key frames. These frames
must be close enough together not to miss any
If a technique that could accurately deter- rapid changes.
mine the fundamental frequency without re- While improvements have been made on this
quiring an estimation from the period or the idea (see [DoNa94]), the basic premise remains
spectrum could be found, it would change this the same. Use the frequency obtained in the
eld of research considerably. As it stands, the last frame as an initial approximation for the
prevalent techniques are estimators, and re- frequency in the present frame.
quire checking a number of candidates for the In [DoRo93] presented earlier, a section on
most likely frequency. fundamental frequency tracking is presented
Frequency estimators and pitch detectors where the authors suggest the use of Hidden
work well only on monophonic music. Once Markov Models. Their justi cation is that
a signal has two or more instruments play- their fundamental frequency model is prob-
ing at once, determining the pitch from the abilistic. A discrete-time continuous-state
frequency becomes much more dicult, and HMM is used, with the optimal state sequence
monophonic techniques such as spectrum peak being found by the Viterbi algorithm. In
detection fail. Stronger techniques such as their model, a state corresponds to an inter-
multi-resolution analysis must be used here, val of the histogram. The conclusion that they
and these topics will be discussed in Section 7 come to is that it is possible to use HMMs on
on page 17. a probabilistic model to track the frequency

12
across time frames. HMMs are also used in acoustic events, or notes. If there are ve in-
[DeGR93], where partials are tracked instead struments playing concurrently, then ve dif-
of the fundamental frequency, and the ulti- ferent notes should be identi ed for each time
mate goal is sound synthesis. A natural sound frame. For convenience, we will refer to the
is analyzed using Fourier methods, and noise note-by-note segmentation in time simply as
is stripped. The partials are identi ed and segmentation, and we will refer to instrument
tracked, and a synthetic sound is generated. melody line segmentation as separation. Thus,
This application is not directly related to mu- we separate the polyphonic music into mono-
sic transcription, rather music compression, phonic melody streams, and then we segment
however the tracking of sound partials instead these melody streams into notes.
of fundamental frequency could prove a useful Separation is the di erence between mono-
tool. phonic and polyphonic music transcription. If
a reliable separation system existed, then one
4.2.2 From Image Processing could simply separate the polyphonic music
into monophonic lines and use monophonic
The time-varying spectrogram can be consid- techniques. Research has been done on source
ered an image, and thus image processing tech- separation using microphone arrays, identify-
niques can be applied. This analogy has its ing the di erent sources by the delay between
roots in psychology, where the similarity be- microphones, however it is possible to segment
tween visual and audio processing has been ob- polyphonic sound even when all of the sound
served in human perception. This is discussed comes from one source. This happens when we
further in Section 2 on page 3. hear the ute part or the oboe part of a sym-
In the spectrum of a single time frame, the phony stored on CD and played through a set
pitch is represented as a spike or a peak for of speakers. For this reason, microphone array
most of the algorithms mentioned above. If systems will not be presented in this report,
the spectra from consecutive frames were lined however arrays consisting of exactly two mi-
up forming the third dimension of time, the crophones could be considered physiologically
result would be a ridge representing the time- correct, since the human system is binaural.
varying pitch. Edge following and ridge follow-
ing techniques are common in image process-
ing, and could be applied to the time-varying
5.1 Piszczalski
spectra to track the pitch. The reader is re- The note segmentation section in Piszczalski's
ferred to [GoWo92] for a treatment of these thesis takes the pitch sequence generated by
algorithms in image processing. During the the previous section as input. Several heuris-
course of this research, no papers were found tics are used to determine note boundaries.
indicating the application of these techniques The system begins with the boundaries easi-
to pitch tracking. This may be a eld worthy est to perceive, and if unresolved segments ex-
of exploration. ist, moves on to more computationally com-
plex algorithms.
The rst heuristic for note boundaries is si-
5 Segmentation lence. This is perceived by the machine as a
period of time where the associated amplitude
There are two types of segmentation in mu- of the pitch falls below a certain threshold. Si-
sic transcription. A polyphonic music piece lence indicates the beginning or ending of a
is segmented into parallel pitch streams, and note, depending on whether the pitch ampli-
each pitch stream is segmented into sequential tude is falling into the silence or rising out of

13
the silence. and endings of sounds. The system works on
The next heuristic is pitch change. If the a single audio stream, which corresponds to
perceived pitch changes rapidly from one time monophonic music.
frame to the next, it is likely that there is Smith's system is based on a model of the
a note boundary there. Piszczalski's system human audio system and in particular, the
uses a logarithmic scale independent of abso- cochlea. The implications are that while the
lute tuning, with a change of one half of a system is more dicult to develop, the nal
chromatic step over 50 milliseconds indicating goal is less a working system and more an un-
a note boundary. derstanding of how the human system works.
These two heuristics are assumed to identify The rst stage of Smith's system is to l-
the majority of note boundaries. Other algo- ter the sound and acquire the spectra. This
rithms are put in place to prevent inaccurate closely models the human process. It is known
boundary identi cations. Octave pitch jumps that the cochlea is an organ that converts time
are subjected to further scrutiny because they waveforms into frequency waveforms.
are often the result of instrument harmon- One might ask why not use a model of the
ics rather than note changes. Other scruti- human system as a rst stage in any compu-
nizing heuristics include the rejection of fre- tational pitch perception algorithm. The rea-
quency glitches and amplitude crevices, where son is that the cochlea uses 32 widely spaced
the pitch or the amplitude change suciently frequency channels [Smit94]. The processing
to register a note boundary but then rapidly necessary to go from 32 channels to an exact
change back to their original level. pitch is very complicated, more so than the
The next step is to decide on the pitch and algorithms that approximate a pitch from the
duration for each note. Time frames with the hundreds of channels in a Fourier spectrum.
same pitch are grouped together, and a re- Until we know how the brain interprets the in-
gion growing algorithm is used to pick up any formation in these channels, pitch extraction
stray time frames containing pitch. Abhorrent might as well use the more information avail-
pitches in these frames are associated with the able in modern spectrographic techniques.
proceeding note and the frequency is ignored. The output of the 32 lter bands are
The pitch of the note is then determined by summed to give an approximation of the to-
averaging the pitch of all time frames in the tal signal energy on the auditory nerve, and
note, and the duration is determined by count- this signal is used to do the onset/o set l-
ing the number of time frames in the note tering. Theories have been stated that human
and nding the closest appropriate note dura- onset/o set perception is based on frequency
tion (half, quarter etc.). Piszczalski's claim is and amplitude, either excitatory or inhibitory.
that the system generates less than ten per- Smith's simpli cation is to interpret all the fre-
cent false positives or false negatives. quencies at once, and even he considers this
too simple. It is evident, however, that until
5.2 Smith we know how the brain interprets the di er-
ent frequencies to produce a single onset/o set
In 1994, Leslie Smith presented a paper to the signal, this simpli cation is acceptable, and in-
Journal of New Music Research, discussing his structive.
work on sound segmentation, inspired by phys- The onset/o set lters themselves are
iological research and auditory scene analysis drawn from image processing, and use a con-
[Smit94]. The system is not con ned to the volution function across the incoming signal.
separation of musical notes, and uses onset This requires some memory of the signal, but
and o set lters, searching for the beginnings psychological studies have shown that human

14
audio perception does rely on memory more connecting the inputs to the oscillator net-
than the instantaneous value of the pressure work, and throughout the network itself.
on the eardrum. The beginning of a sound When an example stream of tones \High-
is identi ed when the output of this convolu- Low-High-Low... " is presented to the net-
tion rises above a certain threshold, but the work, the high tones trigger one set of fre-
end of the sound is more dicult to judge. quency channels, and the low tones trigger an-
Sounds that end sharply are easy, but as a other set of channels. If the tones are tem-
sound drifts o , the boundary is less obvi- porally close enough together, the oscillators
ous. Smith suggests placing the end of the do not have time to relax back to the orig-
sound at the next appropriate sound begin- inal state from the previous high input and
ning, However this disagrees with Bregman's are triggered again, thus following the audi-
theory that a boundary can correspond to ex- tory stream. If the time between pulses is
actly one sound, and is ambiguous if applied long enough, then the oscillators relax from
to more than one sound. the high tone and are excited by the low tone,
Smith's work is intended to model the hu- making the stream seem to oscillate between
man perceptual system and to be useful on high and low tones.
any sound. He mentions music often, because
it is a sound that is commonly separated into
events (notes) but the work is not directly 6 Score Generation
applicable to monophonic music segmentation
yet. Further study on the frequency depen- Once the pitch sequence has been determined
dent nature of the onset/o set lters of the and the note boundaries established, it seems
human could lead to much more accurate seg- an easy task to place those notes on a sta
mentation procedures, as well as a deeper un- and be nished. Many commercial software
derstanding of our own perceptual systems. programs exist to translate a note sequence,
usually a MIDI le, into a musical score, but
5.3 Neural Oscillators a signi cant problem which is still not com-
pletely solved is determining the key signature,
A number of papers have recently been time signature, measure boundaries, acciden-
presented using neural nets for segmenta- tals and dynamics that make a musical score
tion, speci cally [Wang95], [NGIO95], and complete.
[BrCo95], as well as others in that volume.
The neural net model commonly used for this 6.1 MIDI
task is the neural oscillator. The hypothesis
is that neural oscillators are one of the struc- MIDI was rst made commercially available in
tures in the brain that help us to pay atten- 1983 and since then has become a standard for
tion to only one stream of audition, when there transcribing music. The MIDI protocol was
is much auditory noise going on. The model developed in response to the large number of
is built from single oscillators, consisting of a independent interfaces that keyboard and elec-
feedback loop between an excitatory neuron trical instrument manufacturers were coming
and an inhibitory neuron. The oscillator out- up with. As the saying goes, \The nice thing
put quickly alternates between high values and about standards is that there are so many to
low values. choose from." In order to reduce the num-
The inputs to the oscillator network are the ber of interfaces in the industry, MIDI, yet an-
frequency channels that are employed within other standard, was introduced. It did, how-
the ear. Delay is introduced within the lines ever, become widely accepted and while most

15
keyboards still use their own internal interface assumed a C major key signature, and placed
protocol, they often have MIDI as an external accidentals wherever notes were o of the C
interface option as well. major scale. Most score generation systems to-
MIDI stands for Musical Instrument Digital day do just that. They assume a user-de ned
Interface, and is both an information trans- key and time signature, and place notes on the
fer protocol and a hardware speci cation. The score according to these de nitions.
communications protocol of the MIDI system The importance of correctly representing
represents each musical note transition as a the key and the time signatures is shown in
message. Messages for note beginnings and [Long94], where two transcriptions of the same
note endings are used, and other messages in- piece are presented side by side. An expe-
clude instrument changes, voice changes and rienced musician has no problem reading the
other administrative messages. Messages are correct score, but has diculty recognizing the
passed from a controlling sequencer to MIDI incorrect one, because the melody is disguised
instruments or sound modules over serial asyn- by misrepresenting its rhythm and tonality.
chronous cables. Polyphonic music is rep-
resented by a number of overlapping mono-
phonic tracks, each with its own voice. 6.2 Key Signature
Many developments have been added to The key signature, appearing at the beginning
MIDI 1.0 since 1983, and the reader is referred of a piece of music, indicates which notes will
to [Rums94] for a more complete treatment. be at or sharp throughout the piece. Once
the notes are identi ed, one can identify which
6.1.1 MIDI and Transposition notes are constantly sharp or at, and then as-
sign these to the key signature. Key changes in
Many researchers have realized the importance the middle of the piece are dicult for a com-
of a standard note representation. If a sys- puter to judge because most algorithms look
tem can be developed that will translate from at the piece as a whole. Localized statistics
sound to MIDI, then anything MIDI-esque can could solve this problem, but current systems
then be done. The MIDI code can be played are still not completely accurate.
through any MIDI capable keyboard, it can
be transposed, edited and displayed. The fact
that MIDI is based on note onsets and o sets 6.3 Time Signature
suggests to some researchers that transcrip- Deciding where bar lines go in a piece and how
tion research should concentrate on the begin- many beats are in each bar is a much more dif-
nings and endings of notes. Between the note cult problem. There are an in nite number of
boundaries, within the notes themselves, very ways of representing the rhythmical structure
little interesting is happening, and nothing is of a single piece of music. An example given
happening that a transcription system is try- in [Long94] suggests a sequence of 6 evenly
ing to preserve, unless dynamics are of con- spaced notes could be interpreted as a single
cern. bar of 6/8 time, three pairs, two triplets, a full
What MIDI doesn't store is information bar followed by a half bar of 4/4 time, or even
about the key signature, the time signature, between beats. Longuet-Higgins suggests the
measure placement and other information that use of musical grammars to solve this problem,
is on a musical score. This information is eas- which will be described in Section 8 on page
ily inferred by an educated human listener, 18.
but computers still have problems. James
Moorer's initial two-part transcription system

16
Part II time: the sinusoids used to break down the sig-
Related Topics nal are valid across the entire time spectrum.
If the base functions were localized in time, the
resulting decomposition would contain both
time information and frequency information.
7 Time-frequency Analysis The wavelet is a signal that is localized
in both time and frequency. Because of the
Most of the pitch detection and pitch track- uncertainty-type relation that holds between
ing techniques discussed in Section 4 rely on time and frequency, the localization cannot be
methods of frequency analysis that have been absolute, but in both the time domain and the
around for a long time. Fourier techniques, frequency domain, a wavelet decays to zero
pitch detectors and cepstrum analysis, for ex- above or below the center time/frequency. For
ample, all look at frequency as one scale, sepa- a mathematical treatment of wavelets and the
rate from time. A frequency spectrum is valid wavelet transform, the reader is referred to
for the full time-frame being considered, and [Daub90] and [Daub92].
if the windowing is not done well, spectral in- The wavelet transform consists of decom-
formation \leaks" into the neighboring frames. posing the signal into a sum of wavelets of dif-
The only way to get completely accurate spec- ferent scales. It has three dimensions: loca-
tral information is to take the Fourier trans- tion in time of the wavelet, scale of the wavelet
form (or your favorite spectral method) of the (location in frequency) and amplitude. The
entire signal, and then all local information wavelet transform allows a time-frequency rep-
about the time signal is lost. Similarly, when resentation of the signal being decomposed,
looking at the time waveform, one is aware which means that information about the time
of exactly what is happening at each instant, location is available without windowing. An-
but no information is available about the fre- other way to look at it is that windowing is
quency components. built in to the algorithm.
An uncertainty principle is at work here. Researchers have speculated that wavelets
The more one knows about the frequency of a could be designed to resemble musical notes.
signal, the less that frequency can be localized They have a speci c frequency and a speci c
in time. The options so far have been com- location in time as well as an amplitude enve-
plete frequency or complete time, using the en- lope that characterizes the wavelet. If a sys-
tire signal or some small window of the signal. tem could be developed to model musical notes
Is it possible to look at frequency and time to- into wavelets, then a wavelet transform would
gether? Investigating frequency components be a transcription of the musical piece. A mu-
at a more localized time without the need for sical score is a time-frequency representation
windowing would increase the accuracy of the of the music. Time is represented by the for-
spectral methods and allow more speci c pro- ward progression through the score from left
cessing. to right, and frequency is represented by the
location of the note on the score.
7.1 Wavelets Malden Wickerhauser contributed an article
about audio signal compression using wavelets
The Fourier representation of a signal, and [Wick92] in a wavelet application book. This
in fact any spectrum-type representation uses work does not deal directly with music applica-
sinusoids to break down the signal. This is tions, however it does have a treatment of the
why spectral representations are limited to the mathematics involved. Transcription of music
frequency domain and cannot be localized in can be considered lossy compression, in that

17
the musical score representation can be used 8.1 Lerdahl and Jackendo
to construct an audio signal that is a recogniz-
able approximation of the original audio le In 1983, Fred Lerdahl, a composer, and Ray
(i.e. without interpretation or errors gener- Jackendo , a linguist, published a book that
ated during performance). The wavelet trans- was the result of work aimed at a challenge is-
form has also been applied as a pre-processor sued in 1973. The book is called \A Gener-
for sound systems, to clean up the sound and ative Theory of Tonal Music", and the chal-
remove noise from the signal [SoWK95]. lenge was one presented by Leonard Bernstein.
He advocated the search for \musical gram-
mar" after being inspired by Chomskian-type
7.2 Pielemeier and Wake eld grammars for natural language. Several other
authors responded to the challenge, including
William Pielemeier and Greg Wake eld pre- Irving Singer and David Epstein, who formed
sented a work in 1996 [PiWa96] discussing a faculty seminar on Music, Linguistics and
a high-resolution time-frequency representa- Ethics at MIT in 1974.
tions. They argue that windowed Fourier The book begins by presenting a detailed
transforms, while producing reliable estimates introduction to the concept of musical gram-
of frequency, are often less than what is re- mar, from the point of view of linguistics and
quired for musical analysis. Calculation of the artistic interpretation of music. Rhythmic
attack of a note requires very accurate and grouping is discussed in the rst few chap-
short-time information about the waveform, ters, and tonal grammar is discussed in the
and this information is lost when a windowed last few chapters. The intent is not to present
Fourier transform produces averaged informa- a complete grammar of all western music, but
tion for each window. They present a sys- to suggest a thorough starting point for fur-
tem called the Modal distribution, which they ther investigations. The di erences between
show to decrease time averaging caused by a linguistic grammar and a musical gram-
windowing. For a full treatment, please see mar are presented in detail, and an interest-
[PiWa96]. ing point is made that a musical grammar
can have grammatical rules and preferential
rules, where a number of grammatically cor-
8 Musical Grammars rect structures are ranked in preference. The
di erence between a masterpiece and an unin-
It has been theorized that music is a natural teresting etude is the adherence to preferential
language like any other, and the set of rules rules. Both pieces are \grammatically" cor-
that describe it ts somewhere in the Chom- rect, but one somehow sounds better.
sky hierarchy of grammar. The questions are
where in the hierarchy dies it t, and what 8.1.1 Rhythm
does the grammar look like? Is a grammar for
12-semitone, octaval western music di erent In terms of rhythmic structure, Lerdahl and
from a grammar for pentatonic Oriental mu- Jackendo begin by discussing the concept
sic, or decametric East-Indian music? Within of a grouping hierarchy. It seems that mu-
western music, are there di erent grammars sic is grouped into motives, themes, phrases,
for classical and modern music? Top 40 and periods and the like, each being bigger than
Western? Can an opera be translated to a bal- and encompassing one or more of the previous
lad as easily (or with as much diculty) as group. So a period can consist of a number
German can be translated to French? of complete phrases, each being composed of

18
a number of complete themes and so on. This same. The intermediate forms at di erent lev-
kind of grouping is more psychologically cor- els of the trees, if translated, would form gram-
rect than sorting the piece by repetition and matically correct musical pieces. The aim of
similarity. While one can identify similar pas- reduction is to bit by bit strip away the our-
sages in a piece quite easily, the natural group- ishes and transitions that make a piece inter-
ing is hierarchical. esting until a single pitch-event remains. This
Where accents fall in a piece is another im- single fundamental pitch-event is usually the
portant observation which aids in the percep- rst or last event in the group, but this not
tion of the musical intent. Accents tend to necessary for the reduction to be valid. As
oscillate in a self-similar strong-weak-strong- with linguistic reductions, the goal is rst to
weak pattern. There is also a de nite con- ensure that a sentence (or passage) is gram-
nection between the accents and the groups - matically correct, but more importantly, to
Larger groups encompassing many subgroups discover the linguistic (or musical) properties
begin with very strong accents, and smaller associated with each word (or pitch-event).
groups being with smaller accents. Who is the subject and who is the object of
The full set of well-formedness rules and \Talk"? Is this particular C chord being used
preferential rules for the rhythm of a musi- as a suspension or a resolution? It is these
cal passage is presented in Appendix A. There questions that a grammar of music tries to an-
are two sets of well-formedness and preferen- swer, rather than \Is this piece of music gram-
tial rules for the rhythm of a piece, these are matically correct in our system?"
grouping rules and metrical structure rules.
8.2 Longuet-Higgins
8.1.2 Reductions
In a paper discussing Arti cial Intelli-
The rules quoted in Appendix A provide struc- gence[Long94], Christopher Longuet-Higgins
ture for the rhythm of a musical passage, pre- presented a generative grammar for metrical
sented here to provide the avor of the gram- rhythms. His work convinces us that there
mar developed by Lerdahl and Jackendo . To is a close link between musical grammars and
parse the tonality of a passage, the concept of the transcription of music. If one has a gram-
reduction is needed. Much discussion and mo- mar of music and one knows that the piece
tivation stems from the reduction hypothesis, being transcribed is within the musical genre
presented in [LeJa83] as: that this grammar describes, then rules can be
The listener attempts to organize all used to resolve ambiguities in the transcrip-
the pitch-events into a single coher- tion, just as grammar rules are used to resolve
ent structure, such that they are ambiguities in speech recognition. He calls his
heard in a hierarchy of relative im- grammar rules \realization rules" and are re-
portance. produced here for a 4/4-type rhythm.
,1 ,1 ,1
A set of rules for well-formedness and prefer- 2 -unit ! 2 -note or 2 -rest
, 
ence are presented for various aspects of reduc- or 2  41 -units
tion, including time-span reduction and pro- ,1 ,1 ,1
longational reduction, but in addition to the 4 -unit ! 4 -note or 4 -rest
, 
rules, a tree structure is used to investigate the or 2  81 -units
reduction of a piece. Analogies can be drawn ,1 ,1 ,1
to the tree structures used in analysis of gram- 8 -unit ! 8 -note or 8 -rest
, 
matical sentences, however they are not the or 2  161 -units

19
,1 ,1 ,1
16 -unit ! 16 -note or 16 -rest tonality", where he draws on the work of Ler-
dahl and Jackendo . Improvements are made
Di erent rules would be needed for 3/4-type that simplify the general theory. In some
rhythms, for example, and to allow for dot- cases, claims Steedman, repeated notes can
ted notes and other anomalies. The general be considered as a single note of the cumula-
idea, however, of repeatedly breaking down tive duration, with the same psychological ef-
the rhythm into smaller segments until an in- fect and the same \grammatical" rules hold-
dividual note or rest is encountered is insight- ing. In a similar case, scale progressions can
ful and simple. be treated as single note jumps. The phrase he
He also discussed tonality, but does not uses is a rather non-committal \seems more or
present a generative theory of tonality to re- less equivalent to". A good example of the dif-
place or augment Lerdahl and Jackendo 's. culties involved in this concept is the fourth
Discussions are made instead about resolution movement of Beethoven's Choral symphony,
of musical ambiguity even when the notes are Number 9 [Beet25]. In this piece, there is a
known. He compares the resolution of a chord passage which contains what a listener would
sequence to the resolution of a Necker cube, expect to be two quarter notes, and in fact this
which is a two dimensional image that looks is often heard, but the score has a single half
three-dimensional, as seen in Figure 2. It is note. Similar examples can be made to chro-
dicult for an observer to be sure which side matic runs in the same movement. Part of the
of the cube is facing out, just as it is dicult reason that the two quarter notes are expected
to be sure of the nature of a chord without a and often heard is that the theme is presented
tonal context. He insists that a tonal gram- elsewhere in the piece, with two quarter notes
mar is essential for resolving ambiguities. instead of the half note.
This is the way that the passage is written,
8.3 The Well-Tempered Com- 4 " ! ! !!!
" ! ! !. ! "
puter I
4 !
-
In 1994, at the same conference where
[Long94] was presented, Mark Steedman pre- but when played, it can sound like this,
sented a paper with insights into the psycho- 4 !!!! !!! !. ! "
logical method by which people listen to and I 4 ! !!!!
understand music, and these insights move to- -
ward a computational model of human music or like this,
perception [Stee94]. A \musical pitch space"
is presented, adapted from an earlier work by 4 " ! !
z {! z { !.
4
!!!
" ! ! ".
Longuet-Higgins. I
Later in the paper Steedman presents a sec-
tion entitled \Towards a grammar of melodic depending on the interpretation of the listener.
Since di erent versions of the score can be
heard in the same passage, it is tempting to
say that the two versions are grammatically
,
, , , equivalent, and that the brain cannot tell one
from the other. However, it more accurate to
,, ,, say that the brain is being fooled by the in-
strumentalist. We are not confused, saying \I
Figure 2: A Necker Cube. don't know if it is one way or the other", we are

20
sure that it is one way and not the other, but 9 Conclusions
we are not in agreement with our colleagues as
to which way it is. Since Piszczalski's landmark work in 1986,
The substitutions suggested by Steedman many aspects of Computer Music Analy-
can be made in some cases, but not in all cases, sis have changed considerably, while little
and the psychological similarity does not seem progress has been made in other areas. Tran-
to be universal. It is important to discover how scription of monophonic music was solved be-
reliable these substitutions are before inserting fore 1986, and since then improvements to al-
them into a general theory of musical tonality. gorithms have been made, but no really revo-
lutionary leaps. Jimmy Kapadia's M.Sc. The-
sis defended in 1995 had little more than
Piszczalski's ideas from 1986 in it [Kapa95].
8.4 Discussions Hidden Markov Models have been applied to
pitch tracking, new methods in pitch percep-
There seems to be such a similarity be- tion have been implemented, and cognition
tween music understanding and language un- and perception have been applied to the task
derstanding that solutions in one eld can of score generation. Polyphonic music recog-
be used analogically to solve problems in the nition, however, remains a daunting task to
other eld. What needs to be investigated is researchers. Small subproblems have been
exactly how close these two elds really are. solved, and insight gained from computer vi-
The similarities are numerous. The human sion and auditory scene analysis, but the prob-
mind receives information in the form of air lem remains open.
pressure on the eardrums, and converts that Much work remains to be done in the eld of
information into something intelligible, be it key signature and time signature recognition,
a linguistic sentence or a musical phrase. Hu- and a connection needs to be drawn between
mans are capable of taking a written represen- the independent research in musical grammars
tation (text or score) and converting it into and music transcription, before a complete
the appropriate sound waves. There are rules working system that even begins to model the
that language follows, and there seem to be human system is created.
rules that music follows - It is clear that mu- The research needed to solve the music un-
sic can be unintelligible to us, the best exam- derstanding problem seems to be distributed
ple being a random jumble of notes. throughout many other areas. Linguistics,
Identi cation of music that sounds good and psychoacoustics and image processing have
music that doesn't is learned through example, much to teach us about music. Perhaps there
just as language is. A human brought up in are more areas of research that are also worth
the western tradition of music is likely not to investigating.
understand why a particular East Indian piece
is especially heart-wrenching. On the other 10 References
hand, music does not convey semantics. There
is no rational meaning presented with music as [Beet25] Beethoven, Ludwig Van. Symphony
there is with language. There are no nouns or No. 9 D Minor, Op. 125, Edition Eu-
verbs, no subject or object in music. There is, lenburg. London: Ernst Eulenburg Ltd.,
however, emotional meaning that is conveyed. 1925. First Performed: Vienna, 1824.
There are speci c rules that music follows, and
there is an internal mental representation that [BrCo94] Brown, Guy J. and Cooke, Martin.
a listener compares any new piece of music to. \Perceptual Grouping of Musical Sounds:

21
A Computational Model." J. New Music [DoRo91] Doval, Boris and Rodet, Xavier.
Research, Vol. 23, 1994, pp 107-132. \Estimation of Fundamental Frequency of
Musical Sound Signals." IEEE-ICASSP
[BrCo95] Brown, Guy J. and Cooke, Mar- 1991, pp 3657-3660.
tin. \Temporal Synchronization in a Neu-
ral Oscillator Model of Primitive Audi- [DoRo93] Doval, Boris and Rodet, Xavier.
tory Scene Analysis." IJCAI Workshop \Fundamental Frequency Estimation and
on Computational Auditory Scene Anal- Tracking Using Maximum Likelihood
ysis, Montreal, Quebec, August 1995, pp Harmonic Matching and HMMs." IEEE-
40-47. ICASSP 1993, Vol. I, pp 221-224.
[BrPu92] Brown, Judith C. and Puckette, [GoWo92] Gonzales, R. and Woods, R. Dig-
Miller S. \Fundamental Frequency Track- ital Image Processing. Don Mills: Addi-
ing in the Log Frequency Domain Based son Wesley, 1992.
on Pattern Recognition." J. Acoust. Soc. [GrBl95] Grabke, Jorn and Blauert, Jens.
Am., Vol. 92, No. 4, October 1992, p \Cocktail-Party Processors Based on Bin-
2428. aural Models." IJCAI Workshop on
[Breg90] Bregman, Albert S. Auditory Scene Computational Auditory Scene Analysis,
Analysis. Cambridge: MIT Press, 1990.
Montreal, Quebec, August 1995, pp 105-
110.
[ChSM93] Chazan, Dan, Stettiner, Yoram [KaHe95] Kapadia, Jimmy H. and Hemdal,
and Malah, David. \Optimal Multi-Pitch John F. \Automatic Recognition of Mu-
Estimation Using the EM Algorithm for sical Notes." J. Acoust. Soc. Am., Vol.
Co-Channel Speech Separation." IEEE- 98, No. 5, November 1995, p 2957.
ICASSP 1993, Vol. II, pp 728-731.
[Kapa95] Kapadia, Jimmy H. Automatic
[Daub90] Daubechies, Ingrid. \The wavelet Recognition of Musical Notes. M.Sc.
transform, Time-Frequency Localization Thesis, University of Toledo, August
and Signal Analysis." IEEE Trans. Infor- 1995.
mation Theory, Vol. 36, No. 5, 1990, pp [Kata96] Katayose, Haruhiro. \Automatic
961-1005. Music Transcription." Denshi Joho
[Daub92] Daubechies, Ingrid. Ten Lectures Tsushin Gakkai Shi, Vol. 79, No. 3, 1996,
on Wavelets. Society for Industrial and pp 287-289.
Applied Mathematics, 1992. [Kuhn90] Kuhn, William B. \A Real-Time
[DeGR93] Depalle, Ph., Garca, G. and Pitch Recognition Algorithm for Music
Rodet, X. \Tracking of Partials for Ad- Applications." Computer Music Journal,
ditive Sound Synthesis Using Hidden Vol. 14, No. 3, Fall 1990, pp 60-71.
Markov Models." IEEE-ICASSP 1993, [Lane90] Lane, John E. \Pitch Detection Us-
Vol. I, pp 225-228. ing a Tunable IIR Filter." Computer Mu-
[DoNa94] Dorken, Erkan and Nawab, S. sic Journal, Vol. 14, No. 3, Fall 1990, pp
Hamid. \Improved Musical Pitch Track- 46-57.
ing Using Principal Decomposition Anal- [LeJa83] Lerdahl, Fred and Jackendo , Ray.
ysis." IEEE-ICASSP 1994, Vol. II, pp A Generative Theory of Tonal Music.
217-220. Cambridge: MIT Press, 1983.

22
[Long94] Longuet-Higgins, H. Christopher. [Pisz77] Piszczalski, Martin. \Automatic
\Arti cial Intelligence and Music Cogni- Music Transcription." Computer Music
tion." Phil. Trans. R. Soc. Lond. A, Journal, November 1977, pp 24-31.
Vol. 343, 1994, pp 103-113. [Pisz86] Piszczalski, Martin. A Compu-
[Mcge89] McGee, W. F. \Real-Time Acous- tational Model for Music Transcription.
tic Analysis of Polyphonic Music." Pro- Ph.D Thesis, University of Stanford,
ceedings ICMC 1989, pp 199-202. 1986.
[MeCh91] Meillier, Jean-Louis and Chagne, [PiWa96] Pielemeier, William J. and Wake-
Antoine. \AR Modeling of Musical Tran- eld, Gregory H. \A High-Resolution
sients." IEEE-ICASSP 1991, pp 3649- Time-Frequency Representation for Musi-
3652. cal Instrument Signals." J. Acoust. Soc.
Am., Vol. 99, No. 4, April 1996, pp 2383-
[MeOM95] Meddis, Ray and O'Mard, 2396.
Lowel. \Psychophysically Faithful Meth- [QuEn94] Quiros, Francisco J. and Enrquez,
ods for Extracting Pitch." IJCAI Work- Pablo F-C. \Real-Time, Loose-Harmonic
shop on Computational Auditory Scene Matching Fundamental Frequency Esti-
Analysis, Montreal, Quebec, August mation for Musical Signals." IEEE-
1995, pp 19-25. ICASSP 1994, Vol. II, pp 221-224.
[Moor77] Moorer, James A. \On the Tran- [Rich90] Richard, Dominique M. \Godel
scription of Musical Sound by Com- Tune: Formal Models in Music Recogni-
puter." Computer Music Journal, tion Systems." Proceedings ICMC 1990,
November 1977, pp 32-38. pp 338-340.
[Moor84] Moorer, James A. \Algorithm De- [Road85] Roads, Curtis. \Research in Music
sign for Real-Time Audio Signal Process- and Arti cial Intelligence." ACM Com-
ing." IEEE-ICASSP 1984, pp 12.B.3.1- puting Surveys, Vol. 17, No. 5, June
12.B.3.4. 1985.
[NGIO95] Nakatani, T., Goto, M., Ito, T. [Rowe93] Rowe, Robert. Interactive Music
and Okuno, H.. \Multi-Agent Based Bin- Systems. Cambridge: MIT Press, 1993.
aural Sound Stream Segregation." IJ- [Rums94] Rumsey, Francis. MIDI Systems
CAI Workshop on Computational Audi- and Control. Toronto: Focal Press, 1994.
tory Scene Analysis, Montreal, Quebec,
August 1995, pp 84-90. [SaJe89] Sano, Hajime and Jenkins, B.
Keith. \A Neural Network Model for
[Oven88] Ovans, Russell. An Object- Pitch Perception." Computer Music
Oriented Constraint Satisfaction System Journal, Vol. 13, No. 3, Fall 1989, pp 41-
Applied to Music Composition. M.Sc. 48.
Thesis, Simon Fraser University, 1988.
[Sche95] Scheirer, Eric. \Using Musical
[PiGa79] Piszczalski, Martin and Galler, Knowledge to Extract Expressive Perfor-
Bernard. \Predicting Musical Pitch mance Information from Audio Record-
from Component Frequency Ratios." J. ings." IJCAI Workshop on Computa-
Acoust. Soc. Am., Vol. 66, No. 3, tional Auditory Scene Analysis, Montreal,
September 1979, pp 710-720. Quebec, August 1995, pp 153-160.

23
[Smit94] Smith, Leslie S. \Sound Segmenta- Part III
tion Using Onsets and O sets." J. New
Music Research, Vol. 23, 1994, pp 11-23. Appendices
[SoWK95] Solbach, L., Wohrmann, R.
and Kliewer, J. \The Complex-Valued
A Musical Grammar Rules
Wavelet Transform as a Pre-processor from [LeJa83]
for Auditory Scene Analysis." IJCAI
Workshop on Computational Auditory Grouping Well-Formedness Rules:
Scene Analysis, Montreal, Quebec, Au-
gust 1995, pp 118-124. GWFR1: Any contiguous sequence of pitch-
events, drum beats, or the like can con-
[Stee94] Steedman, Mark. \The Well- stitute a group, and only contiguous se-
Tempered Computer." Phil. Trans. R. quences can constitute a group.
Soc. Lond. A, Vol. 343, 1994, pp 115- GWFR2: A piece constitutes a group.
131.
GWFR3: A group may contain smaller
[Tang88] Tanguiane, Andranick. \An Algo- groups.
rithm For Recognition of Chords." Pro- GWFR4: If a group G1 contains part of a
ceedings ICMC 1988, pp 199-210. group G2, it must contain all of G2.
[Tang93] Tanguiane, Andranick. Arti cial GWFR5: If a group G1 contains a smaller
Perception and Music Recognition. Lec- group G2, then G1 must be exhaustively
ture notes in Arti cial Intelligence 746, partitioned into smaller groups.
Germany: Springer-Verlag, 1993.
Grouping Preference Rules:
[Tang95] Tanguiane, Andranick. \Towards GPR1: Avoid analyses with very small
Axiomatization of Music Perception." J.
New Music Research, Vol. 24, 1995, pp groups - the smaller, the less preferable.
247-281. GPR2 (Proximity): Consider a sequence of
four notes n1n2 n3n4 . All else being equal,
[Wang95] Wang, DeLiang. \Stream Segre- the transition n2 , n3 may be heard as a
gation Based on Oscillatory Correlation." group boundary if
IJCAI Workshop on Computational Au- a. (Slur/Rest) the interval of time from
ditory Scene Analysis, Montreal, Quebec, the end of n2 to the beginning of n3 is
August 1995, pp 32-39. greater than that from the end of n1 to
the beginning of n2 and that from the end
[Wick92] Wickerhauser, Malden V. \Acous- of n3 to the beginning of n4, or if
tic Signal Compression with Wavelet b. (Attack-Point) the interval of time
Packets." in Wavelets - A Tutorial in between the attack points of n2 and n3
Theory and Applications. C. K. Chui is greater than that between the attack
(ed.), Academic Press, 1992, pp 679-700. points of n1 and n2 and that between the
attack points of n3 and n4 .

24
GPR3 (Change): Consider a sequence of four MWFR2: Every beat at a given level must
notes n1n2 n3n4 . All else being equal, the also be a beat at all smaller levels present
transition n2 ,n3 may be heard as a group at that point in the piece.
boundary if MWFR3: At each metrical level, strong beats
a. (Register): the transition n2 , n3 in- are spaced either two or three beats apart.
volves a greater intervallic distance than
both n1 , n2 and n3 , n4 , or if MWFR4: The tactus and immediately larger
metrical levels must consist of beats
b. (Dynamics): the transition n2 , n3 in- equally spaced throughout the piece. At
volves a change in dynamics and n1 , n2 subtactus metrical levels, weak beats
and n3 , n4 do not, or if must be equally spaced between the sur-
c. (Articulation): the transition n2 , n3 rounding strong beats.
involves a change in articulation and n1 , Metrical Preference Rules:
n2 and n3 , n4 do not, or if
d. (Length): n2 and n3 are of di erent MPR1 (Parallelism): Where two or more
lengths and both pairs n1 , n2 and n3, n4 groups can be construed as parallel, they
do not di er in length. preferably receive parallel metrical struc-
ture.
(One might add further cases to deal with
such things and change in timbre or in- MPR2 (Strong Beat Early): Weakly prefer a
strumentation) metrical structure in which the strongest
beat in a group appears early in the
GPR4 (Intensi cation): Where the e ects group.
picked out by GPRs 2 and 3 are relatively MPR3 (Event): Prefer a metrical structure in
more pronounced, a larger-level boundary which beats of level L that coincide with
may be placed. i
the inception of pitch-events are strong
PR5 (Symmetry): Prefer grouping analyses beats of L .i

that most closely approach the ideal sub- MPR4 (Stress): Prefer a metrical structure
division of groups into two parts of equal in which beats of level L that are stressed
length. are strong beats of L .i
i

GPR6 (Parallelism): Where two or more MPR5 (Length): Prefer a metrical structure
segments of the music can be construed in which a relatively strong beat occurs at
as parallel, they preferably form parallel the inception of either
parts of groups. a. a relatively long pitch-event,
GPR7 (Time Span and Prolongational Sta- b. a relatively long duration of a dynamic,
bility): Prefer a grouping structure that c. a relatively long slur,
results in more stable time-span and/or d. a relatively long pattern of articulation
prolongational reductions.
e. a relatively long duration of a pitch in
Metrical Well-Formedness Rules: the relevant levels of the time-span reduc-
tion, or
MWFR1: Every attack point must be asso- f. a relatively long duration of a harmony
ciated with a beat at the smallest metri- in the relevant levels of the time-span re-
cal level at that point in the piece. duction (harmonic rhythm).

25
MPR6 (Bass): Prefer a metrically stable
bass.
MPR7 (Cadence): Strongly prefer a metrical
structure in which cadences are metrically
stable; that is, strongly avoid violations
of local preference rules within cadences.
MPR8 (Suspension): Strongly prefer a met-
rical structure in which a suspension is on
a stronger beat than its resolution.
MPR9 (Time-Span Interaction): Prefer a
metrical analysis that minimizes con ict
in the time-span reduction.
MPR10 (Binary Regularity): Prefer metri-
cal structures in which at each level every
other beat is strong.

26