BERAN, J. - Statistics in Musicology PDF

Interdisciplinar y Statistics
STATISTICS in
MUSICOLOGY
Jan Beran
CHAPMAN & HALL/CRC

A CRC Press Company
Boca Raton London New York Washington, D.C.
©2004 CRC Press LLC

C2190 disclaimer.fm Page 1 Monday, June 9, 2003 10:51 AM
Library of Congress Cataloging-in-Publication Data
Beran, Jan, 1959-

Statistics in musicology / Jan Beran.
p. cm. — (Interdisciplinary statistics series)
Includes bibliographical references (p. ) and indexes.
ISBN 1-58488-219-0 (alk. paper)
1. Musical analysis—Statistical methods. I. Title. II. Interdisciplinary statistics
MT6.B344 2003
781.2—dc21 2003048488
This book contains information obtained from authentic and highly regarded sources. Reprinted material
is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable
efforts have been made to publish reliable data and information, but the author and the publisher cannot
assume responsibility for the validity of all materials or for the consequences of their use.
Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic
or mechanical, including photocopying, microfilming, and recording, or by any information storage or
retrieval system, without prior permission in writing from the publisher.
The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for
creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC
for such copying.
Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation, without intent to infringe.
Visit the CRC Press Web site at www.crcpress.com
© 2004 by Chapman & Hall/CRC
No claim to original U.S. Government works

International Standard Book Number 1-58488-219-0
Library of Congress Card Number 2003048488
Printed in the United States of America 1 2 3 4 5 6 7 8 9 0
Printed on acid-free paper

Contents
Preface
1 Some mathematical foundations of music

1.1 General background
1.2 Some elements of algebra
1.3 Specific applications in music
2 Exploratory data mining in musical spaces

2.1 Musical motivation
2.2 Some descriptive statistics and plots for univariate data
2.3 Specific applications in music – univariate
2.4 Some descriptive statistics and plots for bivariate data
2.5 Specific applications in music – bivariate
2.6 Some multivariate descriptive displays
2.7 Specific applications in music – multivariate
3 Global measures of structure and randomness

3.2 Basic principles
4 Time series analysis

5 Hierarchical methods
6 Markov chains and hidden Markov models


7 Circular statistics
8 Principal component analysis

9 Discriminant analysis
10 Cluster analysis
11 Multidimensional scaling
List of figures
References

Preface
An essential aspect of music is structure. It is therefore not surprising that a

connection between music and mathematics was recognized long before our
time. Perhaps best known among the ancient “quantitative musicologists”
are the Pythagoreans, who found fundamental connections between musi-
cal intervals and mathematical ratios. An obvious reason why mathematics
comes into play is that a musical performance results in sound waves that
can be described by physical equations. Perhaps more interesting, however,
is the intrinsic organization of these waves that distinguishes music from
“ordinary noise”. Also, since music is intrinsically linked with human per-
ception, emotion, and reflection as well as the human body, the scientific
study of music goes far beyond physics. For a deeper understanding of mu-
sic, a number of different sciences, such as psychology, physiology, history,
physics, mathematics, statistics, computer science, semiotics, and of course
musicology – to name only a few – need to be combined. This, together
with the lack of available data, prevented, until recently, a systematic de-
velopment of quantitative methods in musicology. In the last few years,
the situation has changed dramatically. Collection of quantitative data is
no longer a serious problem, and a number of mathematical and statis-
tical methods have been developed that are suitable for analyzing such
data. Statistics is likely to play an essential role in future developments
of musicology, mainly for the following reasons: a) statistics is concerned
with finding structure in data; b) statistical methods and structures are
mathematical, and can often be carried over to various types of data –
statistics is therefore an ideal interdisciplinary science that can link differ-
ent scientific disciplines; and c) musical data are massive and complex –
and therefore basically useless, unless suitable tools are applied to extract
essential features.
This book is addressed to anybody who is curious about how one may an-
alyze music in a quantitative manner. Clearly, the question of how such an
analysis may be done is very complex, and no ultimate answer can be given
here. Instead, the book summarizes various ideas that have proven useful
in musical analysis and may provide the reader with “food for thought” or
inspiration to do his or her own analysis. Specifically, the methods and ap-
plications discussed here may be of interest to students and researchers in
music, statistics, mathematics, computer science, communication, and en-

gineering. There is a large variety of statistical methods that can be applied
in music. Selected topics are discussed in this book, ranging from simple
descriptive statistics to formal modeling by parametric and nonparametric
processes. The theoretical foundations of each method are discussed briefly,
with references to more detailed literature. The emphasis is on examples
that illustrate how to use the results in musical analysis. The methods
can be divided into two groups: general classical methods and specific new
methods developed to solve particular questions in music. Examples illus-
trate on one hand how standard statistical methods can be used to obtain
quantitative answers to musicological questions. On the other hand, the
development of more specific methodology illustrates how one may design
new statistical models to answer specific questions. The data examples are
kept simple in order to be understandable without extended musicological
terminology. This implies many simplifications from the point of view of
music theory – and leaves scope for more sophisticated analysis that may
be carried out in future research. Perhaps this book will inspire the reader
to join the effort.
Chapters are essentially independent to allow selective reading. Since
the book describes a large variety of statistical methods in a nutshell it
can be used as a quick reference for applied statistics, with examples from
musicology.
I would like to thank the following libraries, institutes, and museums for
their permission to print various pictures, manuscripts, facsimiles, and pho-
tographs: Zentralbibliothek Zürich (Ruth Häusler, Handschriftenabteilung;
Anikó Ladànyi and Michael Kotrba, Graphische Sammlung); Belmont Mu-
sic Publishers (Anne Wirth); Philippe Gontier, Paris; Österreichische Post
AG; Deutsche Post AG; Elisabeth von Janoza-Bzowski, Düsseldorf; Univer-
sity Library Heidelberg; Galerie Neuer Meister, Dresden; Robert-Sterl-Haus
(K.M. Mieth); Béla Bartók Memorial House (János Szirányi); Frank Mar-
tin Society (Maria Martin); Karadar-Bertoldi Ensemble (Prof. Francesco
Bertoldi); col legno (Wulf Weinmann). Thanks also to B. Repp for provid-
ing us with the tempo data for Schumann’s Träumerei. I would also like to
thank numerous colleagues from mathematics, statistics, and musicology
who encouraged me to write this book. Finally, I would like to thank my
wife and my daughter for their encouragement and support, without which
this book could not have been written.
Jan Beran
Konstanz, March 2003

CHAPTER 1
Some mathematical foundations of

music
1.1 General background
The study of music by means of mathematics goes back several thousand

years. Well documented are, for instance, mathematical and philosophi-
cal studies by the Pythagorean school in ancient Greece (see e.g. van der
Waerden 1979). Advances in mathematics, computer science, psychology,
semiotics, and related fields, together with technological progress (in par-
ticular computer technology) lead to a revival of quantitative thinking in
music in the last two to three decades (see e.g. Archibald 1972, Solomon
1973, Schnitzler 1976, Balzano 1980, Götze and Wille 1985, Lewin 1987,
Mazzola 1990a, 2002, Vuza 1991, 1992a,b, 1993, Keil 1991, Lendvai 1993,
Lindley and Turner-Smith 1993, Genevois and Orlarey 1997, Johnson 1997;
also see Hofstadter 1999, Andreatta et al. 2001, Leyton 2001, and Babbitt
1960, 1961, 1987, Forte 1964, 1973, 1989, Rahn 1980, Morris 1987, 1995,
Andreatta 1997; for early accounts of mathematical analysis of music also
see Graeser 1924, Perle 1955, Norden 1964). Many recent references can be
found in specialized journals such as Computing in Musicology, Music The-
ory Online, Perspectives of New Music, Journal of New Music Research,
Intégral, Music Perception, and Music Theory Spectrum, to name a few.
Music is, to a large extent, the result of a subconscious intuitive “pro-
cess”. The basic question of quantitative musical analysis is in how far music
may nevertheless be described or explained partially in a quantitative man-
ner. The German philosopher and mathematician Leibniz (1646-1716) (Fig-
ure 1.5) called music the “arithmetic of the soul”. This is a profound philo-
sophical statement; however, the difficulty is to formulate what exactly it
may mean. Some composers, notably in the 20th century, consciously used
mathematical elements in their compositions. Typical examples are permu-
tations, the golden section, transformations in two or higher-dimensional
spaces, random numbers, and fractals (see e.g. Schönberg, Webern, Bartók,
Xenakis, Cage, Lutoslawsky, Eimert, Kagel, Stockhausen, Boulez, Ligeti,
Barlow; Figures 1.1, 1.4, 1.15). More generally, conscious “logical” con-
struction is an inherent part of composition. For instance, the forms of
sonata and symphony were developed based on reflections about well bal-
anced proportions. The tormenting search for “logical perfection” is well

Figure 1.1 Quantitative analysis of music helps to understand creative processes.
(Pierre Boulez, photograph courtesy of Philippe Gontier, Paris; and “Jim” by
J.B.)
Figure 1.2 J.S. Bach (1685-1750). (Engraving by L. Sichling after a painting by

Elias Gottlob Haussmann, 1746; courtesy of Zentralbibliothek Zürich.)

documented in Beethoven’s famous sketchbooks. Similarily, the art of coun-
terpoint that culminated in J.S. Bach’s (Figure 1.2) work relies to a high
degree on intrinsically mathematical principles. A rather peculiar early ac-
count of explicit applications of mathematics is the use of permutations in
change ringing in English churches since the 10th century (Fletcher 1956,
Price 1969, Stewart 1992, White 1983, 1985, 1987, Wilson 1965). More
standard are simple symmetries, such as retrograde (e.g. Crab fugue, or
Canon cancricans), inversion, arpeggio, or augmentation. A curious ex-
ample of this sort is Mozart’s “Spiegel Duett” (or mirror duett, Figures
1.6, 1.7 ; the attibution to Mozart is actually uncertain). In the 20th cen-
tury, composers such as Messiaen or Xenakis (Xenakis 1971; figure 1.15)
attempted to develop mathematical theories that would lead to new tech-
niques of composition. From a strictly mathematical point of view, their
derivations are not always exact. Nevertheless, their artistic contributions
were very innovative and inspiring. More recent, mathematically stringent
approaches to music theory, or certain aspects of it, are based on mod-
ern tools of abstract mathematics, such as algebra, algebraic geometry,
and mathematical statistics (see e.g. Reiner 1985, Mazzola 1985, 1990a,
2002, Lewin 1987, Fripertinger 1991, 1999, 2001, Beran and Mazzola 1992,
1999a,b, 2000, Read 1997, Fleischer et al. 2000, Fleischer 2003).
The most obvious connection between music and mathematics is due to
the fact that music is communicated in form of sound waves. Musical sounds
can therefore be studied by means of physical equations. Already in ancient
Greece (around the 5th century BC), Pythagoreans found the relationship
between certain musical intervals and numeric proportions, and calculated
intervals of selected scales. These results were probably obtained by study-
ing the vibration of strings. Similar studies were done in other cultures, but
are mostly not well documented. In practical terms, these studies lead to
singling out specific frequencies (or frequency proportions) as “musically
useful” and to the development of various scales and harmonic systems.
A more systematic approach to physics of musical sounds, music percep-
tion, and acoustics was initiated in the second half of the 19th century by
path-breaking contributions by Helmholz (1863) and other physicists (see
e.g. Rayleigh 1896). Since then, a vast amount of knowledge has been ac-
cumulated in this field (see e.g. Backus 1969, 1977, Morse and Ingard 1968,
1986, Benade 1976, 1990, Rigden 1977, Yost 1977, Hall 1980, Berg and
Stork 1995, Pierce 1983, Cremer 1984, Rossing 1984, 1990, 2000, Johnston
1989, Fletcher and Rossing 1991, Graff 1975, 1991, Roederer 1995, Rossing
et al. 1995, Howard and Angus 1996, Beament 1997, Crocker 1998, Ned-
erveen 1998, Orbach 1999, Kinsler et al. 2000, Raichel 2000). For a historic
account on musical acoustics see e.g. Bailhache (2001).
It may appear at first that once we mastered modeling musical sounds
by physical equations, music is understood. This is, however, not so. Music
is not just an arbitrary collection of sounds – music is “organized sound”.

Figure 1.3 Ludwig van Beethoven (1770-1827). (Drawing by E. Dürck after a
painting by J.K. Stieler, 1819; courtesy of Zentralbibliothek Zürich.)
Figure 1.4 Anton Webern (1883-1945). (Courtesy of Österreichische Post AG.)

Figure 1.5 Gottfried Wilhelm Leibniz (1646-1716). (Courtesy of Deutsche Post
AG and Elisabeth von Janota-Bzowski.)
Physical equations for sound waves only describe the propagation of air
pressure. They do not provide, by themselves, an understanding of how
and why certain sounds are connected, nor do they tell us anything (at
least not directly) about the effect on the audience. As far as structure is
concerned, one may even argue – for the sake of argument – that music does
not necessarily need “physical realization” in form of a sound. Musicians
are able to hear music just by looking at a score. Beethoven (Figures 1.3,
1.16) composed his ultimate masterpieces after he lost his hearing. Thus,
on an abstract level, music can be considered as an organized structure
that follows certain laws. This structure may or may not express feelings
of the composer. Usually, the structure is communicated to the audience
by means of physical sounds – which in turn trigger an emotional expe-
rience of the audience (not necessarily identical with the one intended by
the composer). The structure itself can be analyzed, at least partially, us-
ing suitable mathematical structures. Note, however, that understanding
the mathematical structure does not necessarily tell us anything about the
effect on the audience. Moreover, any mathematical structure used for ana-
lyzing music describes certain selected aspects only. For instance, studying
symmetries of motifs in a composition by purely algebraic means ignores
psychological, historical, perceptual, and other important issues. Ideally, all
relevant scientific disciplines would need to interact to gain a broad under-
standing. A further complication is that the existence of a unique “truth”
is by no means certain (and is in fact rather unlikely). For instance, a
composition may contain certain structures that are important for some
listeners but are ignored by others. This problem became apparent in the
early 20th century with the introduction of 12-tone music. The general
public was not ready to perceive the complex structures of dodecaphonic
music and was rather appalled by the seemingly chaotic noise, whereas a
minority of “specialized” listeners was enthusiastic. Another example is the

comparison of performances. Which pianist is the best? This question has
no unique answer, if any. There is no fixed gold standard and no unique
solution that would represent the ultimate unchangeable truth. What one
may hope for at most is a classification into types of performances that
are characterized by certain quantifiable properties – without attaching a
subjective judgment of “quality”.
The main focus of this book is statistics. Statistics is essential for con-
necting theoretical mathematical concepts with observed “reality”, to find
and explore structures empirically and to develop models that can be ap-
plied and tested in practice. Until recently, traditional musical analysis
was mostly carried out in a purely qualitative, and at least partially sub-
jective, manner. Applications of statistical methods to questions in musicol-
ogy and performance research are very rare (for examples see Yaglom and
Yaglom 1967, Repp 1992, de la Motte-Haber 1996, Steinberg 1995, Waugh
1996, Nettheim 1997, Widmer 2001, Stamatatos and Widmer 2002) and
mostly consist of simple applications of standard statistical tools to con-
firm results or conjectures that had been known or “derived” before by
musicological, historic, or psychological reasoning. An interesting overview
of statistical applications in music, and many references, can be found in
Nettheim (1997). The lack of quantitative analysis may be explained, in
part, by the impossibility of collecting “objective” data. Meanwhile, how-
ever, due to modern computer technology, an increasing number of musical
data are becoming available. An in-depth statistical analysis of music is
therefore no longer unrealistic. On the theoretical side, the development
of sophisticated mathematical tools such as algebra, algebraic geometry,
mathematical statistics, and their adaptation to the specific needs of mu-
sic theory, made it possible to pursue a more quantitative path. Because
of the complex, highly organized nature of music, existing, mostly qual-
itative, knowledge about music must be incorporated into the process of
mathematical and statistical modeling. The statistical methods that will
be discussed in the subsequent chapters can be divided into two categories:
1. Classical methods of mathematical statistics and exploratory data anal-
ysis: many classical methods can be applied to analyze musical struc-
tures, provided that suitable data are available. A number of examples
will be discussed. The examples are relatively simple from the point of
view of musicology, the purpose being to illustrate how the appropriate
use of statistics can yield interesting results, and to stimulate the reader
to invent his or her own statistical methods that are appropriate for
answering specific musicological questions.
2. New methods developed specifically to answer concrete questions in mu-
sicology: in the last few years, questions in music composition and per-
formance lead to the development of new statistical methods that are
specifically designed to solve questions such as classification of perfor-

mance styles, identification and modeling of metric, melodic, and har-
monic structures, quantification of similarities and differences between
compositions and performance styles, automatic identification of musi-
cal events and structures from audio signals, etc. Some of these methods
will be discussed in detail.
A mathematical discipline that is concerned specifically with abstract defi-
nitions of structures is algebra. Some elements of basic algebra are therefore
discussed in the next section. Naturally, depending on the context, other
mathematical disciplines also play an equally important role in musical
analysis, and will be discussed later where necessary. Readers who are fa-
miliar with modern algebra may skip the following section. A few examples
that illustrate applications of algebraic structures to music are presented
in Section 1.3. An extended account of mathematical approaches to music
based on algebra and algebraic geometry is given, for instance, in Mazzola
(1990a, 2002) (also see Lewin 1987 and Benson 1995-2002).
1.2 Some elements of algebra

1.2.1 Motivation
Algebraic considerations in music theory have gained increasing popularity
in recent years. The reason is that there are striking similarities between
musical and algebraic structures. Why this is so can be illustrated by a sim-
ple example: notes (or rather pitches) that differ by an octave can be con-
sidered equivalent with respect to their harmonic “meaning”. If an instru-
ment is tuned according to equal temperament, then, from the harmonic
perspective, there are only 12 different notes. These can be represented as
integers modulo 12. Similarily, there are only 12 different intervals. This
means that we are dealing with the set Z12 = {0, 1, ..., 11}. The sum of two
elements x, y ∈ Z12 , z = x + y is interpreted as the note/interval resulting
from “increasing” the note/interval x by the interval y. The set Z12 of notes
(intervals) is then an additive group (see definition below).
1.2.2 Definitions and results

We discuss some important concepts of algebra that are useful to describe
musical structures. A more comprehensive overview of modern algebra can
be found in standard text books such as those by Albert (1956), Herstein
(1975), Zassenhaus (1999), Gilbert (2002), and Rotman (2002).
The most fundamental structures in algebra are group, ring, field, mod-
ule, and vector space.
Definition 1 Let G be a nonempty set with a binary operation + such that
a + b ∈ G for all a, b ∈ G and the following holds:
1. (a + b) + c = a + (b + c) (Associativity)

2. There exists a zero element 0 ∈ G such that 0 + a = a + 0 = a for all
a∈G
3. For each a ∈ G, there exists an inverse element (− a) ∈ G such that
(− a) + a = a + (− a) = 0
Then (G, +) is called a group. The group (G, +) is called commutative (or
abelian), if for each a, b ∈ G, a + b = b + a. The number of elements in G
is called order of the group and is denoted by o(G). If the order is finite,
then G is called a finite group.
In a multiplicative way this can be written as
Definition 2 Let G be a nonempty set with a binary operation · such that
a · b ∈ G for all a, b ∈ G and the following holds:
1. (a · b) · c = a · (b · c) (Associativity)
2. There exists an identity element e ∈ G such that e · a = a · e = a for all
a∈G
3. For each a ∈ G, there exists an inverse element a−1 ∈ G such that
a−1 · a = a · a−1 = e
Then (G, ·) is called a group. The group (G, ·) is called commutative (or
abelian), if for each a, b ∈ G, a · b = b · a.
For subsets we have
Definition 3 Let (G, ·) and (H, ·) be groups and H ⊂ G. Then H is called
subgroup of G.
Some groups can be generated by a single element of the group:
Definition 4 Let (G, ·) be a group with n < ∞ elements denoted by ai
(i = 0, 1, ..., n − 1) and such that
1. ao = an = e
2. ai aj = ai+j if i + j ≤n and ai aj = ai+j−n if i + j > n
Then G is called a cyclic group. Furthermore, if G = (a) = {ai : i ∈ Z}
where ai denotes the product with all i terms equal to a, then a is called a
generator of G.
An important notion is given in the following
Definition 5 Let G be a group that “acts” on a set X by assigning to each
x ∈ X and g ∈ G an element g(x) ∈ X. Then, for each x ∈ X, the set
G(x) = {y : y = g(x), g ∈ G} is called orbit of x.
Note that, given a group G that acts on X, the set X is partitioned into
disjoint orbits.
If there are two operations + and ·, then a ring is defined by
Definition 6 Let R be a nonempty set with two binary operations + and
· such that the following holds:
1. (R, +) is an abelian group

2. a · b ∈ R for all a, b ∈ R
3. (a · b) · c = a · (b · c) (Associativity)
4. a · (b + c) = a · b + a · c and (b + c) · a = b · a + c · a (distributive law)
Then (R, +, ·) is called an (associative) ring. If also a · b = b · a for all
a, b ∈ R, then R is called a commutative ring.
Further useful definitions are:
Definition 7 Let R be a commutative ring and a ∈ R, a ̸= 0 such that
there exists an element b ∈ R, b ̸= 0 with a · b = 0. Then a is called a
zero-divisor. If R has no zero-divisors, then it is called an integral domain.
Definition 8 Let R be a ring such that (R \ {0}, ·) is a group. Then R is
called a division ring. A commutative division ring is called a field.
A module is defined as follows:
Definition 9 Let (R, +, ·) be a ring and M a nonempty set with a binary
operation +. Assume that
1. (M, +) is an abelian group
2. For every r ∈ R, m ∈ M , there exists an element r · m ∈ M
3. r · (a + b) = r · a + r · b for every r ∈ R, m ∈ M
4. r · (s · b) = (r · s) · a for every r, s ∈ R, m ∈ M
5. (r + s) · a = r · a + s · a for every r, s ∈ R, m ∈ M
Then M is called an R− module or module over R. If R has a unit element
e and if e · a = a for all a ∈ M , then M is called a unital R− module. A a
unital R− module where R is a field is called a vector space over R.
There is an enormous amount of literature on groups, rings, modules,
etc. Some of the standard results are summarized, for instance, in text
books such as those given above. Here, we cite only a few theorems that
are especially useful in music. We start with a few more definitions.
Definition 10 Let H ⊂ G be a subgroup of G such that for every a ∈ G,
a · H · a−1 ∈ H. Then H is called a normal subgroup of G.
Definition 11 Let G be such that the only normal subgroups are H = G
and H = {e}. Then G is called a simple group.
Definition 12 Let G be a group and H1 , ..., Hn normal subgroups such
that
G = H1 · H 2 · · · Hn (1.1)
and any a ∈ G can be written uniquely as a product
a = b1 · b2 · · · bn (1.2)
with bi ∈ Hi . Then G is said to be the (internal) direct product of H1 , ..., Hn .

Definition 13 Let G1 and G2 be two groups, define G = G1 × G2 =
{(a, b) : a ∈ G1 , b ∈ G2 } and the operation · by (a1 , b1 ) · (a2 , b2 ) = (a1 ·
a2 , b1 · b2 ). Then the group (G, ·) is called the (external) direct product of
G1 and G2 .
Definition 14 Let M be an R− module and M1 , ..., Mn submodules such
that every a ∈ M can be written uniquely as a sum
a = a1 + a2 + ... + an (1.3)
with ai ∈ Mi . Then M is said to be the direct sum of M1 , ..., Mn .
We now turn to the question which subgroups of finite groups exist.
Theorem 1 Let H be a subgroup of a finite group G. Then o(H) is a
divisor of o(G).
Theorem 2 (Sylow) Let G be a group and p a prime number such that pm
is a divisor of o(G). Then G has a subgroup H with o(H) = pm .
Definition 15 A subgroup H ⊂ G such that pm is a divisor of o(G) but
pm+1 is not a divisor, is called a p− Sylow subgroup.
The next theorems help to decide whether a ring is a field.
Theorem 3 Let R be a finite integral domain. Then R is a field.
Corollary 1 Let p be a prime number and R = Zp = {x mod p : x ∈ N }
be the set of integers modulo p (with the operations m + and · defined
accordingly). Then R is a field.
An essential way to compare algebraic structures is in terms of operation-
preserving mappings. The following definitions are needed:
Definition 16 Let (G1 , ·) and (G2 , ·) be two groups. A mapping g : G1 →
G2 such that
g(a · b) = g(a) · g(b) (1.4)
is called a (group-)homomorphism. If g is a one-to-one (group-)homomorph-
ism, then it is called an isomorphism (or group-isomorphism). Moreover,
if G1 = G2 , then g is called an automorphism (or group-automorphism).
Definition 17 Two groups G1 , G2 are called isomorphic, if there is an
isomorphism g : G1 → G2 .
Analogous definitions can be given for rings and modules:
Definition 18 Let R1 and R2 be two rings. A mapping g : G1 → G2 such
that
g(a + b) = g(a) + g(b) (1.5)
and
g(a · b) = g(a) · g(b) (1.6)
is called a (ring-)homomorphism. If g is a one-to-one (ring-)homomorphism,
then it is called an isomorphism (or ring-isomorphism). Furthermore, if
R1 = R2 , then g is called an automorphism (or ring-automorphism).

Definition 19 Two rings R1 , R2 are called isomorphic, if there is an iso-
morphism g : R1 → R2 .
Definition 20 Let M1 and M2 be two modules over R. A mapping g :
M1 → M2 such that for every a, b ∈ M1 , r ∈ R,
g(a + b) = g(a) + g(b) (1.7)
and
g(r · a) = r · g(a) (1.8)
is called a (module-)homomorphism (or a linear transformation). If g is
a one-to-one (module-)homomorphism, then it is called an isomorphism
(or module-isomorphism). Furthermore, if G1 = G2 , then g is called an
automorphism (or module-automorphism).
Definition 21 Two modules M1 , M2 are called isomorphic, if there is an
isomorphism g : M1 → M2 .
Finally, a general family of transformations is defined by
Definition 22 Let g : M1 → M2 be a (module-)homomorphism. Then a
mapping h : M1 → M2 defined by
h(a) = c + g(a) (1.9)
with c ∈ M2 is called an affine transformation. If M1 = M2 , then g is called
a symmetry of M . Moreover, if g is invertible, then it is called an invertible
symmetry of M .
Studying properties of groups is equivalent to studying groups of auto-
morphisms:
Theorem 4 (Cayley’s theorem) Let G be a group. Then there is a set S
such that G is isomorphic to a subgroup of A(S) where A(S) is the set of
all one-to-one mappings of S onto itself.
Definition 23 Let G be a finite group. Then the group (A(S), ◦) (where
a ◦ b denotes successive application of the functions a and b) is called the
symmetric group of order n, and is denoted by Sn .
Note that Sn is isomorphic to the group of permutations of the numbers
1, 2, ..., n, and has n! elements. Another important concept is motivated by
representation in coordinates as we are used to from euclidian geometry.
The representation follows since, in terms of isomorphy, the inner and outer
product can be shown to be equivalent:
Theorem 5 Let G = H1 · H2 · · · Hn be the internal direct product of
H1 , ..., Hn and G∗ = H1 × H2 × ... × Hn the external direct product. Then
G and G∗ are isomorphic, through the isomorphism g : G∗ → G defined by
g(a1 , ..., an ) = a1 · a2 · ... · an .
This theorem implies that one does not need to distinguish between the
internal and external direct product. The analogous result holds for mod-
ules:

Theorem 6 Let M be a direct sum of M1 , ..., Mn . Then M is isomor-
phic to the module M ∗ = {(a1 , a2 , ..., an ) : ai ∈ Mi } with the opera-
tions (a1 , a2 , ...) + (b1 , b2 , ...) = (a1 + b1 , a2 + b2 , ...) and r · (a1 , a2 , ...) =
(r · a1 , r · a2 , ...).
Thus, a module M = M1 + M2 + ... + Mn can be described in terms of
its coordinates with respect to Mi (i = 1, ..., n) and the structure of M is
known as soon as we know the structure of Mi (i = 1, ..., n).
Direct products can be used, in particular, to characterize the structure
of finite abelian groups:
Theorem 7 Let (G, ·) be a finite commutative group. Then G is isomor-
phic to the direct product of its Sylow-subgroups.
Theorem 8 Let (G, ·) be a finite commutative group. Then G is the direct
product of cyclic groups.
Similar, but slightly more involved, results can be shown for modules, but
will not be needed here.

In the following, the usefulness of algebraic structures in music is illus-
trated by a few selected examples. This is only a small selection from
the extended literature on this topic. For further reading see e.g. Graeser
(1924), Schönberg (1950), Perle (1955), Fletcher (1956), Babbitt (1960,
1961), Price (1969), Archibald (1972), Halsey and Hewitt (1978), Balzano
(1980), Rahn (1980), Götze and Wille (1985), Reiner (1985), Berry (1987),
Mazzola (1990a, 2002 and references therein), Vuza (1991, 1992a,b, 1993),
Fripertinger (1991), Lendvai (1993), Benson (1995-2002), Read (1997), Noll
(1997), Andreatta (1997), Stange-Elbe (2000), among others.
1.3.1 The Mathieu group

It can be shown that finite simple groups fall into families that can be
described explicitly, except for 26 so-called sporadic groups. One such group
is the so-called Mathieu group M12 which was discovered by the French
mathematician Mathieu in the 19th century (Mathieu 1861, 1873, also see
e.g. Conway and Sloane 1988). In their study of probabilistic properties of
(card) shuffling, Diaconis et al. (1983) show that M12 can be generated by
two permutations (which they call Mongean shuffles), namely
1 2 3 4 5 6 7 8 9 10 11 12
! "
π1 = (1.10)
7 6 8 5 9 4 10 3 11 2 12 1
and
1 2 3 4 5 6 7 8 9 10 11 12
! "
π2 = (1.11)
6 7 5 8 4 9 3 10 2 11 1 12

where the low rows denote the image of the numbers 1, ..., 12. The order
of this group is o(M12 ) = 95040 (!) An interesting application of these
permutations can be found in Ile de feu 2 by Olivier Messiaen (Berry 1987)
where π1 and π2 are used to generate sequences of tones and durations.
1.3.2 Campanology
A rather peculiar example of group theory “in action” (though perhaps
rather trivial mathematically) is campanology or change ringing (Fletcher
1956, Wilson 1965, Price 1969, White 1983, 1985, 1987, Stewart 1992). The
art of change ringing started in England in the 10th century and is still
performed today. The problem that is to be solved is as follows: there are
k swinging bells in the church tower. One starts playing a melody that
consists of a certain sequence in which the bells are played, each bell be-
ing played only once. Thus, the initial sequence is a permutation of the
numbers 1, ..., k. Since it is not interesting to repeat the same melody over
and over, the initial melody has to be varied. However, the bells are very
heavy so that it is not easy to change the timing of the bells. Each variation
is therefore restricted, in that in each “round” only one pair of adjacent
bells can exchange their position. Thus, for instance, if k = 4 and the pre-
vious sequence was (1, 2, 3, 4), then the only permissible permutations are
(2, 1, 3, 4), (1, 3, 2, 4), and (1, 2, 4, 3). A further, mainly aesthetic restiction
is that no sequence should be repeated except that the last one is iden-
tical with the initial sequence. A typical solution to this problem is, for
instance, the “Plain Bob” that starts by (1, 2, 3, 4), (2, 1, 4, 3), (2, 4, 1, 3),...
and continues until all permutations in S4 are visited.
1.3.3 Representation of music

Many aspects of music can be “embedded” in a suitable algebraic module
(see e.g. Mazzola 1990a). Here are some examples:
1. Apart from glissando effects, the essential frequencies in most types of
music are of the form
#K
ω = ωo pxi i (1.12)
i=1
where K < ∞, ωo is a fixed basic frequency, pi are certain fixed prime
numbers and xi ∈ Q. Thus,
K
$
ψ = log ω = ψo + xi ψi (1.13)
i=1
%K
where ψo = log ωo , ψi = log pi (i ≥ 1). Let Ψ = {ψ : ψ = i=1 xi ψi , xi ∈
Q} be the set of all log-frequencies generated this way. Then Ψ is a
module over Q. Two typical examples are:

(a) ωo = 440 Hz, K = 3, ω1 = 2, ω2 = 3, ω3 = 5 : This is the so-called
Euler module in which most Western music operates. An important
subset consists of frequencies of the just intonation with the pure in-
tervals octave (ratio of frequencies 2), fifth (ratio of frequencies=3/2)
and major third (ratio of frequencies 5/4):
ψ = log ω = log 440 + x1 log 2 + x2 log 3 + x3 log 5 (1.14)
(xi ∈ Z). The notes (frequencies) ψ can then be represented by points
in a three-dimensional space of integers Z3 . Note that, using the nota-
tion a = (a1 , a2 , a3 ) and b = (b1 , b2 , b3 ), the pitch obtained by addition
c = a + b corresponds to the frequency ωo 2a1 +b1 3a2 +b2 5a3 +b3 .
p
(b) ωo = 440 Hz, K = 1, ω1 = 2, and x = 12 , where p ∈ Z : This
corresponds to the well-tempered tuning where an octave is divided
into
√ equal intervals. Thus, the ratio 2 is decomposed into 12 ratios
12
2 so that
p
ψ = log 440 + log 2 (1.15)
12
If notes that differ by one or several octaves are considered equiva-
lent, then we can identify the set of notes with the Z− module Z12 =
{0, 1, ..., 11}.
2. Consider a finite module of notes (frequencies), such as for instance the

well-tempered module M = Z12 . Then a scale is an element of S =
{(x1 , ..., xk ) : k ≤ |M |, xi ∈ M, xi ̸= xj (i ̸= j)}, the set of all finite
vectors with different components.
1.3.4 Classification of circular chords and other musical objects

A central element of classical theory of harmony is the triad. An alge-
braic property that distinguishes harmonically important triads from other
chords can be described as follows: let x1 , x2 , x3 ∈ Z12 , such that (a) xi ̸=xj
(i̸=j) and (b) there is an “inner” symmetry g : Z12 → Z12 such that
{y : y = g k (x1 ), k ∈ N} = {x1 , x2 , x3 }. It can be shown that all chords
(x1 , x2 , x3 ) for which (a) and (b) hold are standard chords that are har-
monically important in traditional theory of harmony. Consider for instance
the major triad (c, e, g) = (0, 4, 7) and the minor triad (c, e♭, g) = (0, 3, 7).
For the first triad, the symmetry g(x) = 3x + 7 yields the desired result:
g(0) = 7 = g, g(7) = 4 = e and g(4) = 7 = g. For the minor triad the
only inner symmetry is g(x) = 3x + 3 with g(7) = 0 = c, g(0) = 3 = e♭
and g(3) = 0 = c. This type of classification of chords can be carried over
to more complicated configurations of notes (see e.g. Mazzola 1990a, 2002,
Straub 1989). In particular, musical scales can be classified by comparing
their inner symmetries.

1.3.5 Torus of thirds
Consider the group G = (Z12 , +) of pitches modulo octave. Then G is
isomorphic to the direct sum of the Sylow groups Z3 and Z4 by applying
the isomorphism
g : Z12 → Z3 + Z4 , (1.16)
x → y = (y1 , y2 ) = (x mod 3, − x mod 4) (1.17)
Geometrically, the elements of Z3 + Z4 can be represented as points on
a torus, y1 representing the position on the vertical meridian and y2 the
position on the horizontal equatorial circle (Figure 1.8). This representation
has a musical meaning: a movement along a meridian corresponds to a
major third, whereas a movement along a horizontal circle corresponds to
a minor third. One then can define the “torus-distance” dtorus (x, y) by
equating it to the minimal number of steps needed to move from x to y.
The value of dtorus (x, y) expresses in how far there is a third-relationship
between x and y. The possible values of dtorus are 0 (if x = y), 1, 2, and
3 (smallest third-relationship). Note that dtorus can be decomposed into
d3 + d4 where d3 counts the number of meridian steps and d4 the number
of equatorial steps.
1.3.6 Transformations
For suitably chosen integers p1 , p2 , p3 , p4 , consider the four-dimensional
module M = Zp1 × Zp2 × Zp3 × Zp4 over Z where the coordinates rep-
resent onset time, pitch (well-tempered tuning if p2 = 12), duration, and
volume. Transformations in this space play an essential role in music. A se-
lection of historically relevant transformations used by classical composers
is summarized in Table 1.1 (also see Figure 1.13).
Generally, one may say that affine transformations are most important,
and among these the invertible ones. In particular, it can be shown that each
symmetry of Z12 can be written as a product (in the group of symmetries
Symm(Z12 )) of the following musically meaningful transformations:
• Multiplication by − 1 (inversion);
• Multiplication by 5 (ordering of notes according to circle of quarts);
• Addition of 3 (transposition by a minor third);
• Addition of 4 (transposition by a major third).
All these transformations have been used by composers for many centuries.
Some examples of apparent similarities between groups of notes (or motifs)
are shown in Figures 1.10 through 1.12. In order not to clutter the pic-
tures, only a small selection of similar motifs is marked. In dodecaphonic
and serial music, transformation groups have been applied systematically
(see e.g. Figure 1.9). For instance, in Schöberg’s Orchestervariationen op.

Table 1.1 Some affine transformations used in classical music
Function Musical meaning
Shift: f (x) = x + a Transposition, repetition,
change of duration,
change of loudness
Shear, e.g. of x = (x1 , ..., x4 )t Arpeggio
w.r.t. line y = βo + t · (0, 1, 0, 0):
f (x) = x + a · (0, 1, 0, 0)
for x not on line,
f (x) = x for x on line
Reflection, e.g. w.r.t. Retrograde, inversion
v = (a, 0, 0, 0):
f (x) = (a − (x1 − a), x2 , x3 , x4 )
Dilatation, e.g. w.r.t. pitch: Augmentation
f (x) = (x1 , a · x2 , x3 , x4 )
Exchange of coordinates: Exchange of “parameters”

f (x) = (x2 , x1 , x3 , x4 ) (20th century)
31, the full orbit generated by inversion, retrograde and transposition is

used. Webern used 12-tone series that are diagonally symmetric in the
two-dimensional space spanned by pitch and onset time. Other famous ex-
amples
√ include Eimert’s rotation by 45 degrees together with a dilatation
by 2 (Eimert 1964) and serial compositions such as Boulez’s “Structures”
and Stockhausen’s “Kontra-Punkte”. With advanced computer technol-
ogy (e.g. composition soft- and hardware such as Xenaki’s UPIC graph-
ics/computer system or the recently developed Presto software by Mazzola
1989/1994), the application of affine transformations in musical spaces of
arbitrary dimension is no longer the tedious work of the early dodecaphonic
era. On the contrary, the practical ease and enormous artistic flexibility
lead to an increasing popularity of computer aided transformations among
contemporary composers (see e.g. Iannis Xenakis, Kurt Dahlke, Wilfried
Jentzsch, Guerino Mazzola 1990b, Dieter Salbert, Karl-Heinz Schöppner,
Tamas Ungvary, Jan Beran 1987, 1991, 1992, 2000; cf. Figure 1.14).

Spiegel-Duett (W.A. Mozart)
Allegro =120
Violin
Vln.
12
Vln.
18
Vln.
22
Vln.
27
Vln.
32
Vln.
36
Vln.
41
Vln.
46
Vln.
51
Vln.
57
Vln.
60
Vln.
Figure 1.6 W.A. Mozart (1759-1791) (authorship uncertain) – Spiegel-Duett.

Figure 1.7 Wolfgang Amadeus Mozart (1756-1791). (Engraving by F. Müller af-
ter a painting by J.W. Schmidt; courtesy of Zentralbibliothek Zürich.)
Figure 1.8 The torus of thirds Z3 + Z4 .

Figure 1.9 Arnold Schönberg – Sketch for the piano concert op. 42 – notes with
tone row and its inversions and transpositions. (Used by permission of Belmont
Music Publishers.)
Figure 1.10 Notes of “Air” by Henry Purcell. (For better visibility, only a small
selection of related “motifs” is marked.)

Figure 1.11 Notes of Fugue No. 1 (first half ) from “Das Wohltemperierte
Klavier” by J.S. Bach. (For better visibility, only a small selection of related
“motifs” is marked.)
Figure 1.12 Notes of op. 68, No. 2 from “Album für die Jugend” by Robert Schu-
mann. (For better visibility, only a small selection of related “motifs” is marked.)

Figure 1.13 A miraculous transformation caused by high exposure to Wagner
operas. (Caricature from a 19th century newspaper; courtesy of Zentralbibliothek
Zürich.)
Figure 1.14 Graphical representation of pitch and onset time in Z271 together with
instrumentation of polygonal areas. (Excerpt from Śānti – Piano concert No. 2
by Jan Beran, col legno CD 20062; courtesy of col legno, Germany.)

Figure 1.15 Iannis Xenakis (1922-1998). (Courtesy of Philippe Gontier, Paris.)
Figure 1.16 Ludwig van Beethoven (1770-1827). (Courtesy of Zentralbibliothek

Zürich.)

CHAPTER 2
Exploratory data mining in musical

spaces
The primary aim of descriptive statistics is to summarize data by a small

set of numbers or graphical displays, with the purpose of finding typical
relevant features. An in-depth descriptive analysis explores the data as far
as possible in the hope of finding anything interesting. This activity is
therefore also called “exploratory data analysis” (EDA; see Tukey 1977),
or “data mining”. EDA does not require a priori model assumptions – the
purpose is simply free exploration. Many exploratory tools are, however,
inspired by probabilistic models and designed to detect features that may
be captured by these.
Descriptive or exploratory analysis is of special interest in music. The
reason is that in music very subtle local changes play an important role.
For instance, a good pianist may achieve a desired emotional effect by slight
local variations of tempo, dynamics, etc. Composers are able to do the same
by applying subtle variations. Extreme examples of small gradual changes
can be found, for instance, in minimal music (e.g. Reich, Glass, Riley). As a
result, observed data consist of a dominating deterministic component plus
many other very subtle (and presumably also deterministic, i.e. intended)
components. Thus, because of their subtle nature, many musically relevant
features are difficult to detect and can often be identified in a descriptive
way only - for instance by suitable graphical displays. A formal statistical
“proof” that these features are indeed real, and not just accidental, is then
only possible if more similar data are collected.
To illustrate this, consider the tempo curves of three performances of
Robert Schumann’s (1810-1856) Träumerei by Vladimir Horowitz (1903-
1989), displayed in Figure 2.2. It is obvious that the three curves are very
similar even with respect to small details. However, since these details are
of a local nature and we observed only three performances, it is not an easy
task to show formally (by statistical hypothesis testing or confidence inter-
vals) that, apart from an overall smooth trend, Horowitz’s tempo variations
are not random. An even more difficult task is to “explain” these features,
i.e. to attach an explicit musical meaning to the local tempo changes.

Träumerei op. 15, No. 7
Robert Schumann
=100 (72)
Piano
ritard.
9
13
17 ritard.
a tempo
21
23
ritard.
Figure 2.1 Robert Schumann (1810-1856) – Träumerei op. 15, No. 7.

0
1947
-5
1963
log(tempo)
1965
-10
-15
0 10 20 30
onset time
Figure 2.2 Tempo curves of Schumann’s Träumerei performed by Vladimir

Horowitz.
2.2 Some descriptive statistics and plots for univariate data

2.2.1 Definitions
We give a brief summary of univariate descriptive statistics. For a com-
prehensive discussion we refer the reader to standard text books such as
Tukey (1977), Mosteller and Tukey (1977), Hoaglin (1977), Tufte (1977),
Velleman and Hoaglin (1981), Chambers et al. (1983), Cleveland (1985).
Suppose that we observe univariate data x1 , x2 , ..., xn . To summarize
general characteristics of the data, various numerical summary statistics
can be calculated. Essential features are in particular center (location),
variability, asymmetry, shape of distribution, and location of unusual values
(outliers). The most frequently used statistics are listed in Table 2.1.
We recall a few well known properties of these statistics:
• Sample mean: The sample mean can be understood as the “center of
gravity” of the data, whereas the median divides the sample in two halves

Table 2.1 Simple descriptive statistics
Name Definition Feature measured

−1
!n
Empirical distribution Fn (x) = n i=1 1{xi ≤ x} Proportion of
function obs. ≤ x
Minimum xmin = min{x1 , ..., xn } Smallest value
Maximum xmin = max{x1 , ..., xn } Largest value
Range xrange = xmax − xmin Total spread
x̄ = n−1 n
!
Sample mean i=1 xi Center
1
Sample median M = inf {x : Fn (x) ≥ 2
} Center
Sample α−quantile qα = inf {x : Fn (x) ≥ α} Border of
lower 100α%
Lower and upper Q1 = q 1 , Q2 = q 3 Border of
4 4
quartile lower 25%,
upper 75%
!n
Sample variance s2 = (n − 1)−1 i=1 (xi − x̄)2 Variability
√
Sample standard s = + s2 Variability
deviation
Interquartile range IQR = Q2 − Q1 Variability
m3 = n−1 n 3
!
Sample skewness i=1 [(xi − x̄)/s] Asymmetry
m4 = n−1 n 4
!
Sample kurtosis i=1 [(xi − x̄)/s] − 3 Flat/sharp peak
with an (approximately) equal number of observations. In contrast to the

median, the mean is sensitive to outliers, since observations that are far
from the majority of the data have a strong influence on its value.
• Sample standard deviation: The sample standard deviation is a measure
of variability. In contrast to the variance, s is directly comparable with
the data, since it is measured in the same unit. If observations are drawn
independently from the same normal probability distribution (or a dis-
tribution that is similar to a normal distribution), then the following rule
of thumb applies: (a) approximately 68% of the data are in the interval
x̄ ± s; (b) approximately 95% of the data are in the interval x̄ ± 2s; (c)
almost all data are in the interval x̄ ± 3s. For a sufficiently large sample
size, these conclusions can be carried over to the population from which
the data were drawn.

• Interquartile range: The interquartile range also measures variability. Its
advantage, compared to s, is that it is much less sensitive to outliers. If
the observations are drawn from the same normal probability distribu-
tion, then IQR/1.35 (or more precisely IQR/[Φ−1 (0.75) − Φ−1 (0.25)]
where Φ−1 is the quantile function of the standard normal distribution)
estimates the same quantity as s, namely the population standard devi-
ation.
• Quantiles: For α = ni (i = 1, ..., n), qα coincides with at least one ob-
servation. For other values of α, qα can be defined as in Table 1.1 or,
alternatively, by interpolating neighboring observed values as follows: let
β = ni < α < γ = i+1 n . Then the interpolated quantile q
˜α is defined by
α−β
q˜α = qβ + (qγ − qα ) (2.1)
1/n
Note that a slightly different convention used by some statisticians is to
call inf{x : Fn (x) ≥ α} the (α − 0n.5 )-quantile (see e.g. Chambers et al.
1983).
• Skewness: Skewness measures symmetry/asymmetry. For exactly sym-
metric data, m3 = 0, for data with a long right tail m3 > 0, for data
with a long left tail m3 < 0.
• Kurtosis: The kurtosis is mainly meaningful for unimodal distributions,
i.e. distributions with one peak. For a sample from a normal distribution,
m4 ≈ 0. The reason is that then E[(X − µ)4 ] = 3σ 4 where µ = E(X).
For samples from unimodal distributions with a sharper or flatter peak
than the normal distribution, we then tend to have m4 > 0 and m4 < 0
respectively.
Simple, but very useful graphical displays are:
• Histogram: 1. Divide an interval (a, b] that includes all observations into
disjoint intervals I1 = (a1 , b1 ], ..., Ik = (ak , bk ]. 2. Let n1 , ..., nk be the
number of observations in the intervals I1 , ..., Ik respectively. 3. Above
each interval Ij , plot a rectangle of width wj = bj − aj and height
hj = nj /wj . Instead of the absolute frequencies, one can also use relative
frequencies nj /n where n = n1 + ... + nk . The essential point is that the
area is proportional to nj . If the data are drawn from a probability
distribution with density function f, then the histogram is an estimate
of f.
• Kernel estimate of a density function: The histogram is a step function,
and in that sense does not resemble most density functions. This can be
improved as follows. If the data are realizations of a"continuous random
x
variable X with distribution F (x) = P (X ≤ x) = −∞ f (u)du, then a
smooth estimate of the probability density function f can be defined by
a kernel estimate (Rosenblatt 1956, Parzen 1962, Silverman 1986) of the

form
n
1 # xi − x
fˆ(x) = K( ) (2.2)
nb i=1 b
"∞
where K(u) = K(−u) ≥ 0 and −∞ K(u)du = 1. Most kernels used in
practice also satisfy the condition K(u) = 0 for |u| > 1. The “band-
width” b then specifies which data in the neighborhood of x are used
to estimate f (x). In situations where one has partial knowledge of the
shape of f, one may incorporate this into the estimation procedure. For
instance, Hjort and Glad (2002) combine parametric estimation based
ˆ with kernel smoothing of the
on a preliminary density function f (x; θ)
ˆ
“remaining density” f /f (x; θ). They show that major efficiency gains
can be achieved if the preliminary model is close to the truth.
• Barchart: If data can assume only a few different values, or if data are
qualitative (i.e. we only record which category an item belongs to), then
one can plot the possible values or names of categories on the x-axis and
on the vertical axis the corresponding (relative) frequencies.
• Boxplot (simple version): 1. Calculate Q1 , M, Q2 and IQR = Q2 − Q1 .

2. Draw parallel lines (in principle of arbitrary length) at the levels
Q1 , M, Q2 , A1 = Q1 − 32 IQR, A2 = Q2 + 32 IQR, B1 = Q1 − 3IQR and
B2 = Q1 + 3IQR. The points A1 , A2 are called inner fence, and B1 , B2
are called outer fence. 3. Identify the observation(s) between Q1 and A1
that is closest to A1 and draw a line connecting Q1 with this point. Do
the same for Q2 and A2 . 4. Identify observation(s) between A1 and B1
and draw points (or other symbols) at those places. Do the same for
A2 and B2 . 5. Draw points (or other symbols) for observations beyond
B1 and B2 respectively. The boxplot can be interpreted as follows: the
relative positions of Q1 , M, Q2 and the inner and outer fences indicate
symmetry or asymmetry. Moreover, the distance between Q1 and Q2 is
the IQR and thus measures variability. The inner and outer fences help
to identify outliers, i.e. values lying unusually far from most of the other
observations.
• Q-q-plot for comparing two data sets x1 , ..., xn and y1 , ..., ym : 1. Define
a certain number of points 0 < p1 < ... < pk ≤ 1 (the standard choice is:
pi = i−0
N
.5
where N = min(n, m)). 2. Plot the pi -quantiles (i = 1, ..., N )
of the y−observations versus those of the x − −observations. Alternative
plots for comparing distributions are discussed e.g. in Ghosh and Beran
(2000) and Ghosh (1996, 1999).

2.3 Specific applications in music – univariate
2.3.1 Tempo curves
Figure 2.3 displays 28 tempo curves for performances of Schumann’s Träu-
merei op. 15, No. 7, by 24 pianists. The names of the pianists and dates
of the recordings (in brackets) are Martha Argerich (before 1983), Claudio
Arrau (1974), Vladimir Ashkenazy (1987), Alfred Brendel (before 1980),
Stanislav Bunin (1988), Sylvia Capova (before 1987), Alfred Cortot (1935,
1947 and 1953), Clifford Curzon (about 1955), Fanny Davies (1929), Jörg
Demus (about 1960), Christoph Eschenbach (before 1966), Reine Gianoli
(1974), Vladimir Horowitz (1947, before 1963 and 1965), Cyprien Katsaris
(1980), Walter Klien (date unknown), André Krust (about 1960), Antonin
Kubalek (1988), Benno Moisewitsch (about 1950), Elly Ney (about 1935),
Guiomar Novaes (before 1954), Cristina Ortiz (before 1988), Artur Schn-
abel (1947), Howard Shelley (before 1990), Yakov Zak (about 1960).
Tempo is more likely to be varied in a relative rather than absolute way.
For instance, a musician plays a certain passage twice as fast as the previ-
ous one, but may care less about the exact absolute tempo. This suggests
consideration of the logarithm of tempo. Moreover, the main interest lies in
comparing the shapes of the curves. Therefore, the plotted curves consist
of standardized logarithmic tempo (each curve has sample mean zero and
variance one).
Schumann’s Träumerei is divided into four main parts, each consisting
of about eight bars, the first two and the last one being almost identi-
′ ′′
cal (see Figure 2.1). Thus, the structure is: A, A , B, and A . Already a
very simple exploratory analysis reveals interesting features. For each pi-
anist, we calculate the following statistics for the four parts respectively:
x̄, M, s, Q1 , Q2 , m3 and m4 . Figures 2.4a through e show a distinct pattern
′ ′′
that corresponds to the division into A, A , B, and A . Tempo is much
′′ ′
lower in A and generally highest in B. Also, A seems to be played at a
slightly slower tempo than A – though this distinction is not quite so clear
(Figures 2.4a,b). Tempo is varied most towards the end and considerably
less in the first half of the piece (Figures 2.4c). Skewness is generally nega-
tive which is due to occasional extreme “ritardandi”. This is most extreme
′′
in part B and, again, least pronounced in the first half of the piece (A, A ).
A mirror image of this pattern, with most extreme positive values in B,
′′
is observed for kurtosis. This indicates that in B (and also in A ), most
tempo values vary little around an average value, but occasionally extreme
tempo changes occur. Also, for A, there are two outliers with an extremly
negative skewness – these turn out to be Fanny Davies and Jörg Demus.
Figures 2.4f through h show another interesting comparison of boxplots.
′′
In Figure 2.4f, the differences between the lower quartiles in A and A
for performances before 1965 are compared with those from performances
recorded in 1965 or later. The clear difference indicates that, at least for the

-20
0
ARGERICH
ARRAU
ASKENAZE
BRENDEL
-40
BUNIN
CAPOVA
CORTOT1
CORTOT2
CORTOT3
CURZON
DAVIES
-60
DEMUS
log(tempo)
ESCHENBACH
GIANOLI
HOROWITZ1
HOROWITZ2
HOROWITZ3
KATSARIS
-80
KLIEN
KRUST
KUBALEK
MOISEIWITSCH
NEY
NOVAES
-100
ORTIZ
SCHNABEL
SHELLEY
ZAK
0 10 20 30
onset time
Figure 2.3 Twenty-eight tempo curves of Schumann’s Träumerei performed by 24

pianists. (For Cortot and Horowitz, three tempo curves were available.)
sample considered here, pianists of the “modern era” tend to make a much
′′
stronger distinction between A and A in terms of slow tempi. The only
exceptions (outliers in the left boxplot) are Moiseiwitsch and Horowitz’
first performance and Ashkenazy (outlier in the right boxplot). The com-
parsion of skewness and curtosis in Figures 2.4g and h also indicates that
“modern” pianists seem to prefer occasional extreme ritardandi. The only
exception in the “early 20th century group” is Artur Schnabel, with an
extreme skewness of −2.47 and a kurtosis of 7.04.
Direct comparisons of tempo distributions are shown in Figures 2.5a

Figure 2.4 Boxplots of descriptive statistics for the 28 tempo curves in Figure
2.3.
through f. The following observations can be made: a) compared to Demus

(quantiles on the horizontal axis), Ortiz has a few relatively extreme slow
tempi (Figure 2.5a); b) similarily, but in a less extreme way, Cortot’s inter-
pretation includes occasional extremely slow tempo values (Figure 2.5b); c)
Ortiz and Argerich have practically the same (marginal) distribution (Fig-
ure 2.5c); d) Figure 2.5d is similar to 2.5a and b, though less extreme; e) the
tempo distribution of Cortot’s performance (Figure 2.5e) did not change
much in 1947 compared to 1935; f) similarily, Horowitz’s tempo distribu-

tions in 1947 and 1963 are almost the same, except for slight changes for
very low tempi (Figure 2.5f).
2
2
1
1
1
Figure 2.5a: q-q-plot Figure 2.5b: q-q-plot Figure 2.5c: q-q-plot
Demus (1960) - Ortiz (1988) Demus (1960) - Cortot (1935) Ortiz (1988) - Argerich (1983)
0
0
0
-1
-1
-1
Argerich
-2
Cortot
Ortiz
-2
-2
-3
-3
-3
-4
-4
-4
-2 -1 0 1 -2 -1 0 1 -4 -3 -2 -1 0 1
Demus Demus Ortiz
1
1
Figure 2.5d: q-q-plot Figure 2.5e: q-q-plot Figure 2.5f: q-q-plot

Demus (1960) - Krust (1960) Cortot (1935) - Cortot (1947) Horowitz (1947) - Horowitz (1963)
0
0
2
0
-1
-1
Horowitz 1963
Cortot 1947
-2
Krust
-2
-2
-3
-3
-4
-4
-4
-2 -1 0 1 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 0 1
Demus Cortot 1935 Horowitz 1947
Figure 2.5 q-q-plots of several tempo curves (from Figure 2.3).
2.3.2 Notes modulo 12

In most classical music, a central tone around which notes “fluctuate” can
be identified, and a small selected number of additional notes or chords
(often triads) play a special role. For instance, from about 400 to 1500
A.D., music was mostly written using so-called modes. The main notes

were the first one (finalis, the ”final note”) and the fifth note of the scale
(dominant). The system of 12 major and 12 minor scales was developed
later, adding more flexibility with respect to modulation and scales. The
main “representatives” of a major/minor scale are three triads, obtained
by “adding” thirds, starting at the basic note corresponding to the first
(tonic), fourth (subtonic) and fifth (tonic) note of the scale respectively.
Other triads are also – but to a lesser degree – associated with the properties
“tonic”, “subtonic” and/or “dominant”. In the 20th century, and partially
already in the late 19th century, other systems of scales as well as systems
that do not rely on any specific scales were proposed (in particular 12-tone
music).
0.4
Figure 2.6a: J.S.Bach - Fugue 1, Figure 2.6b: W.A.Mozart - KV 545,
frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)
0.20
frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)

0dcb 287c 1cde
b
5960b 5987afe 4396aedb 8db
0.3
2af
0.15
0
12 47ac 436 150f 13297afec f 14390 43 4678
5789 53
5460a 38fed 12 5460 1e 5276afe
0.2
6 219 1
0.10
387fedcb 12 d 532486 8dc 0 2f

54678 a cbdfe 3cbd 3cdfe
9 32acb 197 b 39 cbdfe a 5469a0e 21
0.1
0.05
2 0 78 490b 56789a0
f 5498760 0aedcb 0 a 29 5678a b
1afe 5436789c0bdfe 1543678 56789a0b 4
1543298760aedcb 1543298760afedcb 1543298760afedcb f 1543298760afedcb 215436789ac0bdfe cbd 215436789ac0bdfe 21 215436789ac0bdfe 2143cdfe 213cdfe
0.0
0.0
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
(Notes-Tonic) mod 12 (Notes-Tonic) mod 12

0.30
Figure 2.6c: R.Schumann - op.15/2, Figure 2.6d: R.Schumann - op.15/3,

0.4
frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) frequencies of notes number i, i in [1+j,16+j] (j=0, ,65)
45 687a0b 2e
736 59c 143fd
18209
d 5 a0b
0.3
afe
0.20
bd 2143 6c 98c
c
fe b 7 17fd
0.2
190dc 8ab 2436e 243fe

9 5 23568afe 90 5 15
0.10
1bcd 80abce 269abfe

320afe 765fd 870bcd 187430cd 187460ac 47 67dc
0.1
87469 432 139a 3259bfed 9870 98ab 239fe 456987a0

432f 5 1 4265fe
187659ced 12abcfed 143265fed 14356afedcb 0 145687a0dcb
1874326509abcfed 0ab 1874326509abcfed 87436509 8709abc 1874326509abcfed 214356987af0edcb 2 214356987af0edcb 214356987af0edcb 214356987af0edcb 213fedcb
0.0
0.0
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
Figure 2.6 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16.
A very simple illustration of this development can be obtained by count-

ing the frequencies of notes (pitches) in the following way: consider a score
in equal temperament. Ignoring transposition by octaves, we can represent
all notes x(t1 ), ..., x(tn ) by the integers 0, 1, ..., 11. Here, t1 ≤ t2 ≤ ... ≤ tn

Figure 2.7a: A.Scriabin - op.51/2, Figure 2.7b: A.Scriabin - op.51/4,
0.3
89 5
35670
0.3
6
24 34a78
1a 345 6a f 29
0.2
b 12 1247890dcbe 12347890ae 1b0 9 23
0.2
68 35f 56dcb c a08 165
79 de b7c 47 0ef
dcfe dcfe
f d 6ba7c98
0c
0.1
f e 46 8 d
0.1
b 790ab adbe 1ba0dc9 12d 1235 1235
ba09 dcef
12568dc f adcbfe 1f 234e8f 3465c 1b 4 1234bdcef 46507e98f
dcef f e dcef f
dfe a 34fe dcbfe 0 2345670adcbe 57 ba078 b 465a07de98 f 2345a09 65a0798 123bad 1234bdce
1234567890acb 1234567890 1234567890adcbfe 1234567890adcbfe 1234567890a 123456789 89 6 9 123465a0798 123bc 678 c 65a0798
0.0
0.0
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
Figure 2.7c: F.Martin - Prelude 6, Figure 2.7d: F.Martin - Prelude 7,

a0de 12 78 0 2 ef
45 13
0.20
0.12
89bc 1276 456 357 1a0cf 156ba 3459ade fe 36 d

b f 4
c 2ef 1290 dce 0abc 567
457f 1238c 345bcf 13afe 4689 24768bde 9bdcf 234569fe 2390c 768bcf 23689dcf 1bd
0.08
890ad 1345 1234abc 87 b 5687 1 456790adef 345890dc

0.10
7ef 89d 687def 24890ab c 67ae

36 4769ba0de 89a0de 2789b0dc ba0d 359 234578a0e 78b0dc 4dfe 21 1457ba0e 249a0c
126 67abc 590 a a 68790ab 9 8790abc 3 1238b 12bf
35 0 bdcef 18790 1235dc 9 8 6d c
0.04
12 5f cfe 16 1a 35768 4 23456 4ef 0abdcef 1234567 12345ef def

0.0
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
Figure 2.7 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16.
denote the score-onset times of the notes. To make different compositions

comparable, the notes are centered by subtracting the central note which
is defined to be the most frequent note. Given a prespecified integer k (in
our case k = 16), we calculate the relative frequencies
j+2
#k
pj (x) = (2k + 1) −1
1{x(ti ) = x}
i=j
where 1{x(ti ) = x} = 1, if x(ti ) = x and zero otherwise and j = 1, 2, ..., n−

2k − 1. This means that we calculate the distribution of notes for a mov-
ing window of 2k + 1 notes. Figures 2.6a through d and 2.7a through d
display the distributions pj (x) (j = 4, 8, ..., 64) for the following composi-
tions: Fugue 1 from “Das Wohltemperierte Klavier I” by J.S. Bach (1685-
1750), Sonata KV 545 (first movement) by W.A. Mozart (1756-1791; Figure
2.8), Kinderszenen No. 2 and 3 by R. Schumann (1810-1856; Figure 2.9),
Préludes op. 51, No. 2 and 4 by A. Scriabin (1872-1915) and Préludes No.

Figure 2.8 Johannes Chrysostomus Wolfgangus Theophilus Mozart (1756-1791)
in the house of Salomon Gessner in Zurich. (Courtesy of Zentralbibliothek
Zürich.)

Figure 2.9 R. Schumann (1810-1856) – lithography by H. Bodmer. (Courtesy of
Zentralbibliothek Zürich.)

6 and 7 by F. Martin (1890-1971). For each j = 4, 8, ..., 64, the frequen-
cies pj (0), ..., pj (11) are joined by lines respectively. The obvious common
feature for Bach, Mozart and Schumann is a distinct preference (local max-
imum) for the notes 5 and 7 (apart from 0). Note that if 0 is the root of
the tonic triad, then 5 corresponds to the root of the subdominant triad.
Similarily, 7 is root of the dominant triad. Also relatively frequent are the
notes 3 =minor third (second note of tonic triad in minor) and 10 =minor
seventh, which is the fourth note of the dominant seventh chord to the
subtonic. Also note that, for Schumann, the local maxima are somewhat
less pronounced. A different pattern can be observed for Scriabin and even
more for Martin. In Scriabin’s Prélude op. 51/2, the perfect fifth almost
never occurs, but instead the major sixth is very frequent. In Scriabin’s
Prélude op. 51/4, the tonal system is dissolved even further, as the clearly
dominating note is 6 which builds together with 0 the augmented fourth
(or diminished fifth) – an interval that is considered highly dissonant in
tonal music. Nevertheless, even in Scriabin’s compositions, the distribution
of notes does not change very rapidly, since the sixteen overlayed curves are
almost identical. This may indicate that the notion of scales or a slow har-
monic development still play a role. In contrast, in Frank Martin’s Prélude
No. 6, the distribution changes very quickly. This is hardly surprising, since
Martin’s style incorporates, among other influences, dodecaphonism (12-
tone music) – a compositional technique that does not impose traditional
restrictions on the harmonic structure.
2.4 Some descriptive statistics and plots for bivariate data
2.4.1 Definitions
We give a short overview of important descriptive concepts for bivariate

data. For a comprehensive treatment we refer the reader to standard text
books given above (also see e.g. Plackett 1960, Ryan 1996, Srivastava and
Sen 1997, Draper and Smith 1998, and Rao 1973 for basic theoretical re-
sults).
Correlation
If each observation consists of a pair of measurements (xi , yi ), then the main

objective is to investigate the relationship between x and y. Consider, for
example, the case where both variables are quantitative. The data can then
be displayed in a scatter plot (y versus x). Useful statistics are Pearson’s
sample correlation
n !n
1 # xi − x̄ yi − ȳ (xi − x̄)(yi − ȳ)
r= ( )( ) = $!n i=1 !n (2.3)
i=1 (xi − x̄) i=1 (yi − ȳ)
n i=1 sx sy 2 2

where s2x = n−1 ni=1 (xi − x̄)2 and s2y = n−1 ni=1 (yi − ȳ)2 and Spearman’s
! !
rank correlation
n !n
1 # ui − ū vi − v̄ (ui − ū)(vi − v̄)
rSp = ( )( ) = !n i=1 !n (2.4)
i=1 (ui − ū) i=1 (vi − v̄)
$
n i=1 su sv 2 2
where ui denotes the rank of xi among the x−values and vi is the rank
of yi among the y−values. In (2.3) and (2.4) it is assumed that sx , sy ,
su and sv are not zero. Recall that these definitions imply the following
properties: a) −1 ≤ r, rSp ≤ 1; b) r = 1, if and only if yi = βo + β1 xi
and β1 > 0 (exact linear relationship with positive slope); c) r = −1, if
and only if yi = βo + β1 xi and β1 < 0 (exact linear relationship with
negative slope); d) rSp = 1, if and only if xi > xj implies yi > yj (strictly
monotonically increasing relationship); e) r = −1, if and only if xi >
xj implies yi < yj (strictly monotonically decreasing relationship); f) r
measures the strength (and sign) of the linear relationship; g) rSp measures
the strength (and sign) of monotonicity; h) if the data are realizations of a
bivariate random variable (X, $ Y ), then r is an estimate of the population
correlation ρ = cov(X, Y )/ var(X)var(Y ) where cov(X, Y ) = E[XY ] −
E[X]E[Y ], var(X) = cov(X, X) and var(Y ) = cov(Y, Y ). When using
these measures of dependence one should bear in mind that each of them
measures a specific type of dependence only, namely linear and monotonic
dependence respectively. Thus, a Pearson or Spearman correlation near
or equal to zero does not necessarily mean independence. Note also that
correlation can be interpreted in a geometric way as follows: defining the
n−dimensional vectors x = (x1 , ..., xn )t and y = (y1 , ..., yn )t , r is equal to
the standardized scalar product between x and y, and is therefore equal to
the cosine of the angle between these two vectors.
A special type of correlation is interesting for time series. Time series are
data that are taken in a specific ordered (usually temporal) sequence. If
Y1 , Y2 , ..., Yn are random variables observed at time points i = 1, ..., n, then
one would like to know whether there is any linear dependence between
observations Yi and Yi−k , i.e. between observations that are k time units
apart. If this dependence is the same for all time points i, and the expected
value of Yi is constant, then the corresponding population correlation can
be written as function of k only (see Chapter 4),
cov(Yi , Yi+k )
= ρ(k) (2.5)
var(Yi )var(Yi+k )
$
and a simple estimate of ρ(k) is the sample autocorrelation (acf)

n−k
1 # yi − ȳ yi+k − ȳ
ρ̂(k) = ( )( ) (2.6)
n i=1 s s
where s2 = n−1 (yi − ȳ)(yi+k − ȳ). Note that here summation stops at
!

n − k, because no data are available beyond (n − k) + k = n. For large lags
(large compared to n), ρ̂(k) is not a very precise estimate, since there are
only very few pairs that are k time units apart.
The definition of ρ(k) and ρ̂(k) can be extended to multivariate time
series, taking into account that dependence between different components
of the series may be delayed. For instance, for a bivariate time series (Xi , Yi )
(i = 1, 2, ...), one considers lag-k sample cross-correlations
n−k
1 # xi − x̄ yi+k − ȳ
ρ̂XY (k) = ( )( ) (2.7)
n i=1 sX sY
as estimates of the population cross-correlations

cov(Xi , Yi+k )
ρXY (k) = $ (2.8)
var(Xi )var(Yi+k )
where s2X = n−1 (xi − x̄)(xi+k − x̄) and s2Y = n−1 (yi − ȳ)(yi+k − ȳ). If
! !
|ρXY (k)| is high, then there is a strong linear dependence between Xi and
Yi+k .
Regression
In addition to measuring the strength of dependence between two variables,
one is often interested in finding an explicit functional relationship. For
instance, it may be possible to express the response variable y in terms of an
explanatory variable x by y = g(x, ε) where ε is a variable representing the
part of y that is unexplained. More specifically, we may have, for example,
an additive relationship y = g(x) + ε or a multiplicative equation y =
g(x)eε . The simplest relationship is given by the simple linear regression
equation
y = βo + β1 x + ε (2.9)
where ε is assumed to be a random variable with E(ε) = 0 (and usually
finite variance σ 2 = var(ε) < ∞). Thus, the data are yi = βo +β1 xi +εi (i =
1, ..., n) where the ε′i s are generated by the same zero mean distribution.
Often the εi ’s are also assumed to uncorrelated or even independent – this is
however not a necessary assumption. An obvious estimate of the unknown
parameters βo and β1 is obtained by minimizing the total sum of squared
errors
# #
SSE = SSE(bo , b1 ) = (yi − bo − b1 xi )2 = ri2 (bo , b1 ) (2.10)
with respect to bo , b1 . The solution is found by setting the partial derivatives
with respect to bo and b1 equal to zero. A more elegant way to find the
solution is obtained by interpreting the problem geometrically: defining the
n-dimensional vectors 1 = (1, ..., 1)t , b = (bo , b1 )t and the n ×2 matrix X
with columns 1 and x, we have SSE = ||y − bo 1 − b1 x||2 = ||y − Xb||2

where ||.|| denotes the squared euclidian norm, or length of the vector. It
is then clear that SSE is minimized by the orthogonal projection of y on
the plane spanned by 1 and x. The estimate of β = (βo , β1 )t is therefore
β̂ = (β̂o , β̂1 )t = (X t X)−1 X t y (2.11)
and the projection – which is the vector of estimated values ŷi – is given
by
ŷ = (ŷ1 , ..., ŷn )t = X(X t X)−1 X t y (2.12)
Defining the measure of the total variability of y, SST = ||y−ȳ1 ||2 (total
sum of squares), and the quantities SSR = ||ŷ−ȳ1 ||2 (regression sum of
squares=variability due to the fact that the fitted line is not horizontal)
2
and SSE = ||y − ŷ|| (error sum of squares, variability unexplained by
regression line), we have by Pythagoras
SST = SSR + SSE (2.13)
The proportion of variability “explained” by the regression line ŷ = β̂o +β̂1 x
is therefore
!n
(ŷi − ȳi )2 ||ŷ − ȳ1 ||2 SSR SSE
R2 = !i=1
n = = =1− . (2.14)
i=1 (y i − ȳ) 2 ||y − ȳ1 ||2 SST SST
By definition, 0 ≤ R2 ≤ 1, and R2 = 1 if and only if ŷi = yi (i.e. all points
are on the regression line). Moreover, for simple regression we also have
R2 = r2 . The advantage of defining R2 as above (instead of via r2 ) is that
the definition remains valid for the multiple regression model (see below),
i.e. when several explanatory variables are available. Finally, note that an
estimate of σ 2 is obtained by σ̂ 2 = (n − 2)−1 ri2 (β̂o , β̂1 ).
!
In analogy to the sample mean and the sample variance, the least squares
estimates of the regression parameters are sensitive to the presence of out-
liers. Outliers in regression can occur in the y-variable as well as in the
x-variable. The latter are also called influential points. Outliers may often
be correct and in fact very interesting observations (e.g. telling us that the
assumed model may not be correct). However, since least squares estimates
are highly influenced by outliers, it is often difficult to notice that there
may be a problem, since the fitted curve tends to lie close to the outliers.
Alternative, robust estimates can be helpful in such situations (see Huber
1981, Hampel et al. 1986). For instance, instead of minimizing the residual
sum of squares we may minimize ρ(ri ) where ρ is a bounded function.
!
If ρ is differentiable, then the solution can usually also be found by solving
the equations
n
# ′ r ∂
ρ( ) r(b) = 0 (j = 0, ..., p) (2.15)
i=1
σ̂ ∂bj
where σ̂ 2 is a robust estimate of σ 2 obtained from an additional equation
and p is the number of explanatory variables. This leads to estimates that

are (up to a certain degree) robust with respect to outliers in y, not however
with respect to influential points (outliers in x). To control the effect of
influential points one can, for instance, solve a set of equations
n
# r
ψj ( , xi ) = 0 (j = 0, ..., p) (2.16)
i=1
σ̂
where ψ is such that it downweighs outliers in x as well. For a compre-

hensive theory of robustness see e.g. Huber (1981), Hampel et al. (1986).
For more recent, efficient and highly robust methods see Yohai (1987),
Rousseeuw and Yohai (1984), Gervini and Yohai (2002), and references
therein.
The results for simple linear regression can be extended easily to the case
where more than one explanatory variable is available. The multiple linear
regression model with p explanatory variables is defined by y = βo + β1 x1 +
...+βp xp +ε. For data we write yi = βo +β1 xi1 +...+βp xip +εi (i = 1, ..., n).
Note that the word “linear” refers to linearity in the parameters βo , ..., βp .
The function itself can be nonlinear. For instance, we may have polynomial
regression with y = βo +β1 x+...+βp xp +ε. The same geometric arguments
as above apply so that (2.11) and (2.12) hold with β = (βo , ..., βp )t , and
the n ×(p + 1)−matrix X = (x(1) , ..., x(p+1) ) with columns x(1) = 1 and
x(j+1) = xj = (x1j , ..., xnj )t (j = 1, ..., p).
Regression smoothing
A more general, but more difficult, approach to modeling a functional re-
lationship is to impose less restrictive assumptions on the function g. For
instance, we may assume
y = g(x) + ε (2.17)
with g being a twice continuously differentiable function. Under suitable
additional conditions on x and ε it is then possible to estimate g from
observed data by nonparametric smoothing. As a special example consider
observations yi taken at time points i = 1, 2, ..., n. A standard model is
yi = g(ti ) + εi (2.18)
where ti = i/n, εi are independent identically distributed (iid) random
variables with E(εi ) = 0 and σ 2 = var(εi ) < 0. The reason for using
standardized time ti ∈ [0, 1] is that this way g is observed on an increasingly
fine grid. This makes it possible to ultimately estimate g(t) for all values
of t by using neighboring values ti , provided that g is not too “wild”. A
simple estimate of g can be obtained, for instance, by a weighted average
(kernel smoothing)
n
#
ĝ(t) = wi yi (2.19)
i=1

with suitable weights wi ≥ 0, wi = 1. For example, one may use the
!
Nadaraya-Watson weights
b )
K( t−ti
wi = wi (t; b, n) = !n t−tj (2.20)

j=1 K( b )
with b > 0, and a kernel function K ≥ 0 such that K(u) = K(−u), K(u) =
"1
0 (|u| > 1) and −1 K(u)du = 1. The role of b is to restrict observations
that influence the estimate to a small window of neighboring time points.
For instance, the rectangular kernel K(u) = 12 1{|u| ≤ 1} yields the sample
mean of observations yi in the “window” n(t − b) ≤ i ≤ n(t + b). An even
more elegant formula can be obtained" 1 by approximating the Riemann sum
!n t−tj
1
nb j=1 K( b ) by the integral −1
K(u)du = 1:
n n
# 1 # t − ti
ĝ(t) = wi yi = K( )yi (2.21)
i=1
nb i=1 b
In this case, the sum of the weights is not exactly equal to one, but asymp-
totically (as n → ∞ and b → 0 such that nb3 → ∞) this error is negligible.
It can be shown that, under fairly general conditions on g and ε, ĝ con-
verges to g, in a certain sense that depends on the specific assumptions (see
e.g. Gasser and Müller 1979, Gasser and Müller 1984, Härdle 1991, Beran
and Feng 2002, Wand and Jones 1995, and references therein).
An alternative to kernel smoothing is local polynomial fitting (Fan and
Gijbels 1995, 1996; also see Feng 1999). The idea is to fit a polynomial
locally, i.e. to data in a small neighborhood of the point of interest. This
can be formulated as a weighted least squares problem as follows:
ĝ(t) = β̂o (2.22)
where β̂ = (β̂o , β̂1 , ..., β̂p )t solves a local least squares problem defined by
# ti − t 2
β̂ = arg min K( )ri (a). (2.23)
a b
Here ri = yi − [ao + a1 (ti − t) + ... + ap (ti − t)p ], K is a kernel as above and
b > 0 is the bandwidth defining the window of neighboring observations.
It can be shown that asymptotically, a local polynomial smoother can be
written as kernel estimator (Ruppert and Wand 1994). A difference only
occurs at the borders (t close to 0 or 1) where, in contrast to the local
polynomial estimate, the kernel smoother has to be modified. The reason
is that observations are no longer symmetrically spaced in the window
t ± b). A major advantage of local polynomials is that they automatically
′ ′′
provide estimates of derivatives, namely ĝ (t) = β̂1 , ĝ (t) = 2β̂2 etc. Kernel
smoothing can also be used for estimation of derivatives; however different
(and rather complicated) kernels have to be used for each derivative (Gasser
and Müller 1984, Gasser et al. 1985). A third alternative, so-called wavelet

thresholding, will not be discussed here (see e.g. Daubechies 1992, Donoho
and Johnston 1995, 1998, Donoho et al. 1995, 1996, Vidakovic 1999, and
Percival and Walden 2000 and references therein). A related method based
of wavelets is discussed in Chapter 5.
Smoothing of two-dimensional distributions, sharpening

Estimating a relationship between x and y (where x and y are realizations
of random variables X and Y respectively) amounts to estimating the joint
two-dimensional distribution function "F (x,"y) = P (X ≤ x, Y ≤ y). For
continuous variables with F (x, y) = u≤x v≤y f (u, v) dudv, the density
function f can be estimated, for instance, by a two-dimensional histogram.
For visual and theoretical reasons, a better estimate is obtained by kernel
estimation (see e.g. Silverman 1986) defined by
1 #
fˆ(x, y) = K(xi − x, yi − y; b1 , b2 ) (2.24)
nb1 b2 i=1
"where the kernel K is such that K(u, v) = K(−u, v) = K(u, −v) ≥ 0, and
K(u, v)dudv = 1. Usually, b1 = b2 = b and K(u, v) has compact sup-
"
port. Examples of kernels are K(u, v) = 14 1{|u| ≤ 1}1{|v| ≤ 1} (rectangular

kernel with rectangular support), K(u, v) = π −1 1{u2 + v 2 ≤ 1} (rectangu-
lar kernel with circular support), K(u, v) = 2π −1 [1−u2 −v 2 ] (Epanechnikov
kernel with circular support) or K(u, v) = (2π)−1 exp[− 21 (u2 + v 2 )] (nor-
mal density kernel with infinite support). In analogy to one-dimensional
density estimation, it can be shown that under mild regularity conditions,
fˆ(x, y) is a consistent estimate of f (x, y), provided that b1 , b2 → 0, and
nb1 , nb2 → ∞.
Graphical representations of two-dimensional distribution functions are
• 3-dimensional perspective plot: z = f (x, y) (or fˆ(x, y)) is plotted against
x and y;
• contour plot: like in a geographic map, curves corresponding to equal
levels of f are drawn in the x-y-plane;
• image plot: coloring of the x-y-plane with the color at point (x, y) cor-
responding to the value of f.
A simple way of enhancing the visual understanding of scatterplots is so-
called sharpening (Tukey and Tukey 1981; also see Chambers et al. 1983):
for given numbers a and b, only points with a ≤ fˆ(x, y) ≤ b are drawn in
the scatterplot. Alternatively, one may plot all points and highlight points
with a ≤ fˆ(x, y) ≤ b.
Interpolation
Often a process may be generated in continuous time, but is observed at
discrete time points. One may then wish to guess the values of the points

in between. Kernel and local polynomial smoothing provide this possibility,
since ĝ(t) can be calculated for any t ∈ (0, 1). Alternatively, if the obser-
vations are assumed to be completely without “error”, i.e. yi = g(ti ), then
deterministic interpolation can be used. The most popular method is spline
interpolation. For instance, cubic splines connect neighboring observed val-
ues yi−1 , yi by cubic polynomials such that the first and second derivatives
at the endpoints ti−1 , ti are equal. For observations y1 , ..., yn at equidistant
time points ti with ti − ti−1 = tj − tj−1 = ∆t (i, j = 1, ..., n), we have n − 1
polynomials
pi (t) = ai + bi (t − ti ) + ci (t − ti )2 + di (t − ti )3 (i = 1, ..., n − 1) (2.25)
To achieve smoothness at the points ti where two polynomials pi−1 , pi meet,
one imposes the condition that the polynomials and their first two deriva-
tives are equal at ti . This together with the conditions pi (ti ) = yi leads to
a system of 3(n − 2) + n = 4(n − 1) − 2 equations for 4(n − 1) parameters
ai , bi , ci , di (i = 1, ..., n − 1). To specify a unique solution one therefore
needs two additional conditions at the border. A typical assumption is
p′′ (t1 ) = p′′ (tn ) = 0 which defines so-called natural splines. Cubic splines
have a physical meaning, since these are the curves that form when a thin
rod is forced to pass through n knots (in our case the knots are t1 , ..., tn ),
corresponding to minimum strain energy. The term “spline” refers to the
thin flexible rods that were used in the past by draftsmen to draw smooth
curves in ship design. In spite of their “natural” meaning, interpolation
splines (and similarily other methods of interpolation) can be problem-
atic since the interpolated values may be highly dependent on the specific
method of interpolation and are therefore purely hypothetical unless the
aim is indeed to build a ship.
Splines can also be used for smoothing purposes by removing the restric-
tion that the curve has to go through all observed points. More specifically,
one looks for a function ĝ(t) such that
n % ∞
# ′′
V (λ) = (yi − ĝ(ti ))2 + λ [ĝ (t)]2 dt (2.26)
i=1 −∞
is minimized. The parameter λ > 0 controls the smoothness of the resulting

curve. For small values of λ, the fitted curve will be rather rough but close
to the data; for large values more smoothness is achieved but the curve
is, in general, not as close to the data. The question of which λ to choose
reflects a standard dilemma in statistical smoothing: one needs to balance
the aim of achieving a small bias (λ small) against the aim of a small
variance (λ large). For a given value of λ, the solution to the minimization
problem above turns out to be a natural cubic spline (see Reinsch 1967;
also see Wahba 1990 and references therein). The solution can also be
written as a kernel smoother with a kernel function K(u) proportional

√ √ 1
to exp(−|u|/ 2) sin(π/4 + |u|/ 2) and a bandwidth b proportional to λ 4
1
(Silverman 1986). If ti = i/n, then the bandwidth is exactly equal to λ 4 .
Statistical inference
In this section, correlation, linear regression, nonparametric smoothing,
and interpolation were introduced in an informal way, without exact dis-
cussion of probabilistic assumptions and statistical inference. All these
techniques can be used in an informal way to explore possible structures
without specific model assumptions. Sometimes, however, one wishes to
obtain more solid conclusions by statistical tests and confidence intervals.
There is an enormous literature on statistical inference in regression, in-
cluding nonparametric approaches. For selected results see the references
given above. For nonparametric methods also see Wand and Jones (1995),
Simonoff (1996), Bowman and Azzalini (1997), Eubank (1999) and refer-
ences therein.
2.5 Specific applications in music – bivariate

2.5.1 Empirical tempo-acceleration
Consider the tempo curves in Figure 2.3. An approximate measure of
tempo-acceleration may be defined by
∆2 y(t) [y(ti ) − y(ti−1 )] − [y(ti−1 ) − y(ti−2 )]
a(ti ) = = (2.27)
∆ t
2 [ti − ti−1 ] − [ti−1 − ti−2 ]
where y(t) is the tempo (or log-tempo) at time t. Figures 2.10a through f
show a(t) for the three performances by Cortot and Horowitz. From the
pictures it is not quite easy to see in how far there are similarilies or dif-
ferences. Consider now the pairs (aj (ti ), al (ti )) where aj , al are accelera-
tion measurements of performance j and l respectively. We calculate the
sample correlations for each pair (j, l) ∈ {1, ..., 28} ×{1, ..., 28}, (j ̸= l).
Figure 2.11a shows the correlations between Cortot 1 (1947) and the other
performances. As expected, Cortot correlates best with Cortot: the corre-
lation between Cortot 1 and Cortot’s other two performances (1947, 1953)
is clearly highest. The analogous observation can be made for Horowitz
1 (1947) (Figure 2.11b). Also interesting is to compare how much overall
resemblance there is between a selected performance and the other per-
formances. For each of the 28 performances, the average and the maximal
correlation with other performances were calculated. Figures 2.11c and d
indicate that, in terms of accelaration, Cortot’s style appears to be quite
unique among the pianists considered here. The overall (average and max-
imal) similarily between each of his three acceleration curves and the other
performances is much smaller than for any other pianist.

10
a) Acceleration - Cortot (1935) b) Acceleration - Cortot (1947) c) Acceleration - Cortot (1953)
10
10
5
5
0
0
0
a(t)
a(t)
a(t)
-5
-5
-5
-10
-10
-10
-15
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
onset time t onset time t onset time t
15
10
d) Acceleration - Horowitz (1947) e) Acceleration - Horowitz (1963) f) Acceleration - Horowitz (1965)

10
10
5
5
5
0
0
0
a(t)
a(t)
a(t)
-5
-5
-5
-10
-10
-10
-15
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
onset time t onset time t onset time t
Figure 2.10 Acceleration of tempo curves for Cortot and Horowitz.
2.5.2 Interpolated and smoothed tempo curves – velocity and acceleration

Conceptually it is plausible to assume that musicians control tempo in con-
tinuous time. The measure of acceleration given above is therefore a rather
crude estimate of the actual acceleration curve. Interpolation splines pro-
vide a simple possibility to “guess” the tempo and its derivatives between
the observed time points. One should bear in mind, however, that interpo-
lation is always based on specific assumptions. For instance, cubic splines
assume that the curve between two consecutive time points where obser-
vations are available is, or can be well approximated by, a third degree
polynomial. This assumption can hardly be checked experimentally and
can lead to undesirable effects. Figure 2.12 shows the observed and inter-
polated tempo for Martha Argerich. While most of the interpolated values
seem plausible, there are a few rather doubtful interpolations (marked with
arrows) where the cubic polynomial by far exceeds each of the two observed
values at the neighboring knots.

mean correlation Correlation
0.4 0.5 0.6 0.7 0.8 0.4 0.8 1.2
0
ARGERICH ARGERICH
ARRAU ARRAU
ASKENAZE ASKENAZE
BRENDEL BRENDEL
5
5
BUNIN BUNIN
CAPOVA CAPOVA
CORTOT1 CORTOT2
CORTOT2 CORTOT3
CORTOT3 CURZON
CURZON
10
DAVIES
10
DAVIES
DEMUS
DEMUS
ESCHENBACH
ESCHENBACH

GIANOLI
GIANOLI
HOROWITZ1
HOROWITZ1
15
HOROWITZ2
15
HOROWITZ2
HOROWITZ3
Performance
Performance
HOROWITZ3
KATSARIS
KATSARIS
other pianists
KLIEN
KLIEN
KRUST
KRUST
20
KUBALEK
20
c) Mean correlations with

KUBALEK
MOISEIWITSCH MOISEIWITSCH
a) Acceleration - Correlations of
NEY NEY
NOVAES NOVAES
Cortot (1935) with other performances
ORTIZ ORTIZ
25
SCHNABEL
25
SCHNABEL
SHELLEY SHELLEY
ZAK ZAK
mean correlation Correlation
0.6 0.7 0.8 0.9 1.0 0.2 0.6 1.0 1.4
0
ARGERICH ARGERICH
ARRAU ARRAU
ASKENAZE ASKENAZE
BRENDEL BRENDEL
5
5
BUNIN BUNIN
CAPOVA CAPOVA
CORTOT1 CORTOT1
CORTOT2 CORTOT2
CORTOT3 CORTOT3
CURZON
10
CURZON
10
DAVIES
DAVIES
DEMUS
DEMUS
ESCHENBACH
ESCHENBACH
GIANOLI
GIANOLI
HOROWITZ1
15
HOROWITZ2
15
HOROWITZ2
HOROWITZ3
Performance
Performance
HOROWITZ3
KATSARIS
other pianists
KATSARIS
KLIEN
KLIEN
KRUST
KRUST
20
KUBALEK
20
KUBALEK
d) Maximal correlations with
MOISEIWITSCH MOISEIWITSCH
b) Acceleration- Correlations of
NEY NEY
NOVAES NOVAES
ORTIZ ORTIZ
25
Horowitz (1947) with other performances
SCHNABEL
25
SCHNABEL
SHELLEY SHELLEY
Figure 2.11 Tempo acceleration – correlation with other performances.

ZAK ZAK
Figure 2.12 Martha Argerich – interpolation of tempo curve by cubic splines.
2.5.3 Tempo – hierarchical decomposition by smoothing

The tempo curve may be thought of as an aggregation of mostly smooth
tempo curves at different onset-time-scales. This corresponds to the general
structure of music as a mixture of global and local structures at various
scales. It is therefore interesting to look at smoothed tempo curves, and
their derivatives, at different scales. Reasonable smoothing bandwidths may
be guessed from the general structure of the composition such as time
signature(s), rhythmic, metric, melodic, and harmonic structure, and so on.
For tempo curves of Schumann’s Träumerei (Figure 2.3), even multiples of
1/8th are plausible. Figures 2.13 through 2.16 show the following kernel-
smoothed tempo curves with b1 = 8, b2 = 1, and b3 = 1/8 respectively:
# t − ti
ĝ1 (t) = (nb1 )−1 K( )yi (2.28)
b1
# t − ti
ĝ2 (t) = (nb2 )−1 K( )[yi − ĝ1 (t)] (2.29)
b2
# t − ti
ĝ3 (t) = (nb3 )−1 K( )[yi − ĝ1 (t) − ĝ2 (t)] (2.30)
b3
and the residuals
ê(t) = yi − ĝ1 (t) − ĝ2 (t) − ĝ3 (t). (2.31)

ARGERICH ARRAU ASKENAZE BRENDEL
-0.4
-0.4
-0.4
-0.4
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
BUNIN CAPOVA CORTOT1 CORTOT2
-0.6
-0.6
-0.6
-0.6
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
CORTOT3 CURZON DAVIES DEMUS
-0.6
-0.6
-0.6
-0.6
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
ESCHENBACH GIANOLI HOROWITZ1 HOROWITZ2

-0.6
-0.6
-0.6
-0.6
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
HOROWITZ3 KATSARIS KLIEN KRUST

-0.6
-0.6
-0.6
-0.6
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
KUBALEK MOISEIWITSCH NEY NOVAES

-0.6
-0.6
-0.6
-0.6
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
ORTIZ SCHNABEL SHELLEY ZAK

-0.6
-0.6
-0.6
-0.6
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
Figure 2.13 Smoothed tempo curves ĝ1 (t) = (nb1 )−1 K( t−t
!
b1
i
)yi (b1 = 8).

-0.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

1.0
-1.5
-1.5
-1.5
-2.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

-2.0
-2.0
-2.0
-2.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
!
b2
i
)[yi − ĝ1 (t)] (b2 =
1).

1
1
1
1
-2
-2
-2
-2
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

1
0
0
0
-2
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0
0
0
0
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

0
0
0
0
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

0
0
0
0
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

0
0
0
0
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

1
0
0
0
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
!
b3
i
)[yi − ĝ1 (t) −
ĝ2 (t)] (b3 = 1/8).

1.0
1.0
-1.0
-1.0
-1.0
-1.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

1.5
1.5
1.5
1.5
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

1.5
1.5
1.5
1.5
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t

1.5
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
Figure 2.16 Smoothed tempo curves – residuals ê(t) = yi − ĝ1 (t) − ĝ2 (t) − ĝ3 (t).

The tempo curves are thus decomposed into curves corresponding to a hi-
erarchy of bandwidths. Each component reveals specific features. The first
component reflects the overall tendency of the tempo. Most pianists have
an essentially monotonically decreasing curve corresponding to a gradual,
and towards the end emphasized, ritardando. For some performances (in
particular Bunin, Capova, Gianoli, Horowitz 1, Kubalek, and Moisewitsch)
there is a distinct initial acceleration with a local maximum in the middle
of the piece. The second component ĝ2(t) reveals tempo-fluctuations that
correspond to a natural division of the piece in 8 times 4 bars. Some pi-
anists, like Cortot, greatly emphasize the 8×4 structure. For other pianists,
such as Horowitz, the 8×4 structure is less evident: the smoothed tempo
curve is mostly quite flat, though the main, but smaller, tempo changes do
take place at the junctions of the eight parts. Striking is also the distinction
′ ′′
between part B (bars 17 to 24) and the other parts (A,A ,A ) of the com-
position – in particular in Argerich’s performance. The third component
characterizes fluctuations at the resolution level of 2/8th. At this very local
level, tempo changes frequently for pianists like Horowitz, whereas there
is less local movement in Cortot’s performances. Finally, the residuals e(t)
consist of the remaining fluctuations at the finest resolution of 1/8th. The
similarity between the three residual curves by Horowitz illustrate that
even at this very fine level, the “seismic” variation of tempo is a highly
controlled process that is far from random.
2.5.4 Tempo curves and melodic indicator
In Chapter 3, the so-called melodic indicator will be introduced. One of

the aims will be to “explain” some of the variability in tempo curves
by melodic structures in the score. Consider a simple melodic indicator
m(t) = wmelod (t) (see Section 3.3.4) that is essentially obtained by adding
all indicators corresponding to individual motifs. Figures 2.17a and d dis-
play smoothed curves obtained by local polynomial smoothing of −m(t)
using a large and a small bandwidth respectively. Figures 2.17b and e show
the first derivatives of the two curves in 2.17a,d. Similarily, the second
derivatives are given in figures 2.17c and f. For the tempo curves, the first
and second derivatives of local polynomial fits with b = 4 are given in Fig-
ures 2.18 and 2.19 respectively. A resemblance can be found in particular
between the second derivative of −m(t) in Figure 2.17f and the second
derivatives of tempo curves in Figure 2.19. Also, there are interesting simi-
larities and differences between the performances, with respect to the local
variability of the first two derivatives. Many pianists start with a very small
second derivative, with strongly increased values in part B.

a) -m(t) (span=24/32) b) -m’(t) (span=24/32) c): -m’’(t) (span=24/32)
-78
0.6
0.5
0.4
-80
0.0
0.2
mel. Ind.
2nd der.
1st der.
0.0
-82
-0.5
-0.4
-84
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t
d) -m(t) (span=8/32) e) -m’(t) (span=8/32) f) -m’’(t) (span=8/32)
150
40
-40
100
20
50
-60
mel. Ind.
2nd der.
1st der.
0
-80
-50
-20
-100
-100
-40
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t
Figure 2.17 Melodic indicator – local polynomial fits together with first and second
derivatives.
2.5.5 Tempo and loudness
By invitation of Prince Charles, Vladimir Horowitz gave a benefit recital

at London’s Royal Festival Hall on May 22, 1982. It was his first European
appearance in 31 years. One of the pieces played at the concert was Schu-
mann’s Kinderszene op. 15, No. 4. Figure 2.20 displays the (approximate)
soundwave of Horowitz’s performance sampled from the CD recording. Two
variables that can be extracted quite easily by visual inspection are: a) on
the horizontal axis the time when notes are played (and derived from this
quantity, the tempo) and b) on the vertical axis, loudness. More specifically,
let t1 , ..., tn be the score onset-times and u(t1 ), ..., u(tn ) the corresponding
performance times. Then an approximate tempo at score-onset time ti can
be defined by y(ti ) = (ti+1 − ti )/(u(ti+1 ) − u(ti )). A complication with
loudness is that the amplitude level of piano sounds decreases gradually in
a complex manner so that “loudness” as such is not defined exactly. For
simplicity, we therefore define loudness as the initial amplitude level (or
rather its logarithm). Moreover, we consider only events where the score-
onset time is a multiple of 1/8. For illustration, the first four events (score
onset times 1/8, 2/8, 3/8, 4/8) are marked with arrows in Figure 2.20.
An interesting question is what kind of relationship there may be be-
tween time delay y and loudness level x. The autocorrelations of x(ti ) =

1.0
1.0
1.0
1.0
1.0
1.0
1.0
ARGERICH ARRAU ASKENAZE BRENDEL BUNIN CAPOVA CORTOT1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t t t t
CORTOT2 CORTOT3 CURZON DAVIES DEMUS ESCHENBACH GIANOLI

1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t t t t
HOROWITZ1 HOROWITZ2 HOROWITZ3 KATSARIS KLIEN KRUST KUBALEK

1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t t t t
MOISEIWITSCH NEY NOVAES ORTIZ SCHNABEL SHELLEY ZAK

1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t t t t
Figure 2.18 Tempo curves (Figure 2.3) – first derivatives obtained from local
polynomial fits (span 24/32).

3
3
3
3
3
3
3
2
2
2
2
2
2
1
1
1
1
1
1
0
0
0
0
0
0
-1
-1
-1
-1
-1
-1
-1
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
-2
-2
-2
-2
-2
-2
-2
3
3
3
3
3
-3
-3
-3
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
2
2
2
2
2
t t t t t t t

1
1
1
1
0
0
0
0
-1
-1
-1
-1
-1
-1
-1
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
-2
-2
-2
-2
-2
-2
-2
-3
-3
-3
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t t t t

3
3
3
3
3
2
2
2
2
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
1
1
1
1
-0
-0
-0
-0
-0
-0
-0
1
1
1
1
--
--
--
--
--
--
--
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
3
0 5 10 15 20 25 30
3
3
3
3
2
2
2
2
t t t t t t t

3
3
3
3
2
2
2
2
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
1
1
1
1
-0
-0
-0
-0
-0
-0
-0
1
1
1
1
--
--
--
--
--
--
--
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
3
3
3
3
2
2
2
2
t t t t t t t
Figure 2.19 Tempo curves (Figure 2.3) – second derivatives obtained from local
polynomial fits (span 8/32).

Figure 2.20 Kinderszene No. 4 – sound wave of performance by Horowitz at the
Royal Festival Hall in London on May 22, 1982.
log(Amplitude) and y(ti ) as well as the cross-autocorrelations between the

two time series are shown in Figure 2.21a. The main remarkable cross-
autocorrelation occurs at lag 8. This can also be seen visually when plot-
ting y(ti+8) against x(ti ) (Figure 2.21b). There appears to be a strong
relationship between the two variables with the exception of four outliers.
The three fitted lines correspond to a) a least square linear regression fit
using all data; b) a robust high breakdown point and high efficiency regres-
sion (Yohai et al. 1991); and c) a least squares fit excluding the outliers. It
should be noted that the “outliers” all occur together in a temporal cluster
(see Figure 2.21c) and correspond to a phase where tempo is at its extreme
(lowest for the first three outliers and fastest for the last outlier). This
indicates that these are informative “outliers” (in contrast to wrong mea-
surements) that should not be dismissed, since they may tell us something
about the intention of the performer.
Finally, Figure 2.21d displays a sharpened version of the scatterplot in
Figure 2.21b: Points with high estimated joint density f̂ (x, y) are marked
with “O”. In contrast to what one would expect from a regression model,
random errors εi that are independent of x, the points with highest density
gather around a horizontal line rather than the regression line(s) fitted in
Figure 2.21b. Thus, a linear regression model is hardly applicable. Instead,
the data may possibly be divided into three clusters: a) a cluster with low
loudness and low tempo; b) a second cluster with medium loudness and
low to medium tempo; and c) a third cluster with a high level of loudness
and medium to high tempo.

Figure 2.21 log(Amplitude) and tempo for Kinderszene No. 4 – auto- and cross
correlations (Figure 2.24a), scatter plot with fitted least squares and robust lines
(Figure 2.24b), time series plots (Figure 2.24c), and sharpened scatter plot (Fig-
ure 2.24d).

2.5.6 Loudness and tempo – two-dimensional distribution function
In the example above, the correlation between loudness and tempo, when
measured at the same time, turned out to be relatively small, whereas there
appeared to be quite a clear lagged relationship. Does this mean that there
is indeed no “immediate” relationship between these two variables? Con-
sider x(ti ) = log(Amplitude) and the logarithm of tempo. The scatterplot
and the boxplot in Figures 22a and b rather suggest that there may be a
relationship, but the dependence is nonlinear. This is further supported by
the two-dimensional histogram (Figure 23a), the smoothed density (Figure
24a) and the corresponding image plots (Figures 23b and 24b; the actual
observations are plotted as stars). The density was estimated by a kernel
estimate with the Epanechnikov kernel. Since correlation only measures lin-
ear dependence, it cannot detect this kind of highly nonlinear relationship.
Figure 2.22 Horowitz’ performance of Kinderszene No. 4 – log(tempo) versus

log(Amplitude) and boxplots of log(tempo) for three ranges of amplitude.
2.5.7 Melodic tempo-sharpening

Sharpening can also be applied by using an “external” variable. This is
illustrated in Figures 2.25 through 2.27. Figure 2.25a displays the estimated
density function of log(m+1) where m(t) is the value of a melodic indicator
at onset time t. The marked region corresponds to very high values of
the density function f (namely f (x) > 0.793). This defines a set Isharp
of corresponding “sharpening onset times”. The series m(t) is shown in
Figure 2.25b, with sharpening onset times t ∈ Isharp highlighted by vertical

Figure 2.23 Horowitz’ performance of Kinderszene No. 4 – two-dimensional his-
togram of (x, y) = (log(tempo), log(Amplitude)) displayed in a perspective and
image plot respectively.
Figure 2.24 Horowitz’ performance of Kinderszene No. 4 – kernel estimate of

two-dimensional distribution of (x, y) = (log(tempo), log(Amplitude)) displayed
in a perspective and image plot respectively.

Figure 2.25 R. Schumann, Träumerei op. 15, No. 7 – density of melodic indicator
with sharpening region (a) and melodic curve plotted against onset time, with
sharpening points highlighted (b).

CORTOT1 CORTOT2 CORTOT3
tempo
tempo
tempo
0
0
HOROWITZ1 HOROWITZ2 HOROWITZ3
tempo
tempo
tempo
0
Figure 2.26 R. Schumann, Träumerei op. 15, No. 7 – tempo by Cortot and
Horowitz at sharpening onset times.
CORTOT1 CORTOT2 CORTOT3

10
10
10
diff(tempo)
diff(tempo)
diff(tempo)
0
0
-10
-10
-10
HOROWITZ1 HOROWITZ2 HOROWITZ3

10
10
10
diff(tempo)
diff(tempo)
diff(tempo)
0
0
-10
-10
-10
Figure 2.27 R. Schumann, Träumerei op. 15, No. 7 – tempo “derivatives” for
Cortot and Horowitz at sharpening onset times.

lines. Figures 2.26 and 2.27 show the tempo y and its discrete “derivative”
v(ti ) = [y(ti+1 ) − y(ti )]/(ti+1 − ti ) for ti ∈ Isharp and the performances by
Cortot and Horowitz. The pictures indicate a systematic difference between
Cortot and Horowitz. A common feature is the negative derivative at the
fifth and sixth sharpening onset time.
2.6 Some multivariate descriptive displays

2.6.1 Definitions
Suppose that we observe multivariate data x1 , x2, ..., xn where each xi is
a p-dimensional vector (xi1 , ..., xip )t ∈ Rp . Obvious numerical summary
statistics are the sample mean
x̄ = (x̄1 , x̄2, ..., x̄p )t
!n
where x̄j = n−1 i=1 xij and the p × p covariance matrix S with elements
n
"
Sjl = (n − 1)−1 (xij − x̄j )(xil − x̄l ).
i=1
Most methods for analyzing multivariate data are based on these two statis-
tics. One of the main tools consists of dimension reduction by suitable pro-
jections, since it is easier to find and visualize structure in low dimensions.
These techniques go far beyond descriptive statistics. We therefore post-
pone the discussion of these methods to Chapters 8 to 11. Another set of
methods consists of visualizing individual multivariate observations. The
main purpose is a simple visual identification of similarities and differences
between observations, as well as search for clusters and other patterns.
Typical examples are:
• Faces: xi =(xi1 , ..., xip )t is represented by a face with features depending
on the values of corresponding coordinates. For instance, the face func-
tion in S-Plus has the following correspondence between coordinates and
feature parameters: xi,1 =area of face; xi,2 = shape of face; xi,3 = length
of nose; xi,4 = location of mouth; xi,5 = curve of smile; xi,6 = width of
mouth; xi,7 = location of eyes; xi,8 = separation of eyes; xi,9 = angle
of eyes; xi,10 = shape of eyes; xi,11 = width of eyes; xi,12 = location of
pupil; xi,13 = location of eyebrow; x14 = angle of eyebrow; xi,15 = width
of eyebrows.
• Stars: Each coordinate is represented by a ray in a star, the length of
each corresponding to the value of the coordinate. More specifically, a
star for a data vector xi = (xi1 , ..., xip )t is constructed as follows:
1. Scale xi to the range [0, r] : 0 ≤ x1j, ..., xnj ≤ r;
2. Draw p rays at angles ϕj = 2π(j − 1)/p (j = 1, ..., p); for a star with

origin 0 representing observation xi , the end point of the jth ray has
the coordinates r · (xij cos ϕj , xij sin ϕj );
3. For visual reasons, the end points of the rays may be connected by
straight lines.
• Profiles: An observation xi =(xi1 , ..., xip )t is represented by a plot of
xij versus j where neighboring points xij−1 and xij (j = 1, ..., p) are
connected.
• Symbol plot: The horizontal and vertical positions represent xi1 and
xi2 respectively (or any other two coordinates of xi ). The other coor-
dinates xi3, ..., xip determine p − 2 characteristic shape parameters of a
geometric object that is plotted at point (xi1 , xi2). Typical symbols are
circle (one additional dimension), rectangle (two additional dimensions),
stars (arbitrary number of additional dimensions), and faces (arbitrary
number of additional dimensions).
2.7 Specific applications in music – multivariate

2.7.1 Distribution of notes – Chernoff faces
In music that is based on scales, pitch (modulo 12) is usually not equally
distributed. Notes that belong to the main scale are more likely to occur,
and within these, there are certain prefered notes as well (e.g. the roots
of the tonic, subtonic and supertonic triads). To illustrate this, we con-
sider the following compositions: 1. Saltarello (Anonymus, 13th century);
2. Prelude and Fugue No. 1 from “Das Wohltemperierte Klavier” (J. S.
Bach, 1685-1750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 1810-1856);
4. Piano piece op. 19, No. 2 (A. Schönberg, 1874-1951; figure 2.28); 5. Rain
Tree Sketch 1 (T. Takemitsu, 1930-1996). For each composition, the dis-
tribution of notes (pitches) modulo 12 is calculated and centered around
the “central pitch” (defined as the most frequent pitch modulo 12). Thus,
the central pitch is defined as zero. We then obtain five vectors of relative
frequencies pj = (pj0, ..., pj11 )t (j = 1, ..., 5) characterizing the five compo-
sitions. In addition, for each of these vectors the number nj of local peaks
in pj is calculated. We say that a local peak at i ∈ {1, ..., 10} occurs, if
pji > max(pji−1 , pji+1 ). For i = 10, we say that a local peak occurs, if
pji > pji−1 . Figure 2.29a displays Chernoff faces of the 12-dimensional vec-
tors vj = (nj , pj1 , ..., pj11 )t . In Figure 2.29b, the coordinates of vj (and thus
the assignment of feature variables) were permuted. The two plots illus-
trate the usefulness of Chernoff faces, and at the same time the difficulties
in finding an objective interpretation. On one hand, the method discovers
a plausible division in two groups: both picures show a clear distinction
between classical tonal music (first three faces) and the three representa-
tives of “avant-garde” music of the 20th century. On the other hand, the

exact nature of the distinction cannot be seen. In Figure 2.29a, the classical
faces look much more friendly than the rather miserable avant-garde fel-
lows. The judgment of conservative music lovers that “avant-garde” music
is unbearable, depressing, or even bad for health, seems to be confirmed!
Yet, bad temper is the response of the classical masters to a simple permu-
tation of the variables (Figure 2.29b), whereas the grim avant-garde seems
to be much more at ease. The difficulty in interpreting Chernoff faces is
that the result depends on the order of the variables, whereas due to their
psychological effect most feature variables are not interchangeable.
Figure 2.28 Arnold Schönberg (1874-1951), self-portrait. (Courtesy of Verwer-

tungsgesellschaft Bild-Kunst, Bonn.)
2.7.2 Distribution of notes – star plots

We consider once more the distribution vectors pj = (pj0, ..., pj11 )t of pitch
modulo 12 where 0 is the tonal center. In contrast to Chernoff faces, per-
mutation of coordinates in star plots is much less likely to have a subjective
influence on the interpretation of the picture. Nevertheless, certain patterns
can become more visible when using an appropriate ordering of the vari-
ables. From the point of view of tonal music, a natural ordering of pitch can
be obtained, for instance, from the ascending circle of fourths. This leads
to the following permutation p∗j = (p5, p10, p3, p8, p1 , p6, p11 , p4, p9, p2, p7)t .
(p0 is omitted, since it is maximal by definition for all compositions.) Since
stars are easy to look at, it is possible to compare a large number of obser-
vations simultaneously. We consider the following set of compositions:

a
ANONYMUS BACH SCHUMANN
WEBERN SCHOENBERG TAKEMITSU
Figure 2.29 a) Chernoff faces for 1. Saltarello (Anonymus, 13th century); 2.

Prelude and Fugue No. 1 from “Das Wohltemperierte Klavier” (J. S. Bach, 1685-
1750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 1810-1856); 4. Piano piece
op. 19, No. 2 (A. Schönberg, 1874-1951); 5. Rain Tree Sketch 1 (T. Takemitsu,
1930-1996).
ANONYMUS BACH SCHUMANN
WEBERN SCHOENBERG TAKEMITSU
Figure 2.29 b) Chernoff faces for the same compositions as in figure 2.29a, after
permuting coordinates.

• A. de la Halle (1235?-1287): “Or est Bayard en la pature, hure!”;
• J. de Ockeghem (1425-1495): Canon epidiatesseron;
• J. Arcadelt (1505-1568): a) Ave Maria, b) La ingratitud, c) Io dico fra
noi;
• W. Byrd (1543-1623): a) Ave Verum Corpus, b) Alman, c) The Queen’s
Alman;
• J.P. Rameau (1683-1764): a) La Poplinière, b) Le Tambourin, c) La
Triomphante;
• J.S. Bach (1685-1750): Das Wohltemperierte Klavier – Preludes und
Fuges No. 5, 6 and 7;
• D. Scarlatti (1660-1725): Sonatas K 222, K 345 and K 381;
• J. Haydn (1732-1809): Sonata op. 34, No. 2;
• W.A. Mozart (1756-1791): 2nd movements of Sonatas KV 332, KV 545
and KV 333;
• M. Clementi (1752-1832): Gradus ad Parnassum –Studies 2 and 9 (Fig-
ure 11.4);
• R. Schumann (1810-1856): Kinderszenen op. 15, No. 1, 2, and 3;
• F. Chopin (1810-1849): a) Nocturne op. 9, No. 2, b) Nocturne op. 32,
No. 1, c) Etude op. 10, No. 6;
• R. Wagner (1813-1883): a) Bridal Choir from “Lohengrin”, b) Ouverture
to Act 3 of “Die Meistersinger”;
• C. Debussy (1862-1918): a) Claire de lune, b) Arabesque No. 1, c) Re-
flections dans l’eau;
• A. Scriabin (1872-1915): Preludes op. 2/2, op. 11/14 and op. 13/2;
• B. Bartók (1881-1945): a) Bagatelle op. 11, No. 2 and 3, b) Sonata for
Piano;
• O. Messiaen (1908-1992): Vingts regards sur l’enfant de Jésus, No. 3;
• S. Prokoffieff (1891-1953): Visions fugitives No. 11, 12 and 13;
• A. Schönberg (1874-1951): Piano piece op. 19, No. 2;
• T. Takemitsu (1930-1996): Rain Tree Sketch No. 1;
• A. Webern (1883-1945): Orchesterstück op. 6, No. 6;
• J. Beran (*1959): Śānti –piano concert No. 2 (beginning of 2nd Mov.)
The star plots of p∗j are given in Figure 2.31. From Halle (cf. Figure 2.30)
up to about the early Scriabin, the long beams form more or less a half-
circle. This means that the most frequent notes are neighbors in the circle
of quarts and are much more frequent than all other notes. This is indeed
what one would expect in music composed in the tonal system. The picture
starts changing in the neighborhood of Scriabin where long beams are either

isolated (most extremely for Bartók’s Bagatelle No. 3) or tend to cover more
or less the whole range of notes (e.g. Bartók, Prokoffieff, Takemitsu, Beran).
Due to the variety of styles in the 20th century, the specific shape of each
of the stars would need to be discussed in detail individually. For instance,
Messiaen’s shape may be explained by the specific scales (Messiaen scales)
he used. Generally speaking, the difference between star plots of the 20th
century and earlier music reflects the replacement of the traditional tonal
system with major/minor scales by other principles.
Figure 2.30 The minnesinger Burchard von Wengen (1229-1280), contemporary

of Adam de la Halle (1235?-1288). (From Codex Manesse, courtesy of the Uni-
versity Library Heidelberg.) (Color figures follow page 152.)

Distribution of notes ordered according to ascending circle of fourths
HALLE OCKEGHEM ARCADELT ARCADELT ARCADELT BYRD BYRD BYRD
RAMEAU RAMEAU RAMEAU BACH BACH BACH SCARLATTI SCARLATTI
SCARLATTI HAYDN MOZART MOZART MOZART CLEMENTI CLEMENTI SCHUMANN
SCHUMANN SCHUMANN CHOPIN CHOPIN CHOPIN WAGNER WAGNER DEBUSSY
DEBUSSY DEBUSSY SCRIABIN SCRIABIN SCRIABIN BARTOK BARTOK BARTOK
PROKOFFIEFF PROKOFFIEFF PROKOFFIEFF MESSIAEN SCHOENBERG WEBERN TAKEMITSU BERAN
Figure 2.31 Star plots of p∗j = (p6 , p11 , p4 , p9 , p2 , p7 , p12 , p5 , p10 , p3 , p8 )t for com-
positions from the 13th to the 20th century.
2.7.3 Joint distribution of interval steps of envelopes

Consider a composition consisting of onset times ti and pitch values x(ti ). In
a polyphonic score, several notes may be played simultaneously. To simplify
analysis, we define a simplified score by considering the lower and upper
envelope:
Definition 24 Let
n
#
C = {(ti , x(ti )) : ti ∈ A, x(ti ) ∈ B, i = 1, 2, ..., N } = Cj
j=1
where A = {t∗1 , ..., t∗n }

⊂ < Z+ (t∗1
< ... < B ⊂ R or Z and
t∗2 t∗n ),
Cj = {(t, x(t)) ∈ C : t = t∗j }. Then the lower and upper envelope of C are

defined by
Elow = {(t∗j , min x(t)), j = 1, ..., n}
(t,x(t))∈Cj
and
Eup = {(t∗j , max x(t)), j = 1, ..., n}.
(t,x(t))∈Cj
In other words, for each onset time, the lowest and highest note are se-
lected to define the lower and upper envelope respectively. In the example
below, we consider interval steps ∆y(ti ) = y(ti+1 ) − y(ti ) mod 12 for the
upper envelope of a composition with onset times t1 , ..., tn and pitches
y(t1 )..., y(tn ). A simple aspect of melodic and harmonic structure is the
question in which sequence intervals are likely to occur. Here, we look at
the empirical two-dimensional distribution of (∆y(ti ), ∆y(ti+1 )). For each
pair (i, j), (−11 ≤ i, j ≤ 11, i, j̸=0), we count the number nij of occurences
and define Nij = log(nij + 1). (The value 0 is excluded here, since repe-
titions of a note – or transposition by an octave – are less interesting.) If
only the type of interval and not its direction is of interest, then i, j assume
the values 1 to 11 only. A useful representation of Nij can be obtained by
a symbol plot. In Figures 2.32 and 2.33, the x- and y-coordinates corre-
spond to i and j respectively. The radius of a circle with center (i, j) is
proportional to Nij . The compositions considered here are: a) J.S. Bach:
Präludium No. 1 from ”Das Wohltemperierte Klavier”; b) W.A. Mozart :
Sonata KV 545, (beginning of 2nd Movement); c) A. Scriabin: Prélude op.
51, No. 4; and d) F. Martin: Prélude No. 6. For Bach’s piece, there is a clear
clustering in three main groups in the first plot (there are almost never two
successive interval steps downwards) and a horseshoe-like pattern for abso-
lute intervals. Remarkable is the clear negative correlation in Mozart’s first
plot and the concentration on a few selected interval sequences. A nega-
tive correlation in the plots of interval steps with sign can also be found
for Scriabin and Martin. However, considering only the types of intervals
without their sign, the number and variety of interval sequences that are
used relatively frequently is much higher for Scriabin and even more for
Martin. For Martin, the plane of absolute intervals (Figure 2.33d) is filled
almost uniformly.
2.7.4 Pitch distribution – symbol plots with circles

Consider once more the distribution vectors pj = (pj0, ..., pj11 )t of pitch
modulo 12 as in the star-plot example above. The star plots show a clear
distinction between “modern” compositions and classical tonal composi-
tions. Symbol plots can be used to see more clearly which composers (or
compositions) are close with respect to pj . In figure 2.34 the x- and y-
axis corresponds to pj5 and pj7. Recall that if 0 is the root of the tonic
triad, then 5 is the root of the subtonic and 7 the root of the dominant

Figure 2.32 Symbol plot of the distribution of successive interval pairs
(∆y(ti ), ∆y(ti+1 )) (a, c) and their absolute values (b, d) respectively, for the
upper envelopes of Bach’s Präludium No. 1 (Das Wohltemperierte Klavier I) and
Mozart ’s Sonata KV 545 (beginning of 2nd movement).

Figure 2.33 Symbol plot of the distribution of successive interval pairs
(∆y(ti ), ∆y(ti+1 )) (a, c) and their absolute values (b, d) respectively, for the
upper envelopes of Scriabin’s Prélude op. 51, No. 4 and F. Martin’s Prélude No.
6.

triad. The radius of the circles in Figure 2.34 is proportional to pj1 , the
frequency of the “dissonant” minor second. In color Figure 2.35, the radius
represents pj6, i.e. the augmented fourth. Both plots show a clear positive
relationship between pj5 and pj7. Moreover the circles tend to be larger
for small values of x and y. The positioning in the plane together with the
size of the circles separates (apart from a few exceptions) classical tonal
compositions from more recent ones. To visualize this, four different col-
ors are chosen for “early music” (black), “baroque and classical” (green),
“romantic” (blue) and “20/21st century” (red). The clustering of the four
colors indicates that there is indeed an approximate clustering according
to the four time periods. Interesting exceptions can be observed for “early”
music with two extreme “outliers” (Halle and Arcadelt). Also, one piece by
Rameau is somewhat far from the rest.
0.20
ARCADELT
RAMEAU
RAMEAU
SCHUMANN
ARCADELT
BYRD
SCRIABIN
RAMEAU SCARLATTI
SCARLATTI
0.15
CLEMENTI BYRD
MOZART
BYRD
DEBUSSY
SCRIABIN PROKOFFIEFF
BACH BACH OCKEGHEM
CHOPIN MOZART
SCARLATTI
CHOPIN DEBUSSY SCHUMANN
BACH HAYDN
WEBERN WAGNER MOZART
CLEMENTI SCRIABIN
0.10
DEBUSSY
CHOPIN SCHUMANN
PROKOFFIEFF
BARTOK
WAGNER
TAKEMITSU
0.05
SCHOENBERG
MESSIAEN
ARCADELT BERAN
BARTOK
PROKOFFIEFF
BARTOK
0.0
HALLE
0.0 0.05 0.10 0.15 0.20
Figure 2.34 Symbol plot with x = pj5 , y = pj7 and radius of circles proportional
to pj1 .
2.7.5 Pitch distribution – symbol plots with rectangles

By using rectangles, four dimensions can be represented. Color Figure 2.36
shows a symbol with (x, y)-coordinates (pj5, pj7) and rectangles with width

0.20
ARCADELT
RAMEAU
RAMEAU
SCHUMANN
ARCADELT
BYRD
0.15
SCRIABIN
RAMEAU SCARLATTI
SCARLATTI
CLEMENTI BYRD
MOZART
BYRD
DEBUSSY
SCRIABIN PROKOFFIEFF
BACH BACH OCKEGHEM
CHOPIN MOZART
SCARLATTI
BACH HAYDN
0.10
CLEMENTI SCRIABIN
DEBUSSY
CHOPIN SCHUMANN
PROKOFFIEFF
BARTOK
WAGNER
TAKEMITSU
0.05
SCHOENBERG
MESSIAEN
ARCADELT BERAN
BARTOK
PROKOFFIEFF
BARTOK
0.0
HALLE
0.0 0.05 0.10 0.15 0.20
Figure 2.35 Symbol plot with x = pj5 , y = pj7 and radius of circles proportional
to pj6 . (Color figures follow page 152 .)
pj1 (diminished second) and height pj6 (augmented fourth). Using the same
colors for the names as above, a similar clustering as in the circle-plot can
be observed. The picture not only visualizes a clear four-dimensional rela-
tionship between pj1 , pj5 , pj6 and pj7 , but also shows that these quantities
are related to the time period.
2.7.6 Pitch distribution – symbol plots with stars

Five dimensions are visualized in color Figure 2.3 7 with (x, y) = (pj5 , pj7 )
and the variables pj1 , pj6 and pj10 (diminished seventh) defining a starplot
for each observation, the first variables starting on the right and the sub-
sequent variables winding counterclockwise around the star (in this case
a triangle). The shape of the triangle is obviously a characteristic of the
time period. For tonal music composed mostly before about 1900, the stars
are very narrow with a relatively long beam in the direction of the di-
minished seventh. The diminished seventh is indeed an important pitch
in tonal music, since it is the fourth note in the dominant seventh chord
to the subtonic. In contrast, notes that are a diminished second and an

0.0
RAMEAU ARCADELT
RAMEAU
SCHUMANN
ARCADELT
BYRD
SCRIABIN
RAMEAU
CLEMENTI SCARLATTI
SCARLATTI
MOZART DEBUSSY BYRD
BYRD
PROKOFFIEFF MOZART OCKEGHEM
CHOPIN SCRIABIN BACH SCARLATTI
BACH
BACH HAYDN
WAGNER
WEBERN CLEMENTI MOZART
SCRIABIN
DEBUSSY
CHOPIN SCHUMANN
PROKOFFIEFF
BARTOK WAGNER
TAKEMITSU
SCHOENBERG
ARCADELTMESSIAEN
BERAN
BARTOK
PROKOFFIEFF
BARTOK
HALLE
-0.1
0.0 0.05 0.10 0.15 0.20
Figure 2.36 Symbol plot with x = pj5 , y = pj7 . The rectangles have width pj1
(diminished second) and height pj6 (augmented fourth). (Color figures follow page
152.)

augmented fourth above the root of the tonic triad build, together with
the tonic root, highly dissonant intervals and are therefore less frequent in
tonal music. Color Figure 2.37 shows the triangles; the names without the
triangles are plotted in color Figure 2.38.
2.7.7 Pitch distribution – profile plots

Finally, as an alternative to star plots, Figure 2.39 displays profile plots
of p∗j = (p5, p10, p3, p8, p1 , p6, p11 , p4, p9, p2, p7)t . For compositions up to
about 1900, the profiles are essentially U-shaped. This corresponds to stars
with clustered long and short beams respectively, as seen previously. For
“modern” compositions, there is a large variety of shapes different from a
U-shape.

0.10
0.05
0.0
0.0 0.05 0.10 0.15 0.20
Figure 2.37 Symbol plot with x = pj5 , y = pj7 , and triangles defined by pj1 (di-
minished second), pj6 (augmented fourth) and pj10 (diminished seventh). (Color
figures follow page 152 .)

ARCADELT
RAMEAU
0.20
RAMEAU
SCHUMANN
ARCADELT
BYRD
SCRIABIN
RAMEAU SCARLATTI
SCARLATTI
CLEMENTI
0.15
MOZART BYRD
DEBUSSY
BYRD
PROKOFFIEFF
CHOPIN SCRIABIN BACH BACH MOZART OCKEGHEM
SCARLATTI
BACH HAYDN
CLEMENTI SCRIABIN
0.10
DEBUSSY
CHOPIN SCHUMANN
PROKOFFIEFF
BARTOK
WAGNER
TAKEMITSU
0.05
SCHOENBERG
ARCADELTMESSIAEN
BERAN
BARTOK
PROKOFFIEFF
BARTOK
0.0
HALLE
0.0 0.05 0.10 0.15 0.20
Figure 2.3 8 Names plotted at locations (x, y) = (pj5 , pj7 ). (Color figures follow
page 152.)

HALLE OCKEGHEM ARCADELT ARCADELT ARCADELT BYRD BYRD BYRD
0.10
0.20
0.10
0.10
0.10
0.08
0.10
0.10
0.10
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
RAMEAU RAMEAU RAMEAU BACH BACH BACH SCARLATTI SCARLATTI
0.10
0.10
0.10
0.10
0.08
0.08
0.10
0.10
0.02
0.02
0.0
0.0
0.0
0.0
0.0
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
SCARLATTI HAYDN MOZART MOZART MOZART CLEMENTI CLEMENTI SCHUMANN

0.15
0.15
0.15
0.10
0.10
0.08
0.10
0.10
0.05
0.05
0.05
0.02
0.02
0.0
0.0
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
SCHUMANN SCHUMANN CHOPIN CHOPIN CHOPIN WAGNER WAGNER DEBUSSY
0.10
0.10
0.10
0.10
0.10
0.08
0.08
0.10
0.02
0.02
0.02
0.0
0.04
0.0
0.0
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
DEBUSSY DEBUSSY SCRIABIN SCRIABIN SCRIABIN BARTOK BARTOK BARTOK

0.15
0.10
0.10
0.10
0.10
0.10
0.10
0.10
0.06
0.05
0.02
0.02
0.0
0.0
0.0
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
PROKOFFIEFF PROKOFFIEFF PROKOFFIEFF MESSIAEN SCHOENBERG WEBERN TAKEMITSU BERAN

0.20
0.10
0.10
0.12
0.10
0.20
0.09
0.08
0.06
0.07
0.05
0.05
0.04
0.02
0.04
0.02
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Figure 2.39 Profile plots of p∗j = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t .

CHAPTER 3
Global measures of structure and

randomness
Essential aspects of music may be summarized under the keywords “struc-

ture”, “information” and “communication”. Even aleatoric pieces where
events are generated randomly (e.g. Cage, Xenakis, Lutoslawsky) have
structure and information induced by the definition of specific random dis-
tributions. It is therefore meaningful to measure the amount of structure
and information contained in a composition. Clearly, this is a nontrivial
task and many different, and possibly controversial, definitions can be in-
vented. In this chapter, two types of measures are discussed: 1) general
global measures of information or randomness, and 2) specific local mea-
sures indicating metric, melodic, and harmonic structures.
3.2.1 Measuring information and randomness
There is an enormous amount of literature on information measures and

their applications. In this section, only some basic fundamental definitions
and results are reviewed. These and other classical results can be found, in
particular, in Fisher (1925, 1956), Hartley (1928), Bhattacharyya (1946a),
Erdös (1946), Wiener (1948), Shannon (1948), Shannon and Weaver (1949),
Barnard (1951), McMillan (1953), Mandelbrot (1953, 1956), Khinchin (1953,
1956), Goldman (1953), Bartlett (1955), Brillouin (1956), Komogorov (1956),
Ashby (1956), Joshi (1957), Kullback (1959), Wolfowitz (1957, 1958, 1961),
Woodward (1953), Rényi (1959a,b, 1961, 1965, 1970). Also see e.g. Ash
(1965) for an overview. A classical measure of information (or random-
ness) is entropy, which is also called Shannon information (Shannon 1948,
Shannon and Weaver 1949). To explain its meaning, consider the follow-
ing question: how much information is contained in a message, or more
specifically, what is the necessary number of digits to encode the message
unambiguously in the binary system? For instance, if the entire vocabulary
only consisted of the words “I”, “hungry”, “not”, “very”, then the words
could be identified with the binary numbers 00 = “I”, 01 = “hungry”, 10 =

Figure 3.1 Ludwig Boltzmann (1844-1906). (Courtesy of Österreichische Post
AG.)
“not” and 11 = “very”. Thus, for a vocabulary V of |V | = N = 22 words,

n = 2 digits would be sufficient. More generally, suppose that we have a
set V with N = 2n elements. Then we need n = log2 N digits for encod-
ing the elements in the binary system. The number n is then called the
information of a message from vocabulary V . Note that in the special case
where V consists of one element only, n = 0, i.e. the information content of
a message is zero, because we know which element of V will be contained
in the message even before receiving it.
An extension of this definition to integers N that are not necessarily
powers of 2 can be justified as follows: consider a sequence of k elements
from V . The number of sequences v1 , ..., vk (vi ∈ V ) is N k . (Note that one
element is allowed to occur more than once.) The number of binary digits
to express a sequence v1 , ..., vk is nk where 2nk −1 < N k ≤ 2nk . The average
number of digits needed to express an element in this sequence is nk /k
where k log2 N ≤ nk < k log2 N + 1. We then have
nk
lim = log2 N.
k→∞ k
The following definition is therefore meaningful:

Definition 25 Let VN be a finite set with N elements. Then the informa-
tion necessary to characterize the elements of VN is defined by
I(VN ) = log2 N (3.1)
This definition can also be derived by postulating the following properties
a measure of information should have:
1. Additivity: If |VK | = N M , then I(VK ) = I(VN ) + I(VM )

2. Monotonicity: I(VN ) ≤ I(VN +1 )
3. Definition of unit: I(V2 ) = 1.
The only function that satisfies these conditions is I(VN ) = log2 N.
Consider now a more complex situation where VN = ∪kj=1 Vj , Vj ∩ Vl = φ
(j ̸= l) and |Vj | = Nj (and hence N = N1 +...+Nk ), and define pj = Nj /N .
Suppose that we select an element from V randomly, each element having
the same probability of being chosen. If an element v ∈ V is known to
belong to a specific Vj , then the additional information needed to identify it
within Vj is equal to I(Vj ) = log2 Nj . The expected value of this additional
information is therefore
k
! k
!
I2 = pj log2 Nj = pj log2 (N pj ) (3.2)
j=1 j=1
Let I1 be the information needed to identify the set Vj which v belongs to.
Then the total information needed for identifying (encoding) elements of
V is
log2 N = I1 + I2 (3.3)
On the other hand, pj log2 N = log2 N so that we obtain Shannon’s
"
famous formula
!k
I1 = − pj log2 (pj ) (3.4)
j=1
I1 is also called Shannon information. Shannon information is thus the ex-
pected information about the occurence of the sets V1 , ..., Vk contained in
a randomly chosen element from V . Note that the term “information” can
be used synonymously for “uncertainty”: the information obtained from
a random experiment diminishes uncertainty by the same amount. The
derivation of Shannon information is credited to Shannon (1948) and, in-
dependently, Wiener (1948). In physics, an analogous formula is known as
entropy and is a measure of the disorder of a system (see Boltzmann 1896,
figure 3.1).
Shannon’s formula can also be derived by postulating the following prop-
erties for a measure of information of the outcome of a random experiment:
let V1 , ..., Vk be the possible outcomes of a random experiment and denote
by pj = P (Aj ) the corresponding probabilities. Then a measure of infor-
mation, say I, obtained by the outcome of the random experiment should
have the following properties:
1. Function of probabilities: I = I(p1 , ..., pk ), i.e. I depends on the proba-
bilities pj only;
2. Symmetry: I(p1 , ..., pk ) = I(pπ(1) , ..., pπ(k) ) for any permutation π;
3. Continuity: I(p, 1 − p) is a continuous function of p (0 ≤ p ≤ 1);
4. Definition of unit: I( 12 , 12 ) = 1;

5. Additivity and weighting by probabilities:
p1 p2
I(p1 , ..., pk ) = I(p1 + p2 , p3, ..., pk ) + (p1 + p2 )I( , ) (3.5)
p1 + p2 p1 + p2
The meaning of the first four properties is obvious. The last property can
be interpreted as follows: suppose the outcome of an experiment does not
distinguish between V1 and V2 , i.e. if v turns out to be in one of these
two sets, we only know that v ∈ V1 ∪ V2 . Then the infomation provided
by the experiment is I(p1 + p2 , p3, ..., pk ). If the experiment did distinguish
between V1 and V2 , then it is reasonable to assume that the information
would be larger by the amount
p1 p2
(p1 + p2 )I( , ).
p1 + p2 p1 + p2
Equation (3.5) tells us exactly that: the complete information I(p1 , ..., pk )
can be obtained by adding the partial and the additional information. It
turns out that the only function for which the postulates hold is Shannon’s
information:
Theorem 9 Let I be a functional that assigns each finite discrete distri-
bution function P (defined by probabilities p1 , ..., pk , k ≥ 1) a real number
I(P ), such that the properties above hold. Then
k
!
I(P ) = I(p1 , ..., pk ) = − pj log2 pj (3.6)
j=1
Shannon information has an obvious upper bound that follows from Jensen’s
inequality: recall that Jensen’s
" inequality states that for a convex function
g and weights wj ≥ 0 with wj = 1 we have
! !
g( wj xj ) ≤ wj g(xj ).
In particular, for g(x) = x log2 x,

! ! !
k −1 g(pj ) = k −1 pj log2 pj ≥ g( k −1 pj ) = − k −1 log2 k.
Hence,
I(P ) ≤ log2 k (3.7)
This bound is achieved by the uniform distribution pj = 1/k. The other
extreme case is pj = 1 for some j. This means that event Vj occurs with
certainty and I(p1 , ..., pk ) = I(pj ) = I(1) = I(1, 0) = I(1, 0, 0) etc. Then
from the fifth property we have I(1, 0) = I(1) + I(1, 0) so that I(1) = 0.
The interpretation is that, if it is clear a priori which event will occur, then
a random experiment does not provide any information.
The notion of information can be extended in an obvious way to the
case where one has an infinite but countable number of possible outcomes.

The information contained in the realization of a random variable X with
possible outcomes x1 , x2 , ... is defined by
!
I(X) = − pj log2 pj
where pj = P (X = xj ). More subtle is the extension to continuous distri-

butions and random variables. A nice illumination of the problem is given
in Renyi (1970): for a random variable with uniform distribution on (0,1),
the digits in the binary expansion of X are infinitely many independent
0-1-random variables where 0 and 1 occur with probability 1/2 each. The
information furnished by a realization of X would therefore be infinite. Nev-
ertheless, a meaningful measure of information can be defined as a limit of
discrete approximations:
Theorem 10 Let X be a random variable with density function f. Define
XN = [N X]/N where [x] denotes the integer part of x. If I(X1 ) < ∞, then
the following holds:
I(XN )
lim =1 (3.8)
N →∞ log2 N
# ∞
lim (I(XN ) − log2 N ) = − f (x) log2 f (x)dx (3.9)
N →∞ −∞
We thus have
Definition 26 Let X be a random variable with density function f . Then
# ∞
I(X) = − f (x) log2 f (x)dx (3.10)
−∞
is called the information (or entropy) of X.

Note that, in contrast to discrete distributions, information can be negative.
This is due to the fact that I(X) is in fact the limit of a difference of
informations.
The notion of entropy can also be carried over to measuring randomness
in stationary time series in the sense of correlations. (For the definition of
stationarity and time series in general see Chapter 4.)
Definition 27 Let Xt (t ∈ Z) be a stationary process with var(Xt ) = 1,
and spectral density f . Then the spectral entropy of Xt is defined by
# π
I(Xt , t ∈ Z) = − f (x) log2 f (x)dx (3.11)
−π
This definition is plausible, because for a process with unit variance, f has
the same properties as a probability distribution and can be interpreted as
a distribution on frequencies. The process Xt is uncorrelated if and only if
f is constant, i.e. if f is the uniform distribution on [− π, π]. Exactly in this
case entropy is maximal, and knowledge of past observations does not help
to predict future observations. On the other hand, if f has one or more

extreme peaks, then entropy is very low (and in the limit minus infinity).
This corresponds to the fact that in this case future observations can be
predicted with high accuracy from past values. Thus, future observations
do not contain as much new information as in the case of independence.
3.2.2 Measuring metric, melodic, and harmonic importance

General idea
Western classical music is usually structured in at least three aspects:
melody, metric structure, and harmony. With respect to representing the
essential melodic, metric, and harmonic structures, not all notes are equally
important. For a given composition K, we may therefore try to find metric,
melodic, and harmonic structures and quantify them in a weight function
w : K → R3 (which we will also call an “indicator”). For each note event x ∈
K, the three components of w(x) = (wmelodic (x), wmetric (x), wharmonic (x))
quantify the ”importance” of x with respect to the melodic, metric, and
harmonic structure of the composition respectively.
Omnibus metric, melodic, and harmonic indicators

Specific definitions of structural indicators (or weight functions) are dis-
cussed for instance in Mazzola et al. (1995), Fleischer et al. (2000), and
Beran and Mazzola (2001). To illustrate the general approach, we give a
full definition of metric weights. Melodic and harmonic weights are defined
in a similar fashion, taking into account the specific nature of melodic and
harmonic structures respectively.
Metric structures characterize local periodic patterns in symbolic onset
times. This can be formalized as follows: let K ⊂Z4 be a composition (with
coordinates “Onset Time”, “Pitch”,”Loudness”, and “Duration”), T ⊂ Z
its set of onset times (i.e. the projection of K on the first axis) and let
tmax = max{t : t ∈ T }. Without loss of generality the smallest onset time
in T is equal to one.
Definition 28 For each triple (t, l, p) ∈ Z ×N ×N the set
B(t, l, p) = {t + kp : 0 ≤ k ≤ l}
is called a meter with starting point t, length l and period p. The meter
is called admissible, if B(t, l, p) ⊂T . The non-negative length l of a local
meter M = B(t, l, p) is uniquely determined by the set M and is denoted
by l(M ).
Note that by definition, t ∈ B(t, l, p) for any (t, l, p) ∈ Z ×N ×N. The
importance of events at onset time s is now measured by the number of
meters this onset is contained in. For a given triple (t, l, p), three situations
can occur:

1. B(t, l, p) is admissible and there is no other admissible local meter B ′ =
′ ′ ′
B ′ (t , l , p ) such that B ! B ′ ;
2. B(t, l, p) is not admissible;
3. B(t, l, p) is admissible, but there is another admissible local meter B ′ =
′ ′ ′
B ′ (t , l , p ) such that B ! B ′ .
We count only case 1. This leads to the following definition:
Definition 29 An admissible meter B(t, l, p) for a composition K ⊂ Z4
is called a maximal local meter if and only if it is not a proper subset
of another admissible local meter B(t′ , l′ , p′ ) of K. Denote by M(K) the
set of maximal local meters of K and by M(K, t) the set of maximal local
meters of K containing onset t.
Note that the set M(K) is always a covering of T . Metric weights can now
be defined, for instance, by
Definition 30 Let x ∈ K be a note event at onset time t(x) ∈ T , M =
M(K, t) the set of maximal local meters of K containing t(x), and h a
nondecreasing real function on Z. Specify a minimal length lmin . Then
the metric indicator (or metric weight) of x, associated with the minimal
length lmin , is given by
!
wmetric (x) = h(l(M )) (3.12)
M∈M,
l(M)≥ lmin
In a similar fashion, melodic indicators wmelodic and harmonic indicators

wharmonic can be derived from a melodic and harmonic analysis respec-
tively.
Specific indicators
A possible objection to weight functions as defined above is that only in-
formation about pitch and onset time is used. A score, however, usually
contains much more symbolic information that helps musicians to read it
correctly. For instance, melodic phrases are often connected by a phras-
ing slur, notes are grouped by beams, separate voices are made visible by
suitable orientation of note stems, etc. Ideally, structural indicators should
take into account such additional information. An improved indicator that
takes into account knowledge about musical “motifs” can be defined for
example as follows:
Definition 31 Let M = {(τ1 , y1 ), ..., (τk , yk )}, τ1 < τ2 < ... < τk be a
“motif ” where y denotes pitch and τ onset time. Given a composition K ⊂
T ×Z ⊂ Z2 , define for each score-onset time ti ∈ T (i = 1, ..., n) and
u ∈ {1, ..., k}, the shifted motif
M (ti , u) = {(ti + τ1 − τu , y1 ), ..., (ti + τk − τu , yk )}

and denote by
Tu (ti ) = {ti + τ1 − τu , ..., ti + τk − τu } = {s1 , ..., sk }
the corresponding onset times. Moreover, let
Xu (ti ) = {x = (x(s1 ), ..., x(sk )) : (si , x(si )) ∈ K}
be the set of all pitch-vectors with onset set Tu (ti ). Then we define the
distance
!k
du (ti ) = min (x(si ) − yi )2 (3.13)
x∈Xu (ti )
i=1
If Xu is empty, then du (ti ) is not defined or set equal to an arbitrary upper
bound D < ∞.
In this definition, it is assumed that the motif is identified beforehand
by other means (e.g. “by hand” using traditional musical analysis). The
distance du (ti ) thus measures in how far there are notes that are similar
to those in M, if ti is at the uth place
" of the rhythmic pattern of motif M.
Note that the euclidian distance (x(si ) − yi )2 could be replaced by any
other reasonable distance.
Analogously, distance or similarity can be measured by correlation:
Definition 32 Using the same definitions as above, let
k
!
xo = arg min (x(si ) − yi )2 ,
x∈Xu (ti )
i=1
and define ru (ti ) to be the sample correlation between xo and y = (y1 , ..., yk ).
If M (ti , u) " K, then set ru (ti ) = 0.
Disregarding the position within a motif, we can now define overall motivic
indicators (or weights), for instance by
k
!
wd,mean (ti ) = g( du (ti )) (3.14)
u=1
where g is a monotonically decreasing function,

wd,min (ti ) = min du (ti ) (3.15)
1≤ u≤ k
or
wcorr (ti ) = max ru (ti ) (3.16)
1≤ u≤ k
Finally, given weights for p different motifs, we may combine these into
one overall indicator. For instance, an overall melodic indicator based on
correlations can be defined by
p
!
wmelod (ti ) = h(wcorr,j (ti ), Li ) (3.17)
j=1

where wcorr,j is the weight function for motif number j and Li is the number
of elements in the motif. Including Li has the purpose of attributing higher
weights to the presence of longer motifs.
The advantage of the motif-based definition is that one can first search
for possible motifs in the score, making full use of the available information
in the score as well as musicological and historical knowledge, and then
incorporate these in the definition of melodic weights. Similar definitions
may be obtained for metric and harmonic indicators.
3.2.3 Measuring dimension

There are many different definitions of dimension, each measuring a specific
aspect of “objects”. Best known is the topological "dimension. In the usual
k
euclidian √ space Rk with scalar product < x, y >= i=1 xi yi and distances
|x− y| = < x − y, x − y >, the topological dimension of the space is equal
to k. The dimension of an object in this space is equal to the dimension
of the subspace it is contained in. The euclidian space is, however, rather
special since it is metric with a scalar product.
More generally, one can define a topological dimension in any topological
(not necessarily metric) space in terms of coverings. We start with the
definition of a topological space: a topological space is a nonempty set
X together with a family O of so-called open subsets of X satisfying the
following conditions:
1. X ∈ O and φ ∈ O (φ denotes the empty set)
2. If U1 , U2 ∈ O, then U1 ∪ U2 ∈ O
3. If U1 , U2 ∈ O, then U1 ∩ U2 ∈ O.
A covering of a set S ⊆ X is a collection U ⊆ O of open sets such that
S ⊆ ∪U∈U U.
A refinement of a covering U is a covering U ∗ such that for each U ∗ ∈ U ∗
there exists a U ∈ U with U ∗ ⊆ U . The definition of topological dimension
is now as follows:
Definition 33 A topological space X has topological dimension m, if every
covering U of X has a refinement U ∗ in which every point of X occurs in
at most m + 1 sets of U ∗ , and m is the smallest such integer.
The topological dimension of a subset S ⊆ X is analogous. For instance,
a straight line in a euclidian space can be divided into open intervals such
that at most two intervals intersect – so that dT = 1. Similarily, a simple
geometric figure in the plane, such as a disk or a rectangle (including the
inner area), can be covered with arbitrarily small circles or rectangles such
that at most three such sets intersect – this number can however not be
made smaller. Thus, the topological dimension of such an object is dT =
3 − 1 = 2.

The topological dimension is a relatively rough measure of dimension,
since it can assume integer values only and thus classifies sets (in a topo-
logical space) into a finite or countable number of categories. On the other
hand, dT is defined for very general spaces where a metric (i.e. distances)
need not exist. A finer definition of dimension, which is however confined
to metric spaces, is the Hausdorff-Besicovitch dimension. Suppose we have
a set A in a metric space X. In a metric space, we can define open balls of
radius r around each point x ∈ X by
U (r) = {y ∈ X : dX (x, y) < r}
where dX is the metric in X. The idea is now to measure the size of A by
covering it with a finite number of balls Ur = {U1 (r), ..., Uk (r)} of radius r
and to calculate an approximate measure of A by
!
µUr ,r,h (A) = h(r) (3.18)
where the sum is taken over all balls and h is some positive function. This
measure depends on r, the specific covering Ur and h. To obtain a measure
that is independent of a specific covering, we define the measure
µr,h (A) = inf µUρ ,ρ,h (A) (3.19)
Uρ :ρ<r
This measure is still only an approximation of A. The question is now

whether we can get a measure that corresponds exactly to the set A. This
is done by taking the limit r → 0 :
µh (A) = lim µr,h (A) (3.20)
r→0
Clearly, as r tends to zero, µr,h becomes at most larger and therefore has a
limit. The limit can be either zero (if µr,h = 0 already), infinity, or a finite
number. This leads to the following definition:
Definition 34 A function h for which
0 < µh (A) < ∞
is called intrinsic function of A.
Consider, for example, a simple shape in the plane such as a circle with
radius R. The area of the circle A can be measured by covering it by small
circles of radius r and evaluating µh (A) using the function h(r) = πr2 .
It is well known that limr→0µr,h (A) exists and is equal to µh (A) = πR2 .
On the other hand, if we took h(r) = πrα with α < 2, then µh (A) = ∞,
whereas for α > 2, µh (A) = 0. For standard sets, such as circles, rectangles,
triangles, cylinders, etc., it is generally true that the intrinsic function for a
set A that with topological dimension dT = d is given by (Hausdorff 1919)
{Γ( 21 )}d
h(r) = hd (r) = rd . (3.21)
Γ(1 + d2 )

Many other more complicated sets, including randomly g enerated sets, have
intrinsic functions of the form h(r) = L(r)rd for some d > 0 which is
not always equal to dT , and L a function that is slowly varying at the
orig in (see e.g . Hausdorff 1919, Besicovitch 1935 , Besicovitch and Ursell
1937, Mandelbrot 1977, 1983, Falcomer 1985 , 1986 , Kono 1986 , Telcs 1990,
Devaney 1990). Here, L is called slowly varying at zero, if for any u > 0,
limr→0 [L(ur)/L(r)] = 1. This leads to the following definition of dimension:
Definition 35 Let A be a subset of a metric space and
h(r) = L(r) · rd
an intrinsic function of A where L(r) is slowly varying. Then dH = d is
called the Hausdorff-Besicovitch dimension (or Hausdorff dimension) of A.
The definition of Hausdorff dimension leads to the definition of fractals (see
e.g . Mandelbrot 1977):
Definition 36 Let A be a subset of a metric space. Suppose that A has
topological dimension dT and Hausdorff dimension dH such that
dH > dT .
Then A is called a fractal.
Figure 3.2 Fractal pictures (by Céline Beran, computer generated.) (Color figures
follow page 152 .)
Intuitively, dH > dT means that the set A is “more complicated” than a

standard set with topolog ical dimension dT . An alternative definition of
Hausdorff-dimension is the fractal dimension:
Definition 37 Let A be a compact subset of a metric space. For each ε > 0,
denote by N (ε) the smallest number of balls of radius r ≤ ε necessary to
cover A. If
log N (ε)
dF = − lim (3.22)
ε→0 log ε
exists, then dF is called the fractal dimension of A.
It can be shown that dF ≥ dT . Moreover, in Rk one has dF ≤ k = dT . Beau-
tiful examples of fractal curves and surfaces (cf. Fig ure 3.2) can be found in

Mandelbrot (1977) and other related books. Many phenomena, not only in
nature but also in art, appear to be fractal. For instance, fractal shapes can
be found in Jackson Pollock’s (1912-1956) abstract drip paintings (Taylor
1999a,b,c, 2000). In music, the idea of fractals was used by some contem-
porary composers, though mainly as a conceptual inspiration rather than
an exact algorithm (e.g. Harri Vuori, György Ligeti; Figure 3.3).
Figure 3.3 György Ligeti (*1923). (Courtesy of Philippe Gontier, Paris.)
The notion of fractals is closely related to self-similarity (see Mandel-

brot 1977 and references therein). Self-similar geometric objects have the
property that the same shapes are repeated at infinitely many scales. By
drawing recursively m smaller copies of the same shape – rescaling them
by a factor s – one can construct fractals. For self-similar objects, the frac-
tal dimension can be calculated directly from the scaling factor s and the
number m of repetitions of the rescaled objects by
log m
dF = (3.23)
log s
For many purposes more realistic are random fractals where instead of
the shape itself, the distribution remains the same after rescaling. More
specifically, we have
Definition 38 Let Xt (t ∈ R) be a stochastic process. The process is called
self-similar with self-similarity parameter H, if for any c > 0
Xt =d c−H Xct
where = d means equality of the two processes in distribution.
The parameter H is also called Hurst exponent. Self-similar processes are
(like their deterministic counterparts) very special models. However, they
play a central role for stochastic processes just like the normal distribution
for random variables. The reason is that, under very general conditions,
the limit of partial sum processes (see Lamperti 1962, 1972) is always a
self-similar process:

Theorem 11 Suppose that Zt (t ∈ R+ ) is a stochastic process such that
Z1 ̸= 0 with positive probability and Zt is the limit in distribution of the
sequence of normalized partial sums
[nt]
!
n Snt = an
a−1 −1
Xs (n = 1, 2, ...) (3.24)
s=1
where X1 , X2 , ... is a stationary discrete time process with zero mean and
a1 , a2 , ... a sequence of positive normalizing constants such that log an →
∞. Then there exists an H > 0 such that for any u > 0, limn→∞ (anu /an ) =
uH , Zt is self-similar with self-similarity parameter H, and Zt has station-
ary increments.
The self-similarity parameter therefore also makes sense for processes that
are not exactly self-similar themselves, since it is defined by the rate n−H
needed to standardize partial sums. Moreover, H is related to the frac-
tal dimension, the exact relationship between H and the fractal dimen-
sion however depends on some other properties of the process as well. For
instance, sample paths of (univariate) Gaussian self-similar processes so-
called fractional Brownian motion (see Chapter 4) have, with probability
one, a fractal dimension of 2 − H with possible values of H in the inter-
val (0, 1). Thus, the closer H is to 1, the more a sample paths is similar
to a simple geometric line with dimension one. On the other hand, as H
approaches zero, a typical sample path fills up most of the plane so that
the dimension approaches two. Practically, H can be determined from an
observed series X1 , ..., Xn , for example by maximum likelihood estimation.
For a thorough discussion of self-similar and related processes and sta-
tistical methods see e.g. Beran (1994). Further references on fractals apart
from those given above are, for instance, Edgar (1990), Falconer (1990),
Peitgen and Saupe (1988), Stoyan and Stoyan (1994), and Tricot (1995).
A cautionary remark should be made at this point: in view of theorem 11,
the fact that we do find self-similarity in aggregated time series is hardly
surprising and can therefore not be interpreted as something very special
that would distinguish the particular series from other data. What may
be special at most is which particular value of H is obtained and which
particular self-similar process the normalized aggregated series converges
to.
3.3.1 Entropy of melodic shapes
Let x(ti ) be the upper and y(ti ) the lower envelope of a composition at
score-onset times ti (i = 1, ..., n). To investigate the shape of the melodic

movement we consider the first and second discrete “derivatives”
∆x(ti ) x(ti+1 ) − x(ti )
x(1) (ti ) = = (3.25)
∆ti ti+1 − ti
and
∆2 x(ti ) [x(ti+2 ) − x(ti+1 )] − [x(ti+1 ) − x(ti )]
x(2) (ti ) = = (3.26)
∆2 ti [ti+2 − ti+1 ] − [ti+1 − ti ]
Alternatively, if octaves “do not count”, we define
[x(ti+1 ) − x(ti )]12
x(1;12) (ti ) = (3.27)
ti+1 − ti
and
[x(ti+2 ) − x(ti+1 )]12 − [x(ti+1 ) − x(ti )]12
x(2;12) (ti ) = (3.28)
[ti+2 − ti+1 ] − [ti+1 − ti+2 ]
where [x]k = x mod k. Thus, in this definition intervals between successive
notes x(ti ), x(ti+1 ) and x(tj ), x(tj+1 ) respectively are considered identical
if they differ by octaves only.
The number of possible values of x(2) and x(2;12) is finite, however poten-
tially very large. In first approximation we may therefore consider both vari-
ables to be continuous. In the following, the distribution of x(2) and x(2;12)
is approximated by a continuous density kernel estimate f̂ (see Chapter 2).
For illustration, we define the following measures of entropy:
1. #
E1 = − f̂ (x) log2 f̂ (x)dx (3.29)
where f̂ is obtained from the observed data x(2;12) (t1 ), ..., x(2;12) (tn ) by
kernel estimation.
2. E2 : Same as E1 , but using x(2) (t1 ), ..., x(2) (tn ) instead.
3. #
E3 = − f̂ (x, y) log2 f̂ (x, y)dxdy (3.30)
where f̂ (x, y) is a kernel estimate based on observations (ai , bi ) with

ai = x(2) (ti−1 ) and bi = x(2) (ti ). Thus, E3 is the (empirical) entropy of
the joint distribution of two successive values of x(2) .
4. E4 : Same as Entropy 3, but using (x(2;12) (ti−1 ), x(2;12) (ti )) instead.
5. E5 : Same as Entropy 3, but using (x(ti ) − y(ti ))(1) instead.
6. E6 : Same as Entropy 3, but using (x(ti ) − y(ti ))(1;12) instead.
7. E7 : Same as Entropy 1, but using (x(ti ) − y(ti ))(1) instead.
8. E8 : Same as Entropy 1, but using (x(ti ) − y(ti ))(1;12) instead.

Figure 3.4 Comparison of entropies 1, 2, 3, and 4 for J.S. Bach’s Cello Suite
No. I and R. Schumann’s op. 15, No. 2, 3, 4, and 7, and op. 68, No. 2 and 16.

Each of these entropies characterizes information content (or randomness)
of certain aspects of melodic patterns in the upper and lower envelope.
Figures 3.4a through d show boxplots of Entropies 1 through 4 for Bach
and Schumann (Figure 3.8). The pieces considered here are: J.S. Bach Cello
Suite No. I (each of the six movements separately), Präludium und Fuge
No. 1 and 8 from “Das Wohltemperierte Klavier” I (each piece separately);
R. Schumann – op. 15, No. 2, 3, 4 and 7, and op. 68, No. 2 and 16. Obvi-
ously there is a difference between Bach and Schumann in all four entropy
measures. In Bach’s pieces, entropy is higher, indicating a more uniform
mixture of local melodic shapes.
3.3.2 Spectral entropy of local interval variability
Consider the local variability of intervals yi = x(ti+1 ) − x(ti ) between suc-

cessive notes. Specifically, we consider a moving “nearest neighbor” window
[ti , ti+4] (i = 1, ..., n − 4) and define local variances
3
1 !
vi = (yi+j − ȳi )2 (3.31)
4− 1 j=0
"3
where ȳi = 4−1 j=0yi+j . Based on this, a SEMIFAR-model is fitted to
the time series zi = log(vi + 12 ) (see Chapter 4 for the definition of SEMI-
FAR models). The fitted spectral density f (λ; θ̂) is then used to define the
spectral entropy
# π
E9 = − f (λ; θ̂) log f (λ; θ̂)dλ (3.32)
−π
If octaves do not count, then intervals are circular so that an estimate of

variability for circular data should be used. Here, we use R∗ = 2(1 − R) as
defined in Chapter 7. To transform the range [0, 2] of R∗ to the real line,
the logistic transformation is applied, defining
R∗ + ε
zi = log( )
2 + ε − R∗
where ε is a small positive number that is needed in order that − ∞ < zi <
∞ even if R∗ = 0 or 2 respectively. Fitting a SEMIFAR-model to zi we
then define E10 the same way as E9 above.
Figure 3.6 shows a comparison of E9 and E10 for the same compositions
as in 3.3.1. In contrast to the previous measures of entropy, Bach is con-
sistently lower than Schumann. With respect to E10 this is also the case
in comparison with Scriabin (Figure 3.5) and Martin. Thus, for Bach there
appears to be a high degree of nonrandomness (i.e. organization) in the
way variability of interval steps changes sequentially.

Figure 3.5 Alexander Scriabin (1871-1915) (at the piano) and the conductor
Serge Koussevitzky. (Painting by Robert Sterl, 1910; courtesy of Gemäldegalerie
Neuer Meister, Dresden, and Robert-Sterl-House.)
Figure 3.6 Comparison of entropies 9 and 10 for Bach, Schumann, and Scri-
abin/Martin.

3.3.3 Omnibus metric, melodic, and harmonic indicators for compositions
by Bach, Schumann, and Webern
Figures 3.7, and 3.9 through 3.11 show the “omnibus” metric, melodic,
and harmonic weight functions for Bach’s Canon cancricans, Schumann’s
op. 15/2 and 7, and for Webern’s Variations op 27. For Bach’s composi-
tion, the almost perfect symmetry around the middle of the composition
can be seen. Moreover, the metric curve exhibits a very regular up and
down. Schumann’s curves, in particular the melodic one, show clear pe-
riodicities. This appears to be quite typical for Schumann and becomes
even clearer when plotting a kernel-smoothed version of the curves (here
a bandwidth of 8/8 was used). Interestingly, this type of pattern can also
be observed for Webern. In view of the historic development of 12-tone
music as a logical continuation of harmonic freedom and romantic ges-
ture achieved in the 19th and early 20th centuries, this similarity is not
completely unexpected. Finally, note that a relationship between metric,
Figure 3.7 Metric, melodic, and harmonic global indicators for Bach’s Canon
cancricans.
melodic and harmonic structure can not be seen directly from the “raw”
curves. However, smoothed weights as shown in the figures above reveal
clear connections between the three weight functions. This is even the case
for Webern, in spite of the absence of tonality.

Figure 3.8 Robert Schumann (1810-1856). (Courtesy of Zentralbibliothek
Zürich.)
3.3.4 Specific melodic indicators for Schumann’s Tr¨

aumerei
Schumann’s Träumerei is rich in local motifs. Here, we consider eight of
these as indicated in Figure 3.12. Figure 3.13 displays the individual indi-
cator functions obtained from (3.16). The overall indicator function m(t) =
wmelod (t) displayed in Figure 3.15 is defined by (3.17) with h(w, L) =
[2 · max(w, 0.5)]L and Lj =number of notes in motif j. The contributions
h(wcorr,j (ti ), Lj ) of wcorr,j (j = 1, ..., 8) are given in Figure 3.14.

Figure 3.9 Metric, melodic, and harmonic global indicators for Schumann’s op.
15, No. 2 (upper figure), together with smoothed versions (lower figure).

Figure 3.10 Metric, melodic, and harmonic global indicators for Schumann’s op.
15, No. 7 upper figure), together with smoothed versions (lower figure).

Figure 3.11 Metric, melodic, and harmonic global indicators for Webern’s Varia-
tions op. 27, No. 2 (upper figure), together with smoothed versions (lower figure).

Figure 3.12 R. Schumann – Träumerei: motifs used for specific melodic indica-
tors.

Figure 3.13 R. Schumann – Träumerei: indicators of individual motifs.
Figure 3.14 R. Schumann – Träumerei: contributions of individual motifs to over-

all melodic indicator.

150
100
w
50
0
0 5 10 15 20 25 30
onset time
Figure 3.15 R. Schumann – Träumerei: overall melodic indicator.

CHAPTER 4
Time series analysis

Musical events are ordered according to a specific temporal sequence. Time
series analysis deals with observations that are indexed by an ordered vari-
able (usuallly time). It is therefore not surprising that time series anal-
ysis is important for analyzing musical data. Traditional applications are
concerned with “raw physical data” in the form of audio signals (e.g. digi-
tal CD-recording, sound analysis, frequency recognition, synthetic sounds,
modeling musical instruments). In the last few years, time series models
have been developed for modeling symbolic musical data and analyzing
“higher level” structures in musical performance and composition. A few
examples are discussed in this chapter.

4.2.1 Deterministic and random components, basic definitions
Time series analysis in its most sophisticated form is a complex subject
that cannot be summarized in one short chapter. Here, we briefly mention
some of the main ingredients only. For a thorough systematic account of the
topic we refer the reader to standard text books such as Priestley (1981a,b),
Brillinger (1981), Brockwell and Davis (1991), Diggle (1990), Beran (1994),
Shumway and Stoffer (2000).
A time series is a family of (usually, but not necessarily) real variables
Xt with an ordered index t. For simplicity, we assume that observations
are taken at equidistant discrete time points t ∈ Z (or N). Usually, obser-
vations are random with certain deterministic components. For instance,
we may have an additive decomposition Xt = µ(t) + Ut where Ut is such
that E(Ut ) = 0 and µ(t) is a deterministic function of t. One of the main
aims of time series analysis is to identify the probability model that gen-
erated an observed time series x1 , ..., xn . In the additive model this would
mean to estimate the mean function µ(t) and the probability distribution
of the random sequence U1 , U2 , .... Note that a random sequence can also
be understood as a function mapping positive integers t to the real numbers
Ut .
The main difficulties in identifying the correct distribution are:

1. The probability law has to be defined on an infinite dimensional space of
vectors (X1 , X2 , ...). This difficulty is even more serious for continuous
time series where a sample path is a function on R;
2. The finite sample vector X(n) = (X1 , ..., Xn )t has an arbitrary n-di-
mensional distribution so that it cannot be estimated from observed val-
ues x1 , ..., xn consistently, unless some minimal assumptions are made.
Difficulty 1 can be solved by applying appropriate mathematical techniques
and is described in detail in standard books on stochastic processes and
time series analysis (see e.g. Billingsley 1986 and the references above). Dif-
ficulty 2 cannot be solved by mathematical arguments only. It is of course
possible to give necessary or sufficient conditions such that the probability
distribution can be estimated with arbitrary accuracy (measured in an ap-
propriate sense) as n tend infinity. However, which concrete assumptions
should be used depends on the specific application. Assumptions should nei-
ther be too general (otherwise population quantities cannot be estimated)
nor too restrictive (otherwise results are unrealistic).
A standard, and almost necessary, assumption is that Xt can be reduced
to a stationary process Ut by applying a suitable transformation. For in-
stance, we may have a deterministic “trend” µ(i) plus stationary “noise”
Ui ,
Xi = µ(i) + Ui , (4.1)
or an integrated process of order m for which the mth difference is station-
ary, i.e.
(1 − B)m Xi = Ui (4.2)
where (1 − B)Xi = Xi − Xi−1 . In the latter case, Xt is called m-difference
stationary. Stationarity is defined as follows:
Definition 39 A time series Xi is called strictly stationary, if for any
k, i1 , ..., in ∈ N,
P (Xi1 ≤ x1 , ..., Xin ≤ xn ) = P (Xi1 +k ≤ x1 , ..., Xin +k ≤ xn ) (4.3)
The time series is called weakly (or second order) stationary, if
µ(i) = E(Xi ) = µ = const (4.4)
and for any i, j ∈ N, the autocovariance depends on the lag k = |i − j| only,
i.e.
cov(Xi , Xi+k ) = γ(k) = γ(−k) (4.5)
A second order stationary process can be decomposed into uncorrelated
random components that correspond to periodic signals, via the so-called
spectral representation
! π
Xt = µ + eitλ dZX (λ). (4.6)
−π
Here ZX (λ) = ZX,1 (λ) + iZX,2 (λ) ∈ C is a so-called orthogonal increment

process (in λ) with the following properties: ZX (0) = 0, E[ZX (λ)] = 0 and
for λ1 > λ2 ≥ ν1 > ν2 ,
E[∆Z X (λ2 , λ1 )∆ZX (ν2 , ν1 )] = 0 (4.7)
where ∆ZX (u, v) = ZX (u) − ZX (v). The integral in (4.6) is defined as a
limit in mean square. It can be constructed by approximating the function
eitλ by step functions
"
gn (λ) = αi,n 1{ai,n < λ ≤ bi,n }
(n ∈ N). For step functions we have the integrals
! π "
In = gn (λ)dZX (λ) = αi,n [Z(bi,n ) − Z(ai,n )].
−π
As gn → e itλ
, the integrals In converge to a random variable I, in the sense
that
lim E[(I − In )2 ] = 0.
n→∞
The random variable I is then denoted by exp(itλ)dZ(λ). The spectral
#
representation is especially useful when one needs to identify (random)

periodicities. For this purpose one defines the spectral distribution function
FX (λ) = E[|ZX (λ) − ZX (0)|2 ] = E[|ZX (λ)|2 ] (4.8)
The variance is then decomposed into frequency contributions by
! π ! π
var(Xt ) = E[|dZX (λ)| ] =
2
dFX (λ) (4.9)
−π −π
This means that the expected contribution (expected squared amplitude)

of components with frequencies in the interval (λ, λ + ε] to the variance of
Xt is equal to F (λ + ε) − F (λ).
Two interesting special cases can be distinguished:
Case 1 – F differentiable: In this case,
d
F (λ + ε) − F (λ) = F (λ)ε + o(ε) = f (λ)ε + o(ε).
dλ
The function f is called spectral density and can also be defined directly
by
∞
1 "
f (λ) = γX (k)eikλ (4.10)
2π
k=−∞
where γX (k) = cov(Xt , Xt+k ). The inverse relationship is
! π
γX (k) = eikλ f (λ)dλ (4.11)
−π
A high peak of f at a frequency λo means that the component(s) at (or in

the neighborhood of) λo contribute largely to the variability of Xt . Note

that the period of exp(itλ), as a function of t, is T = 2π/λ (sometimes
one therefore defines λ̃ = λ/(2π) as frequency in order that the period T is
directly the inverse of the frequency). Thus, a peak of f at λo implies that
a sample path of Xt is likely to exhibit a strong periodic component with
frequency λo . Periodicity is, however, random – the observed series is not a
periodic function. The meaning of random periodicity can be explained best
in the simplest case where T is an integer: if f has a peak at frequency λo =
2π/T, then the correlation between Xt and Xt+jT (j ∈ Z) is relatively high
compared to other correlations with similar lags. A further complication
that blurs periodicity is that, if f is continuous around a peak at λo , then
the observed signal is a weighted sum of infinitely (in fact uncountably)
many, relatively large components with frequencies that are similar to λo .
The sharper the peak, the less this “blurring” takes place and a distinct
periodicity (though still random) can be seen. In the other extreme case
where f is constant, there is no preference for any frequency, and γX (k) = 0
(k ̸= 0), i.e. observations are uncorrelated.
Case 2 - F is a step function with a finite or countable number of jumps:
this corresponds to processes of the form
k
"
Xt = Aj eiλj t
j=1
for some k ≤ ∞, and λj ∈ [0, π], Aj ∈ C. We then have

"
F (λ) = E[|Aj |2 ], (4.12)
j:λj ≤λ
k
"
var(Xt ) = E[|Aj |2 ] (4.13)
j=1
This means that the variance is a sum of contributions that are due to the
frequencies λj (1 ≤ j ≤ k). A sample path of Xt cannot be distinguished
from a deterministic periodic function, because the randomly selected am-
plitudes Aj are then fixed.
Finally, it should be noted that not all frequencies are observable when
observations are taken at discrete time points t = 1, 2, ..., n. The smallest
identifiable period is 2, which corresponds to a highest observable frequency
of 2π/2 = π. The largest identifiable period is n/2, which corresponds to
the smallest frequency 4π/n. As n increases, the lowest frequency tends to
zero, however the highest does not. In other words, the highest frequency
resolution does not improve with increasing sample size.
To obtain more general models, one may wish to relax the condition
of stationarity. An asymptotic concept of local stationarity is defined in
Dahlhaus (Dahlhaus 1996a,b, 1997): a sequence of stochastic processes Xt,n

(n ∈ N ) is called locally stationary, if we have a spectral representation
! π
t
Xt,n = µ( ) + eitλ At,n (λ)dZX (λ), (4.14)
n −π
with “ = ” meaning almost sure (a.s.) equality, µ(u) continuous, and there
exists a 2π−periodic function A : [0, 1] ×R → C such that A(u, −λ) =
Ā(u, λ), A(u, λ) is continuous in u, and
t
sup |A( , λ) − At,n (λ)| ≤ cn−1 (4.15)
t,λ n
(a.s.) for some constant c < ∞. Intuitively, this means that for n large
enough, the observed process can be approximated locally in a small time
window t ± ε by the stationary process exp(itλ)A( nt , λ)dZX (λ). The or-
#
der n−1 of the approximation is chosen such that most standard estima-
tion procedures, such as maximum likelihood estimation, can be applied
locally and their usual properties (e.g. consistency, asymptotic normality)
still hold. Under smoothness conditions on A one can prove that a mean-
ingful “evolving” spectral density fX (u, λ) (u ∈ (0, 1)) exists such that
∞
1 "
fX (u, λ) = lim cov(X[u·n−k/2],n , X[u·n+k/2],n ) (4.16)
n→∞ 2π
k=−∞
The function fX (u, λ) is called evolutionary spectral density. Note that, for
fixed u,
lim cov(X[u·n−k/2],n , X[u·n+k/2],n ) = γX (k)
n→∞
!
= (2π)−1 exp(ikλ)fX (u, λ)dλ.
Thumfart (1995) carries this concept over to series with discrete spectra.
A simplified definition can be given as follows: a sequence of stochastic
processes Xt,n (n ∈ N ) is said to have a discrete evolutionary spectrum
FX (u, λ), if
t " t t
Xt,n = µ( ) + Aj ( )eiλj ( n )t (4.17)
n n
j∈M
where M ⊆Z, and µj (u) is twice continuously differentiable. The discrete

evolutionary spectrum can be defined in analogy to the continuous case.
For other definitions of nonstationary processes see e.g. Priestley (1965,
1981), Ghosh et al. (1997) and Ghosh and Draghicescu (2002a,b).
4.2.2 Sampling of continuous-time time series

Often time series observed at discrete time points t = j · ∆τ (j = 1, 2, 3, ...)
actually “happen” in continuous time τ ∈ R. Sampling in discrete time

leads to information loss in the following way: let Yτ be a second order sta-
tionary time series with τ ∈ R. (Stationarity in continuous time is defined
in an exact analogy to definition 39.) Then, Yτ has a spectral representation
! ∞
Yτ = eiτ λ dZY (λ), (4.18)
−∞
a spectral distribution function

! λ
FY (λ) = E[|dZ(λ)|2 ] (4.19)
−∞
′
and, if F exists, a spectral density function
! ∞
1
fY (λ) = F (λ) =
′
e−iτ λ γY (τ )dτ (4.20)
2π −∞
We also have
!
γY (τ ) = cov(Yt , Yt+τ ) = eiλτ f (λ)dλ.
The reason why the frequency range extends to (−∞, ∞), instead of [−π, π],
is that in continuous time, by definition, arbitrarily small frequencies are
observable.
Suppose now that Yτ is observed at discrete time points t = j · ∆τ , i.e.
we observe
Xt = Yj·∆τ (4.21)
Then we can write
! ∞ ∞ !
" π/∆τ +(2π/∆τ )u
Xt = eij(∆τ λ) dZY (λ) = eij(∆τ λ) dZY (λ)
−∞ u=−∞ −π/∆τ +(2π/∆τ )u
(4.22)
∞ !
" π/∆τ ! π/∆τ
= eij(∆τ λ) dZY (λ + (2π/∆τ )u) = eitλ dZX (λ)
u=−∞ −π/∆τ −π/∆τ
(4.23)
where ∞
"
dZX (λ) = dZY (λ + (2π/∆τ )u) (4.24)
u=−∞
Moreover, if Yτ has spectral density fY , then the spectral density of Xt is
∞
"
fX (λ) = fY (λ + (2π/∆τ )u) (4.25)
u=−∞
for λ ∈ [− ∆τ
π π
, ∆τ ]. This result can be interpreted as follows: a frequency λ >
π/∆τ can be written as λ = λo − (2π/∆τ )j for some j ∈ N where λo is in
the interval [−π/∆τ, π/∆τ ]. The contributions of the two frequencies λ and

λo to the observed function Xt (in discrete time) are confounded, i.e. they
cannot be distinguished. Thus, if we observe a peak of fX at a frequency
λ ∈ (0, π/∆τ ], then this may be due to any of the periodic components with
periods 2π/(λ + (2π/∆τ )u), u = 0, 1, 2, ..., or a combination of these. This
has, for instance, direct implications for sampling of sound signals. Suppose
that 22050Hz (i.e. λ = 22050 · 2π ≈138544.2) is the highest frequency that
we want to identify (and later reproduce) correctly, instead of attributing it
to a lower frequency. This would cover the range perceivable by the human
ear. Then ∆τ must be so small that π/∆τ ≥ 22050 · 2π. Thus the time gap
∆τ between successive measurements of the sound wave must not exceed
1/44100.
4.2.3 Linear filters

Suppose we need to extract or eliminate frequency components from a
signal Xt with spectral density fX . The aim is thus, for instance, to produce
an output signal Yt whose spectral density fY is zero for a frequency interval
a ≤ λ ≤ b. The simplest, though not necessarily best, way to do this is linear
filtering. A linear filter maps an input series Xt to an output series Yt by
∞
"
Yt = aj Xt−j (4.26)
j=−∞
The coefficients must fulfill certain conditions in order $

that the sum is
defined. If Xt is second order stationary, then we need a2j < ∞. The
resulting spectral density of Yt is
fY (λ) = |A(λ)|2 fX (λ) (4.27)
where
∞
"
A(λ) = aj e−ijλ . (4.28)
j=−∞
To eliminate a certain frequency band [a, b] one thus needs a linear filter
such that A(λ) ≡0 in this interval.
Equation (4.27) also helps to construct and simulate time series mod-
els with desired spectral densities: a series with spectral density fY (λ) =
(2π)−1 |A(λ)|2 can be simulated by passing a series of independent obser-
vations Xt through the filter A(λ). Note that, in reality, one can use only
a finite number of terms in the filter so that only an approximation can be
achieved.
4.2.4 Special models

When modeling time series statistically, one may use one of the following
approaches: a) parametric modeling; b) nonparametric modeling; and c)

semiparametric modeling. In parametric modeling, the probability distri-
bution of the time series is completely specified a priori, except for a fi-
nite dimensional parameter θ = (θ1 , ..., θp )t . In contrast, for nonparametric
models, an infinite dimensional parameter is unknown and must be esti-
mated from the data. Finally, semiparametric models have parametric and
nonparametric components. A link between parametric and nonparametric
models can also be established by data-based choice of the length p of the
unknown parameter vector θ, with p tending to infinity with the sample
size. Some typical parametric models are:
1. White noise: Xt second order stationary, var(Xt ) = σ 2 ,
fX (λ) = σ 2 /(2π),
and γX (k) = 0 (k ̸= 0)
2. Moving average process of order q, MA(q):
q
"
Xt = µ + εt + ψk εt−k (4.29)
k=1
with µ ∈ R, εt independent identically distributed (iid) r.v., E(εt ) = 0 and

σε2 = var(εt ) < ∞. This can also be written as
Xt − µ = ψ(B)εt (4.30)
$q
where
$q B is the backshift operator with BXt = Xt−1 , ψ(B) = k=0 ψk B k .
If k=0 ψk z = 0 implies |z| > 1, then Xt is invertible in the sense that it
k
can also be written as

"∞
Xt − µ = ϕk (Xt−k − µ) + εt .
k=1
3. Autoregressive process of order p, AR(p):

p
"
(Xt − µ) − ϕk (Xt−k − µ) = εt (4.31)
k=1
or ϕ(B)(Xt − µ) = εt where ϕ(B) = 1 − pk=1 ϕk B k . If 1 − pk=1 ϕk z k = 0

$ $
implies |z| > 1, then Xt is stationary.
4. Autoregressive moving average process, ARMA(p, q):
ϕ(B)(Xt − µ) = ψ(B)εt . (4.32)
The spectral density is
|ψ(eiλ )|2
fX (λ) = σε2 . (4.33)
|ϕ(eiλ )|2
5. Linear process:
∞
"
Xt = µ + ψj εt−j (4.34)
j=−∞

where ψj depend on a finite dimensional parameter vector θ. The spectral
density is
fX (λ) = σε2 |ψ(eiλ )|2 .
6. Integrated ARIMA process, ARIMA(p, d, q) (Box and Jenkins 1970):
ϕ(B)((1 − B)d Xt − µ) = ψ(B)εt (4.35)
with d = 0, 1, 2, ..., where ϕ(z) and ψ(z) are not zero for |z| ≤ 1. This
means that the dth difference (1 − B)d Xt is a stationary ARMA process.
7. Fractional ARIMA process, FARIMA(p, d, q) (Granger and Joyeux 1980,
Hosking 1981, Beran 1995):
(1 − B)δ ϕ(B){(1 − B)m Xt − µ} = ψ(B)εt (4.36)
with d = m + δ, 1
2 <δ< 2,
1
m = 0, 1. Here,
∞ % &
" d
(1 − B) = d
(−1) Bk k
k
k=0
with
Γ(d + 1)
% &
d
= .
k Γ(k + 1)Γ(d − k + 1)
The spectral density of (1 − B)m Xt is
|ψ(eiλ )|2
fX (λ) = σε2 |1 − eiλ |−2d . (4.37)
|ϕ(eiλ )|2
The fractional differencing parameter δ plays an important role. If δ =
0, then (1 − B)m Xt is an ordinary ARIMA(p, 0, q) process, with spectral
density such that fX (λ) converges to a finite value fX (0) as λ → 0 and
the covariances decay exponentially, i.e. |γX (k)| ≤ Cak for some 0 < C <
∞, 0 < a < 1. The process is therefore said to have short memory. For
δ > 0, fX has a pole at the origin of the form fX (λ) ∝λ−2δ as λ → 0, and
γX (k) ∝k 2d−1 so that
∞
"
γX (k) = ∞.
k=−∞
This case is also known as long memory, since autocorrelations decay very
slowly (see Beran 1994). On the other hand, if δ < 0, then fX (λ) ∝ λ−2δ
converges to zero at the origin and
∞
"
γX (k) = 0.
k=−∞
This is called antipersistence, since for large lags there is a negative corre-
lation. The fractional differencing parameter δ, or d = δ + m, is also called
long-memory parameter, and is related to the fractal or Hausdorff dimen-
sion dH (see Chapter 3). For an extended discussion of long-memory and
antipersistent processes see e.g. Beran (1994) and references therein.

8. Fractional Gaussian noise (Mandelbrot and van Ness 1968, Mandelbrot
and Wallis 1969): recall that a stochastic process Yt (t ∈ R) is called self-
similar with self-similarity parameter H, if for any c > 0, Yt =d c−H Yct .
This definition implies that the covariances of Yt are equal to
σ 2 2H
cov(Yt , Yt+s ) = (|t| + |s|2H − |t − s|2H )
2
where σ 2 > 0. If Yt is Gaussian (i.e. all joint distributions are normal),
then the process is fully determined by its expected value and the covari-
ance function. Therefore, there is only one self-similar Gaussian process.
This process is called fractional Brownian motion BH (t) with self-similarity
parameter 0 < H < 1. The discrete time increment process
Xt = BH (t) − BH (t − 1) (t ∈ N) (4.38)
is called fractional Gaussian noise (FGN). FGN is stationary with autoco-
variances
σ2
γ(k) = (|k + 1|2H + |k − 1|2H − 2|k|2H ), (4.39)
2
the spectral density is equal to (Sinai 1976)
∞
"
f (λ) = 2cf (1 − cos λ) |2πj + λ|−2H−1 , λ ∈ [−π, π] (4.40)
j=−∞
with cf = cf (H, σ 2 ) = σ 2 (2π)−1 sin(πH)Γ(2H + 1) and σ 2 = var(Xi ). For

further discussion see e.g. Beran (1994).
8. Polynomial trend model:
p
"
Xt = βj t j + U t (4.41)
j=0
where Ut is stationary.
9. Harmonic or seasonal trend model:
p
" p
"
Xt = αj cos λj t + αj sin λj t + Ut (4.42)
j=0 j=0
with Ut stationary
10. Nonparametic trend model:
t
Xt,n = g( ) + Ut (4.43)
n
with g : [0, 1] → R a “smooth” function (e.g. twice continuously differen-
tiable) and Ut stationary.
11. Semiparametric fractional autoregressive model, SEMIFAR(p, d, q) (Be-
ran 1998, Beran and Ocker 1999, 2001, Beran and Feng 2002a,b):
(1 − B)δ ϕ(B){(1 − B)m Xt − g(st )} = Ut (4.44)

where d, ϕ, εt and g are as above and m = 0, 1. In this case, the centered
differenced process Yt = (1 − B)m Xt − g(st ) is a fractional ARIMA(p, δ, 0)
model. The SEMIFAR model incorporates stationarity, difference stationar-
ity, antipersistence, short memory and long memory, as well as an unspec-
ified trend. Incorporating all these components enables us to distinguish
statistically which of the components are present in an observed time se-
ries (see Beran and Feng 2002a,b). A software implementation by Beran is
included in the S − P lus−package F inM etrics and described in Zivot and
Wang (2002).
4.2.5 Fitting parametric models

If Xt is a second order stationary model with a distribution function that
is known except for a finite dimensional parameter θo = (θ1o , ..., θko )t ∈ Θ ⊆
Rk , then the standard estimation technique is the maximum likelihood
method: given an observed time series x1 , ..., xn , estimate θ by
θ̂ = arg max h(x1 , ..., xn ; θ) (4.45)
θ∈Θ
where h is the joint density function of (X1 , ..., Xn ). If observations are dis-
crete, then h is the joint probability P (X1 = x1 , ..., Xn = xn ). Equivalently,
we may maximize the log-likelihood L(x1 , ..., xn ; θ) = log h(x1 , ..., xn ; θ).
Under fairly general regularity conditions, θ̂ is asymptotically consistent, in
the sense that it converges in probabilty to θo . In other words, limn→∞ P (|θ̂−
θo | > ε) = 0 for all ε > 0. In the case of a Gaussian time series with spectral
density fX (λ; θ), we have
1 t
L(x1 , ..., xn ; θ) = − [log 2π + log |Σn | + (x−x̄) Σ−1 n (x−x̄)] (4.46)
2
where x = (x1 , ..., xn )t , x̄ = x̄ · (1, 1, ..., 1)t , and |Σn | is the determinant of
the covariance matrix of (X1 , ..., Xn )t with elements [Σn ]ij = cov(Xi , Xj ).
Since under general conditions n−1 log |Σn | converges to (2π)−1 times the
integral of log fX (Grenander#and Szegö 1958), and the (j, l)th element of
−1
Σ−1
n can be approximated by fX (λ) exp{i(j − l)λ}dλ, an approximation
to θ̂ can be obtained by the so-called Whittle estimator θ̃ (Whittle 1953;
also see e.g. Fox and Taqqu 1986, Dahlhaus 1987) that minimizes
1
! π
I(λ)
Ln (θ) = [log fX (λ; θ) + ]dλ (4.47)
4π −π fX (λ; θ)
An alternative approximation for Gaussian processes$∞is obtained by using
an autoregressive representation of the type Xt = j=1 bj Xt−j + ϵt , where
ϵt are independent identically distributed zero mean normal variables with
variance σϵ2 . This leads to minimizing the sum of the squared residuals as
explained below in Equation (4.50) (see e.g. Box and Jenkins 1970, Beran
1995).

In general, the actual mathematical and practical difficulty lies in defin-
ing a computationally feasible estimation procedure and also to obtain
the asymptotic distribution of θ̂. There is a large variety of models for
which$ this has been achieved. Most results are known for linear models
Xt = ψj εt−j with iid εt . (All examples given in the previous section are
linear.) The reason is that, if the distribution of εt is known, then the distri-
bution of the process can be recovered by looking at the autocovariances, or
equivalently the spectral density,
$ only. Furthermore, if X t is invertible, i.e.
if Xt can be written as Xt = ∞ k=1 ϕk Xt−k + εt , then θ can be estimated
o
by maximizing the loglikelihood of the independent variables εt :

n
"
θ̂ = arg max log hε (et (θ)) (4.48)
θ∈Θ
t=1
where hε is the probability density of ε and et (θ) = xt − ∞ ϕ x .

$
$t−1k=1 k t−k
For a finite sample, et (θ) is approximated by êt (θ) = xt − k=1 ϕk xt−k . In
1
the simplest case where εt are normally distributed with hε (x)= (2πσε2 )− 2
exp{−x2 /(2σe2 )} and θ = (σε2 , θ2 , ..., θp ) = (σε2 , η), we have et (θ) = et (η)
and
n n % &2
" " et (η)
θ̂ = arg min[ log σε2 + ] (4.49)
θ∈Θ
t=1 t=1
σε
Differentiating with respect to θ leads to
n
"
η̂ = arg min e2t (η) (4.50)
η
t=1
and σ̂ε2 = n−1 e2t (η̂). Under mild regularity conditions, as n tends to
$
√
infinity, the distribution of n(θ̂−θ) tends to a normal distribution N (0, V )
with with covariance matrix V = 2B −1 where B is a p ×p matrix with
elements
! π
∂ ∂
Bij = (2π)−1 log f (λ; θ) log f (λ; θ)dλ
−π ∂θi ∂θj
(see e.g. Box and Jenkins 1970, Beran 1995).
The estimation method above assumes that the order of the model, i.e.
the length p of the parameter vector θ, is known. This is not the case in
general so that p has to be estimated from data. Information theoretic con-
siderations (based on definitions discussed in Section 3.1) lead to Akaike’s
famous criterion (AIC; Akaike 1973a,b)
p̂ = arg min{−2 log likelihood + 2p} (4.51)
p
More generally, we may minimize AICα = −2 log likelihood + αk with re-

spect to p. This includes the AIC (α = 2), the BIC (Bayesian information
criterion, Schwarz 1978, Akaike 1979) with α = log n and the HIC (Han-

nan and Quinn 1979) with α = 2c log log n (c > 1). It can be shown that, if
the observed process is indeed generated by a process from the postulated
class of models, and if its order is po , then for α ≥ O(2c log log n) the esti-
mated order is asymptotically correct with probability one. In contrast, if
α/(2c log log n) → 0 as n → ∞, then the criterion tends to choose too many
parameters in the sense that P (p̂ > po ) converges to a positive probability.
This is, for instance, the case for Akaike’s criterion. Thus, if identification
of a correct model is the aim, and the observed process is indeed likely to be
at least very close to the postulated model class, then α ≥ O(2c log log n)
should be used. On the other hand, one may argue that no model is ever
correct, so that increasing the number of parameters with increasing sam-
ple size may be the right approach. In this case, the original AIC is a good
candidate. It should be noted, however, that if p → ∞ as n → ∞, then
the asymptotic distribution and even the rate of convergence of θ̂ changes,
since this is a kind of nonparametric modeling with an ultimately infinite
dimensional parameter.
4.2.6 Fitting non- and semiparametric models

Most techniques for fitting nonparametric models rely on smoothing, com-
bined with additional estimation of parameters needed for fine tuning of
the smoothing procedure. To illustrate this, consider for instance,
(1 − B)m Xt = g(st ) + Ut (4.52)
as defined above where Ut is second order stationary and st = t/n. If m is
known, then g may be estimated, for instance, by a kernel smoother
n
1 " st − sto
ĝ(to ) = K( )yt (4.53)
nb t=1 b
as defined in Chapter 2, with xt = (1 − B)m xt . However, results may

differ considerably depending on the choice of the bandwidth b (see e.g.
Gasser and Müller 1979, Beran and Feng 2002a,b). The optimal bandwidth
depends on the nature of the residual process Ut . A criterion for optimality
is, for instance, the integrated mean squared error
!
IM SE = E{[ĝ(s) − g(s)]2 }ds.
The IMSE can be written as

! ! !
IM SE = {E[ĝ(s)]−g(s)} ds+ var(ĝ(s))ds = {Bias2 +variance}ds.
2
The Bias only depends on the function g, and is thus independent of the
error process. The variance, on the other hand, is a function of the co-
variances γU (k) = cov(Ut , Ut+k ), or equivalently the spectral density fU .

The bandwidth that minimizes the IM SE thus depends on the unknown
quantities g and fU . Both g and fU , therefore, have to be estimated simul-
taneously in an iterative fashion. For instance, in a SEMIFAR model, the
asymptotically optimal bandwidth can be shown to be equal to
bopt = Copt n(2δ−1)/(5 −2δ)
where Copt is a constant that depends on the unknown parameter vector
θ = (σϵ2 , d, ϕ1 , ..., ϕp )t . Note that in this case, m is also part of the un-
known vector. An algorithm for estimating g as well as θ can be defined by
starting with an initial estimate of θ, calculating the corresponding optimal
bandwidth, subtracting ĝ from xt , reestimating θ, estimating the new op-
timal bandwidth and so on. Note that in addition the order p is unknown,
so that a model choice criterion has to be used at some stage. This com-
plicates matters considerably, and special care has to be taken to define
a reliable algorithm. Algorithms that work theoretically as well as prac-
tically for reasonably small sample sizes are discussed in Beran and Feng
(2002a,b).
4.2.7 Spectral estimation

Sometimes one is only interested in the spectral density fX of a stationary
process or, equivalently, the autocovariances γX (k), without modeling the
whole distribution of the time series. The reason can be, for instance, that
as discussed above, one may be mainly interested in (random) periodicities
which are identifiable as peaks in the spectral density.
A natural nonparametric estimate of γX (k) is the sample autocovariance
n−k
1"
γ̂(k) = (xt − x̄)(xt+k − x̄) (4.54)
n t=1
for k ≥ 0 and γ̂(−k) = γ̂(k). The corresponding estimate of fX is the

periodogram
n−1 n
1 " 1 "
I(λ) = γ̂(k)e ikλ
= | (xt − x̄)eitλ |2 (4.55)
2π 2πn t=1
k=−(n−1)
Sometimes a so-called tapered periodogram is used:

n
" t
Iw (λ) = (2πn)−1 | w( )(xt − x̄)eitλ |2
t=1
n
where w is a weight function. It can be shown that E[I(λ)] → fX (λ) as

n → ∞. However, for lags close to n−1, γ̂(k) is very inaccurate, because one
averages over n − k observed pairs only. For instance, for k = n − 1, there is
only one observed pair, namely (x1 , xn ), with this lag! As a result, I(λ) does

not converge to fX (λ). Instead, the following holds, under mild regularity
conditions: if 0 < λ1 < ... < λk < π, and n → ∞, then, as n → ∞,
the distribution of 2 · [I(λ1 )/fX (λ1 ), ..., 2I(λk )/fX (λk )] converges to the
distribution of (Z1 , ..., Zk ) where Zi are independent χ22 -distributed random
variables. This result is also true for sequences of frequencies 0 < λ1,n <
... < λk,n < π as long as the smallest distance between the frequencies,
min |λi,n − λj,n | does not converge to zero faster than n−1 . Because of the
latter condition, and also for computational reasons (fast Fourier transform,
FFT; see Cooley and Tukey 1965, Bringham 1988), one usually calculates
I(λ) at the so-called Fourier frequencies λj = 2πj/n (j$= 1, ..., m) with
n
m = [(n − 1)/2]) only. Note that for Fourier frequencies, t=1 eitλj = 0, so
that the "
I(λ) = (2πn)−1 | xt eitλ |2 .
Thus, the sample mean actually does not need to be subtracted. The peri-
odogram at Fourier frequencies can also be understood as a decomposition
of the variance into orthogonal components, analogous to classical analysis
of variance (Scheffé 1959): for n odd,
n
" m
"
(xt − x̄)2 = 4π I(λj ) (4.56)
t=1 j=2
and for n even,

n
" m
"
(xt − x̄) = 4π
2
I(λj ) + 2πI(π). (4.57)
t=1 j=2
This means that I(λj ) corresponds to the (empirically observed) contribu-

tion of periodic components with frequency λj to the overall variability of
x1 , ..., xn .
A consistent estimate of fX can be obtained by eliminating or down-
weighing sample autocovariances with too large lags:
n−1
1
fˆ(λ) =
"
wn (k)γ̂(k)eikλ (4.58)
2π
k=−(n−1)
where wn (k) = 0 (or becomes negligible) for k > Mn , with Mn /n → 0 and

Mn → ∞. Equivalently, one can define a smoothed periodogram
!
fˆ(λ) = Wn (ν − λ)I(ν)dν (4.59)
for a suitable sequence of window functions Wn such that Wn (ν−λ)f (ν)dν

#
converges to f (λ) as n → ∞. See e.g. Priestley (1981) for a detailed dis-
cussion.
Finally, it should be noted that, in spite of inconsistency, the raw peri-
odogram is very useful for finding periodicities. In particular, in the case

of deterministic periodicities with frequencies ωj , I(λ) diverges to infinity
for λ = ωj and remains finite (proportional to a χ22 −variable) elsewhere.
4.2.8 The harmonic regression model

An important approach to analyzing musical sounds is the harmonic re-
gression model
p
"
Xt = [αj cos ωj t + βj sin ωj t] + Ut (4.60)
j=1
with Ut stationary. Note that, theoretically, this model can also be un-
derstood as a stationary process with jumps in the spectral distribution
FX (see Section 4.2.1). Given ω = (ω1 , ..., ωp )t , the parameter vector θ =
(α1 , ..., αp , β1 , ..., βp )t can be estimated by the least squares or, more gen-
erally, weighted least squares method,
n p
" t "
θ̂ = arg min w( )[xt − (αj cos ωj t + βj sin ωj t)]2 (4.61)
θ
t=1
n j=1
where w is a weight function. The solution is obtained from usual linear

regression formulas. In many applications the situation is more complex,
since the frequencies ω1 , ..., ωp are also unknown. This leads to a nonlinear
regression problem. A simple approximate solution can be given by (Walker
1971, Hannan 1973, Hassan 1982, Brown 1990, Quinn and Thomson 1991)
p n p
" " t "
ω̂ = arg max | w( )xt e | = arg max
iωj t 2
Iw (ωj ), (4.62)
0<ω1 ,...,ωp ≤π
j=1 t=1
n ω
j=1
$n
w( t )xt cos ω̂j t
α̂j = t=1$n n , (4.63)
t=1 w( n )
t
and $n
w( t )xt sin ω̂j t
β̂j = t=1$n n (4.64)
t=1 w( n )
t
Note that (4.64) means that we look for the k largest peaks in the (w-
tapered) periodogram. Under quite general assumptions, the asymptotic
distribution of the estimates can be shown to be as follows: the vectors
√ √ 3
Zn,j = [ n(α̂j − αj ), n(β̂j − βj ), n 2 (ω̂j − ωj )]t
(j = 1, ..., p) are asymptotically mutually independent, each having a 3-
dimensional normal distribution with expected value zero and covariance
matrix C(ωj ) that depends on fU (ωj ) and the weight function w. The
formulas for C are as follows (Irizarry 1998, 2000, 2001, 2002):
4πfU (ωj )
C(ωj ) = V (ωj ) (4.65)
α2j + βj2

where
c1 α2j + c2 βj2
⎛ ⎞
−c3 αj βj −c4 βj
V (ωj ) = ⎝ −c3 αj βj c2 α2j + c1 βj2 c4 αj ⎠ , (4.66)
−c4 βj c4 αj co
co = ao bo , c1 = Uo Wo−2 , c2 = ao b1 , (4.67)
c3 = ao W1 Wo−2 (Wo2 W1 U2 − W13 Uo − 2Wo2 W2 U1 + 2Wo W1 W2 Uo ), (4.68)
c4 = ao (Wo W1 U2 − W12 U1 − Wo W2 U1 + W1 W2 Uo ), (4.69)
ao = (Wo W2 − W12 )−2 , (4.70)
bn = Wn2 U2 + Wn+1 (Wn+1 Uo − 2Wn U1 ) (n = 0, 1), (4.71)

! 1
Un = sn w2 (s)ds, (4.72)
o
! 1
Wn = sn w(s)ds. (4.73)
o
This result can be used to obtain tests and confidence intervals for αj , βj
and ωj (j = 1, 2, ..., p), with the unknown quantities αj , βj and fU (ωj ) then
replaced by estimates. Note that this involves, in particular, estimation of
the spectral density of the residual process Ut .
A quantity that is of particular interest is the difference between the
fundamental frequency ω1 and partials j · ω1 ,
∆j = ωj − j · ω1 . (4.74)
For many musical instruments, this difference is exactly or approximately
equal to zero. The asymptotic distribution given above can be used to test
the null hypothesis Ho : ∆j = 0 or to construct confidence intervals for ∆j .
3
ˆ j − ∆j ) is asymptotically normal with zero mean
More specifically, n 2 (∆
and variance + ,
fU (ωj ) j 2 fU (ω1 )
v∆ = 4πco + 2 . (4.75)
α2j + βj2 α1 + β12
This can be generalized to any hypothesized relationship ∆j = ωj − g(j)ω1
(see the example of a guitar mentioned in the next section).
4.2.9 Dominating frequencies in random series

In the harmonic regression model, the main signal consists of determin-
istic periodic functions. For less harmonic “noisy” signals, a weaker form

of periodicity may be observed. This can be modeled by a purely random
process whose mth difference Yt = (1 − B)m Xt is stationary (m = 0, 1, ...)
with a spectral density f that has distinct local maxima. Estimation of
local maxima and identification of the corresponding frequencies is consid-
ered, for instance, in Newton and Pagano (1983) and Beran and Ghosh
(2000). Beran and Ghosh (2000) consider the case where Yt is a fractional
ARIM A(p, d, 0) process of unknown order p. Suppose we want to esti-
mate the frequency ωmax where f assumes the largest local maximum. In a
first step, the parameter vector θ = (σϵ2 , d, φ1 , ..., φp ) (with d = δ + m)
is estimated by maximum likelihood and p is chosen by the BIC. Let
θ∗ = (σϵ2 , δ, θ3 , ..., θp+2 ) = (σϵ2 , η ∗ ) and
σϵ2 σ2
f (λ; θ∗ ) = |φ(eiλ )|−2 |1 − eiλ |−2δ = ϵ g(λ; η ∗ ) (4.76)
2π 2π
be the spectral density of Yt . Then Ŷt = (1 − B)m̂ Xt and ω̂max is set equal
to the frequency where the estimated spectral density f (λ; θ̂∗ ) assumes its
maximum. Define
Vp (η ∗ ) = 2W −1 (4.77)
where
π
∂ ∂
!
Wij = (2π)−1 [ log g(x; u) log g(x; u)dx]|u=η∗ , (i, j = 1, ..., p+1).
−π ∂ui ∂uj
(4.78)
Then, as n → ∞,
√
n(ω̂max − ωmax ) →d N (0, τp ) (4.79)
with
1 ′ ′
τp = τp (η ∗ ) = [ġ (ωmax , η ∗ )]T Vp (η ∗ )[ġ (ωmax , η ∗ )] (4.80)
[g ′′ (ωmax , η ∗ )]2
where →d denotes convergence in distribution, g ′ , g ′′ denotes derivatives
with respect to frequency and ġ with respect to the parameter vector. Note
in particular that the order of var(ω̂max ) is n−1 whereas in the harmonic
regression model the frequency estimates have variances of the order n−3 .
The reason is that a deterministic periodic signal is a much stronger form
of periodicity and is therefore easier to identify.

4.3.1 Analysis and modeling of musical instruments
There is an abundance of literature on mathematical modeling of sound
signals produced by musical instruments. Since a musical instrument is a
very complex physical system, even if conditions are kept fixed, not only de-
terministic but also statistical models are important. In addition to that,

various factors can play a role. For instance, the sound of a violin de-
pends on the wood it is made of, which manufacturing procedure was used,
current atmospheric conditions (temperature, humidity, air pressure), who
plays the violin, which particular notes are played in which context, etc.
The standard approach that makes modeling feasible is to think of a sound
as the result of harmonic components that may change slowly in time, plus
“noise” components that may be described by random models. It should be
noted, however, that sound is not only produced by an instrument but also
perceived by the human ear and brain. Thus, when dealing with the “signif-
icance” or “effect” of sounds, physiology, psychology and related scientific
disciplines come into play. Here, we are first concerned with the actual “ob-
jective” modeling of the physical sound wave. This is a formidable task on
its own, and far from being solved in a satisfactory manner.
The scientific study of musical sound signals by physical equations goes
back to the 19th century. Helmholtz (1863) proved experimentally that
musical sound signals are mainly composed of frequency components that
are multiples of a fundamental frequency (also see Raleigh 1894). Ohm
conjectured that the human ear perceives sounds by analyzing the power
spectrum (i.e. essentially the periodogram), without taking into account rel-
ative phases of the sounds. These conjectures have been mostly confirmed
by psychological and physiological experiments (see e.g. Grey 1977, Pierce
1983/1992). Recent mathematical models of instrumental sound waves (see
e.g. Fletcher and Rossing 1991) lead to the assumption that, for short time
segments, a musical sound signal is stationary and can be written as a
harmonic regression model with ω1 < ω2 < ... < ωp . To analyze a musical
sound wave, one therefore can divide time into small blocks and fit the
harmonic regression model as described above. The lowest frequency ω1 is
called the fundamental frequency and corresponds to what one calls “pitch”
in music. The higher frequencies ωj (j ≥ 2) are called partials, overtones,
or harmonics. The amplitudes of partials, and how they change gradually,
are main factors in determining the “timbre” of a sound. For illustration,
Figure 4.1 shows the sound wave (air pressure amplitudes) of a piano dur-
ing 1.9 seconds where first a c′ and then an f ′ are played. The signal
was sampled in 16-bit format at a sampling rate of 44100 Hz. This corre-
sponds to CD-quality and means that every second, 44100 measurements
of the sound wave were taken, each of the measurements taking an integer
value between −32768 to 32767 (32767+32768+1=216 ). Figure 4.2 shows
an enlarged picture of the shaded area in Figure 4.1 (2050 measurements,
corresponding to 0.046 seconds). The periodogram (in log-coordinates) of
this subseries is plotted in Figure 4.3. The largest peak occurs approxi-
mately at the fundamental frequency ω1 = 441 · 2−9 /12 ≈262.22 of c′ . Note
that, since the periodogram is calculated at Fourier frequencies only, ω1
cannot be identified exactly (see also the remarks below). A small number
of partials ωj (j ≥ 2) can also be seen in Figure 4.3 the contribution of

Figure 4.1 Sound wave of c′ and f ′ played on a piano.
higher partials is however relatively small. In contrast, the periodogram of

e′′ ♭ played on a harpsichord shows a large number of distinctly important
partials (Figures 4.4, 4.5). There is obviously a clear difference between
piano and harpsichord in terms of amplitudes of higher partials. A com-
prehensive study of instrumental or vocal sounds also needs to take into
account different techniques in which an instrument is played, and other
factors such as the particular pitch ω1 that is played. This would, however,
be beyond the scope of this introductory chapter.
A specific component that is important for “timbre” is the way in which
the coefficients αj , βj change in time (see e.g. Risset and Mathews 1969).
Readers familiar with synthesizers may recall “envelopes” that are con-
trolled by parameters such as “attack” and “delay”. The development of αj ,
βj can be studied by calculating the periodogram for a moving time-window
and plotting its values against time and frequency in a 3-dimensional or
image plot. Thus, we plot the local periodogram (in this context also called

10^3 c’ played by piano (shaded area from figure 4.1)
amplitude
time in seconds
Figure 4.2 Zoomed piano sound wave – shaded area in Figure 4.1.
spectrogram)
n
1 " t − j −iλj 2
I(t, λ) = $n t−j
| W( )e xj | (4.81)
2π j=1 W ( nb ) j=1
2 nb
where W : R → R+ is a weight function such that W (u) = 0 for |u| > 1 and
b > 0 is a bandwidth that determines how large the window (block) is, i.e.
how many consecutive observations are considered to correspond approx-
imately to a harmonic regression model with fixed coefficients αj , βj and
stationary noise Ut . This is illustrated in color Figure 4.7 for a harpsichord
sound, with W (u) = 1{|u| ≤ 1}. Intense pink corresponds to high values of
I(t, λ). Figures 4.6a through d show explicitly the change in I(t, λ) between
four different blocks. Since the note was played “staccato”, the sound wave
is very short, namely about 0.1 seconds. Nevertheless, there is a change in
the spectrum of the sound, with some of the higher harmonics fading away.
Apart from the relative amplitudes of partials, most musical sounds in-

Figure 4.3 Periodogram of piano sound wave in Figure 4.2.
Sound wave of e’’ flat played by harpsichord

(0.25sec at sampling rate=44100 Hz)
3000
2000
1000
amplitude
0
-1000
-2000
-3000
0.0 0.01 0.02 0.03 0.04
time in seconds
Figure 4.4 Sound wave of e′′ ♭ played on a harpsichord.

Figure 4.5 Periodogram of harpsichord sound wave in Figure 4.4.
Harpsichord - Periodogram Harpsichord - Periodogram

(block 1) (block 22)
10^7
10^6
periodogram
periodogram
10^5
10^4
10^2
10^3
10^0
10^1
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
frequency frequency
a a
Harpsichord - Periodogram Harpsichord - Periodogram

(block 42) (block 62)
10^6
10^6
periodogram
periodogram
10^4
10^4
10^2
10^2
10^0
10^0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
frequency frequency
b c
Figure 4.6 Harpsichord sound – periodogram plots for different time frames (mov-
ing windows of time points).

Figure 4.7 A harpsichord sound and its spectrogram. Intense pink correspond s to
high values of I(t, λ). (Color figures follow page 152 .)
clude a characteristic nonperiodic noise component. This is a further jus-

tification, apart from possible measurement errors, to include a random
deviation part in the harmonic regression equation. The properties of the
stochastic process Ut are believed to be characteristic for specific instru-
ments (see e.g. Serra and Smith 1991, Rodet 1997 ). Typical noise compo-
nents are, for instance, transient noise in percussive instruments, breath
noise in wind instruments, or bow noise of string instruments. For a dis-
cussion of statistical issues in this context see e.g. Irizarry (2001). For most
instruments, not only the harmonic amplitudes but also the characteris-
tics of the noise component change gradually. This may be modeled by
smoothly changing processes as defined for instance in Ghosh et al. (1997 ).
Other approaches are discussed in Priestley (1965 ) and Dahlhaus (1996a,b,
1997 ) (see Section 4 .2.1 above).
Some interesting applications of the asymptotic results in Section 4 .2.8 to
questions arising in the analysis of musical sounds are discussed in Irizarry

(2001). In particular, the following experiment is described: recordings of
a professional clarinet player trying to play concert pitch A (ω1 = 441Hz)
and a professional guitar player playing D (ω1 = 146.8Hz) were made. For
the analysis of the clarinet sound, a one-second segment was divided into
non-overlapping blocks consisting of 1025 measurements (≈23 milliseconds)
and the harmonic regression model was fitted to each block separately. For
the guitar, the same was done with 60 non-overlapping intervals with 3000
observations each. Two types of results were obtained:
1. The clarinet player turned out to be always out of tune in the sense that
the estimated fundamental frequency ω̂1 was always outside the 95%-
3
acceptance region 441Hz ± 1.96 C33 (ω1o )n− 2 where the null hypothesis
-
is Ho : ω1 = ω1 = 441Hz. On the other hand, from the point of view of

o
musical perception, the clarinet player was not out of tune, because the
deviation from 441Hz was less than 0.76Hz which corresponds to 0.03
semitones. According to experimental studies, the human ear cannot
distinguish notes that are 0.03 semitones apart (Pierce 1983/1992).
2. Physical models (see e.g. Fletcher and Rossing 1991) postulate the fol-
lowing relationships between the fundamental frequency and partials:
for a “harmonic instrument” such as the clarinet, one expects
ωj = j · ω1 ,
whereas for a “plucked string instrument”, such as the guitar, one should
have
ωj ≈cj 2 · ω1
where c is a constant determined by properties of the strings. The ex-
periment described in Irizarry (2001) supports the assumption for the
clarinet in the sense that, in general, the 95%-confidence intervals for
the difference ωj − jω1 contained 0. For the guitar, his findings suggest
a relationship of the form ωj ≈c(a + j)2 ω1 with a ̸= 0.
4.3.2 Licklider’s theory of pitch perception

Thumfart (1995) uses the theory of discrete evolutionary spectra to derive
a simple linear model for pitch perception as proposed by Licklider (1951).
The general biological background is as follows (see e.g. Kelly 1991): vibra-
tions of the ear drum caused by sound waves are transferred to the inner
ear (cochlea) by three ossicles in the middle ear. The inner ear is a spiral
structure that is partitioned along its length by the basilar membrane. The
sound wave causes a traveling wave on the basilar membrane which in turn
causes hair cells positioned at different locations to release a chemical trans-
mitter. The chemical transmitter generates nerve impulses to the auditory
nerve. At which location on the membrane the highest amplitude occurs,
and thus which groups of hair cells are activated, depends on the frequency

of the sound wave. This means that certain frequency regions correspond
to certain hair groups. Frequency bands with high spectral density f (or
high increments dF of the spectral distribution) activate the associated
hair groups.
To obtain a simple model for the effect of a sound on the basilar mem-
brane movement, Slaney and Lyon (1991) partition the cochlea into 86
sections, each section corresponding to a particular group of cells. Thum-
fart (1995) assumes that each group of cells acts like a separate linear filter
Ψj (j = 1, ..., 86). (This is a simplification compared to Slaney and Lyon
who use nonlinear models.) The wave entering the inner ear is assumed to
be the original sound wave Xt , filtered by the outer ear by a linear filter
A1 , and the middle ear by a linear A2 . Thus, the output of the inner ear
that generates the final nerve impulses consists of 86 time series
Yt,j = Ψj (B)A2 (B)A1 (B)Xt (j = 1, ..., 86). (4.82)
Calculating tapered local periodograms Ij (u, λ) of Yt,j for each of the 86
sections (j = 1, ..., 86), one can then define the quantity
! π
c(k, j, u) = Ij (u, λ)eikλ dλ (4.83)
−π
which Slaney and Lyon call “correlogram”. This is in fact an estimated local
autocovariance at lag k for section j and the time-segment with midpoint
u. The “Slaney-Lyon-correlogram” thus essentially characterizes the local
autocovariance structure of the resulting nerve impulse series. Thumfart
(1995) shows formally how, and under which conditions, this model can
be defined within the framework of processes with a discrete evolutionary
spectrum. He also suggests a simple method for estimating pitch (the fun-
damental frequency) at local time u by setting ω̂1 (u) = 2π/kmax (u) where
$86
kmax (u) = arg maxk C(k, u) and C(k, u) = j=1 c(k, j, u).
4.3.3 Identification of pitch, tone separation and purity of intonation

In a recent study, Weihs et al. (2001) investigate objective criteria for judg-
ing the quality of singing (also see Ligges et al. 2002). The main question
asked in their analysis is how to assess purity of intonation. In an ex-
perimental setting, with standardized playback piano accompaniment in a
recording studio, 17 singers were asked to sing Händel’s “Tochter Zion” and
Beethoven’s “Ehre Gottes aus der Natur”. The audio signal of the vocal
performance was recorded in CD quality in 16-bit format at a sampling rate
of 44100 Hz. For the actual statistical analysis, data is reduced to 11000Hz,
for computational reasons, and standardized to the interval [-1,1].
The first question is how to identify the fundamental frequency (pitch)
ω1 . In the harmonic regression model above, estimates of ω1 and the par-
tials ωj (2 ≤ j ≤ k) are identical with the k frequencies where the pe-

riodogram assumes its k largest values. Weihs et al. suggest a simplified
(though clearly suboptimal) version of this, in that they consider the peri-
odogram at Fourier frequencies λj = 2πj/n (j = 1, 2, ..., m = [(n − 1)/2])
only and set
ω̃1 = min {λj : I(λj ) > max[I(λj−1 ), I(λj+1 )]}. (4.84)
λj ∈{λ2 ,...,λm−1 }
In other words, ω̃1 corresponds to the Fourier frequency where the first
peak of the periodogram occurs. Because of the restriction to Fourier fre-
quencies, the peridogram may have two adjacent peaks and the estimate is
too inaccurate in general. An empirical interpolation formula is suggested
by the authors to obtain an improved estimate ω̂1 . A comparison with har-
monic regression is not made, however, so that it is not clear how good the
interpolation works in comparison.
Given a procedure for pitch identification, an automatic note separation
procedure can be defined. This is a procedure that identifies time points in
a sound signal where a new note starts. The interesting result in Weihs et
al. is that automatic note separation works better for amateur singers than
for professionals. The reason may be the absence of vibrato in amateur
voices. In a third step, Weihs et al. address the question of how to as-
sess computationally the purity of intonation based on a vocal time series.
This is done using discriminant analysis. The discussion of these results is
therefore postponed to Chapter 9.
4.3.4 Music as 1/f noise?

In the 1970s Voss and Clarke (1975, 1978) discovered a seemingly universal
“law” according to which music has a 1/f spectrum. With 1/f -spectrum
one means that the observed process has a spectral density f such that
f (λ) ∝ λ−1 as λ → 0. In the sense of definition (4.10), such a density
actually does not exist - however, a generalized version of spectral density
exists in the sense that the expected value of the periodogram converges
to this function (see Matheron 1973, Solo 1992, Hurvich and Ray 1995).
Specifically, Voss and Clarke analyzed acoustic music signals by first trans-
forming the recorded signal Xt in the following way: a) Xt is filtered by a
low-pass filter (frequencies outside the interval [10Hz, 10000Hz] are elim-
inated); and b) the “instantaneous power” Yt = Xt2 is filtered by another
low-pass filter (frequencies above 20Hz are eliminated). This filtering tech-
nique essentially removes higher frequencies but retains the overall shape
(or envelope) of each sound wave corresponding to a note and the relative
position on the onset axis. In this sense, Voss and Clarke actually analyzed
rhythmic structures. A recent, statistically more sophisticated study along
this line is described in Brillinger and Irizarry (1998).
One objection to this approach can be that in acoustic signals, structural

a) Harpsichord sound wave (e flat) b) Harpsichord - log(power)
sampled at 44100 Hz
18
3000
17
1000
16
air pressure
log(power)
15
-1000
14
-3000
13
0.0 0.02 0.04 0.06 0.08 0.10 0.12 0.0 0.02 0.04 0.06 0.08 0.10 0.12
time (sec) time (sec)
c) Harpsichord - d) Harpsichord -
histogram of log(power) log-log-periodogram and SEMIFAR-fit (d=0.51)
1.0000
0
10
80
0.0100
log(f)
60
40
20
0.0001
0
13 14 15 16 17 18 0.01 0.05 0.10 0.50 1.00
log(y**2) log(frequency)
Figure 4.8 A harpsichord sound wave (a), logarithm of squared amplitudes (b),
histogram of the series (c) and its periodogram on log-scale (d) together with fitted
SEMIFAR-spectrum.
properties of the composition may be confounded with those of the instru-

ments. Consider, for instance, the harpsichord sound wave in Figure 4.8a.
The square of the wave is displayed in Figure 4.8b on logarithmic scale.
The picture illustrates that, apart from obvious oscillation, the (envelope
of the) signal changes slowly. Fitting a SEMIFAR-model (with order p ≤ 8
chosen by the BIC) yields a good fit to the periodogram. The estimated
fractional differencing parameter is dˆ = 0.51 with a 95%-confidence interval
of [0.29,0.72]. This corresponds to a spectral density (defined in the gener-
alized sense above) that is proportional to λ−1.02 , or approximately λ−1 .
Thus, even in a composition consisting of one single note one would detect
1/f noise in the resulting sound wave.
Instead of recorded sound waves, we therefore consider the score itself,
independently of which instrument is supposed to play. This is similar but
not identical to considering zero crossings of a sound signal (see Voss and

Clarke 1975, 1978, Voss 1988; Brillinger and Irizarry 1998). Figures 4.9a and
c show the log-frequencies plotted against onset time for the first movement
of Bach’s first Cello-Suite and for Paganini’s Capriccio No. 24. For Bach, the
SEMIFAR-fit yields dˆ ≈0.7 with a 95%-confidence interval of [0.46, 0.93].
This corresponds to a 1/f 1.4 spectrum; however 1/f (d = 1/2) is included
in the confidence interval. Thus, there is not enough evidence against the
1/f hypothesis. In contrast, for Paganini (Figure 4.11) we obtain dˆ ≈0.21
with a 95%-confidence interval of [0.07, 0.35] which excludes 1/f noise. This
indicates that there is a larger variety of fractal behavior than the “1/f -
law” would suggest. Note also that in both cases there is also a trend in
the data which is in fact an even stronger type of long memory than the
stochastic one. Moreover, Bach’s (and also to a lesser degree Paganini’s)
spectrum has local maxima in the spectral density, indicating periodicities
(see Section 4.2.9). Thus, there is no “pure” 1/f α behavior but instead
a mixture of long-range dependence expressed by the power law near the
origin, and short-range periodicities.
Figure 4.9 Log-frequencies with fitted SEMIFAR-trend and log-log-periodogram

together with SEMIFAR-fit for Bach’s first Cello Suite (1st movement; a,b) and
Paganini’s Capriccio No. 24 (c,d) respectively.
Finally, consider an alternative quantity, namely local variability of notes

modulo octave. Since we are in Z12 , a measure of variability for circular
data should be used. Here, we use the measure V = (1 − R̄) as defined in
Chapter 7 or rather the transformed variable log[(V +0.05)/(1.05−V )]. The
resulting standardized time series are displayed in Figures 4.10a and c. The
log-log-plot of the periodgrams and fitted SEMIFAR-spectra are given in
Figures 4.10b and d respectively. The estimated long-memory parameters

Figure 4.10 Local variability with fitted SEMIFAR-trend and log-log-periodogram
together with SEMIFAR-fit for Bach’s first Cello Suite (1st movement; a,b) and
Paganini’s Capriccio No. 24 (c,d) respectively.
are similar to before, namely dˆ = 0.51 ([0.20, 0.81]) for Bach and 0.33
([0.24, 0.42]) for Paganini.

Figure 4.11 Niccolò Paganini (1782-1840). (Courtesy of Zentralbibliothek
Zürich.)

CHAPTER 5
Hierarchical methods
Musical structures are typically generated in a hierarchical manner. Most

compositions can be divided approximately into natural segments (e.g.
movements of a sonata); these are again divided into smaller units (e.g.
exposition, development, and coda of a sonata movement). These can again
be divided into smaller parts (e.g. melodic phrases), and so on. Different
parts even at the same hierarchical level need not be disjoint. For instance,
different melodic lines may overlap. Moreover, different parts are usually
closely related within and across levels. A general mathematical approach
to understanding the vast variety of possibilities can be obtained, for in-
stance, by considering a hierarchy of maps defined in terms of a manifold
(see e.g. Mazzola 1990a). The concept of hierarchical relationships and sim-
ilarities is also related to “self-similarity” and fractals as defined in Mandel-
brot (1977) (see Chapter 3). To obtain more concrete results, hierarchical
regression models have been developed in the last few years (Beran and
Mazzola 1999a,b, 2000, 2001).
5.2.1 Hierarchical aggregation and decomposition
Suppose that we have two time series Yt , Xt and we wish to model the re-
latioship between Yt and Xt . The simplest model is simple linear regression
Yt = βo + β1 Xt + εt (5.1)
where εt is a stationary zero mean process independent of Xt . If Yt and Xt

are expected to be “hierarchical”, then we may hope to find a more realistic
model by first decomposing Xt (and possibly also Yt ) and searching for
dependence structures between Yt (or its components) and the components
of Xt . Thus, given a decomposition Xt = Xt,1 + ... + Xt,M , we consider the
multiple regression model
M
!
Yt = βo + βj Xt,j + εt (5.2)
j=1

with εt second order stationary and E(εt ) = 0. Alternatively, if Yt = Yt,1 +
... + Yt,L , we may consider a system of L regressions
M
!
Yt,1 = β01 + βj1 Xt,j + εt,1
j=1
M
!
Yt,2 = β02 + βj2 Xt,j + εt,2
j=1
..
.
M
!
Yt,L = β0L + βjL Xt,j + εt,L .
j=1
Three methods of hierarchical regression based on decompositions will be

discussed here: HIREG: hierarchical regression using explanatory variables
obtained by kernel smoothing with predetermined fixed bandwidths; HIS-
MOOTH: hierarchical smoothing models with automatic bandwidth selec-
tion; HIWAVE: hierarchical wavelet models.
5.2.2 Hierarchical regression

Given an explanatory time series Xt (t = 1, 2, ..., n), a smoothing kernel
K, and a hierarchy of bandwidths b1 > b2 > ... > bM > 0, define
n
1 ! t−s
Xt,1 = K( )Xt (5.3)
nb1 s=1 nb1
and for 1 < j ≤ M ,

n j−1
1 ! t−s !
Xt,j = K( )[Xt − Xt,l ] (5.4)
nbj s=1 nbj
l=1
The collection of time series {X1,j , ..., Xn,j } (j = 1, ..., M ) is called a hier-
archical decomposition of Xt . The HIREG-model is then defined by (5.2).
If εt (t = 1, 2, ...) are independent, then usual techniques of multiple linear
regression can be used (see e.g. Plackett 1960, Rao 1973, Ryan 1996, Sri-
vastava and Sen 1997, Draper and Smith 1998). In case of correlated errors
εt , appropriate adjustments of tests, confidence intervals, and parameter
selection techniques must be made. The main assumption in the HIREG
model is that we know which bandwidths to use. In some cases this may
indeed be true. For instance, if there is a three-fourth meter at the begin-
ning of a musical score, then bandwidths that are divisible by three are
plausible.

5.2.3 Hierarchical smoothing
Beran and Mazzola (1999b) consider the case where the bandwidths bj
are not known a priori.
"M Essentially, this amounts to a nonlinear regression
model Yt = βo + j=1 βj Xt,j + εt where not only βj (j = 0, ..., p) are un-
known, but also b1 , ..., bM , and possibly the order M, have to be estimated.
The following definition formalizes the idea (for simplicity it is given for
the case of one explanatory series Xt only):
Definition 40 For integers M, n > 0, let β = (β1 , ..., βM ) ∈ RM , b =
(b1 , ..., bM ) ∈ RM , b1 > b2 > ... > bM = 0, ti ∈ [0, T ], 0 < T < ∞, t1 <
t2 < ... < tn , and θ = (β, b)t . Denote# by K : [0, 1] → R+ a non-negative
symmetric kernel function such that K(u)du = 1, K is twice continuously
differentiable, and define for b > 0 and t ∈ [0, T ], the Nadaraya-Watson
weights (Nadaraya 1964, Watson 1964)
b )
K( t−ti
ab (t, ti ) = "n t−tj (5.5)

j=1 K( b )
Also, let εi (i ∈ Z) be a stationary zero mean process satisfying suitable

moment conditions, fε the spectral density of εi , and assume εi to be in-
dependent of Xi . Then the sequence of bivariate time series {(X1,n , Y1,n ),
..., (Xn,n , Yn,n )} (n = 1, 2, 3, ...) is a Hierarchical Smoothing Model (or
HISMOOTH model), if
M
!
Yi,n = Y (ti ) = βj g(ti ; bj ) + εi (5.6)
j=1
where ti = i/n and

n
!
g(ti ; bj ) = abj (ti , tl )Xl,n (5.7)
l=1
Denote by θo = (β o , bo )t the true parameter vector. Then θo can be esti-

mated by a nonlinear least squares method as follows: define
M
!
ei (θ) = Y (ti ) − βj g(ti ; bj ) (5.8)
l=1
"n
as a function of θ = (β, b)t , let S(θ) = i=1 ei (θ)
2
and ġ = ∂
∂b g. Then
θ̂ = argminθ S(θ) (5.9)

or equivalently
n
!
ψ(ti , y; θ̂) = 0 (5.10)
i=1

where ψ = (ψ1 , ..., ψ2 M )t ,
ψj (t, y; θ) = ei (θ)g(t; bj ) (5.11)
for j = 1, ..., M, and
ψj (t, y; θ) = ei (θ)βj ġ(t; bj ) (5.12)
for j = M +1, ..., 2M. Under suitable assumptions, the estimate θ̂ is asymp-
totically normal. More specifically, set
hi (t; θo ) = g(t; bi ) (i = 1, ..., M ) (5.13)
hi (t; θo ) = βi ġ(t; bi ) (i = M + 1, ..., 2M ) (5.14)

Σ = [γε (i − j)]i,j=1,...,n = [cov(εi , εj )]i,j=1,...,n (5.15)
and define the 2M ×n matrix
G = G2 M×n = [hi (tj ; θo )]i=1,...,2 M;j=1,...,n (5.16)
and the 2M ×2M matrix
Vn = (GGt )−1 (GΣGt )(GGt )−1 (5.17)
The following assumptions are sufficient to obtain asymptotic normality:
(A1) fε (λ) ∼cf |λ|−2 d (cf > 0) as λ → 0 with − 12 < d < 12 ;
(A2) Let
n
!
ar = n−1 γε (i − j)g(ti ; br )g(tj ; br ),
i,j=1
n
!
br = n−1 γε (i − j)ġ(ti ; br )ġ(tj ; bs ).
i,j=1
Then, as n → ∞, lim inf |ar | > 0, and lim inf |br | > 0 for all r, s ∈
{1, ..., M }.
(A3) x(ti ) = ξ(ti ) where ξ : [0, T ] → R is a function in C[0, T ], T < ∞.
(A4) The set of time points converges to a set A that is dense in [0, T ].
Then we have (Beran and Mazzola 1999b):
Theorem 12 Let Θ1 and Θ2 be compact subsets of R and R+ respectively,
Θ = ΘM1 ×Θ2 and let η = 2 min{1, 1 − 2d}. Suppose that (A1), (A2), (A3)
M 1
and (A4) hold and θo is in the interior of Θ. Then, as n → ∞,

(i) θ̂ →p θo ;
(ii) Vn → V where V is a symmetric positive definite 2M ×2M matrix;
(iii) nη (θ̂ − θ) →d N (0, V ).

Thus, θ̂ is asymptotically normal, but for d > 0 (i.e. long-memory errors),
1 1
the rate of convergence n 2 −d is slower than the usual n 2 −rate.
A particular aspect of HISMOOTH models is that the bandwidths bj are
fixed positive unknown parameters that are estimated from the data. This
means that, in contrast to nonparametric regression models (see e.g. Gasser
and Müller 1979, Simonoff 1996, Bowman and Azzalini 1997, Eubank 1999),
the notion of optimal bandwidth does not exist here. There is a fixed true
bandwidth (or a vector of true bandwidths) that has to be estimated. A
HISMOOTH model is in fact a semiparametric nonlinear regression rather
than a nonparametric smoothing model.
Theorem 1 can be interpreted as multiple linear regression where uncer-
tainty due to (explanatory) variable selection is taken into account. The
set of possible combinations of explanatory variables is parametrized by a
continuous bandwidth-parameter vector b ∈ ΘM 2 . Confidence intervals for β
based on the asymptotic distribution of θ̂ take into account additional un-
certainty due to “variable selection” from the (infinite) parametric family of
M explanatory variables X = {(xb1 , ..., xbM ) : bj ∈ Θ2 , b1 > b2 > ... > bM }.
For the practical implementation of the model, the following algorithms
that include estimation of M are defined in Beran and Mazzola (1999b): if
M is fixed, then the algorithm consists of two basic steps: a) generation of
the set of all possible explanatory variables xs (s ∈ S), and b) selection of
M variables (bandwidths) that maximize R2 . This means that after step 1,
the estimation problem is reduced to variable selection in multiple regres-
sion, with a fixed number M of explanatory variables. Standard regression
software, such as the function leaps in S-Plus, can be used for this purpose.
The detailed algorithm is as follows:
Algorithm 1 Define a sufficiently fine grid S = {s1 , ..., sk } ⊂ Θ2 and
carry out the following steps:
Step 1: Define k explanatory time series xs = [xs (t1 ), ..., xs (tn )]t (s ∈ S)
by xs (ti ) = g(ti , s).
Step 2: For each b = (b1 , ..., bM ) ∈ S M , with bi > bj (i < j) define the
n ×M matrix X = (xb1 , ..., xbM ) and let β = β(b) = (X t X)−1 X t y.
Also, denote by R2 (b) the corresponding value of R2 obtained from least
squares regression of y on X.
Step 3: Define θ̂ = (β̂, b̂)t by b̂ = argmaxb R2 (b) and β̂ = β(b̂).
If M is unknown, then the algorithm can be modified, for instance by in-
creasing M as long as all β-coefficients are significant. In order to calculate
the standard deviation of β̂ at each stage, the error process εi needs to be
modeled explicitly. Beran and Mazzola (1999) use fractional autoregressive
models together with the BIC for choosing the order of the process. This
leads to
Algorithm 2 Define a sufficiently fine grid S = {s1 , ..., sk } ⊂Θ2 for the

bandwidths, and calculate k explanatory time series xs (s ∈ S) by xs (ti ) =
g(ti , s). Furthermore, define a significance level α, set Mo = 0, and carry
out the following steps:
Step 1: Set M = Mo + 1;
Step 2: For each b = (b1 , ..., bM ) ∈ S M , with bi > bj (i < j) define the
n ×M matrix X = (xb1 , ..., xbM ) and let β = β(b) = (X t X)−1 X t y.
Also, denote by R2 (b) the corresponding value of R2 obtained from least
squares regression of y on X.
Step 3: Define θ̂ = (b̂, β̂)t by b̂ = argmaxb R2 (b) and β̂ = β(b̂).
Step 4: Let e(θ̂) = [e1 , ..., en ]t be the vector of regression residuals. Assume
that ei is a fractional autoregressive process of unknown order p charac-
terized by a parameter vector ζ = (σε2 , d, φ1 , ..., φp ). Estimate p and ζ by
maximum likelihood and the BIC.
Step 5: Calculate for each j = 1, ..., M, the estimated standard deviation
σj (ζ̂) of β̂j , and set
pj = 2[1 − Φ(|β̂j |σj−1 (ζ̂))]
where Φ denotes the cumulative standard normal distribution function.
If max (pj ) < α, set Mo = Mo + 1 and repeat 1 through 5. Otherwise,
stop the iteration and set M̂ = Mo and θ̂ equal to the corresponding
estimate.
5.2.4 Hierarchical wavelet models

Wavelet decomposition has become very popular in statistics and many
fields of application in the last few years. This is due to the flexibility to
depict local features at different levels of resolution. There is an extended
literature on wavelets spanning a vast range between profound mathemat-
ical foundations and mathematical statistics to concrete applications such
as data compression, image and sound processing, and data analysis, to
name only a few. For references see for example Daubechies (1992), Meyer
(1992, 1993), Kaiser (1994), Antoniadis and Oppenheim (1995), Ogden
(1996), Mallat (1998), Härdle et al. (1998), Vidakovic (1999), Percival and
Walden (2000), Jansen (2001), Jaffard et al. (2001). The essential principle
of wavelets is to express square integrable functions in terms of orthogonal
basis functions that are zero except in a small neighborhood, the neigh-
borhoods being hierarchical in size. The set of basis functions Ψ = {ϕok ,
k ∈ Z} ∪ {ψjk , j, k ∈ Z} is generated by two functions only, the father
wavelet ϕ and the mother wavelet ψ, respectively, by up/downscaling and
shifting of the location respectively. If scaling is done by powers of 2 and
shifting by integers, then the basis functions are:
ϕok (x) = ϕoo (x − k) = ϕ(x − k) (k ∈ Z) (5.18)

j j
ψjk (x) = 2 2 ψoo (2j x − k) = 2 2 ψ(2j x − k) (j ∈ N, k ∈ Z) (5.19)
With respect to the scalar product < g, h >= g(x)h(x)dx, these basis
#
functions are orthonormal:

< ϕok , ϕom >= 0 (k ̸= m), < ϕok , ϕok >= ||ϕk ||2 = 1 (5.20)
< ψjk , ψlm >= 0 (k ̸= m or j ̸= l), < ψjk , ψjk >= ||ψjk ||2 = 1 (5.21)
< ψjk , ϕol >= 0 (5.22)
Every function g in L (R) (the space of square integrable functions on R)
2
has a unique representation

∞
! ∞ !
! ∞
g(x) = ak ϕok (x) + bjk ψjk (x) (5.23)
k=−∞ j=0 k=−∞
!∞ !∞ ! ∞
= ak ϕ(x − k) + bjk ψ(2j x − k) (5.24)
k=−∞ j=0 k=−∞
where $
ak =< g, ϕk >= g(x)ϕk (x)dx (5.25)
and $
bjk =< g, ψjk >= g(x)ψjk (x)dx (5.26)
Note in particular that g (x)dx = ak + bjk . The purpose of this

# 2 " 2 " 2
representation is a decomposition with respect to frequency and time. A
simple wavelet, where the meaning of the decomposition can be understood
directly, is the Haar wavelet with
ϕ(x) = 1{0 ≤ x < 1} (5.27)
where 1{0 ≤ x < 1} = 1 for 0 ≤ x < 1 and zero otherwise, and
1 1
ψ(x) = 1{0 ≤ x < } − 1{ ≤ x < 1}. (5.28)
2 2
For the Haar basis functions ϕk , we have coefficients
$ k+1
ak = g(x)dx (5.29)
k
Thus, coefficients of the basis functions ϕk are equal to the average value
of g in the interval [k, k + 1]. For ψjk we have
$ 2 −j (k+ 12 ) $ 2 −j (k+1)
j
bjk = 2 [
2 g(x)dx − g(x)dx] (5.30)
2 −j k 2 −j (k+ 12 )
which is the difference between the average values of g in the intervals

2−j k ≤ x < 2−j (k + 12 ) and 2−j (k + 12 ) ≤ x < 2−j (k + 1). This can be
interpreted as a (signed) measure of variability. Since each interval Ijk =

[2−j k, 2−j (k + 1)] has length 2−j and midpoint 2−j (k + 12 ), the coefficients
ajk (or their squares a2jk ) characterize the variability of g at different scales
2−j (j = 0, 1, 2, ...) and a grid of locations 2−j (k + 12 ) that becomes finer
as the scale decreases with increasing values of j.
Suppose now that a time series (function) yt is observed at a finite num-
ber of discrete time points t = 1, 2, ..., n with n = 2m . To relate this to
wavelet decomposition in continuous time, one can construct a piecewise
constant function in continuous time by
n−1 n−1
! k k+1 !
gn (x) = yk 1{
≤x< }= yk 1{2−mk ≤ x < 2−m (k + 1)}
n n
k=0 k=0
(5.31)
Since gn is a step function (like the Haar basis functions themselves) and
zero outside the interval [0, 1), the Haar wavelet decomposition of gn has
only a finite number of nonzero terms:
j
m−1
! 2! −1
gn (x) = aoo + bjk ψjk (x) (5.32)
j=0 k=0
Note that gn assumes only a finite number of values gn (x) = ynx (x =

j
1/n, 2/n, ..., 1). Moreover, for x = k/n, ψjk (x) = 2 2 ψ(2j x − k) is nonzero
for 0 ≤ k < 1/(2m−j − 1) only. Therefore, Equation (5.32) can be written
in matrix form and calculation of the coefficients aoo and bjk can be done
by matrix inversion. Since matrix inversion may not be feasible for large
data sets, various efficient algorithms such as the so-called discrete wavelet
transform have been developed (see e.g. Percival and Walden 2000).
An interesting interpretation of wavelet decomposition can be given in
terms of total variability. The total variability of an observed series can be
decomposed into contributions of the basis functions by
j
! m−1
! 2!−1
(yt − ȳ)2 = b2jk . (5.33)
j=0 k=0
A plot of b2jk against j (or 2j = “frequency”, or 2−j = “period”) and k

(“location”) shows for each k and j how much of the signal’s variability is
due to variation at the corresponding location k and frequency 2j .
To illustrate how wavelet decomposition works, consider the following
simulated example: let xi = 2 cos(2πi/90) if i ∈ {1, ..., 300} or {501, ..., 700}
or {901, ..., 1024}, For 301 ≤ i ≤ 500, set xi = 12 cos(2πi/10), and for
701 ≤ i ≤ 900, xi = 15 cos(2πi/10) + 10000 1
(i − 200)2 . The observed signal
thus consists of several periodic segments with different frequencies and
amplitudes, the largest amplitude occurring between t = 701 and 900, to-
gether with a slight trend. Figure 5.1a displays xi . The coefficients for the
four highest levels (i.e. j = 0, 1, 2, 3) are plotted against time in Figure

5.1b. Note that D stands for mother and S for father wavelet. Moreover,
the numbering in the plot (as given in S-Plus) is opposite to the one given
above: s4 and d4 in the plot correspond to the coarsest level j = 0 above.
The corresponding functions at the different levels are given in Figure 5.1c.
The ten and fifty largest basis contributions are given in Figures 5.1d and
e respectively (together with the data on top and residuals at the bot-
tom). Figure 5.1f shows the time frequency plot of the squared coefficients
in the wavelet decomposition of xi . Bright shading corresponds to large
coefficients. All plots emphasize the high-frequency portion with large am-
plitude between i = 701 and 900. Moreover, the trend at this location is
visible through the coefficient values of the father wavelet ψ (s4 in the
plot) and the slightly brighter shading in the lowest frequency band of the
time-frequency plot.
An alternative to HISMOOTH models can be defined via wavelets (the
following definition is a slight modification of Beran and Mazzola 2001):
Definition 41 Let φ, ψ ∈ L2 (R) be a father and the corresponding mother
wavelet respectively, φk (.) = φ(. − k), ψj,k = 2j/2 ψ(2j . − k) (k ∈ Z, j ∈ N)
the orthogonal wavelet basis generated by φ and ψ, and ui and εi (i ∈
Z) independent stationary zero mean processes satisfying suitable moment
conditions. Assume X(ti ) = g(t " i ) + ui with"g ∈ L [0, T ], ti ∈ [0, T ] and
2
wavelet decomposition g(t) = ak φk (t) + bj,k ψj,k (t). For 0 = cM+1 <
cM < ... < c1 < co = ∞ let
! !
g(t; ci−1 , ci ) = ak φk (t) + bj,k ψj,k (t).
ci ≤|ak |<ci−1 ci ≤|bj,k |<ci−1
Then (X(ti ), Y (ti )) (i = 1, ..., n) is a Hierarchical Wavelet Model (HI-

WAVE model) of order M , if there exists M ∈ N, β = (β1 , ..., βM ) ∈ RM ,
η = (η1 , ..., ηM ) ∈ RM
+ , 0 < ηM < ...η1 < ηo = ∞ such that
M
!
Y (ti ) = βl g(ti ; ηl−1 , ηl ) + εi . (5.34)
l=1
The definition means that the time series Y (t) is decomposed into orthog-
onal components that are proportional to certain “bands” in the wavelet
decomposition of the explanatory series X(t) – the bands being defined by
the size of wavelet coefficients. As for HISMOOTH models, the parameter
vector θ = (β, η)t can be estimated by nonlinear least squares regression.
To illustrate how HIWAVE-models may be used, consider the following
simulated example: let xi = g(ti ) (i = 1, ..., 1024) as in the previous ex-
ample. The function g is decomposed into g(t) = g(t; ∞, η1 ) + g(t; η1 , 0) =
g1 (t) + g2 (t) where η1 is such that 50 wavelet coefficients of g are larger
or equal η1 . Figure 5.2 shows g, g1 , and g2 . A simulated series of response
variables, defined by Y (ti ) = 2g1 (ti ) + εi (t = 1, ..., 1024) with independent
zero-mean normal errors εi with variance σε2 = 100, is shown in Figure 5.3b.

20
10
x
0
x
-10
0 200 400 600 800 1000
Coefficients upto j=4

(numbered in reversed order)
idwt
d1
d2
d3
d4
s4
0 200 400 600 800 1000

b
Figure 5.1 Simulated signal (a) and wavelet coefficients (b).

Components upto j=4
Data
D1
D2
D3
D4
S4
0 200 400 600 800 1000
The largest ten components
D3.92
D3.102
D3.112
D3.109
D3.104
Resid
0 200 400 600 800 1000
Figure 5.1 c and d: wavelet components of simulated signal in a.

Figure 5.1 e and f: wavelet components of simulated signal in a and frequency
plot of coefficients.

x and its components g1 (left) and g2=x-g1 (right)
20
10
0
-10
g1=first 50 components of x g2=x-g1

20
x
10
10
-20
5
x-g1
g1
0
0
-30
-10
-10
-40
0 400 800 0 400 800

0 200 400 600 800 1000
Figure 5.2 Decomposition of x−series in simulated HIWAVE model.
A comparison of the two scatter plots in Figures 5.3c and d shows a much
clearer dependence between y and g1 as compared to y versus x = g. Figure
5.3e illustrates that there is no relationship between y and g2 . Finally, the
time-frequency plot in Figure 5.3f indicates that the main periodic behavior
occurs for t ∈ {701, ..., 900}. The difficulty in practice is that the correct
decomposition of x into g1 and the redundant component g2 is not known
a priori. Figure 5.4 shows y and the HIWAVE-curve β̂o + β̂1 g(ti ; ∞, η̂1 ) (for
graphical reason the fitted curve is shifted vertically) fitted by nonlinear
least squares regression. Apparently, the algorithm identified η̂1 and hence
the relevant time span [701, 900] quite exactly, since g(ti ; ∞, η̂1 ) corresponds
to the sum of the largest 51 wavelet components. The estimated coefficients
are β̂o = −0.36 and β̂1 = 1.95. If we assume (incorrectly of course) that
η̂1 has been known a priori, then we can give confidence intervals for both
parameters as in linear least squares regression. These intervals are gen-
erally too short, since they do not take into account that η̂1 is estimated.
However, if a null hypothesis is not rejected using these intervals, then it
will not be rejected by the correct test either. In our case, the linear re-
gression confidence intervals for βo and β1 are [−0.96, 0.24] and [1.81, 2.09]
respectively, and thus contain the true values βo = 0 and β1 = 2.

Figure 5.3 Simulated HIWAVE model - explanatory series g1 (a), y−series (b),
y versus x (c), y versus g1 (d), y versus g2 = x − g1 (e) and time frequency plot
of y (f ).

COLOR FIGURE 2.30 The minnesinger
Burchard von Wengen (1229-1280), contem-
porary of Adam de la Halle (1235?-1288).
(From Codex Manesse, courtesy of the Uni-
versity Library, Heidelberg.)
COLOR FIGURE 2.35 Symbol plot with x = pj5 , y = pj7 , and radius of circles
proportional to pj6 .

COLOR FIGURE 2.36 Symbol plot with x = pj5 , y = pj7 . The rectangles have
width pj1 (diminished second) and height pj6 (augmented fourth).
COLOR FIGURE 2.37 Symbol plot with x = pj5 , y = pj7 , and triangles defined
by pj1 (diminished second), pj6 (augmented fourth), and pj10 (diminished
seventh).

COLOR FIGURE 2.38 Names plotted at locations (x, y) = (pj5 , pj7 ).
COLOR FIGURE 3.2 Fractal pictures (by Céline Beran, computer generated).

COLOR FIGURE 4.7 A harpsichord sound and its spectrogram. Intense pink
corresponds to high values of I(t,λ).
COLOR FIGURE 9.6 Graduale

written for an Augustinian monastery
of the diocese Konstanz, 13th cen-
tury. (Courtesy of Zentralbibliothek
Zürich.)

Figure 5.4 HIWAVE time series and fitted function gˆ1 .
5.3.1 Hierarchical decomposition of metric, melodic, and harmonic

weights
Decomposition of metric, melodic and harmonic weights as in (5.3) and

(5.4) can reveal structures and relationships that are not obvious in the
original series. To illustrate this, Figures 5.5a through d and 5.5e through
h show a decomposition of these weights for Bach’s Canon cancricans from
“Das Musikalische Opfer” BWV 1079 and Webern’s Variation op. 27/2
respectively. The bandwidths were chosen based on time signature and
bar grouping. Webern’s piano piece is written in 2/4 signature, its formal
grouping is 1 + 11 + 11 + 11 + 11; however, Webern insists on a grouping
in 2-bar portions suggesting the bandwidths of 5.5 (11 bars), 1 (2 bars)
and 0.5 (1 bar). Bach’s canon is written in 4/4 signature; the grouping is
9+9+9+9. The chosen bandwidths are 9 (9 bars), 3 (3 bars) and 1 (1 bar).
For both compositions, much stronger similarities between the smoothed
metric, melodic, and harmonic components can be observed than for the
original weights. An extended discussion of these and other examples can
be found in Beran and Mazzola (1999a).

Figure 5.5 Hierarchical decomposition of metric, melodic, and harmonic indica-
tors for Bach’s “Canon cancricans” (Das Musikalische Opfer BWV 1079) and
Webern’s Variation op. 27, No. 2.
5.3.2 HIREG models of the relationship between tempo and melodic

curves
Quantitative analysis of performance data is an attempt to understand

“objectively” how musicians interpret a score (Figure 5.6). For the anal-
ysis of tempo curves for Schumann’s Träumerei (Figure 2.3), Beran and
Mazzola (1999a) construct the following matrix of explanatory variables
by decomposing structural weight functions into components of different

smoothness: let x1 = xmetric = metric weight, x2 = xmelod = melodic
weight, x3 = xhmean = harmonic (mean) weight (see Chapter 3). Define
the bandwidths b1 = 4 (4 bars), b2 = 2 (2 bars) and b3 = 1 (1 bar) and
denote the corresponding components in the decomposition of x1 , x2 , x3
by xj,metric = xj,1 , xj,melod = xj,2 , xj,hmean = xj,3 . More exactly, since
harmonic weights are originally defined for each note, two alternative vari-
ables are considered for the harmonic aspect: xhmean (tl ) = average har-
monic weight at onset time tl , and xhmax (tl ) = maximal harmonic weight
at onset time tl . Thus, the decomposition of four different weight functions
xmetric , xmelod , xhmean , and xhmax is used in the analysis. Moreover, for
each curve, discrete derivatives are defined by
x(tj ) −x(tj−1 )
dx(tj ) =
tj −tj−1
and
dx(tj ) −dx(tj−1 )
dx(2) (tj−1 ) = .
tj −tj−1
Each of these variables is decomposed hierarchically into four components,
as decribed above, with the bandwidths b1 = 4 (weighted averaging over 8
bars), b2 = 2 (4 bars), b3 = 1 (2 bars) and b4 = 0 (residual – no averaging).
We thus obtain 48 variables (functions):
xmetric,1 xmetric,2 xmetric,3 xmetric,4
dxmetric,1 dxmetric,2 dxmetric,3 dxmetric,4
d2 xmetric,1 d2 xmetric,2 d2 xmetric,3 d2 xmetric,4
xmelodic,1 xmelodic,2 xmelodic,3 xmelodic,4

dxmelodic,1 dxmelodic,2 dxmelodic,3 dxmelodic,4
d2 xmelodic,1 d2 xmelodic,2 d2 xmelodic,3 d2 xmelodic,4
xhmax,1 xhmax,2 xhmax,3 xhmax,4

dxhmax,1 dxhmax,2 dxhmax,3 dxhmax,4
d2 xhmax,1 d2 xhmax,2 d2 xhmax,3 d2 xhmax,4
xhmean,1 xhmean,2 xhmean,3 xhmean,4

dxhmean,1 dxhmean,2 dxhmean,3 dxhmean,4
d2 xhmean,1 d2 xhmean,2 d2 xhmean,3 d2 xhmean,4
In addition to these variables, the following score-information is modeled
in a simple way:
1. Ritardandi There are four onset intervals R1 , R2 , R4 , and R4 with an
explicitly written ritardando instruction, starting at onset times to (Rj )
(j = 1, 2, 3, 4) respectively. This is modeled by linear functions
xritj (t) = 1{t ∈ Rj } · (t −to (Rj )), j = 1, 2, 3, 4 (5.35)

Figure 5.6 Quantitative analysis of performance data is an attempt to under-
stand “objectively” how musicians interpret a score without attaching any subjec-
tive judgement. (Left: “Freddy” by J.B.; right: J.S. Bach, woodcutting by Ernst
Würtemberger, Zürich. Courtesy of Zentralbibliothek Zürich).
2. Suspensions There are four onset intervals S1 , S2 , S4 , and S4 with

suspensions, starting at onset times to (Sj ) (j = 1, 2, 3, 4) respectively.
The effect is modeled by the variables
xsusj (t) = 1{t ∈ Sj } · (t −to (Sj )), j = 1, 2, 3, 4 (5.36)
3. Fermatas There are two onset intervals F1 , F2 with fermatas. Their

effect is modeled by indicator functions
xf ermj (t) = 1{t ∈ Fj }, j = 1, 2 (5.37)
The variables are summarized in an n ×57 matrix X. After orthonormal-

ization, the following model is assumed:
y(j) = Zβ(j) + ε(j)
where y(j) = [y(t1 , j), y(t2 , j), ..., y(tn , j)]t are the tempo measurements
for performance j, Z is the orthonormalized X-matrix, β(j) is the vector
of coefficients (β1 (j), ..., βp (j))t and ε(j) = [ε(t1 , j), ε(t2 , j), ..., ε(tn , j)]t is
a vector of n identically distributed, but possibly correlated, zero mean
random variables ε(ti , j) (ti ∈ T ) with variance var(ε(ti , j)) = σ 2 (j). Be-
ran and Mazzola (1999a) select the most important variables for each of
the 28 performances separately, by stepwise linear regression. The main
aim of the analysis is to study the relationship between structural weight
functions and tempo with respect to a) existence, b) type and complexity,
and c) comparison of different performances. It should perhaps be empha-
sized at this point that quantitative analysis of performance data aims at
gaining a better “objective” understanding how pianists interpret a score

without attaching any subjective judgement. The aim is thus not to find
the “ideal performance” which may in fact not exist or to state an opin-
ion about the quality of a performance. The values of R2 , obtained for
the full model with all explanatory variables, vary between 0.65 and 0.85.
Note, however, that the number of potential explanatory variables is very
large so that high values of R2 do not necessarily imply that the regression
model is meaningful. On the other hand, musical performance is a very
complex process. It is therefore not unreasonable that a large number of
explanatory variables may be necessary. This is confirmed formally, in that
for most performances, the selected models turn out to be complex (with
many variables), all variables being statistically significant (at the 5%-level)
even when correlations in the errors are taken into account. For instance,
for Brendel’s performance (R2 = 0.76), seventeen significant variables are
selected (including first and second derivatives). In spite of the complex-
ity, there is a large degree of similiarity between the performances in the
following sense: a) all except at most 3 of the 57 coefficients βj have the
same sign for all performances (the results are therefore hardly random), b)
there are “canonical” variables that are chosen by stepwise regression for
(almost) all performances, and c) the same is true if one considers (for each
performance separately) explanatory variables with the largest coefficient.
Figure 5.7 shows three of these curves. The upper curve is the most impor-
tant explanatory variable for 24 of the 28 performances. The exceptions are:
all three Cortot-performances and Krust with a preference for the middle
curve which reflects the division of the piece into 8 parts and the perfor-
mance by Ashkenazy with a curve similar to Cortot’s. Apparently, Cortot,
Krust, and Ashkenazy put special emphasis on the division into 8 parts.
The results can also be used to visualize the structure of tempo curves in
the following way: using the size of |β̂k | as criterion for the importance of
variable k, we may add the terms in the regression equation sequentially to
obtain a hierarchy of tempo curves ranging from very simple to complex.
This is illustrated in Figures 5.8a and b for Ashkenazy and Horowitz’s third
performance.
5.3.3 HISMOOTH models for the relationship between tempo and

structural curves
An analysis of the relationship between a melodic curve (Chapter 3) and the
28 tempo curves for Schumann’s Träumerei is discussed in Beran and Maz-
zola (1999). In a first step, effects of fermatas and ritardandi are subtracted
from each of the 28 tempo series individually, using linear regression. The
component of the melodic curve mt orthogonal to these variables is then
used. The second algorithm for HISMOOTH models is used, with a grid G
that takes into account that 0 ≤ t ≤32 and only certain multiples of 1/8
correspond to musically interesting neighborhoods: G = {32, 30, 28, 26, 24,

Figure 5.7 Most important melodic curves obtained from HIREG fit to tempo
curves for Schumann’s Träumerei.
Figure 5.8a: Figure 5.8b:

25
Adding effects for ASKENAZE Adding effects for HOROWITZ3

20
20
15
15
estimated and observed log(tempo)
estimated and observed log(tempo)

10
10
5
5
0
0
-5
5 10 15 20 25 30 5 10 15 20 25 30
onset time onset time
Figure 5.8 Successive aggregation of HIREG-components for tempo curves by

Ashkenazy and Horowitz (third performance).

22, 20, 18, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1.5, 1, 0.75, 0.5, 0.25, 0.125}.
Note that since for large bandwidths the resulting curves g do not vary
much, large trial bandwidths do not need to be too close together. The
error process ε is modeled by a fractional AR(p, d) process, the order being
estimated from the data by the BIC. Note that, from the musicological
point of view, the fractional differencing parameter can be interpreted as a
measure of self-similarity (see Chapter 3).
For illustration, consider the performances CORTOT1 and HOROWITZ1
(see Figures 5.9b and c). In both cases, the number M of explanatory vari-
ables estimated by Algorithm 2 turns out to be 3 (with a level of significance
of α = 0.05). The estimated bandwidths (and 95%-confidence intervals)
are b̂1 = 4.0 ([2.66, 5.34]), b̂2 = 2.0 ([1.10, 2.90]) and b̂3 = 0.5 ([0.17, 0.83])
for CORTOT1 and b̂1 = 4 ([2.26, 5.74]), b̂2 = 1 ([0.39, 1.62]) and b̂3 =
.25 ([0.04, 0.46]) for HOROWITZ1. The estimates of β are β̂1 = −0.81
([−1.53, −0.10]), β̂2 = 1.08 ([0.21, 1.05]) and β̂3 = −0.624 ([−1.15, −0.10]),
and β̂1 = −0.42 ([−0.66, −0.18]), β̂2 = 0.54 ([0.13, 0.95]) and β̂3 = −0.68
([−1.08, −0.28]) respectively. Finally, the fitted error process for Cortot is
a fractional AR(1) process with dˆ = −0.25 ([−0.60, 0.09]) and φ̂1 = 0.77
[0.48, 1]. For Horowitz we obtain a fractional AR(2) process with dˆ= 0.30
([0.14, 0.45], φ̂1 = 0.26 ([0.09, 0.42]) and φ̂2 = −0.43 ([−0.55, −0.30]).
A possible interpretation of the results is as follows: the largest band-
width b̂1 = 4 (one bar) is the same for both performers. A relatively
large portion of the shaping of the tempo “happens” at this level. Apart
from this, however, Horowitz’s bandwidths are smaller. Horowitz appears
to emphasize very local melodic structures more than Cortot. Moreover,
for Horowitz, dˆ > 0 (long-range dependence): while the small scale struc-
tures are “explained” by the melodic structure of the score, the remaining
“unexplained” part of the performance is still “coherent” in the sense that
there is a relatively strong (self-)similarity and positive correlations even be-
tween remote parts. On the other hand, for Cortot, dˆ< 0 (antipersistence):
While larger scale structures are “explained” by the melodic structure of
the score, more local fluctuations are still “coherent” in the sense that there
is a relatively strong negative autocorrelation even between remote parts,
these smaller scale structures are however difficult to relate directly to the
melodic structure of the score.
Figures 5.9a through d also simplified tempo curves for all 28 perfor-
mances, obtained by HISMOOTH fits with M = 3. The comparison of
typical characteristics is now much easier than for the original curves. In
particular, there is a strong similarity between all three performances by
Horowitz on one hand, and the three performances by Cortot on the other
hand. Several performers (Moisewitsch, Novaes, Ortiz, Krust, Schnabel,
Katsaris) put even higher emphasis on global melodic features than Cor-
tot. Striking similarities can also be seen between Horowitz, Klien, and

Figure 5.9 a and b: HISMOOTH-fits to tempo curves (performances 1-14).
Brendel. Another group of similar performances consisting of Cortot, Arg-

erich, Capova, Demus, Kubalek, and Shelley.
5.3.4 Digital encoding of musical sounds (CD, mpeg)

Wavelet decomposition plays an important role in modern techniques of
digital sound and image processing. Digital encoding of sounds (e.g. CD,
mpeg) relies on algorithms that make it possible to compress complex data
in as few storage units as possible. Wavelet decoposition is one such tech-
nique: instead of storing a complete function (evaluated or measured at a
very large number of time points on a fine grid), one only needs to keep the
relatively small number of wavelet coefficients. There is an extensive liter-
ature on how exactly this can be done to suit particular engineering needs.
Since here the focus is on “genuine” musical questions rather than signal
processing, we do not pursue this further. The interested reader is referred
to the engineering literature such as Effelsberg and Steinmetz (1998) and
references therein.
5.3.5 Wavelet analysis of tempo curves

Consider the tempo curves for Schumann’s Träumerei. Wavelet analysis
can help one to understand some of the similarities and differences be-

Figure 5.9 c and d: HISMOOTH-fits to tempo curves (performances 15-28).
tween tempo curves. This is illustrated in Figures 5.10a through f where

time-frequency plots of the three tempo curves by Cortot are compared
with those by Horowitz. (More specifically, only the first 128 observations
are used here.) The obvious difference is that Horowitz has more power in
the high frequency range. Figures 5.11a through f compare the wavelet coef-
ficients of residuals obtained after subtracting a kernel-smoothed version of
the tempo curves (bandwidth 1/8, i.e. averaging was done over one quarter
of a bar). This provides an overview of local details of the curves. In partic-
ular, it can be seen at which level of resolution each pianist kept essentially
the same “profile” throughout the years. For instance, for Horowitz the
complete profile at level 2 (d2) remains essentially the same. An even better
adaptation to data is achieved by using so-called wavelet packets, which are
generalizations of wavelets, in conjunction with a “best-basis algorithm”.
The idea of the algorithm is to find the best type of basis functions suit-
able to approximate an observed time series with as few basis functions as
possible. This is a way out of the limitation due the very specific shape of
a particular class of wavelet functions (see e.g. Haar wavelets where we are
confined to step functions). For detailed references on wavelet packets see
e.g. Coifman et al. (1992) and Coifman and Wickerhauser (1992). Figures
5.12 through 5.14 illustrate the usefulness of this approach: the 28 tempo
curves of Schumann’s Träumerei are approximated by the most important

Figure 5.10 Time frequency plots for Cortot’s and Horowitz’s three performances.
two (Figure 5.12), five (Figure 5.13) and ten (Figure 5.14) best basis func-
tions. The plots show interesting and plausible similarities and differences.
Particularly striking are Cortot’s 4-bar oscillations, Horowitz’s “seismic”
local fluctuations, the relatively unbalanced tempo with a few extreme
tempo variations for Eschenbach, Klien, Ortiz, and Schnabel, the irregular
shapes for Moisewitsch, and also a strong similarity between Horowitz1 and
Moisewitsch with respect to the general shape (Figure 5.12).
5.3.6 HIWAVE models of the relationship between tempo and melodic

curves
HIWAVE models can be used, for instance, to establish a relationship be-
tween structural curves obtained from a score and a performance of the
score. Here, we consider the tempo curves by Cortot and Horowitz (Figure
5.15a), and the melodic weight function m(t) defined in Section 3.3.4. As-
suming a HIWAVE-model of order 1, Figure 5.15b displays the value of R2

Coefficients of residuals - Coefficients of residuals - Coefficients of residuals -
Cortot1 Cortot2 Cortot3
idwt idwt idwt
d1 d1 d1
d2 d2 d2
s2 s2 s2
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

a b c
Coefficients of residuals - Coefficients of residuals - Coefficients of residuals -

Horowitz1 Horowitz2 Horowitz3
idwt idwt idwt
d1 d1 d1
d2 d2 d2
s2 s2 s2
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

d e f
Figure 5.11 Wavelet coefficients for Cortot’s and Horowitz’s three performances.

1.0
1.0
1.0
1
1
0
0.5
0.5
0.5
0
0
-1
-1
0.0
0.0
-1
0.0
-2
-2
-1
-1.0
-3
-0.5
-0.5
-2
-3
-2
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
1.0
2
1.0
1.0
1.0
1
1.0
1
0.5
0.5
0.5
0.5
0
0.5
-1
0.0
0.0
-1
0.0
0.0
-0.5
-2
-2
-0.5
-3
-1.0
-1.0
-3
-1.0
-4
-1.5
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150

2
1.0
1.0
1
0
0
1.0
1
0.5
0
0.5
0.5
-1
-1
0.0
-1
-2
-1
-0.5
0.0
-2
-2
-3
-2
-3
-0.5
-1.0
-1.5
-3
-3
-4
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
1.0
1
1.0
0
0
0
1
0.5
-1
0.5
-1
0
-1
0
-2
0.0
-2
0.0
-1
-2
-3
-1
-0.5
-3
-4
-0.5
-3
-2
-1.0
-2
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
Figure 5.12 Tempo curves – approximation by most important 2 best basis func-
tions.

1
1
1
1
1
1.0
0
0
0
0
0
0
0.5
-1
-1
-1
-1
-1
-1
0.0
-2
-2
-2
-2
-2
-2
-3
-3
-3
-3
-1.0
-3
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150

1.0
1
1
1
1
0.5
1
1
0
0
0
0
0
-1
0
-1
-1
-0.5
-1
-2
-1
-2
-1
-2
-3
-2
-3
-2
-1.5
-4
-3
-4
-2
-3
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150

2
2
1
1.0
1.0
1
0.5
0
0
-1
-1
-1
0.0
0.0
-1
-0.5
-2
-2
-2
-2
-3
-1.0
-3
-1.0
-1.5
-3
-3
-4
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150

1.5
1
1
0
1
1
1
1.0
-1
0
0
0
0
0
0.5
-2
-1
-1
-3
-1
0.0
-2
-1
-1
-2
-4
-0.5
-3
-2
-2
-3
-5
-2
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
tions.

2
1
1
1
1
1
0
0
0
0
0
-1
-1
-1
-1
-1
-1
-2
-2
-2
-2
-2
-2
-1
-3
-3
-3
-3
-3
-3
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
1
1
1
0.5
1
1
0
0
0
0
-1
-1
-1
0
-1
-0.5
-1
-2
-2
-2
-2
-1
-3
-3
-2
-1.5
-3
-3
-4
-4
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150

1.5
2
2
1
1
1
1
1
1
0
0.5
0
0
0
-1
0
-1
-1
-1
-1
-2
-0.5
-2
-2
-1
-2
-2
-3
-3
-3
-3
-3
-1.5
-2
-4
0 50 100 150 0 50 100 150 0 50 100 150 -4 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
2
1
1
1
1
1
1
0
0
0
0
-1
0
0
0
-1
-2
-1
-1
-1
-1
-2
-3
-1
-2
-2
-4
-2
-3
-2
-3
-2
-5
-3
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
tions.
for the simple linear regression model yi = βo +β1 g(ti ; ∞, η) as a function of

the number of wavelet-coefficients of mi that are larger or equal to η. Two
observations can be made: a) for almost all choices of η, the fit for Horowitz
(gray lines) is better and b) the best value of η is practically the same for all
six performances. Figure 5.15c shows the fitted HIWAVE-curves for Cortot
and Horowitz separately. The result shows an amazing agreement between
the three Cortot performances on one hand and the three Horowitz curves
on the other hand. The HIWAVE-fits seem to have extracted a major as-
pect of the performance styles. Horowitz appears to build blocks of almost
horizontal tempo levels and “adds”, within these blocks, very fine tempo
variations. In contrast, for Cortot, blocks have a more “parabolic” shape.
It should be noted, of course, that, since Haar wavelets were used here,
these features (in particular Horowitz’ horizontal blocks) may be some-
what overemphasized. Analogous pictures are displayed in Figures 5.16a
through c and 5.17a through c for the first and second difference of the
tempo respectively. Particularly interesting are Figures 5.17b and c: the
values of R2 are practically the same for all Horowitz performances and
clearly lower than for Cortot. Moreover, as before, both pianists show an
amazing consistency in their performances.

Figure 5.15 Tempo curves (a) by Cortot (three curves on top) and Horowitz, R2
obtained in HIWAVE-fit plotted against trial cut-off parameter η (b) and fitted
HIWAVE-curves (c).

Figure 5.16 First derivative of tempo curves (a) by Cortot (three curves on top)
and Horowitz, R2 obtained in HIWAVE-fit plotted against trial cut-off parameter
η (b) and fitted HIWAVE-curves (c).

Figure 5.17 Second derivative of tempo curves (a) by Cortot (three curves on top)
and Horowitz, R2 obtained in HIWAVE-fit plotted against trial cut-off parameter
η (b) and fitted HIWAVE-curves (c).

CHAPTER 6
Markov chains and hidden Markov

models

Musical events can often be classified into a finite or countable number of
categories that occur in a temporal sequence. A natural question is then
whether the transition between different categories can be characterized
by probabilities. In particular, a successful model may be able to repro-
duce formally a listener’s expectation of “what happens next”, by giving
appropriate conditional probabilities. Markov chains are simple models in
discrete time that are defined by conditioning on the immediate past only.
The theory of Markov chains is well developed and many beautiful results
are available. More complicated, but very flexible, are hidden Markov pro-
cesses. For these models, the probability distribution itself changes dynam-
ically according to a Markov process. Many of the developments on hidden
Markov models have been stimulated by problems in speech recognition.
It is therefore not surprising that these models are also very useful for an-
alyzing musical signals. Here, a very brief introduction to Markov chains
and hidden Markov models is given. For an extended discussion see, for
instance, Chung (1967), Isaacson and Madsen (1976), Kemey et al. (1976),
Billingsley (1986), Elliott et al. (1995), MacDonald and Zucchini (1997),
Norris (1998), Bremaud (1999).

6.2.1 Definition of Markov chains
Let Xo , X1 , ... be a sequence of random variables with possible outcomes
Xt = xt ∈ S. Then the sequence is called a Markov chain, if
M1. The state space S is finite or countable;
M2. For any t ∈ N,
P (Xt+1 = j|Xo = io , X1 = i1 , ..., Xt = it ) = P (Xt+1 = j|Xt = it )
(6.1)
Condition M2 means that the future development of the process, given
the past, depends on the most recent value only. In the following we also

assume that the Markov chain is homogeneous in the sense that for any
i, j ∈ N, the conditional probability P (Xt+1 = j|Xt = i) does not depend
on time t. The probability distribution of the process Xt (t = 0, 1, 2, ...) is
then fully specified by the initial distribution
πi = P (Xo = i) (6.2)
and the (finite or infinite dimensional) matrix of transition probabilities
pij = P (Xt+1 = j|Xt = i) (i, j = 1, 2, ..., |S|) (6.3)
where |S| = m ≤ ∞ is the number of elements in the state space S. Without
loss of generality, we may assume S = {1, 2, ..., m}. Note that the vector
π = (π1 , ..., πm )t and the matrix
M = (pij )i,j=1,2 ,...,m
have the following properties:
0 ≤ πi , pij ≤ 1,
m
!
πi = 1
i=1
and
m
!
pij = 1
j=1
Probabilities of events can be obtained by matrix multiplication, since
m
(n)
!
pij = P (Xt+n = j|Xt = i) = pij1 pj1 j2 ···pjn−1 j = [M n ]ij (6.4)
j1 ,...,jn−1 =1
and
(n)
pj = P (Xt+n = j) = [π t M n ]j (6.5)
6.2.2 Transience, persistence, irreducibility, periodicity, and stationarity

The dynamic behavior of a Markov chain can essentially be character-
ized by the notions of transience–persistence, irreducibility–reducibility,
aperiodicity–periodicity and stationarity–nonstationarity. These properties
will be discussed now.
Consider the probability that the first visit in state j occurs at time n,
given that the process started in state i,
(n)
fij = P (X1 ̸= j, ..., Xn−1 ̸= j, Xn = j|Xo = i) (6.6)
(n)
Note that fij can also be written as
(n)
fij = P (Tj = n|Xo = i)

where
Tj = min{n : Xn = j}
n≥1
is the first time when the process reaches state j. The conditional proba-
bility that the process ever visits the state j can be written as
∞
(n)
!
fij = P (Tj < ∞|Xo = i) = P (∪∞
n=1 {Xn = j}|Xo = i) = fij (6.7)
n=1
We then have the following

Definition 42 A state i is called
i) transient, if fii < 1.
ii) persistent, if fii = 1;
Persistence means that we return to the same state again with certainty. For
transient states it can occur, with positive probability, that we never return
to the same place. As it turns out, a positive probability of never returning
implies that there is indeed a “point of no return”, i.e. a time point after
which one never returns. This can be seen as follows. Conditionally on
Xo = i, the probability that state j is reached at least k + 1 times is
equal to fij fjj
k
. Hence, for k → ∞, we obtain the probability of returning
infinitely often
qij = P (Xn = j infinitely often|Xo = i) = fij lim fjj
k
. (6.8)
k→∞
This implies
qij = 0 for fjj < 1
and
qij = 1 for fjj = 1.
A simple way of checking whether a state is persistent or not is given by
Theorem 13 The following holds for a Markov chain:
"∞ (n)
i) A state j is transient ⇔ qjj = 0 ⇔ n=1 pjj < ∞
(n)
ii) A state j is persistent ⇔ qjj = 1 ⇔ ∞ n=1 pjj = ∞.
"
"∞ (n)
The condition on n=1 pii can be simplified further for irreducible Markov
chains:
Definition 43 A Markov chain is called irreducible, if for each i, j ∈ S,
(n)
pij > 0 for some n.
Irreducibility means that wherever we start, any state j can be reached in
due time with positive probability. This excludes the possibility of being
caught forever in a certain subset of S. With respect to persistent and tran-
sient states, the situation simplifies greatly for irreducible Markov chains:
Theorem 14 Suppose that Xt (t = 0, 1, ...) is an irreducible Markov chain.
Then one of the following possibilities is true:

i) All states are transient.
ii) All states are persistent.
Instead of speaking of transient and persistent states one therefore also
uses the notion of transient and persistent Markov chain respectively.
Another important property is stationarity of Markov chains. The word
“stationarity” implies that the distribution remains stable in some sense.
The first definition concerns initial distributions:
Definition 44 A distribution π is called stationary if
k
!
πi pij = πj , (6.9)
i=1
or in matrix form,
π t M = π. (6.10)
This means that if we start with distribution π, then the distribution of all
subsequent Xt′ s is again π.
The next question is in how far the initial distribution influences the
dynamic behavior (probability distribution) into the infinite future. A pos-
sible complication is that the process may be periodic in the sense that one
may return to certain states periodically:
Definition 45 A state j is called to have period τ , if
(n)
pjj > 0
implies that n is a multiple of τ .
For an irreducible Markov chain, all states have the same period. Hence,
the following definition is meaningful:
Definition 46 An irreducible Markov chain is called periodic if τ > 1, and
it is called aperiodic if τ = 1.
It can be shown that for an aperiodic Markov chain, there is at most one
stationary distribution and, if there is one, then the initial distribution does
not play any role ultimately:
Theorem 15 If Xt (t = 0, 1, ...) is an aperiodic irreducible Markov chain
for which a stationary distribution π exists, then the following holds:
(i) the Markov chain is persistent;
(n)
(ii) limn→∞ pij = πj > 0 for all i, j;
(iii) the stationary distribution π is unique.
In the other case of an aperiodic irreducible Markov chain for which no
stationary distribution exists, we have
(n)
lim p =0
n→∞ ij
for all i, j. Note that this is even the case if the Markov chain is persistent.
One then can classify irreducible aperiodic Markov chains into three classes:

Theorem 16 If Xt (t = 0, 1, 2, ...) is an irreducible aperiodic Markov
chain, then one the following three possibilities is true:
(i) Xt is transient,
(n)
lim p =0
n→∞ ij
and
∞
(n)
!
pij < ∞
n=1
(ii) Xt is persistent, but no stationary distribution π exists,

(n)
lim p = 0,
n→∞ ij
∞
(n)
!
pij = ∞
n=1
and
∞
(n)
!
µj = nfjj = ∞
n=1
(iii) Xt is persistent, and a unique stationary distribution π exists,

(n)
lim p = πj > 0
n→∞ ij
for all i, j and the average number of steps till the process returns to
state j is given by
µj = πj−1
For Markov chains with a finite state space, the results simplify further:
Theorem 17 If Xt is an irreducible aperiodic Markov chain with a finite
state space, then the following holds:
(i) Xt is persistent
(ii) a unique stationary distribution π = (π1 , ..., πk )t exists and is the so-
lution of
!
π t (I −M ) = 0, (0 ≤ πj ≤ 1, πj = 1) (6.11)
where I is the m ×m identity matrix.
Note that j Mij = j pij = 1 so that j (I −M )ij = 0, i.e. the matrix
" " "
(I −M ) is singular. (If this were not the case, then the only solution to the
system of linear equations would be 0 so that no stationary distribution
would exist.) Thus, there are infinitely many solutions of (6.13). However,
there is only one solution that satisfies the conditions 0 ≤ πj ≤ 1 and
πj = 1.
"

6.2.3 Hidden Markov models
A hidden Markov model is, as the name says, a model where an under-
lying Markov process is not directly observable. Instead, observations Xt
(t = 1, 2, ...) are generated by a series of probability distributions which in
turn are controlled by an unobserved Markov chain. More specifically, the
following definitions are used: let θt (t = 1, 2, ...) be a Markov chain with
initial distribution π so that P (θ1 = j) = πj , and transition probabilities
pij = P (θt+1 = j|θt = i). (6.12)
The state of the Markov chain determines the probability distribution of
the observable random variables Xt by
ψij = P (Xt = j|θt = i) (6.13)
In particular, if the state spaces of θt and Xt are finite with dimensions
m1 and m2 respectively, then the probability distribution of the process Xt
is determined by the m1 -dimensional vector π, the m1 ×m1 -dimensional
transition matrix M = (pij )i,j=1,...,m1 and the m2 ×m1 -dimensional matrix
Ψ = (ψij )i=1,...,m2 ;j=1,...,m1 that links θt with Xt . Analogous models can
be defined for the case where Xt (t ∈ N) are continuous variables.
The flexibility of hidden Markov models is due to the fact that Xt can
be an arbitrary quantity with an arbitrary distribution that can change in
time. For instance, Xt itself can be equal to a time series Xt = (Z1 , ..., Zn ) =
(Z1 (t), ..., Zn (t)) whose distribution depends on θt . Typically, such models
are used in automatic speech processing (see e.g. Levinson et al. 1983, Juang
and Rabiner 1991). The variable θt may represent the unobservable state of
the vocal tract at time t, which in turn “produces” an observable acoustic
signal Z1 (t), ..., Zn (t) generated by a distribution characterized by θt . Given
observations Xt (t = 1, 2, ..., N ), the aim is to guess which configurations
θt (t = 1, 2, ..., N ) the vocal tract was in. More specifically, it is sometimes
assumed that there is only a finite number of possible acoustic signals. We
may therefore denote by Xt the label of the observed signal and estimate
θ by maximizing the a posteriori probability P (θ = j|Xt = i). Using the
Bayes rule, this leads to
θ̂t = arg max P (θt = j|Xt = i)
j=1,...,m1
P (X = i|θt = j)P (θt = j)

= arg max "m1 t (6.14)
l=1 P (Xt = i|θt = l)P (θt = l)
j=1,...,m1
6.2.4 Parameter estimation for Markov and hidden Markov models

In principle, parameter estimation for Markov chains and hidden Markov
models is simple, since the likelihood function can be written down explic-

itly in terms of simple conditional probabilities. The main difficulties that
can occur are:
1. Large number of unknown parameters: the unknown parameters for a
Markov chain are the initial distribution π and the transition matrix
M = (pij )i,j=1,...,m . If m is finite, then the number of unknown parame-
ters is (m−1)+m(m−1). If the initial distribution does not matter, then
this reduces to m(m −1). Both numbers can be quite large compared
to the available sample size, since they increase quadratically in m. The
situation is even worse if the state space is infinite, since then the num-
ber of unknown parameters is infinite. A solution to this problem is to
impose restrictions on the parameters or to define parsimonious models
where M is characterized by a low-dimensional parameter vector.
2. Implicit solution: The maximum likelihood estimate of the unknown
parameters is the solution of a system of nonlinear equations, and there-
fore must be found by a suitable numerical algorithm. For real time
applications with massive data input, as they typically occur in speech
processing or processing of musical sound signals, fast algorithms are
required.
3. Asymptotic distribution: The asymptotic distribution of maximum like-
lihood estimates is not always easy to derive.

6.3.1 Stationary distribution of intervals modulo 12
We consider intervals between successive notes modulo octave for the upper
envelopes of the following compositions:
• Anonymus: a) Saltarello (13th century); b) Saltarello (14th century); c)
Alle Psallite (13th century); d) Troto (13th century)
• A. de la Halle (1235?-1287): Or est Bayard en la pature, hure!
• J. de Ockeghem (1425-1495): Canon epidiatesseron
• J. Arcadelt (1505-1568): a) Ave Mari, b) La Ingratitud, c) Io Dico Fra
Noi
• W. Byrd (1543-1623): a) Ave Verum Corpus, b) Alman, c) The Queen’s
Alman
• J. Dowland (1562-1626): a) Come Again, b) The Frog Galliard, c) The
King Denmark’s Galliard
• H.L. Hassler (1564-1612): a) Galliard, b) Kyrie from “Missa secunda”,
c) Sanctus et Benedictus from “Missa secunda”
• G.P. Palestrina (1525-1594): a) Jesu Rex admirabilis, b) O bone Jesu,
c) Pueri Hebraeorum

• J.P. Rameau (1683-1764): a) La Popliniere, b) Tambourin, c) La Triom-
phante (Figure 6.1)
• J.F. Couperin (1668-1733): a) Barriquades mysterieuses, b) La Linotte
Effarouchée, c) Les Moissonneurs, d) Les Papillons
• J.S. Bach (1685-1750): Das Wohltemperierte Klavier; Cello-Suites I to
VI (1st Movements)
• D. Scarlatti (1660-1725): a) Sonata K 222, b) Sonata K 345, c) Sonata
K 381
• J. Haydn (1732-1809): Sonata op. 34, No. 2
• W.A. Mozart (1756-1791): a) Sonata KV 332, 2nd Mov., b) Sonata KV
545, 2nd Mov., c) Sonata KV 333, 2nd Mov.
• F. Chopin (1810-1849): a) Nocturne op. 9, No. 2, b) Nocturne op. 32,
No. 1, c) Etude op. 10, No. 6 (Figure 6.2)
• R. Schumann (1810-1856): Kinderszenen op. 15
• J. Brahms (1833-1897): a) Hungarian dances No. 1, 2, 3, 6, 7, b) Inter-
mezzo op. 117, No. 1 (Figures 6.12, 9.7, 11.5)
• C. Debussy (1862-1918): a) Claire de lune, b) Arabesque No. 1, c) Re-
flections dans l’eau
• A. Scriabin (1872-1915): Preludes a) op. 2, No. 2, b) op. 11, No. 14, c)
op. 13, No. 2
• S. Rachmaninoff (1873-1943): a) Prelude op. 3, No. 2, b) Preludes op.
23, No. 3, 5, 9
• B. Bartók (1881-1945): a) Bagatelle op. 11, No. 2, b) Bagatelle op. 11,
No. 3, c) Sonata for piano
• O. Messiaen (1908-1992): Vingts regards sur l’enfant de Jésus, No. 3
• S. Prokoffieff (1891-1953): Visions fugitives a) No. 11, b) No. 12, c) No.
13
• A. Schönberg (1874-1951): Piano piece op. 19, No. 2
• T. Takemitsu (1930-1996): Rain tree sketch No. 1
• A. Webern (1883-1945): Orchesterstück op. 6, No. 6
Since we are not interested in note repetitions, zero is excluded, i.e. the
state space of Xt consists of the numbers 1,...,11. For the sake of simplicity,
Xt is assumed to be a Markov chain. This is, of course, not really true
nevertheless an “approximation” by a Markov chain may reveal certain
characteristics of the composition. The elements of the transition matrix
M = (pij )i,j=1,...,11 are estimated by relative frequencies
"n
1{xt−1 = i, xt = j}
p̂ij = t=2"n−1 , (6.15)
t=1 1{xt = i}

Figure 6.1 Jean-Philippe Rameau (1683-1764). (Engraving by A. St. Aubin after
J. J. Cafferi, Paris after 1764; courtesy of Zentralbibliothek Zürich.)
and the stationary distribution π of the Markov chain with transition ma-
trix M̂ = (p̂ij )i,j=1,...,11 is estimated by solving the system of linear equa-
tions
π t (I −M̂ ) = 0
as described above. Figures 6.3a through l show the resulting values of π̂j
(joined by lines). For each composition, the vector π̂j is plotted against j.
For visual clarity, points at neighboring states j and j−1 are connected. The
figures illustrate how the characteristic shape of π changed in the course of
the last 500 years. The most dramatic change occured in the 20th century
with a “flattening” of the peaks. Starting with Scriabin a pioneer of atonal
music, though still rooted in the romantic style of the late 19th century, this
is most extreme for the compositions by Schönberg, Webern, Takemitsu,
and Messiaen. On the other hand, Prokoffieff’s “Visions fugitives” exhibit
clear peaks but at varying locations. The estimated stationary distributions
can also be used to perform a cluster analysis. Figure 6.4 shows the result
of the single linkage algorithm with the manhattan norm (see Chapter
10). To make names legible, only a subsample of the data was used. An
almost perfect separation between Bach and composers from the classical
and romantic period can be seen.

Figure 6.2 Frédéric Chopin (1810 -1849). (Courtesy of Zentralbibliothek Zürich.)
6.3.2 Stationary distribution of interval torus values
An analogous analysis can be carried out replacing the interval numbers by

the corresponding values of the torus distance (see Chapter 1). Excluding
zeroes, the state space consists of the three numbers 1, 2, 3 only. For the
same compositions as above, the stationary probabilities π̂j (j = 1, 2, 3) are
calculated. A cluster analysis as above, but with the new probabilties, yields
practically the same result as before (Figure 6.5). Since the state space con-
tains three elements only, it is now even easier to find the patterns that
determine clustering. In particular, log-odds-ratios log(π̂i /π̂j ) (i ̸= j) ap-
pear to be characteristic. Boxplots are shown in Figures 6.6a, 6.7a and 6.8a
for categories of composers defined by date of birth as follows: a) before
1600 (“early music”); b) [1600,1720) (“baroque”); c) [1720,1800) (“clas-
sic”); d) [1800,1880) (“romantic and early 20th century”) (Figure 6.12); e)
1880 and later (“20th century”). This is a simple, though somewhat arbi-
trary, division with some inaccuracies for instance, Schönberg is classified
in category 4 instead of 5. The log-odds-ratio between π̂1 and π̂2 is high-
est in the “classical” period and generally tends to decrease afterwards.
Moreover, there is a distinct jump from the baroque to the classical period.
This jump is also visible for log(π̂1 /π̂3 ). Here, however, the attained level
is kept in the subsequent time periods. For log(π̂2 /π̂3 ) a gradual increase

Figure 6.3 Stationary distributions π̂j (j = 1, ..., 11) of Markov chains with state
space Z12 \ {0 }, estimated for the transition between successive intervals.

Clusters based on stationary distribution
8
6
SCHUMANN
4
CHOPIN
BRAHMS
BRAHMS
2
RACHMANINOFF
HAYDN
SCHUMANN
MOZART
HAYDN
BACH
BRAHMS
HAYDN
SCHUMANN
CHOPIN
RACHMANINOFF
SCHUMANN
SCHUMANN
SCHUMANN
CHOPIN
SCHUMANN
SCHUMANN
SCHUMANN
BACH
HAYDN
HAYDN
SCHUMANN
BRAHMS
MOZART
CHOPIN
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
HAYDN
HAYDN
MOZART
MOZART
MOZART
BRAHMS
BRAHMS
SCHUMANN
SCHUMANN
SCHUMANN
RACHMANINOFF
RACHMANINOFF
Figure 6.4 Cluster analysis based on stationary Markov chain distributions for
compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rach-
maninoff.
can be observed. The differences are even more visible when comparing in-
dividual composers. This is illustrated in Figures 6.9a and b where Bach’s
and Schumann’s log(π̂1 /π̂3 ) and log(π̂2 /π̂3 ) are compared, and in Figures
6.10a through f where the median and lower and upper quartiles of π̂j are
plotted against j. Finally, Figure 6.11 shows the plots of log(π̂1 /π̂3 ) and
log(π̂2 /π̂3 ) against the date of birth.
6.3.3 Classification by hidden Markov models

Chai and Vercoe (2001) study classification of folk songs using hidden
Markov models. They consider, essentially, four ways of representating a
melody; namely by a) a vector of pitches modulo 12; b) a vector of pitches
modulo 12 together with duration (duration being represented by repeating
the same pitch); c) a sequence of intervals (differenced series of pitches); and
d) sequence of intervals, with intervals being classified into only five interval
classes {0}, {−1, −2}, {1, 2}, {x ≤ −3} and {x ≥ 3}. The observed data
consist of 187 Irish, 200 German, and 104 Austrian homophonic melodies
from folk songs. For each melody representation, the authors estimate the
parameters of several hidden Markov models which differ mainly with re-
spect to the size of the hidden state space. The models are fitted for each

3.0
Clusters based on stationary distribution

of torus distances
2.5
2.0
1.5
SCHUMANN
1.0
BRAHMS
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
HAYDN
HAYDN
HAYDN
HAYDN
HAYDN
HAYDN
HAYDN
CHOPIN
MOZART
CHOPIN
CHOPIN
MOZART
CHOPIN
MOZART
MOZART
MOZART
BRAHMS
BRAHMS
BRAHMS
BRAHMS
BRAHMS
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
RACHMANINOFF
RACHMANINOFF
RACHMANINOFF
RACHMANINOFF
Figure 6.5 Cluster analysis based on stationary Markov chain distributions of

torus distances for compositions by Bach, Mozart, Haydn, Chopin, Schumann,
Brahms, and Rachmaninoff.
country separately. Only 70% of the data are used for estimation. The
remaining 30% are used for validation of a classification rule defined as
follows: a melody is assigned to country j, if the corresponding likelihood
(calculated using the country’s hidden Markov model) is the largest. Not
surprisingly, the authors conclude that the most reliable distinction can be
made between Irish and non-Irish songs.

a) log(pi(1)/pi(2)) for b): log(pi(1)/pi(2)) for
five different periods ‘classic’ vs. ‘not classic’
0.5
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
-1.5
-1.5
b. 1600 1600 1720 1800 from birth 1720-1800 birth before 1720 or
-1720 -1800 -1880 1880 1800 and later
Figure 6.6 Comparison of log odds ratios log(π̂1 /π̂2 ) of stationary Markov chain
distributions of torus distances.
a) log(pi(1)/pi(3)) for b) log(pi(1)/pi(3)) for

five different periods ‘upto baroque’ vs. ‘after baroque’
2
2
1
1
0
0
-1
-1
b. 1600 1600 1720 1800 from birth before 1720 birth 1720 and later
-1720 -1800 -1880 1880

3 five different periods ‘upto baroque’ vs. ‘after baroque’
3
2
2
1
1
0
0
b. 1600 1600 1720 1800 from birth before 1720 birth 1720 and later
-1720 -1800 -1880 1880

Bach and Schumann Bach and Schumann
1.5
2.0
1.0
1.5
0.5
1.0
0.0
0.5
-0.5
0.0
-1.0
-0.5
Bach Schumann Bach Schumann
Figure 6.9 Comparison of log odds ratios log(π̂1 /π̂3 ) and log(π̂2 /π̂3 ) of stationary
Markov chain distributions of torus distances.

Figure 6.10 Comparison of stationary Markov chain distributions of torus dis-
tances.
log(pi(1)/pi(3)) log(pi(2)/pi(3))
plotted against date of birth plotted against date of birth
3
2
2
1
log(pi(1)/pi(3))
log(pi(2)/pi(3))
1
0
0
-1
1200 1400 1600 1800 1200 1400 1600 1800
year year
a b
Figure 6.11 Log odds ratios log(π̂1 /π̂3 ) and log(π̂2 /π̂3 ) plotted against date of
birth of composer.

Figure 6.12 Johannes Brahms (1833-1897). (Courtesy of Zentralbibliothek
Zürich.)
6.3.4 Reconstructing scores from acoustic signals
One of the ultimate dreams of musical signal recognition is to reconstruct

a musical score from the acoustic signal of a musical performance. This
is a highly complex task that has not yet been solved in a satisfactory
manner. Consider, for instance, the problem of polyphonic pitch tracking
defined as follows: given a musical audio signal, identify the pitches of
the music. This problem is not easy for at least two reasons: a) different
instruments have different harmonics and a different change of the spec-
trum; and b) in polyphonic music, one must be able to distinguish different
voices (pitches) that are played simultaneously by the same or different
instruments. An approach based on a rather complex hierarchical model
is proposed for instance in Walmsley, Godsill, and Rayner (1999). Sup-
pose that a maximal number N of notes can be played simultaneously and
denote by ν = (ν1 , ..., νN )t the vector of 0-1-variables indicating whether
note j (j = 1, ..., N ) is played or not. Each note j is associated with a har-
monic representation (see Chapter 4) with fundamental frequency j and
amplitudes b1 (j), ..., bk (j) (k = number of harmonics). Time is divided into

disjoint time intervals, so-called frames. In each frame i of length mi , the
sound signal is assumed to be equal to yi (t) = µi (t) + ei (t) where µi (t)
(t = 1, ..., mi ) is the sum of the harmonic representations of the notes
and a random noise ei . Walmsley et al. assume ei to be iid (independent
identically distributed) normal with zero mean and variance σi2 . Taking ev-
erything together, the probability distribution of the acoustic signal is fully
specified by a finite dimensional parameter vector θ. In principle, given an
observed signal, θ could be estimated by maximizing the likelihood (see
Chapter 4). The difficulty is, however, that the dimension of θ is very high
compared to the number of observations. The solution proposed by Walm-
sley et al. is to circumvent this problem by a Bayesian approach, in that θ
is assumed to be generated by an a priori distribution. Given the data, con-
sisting of a sound signal and an a priori distribution p(θ), the a posteriori
distribution p(θ|yi ) of θ is given by
f (yi |θ)p(θ)
p(θ|yi ) = # (6.16)
f (yi |θ̃)p(θ̃)dθ̃
where
!mi
f (yi |θ) = (2πσi )−mi /2 exp(− e2i (t)/σi2 )
t=1
and ei (t) = ei (t; θ). How many notes and which pitches are played can then
be decided, for instance, by searching for the mode of the distribution.
Even if this model is assumed to be realistic, a major practical difficulty
remains: the dimension of θ can be several hundred. The computation of
the
# a posteriori distribution is therefore very difficult since calculation of
f (yi |θ̃)p(θ̃)dθ̃ involves high-dimensional numerical intergration. A fur-
ther complication is that some of the parameters may be highly correlated.
Walmsley et al. therefore propose to use Markov Chain Monte Carlo Meth-
ods (see e.g. Gilks et al. 1996). The essential idea is to simulate the integral
by a sample mean of f (yi |θ) where θ is sampled randomly from the a pri-
ori distribution p(θ). Sampling can be done by using a Markov process
whose stationary distribution is p. The simulation can be simplified fur-
ther by the so-called Gibbs sampler which uses suitable one-dimensional
conditional distributions (Besag 1989).
A more modest task than polyphonic pitch tracking is automatic segmen-
tation of monophonic music. The task is as follows: given a monophonic
musical score and a sampled acoustic signal of a performance of the score,
identify for each note and rest in the score the corresponding time in-
terval in the performance. A possible approach based on hidden Markov
processes and Bayesian models is proposed in Raphael (1999) (also see
Raphael 2001a,b). Raphael, who is a professional oboist and a mathemati-
cal statistician, also implemented his method in a computer system, called
Music Plus One, that performs the role of a musical accompanist.

CHAPTER 7
Circular statistics
Many phenomena in music are circular. The best known examples are re-
peated rhythmic patterns, the circles of fourths and fifths, and scales mod-
ulo octave in the well-tempered system. In the circle of fourths, for example,
one progresses by steps of a fourth and arrives, after 12 steps, at the ini-
tial starting point modulo octave. It is not immediately clear whether and
how to “calculate” in such situations, and what type of statistical proce-
dures may be used. The theory of circular statistics has been developed
to analyze data on circles where angles have a meaning. Originally, this
was motivated by data in biology (e.g. direction of bird flight), meteorol-
ogy (e.g. direction of wind), and geology (e.g. magnetic fields). Here we
give a very brief introduction, mostly to descriptive statistics. For an ex-
tended account of methods and applications of circular statistics see, for
instance, Mardia (1972), Batschelet (1981), Watson (1983), Fisher (1993),
and Jammalamadaka and SenGupta (2001). In music, circular methods can
be applied to situations where angles measure a meaningful distance be-
tween points on the circle and arithmetic operations in the sense of circular
data are well defined.
7.2.1 Some descriptive statistics
Circular data are observations on a circle. In other words, observations

consist of directions expressed in terms of angles. The first question is which
statistics describe the data in a meaningful way or, at an even more basic
level, how to calculate at all when “moving” on a circle. The difficulty can
be seen easily by trying to determine the “average direction”. Suppose we
observe two angles ϕ1 = 330o and ϕ2 = 10o . It is plausible to say that the
average direction is 350o . However, the average is (330o + 10o )/2 = 170o
which is almost the opposite direction. Calculating the sample mean of
angles is obviously not meaningful.
The simple solution is to interpret angular observations as vectors in
the plane, with end points on the unit circle, and applying vector addition

instead of adding angles. Thus, we replace ϕi (i = 1, ..., n) by
xi = (sin ϕi , cos ϕi )
where ϕ is measured anti-clockwise relative to the horizontal axis. The
following descriptive statistics can then be defined.
Definition 47 Let
! n n
! "
C= cos ϕi , S = sin ϕi , R = C 2 + S 2 . (7.1)
i=1 i=1
The (vector of the) mean direction of ϕi (i = 1, ..., n) is equal to

cos ϕ̄
# $ # $
C/R
x̄ = = (7.2)
sin ϕ̄ S/R
Equivalently one may use the following
Definition 48 The (angle of the) mean direction of ϕi (i = 1, ..., n) is
equal to
S
ϕ̄ = arctan + π1{C < 0} + 2π1{C > 0, S < 0} (7.3)
C
Moreover, we have
Definition 49 The mean resultant length of ϕi (i = 1, ..., n) is equal to
R
R̄ = (7.4)
n
Note that R is the length of the vector nx̄ obtained by adding all observed
vectors. If all angles are identical, then R = n so that R̄ = 1. In all other
cases, we have 0 ≤ R̄ < 1. In the other extreme case with ϕi = 2πi/n
(i.e. the angles are scattered uniformly over [0, 2π], there are no clusters
of directions), we have R̄ = 0. In this sense, R̄ measures the amount of
concentration around the mean direction. This leads to
Definition 50 The sample circular variance of ϕi (i = 1, ..., n) is equal to
V = 1 − R̄ (7.5)
Note, however, that R̄ is not a perfect measure of concentration, since
R̄ = 0 does not necessarily imply that the data are scattered uniformly.
For instance, suppose n is even, ϕ2 i+1 = π and ϕ2 i = 0. Thus there are two
preferred directions. Nevertheless, R̄ = 0.
Alternative measures of center and variability respectively are the me-
dian and the difference between the lower and upper quartile. The median
direction is a direction Mn = ϕo determined as follows: a) find the axis
(straight line through zero) such that the data are divided into two groups
of equal size (if n is odd, then the axis passes through at least one point,
otherwise through the midpoint between the two observations in the mid-
dle); b) take the direction ϕ on the chosen axis for which the more points

xi are closer to the point (cosφ, sinφ)t defined by φ. Similarly, the lower
and upper quartiles, Q1 , Q2 can be defined by dividing each of the halves
into two halves again. An alternative measure of variability is then given
by IQR = Q2 − Q1 .
Since we are dealing with vectors in the two-dimensional plane, all quan-
tities above can be expressed in terms of complex numbers. In particular,
one can define trigonometric moments by
Definition 51 For p = 1, 2, ... let
n
! n
! %
Cp = cos pϕi , Sp = sin pϕi , Rp = Cp2 + Sp2 (7.6)
i=1 i=1
Cp Sp Rp
C̄p = , S̄p = , R̄p = (7.7)
n n n
and
Sp
ϕ̄(p) = arctan + π1{Cp < 0} + 2π1{Cp > 0, Sp < 0} (7.8)
Cp
Then
mp = C̄p + iS̄p = R̄p eiϕ̄(p) (7.9)
is called the pth trigonometric sample moment.
For p = 1, this definition yields
m1 = C̄1 + iS̄1 = R̄1 eiϕ̄(1)
with C̄1 = C̄, S̄p = S̄, R̄p = R̄ and ϕ̄(p) = ϕ̄ as before. Similarily, we have
Definition 52 Let
n
! n
!
Cpo = cos p(ϕi − ϕ̄(1)), Spo = sin p(ϕi − ϕ̄(1)) (7.10)
i=1 i=1
Cpo Spo
C̄po = , S̄po = (7.11)
n n
Spo
ϕ̄o (p) = arctan + π1{Cpo < 0} + 2π1{Cpo > 0, Spo < 0} (7.12)
Cpo
Then
o
mop = C̄po + iS̄po = R̄p eiϕ̄ (p)
(7.13)
is called the pth centered trigonometric (sample) moment centered rel-
mop ,
ative to the mean direction ϕ̄(1).
Note, in particular, that sin(ϕi −ϕ̄(1)) = 0 so that mo1 = R̄1 . An overview
&
of descriptive measures of center and variability is given in Table 7.1.

Table 7.1 Some Important Descriptive Statistics for Circular Data
Name Definition Feature

measured
Sample mean x̄ = (C/R,√S/R)t Center
with R = C 2 + S 2 (direction)
Mean resultant length R̄ = R/n Concentration
Mean direction ϕ̄ = arctan S/C + π1 {C < 0 } Center (angle)
+2π1 {C > 0 , S < 0 }
Median direction Mn = g(φ)
& where Center (angle)
g(φ) = ni=1|π − |ϕi − ϕ||
Quartiles Q1, Q2 Q1 = median of {ϕi : Mn − π ≤ ϕi ≤ Mn } Center of

Q2 = median of {ϕi : Mn ≤ ϕi ≤ Mn + π} “left” and
“right” half
Modal direction M̃n = arg max fˆ(ϕ) where Center (angle)

fˆ(ϕ) = estimate of density f
Principal direction a = first
& eigenvector of Center
S= n t
i=1xi xi (direction,
unit vector)
Concentration λ̂1 = first eigenvalue of S Variability

Circular variance Vn = 1 − R̄ Variability
"
Circular stand. dev. sn = −2 log(1 − V ) Variability
%
Circular dispersion 2 + S̄ 2)/(2R̄2)
dn = (1 − C̄2 Variability
2
1 &n
Mean deviation Dn = π − n i=1|π − |ϕi − Mn || Variability
Interquartile range IQR = Q2 − Q1 Variability
7.2.2 Correlation and autocorrelation

A model for perfect “linear” association between two circular random vari-
ables ϕ, ψ is
ϕ = ± ψ + (c mod 2π) (7.14)
where c ∈ [0, 2π) is a fixed constant. A sample statistic that measures how
close we are to this perfect association is
&n
i,j=1;i̸=j sin(ϕi − ϕj ) sin(ψi − ψj )
rϕ,ψ = %& (7.15)
n &n
i,j=1;i̸=j sin 2
(ϕi − ϕj ) i,j=1;i̸=j sin2
(ψi − ψj )
or &n
det(n−1 i=1 xi yit )
rϕ,ψ = " (7.16)
det(n−1 ni=1 xi xti ) det(n−1 ni=1 yi yit )
& &

where xi = (cos ϕi , sin ϕi )t and yi = (cos ψi , sin ψi )t . For a time series
ϕt (t = 1, 2, ...) of circular data, this definition can be carried over to
autocorrelations
&n
i,j=1;i̸=j sin(ϕi − ϕj ) sin(ϕi+k − ϕj+k )
r(k) = &n (7.17)
i,j=1;i̸=j sin (ϕi − ϕj )
2
or
&n−k
det(n−1 i=1 xi xi+k )
t
rϕ (k) = n−k
(7.18)
det(n−1 i=1 xi xti )
&
7.2.3 Probability distributions
A probability distribution for circular data is a distribution F on the in-

terval [0, 2π). The sample statistics defined in Section 7.1 are estimates of
the corresponding population counterparts in Table 7.2.
Most frequently used distributions are the uniform, cardioid, wrapped,
von Mises, and mixture distributions.
Uniform distribution U ([0, 2π)):
u
F (u) = P (0 ≤ ϕ ≤ u) = 1{0 ≤ u < 2π},
2π
′ 1
f (ϕ) = F (ϕ) = 1{0 ≤ u < 2π}.
2π
In this case, µp = ρp = 0, the mean direction µϕ is not defined, and the
circular standard deviation σ and dispersion δ are infinite. This expresses
the fact that there is no preference for any direction and variability is
therefore maximal.
Cardioid (or Cosine) distribution C(µ, ρ):
ρ u
F (u) = [ sin(u − µ) + ]1{0 ≤ u < 2π}
π 2π
and
1
f (u) = (1 + 2ρ cos(u − µ))1{0 ≤ u < 2π}
2π
where 0 ≤ ρ ≤ 12 . In this case, µϕ = µ, ρ1 = ρ, µp = 0 (p ≥ 1) and
δ = 1/(2ρ2 ). An interesting property is that this distribution tends to the
uniform distribution as ρ → 0.

Table 7.2 Some important population statistics for circular data
Name Definition Feature
µp = 02π cos(pϕ)dF (ϕ)

'
pth trigonometric -
' 2π
moment +i 0 sin(pϕ)dF (ϕ)
= µp,C + iµp,S = ρp eiµϕ (p)
Mean direction µϕ = arctan µ1,S /µ1,C Center (angle)
+π1 {µ1,C < 0 }
+2π1 {µ1,C > 0 , µ1,S < 0 }
µop = 02π cos(p(ϕ − µϕ ))dF (ϕ)

'
pth central trig. -
' 2π
moment +i 0 sin(p(ϕ − µϕ ))dF (ϕ)
= µop,C + iµop,S
Mean resultant ρ = |µ1| Concentration

length
'π ' ϕ+π 1
Median direction M = {α : α−π dF (ϕ) = α dF (ϕ) = 2
} Center (angle)
Quartiles q1, q2 q1 = median of {ϕ : M − π ≤ ϕ ≤ M } 25%-quantile

q2 = median of {ϕ : M ≤ ϕ ≤ M + π} 75%-quantile
Modal direction M̃ = arg max f (ϕ) Center (angle)

Principal direction α
&= first eigenvector of Center
t
ϕ = E(XX ) (direction)
&
Concentration λ1 = first eigenvalue of ϕ Variability
Circular variance υ = 1 −ρ Variability

"
Circular stand. dev. σ = −2 log(1 − υ) Variability
Circular dispersion δ = (1 − ρ)/(2ρ2) Variability
' 2π
Mean deviation ∆=π− 0 |π − |ϕ − M ||dF (ϕ) Variability
Interquartile range IQR = q2 − q1 Variability
Wrapped distribution:
Let X be a random variable with distribution function FX . The random

variable ϕ = X (mod 2π) has a distribution Fϕ on [0, 2π) given by
∞
!
Fϕ (u) = [F (u + 2πj) − F (2πj)]
j=−∞
If X has a density function fX , then the density function of ϕ is equal to

∞
!
fϕ (u) = fX (u + 2πj).
j=−∞

An important special example is the wrapped normal distribution. The
wrapped normal distribution W N (µ, ρ) is obtained by wrapping a normal
distribution with E(X) = µ and var(X) = −2 log ρ (0 < ρ ≤ 1). This
yields the circular density function
∞
1 ! 2
fϕ (u) = [1 + 2 ρj cos j(u − µ)]1{0 ≤ u < 2π}
2π j=1
2
Then, µϕ = µ, ρ1 = ρ, δ = (1 − ρ4 )/(2ρ2 ), µp,C = ρp and µp,S = 0 (p ≥ 1).
For ρ → 0, we obtain the uniform distribution, and for ρ → 1 a distribution
with point mass in the direction µϕ .
von Mises distribution M (µ, κ)

The most frequently used unimodal circular distribution is the von Mises
distribution with density function
1
fϕ (u) = eκ cos(u−µ) 1{0 ≤ u < 2π}
2πIo (κ)
where 0 ≤ κ < ∞, 0 ≤ µ < 2π and
∞
1 1 κ 2j
( 2π !
Io = exp(κ cos(v − µ))dv = ( )
2π o j=0
(j!)2 2
is the modified Bessel function of the first kind and order 0. In this case,
we have µϕ = µ, ρ1 = I1 /Io , δ = (κI1 /Io )−1 , µp,C = Ip /Io and µp,S = 0
(p ≥ 1) where
∞
! 1 κ
Ip = ( )2 j+p
j=0
(j + p)!j! 2
is a modified Bessel function of order p. For κ → 0, the M (µ, κ)-distribution
converges to U ([0, 2π)), and for κ → ∞ we obtain a point mass in the
direction µϕ .
Mixture distribution:
All distributions above are unimodal. Distributions with more than one
mode can be modeled, for instance, by mixture distributions
fϕ (u) = p1 fϕ,1 (u) + ... + pm fϕ,m (u)
where 0 ≤ p1 , ..., pm ≤ 1, pi = 1 and fϕ,j are different circular probabil-
&
ity densities.
7.2.4 Statistical inference

Statistical inference about population parameters is mainly known for the
distributions above. Classical methods can be found in Mardia (1972),

Batschelet (1981), Watson (1983), and Fisher (1993). For recent results
see e.g. Jammalamadaka and SenGupta (2001).

7.3.1 Variability and autocorrelation of notes modulo 12
Figure 7.1 Béla Bart´

ok –statue by Varga Imre in front of the B´
ela Bart´
ok Memo-
rial House in Budapest. (Courtesy of the Béla Bart´
ok Memorial House.)
The following analysis is done for various compositions: pitch is repre-

sented in Z12 with 0 set equal to the note (modulo 12) with the highest
frequency in the composition. Given a note j in Z12 , the corresponding
circular point is then x = (x1 , x2 )t = (cos(2πj/12), sin(2πj/12))t . The
following statistics are calculated: λ1 , R̄, d and the maximal circular auto-
correlation m = max1≤k≤10 |rϕ (k)|. The compositions considered here are:
Figure 7.2 Sergei Prokoffieff as a child. (Courtesy of Karadar Bertoldi Ensemble;

www.karadar.net/Ensemble/.)

Figure 7.3 Circular representation of compositions by J. S. Bach (Pr¨
aludium und
Fuge No. 5 from “Das Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirk-
patrick No. 125), B. Bartók (Bagatelles No. 3), and S. Prokoffieff (Visions fugi-
tives No. 8).

• J. S. Bach: Das Wohltemperierte Klavier I (all preludes and fugues)
• D. Scarlatti: Sonatas Kirkpatrick No. 49, 125, 222, 345, 381, 412, 440,
541
• B. Bartók (Figure 7.1): Bagatelles No. 1–3, Sonata for Piano (2nd move-
ment)
• S. Prokoffief (Figure 7.2): Visions fugitives No. 1–15.

To simplify the analysis, the upper envelope is considered for each composi-
tion. The data set that was available consists of played music. Thus, instead
of the written score we are looking at its realization by a pianist. This re-
sults in some changes of onset times. In particular, some notes with equal
score onset times are not played simultaneously. Strictly speaking, the anal-
ysis thus refers to the played music rather than the original score. In Figure
7.3, four representative compositions are displayed. Z12 is represented by a
circle starting on top with 0 and proceeding clockwise as j ∈ Z12 increases.
A composition is thus represented by pitches j1 , ..., jn ∈ Z12 , each pitch be-
ings represented by a dot on the circle. In order to visualize how frequent
each note is, each point xi = (cos ϕi , sin ϕi )t (i = 1, ..., n) where ϕi = 2πji ,
is displaced slightly by adding a random number from a uniform distribu-
tion on [0, 0.1] to the angle φi . (This technique of exploratory data analysis
is often referred to as “jittering” see Chambers et al. 1983) Moreover, to
obtain an impression of the dynamic movement, successive points xi , xi+1
are joined by a line. The connections visualize which notes are likely to
follow each other. Some clear differences are visible between the four plots:
for Bach, the main movements take place along the edges, the main points
and vertices corresponding to the D-major scale. The rather curious simple
figure for Bartók’s Bagatelle No. 3 stems from the continuous repetition of
the same chromatic figure in the upper voice. For Prokoffieff one can see
two main vertices that are positioned symmetrically with respect to the
middle vertical line. This is due to the repetitive nature of the upper en-
velope. Figure 7.4 shows boxplots of λ̂1 , R̄, d, and log m, comparing Bach,
Scarlatti, Bartók and Prokoffief. Variability is clearly lower for Bartók and
Prokoffief, independently of the specific statistic that is used. There are
also some, but less extreme, differences with respect to the maximal auto-
correlation m. As one may perhaps expect, Bartók has the highest values
of m.
7.3.2 Variability and autocorrelation of note intervals modulo 12
The same as above can be carried out for intervals between successive
notes (Figure 7.5). Figure 7.6 shows that, again, variability is much lower
for Bartók and Prokoffieff.

Figure 7.4 Boxplots of λ̂1, R̄, d and log m for notes modulo 12, comparing Bach,
Scarlatti, Bart´
ok, and Prokoffief.

Figure 7.5 Circular representation of intervals of successive notes in the following
compositions: J. S. Bach (Pr¨
aludium und Fuge No. 5 from “Das Wohltemperierte
Klavier”), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart´ ok (Bagatelles No.
3), and S. Prokoffieff (Visions fugitives No. 8).

Figure 7.6 Boxplots of λ̂1, R̄, d and log m for note intervals modulo 12, comparing
Bach, Scarlatti, Bart´
ok, and Prokoffief.

Figure 7.7 Circular representation of notes ordered according to circle of fourths
in the following compositions: J. S. Bach (Präludium und Fuge No. 5 from ”Das
Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bartók
(Bagatelles No. 3), and S. Prokoffieff (Visions fugitives No. 8).

Figure 7.8 Boxplots of λ̂1, R̄, d and log m for notes 12 ordered according to circle
of fourths, comparing Bach, Scarlatti, Bartók and Prokoffief.

Figure 7.9 Circular representation of intervals of successive notes ordered accord-
ing to circle of fourths in the following compositions: J. S. Bach (Präludium und
Fuge No. 5 from “Das Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirk-
patrick No. 125), B. Bartók (Bagatelles No. 3), and S. Prokoffieff (Visions fugi-
tives No. 8).

Figure 7.10 Boxplots of λ̂1, R̄, d and log m for note intervals modulo 12 ordered
according to circle of fourths, comparing Bach, Scarlatti, Bartók, and Prokoffief.

7.3.3 Notes and intervals on the circle of fourths
Alternatively, the analysis above can be carried out by ordering notes ac-
cording to the circle of fourths. Thus, a rotation by 36 0 o/12 = 30 o corre-
sponds to a step of one fourth. The analogous plots are given in Figures 7.7
through 7.10 . This specific circular representation makes some symmetries
and their harmonic meaning more visible.

CHAPTER 8
Principal component analysis

Observations in music often consist of vectors. Consider, for instance, the
tempo measurements for Schumann’s Träumerei (Figure 2.3). In this case,
the observational units are performances and an observation consists of a
tempo “curve” which is a vector of n tempo measurements x(ti ) at symbolic
score onset times ti (i = 1, ..., p). The main question is which similarities
and differences there are between the performances. Principal component
analysis (PCA) provides an answer in the sense that the “most interesting”,
and hopefully interpretable, projections are found. In this chapter, a brief
introduction to PCA is given. For a detailed account and references see
e.g. Mardia et al. (1979), Anderson (1984), Dillon and Goldstein (1984),
Seber (1984), Krzanowski (1988), Flury and Riedwyl (1988), Johnson and
Wichern (2002).

8.2.1 Definition of PCA for multivariate probability distributions
Algorithmic definition
Let X = (X1 , ..., Xp )t be a random vector with expected value E(X) = µ
and covariance matrix Σ. The following algorithm is defined:
• Step 0. Initialization: Set j = 1 and Z (1) = X.
• Step 1. Find a direction, i.e. a vector a(j) with |a(j) | = 1, such that
(j) (j) (j) (j)
the projection Zj = [a(j) ]t Z (j) = a1 Z1 + ... + ap Zp has the largest
possible variance.
• Step 2. Consider the part of Z (j) that is orthogonal to a(1) , ..., a(j) , i.e.
set Z (j+1) = Z (j) − Zj a(j) . If j = p, or all components of Z (j+1) have
variance zero, then stop. Otherwise set j = j + 1 and go to Step 1.
The algorithm finds successively orthogonal directions a(1) , a(2) , ... such
that the corresponding projections of Z have the largest variance among
all projections that are orthogonal to the previous ones. A projection with
a large variance is suitable for comparing, ranking, and classifying observa-
tions, since different random realizations of the projection tend to be widely
scattered. In contrast, if a projection has a small variance, then individuals

do not differ very much with respect to that projection, and are therefore
more difficult to distinguish.
Definition via spectral decomposition of matrices

The algorithm given above has an elegant interpretation:
Theorem 18 (Spectral decomposition theorem) Let B be a symmetric p×p
matrix. Then B can be written as
!p
B = AΛAt = λj a(j) [a(j) ]t (8.1)
j=1
λ1 0 . . 0
⎛ ⎞
⎜ 0 λ2 . ⎟
where Λ = ⎜ . ⎟is a diagonal matrix, λj are the eigen-
⎜ ⎟
⎜ . . ⎟
⎝. . 0 ⎠
0 . . 0 λp
values and the columns a(j) of A the corresponding orthonormal eigenvec-
tors of B, i.e. we have
Ba(j) = λj a(j) (8.2)
|a(j) |2 = [a(j) ]t a(j) = 1, and [a(j) ]t a(l) = 0 for j ̸= l (8.3)
In matrix form equation (8.3) means that A is an orthogonal matrix, i.e.
At A = I (8.4)
where I denotes the identity matrix with Ijj = 1 and Ijl = 0 (j ̸= l).
This result can now be applied to the covariance matrix of a random
vector X = (X1 , ..., Xp )t :
Theorem 19 Let X be a p-dimensional random vector with expected value
E(X) = µ and p × p covariance matrix Σ. Then
Σ = AΛAt (8.5)
where the columns a (j)
of A are eigenvectors of Σ and Λ is a diagonal
matrix with eigenvalues λ1 , ..., λp ≥ 0.
In particular, we may permute the sequence of the X-components such that
the eigenvalues are ordered. We thus obtain:
Theorem 20 Let X be a p-dimensional( random vector with expected value
E(X) = µ and a p×p covariance matrix . Then there exists an orthogonal
matrix A such that
Σ = AΛAt (8.6)
where the columns a (j)
of A are eigenvectors of and Λ is a diagonal
(
matrix with eigenvalues λ1 ≥ λ2 ≥ ... ≥ λp ≥ 0. Moreover, the covariance
matrix of the transformed vector
Z = At (X − µ) (8.7)

is equal to
cov(Z) = At ΣA = Λ (8.8)
Note in particular that var(Z1 ) = λ1 ≥ var(Z2 ) = λ2 ≥ ... ≥ var(Zp ) = λp
and the covariance matrix Σ may be approximated by a matrix
q
!
Σ(q) = λj a(j) [a(j) ]t
j=1
for a suitably chosen value q ≤ p. If a good approximation can be achieved

for a relatively small value of q, then this means that most of the random
variation in X occurs in a low dimensional space spanned by the random
vector Z(q) = (Z1 , ..., Zq )t .
Definition 53 The transformation defined by Z = At (X − µ) is called the
principal component transformation. The ith component of Z,
Zj = [At (X − µ)]j = [(X − µ)t a(j) ]t (8.9)
is called the jth principal component of X. The jth column of A, i.e. the
jth eigenvector a(j) , is called the vector of principal component loadings.
In summary, the principal component transformation rotates the original
random vector X − µ in such a way that the new coordinates Z1 , ..., Zp are
uncorrelated (orthogonal) and they are ordered according to their impor-
tance with respect to characterizing the covariance structure of X.
The following result states that the algorithmic and the algebraic defini-
tion are indeed the same:
Theorem 21 Consider U = bt X where b = (b1 , ..., bp )t and |b| = 1. Sup-
pose that U is orthogonal (i.e. uncorrelated) to the first k principal compo-
nents of X. Then var(U ) is maximal, among all such projections, if and
only if b = a(k+1) , i.e. if U is the (k + 1)st principal component Zk+1 .
8.2.2 Definition of PCA for observed data

The definition of principal components given above cannot applied directly
to data, since the expected value and covariance matrix are usually un-
known. It can however be modified in an obvious way by replacing pop-
ulation quantities by suitable estimates. The simplest solution is to use
the sample mean and the sample covariance matrix. For observed vectors
x(i) = (x1 (i), ..., xp (i))t (i = 1, 2, ..., n) one defines
n
1!
µ̂ = x̄ = x(i) (8.10)
n i=1
and the estimate of the covariance matrix
n
1!
Σ̂ = (x(i) − x̄)(x(i) − x̄)t . (8.11)
n i=1

The estimated ith vector of principal component loadings, â(j) , is the stan-
dardized eigenvector corresponding to the jth-largest eigenvalue of Σ̂. The
estimated principal component transformation is then defined by
z = Ât (x − x̄) = [(x − x̄)t Â]t (8.12)
where the columns of Â are equal to the orthogonal vectors â(j) . Applying
this transformation to the observed vectors x(1), ..., x(n), enables us to
compare observations with respect to their principal components. The jth
principal component of the ith observation is equal to
zj (i) = (x(i) − x̄)t â(j) (8.13)
In other words, the ith observed vector x(i)− x̄ is transformed into a rotated
vector z(i) = (z1 (i), ..., zp (i))t with the corresponding observed principal
components. In matrix form, we can define the n × p matrix of observations
x1 (1) x2 (1) · · · xp (1)
⎛ ⎞
⎜x1 (2) x2 (2) · · · xp (2) ⎟
X=⎜ . .. .. (8.14)
⎜ ⎟
⎝ .. .. .
⎟
⎠
x1 (n) x2 (n) · · · xp (n)
and the n × p matrix of observed principal components
z1 (1) z2 (1) · · · zp (1)
⎛ ⎞
⎜z1 (2) z2 (2) · · · zp (2) ⎟
Z =⎜ . .. .. (8.15)
⎜ ⎟
⎝ . . .. .
⎟
⎠
z1 (n) z2 (n) · · · zp (n)
so that
Z = (X − I ȳ t )Â (8.16)
where I denotes the identity matrix. Note that the jth column z (j) =
(zj (1), ..., zj (n))t consists of the observed jth principal components. There-
fore, the sample variance of the jth principal components is given by
n
!
s2z = n−1 zj2 (i) = λ̂j .
i=1
If λ̂j is large, then the observed jth principal components zj (1), ..., zj (n)
have a large sample variance so that the observed values are scattered far
apart.
8.2.3 Scale invariance?

The principal component transformation is based on the covariance ma-
trix. It is therefore not scale invariant, since variance and covariance de-
pend on the units in which individual components Xj are measured. It is

therefore often recommended to standardize all components. (nThus, we re-
place each(coordinate xj by (xj − x̄j )/sj where(x̄j = n−1 i=1 xj (i) and
n n
s2j = n−1 i=1 (xj (i) − x̄j )2 (or s2j = (n − 1)−1 i=1 (xj (i) − x̄j )2 ).
8.2.4 Choosing important principal components
Since an orthogonal transformation does not change the length of vectors,

the “total variability” of the random vector Z in (8.7) is the same as the one
of the original random vector X with covariance matrix Σ = (σij )i,j=1,...,p .
More specifically, one defines total variability by
p
!
Vtotal = tr(Σ) = σii . (8.17)
i=1
The singular value decomposition (spectral decomposition) of Σ then im-

plies
Theorem 22 Let Σ be a covariance matrix with spectral decomposition
Σ = AΛAt . Then
p
!
Vtotal = tr(Σ) = λii (8.18)
i=1
Since the eigenvalues λi are ordered according to their size, we may there-
fore hope that the proportion of total variation
λ1 + ... + λq
P (q) = (p (8.19)
i=1 λi
is close to one for a low value of q. If this is the case, then one may re-
duce the dimension of the random vector considerably without losing much
information. For data, we plot P̂ (q) = (λ̂1 + ... + λ̂q )/ λ̂i versus q and
(
judge by eye from which point on the increase in P̂ (q) is not worth the
price of adding additional dimensions.
( Alternatively, we may plot the con-
tribution of each eigenvalue, λ̂j / λ̂i or λ̂j itself, against j. This is the
so-called scree graph. More formal tests, e.g. for testing which eigenvalues
are nonzero or for comparing different eigenvalues, are available however
mostly under the rather restrictive assumption that the distribution of X
is multivariate normal (see e.g. Mardia et al. 1979, Ch. 8.3.2).
In addition to the scree plot, the decision on the number of principal
components is often also based on the (possibly subjective) interpretability
of the components. The interpretation of principal components may be
(i)
based on the coefficients aj and/or on the correlation between Zj and
the coordinates of the original random vector X = (X1 , ..., Xp )t . Note that
since E(ZX t ) = E(At XX t) = At Σ = At AΛAt = ΛAt , var(Xk ) = σkk and

var(Zi ) = λi , the correlation between Zj and Xk is equal to
!
(j) λj
ρj,k = corr(Zj , Xk ) = ak (8.20)
σkk
Analogously, for observed data we have the empirical correlations
"
(j) λ̂j
ρ̂j,k = âk (8.21)
σ̂kk
8.2.5 Plots
One of the main difficulties with high-dimensional data is that they cannot
be represented directly in a two-dimensional display. Principal components
provide a possible solution to this problem. The situation is particularly
simple if the first two principal components explain most of the variability.
In that case, the original data (x1 (i), ..., xp (i))t (i = 1, 2, ..., n) may be re-
placed by the first two principal components (z1 (i), z2 (i))t (i = 1, 2, ..., n).
Thus, z2 (i) is plotted against z1 (i). If more than two principal components
are needed, then the plot of z2 (i) versus z1 (i) provides at least a partial view
of the data structure, and further projections can viewed by corresponding
scatter plots of other components, or by symbol plots as described in Chap-
ter 2. The scatter plots can be useful for identifying structure in the data.
In particular, one may detect unusual observations (outliers) or clusters of
similar observations.

8.3.1 PCA of tempo skewness
The 28 tempo curves in Figure 2.3 , each consisting of measurements at
p = 212 onset times, can be considered as n = 28 observations of a 212-
dimensional random vector. Principal component analysis cannot be ap-
plied directly to these data. The reason is that PCA relies on estimating
the p × p covariance matrix. The number of observations (n = 28) is much
smaller than p. Therefore, not all elements of the covariance matrix can
be estimated consistently and an empirical PCA-decomposition would be
highly unreliable. A solution to this problem is to reduce the dimension p
in a meaningful way. Here, we consider the following reduction: the onset-
′ ′ ′′ ′′
time axis is divided into 8 disjoint blocks A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 of
4 bars each. For each part number i (i = 1, ..., 8) and each performance j
(j = 1, ..., 28), we calculate the skewness measure
x̄ − M
ηj (i) =
Q2 − Q1

0.0
-0.2
-0.4
-0.6 Skewness of tempo plotted against period 1,2, ,8
1 2 3 4 5 6 7 8
Figure 8.1 Tempo curves for Schumann’s Träumerei: skewness for the eight parts
′ ′ ′′ ′′
A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 for 28 performances, plotted against the number of
the part.
where M is the median and Q1 , Q2 are the lower and upper quartile respec-
tively. Figure 8.1 shows ηj (i) plotted against i. An apparent pattern is the
generally strong negative skewness in B2 . (Recall that negative skewness
can be created by extreme ritardandi.) Apart from that, however, Figure
8.1 is difficult to interpret directly. Principal component analysis helps to
find more interesting features. Figure 8.3 shows the loadings for the first
four principal components which explain more than 80% of the variability
(see Figure 8.2). The loadings can be interpreted as follows: the first com-
ponent corresponds to a weighted average emphasizing the skewness values
in the first half of the piece. The 28 performances apparently differ most
with respect to ηj (i) during the first 16 bars of the piece (parts A1 , A2 ,
′ ′
A1 , A2 ). The second most important distinction between pianists is charac-
terized by the second component. This component compares skewness for
the A-parts with the values in B1 and B2 . The third component essentially

Skewness of tempo - screeplot
0.355
0.025
0.015
0.564
Variances
0.709
0.824
0.005
0.889
0.935
0.971 1
0.0
Comp. 1
Comp. 2
Comp. 3
Comp. 4
Comp. 5
Comp. 6
Comp. 7
Comp. 8
Figure 8.2 Schumann’s Träumerei: screeplot for skewness.
compares the first with the second half. Finally, the fourth component es-
sentially compares the odd with the even numbered parts, excluding the
′′ ′′
end A1 , A2 . Components two to five are displayed in Figure 8.4, with z2
and z3 on the x- and y-axis respectively and rectangles representing z4 and
z5 . Note in particular that Cortot and Horowitz mainly differ with respect
to the third principal component. Horowitz has a more extreme difference
in skewness betweem the first and second halves of the piece. Also striking
are the “outliers” Brendel, Ortiz, and Gianoli. The overall skewness, as
represented by the first component, is quite extreme for Brendel and Ortiz.
For comparison, their tempo curves are plotted in Figure 8.5 together with
Cortot’s and Horowitz’ first performances. In view of the PCA one may
now indeed see that in the tempo curves by Brendel and Ortiz there is a
strong contrast between small tempo variations applied most of the time
and occasional strong local ritardandi.

Skewness: Loadings of Skewness: Loadings of
first PCA-component second PCA-component
A1 A2 A’1 A’2 B1 B2 A’’1 A’’2 A1 A2 A’1 A’2 B1 B2 A’’1 A’’2
0.5
0.2
0.4
loading
loading
-0.2
0.3
0.2
-0.6
2 4 6 8 2 4 6 8
Skewness: Loadings of Skewness: Loadings of

third PCA-component fourth PCA-component
A1 A2 A’1 A’2 B1 B2 A’’1 A’’2 A1 A2 A’1 A’2 B1 B2 A’’1 A’’2
0.6
0.2
loading
loading
0.2
-0.2
-0.2
-0.6
2 4 6 8 2 4 6 8
Figure 8.3 Schumann’s Träumerei: loadings for PCA of skewness.
PCA of skewness -
symbol plot of principal components 2-5
0.0
Y
NE
IZ
RT
T2
-0.1
O
RT
T3
CO T1
CO
AC TO
O
ES
VA
R
RT
N
CU C H
VI
IE1
OO
N
DA
KA TKZ
CH
RZAP
CO
K L
-0.2
I
EN
ER
LEZ I
ES
BA W
CH
G
VA
KU RO
AR
ES
NO
AZ H
Z2
HO
U EN SC
US
IT
z3
IT
DE E
AR A EIW
M
W
Z3
-0.3
RO
T OIS
RA SK
IT
N
US M
NI
HO
W
EL
BU
L
D
RO
BE
KR
EN
IS
NA
AR
BR
HO
H
TS
SC
KA
-0.4
LI
NO
-0.5
IA
G
Y
LE
EL
SH
-0.5 -0.4 -0.3 -0.2 -0.1 0.0
z2
Figure 8.4 Schumann’s Träumerei: symbol plot of principal components z2 , ..., z5

for PCA of tempo skewness.

-5 Cortot1
Horowitz1
-10
Brendel
-15
Gianoli
-20
-25
0 50 100 150 200
Figure 8.5 Schumann’s Träumerei: tempo curves by Cortot, Horowitz, Brendel,

and Gianoli.
8.3.2 PCA of entropies
Consider the entropy measures E1 , E2 , E3 , E4 , E8 and E10 defined in

Chapter 3. We ask the following question: is there a combination of en-
tropy measures that enables us to distinguish ”computationally” between
various styles of composition? The following compositions are included in
the study: Henry Purcell 2 Airs (Figure 8.6), Hornpipe; J.S. Bach First
movements of Cello Suites No. 1-6, Prelude and Fugue No. 1 and 8 from
“Das Wohltemperierte Klavier”; W.A. Mozart KV 1e, 331/1, 545/1; R.
Schumann op. 15, No. 2,3,4,7; op. 68, No. 2, 16; A. Scriabin op. 51, No.
2, 4; F. Martin Préludes No. 6, 7 (cf. Figures 8.11, 8.12). For each compo-
sition, we define the vector x = (x1 , ..., x6 )t = (E1 , E2 , E3 , E4 , E9 , E10 )t .
The results of PCA are displayed in Figures 8.7 through 8.10. The first
principal component mainly consists of an average of the first four com-
ponents and a comparison with E10 (Figure 8.8). The second component
essentially includes a comparison between E9 and E10 , whereas the third
component is mainly a weighted average of E2 , E9 , and E10 . Finally, the
fourth component compares E2 , E3 with E1 . According to the screeplot
(Figure 8.7), the first three components already explain more than 95% of
the variability. Scatterplots of the first three components (Figures 8.9 and
8.10) together with symbols representing the next two components show a

clear clustering. For clarity, only three different names (Purcell, Bach, and
Schumann) are written explicitly in the plots. Schumann turns out to be
completely separated from Bach. Moreover, Purcell appears to be some-
what outside the regions of Bach and Schumann, in particular in Figure
8.10. In conclusion, entropies, as defined above, do indeed seem to capture
certain features of a composer’s style.

AIR
Henry Purcell (1659-1695)
= 96
Piano
11
14
Figure 8.6 Air by Henry Purcell (1659-1695).

Figure 8.7 Screeplot for PCA of entropies.
Figure 8.8 Loadings for PCA of entropies.

Entropies - second vs. first principal component;
rectangles with width=3rd comp., height=4th comp.
Purcell
4
3
2
Bach
1
Bach
Bach
Schumann
Bach
Bach
Purcell Bach
Bach
Bach
Bach
0
Bach
Schumann
Schumann
Schumann
Schumann
Purcell
-1
Schumann
-2
-4 -2 0 2
Figure 8.9 Entropies – symbol plot of the first four principal components.
Third vs. second principal component -

rectangles with width=4th comp., height=5th comp.
Bach
Bach
Bach
1
Bach
Bach
Bach Bach
Bach
Schumann
Bach
Schumann
Schumann
Schumann
Purcell
0
Schumann
Bach
Purcell
-1
Schumann
Purcell
-2
-1 0 1 2 3 4
Figure 8.10 Entropies – symbol plot of principal components no. 2-5.

Figure 8.11 F. Martin (1890-1971). (Courtesy of the Société Frank Martin and
Mrs. Maria Martin.)
Figure 8.12 F. Martin (1890-1971) - manuscript from 8 Préludes. (Courtesy of

the Société Frank Martin and Mrs. Maria Martin.)

CHAPTER 9
Discriminant analysis

Discriminant analysis, often also referred to under the more general notion
of pattern recognition, answers the question of which category an observed
item is most likely to belong to. A typical application in music is attribution
of an anonymous composition to a time period or even to a composer.
Other examples are discussed below. A prerequisite for the application of
discriminant analysis is that a “training data set” is available where the
correct answers are known. We give a brief introduction to basic principles
of discriminant analysis. For a detailed account see e.g. Mardia et al. (1979),
Klecka (1980), Breiman (1984), Seber (1984), Fukunaga (1990), McLachlan
(1992) and Huberty (1994), Ripley (1995), Duda et al. (2000), Hastie et al.
(2001).

9.2.1 Allocation rules
Suppose that an observation x ∈ Rk is known to belong to one of p mu-
tually exclusive categories G1 , G2 ,...,Gp . Associated with each category is
a probability density fi (x) of X on Rk . This means that if an individual
comes from group i, then the individual’s random vector X has the prob-
ability distribution fi . The problem addressed by discriminant analysis is
as follows: observe X = x, and try to guess which group the observation
comes from. The aim is, of course, to make as few mistakes as possible. In
probability terms this amounts to minimizing the probability of misclassi-
fication.
The solution is defined by a classification rule. A classification rule is a
division of Rk into p disjoint regions: Rk = R1 ∪ R2 ... ∪ Rp , Ri ∩ Rj = φ
(i ̸= j). The rule allocates an observation to group Gi , if x ∈ Ri . More
generally, we may define a randomized rule !πby allocating an observation to
group Gi with probability ψi (x), where i=1 ψi (x) = 1 for every x. The
advantage of allowing random allocation is that discriminant rules can be
averaged and the set of all random rules is convex, thus allowing to find
optimal rules. Note that deterministic rules are a special case, by setting
ψi (x) = 1 if x ∈ Ri and 0 otherwise.

9.2.2 Case I: Known population distributions
Discriminant analysis without prior group probabilities – the ML-rule
Assume that it is not known a priori which of the groups is more likely to
occur; however for each group the distribution fi is known exactly. This
case is mainly of theoretical interest; it does however illustrate the essential
ideas of discriminant analysis.
A plausible discriminant rule is the Maximum Likelihood Rule (ML-
Rule): allocate x to group Gi , if
fi (x) = max fj (x) (9.1)
j=1,...,p
If the maximum is reached for several groups, then x is considered to be in

the union of these (for continuous distributions this occurs with probability
zero). In the case of two groups the ML-rule means that x is allocated to
G1 , if f1 (x) > f2 (x), or, equivalently,
f1 (x)
log >0 (9.2)
f2 (x)
In the case where all probability densities are normal with equal covariance
matrices we have:
Theorem 23 Suppose that each fi is a multivariate normal distribution
with expected value µi and covariance matrix Σi . Suppose further that Σ1 =
Σ2 = ... = Σp = Σ and det Σ > 0. Then the ML-rule is given as follows:
allocate x to group Gi , if
(x − µi )t Σ−1 (x − µi ) = min (x − µj )t Σ−1 (x − µj ) (9.3)
j=1,...,p
Note that the “Mahalanobis distance” di = (x − µi )t Σ−1 (x − µi ) measures

how far x is from the expected value µi , while taking into account covari-
ances between the components of the random vector X = (X1 , ..., Xp )t . In
particular, for p = 2, x is allocated to G1 , if
1
at (x − (µ1 + µ2 )) > 0 (9.4)
2
where a = Σ−1 (µ1 − µ2 ). Thus, we obtain a linear rule where x is compared
with the midpoint between µ1 and µ2 .
Discriminant analysis with prior group probabilities – the Bayesian rule

Sometimes one has a priori knowledge (or belief) how likely each of the
groups is to occur. Thus, it is assumed that we know the probabilities
πi = P (observation drawn from group Gi ) (i = 1, ..., p) (9.5)
where 0 ≤ πi ≤ 1 and πi = 1. The conditional likelihood that the obser-
!
vation comes from group Gi given the observed value X = x is proportional

to πi fi (x). The natural rule is then the Bayes rule: Allocate x to Gi , if
πi fi (x) = max πj fj (x) (9.6)
j=1,...,p
For the “noninformative prior” π1 = π2 = ... = πp = 1/p, representing

complete lack of knowledge about which groups observations are more likely
to come from, the Bayes rule coincides with the ML-rule. In the case of two
groups, the Bayes rule is a simple modification of the ML-rule, since x is
allocated to G1 , if
f1 (x) π2
log > log (9.7)
f2 (x) π1
Which rule is better?

The quality of a rule is judged by the probability of correct classification (or
misclassification). There are two standard ways of comparing classification
rules: a) comparison of individual probabilities of correct classification; and
b) comparison of the overall probability of correct classification.
The first criterion can be understood as follows: for a random alloca-
tion rule with probabilties ψi (.), the probability that a randomly chosen
individual coming from group Gi is classified into group Gj is equal to
"
pji = ψj (x)fi (x)dx (9.8)
Thus, correct classification for individuals from group Gi occurs with prob-
ability pii and misclassification with probability 1 − pii . A rule r with
correct-classification-probabilities pii is said to be at least as good as a rule
r̃ with probabilities p̃ii , if pii ≥ p̃ii for all i. If there is at least one “ > ”
sign, then r is better. If there is no better rule than r, then r is called
admissible. Consider now a Bayes rule r with probabilities pij . Is there any
better rule than r? Suppose that r̃ is better. Then
# #
πi pii < πi p̃ii .
On the other hand,
# #"
πi p̃ii = ψ̃i πi fi (x)dx
#" "
≤ ψ̃i max{πj fj (x)}dx = max{πj fj (x)}dx.
j j
Since r is a Bayes rule, we have
#
max{πj fj (x)} = ψi πi fi (x)
j
so that finally, the inequality is:

" # # #
ψi πi fi (x)dx = πi pii ≥ πi p̃ii

which contradicts the first inequality. The conclusion is therefore that every
Bayes rule is optimal in the sense that it is admissible. If there are no a
priori probabilities πi , or more exactly the noninformative prior is used,
then this means that the ML-rule is optimal.
The second criterion is applicable if a priori probabilities are available:
the probability of correct allocation is
#p #p "
pcorrect = πi pii = πi ψi fi (x)dx (9.9)
i=1 i=1
A rule is optimal if pcorrect is maximal. In contrast to admissibility, all rules

can be ordered according to “classification correctness”. As before, it can
be shown that the Bayes rule is optimal.
Both criteria can be generalized to the case where misclassification is
associated with costs that may differ for different groups.
9.2.3 Case II: Population distribution form known, parameters unknown

Suppose that each fi is known, except for a finite dimensional parame-
ter vector θi . Then the rules above can be adopted accordingly, replacing
parameters by their estimates. The ML-rule is then: allocate x to Gi , if
fi (x; θ̂i ) = max fj (x; θ̂j ) (9.10)
j=1,...,p
The Bayes rule allocates x to G1 , if

πi fi (x; θ̂i ) = max πj fj (x; θ̂j ) (9.11)
j=1,...,p
The rule becomes particularly simple if fi are normal with unknown means
µi and equal covariance matrices Σ1 = Σ2 = ... = Σ. Let x̄i be the sample
mean and Σ̂i the sample covariance matrix for observations from group Gi .
Estimating the common covariance matrix Σ by
Σ̂ = (n1 Σ̂1 + n2 Σ̂2 + ... + np Σ̂p )/(n − p)
where ni is the number of observations from Gi and n = n1 + ... + np , the
ML-rule allocates x to Gi , if
(x − µi )t Σ−1 (x − µi ) = min (x − µj )t Σ−1 (x − µj ) (9.12)
j=1,...,p
For two groups, we have the linear ML-rule

1
ât (x − (x̄1 + x̄2 )) > 0 (9.13)
2
where â = Σ̂−1 (x̄1 − x̄2 ), and the corresponding Bayes rule
1 π2
ât (x − (x̄1 + x̄2 )) > log (9.14)
2 π1

It should be emphasized here that while a linear discriminant rule is mean-
ingful for the normal distribution, this may not be so for other distributions.
For instance, if for G1 a one-dimensional random variable X is observed
with a uniform distribution on [− 1, 1] and for G2 the variable X is uniformly
distributed on [− 3, − 2] ∪ [2, 3], then the two groups can be distinguished
perfectly, however not by a linear rule.
9.2.4 Case III: Population distributions completely unknown

If the population distributions fi are completely unknown, then the search
for reasonable rules is more difficult. In recent literature, some rules based
on nonparametric estimation or suitable projection techniques have been
proposed (see e.g. Friedman 1977, Breiman 1984, Hastie et al. 1994, Polzehl
1995, Ripley 1995, Duda et al. 2000, Hand et al. 2001).
The simplest, and historically most important, rule is based on Fisher’s
linear discriminant function. Fisher postulated that a linear rule may often
be reasonable (see however the remark in Section 9.2.3 why this need not
always be so). He proposed to find a vector a such that the linear function
at x maximizes the ratio between the variability between groups compared
to the variability within the groups. More specifically, define
Xn×p = X
to be the n × p matrix where each row i corresponds to an observed vector
xi = (xi1 , ..., xip )t . We denote the columns of X by x(j) (j = 1, ..., p). The
rows are assumed to be ordered according to groups, i.e. rows 1 to n1 are
observations from G1 , rows n1 + 1 through n1 + n2 are from G2 and so on.
Moreover, define the matrix
Mn×n = M = I − n−1 1 · 1t
where I is the identity matrix and 1 = (1, ..., 1)t . We denote the subma-
(i)
trices of X and M that belong to the different groups by Xnj ×p = X (j)
(j)
and Mnj ×nj = M (j) respectively. The corresponding subvectors of y =
(y1 , ..., yn )t are denoted by y (j) . Then the variability of the vector y = Xa,
defined by
#n
SST = (yi − ȳ)2 = y t M y = at X t M Xa (9.15)
i=1
can be written as
SST = SSTwithin + SSTbetween (9.16)
where
nj
p #
(j)
#
SSTwithin = (yi − ȳ (j) )2 = at W a (9.17)
j=1 i=1

and
p
#
SSTbetween = nj (ȳ (j) − ȳ)2 = at Ba (9.18)
j=1
Here,
p
# p
#
W = n j Sj = [X (j) ]t M (j) X (j)
j=1 j=1
is the within groups matrix and
p
#
B= nj (x̄(j) − x̄)(x̄(j) − x̄)t
j=1
the between groups matrix, Sj is the sample covariance matrix of obser-

!p !nj (j)
vations xi from group Gj , ȳ = n−1 j=1 i=1 yi is the overall mean,
−1 ! (j)
ȳ (j)
= nj yi the mean in group Gj and x̄(j) and x̄ are the corre-
sponding (vector) means for x. Fisher’s linear discriminant function (or
first canonical variate) is the linear function at x where a maximizes the
ratio
SSTbetween at Ba
Q(a) = = t (9.19)
SSTwithin a Wa
The solution is given by
Theorem 24 Let a be the eigenvector of W −1 B that corresponds to the
largest eigenvalue. Then Q(a) is maximal.
The classification rule is then: allocate x to Gi , if
|at x − at x̄(i) | = min |at x − at x̄(j) | (9.20)
j=1,...,p
If there are only p = 2 groups, then

n1 n2 (1)
B= (x̄ − x̄(2) )(x̄(1) − x̄(2) )t
n
has rank 1 and the only non-zero eigenvalue is
n1 n2 (1)
tr(W −1 B) = (x̄ − x̄(2) )t W −1 (x̄(1) − x̄(2) )
n
with eigenvector a = W −1 (x̄(1) − x̄(2) ). The discriminant rule then becomes
the same as the ML-rule for normal distributions with equal covariance
matrices: allocate x to Gi , if
1 (1)
(x̄(1) − x̄(2) )t W −1 (x − (x̄ + x̄(2) )) > 0 (9.21)
2
9.2.5 How good is an empirical discriminant rule?

If the densities fi are not known, then the classification rule as well as the
probabilities pii of correct classification must be estimated from the given

Figure 9.1 Discriminant analysis combined with time series analysis can be used
to judge purity of intonation (“Elvira” by J.B.).
data. In principle this is easy, since the corresponding estimates can simply
be plugged into the formula for pii . The observed data that are used for
estimation are also called “training sample”. A problem with these esti-
mates is, however, that the search for the optimal discriminant rule was
done with the same data. Therefore, p̂11 will tend to be too optimistic (i.e.
too large), unless n is very large. The same is true for any method that
estimates classification probabilities from the training data. A possibility
to avoid this is to partition the data set randomly into a “training” sample
that is used for estimation of the discriminant rule, and a disjoint “val-
idation” sample that is used for estimation of classification probabilities.
Obviously, this can only be done for large enough data sets. For recently
developed computational methods of validation, such as bootstrap, see e.g.
Efron (1979), Läuter (1985), Fukunaga (1990), Hirst (1996), LeBlanc and
Tibshirani (1996), Davison and Hinkley (1997), Chernick (1999), Good
(2001).

9.3.1 Identification of pitch, tone separation, and purity of intonation
Weihs et al. (2001) investigate objective criteria for judging purity of in-
tonation of singing. The acoustic data are as described in Chapter 4. In
order to address the question of how to computationally assess purity of
intonation, a vocal expert classified 132 selected tones of 17 performances
(Figure 9.1) of Händel’s “Tochter Zion” into the classes “flat”, “correct”,
and “sharp”. The opinion of the expert is assumed to be the truth. An
objective measure of purity is defined by ∆ = log12 (ωobserved ) − log12 (ωo )

where ωo is the correct basic frequency, corresponding to the note in the
score and adjusted to the tuning of the accompanying piano, and ωobserved
is the actually measured frequency. Maximum likelihood discriminant anal-
ysis leads to the following classification rule: the maximal permissible error
in halftones which is accepted in order to classify a tone as “correct” is
about 0.4 halftones below and above the target tone. Note that this is
much higher than 0.03 halfnotes which is the minimal distance between
frequencies a trained ear can distinguish in principle (see Pierce 1992). If
a note is considered incorrect by an expert, then the estimated probability
of being nevertheless classified as “correct” by the discriminant rule turns
out to be 0.174. This rather high error rate may be due to several causes.
“Purity of intonation” is a phenomenon that probably depends on more
than just the basic frequency. Possible factors are, for instance, amount
of vibrato, loudness, pitch, context (e.g. previous and subsequent notes),
timbre, etc. Thus, more variables that characterize the sound may have to
be incorporated, in addition to ∆, in order to define a musically meaningful
notion of “purity of intonation”.
9.3.2 Identification of historic periods

For a composition, consider notes modulo octave, with 0 being set equal to
the most frequent note (which we will also call “basic tone”). The relative
frequencies of each note 0, ..., 11 are denoted by po , ..., p11 . We the set x1 =
p5 . Note that, if 0 is the root of the tonic triad then 5 is the root of the
subdominant. Moreover we define
#n
x2 = E = − log(pi + 0.001)pi
i=1
which is slightly modified measure of entropy. We now describe each com-

position by a bivariate observation
x = (p5 , E)t .
The question is now whether this very simple 2-dimensional descriptive
statistic can tell us anything about the time when the music was composed.
In view of the somewhat naive simplicity of x, the answer is not at all
obvious.
To simplify the problem, composers are divided into two groups: Group
1 = composers who died before 1800, and Group 2 = composers who died
after 1800 (or are still alive). Essentially, the two groups correspond to
the partition into “early music to baroque” and “classical till today”. The
compositions considered here are those given in the star plot example (Sec-
tion 2.7.2). In order to be able to check objectively how the procedure
works, only a subset of n = 94 compositions is used for estimation. Apply-
ing a linear discriminant rule partitions the plane into two half planes by

Fitted discriminant rule and training data used for estimation
before 1800
after 1800
-1.2
P(Subdominant)
-1.4
-1.6
-1.8
-2.0
1.9 2.0 2.1 2.2 2.3 2.4
entropy
Figure 9.2 Linear discriminant analysis of compositions before and after 1800,
with the training sample. The data used for the discriminant rule consists of
x = (p5 , E).
a straight line. Figure 9.2 shows the estimated partitioning line together
with the training sample (o = before 1800, x = after 1800). Apparently, the
two groups can indeed be separated quite well by the estimated straight
line. This is quite surprising, given the simplicity of the two variables. As
expected, however, the partition is not perfect, and it does not seem to be
possible to improve it by more complicated partitioning lines. To assess how
well the rule may indeed classify, we consider 50 other compositions that
were not used for estimating the discriminant rule. Figure 9.3 shows that
the rule works well, since almost all observations in the validation sample
are classified correctly. An unusual composition is Bartók’s Bagatelle No.
3 which lies far on the left in the “wrong” group.
The partitioning can be improved if the time periods of the two groups
are chosen farther apart. This is done in figures 9.3a and b with Group
1 = “Early Music to Baroque” and 2 = “Romantic to 20th century”. (A
beautiful example of early music is displayed in Figure 9.6; also see Fig-
ures 9.7 and 9.8 for portraits of Brahms and Wagner.) Figure 9.4 shows
the corresponding plot of the partition together with the data (n = 72).
Compositions not used in the estimation are shown in Figure 9.5. Again,
the rule works well, except for Bartók’s third Bagatelle.

Fitted discriminant rule and validation data
not used for estimation
before 1800
after 1800
-1.2
P(Subdominant)
-1.4
-1.6
k
to
ar
B
-1.8
-2.0
1.9 2.0 2.1 2.2 2.3 2.4
entropy
Figure 9.3 Linear discriminant analysis of compositions before and after 1800,
with the validation sample. The data used for the discriminant rule consists of
x = (p5 , E).
Fitted discriminant rule and data used for estimation

-1.0
Early & Baroque

Romantic & 20th
P(Subdominant)
-1.2
-1.4
-1.6
-1.8
-2.0
1.8 2.0 2.2 2.4
entropy
Figure 9.4 Linear discriminant analysis of “Early Music to Baroque” and “Ro-
mantic to 20th Century”. The points (”o” and ”×”) belong to the training sample.
The data used for the discriminant rule consists of x = (p5 , E).

Fitted discriminant rule and validation data
not used for estimation
Early & Baroque

Romantic & 20th
P(Subdominant)
-1.0
k
to
-1.4
ar
B
-1.8
-2.2
1.8 2.0 2.2 2.4
entropy
Figure 9.5 Linear discriminant analysis of “Early Music to Baroque” and “Ro-
mantic to 20th century”. The points (”o” and ”×”) belong to the validation sam-
ple. The data used for the discriminant rule consists of x = (p5 , E).
Figure 9.6 Graduale written for an Augustinian monastery of the diocese Kon-
stanz, 13 th century. (Courtesy of Zentralbibliothek Zürich.) (Color figures follow
page 152.)

Figure 9.7 Johannes Brahms (1833-1897). (Photograph by Maria Fellinger, cour-
tesy of Zentralbibliothek Zürich.)
Figure 9.8 Richard Wagner (1813-1883). (Engraving by J. Bankel after a painting

by C. Jäger, courtesy of Zentralbibliothek Zürich.)

CHAPTER 10
Cluster analysis

In discriminant analysis, an optimal allocation rule between different groups
is estimated from a training sample. The type and number of groups are
known. In some situations, however, it is neither known whether the data
can be divided into homogeneous subgroups nor how many subgroups there
may be. How to find such clusters in previously ungrouped data is the pur-
pose of cluster analysis. In music, one may for instance be interested in how
far compositions or performances can be grouped into clusters representing
different “styles”. In this chapter, a brief introduction to basic principles
of statistical cluster analysis is given. For an extended account of cluster
analysis see e.g. Jardine and Sibson (1971), Anderberg (1973), Hartigan
(1978), Mardia et al. (1979), Seber (1984), Blashfield et al. (1985), Hand
(1986), Fukunaga (1990), Arabie et al. (1996), Gordon (1999), Höppner et
al. (1999), Everitt et al. (2001), Jajuga et al. (2002), Webb (2002).

10.2.1 Maximum likelihood classification
Suppose that observations x1 , ..., xn ∈ Rk are realizations of n independent
random variables Xi (i = 1, ..., n). Assume further that each random vari-
able comes from one of p possible groups such that if Xi comes from group j,
then it is distributed according to a probability density f (x; θj ). In contrast
to discriminant analysis, it is not observed which groups xi (i = 1, ..., n)
belong to. Each observation xi is thus associated with an unobserved pa-
rameter (or label) ηi specifying group membership. We may simply define
ηi = j if xi belongs to group j. Denote by η = (η1 , ..., ηn )t the vector of
labels and, for each j = 1, ..., p, let Aj = {xi : 1 ≤ i ≤ n, ηi = j} be
the unknown set of observations that belong group j. Then the likelihood
function of the observed data is
p
! !
L(x1 , ..., xn ; θ1 , ..., θp , η1 , ..., ηn ) = { f (xi ; θj )} (10.1)
j=1 xi ∈Aj
Maximizing L with respect to the unknown parameters θ1 , ..., θp and η1 ,

..., ηn , we obtain ML-estimates θ̂1 , ..., θ̂p , η̂1 , ..., η̂n and estimated sets

Â1 , ..., Âp . Denoting by m the dimension of θj , the number of estimated
parameters is p · m + n. This is larger than the number of observations.
It can therefore not be expected that all parameters are estimated consis-
tently. Nevertheless, the ML-estimate provides a classification rule due to
the following property: suppose that we change one of the Âj s by removing
an observation xio from Âj and putting it into another set Âl (l̸=j). Then
the likelihood can at most become smaller. The new likelihood is obtained
from the old one by dividing by f (xio ; θ̂j ) and multiplying by f (xio ; θ̂l ).
We therefore have the following property
f (xio ; θ̂l )
L(x1 , ..., xn ; θ̂1 , ..., θ̂p , η̂1 , ..., η̂n ) ≤ L(x1 , ..., xn ; θ̂1 , ..., θ̂p , η̂1 , ..., η̂n )
f (xio ; θ̂j )
(10.2)
or, dividing by L (assuming that it is not zero),
f (x; θ̂j ) ≥ f (x; θ̂l ) for x ∈ Âj (10.3)
This is identical with the ML-allocation rule in discriminant analysis. The
only, but essential, difference here is that η is unknown, i.e. our sample
(“training data”) gives us only information about the distribution of X
but not about η. This makes the task much more difficult. In particular,
since the number of unknown parameters is too large in general, maxi-
mum likelihood clustering can not only be computationally difficult but
its asymptotic performance may not stabilize sufficiently. In special cases,
however, a simple method can be obtained. Suppose, for instance, that the
distributions in the groups are multivariate normal with means µj and co-
variance matrices Σj . Then the ML-estimates of these parameters, given η,
are the group sample means
1 "
x̄j = xi
nj (η)
i∈Aj (η)
and group sample covariance matrices

1 "
Σ̂j (η) = (xi − x̄j (η))(xi − x̄j (η))t
nj (η)
i∈Aj (η)
respectively. The log-likelihood function then reduces to a constant minus

1 #p
2 j=1 n j log |Σ̂j |. Maximization with respect to η leads to the estimate
η̂ = arg min h(η) (10.4)
η
where
p
!
h(η) = |Σ̂j (η)|nj (η) (10.5)
j=1
Computationally this means that the function h(η) is evaluated for all
groupings η of the observations x1 , ..., xn , and the estimate is the grouping

that minimizes h(η). Clearly, this is a computationally demanding task. A
simpler rule is obtained if we assume
# that all covariance matrices are equal
to a common covariance matrix . Then
p
" p
"
η̂ = arg min |Σ̂| = arg min n− 1 (nj Σ̂j ) = arg min (nj Σ̂j ) (10.6)
η η η
j=1 j=1
Even in this simplified form, finding the best clustering is computationally

demanding. For instance, if data have to be divided #2 into two groups, then
the number of possible assignments for which j=1 (nj Σ̂j ) may differ is
equal to 2n− 1 . In addition, if the number of groups is not known a priori,
then a suitable, and usually computationally costly, method for estimat-
ing p must be applied. From a principle point of view it should also be
noted that if normal distributions or any other distributions with overlap-
ping domains are assumed, then there are no perfect clusters. Even if the
distributions were known, an observation x can be from any group with
fi (x) > 0, with positive probability, so that one can never be absolutely
sure where it belongs.
A variation of ML-clustering is obtained if the groups themselves are
associated with probabilities. Let πj be the probability that a randomly
sampled observation comes from group j. In analogy to the arguments
above, maximization of the likelihood with respect to all parameters in-
cluding πj (j = 1, ..., p) leads to a Bayesian allocation rule with π̂j as prior
distribution.
10.2.2 Hierarchical clustering

ML-clustering yields a partition of observations into p groups. Sometimes
it is desirable to obtain a sequence of clusters, e.g. starting with two main
groups and then subdividing these into increasingly homogeneous clusters.
This is particularly suitable for data where a hierarchy is expected - such
as, for instance, in music. Generally speaking, a hierarchical method has
the following property: a partitioning into p + 1 clusters consists of
• two clusters whose union is equal to one of the clusters from the parti-
tioning into p groups
• p − 1 clusters that are identical with p − 1 clusters of the partitioning
into p groups.
In a first step, data are transformed into a matrix D = (dij )i,j=1,...,n of
distances or a matrix S = (sij )i,j=1,...,n of similarities. The definition of
distance and similarity used in cluster analysis is more general than the
usual definition of a metric:
Definition 54 Let X be an arbitrary set and d : X × X → R a real valued
function such that for all x, y ∈ X

D1. d(x, y) = d(y, x)
D2. d(x, y) ≥ 0
D3. d(x, x) = 0
Then d is called a distance. If in addition we also have
D4 . d(x, y) = 0 ⇔ x = y
D5 . d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality),
then d is a metric.
A measure of similarity is usually assumed to have the following properties:
Definition 55 Let X be an arbitrary set and s : X × X → R a real valued
function such that for all x, y ∈ X
S1. s(x, y) = s(y, x)
S2. s(x, y) > 0
S3. s(x, y) increases with increasing “similarity”.
Then s is called a measure of similarity.
Axiom S3 is of course somewhat subjective, since it depends on what is
meant exactly by “similarity”. Table 10.1 gives examples of distances and
measures of similarity.
Suppose now that, for an observed data set x1 , ..., xn , we can define a
distance matrix D = (dij )i,j=1,...,n where dij denotes the distance between
vectors xi and xj . A hierarchical clustering algorithm tries to group the
data into a hierarchy of clusters in such a way that the distances within
these clusters are generally much smaller than those between the clusters.
Numerous algorithms are available in the literature. The reason for the
variety of solutions is that in general the result depends on various “free
choices”, such as the sequence in which clusters are built or the definition
of distance between clusters. For illustration, we give the definition of the
complete linkage (or furthest neighbor) algorithm:
1. Set a threshold do .
(o) (o)
2. Start with the initial clusters A1 = {x1 }, ..., An = {xn } and set i = 1.
(o) (o) (o)
The distances between the clusters are defined by djl = d(Aj , Al ) =
(o)
d(xj , xl ). This gives the n × n distance matrix D(o) = (djl )j,l=1,...,n .
(i− 1)
3. Join the two clusters for which the distance djl is minimal, thus ob-
(i) (i)
taining new clusters A1 , ..., An− i .
4. Calculate the new distances between clusters by
(i) (i) (i)
djl = d(Aj , Al ) = max d(x, y) (10.7)
(i) (i)
x∈Aj ,y∈Al
and the corresponding (n− i)× (n− i) distance matrix D(i) with elements
(i)
djl (j, l = 1, ..., n − i).

Table 10.1 Some measures of distance and similarity between x = (x1 , ..., xk )t ,
y = (y1 , ..., yk )t ∈ Rk . For some of the distances, it is assumed that a data set of
observations in Rk is available to calculate sample variances s2j (j = 1, ..., k) and
a k × k sample covariance matrix S.
Name Definition Comments

$#
k 2
Euclidian distance d(x, y) = i=1 (xi − yi ) Usual distance
in Rk
$#
k
Pearson distance d(x, y) = i=1 (xi − yi )2 /s2j Standardized
Euclidian
%
Mahalanobis distance d(x, y) = (x − y)t S −1 (x − y) Standardized
Euclidian
#k
Manhattan metric d(x, y) = i=1 wi |xi − yi | Less sensitive
(wi ≥ 0) to outliers
&# ' 1/λ
k
Minkowski metric d(x, y) = i=1 wi |xi − yi |λ For λ = 1 :
(λ ≥ 1) Manhattan
&#
k √ √ ' 1/2
Bhattacharyya distance d(x, y) = i=1 ( xi − yi )2 For xi , yi ≥ 0
(example:
proportions)
s(x, y) = k −1
#
Binary similarity x i yi Suitable for
xi = 0, 1
s(x, y) = k −1
#
Simple matching coefficient ai , Suitable for
ai = xi yi + (1 − xi )(1 − yi ) for xi = 0, 1
s(x, y) = 1 − k −1
#
Gower’s similarity wi |xi − yj |, Suitable if
coefficient wi = 1 if some xi
xi qualitative, qualitative,
wi = 1/Ri if some xi
quantitative quantitative
(with Ri = range of
ith coordinate)
5. If
(i)
max d ≤ do (10.8)
j,l=1,...,n− i jl
then stop. Otherwise, set i = i + 1 and go to step 3.
Note in particular that for the final clusters, the maximal distance within
each cluster is at most do . As a result, the final clusters tend to be very
“compact”. A related method is the so-called nearest neighbor single link-
age algorithm. It is identical with the above except that distance between
clusters is defined as the minimal distance between points in the two clus-
ters. This can lead to so-called “chaining” in the form of elongated clusters.

For other algorithms and further properties see the references given at the
beginning of this chapter, and references therein.
10.2.3 HISMOOTH and HIWAVE clustering

HISMOOTH and HIWAVE models, as defined in Chapter 5, can be used
to extract dominating features of a time series y(t) that are related to
an explanatory series x(t). Suppose that we have several y-series, yj (t)
(j = 1, ..., N ) that share the same explanatory series x(t). An interesting
question is then in how far features related to x(t) are similar, and which
series have more in common than others. One way to answer the question
consists of the following clustering algorithm:
1. For each series yj (t), fit a HISMOOTH or HIWAVE model, thus obtain-
ing a decomposition
yj (t) = µ̂j (t, xt ) + ej (t)
where µ̂j is the estimated expected value of yj given x(t).
2. Perform a cluster analysis of the fitted curves µ̂j (t, xt ).

10.3.1 Distribution of notes
Consider the distribution pj (j = 0, 1, ..., 11) of notes modulo as defined
for the star plots in Chapter 2. Can the visual impression of star plots
in Figure 2.31 be confirmed by cluster analysis? We consider the trans-
formed data vectors ζ = (ζ1 , ..., ζ11 )t , with ζj = log(pj /(1 − pj )), for the
following compositions: 1) Anonymus: Saltarello (13th century); Saltarello
(14th century); Troto (13th century); Alle psalite (13th century); 2) A. de
la Halle (1235?-1287): Or est Bayard en la pature, hure!; 3) J. Ockeghem
(1425-1495): Canon epidiatesseron; 4) J. Arcadelt (1505-1568): Ave Maria;
La Ingratitud; Io dico fra noi; 5) W. Byrd (1543-1623): Ave Verum Corpus;
Alman; The Queen’s Alman; 6) J. Dowland (1562-1626): The Frog Galliard;
The King of Denmark’s Galliard; Come again; 7) H.L. Hassler (1564-1612):
Galliarda; Kyrie from Missa Secunda; Sanctus et Benedictus from Missa
Secunda; 8) Palestrina (1525-1594): Jesu! Rex admirablis; O bone Jesu;
Pueri hebraeorum; 9) J.H. Schein (1586-1630): Banchetto musicale; 10) J.S.
Bach (1685-1750): Preludes and Fugues 1-24 from “Das Wohltemperierte
Klavier”; 11) J. Haydn (1732-1809): Sonata op. 34/ 3 (Figure 10.3); 12)
W.A. Mozart (1756-1791): Sonata KV 545 (2nd Mv.); Sonata KV 281 (2nd
Mv.); Sonata KV 332 (2nd Mv.); Sonata KV 333 (2nd Mv); 13) C. De-
bussy (1862-1918): Claire de lune; Arabesque 1; Reflections dans l’eau; 14)
A. Schönberg (1874-1951): op. 19/ 2 (Figure 10.4); 15) A. Webern (1883-
1945): Orchesterstück op. 6, No. 6; 16) Bartók (1881-1945): Bagatelles No.

4 6 8 10 12 14 5 10 15 20 25 30
HALLE SCHOENBERG
ANONYMUS BARTOK
OCKEGHEM BARTOK
DOWLAND TAKEMITSU
ARCADELT MESSIAEN
ANONYMUS BARTOK
ARCADELT WEBERN
ANONYMUS BACH
ARCADELT BACH

PALESTRINA BACH
DOWLAND BACH
ANONYMUS BACH
HASSLER BACH
HASSLER BACH
HASSLER BACH
PALESTRINA BACH
PALESTRINA HAYDN
BYRD BACH
BYRD BACH
BARTOK BACH
BYRD BACH
SCHEIN BACH
BACH BACH
SCHOENBERG BARTOK
DEBUSSY BACH
BACH MOZART
DEBUSSY DEBUSSY
MOZART DEBUSSY
MOZART BACH
MOZART BACH
MOZART MOZART
BACH MOZART
BACH MOZART
BACH MOZART
BACH BACH
MOZART BACH
BACH DEBUSSY
BACH BACH
BACH BACH
BACH BACH
DEBUSSY BACH
BACH DOWLAND
BARTOK HASSLER
WEBERN PALESTRINA
MESSIAEN SCHEIN
HAYDN PALESTRINA
BACH BYRD
BACH BYRD
BACH BYRD
BACH ANONYMUS
BACH OCKEGHEM
BACH ANONYMUS
Distribution of notes modulo 12 - single linkage

BACH ANONYMUS
BACH HALLE
Distribution of notes modulo 12 - complete linkage
BACH DOWLAND
BACH HASSLER
BACH HASSLER
BACH ANONYMUS
BACH ARCADELT
BARTOK ARCADELT
BARTOK ARCADELT
TAKEMITSU PALESTRINA
Figure 10.2 Single linkage clustering of log-odds-ratios of note-frequencies.

Figure 10.1 Complete linkage clustering of log-odds-ratios of note-frequencies.
Figure 10.3 Joseph Haydn (1732-1809). (Title page of a biography published by
the Allgemeine Musik-Gesellschaft Zürich, 1830; courtesy of Zentralbibliothek
Zürich.)
1-3; Piano Sonata (2nd Mv.); 17) O. Messiaen (1908-1992): Vingts regards
de Jesu No. 3; 18) T. Takemitsu (1930-1996): Rain tree sketch No. 1.
Figure 10.1 shows the result of complete linkage clustering of the vectors
(ζ1 , ..., ζ11 )t , based on the Euclidian and do = 5. The most striking fea-
ture is the clear separation of “early music” from the rest. Moreover, the
20th century composers considered here are in a separate cluster, except
for Bartók’s Bagatelle No. 3 (and Debussy, who may be considered as be-
longing to the 19th and 20th centuries). In contrast, clusters provided by a
single linkage algorithm are less easy to interpret. Figure 10.2 illustrates a
typical result of this method namely long narrow clusters where the maxi-
mal distance within a cluster can be quite large. In our example this does

Figure 10.4 Klavierstück op. 19, No. 2 by Arnold Schönberg. (Facsimile; used by
permission of Belmont Music Publishers.)

Clusters of entropies - complete linkage
8
7
6
5
Bach: Pr.1/WK I
Bach: Cello Suite V/1
Bach: Cello Suite IV/1
Bach: Cello Suite VI/1
Bach: Cello Suite I/1
Bach: Cello Suite II/1
Bach: Cello Suite III/1
Bach: F.1/WK I
Bach: F.8/WK I
Bach: Pr.8/WK I
Figure 10.5 Complete linkage clustering of entropies.
not seem appropriate, since, due to the “organic” historic development of

music, the effect of chaining is likely to be particularly pronounced.
10.3.2 Entropies
Consider entropies as defined in Chapter 3. More specifically, we define
for each composition a vector y = (E1 , ..., E10)t . After standardization of
each coordinate, cluster analysis is applied the following compositions by
J.S. Bach: Cello Suites No. I to VI (1st movement from each); Preludes and
Fugues No. 1 and 8 from “Das Wohltemperierte Klavier” (each separately).
The complete linkage algorithm leads to a clear separation of the Cello
Suites from “Das Wohltemperierte Klavier” displayed in Figure 10.5.
10.3.3 Tempo curves

One of the obvious questions with respect to the tempo curves in Figure
2.3 is whether one can find clusters of similar performances. Applying com-
plete linkage cluster analysis (with the euclidian distance) to the raw data
yields the clusters in Figure 10.6. Cortot and Horowitz appear to have very
individual styles, since they build distinct clusters on their own. It should
be noted, however, that this does not imply that other pianists do not have
their own styles. Cortot and Horowitz simply happen to be the lucky ones

Clusters of tempo curves - complete linkage
14
12
10
DEMUS
BUNIN
ARGERICH
DAVIES
KUBALEK
KLIEN
MOISEIWITSCH
8
ORTIZ
ASKENAZE
ARRAU
SCHNABEL
GIANOLI
KATSARIS
BRENDEL
NEY
ESCHENBACH
NOVAES
SHELLEY
CORTOT3
ZAK
CAPOVA
HOROWITZ1
KRUST
6
CURZON
HOROWITZ2
HOROWITZ3
CORTOT1
CORTOT2
Figure 10.6 Complete linkage clustering of tempo.
who are represented more than once in the sample, so that the consistency
of their performances can be checked empirically. Figure 10.6 also shows
that Cortot is somewhat of an “outlier”, since his cluster separates from
all other pianists at the top level.
10.3.4 Tempo curves and melodic structure

Cluster analysis alone does not provide any further explanation about the
meaning of observed clusters. In particular, we do not know which musi-
cally meaningful characteristics determine the clustering of tempo curves.
In contrast, cluster analysis based on HISMOOTH or HIWAVE models
provides a way to gain more insight. The fitted HISMOOTH curves in Fig-
ures 5.9a through d extract essential features that make comparisons easier.
The estimated bandwidths can be interpreted as a measure of how much
emphasis a pianist puts on global and local features respectively. Figure
10.7 shows clusters based on the fitted HISMOOTH curves. In contrast to
the original data, complete and single linkage turn out to yield almost the
same clusters. Thus, applying the HISMOOTH fit first leads to a stabiliza-
tion of results. From Figure 10.7, we may identify about six main clusters,
namely:
• A: KRUST, KATSARIS, SCHNABEL;

Clusters of HISMOOTH fits - complete linkage
9
8
7
6
ARRAU
MOISEIWITSCH
5
4
KRUST
ORTIZ
NOVAES
DEMUS
NEY
ZAK
KLIEN
BUNIN
DAVIES
SHELLEY
CAPOVA
KUBALEK
KATSARIS
GIANOLI
BRENDEL
ASKENAZE
SCHNABEL
CURZON
ARGERICH
CORTOT1
CORTOT3
CORTOT2
ESCHENBACH
HOROWITZ3
HOROWITZ1
HOROWITZ2
Figure 10.7 Complete linkage clustering of HISMOOTH-fits to tempo curves.
• B: MOISEIWITSCH, NOVAES, ORTIZ;

• C: DEMUS, CORTOT1, CORTOT2, CORTOT3, ARGERICH, SHEL-
LEY, CAPOVA;
• D: ARRAU, BUNIN, KUBALEK, CURZON, GIANOLI;
• E: ASKENAZE, DAVIES;
• F: HOROWITZ1, HOROWITZ2, HOROWITZ3, ZAK, ESCHENBACH,
NEY, KLIEN, BRENDEL.
This is related to grouping of the vector of estimated bandwidths, (b1 , b2, b3)t ∈
R3+ . In figure 10.8, the x- and y-coordinates correspond to b1 and b2 respec-
tively, and the radius of a circle is proportional to b3. The letters A through
F identify locations where one or more observation from that cluster occurs.
The pictures show that only a few selected values of b1 and b2 are selected.
Particularly striking are the large bandwidths for clusters A and B. Appar-
ently, these pianists emphasize mostly larger structures of the composition.
Also note that the clusters do not separate equally well in each projec-
tion. Apart from clusters A and B, one cannot “order” the performances
in terms of large versus small bandwidth. Overall, one may conclude that
HISMOOTH-clustering together with analytic indicator functions provides
a better understanding of essential characteristics of musical performance
(Figure 10.9).

3
B BB
B
A
F A
2
D
D
1
C
C
D C
D
F CDF E F
F
0
C E
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Figure 10.8 Symbol plot of HISMOOTH bandwidths for tempo curves. The radius
of each circle is proportional to a constant plus log b3 ; the horizontal and vertical
axes are equal to b1 and b2 respectively. The letters A–F indicate where at least
one observation from the corresponding cluster occurs.
Figure 10.9 Maurizio Pollini (*1942). (Courtesy of Philippe Gontier, Paris.)

CHAPTER 11
Multidimensional scaling

In some situations data consist of distances only. These distances are not
necessarily euclidian so that they do not necessarily correspond to a config-
uration of points in a euclidian space. The question addressed by multidi-
mensional scaling (MDS) is in how far one may nevertheless find points in
a hopefully low-dimensional euclidian space that have exactly or approx-
imately the observed distances. The procedure is mainly an exploratory
tool that helps to find structure in distance data. We give a brief intro-
duction to the basic principles of MDS. For a detailed discussion and an
extended bibliography see, for instance, Kruskal and Wish (1978), Cox and
Cox (1994), Everitt and Rabe-Hesketh (1997), Borg and Groenen (1997),
Schiffman (1997); also see textbooks on multivariate statistics, such as the
ones given in the previous chapters. For the origins of MDS and early
references see Young and Householder (1941), Guttman (1954), Shepard
(1962a,b), Kruskal (1964a,b), Ramsay (1977).

11.2.1 Basic definitions
In MDS, any symmetric n × n matrix D = (dij )i,j=1 ,...,n with dij ≥ 0 and
dii = 0 is called a distance matrix. Note that this corresponds to the axioms
D1, D2, and D3 in the previous chapter. If instead of distances, a similarity
matrix S = (sij )i,j=1 ,....,n is given, then one can define a corresponding
distance matrix by a suitable transformation. One possible transformation
is, for instance,
dij = sii − 2sij + sjj (11.1)
!
The question addressed by metric MDS can be formulated as follows: given

an n × n distance matrix D, can one find a dimension k and n points
x1 , ..., xn in Rk such that these points have a distance matrix D̃ with D̃
approximately, or even exactly, equal to D? Clearly one prefers low dimen-
sions (k = 2 or 3, if possible), since it is then easy to display the points
graphically. On the other hand, the dimension cannot be too low in order
to obtain a good approximation of D, and hence a realistic picture of struc-
tures in the data. As an alternative to metric MDS, one may also consider

non-metric methods where one tries to find points in a euclidian space such
that the ranking of the distances remains the same, whereas their nominal
values may differ.
11.2.2 Metric MDS

In the ideal case, the metric solution constructs n points x1 , ..., xn ∈ Rk
for some k such that their euclidian distance matrix D̃, with elements
d˜ij = (xi − xj )t (xi − xj ), is exactly equal to the original distance matrix
D. If this is possible, then D is called euclidian. The condition under which
this is possible is as follows:
Theorem 25 D = Dn×n = (dij )i,j=1 ,...,n is euclidian if and only if the
matrix
B = Bn×n = M AM
is positive semidefinite, where M = (I − n−1 11t ), I = In×n is the identity
matrix, 1 = (1, ..., 1)t and A = An×n has elements
1
aij = − d2ij (i, j = 1, ..., n).
2
The reason for positive semidefiniteness of B is that if D is indeed a eu-
clidian matrix corresponding to points x1 , ..., xn ∈ Rk , then
bij = (xi − x̄)t (xj − x¯) (11.2)
so that B defines a “centered” scalar product for these points. In matrix
form we have B = (M X)(M X)t where the n rows of Xn×k correspond to
the vectors xi (i = 1, ..., n). Since for any matrix C, the matrices C t C and
CC t are positive semidefinite, so is B.
The construction of the points x1 , ..., xn given D = Dn×n (or Bn×n ≥ 0)
is done as follows: suppose that B is of rank k ≤ n. Since B is a symmetric
matrix, we have the spectral decomposition
B = CΛC t = ZZ t (11.3)
where Λ is the n×n diagonal matrix with eigenvalues λ1 ≥ λ2 ≥...≥ λk > 0
and λj = 0 (j > k) in the diagonal, and Z = Zn×n = (zij )i,j=1 ,...,n the
n × n matrix with the first k columns z (j) (j = 1, ..., k) equal to the first k
eigenvectors. Then
xi = (zi1 , ..., zik )t (i = 1, ..., n) (11.4)
of Z are points in R with distance matrix D.
k
In practice, the following difficulties can occur: 1. D is euclidian, but k is

too large to be of any use (after all the purpose is to obtain an interpretable
picture of the data); 2. D is not euclidian with a) all λi positive, or, b) some
λi negative. Because of these problems, one often uses a rough approxima-

tion of D, based on a small number of eigenvectors that correspond to
positive eigenvalues.
Finally, note that if instead of distances, similarities are given and the
similarity matrix S is positive semidefinite, then S can be transformed into
a euclidian distance matrix by defining
dij = sii − 2sij + sjj (11.5)

!
11.2.3 Non-metric MDS
For qualitative data, or generally observations in non-metric spaces, dis-

tances can only be interpreted in terms of ranking. For instance, the sub-
jective judgement of an audience may be that a composition by Webern is
slightly more “difficult” than Wagner, but much more difficult than Mozart,
thus defining a larger distance between Webern and Mozart than Webern
and Wagner. It may, however, not be possible to express distances between
the compositions by numbers that could be interpreted directly. In such
cases, D is often called a dissimilarity matrix rather than a distance matrix.
Since only the relative size of distances is meaningful, various computation-
ally demanding algorithmic methods for defining points in a euclidian space
such that the ranking of the distances remains the same have been devel-
oped in the literature (e.g. Shepard 1962a,b, Kruskal 1964a,b, Guttman
1968, Lingoes and Roskam 1973).
11.2.4 Chronological ordering
Suppose a distance matrix D (or a similarity matrix S) is given and one

would like to find out whether there is a natural ordering of the obser-
vational units. For instance, a listener may assign a distance matrix be-
tween various musical pieces without knowing anything about these pieces
a priori. The question then may be whether the listener’s distance matrix
corresponds approximately to the sequence in time when the pieces were
composed. This problem is also called seriation. MDS provides a possi-
ble solution in the following way: if the distances expressed the temporal
(or any other) sequence exactly, then the configuration of points found by
MDS would be one-dimensional. In the more realistic case that distances
are partially due to the temporal sequence, the points in Rk should be
scattered around a one-dimensional, not necessarily straight, line in Rk . In
the simplest case, this may already be visible in a two-dimensional plot.

11.3.1 Seriation by simple descriptive statistics
Suppose we would like to guess which time a composition is from, without
listening to the music but instead using an algorithm. There is a large
amount of music theory that can be used to determine the time when a
composition was written. One may wonder, however, whether there may
be a very simple computational way of guessing.
Consider, for instance, the following frequencies: xi = pi−1 (i = 1, ..., 12)
are the relative frequencies of notes modulo 12 centered around the central
tone, as defined in section 9.3.2. Moreover, set x1 3 equal to the relative
frequency of a sequence of four notes following the sequence of interval steps
3, 3 and 3. This corresponds to an arpeggio of the diminished seventh chord.
Thus, we consider a vector x = (x1 , ..., x1 3 )t with coordinates corresponding
to proportions. An appropriate measure of distance between proportions
is the Bhattacharyya distance (Bhattacharyya 1946b) given in Table 10.1
namely
" k $1 /2
#√ √ 2
d(x, y) = ( xi − yi ) .
i=1
This is not a euclidian distance so that it is not a priori clear whether a
suitable representation of the observations in a euclidian space is possible.
MDS with k ∗ = 2 yields the points in Figure 11.1. Three time periods
are distinguished by using different symbols for the points. The periods
are defined in a very simple way, namely by date of birth of the composer:
a) before 1720 (“early to baroque”; see e.g. Figure 11.3); b) 1720-1880
(“classical to romantic”); and c) 1880 or later (“20th century”). The con-
figuration of the respective points does show an “effect” of time. The three
time periods can be associated with regional clusters though the regions
overlap. An outlier from the middle category is Schoenberg. This is due to
the crude definition of the time periods: Schoenberg (in particular his op.
19/2) clearly belongs the 20th century he just happens to be born a little
bit too early (1874), and is therefore classified as “classical to romantic”.
The dependence between time period and second MDS-coordinate can also
be seen by comparing boxplots (Figure 11.2).
11.3.2 Perception and music psychology

MDS is frequently used to analyze data that consist of subjective distances
between musical sounds (e.g. with respect to pitch or timbre) or composi-
tions obtained in controlled experiments. Typical examples are Grey and
Gordon (1978), Gromko (1993), Ueda and Ohgushi (1987), Wedin (1972),
Wedin and Goude (1972), Markuse and Schneider (1995). Since it is not
known in how far the cognitive “metric” may correspond approximately to

0.6
before 1720
1720-1880
1880 or later
Schoenberg
0.4
0.2
x2
0.0
-0.2
-0.4
-0.4 -0.2 0.0 0.2 0.4
x1
Figure 11.1 Two-dimensional multidimensional scaling of compositions ranging

from the 13th to the 20th century, based on frequencies of intervals and interval
sequences.
a euclidian distance, MDS is a useful method to investigate this question,

to simplify high-dimensional distance data and possibly find interesting
structures. Grey and Gordon consider perceptual effects of timbres char-
acterized by spectra. For a related study see Wedin and Goude (1972).
Gromko (1993) carries out an MDS analysis to study perceptual differ-
ences between expert and novice music listeners. Ueda and Ohgushi (1987)
study perceptual components of pitch and use MDS to obtain a spatial
representation of pitch.

0.6
0.4
0.2
0.0
-0.2
-0.4
birth before 1720 1720-1880 1880 and later
Figure 11.2 Boxplots of second MDS-component where compositions are classified

according to three time periods.
Figure 11.3 Fragment of a graduale from the 14th century. (Courtesy of Zentral-
bibliothek Zürich.)

Figure 11.4 Muzio Clementi (1752-1832). (Lithography by H. Bodmer, courtesy
of Zentralbibliothek Zürich.)
Figure 11.5 Freddy (by J.B.) and Johannes Brahms (1833-1897) going for a
drink. (Caricature from a contemporary newspaper; courtesy of Zentralbibliothek
Zürich.)

List of figures
Figure 1.1: Quantitative analysis of music helps to understand creative

processes. (Pierre Boulez, photograph courtesy of Philippe Gontier, Paris;
and “Jim” by J.B.)
Figure 1.2: J.S. Bach (1685-1750). (Engraving by L. Sichling after a paint-
ing by Elias Gottlob Haussmann, 1746; courtesy of Zentralbibliothek
Zürich.)
Figure 1.3: Ludwig van Beethoven (1770-1827). (Drawing by E. Dürck
after a painting by J.K. Stieler, 1819; courtesy of Zentralbibliothek
Zürich.)
Figure 1.4: Anton Webern (1883-1945). (Courtesy of Österreichische Post
AG.)
Figure 1.5: Gottfried Wilhelm Leibniz (1646-1716). (Courtesy of Deutsche
Post AG and Elisabeth von Janota-Bzowski.)
Figure 1.6: W.A. Mozart (1759-1791) (authorship uncertain) – Spiegel-
Duett.
Figure 1.7: Wolfgang Amadeus Mozart (1756-1791). (Engraving by F.
Müller after a painting by J.W. Schmidt; courtesy of Zentralbibliothek
Zürich.)
Figure 1.8: The torus of thirds Z3 + Z4 .
Figure 1.9: Arnold Schönberg – Sketch for the piano concert op. 42 – notes
with tone row and its inversions and transpositions. (Used by permission
of Belmont Music Publishers.)
Figure 1.10: Notes of “Air” by Henry Purcell. (For better visibility, only
a small selection of related “motifs” is marked.)
Figure 1.11: Notes of Fugue No. 1 (first half) from “Das Wohltemperierte
Klavier” by J.S. Bach. (For better visibility, only a small selection of
related “motifs” is marked.)
Figure 1.12: Notes of op. 68, No. 2 from “Album für die Jugend” by
Robert Schumann. (For better visibility, only a small selection of related
“motifs” is marked.)
Figure 1.13: A miraculous transformation caused by high exposure to
Wagner operas. (Caricature from a 19th century newspaper; courtesy of

Figure 1.14: Graphical representation of pitch and onset time in Z271 to-
gether with instrumentation of polygonal areas. (Excerpt from Śānti –
Piano concert No. 2 by Jan Beran, col legno CD 20062; courtesy of col
legno, Germany.)
Figure 1.15: Iannis Xenakis (1922-1998). (Courtesy of Philippe Gontier,
Paris.)
Figure 1.16: Ludwig van Beethoven (1770-1827). (Courtesy of Zentralbib-
liothek Zürich.)
Figure 2.1: Robert Schumann (1810-1856) – Träumerei op. 15, No. 7.
Figure 2.2: Tempo curves of Schumann’s Träumerei performed by Vladimir
Horowitz.
Figure 2.3: Twenty-eight tempo curves of Schumann’s Träumerei per-
formed by 24 pianists. (For Cortot and Horowitz, three tempo curves
were available.)
Figure 2.4: Boxplots of descriptive statistics for the 28 tempo curves in
Figure 2.3.
Figure 2.5: q-q-plots of several tempo curves (from Figure 2.3).
Figure 2.6: Frequencies of notes 0,1,...,11 for moving windows of onset-
length 16.
Figure 2.7: Frequencies of notes 0,1,...,11 for moving windows of onset-
length 16.
Figure 2.8: Johannes Chrysostomus Wolfgangus Theophilus Mozart (1756-
1791) in the house of Salomon Gessner in Zurich. (Courtesy of Zentral-
bibliothek Zürich.)
Figure 2.9: R. Schumann (1810-1856) – lithography by H. Bodmer. (Cour-
tesy of Zentralbibliothek Zürich.)
Figure 2.10: Acceleration of tempo curves for Cortot and Horowitz.
Figure 2.11: Tempo acceleration – correlation with other performances.
Figure 2.12: Martha Argerich – interpolation of tempo curve by cubic
splines.
Figure 2.13: Smoothed tempo curves ĝ1 (t) = (nb1 )−1 K( t−t
b1 )yi (b1 =
! i
8).
Figure 2.14: Smoothed tempo curves ĝ2 (t) = (nb2 )−1 K( t−t b2 )[yi −
! i
ĝ1 (t)] (b2 = 1).

Figure 2.15: Smoothed tempo curves ĝ3 (t) = (nb3 )−1 K( t−t b3 )[yi −
! i
ĝ1 (t) − ĝ2 (t)] (b3 = 1/8).

Figure 2.16: Smoothed tempo curves – residuals ê(t) = yi − ĝ1 (t) − ĝ2 (t) −
ĝ3 (t).

Figure 2.17: Melodic indicator – local polynomial fits together with first
and second derivatives.
Figure 2.18: Tempo curves (Figure 2.3) – first derivatives obtained from
local polynomial fits (span 24/32).
Figure 2.19: Tempo curves (Figure 2.3) – second derivatives obtained from
local polynomial fits (span 8/32).
Figure 2.20: Kinderszene No. 4 – sound wave of performance by Horowitz
at the Royal Festival Hall in London on May 22, 1982.
Figure 2.21: log(Amplitude) and tempo for Kinderszene No. 4 – auto- and
cross correlations (Figure 2.24a), scatter plot with fitted least squares
and robust lines (Figure 2.24b), time series plots (Figure 2.24c), and
sharpened scatter plot (Figure 2.24d).
Figure 2.22: Horowitz’ performance of Kinderszene No. 4 – log(tempo)
versus log(Amplitude) and boxplots of log(tempo) for three ranges of
amplitude.
Figure 2.23: Horowitz’ performance of Kinderszene No. 4 – two-dimensional
histogram of (x, y) = (log(tempo), log(Amplitude)) displayed in a per-
spective and image plot respectively.
Figure 2.24: Horowitz’ performance of Kinderszene No. 4 – kernel estimate
of two-dimensional distribution of (x, y) = (log(tempo), log(Amplitude))
displayed in a perspective and image plot respectively.
Figure 2.25: R. Schumann, Träumerei op. 15, No. 7 – density of melodic
indicator with sharpening region (a) and melodic curve plotted against
onset time, with sharpening points highlighted (b).
Figure 2.26: R. Schumann, Träumerei op. 15, No. 7 – tempo by Cortot
and Horowitz at sharpening onset times.
Figure 2.27: R. Schumann, Träumerei op. 15, No. 7 – tempo “derivatives”
for Cortot and Horowitz at sharpening onset times.
Figure 2.28: Arnold Schönberg (1874-1951), self-portrait. (Courtesy of
Verwertungsgesellschaft Bild-Kunst, Bonn.)
Figure 2.29: a) Chernoff faces for 1. Saltarello (Anonymus, 13th century);
2. Prelude and Fugue No. 1 from “Das Wohltemperierte Klavier” (J.
S. Bach, 1685-1750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 1810-
1856); 4. Piano piece op. 19, No. 2 (A. Schönberg, 1874-1951); 5. Rain
Tree Sketch 1 (T. Takemitsu, 1930-1996); b) Chernoff faces for the same
compositions as in figure 2.29a, after permuting coordinates.
Figure 2.30: The minnesinger Burchard von Wengen (1229-1280), contem-
porary of Adam de la Halle (1235?-1288). (From Codex Manesse, cour-
tesy of the University Library Heidelberg.) (Color figures follow page
168.)

Figure 2.31: Star plots of p∗j = (p6 , p11 , p4 , p9 , p2 , p7 , p12 , p5 , p10 , p3 , p8 )t
for compositions from the 13th to the 20th century.
Figure 2.32: Symbol plot of the distribution of successive interval pairs
(∆y(ti ), ∆y(ti+1 )) (2.36a, c) and their absolute values (b, d) respectively,
for the upper envelopes of Bach’s Präludium No. 1 (Das Wohltemperierte
Klavier I) and Mozart ’s Sonata KV 545 (beginning of 2nd movement).
Figure 2.33: Symbol plot of the distribution of successive interval pairs
(∆y(ti ), ∆y(ti+1 )) (a, c) and their absolute values (b, d) respectively, for
the upper envelopes of Scriabin’s Prélude op. 51, No. 4 and F. Martin’s
Prélude No. 6.
Figure 2.34: Symbol plot with x = pj5 , y = pj7 and radius of circles
proportional to pj1 .
Figure 2.35: Symbol plot with x = pj5 , y = pj7 and radius of circles
proportional to pj6 . (Color figures follow page 168.)
Figure 2.36: Symbol plot with x = pj5 , y = pj7 . The rectangles have
width pj1 (diminished second) and height pj6 (augmented fourth). (Color
figures follow page 168.)
Figure 2.37: Symbol plot with x = pj5 , y = pj7 , and triangles defined by
pj1 (diminished second), pj6 (augmented fourth) and pj10 (diminished
seventh). (Color figures follow page 168.)
Figure 2.38: Names plotted at locations (x, y) = (pj5 , pj7 ). (Color figures
follow page 168.)
Figure 2.39: Profile plots of p∗j = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t .
Figure 3.1: Ludwig Boltzmann (1844-1906). (Courtesy of Österreichische
Post AG.)
Figure 3.2: Fractal pictures (by Céline Beran, computer generated.) (Color
Figure 3.3: György Ligeti (*1923). (Courtesy of Philippe Gontier, Paris.)
Figure 3.4: Comparison of entropies 1, 2, 3, and 4 for J.S. Bach’s Cello
Suite No. I and R. Schumann’s op. 15, No. 2, 3, 4, and 7, and op. 68,
No. 2 and 16.
Figure 3.5: Alexander Scriabin (1871-1915) (at the piano) and the con-
ductor Serge Koussevitzky. (Painting by Robert Sterl, 1910; courtesy of
Gemäldegalerie Neuer Meister, Dresden, and Robert-Sterl-House.)
Figure 3.6: Comparison of entropies 9 and 10 for Bach, Schumann, and
Scriabin/Martin.
Figure 3.7: Metric, melodic, and harmonic global indicators for Bach’s
Canon cancricans.
Figure 3.8: Robert Schumann (1810-1856). (Courtesy of Zentralbibliothek
Zürich.)

Figure 3.9: Metric, melodic, and harmonic global indicators for Schu-
mann’s op. 15, No. 2 (upper figure), together with smoothed versions
(lower figure).
Figure 3.10: Metric, melodic, and harmonic global indicators for Schu-
mann’s op. 15, No. 7 upper figure), together with smoothed versions
(lower figure).
Figure 3.11: Metric, melodic, and harmonic global indicators for Webern’s
Variations op. 27, No. 2 (upper figure), together with smoothed versions
(lower figure).
Figure 3.12: R. Schumann – Träumerei: motifs used for specific melodic
indicators.
Figure 3.13: R. Schumann – Träumerei: indicators of individual motifs.
Figure 3.14: R. Schumann – Träumerei: contributions of individual motifs
to overall melodic indicator.
Figure 3.15: R. Schumann – Träumerei: overall melodic indicator.
Figure 4.1: Sound wave of c′ and f ′ played on a piano.
Figure 4.2: Zoomed piano sound wave – shaded area in Figure 4.1.
Figure 4.3: Periodogram of piano sound wave in Figure 4.2.
Figure 4.4: Sound wave of e′′ ♭ played on a harpsichord.
Figure 4.5: Periodogram of harpsichord sound wave in Figure 4.4.
Figure 4.6: Harpsichord sound – periodogram plots for different time
frames (moving windows of time points).
Figure 4.7: A harpsichord sound and its spectrogram. Intense pink corre-
sponds to high values of I(t, λ). (Color figures follow page 168.)
Figure 4.8: A harpsichord sound wave (a), logarithm of squared amplitudes
(b), histogram of the series (c) and its periodogram on log-scale (d)
together with fitted SEMIFAR-spectrum.
Figure 4.9: Log-frequencies with fitted SEMIFAR-trend and log-log-periodogram
together with SEMIFAR-fit for Bach’s first Cello Suite (1st movement;
a,b) and Paganini’s Capriccio No. 24 (c,d) respectively.
Figure 4.10: Local variability with fitted SEMIFAR-trend and log-log-
periodogram together with SEMIFAR-fit for Bach’s first Cello Suite (1st
movement; a,b) and Paganini’s Capriccio No. 24 (c,d) respectively.
Figure 4.11: Niccolò Paganini (1782-1840). (Courtesy of Zentralbibliothek
Zürich.)
Figure 5.1: Simulated signal (a) and wavelet coefficients (b); (c) and (d):
wavelet components of simulated signal in a; (e) and (f): wavelet com-
ponents of simulated signal in a and frequency plot of coefficients.
Figure 5.2: Decomposition of x−series in simulated HIWAVE model.

Figure 5.3: Simulated HIWAVE model - explanatory series g1 (a), y−series
(b), y versus x (c), y versus g1 (d), y versus g2 = x − g1 (e) and time
frequency plot of y (f).
Figure 5.4: HIWAVE time series and fitted function ĝ1 .
Figure 5.5: Hierarchical decomposition of metric, melodic, and harmonic
indicators for Bach’s “Canon cancricans” (Das Musikalische Opfer BWV
1079) and Webern’s Variation op. 27, No. 2.
Figure 5.6: Quantitative analysis of performance data is an attempt to
understand “objectively” how musicians interpret a score without at-
taching any subjective judgement. (Left: “Freddy” by J.B.; right: J.S.
Bach, woodcutting by Ernst Würtemberger, Zürich. Courtesy of Zen-
tralbibliothek Zürich).
Figure 5.7: Most important melodic curves obtained from HIREG fit to
tempo curves for Schumann’s Träumerei.
Figure 5.8: Successive aggregation of HIREG-components for tempo curves
by Ashkenazy and Horowitz (third performance).
Figure 5.9 a and b: HISMOOTH-fits to tempo curves (performances 1-14);
Figure 5.9 c and d: HISMOOTH-fits to tempo curves (performances 15-
28).
Figure 5.10: Time frequency plots for Cortot’s and Horowitz’s three per-
formances.
Figure 5.11: Wavelet coefficients for Cortot’s and Horowitz’s three perfor-
mances.
Figure 5.12: Tempo curves – approximation by most important 2 best
basis functions.
basis functions.
basis functions.
Figure 5.15: Tempo curves (a) by Cortot (three curves on top) and Horowitz,
R2 obtained in HIWAVE-fit plotted against trial cut-off parameter η (b)
and fitted HIWAVE-curves (c).
Figure 5.16: First derivative of tempo curves (a) by Cortot (three curves
on top) and Horowitz, R2 obtained in HIWAVE-fit plotted against trial
cut-off parameter η (b) and fitted HIWAVE-curves (c).
Figure 5.17: Second derivative of tempo curves (a) by Cortot (three curves
on top) and Horowitz, R2 obtained in HIWAVE-fit plotted against trial
cut-off parameter η (b) and fitted HIWAVE-curves (c).
Figure 6.1: Jean-Philippe Rameau (1683-1764). (Engraving by A. St. Aubin
after J. J. Cafferi, Paris after 1764; courtesy of Zentralbibliothek Zürich.)

Figure 6.2: Frédéric Chopin (1810-1849). (Courtesy of Zentralbibliothek
Zürich.)
Figure 6.3: Stationary distributions π̂j (j = 1, ..., 11) of Markov chains
with state space Z12 \{0}, estimated for the transition between successive
intervals.
Figure 6.4: Cluster analysis based on stationary Markov chain distribu-
tions for compositions by Bach, Mozart, Haydn, Chopin, Schumann,
Brahms, and Rachmaninoff.
Figure 6.5: Cluster analysis based on stationary Markov chain distri-
butions of torus distances for compositions by Bach, Mozart, Haydn,
Chopin, Schumann, Brahms, and Rachmaninoff.
Figure 6.6: Comparison of log odds ratios log(π̂1 /π̂2 ) of stationary Markov
chain distributions of torus distances.
Figure 6.9: Comparison of log odds ratios log(π̂1 /π̂3 ) and log(π̂2 /π̂3 ) of
stationary Markov chain distributions of torus distances.
Figure 6.10: Comparison of stationary Markov chain distributions of torus
distances.
Figure 6.11: Log odds ratios log(π̂1 /π̂3 ) and log(π̂2 /π̂3 ) plotted against
date of birth of composer.
Figure 6.12: Johannes Brahms (1833-1897). (Courtesy of Zentralbibliothek
Zürich.)
Figure 7.1: Béla Bartók – statue by Varga Imre in front of the Béla Bartók
Memorial House in Budapest. (Courtesy of the Béla Bartók Memorial
House.)
Figure 7.2: Sergei Prokoffieff as a child. (Courtesy of Karadar Bertoldi
Ensemble; www.karadar.net/Ensemble/.)
Figure 7.3: Circular representation of compositions by J. S. Bach (Präludium
und Fuge No. 5 from “Das Wohltemperierte Klavier”), D. Scarlatti
(Sonata Kirkpatrick No. 125), B. Bartók (Bagatelles No. 3), and S.
Prokoffieff (Visions fugitives No. 8).
Figure 7.4: Boxplots of λ̂1 , R̄, d and log m for notes modulo 12, comparing
Bach, Scarlatti, Bartók, and Prokoffief.
Figure 7.5: Circular representation of intervals of successive notes in the
following compositions: J. S. Bach (Präludium und Fuge No. 5 from
“Das Wohltemperierte Klavier”), D. Scarlatti (Sonata Kirkpatrick No.
125), B. Bartók (Bagatelles No. 3), and S. Prokoffieff (Visions fugitives
No. 8).

Figure 7.6: Boxplots of λ̂1 , R̄, d and log m for note intervals modulo 12,
comparing Bach, Scarlatti, Bartók, and Prokoffief.
Figure 7.7: Circular representation of notes ordered according to circle
of fourhts in the following compositions: J. S. Bach (Präludium und
Fuge No. 5 from ”Das Wohltemperierte Klavier”), D. Scarlatti (Sonata
Kirkpatrick No. 125), B. Bartók (Bagatelles No. 3), and S. Prokoffieff
(Visions fugitives No. 8).
Figure 7.8: Boxplots of λ̂1 , R̄, d and log m for notes 12 ordered according
to circle of fourhts, comparing Bach, Scarlatti, Bartók and Prokoffief.
Figure 7.9: Circular representation of intervals of successive notes ordered
according to circle of fourhts in the following compositions: J. S. Bach
(Präludium und Fuge No. 5 from ”Das Wohltemperierte Klavier”), D.
Scarlatti (Sonata Kirkpatrick No. 125), B. Bartók (Bagatelles No. 3),
and S. Prokoffieff (Visions fugitives No. 8).
Figure 7.10: Boxplots of λ̂1 , R̄, d and log m for note intervals modulo 12
ordered according to circle of fourhts, comparing Bach, Scarlatti, Bartók,
and Prokoffief.
Figure 8.1: Tempo curves for Schumann’s Träumerei: skewness for the
′ ′ ′′ ′′
eight parts A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 for 28 performances, plotted
against the number of the part.
Figure 8.2: Schumann’s Träumerei: screeplot for skewness.
Figure 8.3: Schumann’s Träumerei: loadings for PCA of skewness.
Figure 8.4: Schumann’s Träumerei: symbol plot of principal components
z2 , ..., z5 for PCA of tempo skewness.
Figure 8.5: Schumann’s Träumerei: tempo curves by Cortot, Horowitz,
Brendel, and Gianoli.
Figure 8.6: Air by Henry Purcell (1659-1695).
Figure 8.7: Screeplot for PCA of entropies.
Figure 8.8: Loadings for PCA of entropies.
Figure 8.9: Entropies – symbol plot of the first four principal components.
Figure 8.10: Entropies – symbol plot of principal components no. 2-5.
Figure 8.11: F. Martin (1890-1971). (Courtesy of the Société Frank Martin
and Mrs. Maria Martin.)
Figure 8.12: F. Martin (1890-1971) - manuscript from 8 Préludes. (Cour-
tesy of the Société Frank Martin and Mrs. Maria Martin.)
Figure 9.1: Discriminant analysis combined with time series analysis can
be used to judge purity of intonation (“Elvira” by J.B.).
Figure 9.2: Linear discriminant analysis of compositions before and after
1800, with the training sample. The data used for the discriminant rule
consists of x = (p5 , E).

Figure 9.3: Linear discriminant analysis of compositions before and after
1800, with the validation sample. The data used for the discriminant
rule consists of x = (p5 , E).
Figure 9.4: Linear discriminant analysis of “Early Music to Baroque” and
“Romantic to 20th Century”. The points (”o” and ”×”) belong to the
training sample. The data used for the discriminant rule consists of x =
(p5 , E).
Figure 9.5: Linear discriminant analysis of “Early Music to Baroque” and
“Romantic to 20th century”. The points (”o” and ”×”) belong to the
validation sample. The data used for the discriminant rule consists of
x = (p5 , E).
Figure 9.6: Graduale written for an Augustinian monastery of the diocese
Konstanz, 13th century. (Courtesy of Zentralbibliothek Zürich.) (Color
Figure 9.7: Johannes Brahms (1833-1897). (Photograph by Maria Fellinger,
courtesy of Zentralbibliothek Zürich.)
Figure 9.8: Richard Wagner (1813-1883). (Engraving by J. Bankel after a
painting by C. Jäger, courtesy of Zentralbibliothek Zürich.)
Figure 10.1: Complete linkage clustering of log-odds-ratios of note-frequencies.
Figure 10.2: Single linkage clustering of log-odds-ratios of note-frequencies.
Figure 10.3: Joseph Haydn (1732-1809). (Title page of a biography pub-
lished by the Allgemeine Musik-Gesellschaft Zürich, 1830; courtesy of
Figure 10.4: Klavierstück op. 19, No. 2 by Arnold Schönberg. (Facsimile;
used by permission of Belmont Music Publishers.)
Figure 10.5: Complete linkage clustering of entropies.
Figure 10.6: Complete linkage clustering of tempo.
Figure 10.7: Complete linkage clustering of HISMOOTH-fits to tempo
curves.
Figure 10.8: Symbol plot of HISMOOTH bandwidths for tempo curves.
The radius of each circle is proportional to a constant plus log b3 the
horizontal and vertical axes are equal to b1 and b2 respectively. The let-
ters A–F indicate where at least one observation from the corresponding
cluster occurs.
Figure 10.9: Maurizio Pollini (*1942). (Courtesy of Philippe Gontier, Paris.)
Figure 11.1: Two-dimensional multidimensional scaling of compositions
ranging from the 13th to the 20th century, based on frequencies of in-
tervals and interval sequences.
Figure 11.2: Boxplots of second MDS-component where compositions are
classified according to three time periods.

Figure 11.3: Fragment of a graduale from the 14th century. (Courtesy of
Figure 11.4: Muzio Clementi (1752-1832). (Lithography by H. Bodmer,
courtesy of Zentralbibliothek Zürich.)
Figure 11.5: Freddy (by J.B.) and Johannes Brahms (1833-1897) going
for a drink. (Caricature from a contemporary newspaper; courtesy of

References
Akaike, H. (1973a). Information theory and an extension of the maximum like-

lihood principle. In: Second International Symposium on Information Theory,
B.N. Petrow and F. Csaki (eds.), Akademiai Kiado, Budapest, 267-281.
Akaike, H. (1973b). Maximum likelihood identification of Gaussian autoregressive
moving average models. Biometrika, Vol. 60, 255-265.
Akaike, H. (1979). A Bayesian extension of the minimum AIC procedure of au-
toregressive model fitting. Biometrika, Vol. 26, 237-242.
Albert. A.A. (1956). Fundamental Concepts of Higher Algebra. University of
Chicago Press, Chicago.
Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic Press, New
York and London.
Anderson, T.W. (1984). An Introduction to Multivariate Statistical Analysis (2nd
ed.). Wiley, New York.
Andreatta, M. (1997) Group-theoretical methods applied to music. PhD thesis,
University of Sussex.
M. Andreatta, M., Noll, T., Agon, C. and Assayag, G. (2001). The geometrical
groove: rhythmic canons between theory, implementation and musical exper-
iment. In: Les Actes des 8èmes Journes dInformatique Musicale, Bourges 7-9
juin 2001, p. 93-97.
Antoniadis, A. and Oppenheim, G. (1995). Wavelets and Statistics. Lecture Notes
in Statistics, No. 103, Springer, New York.
Arabie, P., Hubert, L.J. and De Soete, G. (1996). Clustering and Classification.
World Scientific Pub., London.
Archibald, B. (1972). Some thoughts on symmetry in early Webern. Persp. New
Music, 10, 159-163.
Ash, R.B. (1965). Information Theory. Wiley, New York.
Ashby, W.R. (1956). An Introduction to Cybernetics. Wiley, New York.
Babbitt, M. (1960) Twelve-tone invariants as compositional determinant. Musical
Quarterly, 46, 245-259.
Babbitt, M. (1961) Set structure as a compositional determinant. JMT, 5, No. 2,
72-94.
Babbitt, M. (1987) Words about Music. Dembski A. and Straus J.N. (eds.), Uni-
versity of Wisconsin Press, Madison.
Backus, J. (1969). The acoustical Foundations of Music, W.W. Norton & Co.,
New York (reprinted 1977).
Bailhache, P. (2001). Une Histoire de l’Acoustique Musicale, CNRS Editions.
Balzano, G.J. (1980). The group-theoretic description of 12-fold and microtonal
pitch systems. Computer Music Journal, Vol. 4, No. 4, 66-84.

Barnard, G.A. (1951). The theory of information. J. Royal Statist. Soc., Series
B, Vol. 13, 46-69.
Bartlett, M.S. (1955). An Introduction to Stochastic Processes. Cambridge Uni-
versity Press, Cambridge.
Batschelet, E. (1981). Circular Statistics. Academic Press, London.
Beament, J. (1997). The Violin Explained: Components, Mechanism, and Sound.
Oxford University Press, Oxford.
Benade, A.H. (1976). Fundamentals of Musical Acoustics. Oxford University
Press, Oxford. (Reprinted by Dover in 1990).
Benson, D. (1995-2002). Mathematics and Music. Internet Lecture Notes,
Department of Mathematics, University of Georgia, USA (available at
http://www.math.uga.edu/~djb/html/math-music.html).
Beran, J. (1987). Aniseikonia. H.O.E. (Bison Records).
Beran, J. (1991). Cirri. Centaur Records, CRC 2100.
Beran, J. (1994). Statistics for Long-Memory Processes. Chapman & Hall, New
York.
Beran, J. (1995). Maximum likelihood estimation of the differencing parameter
for invertible short- and long-memory ARIMA models. J. R. Statist. Soc.,
Series B, Vol. 57, No.4, 659-672.
Beran, J. (1998) Modeling and objective distinction of trends, stationarity and
long-range dependence. Proceedings of the VIIth International Congress of
Ecology - INTECOL 98, Farina, A., Kennedy, J. and Bossú, V. (Eds.), p.
41.
Beran, J. (2000). Śānti. col legno, WWE 1CD 20062 (http://www.col-legno.de).
Beran, J. and Feng. Y. (2002a). SEMIFAR models – a semiparametric framework
for modelling trends, long-range dependence and nonstationarity. Computa-
tional Statistics & Data Analysis, Vol. 40, No. 2, 393-419.
Beran, J. and Feng, Y. (2002b). Iterative plug-in algorithms for SEMIFAR models
– definition, convergence, and asymptotic properties. J. Computational Graph-
ical Statist., Vol. 11, No. 3, 690-713.
Beran, J. and Ghosh, S. (2000). Estimation of the dominating frequency for sta-
tionary and nonstationary fractional autoregressive processes. J. Time Series
Analysis, Vol. 21, No. 5, 513-533.
Beran, J. and Mazzola, G. (1992). Immaculate Concept. SToA music, 1 CD
1002.92, Zürich.
Beran, J. and Mazzola, G. (1999). Analyzing musical structure and performance
- a statistical approach. Statistical Science, Vol. 14, No. 1, pp.47-79.
Beran, J. and Mazzola, G. (1999). Visualizing the relationship between two time
series by hierarchical smoothing. J. Computational Graphical Statist., Vol. 8,
No. 2, pp.213-238.
Beran, J. and Mazzola, G. (2000). Timing Microstructure in Schumann’s
“Träumerei” as an Expression of Harmony, Rhythm, and Motivic Structure in
Music Performance’. Computers Mathematics Appl., Vol. 39, No. 5-6, pp.99-
130.
Beran, J. and Mazzola, G. (2001). Musical composition and performance – sta-
tistical decomposition and interpretation. Student, Vol. 4, No.1, 13-42.
Beran, J. and Ocker, D. (1999). SEMIFAR forecasts, with applications to foreign

exchange rates. J. Statistical Planning Inference, 80, 137-153.
Beran, J. and Ocker, D. (2001). Volatility of stock market indices - an analysis
based on SEMIFAR models. J. Bus. Economic Statist., Vol. 19, No. 1, 103-116.
Berg, R.E. and Stork, D.G. (1995). The Physics of Sound (2nd ed.). Prentice
Hall, New Jersey.
Berry, W. (1987). Structural Function in Music. Dover, Mineola.
Besag, J. (1989). Towards Bayesian image analysis. J. Appl. Statistics, Vol. 16,
395-407.
Besicovitch, A.S. (1935). On the sum of digits of real numbers represented in the
dyadic system (On sets of fractional dimensions II). Mathematische Annalen,
Vol. 110, 321-330.
Besicovitch, A.S. and Ursell, H.D. (1937). Sets of fractional dimensions (V): On
dimensional numbers of some continuous curves. J. London Mathematical So-
ciety, Vol. 29, 449-459.
Bhattacharyya, A. (1946a). On some analogues of the amount of information and
their use in statistical estimation. Sankhya, Vol. 8, 1-14.
Bhattacharyya, A. (1946b). On a measure of divergence between two multinomial
populations. Sankhya, 7, 401-406.
Billingsley, P. (1986). Probability and Measure (2nd ed.). Wiley, New York.
Blashfield, R.K. and Aldenderfer, M.S. (1985). Cluster Analysis. Sage, London.
Boltzmann, L. (1896). Vorlesungen über Gastheorie. Johann Ambrosius Barth,
Leipzig.
Borg, I. and Groenen, P. (1997). Modern Multidimensional Scaling: Theory and
Applications. Springer, New York.
Bowman, A.W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data
Analysis: The Kernel Approach with S-Plus Illustrations. Oxford University
Press, Oxford.
Box, G.E.P. and Jenkins, G.M. (1970). Time Series Analysis: Forecasting and
Control. Holden-Day, San Francisco.
Breiman, L. (1984). Classification and Regression Trees. CRC Press, Boca Raton.
Bremaud, P. (1999). Markov Chains. Springer, New York.
Brillouin, L. (1956). Science and Information Theory. Academic Press, New York.
Brillinger, D. (1981). Time Series Data Analysis and Theory (expanded ed.).
Holden Day, San Francisco.
Brillinger, D. and Irizarry, R.A. (1998). An investigation of the second- and
higher-order spectra of music. Signal Processing, Vol. 65, 161-179.
Bringham, E.O. (1988). The Fast Fourier Transform and Applications. Prentice
Hall, New Jersey.
Brockwell, P.J. and Davis, R.A. (1991). Time series: Theory and methods (2nd
ed.). Springer, New York.
Brown, E.N. (1990). A note on the asymptotic distribution of the parameter
estimates for the harmonic regression model. Biometrika, Vol. 77, No. 3, 653-
656.
Chai, W. and Vercoe, B. (2001). Folk Music Classification Using Hidden Markov
Models. Proceedings of International Conference on Artificial Intelligence, June
2001 (//web.media.mit.edu/∼ chaiwei/papers/chai ICAI183.pdf).
Chambers, J., Cleveland, W., Kleiner, B., and Tukey, P. (1983). Graphical Meth-

ods for Data Analysis. Wadsworth Publishing Company: Belmont, California.
Chernick, M.R. (1999). Bootstrap Methods: A Practitioner’s Guide. Jpssey-Bass,
New York.
Chung, K.L. (1967). Markov Chains with Stationary Transition Probabilities.
Springer, Berlin.
Cleveland, W. (1985). Elements of Graphing Data. Wadsworth Publishing Com-
pany: Belmont, California.
Coifman, R., Meyer, Y., and Wickerhauser, V. (1992). Wavelet analysis and
sinal processing. In: Wavelets and Their Applications, pp. 153-178. Jones and
Bartlett Publishers, Boston.
Coifman, R. and Wickerhauser, V. (1992). Entropy-based algorithms for best
basis selection. IEEE Transactions on Information Theory, Vol. 38, No. 2,
713-718.
Conway, J.H. and Sloane, N.J.A. (1988). Sphere packings, lattices and groups.
Grundlehren der mathematischen Wissenschaften 290, Springe, Berlin.
Cooley, J.W. and Tukey, J.W. (1965). An algorithm for the machine calculation
of complex Fourier series. Math. Comput., Vol. 19, 297-301.
Cox, T.F. and Cox, M.A.A. (1994). Multidimensional Scaling. Chapman & Hall,
London.
Cremer, L. (1984). The Physics of The Violin, MIT Press, 1984.
Crocker, M.J. (ed.) (1998). Handbook of Acoustics, Wiley Interscience: New York.
Dahlhaus, R. (1987). Efficient parameter estimation for self-similar processes.
Ann. Statist., Vol. 17, 1749-1766.
Dahlhaus, R. (1996a). Maximum likelihood estimation and model selection for
locally stationary processes. J. Nonpar. Statist., Vol. 6, 171-191.
Dahlhaus, R. (1996b) Asymptotic statistical inference for nonstationary processes
with evolutionary spectra. In: Athens Conference on Applied Probability and
Time Series, Vol. II, P.M. Robinson and M. Rosenblatt (Eds.), 145-159, Lec-
ture Notes in Statistics, 115, Springer, New York.
Dahlhaus, R. (1997). Fitting time series models to nonstationary processes. Ann.
Statistics, Vol. 25, 1-37.
Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia, PA.
Davison, A.C. and Hinkley, D.V. (1997). Bootstrap Methods and Their Applica-
tion. Cambridge University Press, Cambridge.
de la Motte-Haber, H. (1996). Handbuch der Musikpsychologie (2nd ed.). Laaber
Verlag, Laaber.
Devaney, R.L. (1990). Chaos, Fractals and Dynamics. Addison-Wesley, Califor-
nia.
Diaconis, P., Graham, R.L., and Kantor, W.M. (1983). The mathematics of per-
fect shuffles. Adv. Appl. Math., Vol. 4, 175-196.
Diggle, P. (1990) Time Series – A Biostatistical Introduction. Oxford University
Press, Ocford.
Dillon, W. R. and Goldstein, M. (1984). Multivariate Analysis, Methods and
Applications. Wiley, New York.
Donoho, D.L. and Johnstone, I.M. (1995). Adapting to unknown smoothness via
wavelet shrinkage. JASA, 90, 1200-1224.
Donoho, D.L. and Johnstone, I.M. (1998). Minimax estimation via wavelet shrink-

age. Ann. Statistics 26, 879-921.
Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1995).
Wavelet shrinkage: Asymptopia? J. R. Statist. Soc., Series B, 57, 301-337.
Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1996). Density
estimation by wavelet thresholding. Ann. Statistics, 24, 508-539.
Draper, N.R. and Smith, H. (1998). Applied Regression Analysis (3rd ed.). Wiley,
New York.
Duda, R.O., Hart, P.E. and Stork, D.G. (2000). Pattern classification (2nd ed.).
Wiley, New York.
Edgar, G.A. (1990). Measure, Topology and Fractal Geometry. Springer, New
York.
Effelsberg, W. and Steinmetz, R. (1998). Video Compression Techniques. Dpunkt
Verlag, Heidelberg.
Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statis-
tics, Vol. 7, 1- 26.
Eimert, H. (1964). Grundlagen der musikalischen Reihentechnik. Universal Edi-
tion, Vienna.
Elliott, R.J., Agoun, L., and Moore, J.B. (1995). Hidden Markov Models: Esti-
mation and Control. Springer, New York.
Erdös, P. (1946). On the distribution function of additive functions. Ann. Math-
ematics, Vol. 43, 1-20.
Eubank, R.L. (1999). Nonparametric Regression and Spline Smoothing (2nd ed.).
Marcel Dekker: New York.
Everitt, B.S., Landau, S. and Leese, M. (2001). Cluster Analysis (4th ed.). Oxford
University Press, Oxford.
Everitt, B.S. and Rabe-Hesketh, S. (1997). The Analysis of Proximity Data.
Arnold, London.
Falconer, K.J. (1985). The Geometry of Fractal Sets. Cambridge University Press,
Cambridge.
Falconer, K.J. (1986). Random Fractals. Math. Proc. Cambridge Philos. Soc.,
Vol. 100, 559-582.
Falconer, K.J. (1990). Fractal Geometry. Wiley, New York.
Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polyno-
mial fitting: Variable bandwidth and spatial adaptation. J. R. Statist. Soc.,
Ser. B, 57, 371–394.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and Its Applications.
Chapman & Hall, London.
Feng, Y. (1999). Kernel- and Locally Weighted Regression – with Applications to
Time Series Decomposition. Verlag für Wissenschaft und Forschung, Berlin.
Fisher, N.I. (1993). Statistical Analysis of Circular Data. Cambridge University
Press, Cambridge.
Fisher, R.A. (1925). Theory of Statistical Information. Proc. Camb. Phil. Soc.,
Vol. 22, pp. 700-725.
Fisher, R.A. (1956). Statistical Methods and Scientific Inference. Oliver & Boyd,
London.
Fleischer, A. (2003). Die analytische Interpretation. Schritte zur Erschlieung
eines Forschungsfeldes am Beispiel der Metrik. PhD dissertation, Humboldt-

University Berlin. dissertation.de, Verlag im Internet GmbH, Berlin.
Fleischer, A., Mazzola, G., Noll, Th. Zur Konzeption der Software RUBATO für
musikalische Analyse und Performance. Musiktheorie, Heft 4, pp.314-325, 2000.
Fletcher, T.J. (1956). Campanological groups. American Math. Monthly, 63/9,
619-626.
Fletcher, N.H. and Rossing, T.D. (1991). The Physics of Musical Instruments.
Springer, Berlin/New York.
Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A Practical Approach.
Cambridge University Press, Cambridge, UK.
Forte, A. (1964). A theory of set-complexes for music. JMT, 8, No. 2, 136-183.
Forte, A. (1973). Structure of atonal music. Yale University Press, New Haven.
Forte, A. (1989). La set-complex theory: elevons les enjeux! Analyse musicale,
4eme trimestre, 80-86.
Fox, R. and Taqqu, M.S. (1986). Large sample properties of parameter estimates
for strongly dependent stationary Gaussian time series. Ann. Statisics., Vol.
14, 517-532.
Friedman, J.H. (1977). A recursive partitioning decision rule for nonparametric
classification. IEEE Transactions on Computers, Vol. 26, No. 4, 404-408.
Fripertinger, H. (1991). Enumeration in music theory. Séminaire Lotharingien de
Combinatoire, 26, 29-42.
Fripertinger, H. (1999). Enumeration and construction in music theory. Diderot
Forum on Mathematics and Music Computational and Mathematical Methods
in Music, Vienna, Austria. December 2–4, 1999. H. G. Feichtinger and M.
Drfler, editors. sterreichische Computergesellschaft, 179-204.
Fripertinger, H. (2001). Enumeration of non-isomorphic canons. Tatra Mountains
Math. Publ., 23.
Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition (2nd ed.).
Academic Press, New York.
Gasser, T. and Müller, H.G. (1979). Kernel estimation of regression functions.
In: Smoothing Techniques for Curve Estimation. Gasser, T., Rosenblatt, M.
(Eds.), Springer, New York, pp. 23-68.
Gasser, T. and Müller, H.G. (1984). Estimating regression functions and their
derivatives by the kernel method. Scand. J. Statist., Vol. 11, 171-185.
Gasser, T., Müller, H.G., and Mammitzsch, V. (1985). Kernels for nonparametric
curve estimation. J. R. Statist. Soc., Ser. B, Vol. 47, 238-252.
Genevois, H. and Orlarey, Y. (1997). Musique et Mathématiques. Aléas-Grame,
Lyon.
Gervini, D. and Yohai, V.J. (2002). A class of robust and fully efficient regression
estimators. Ann. Statistics, Vol. 30, 583-616.
Ghosh, S. (1996). A new graphical tool to detect non-normality. J. R. Statist.
Society, Series B, Vol. 58, 691-702.
Ghosh, S. (1999). T3-plot. In: Encyclopedia for Statistical Sciences, Update vol-
ume 3, (S. Kotz ed.), pp. 739-744, Wiley, New York.
Ghosh, S. and Beran, J. (2000). Comparing two distributions: The two sample
T3 plot. J. Computational Graphical Statist., Vol. 9, No. 1, 167-179.
W.J. Gilbert (2002) Modern Algebra with Applications. Wiley, New York.
Ghosh, S. and Draghicescu, D.(2002a). Predicting the distribution function for

long-memory processes. Int. J. Forecasting, 18, 283-290.
Ghosh, S., Draghicescu, D. (2002b). An algorithm for optimal bandwidth selec-
tion for smooth nonparametric quantiles and distribution functions. In: Statis-
tics in Industry and Technology: Statistical Data Analysis Based on the L1-
Norm and Related Methods. Dodge Y. (Ed.), Birkhäuser Verlag, Basel, Switzer-
land, pp. 161-168.
Ghosh, S., Beran, J. and Innes, J. (1997). Nonparametric conditional quantile
estimation in the presence of long memory. Student - Special issue on the
conference on L1-Norm and related methods, Vol. 2, 109-117.
Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. (Eds.) (1996). Markov Chain
Monte Carlo in Practice. Chapman & Hall, London.
Goldman, S. (1953). Information Theory. Prentice Hall, New Jersey.
Good, P.I. (2001). Resampling Methods. Birkhäuser, Basel.
Gordon, A.D. (1999). Classification (2nd ed.). Chapman and Hall, London.
Götze, H. and Wille, R. (Eds.) (1985). Musik und Mathematik (Salzburger
Musikgesprch 1984 unter Vorsitz von Herbert von Karajan). Springer, Berlin.
Graeser, W. (1924). Bachs “Kunst der Fuge”. In: Bach-Jahrbuch, 1924.
Graff, K.F. (1975). Wave Motion in Elastic Solids. Oxford University Press.
(reprinted by Dover, 1991).
Granger, C.W.J. and Joyeux, R. (1980). An introduction to long-range time series
models and fractional differencing. J. Time Series Anal., Vol. 1, 15-30.
Grenander, U. and Szegö, G. (1958). Toeplitz Forms and Their Application. Univ.
California Press, Berkeley.
Grey, J. (1977). Multidimensional perceptual scaling of musical timbre. J. Acous-
tical Soc. America, Vol. 62, 1270-1277.
Grey, J. and Gordon, J. (1978). Perceptual Effects of spectralmodifications on
musical timbres. J. Acoust. Soc. America, 63, 1493-1500.
Gromko, J.E. (1993). Perceptual Differences between expert and novice music
listeners at multidimensional scaling analysis. Psychology of Music, 21, 34-47.
Guttman, L. (1954). A new approach to factor analysis: the radex. In: Mathemat-
ical thinking in the behavioral sciences, P. Lazarsfeld (Ed.). Free Press, New
York, pp. 258-348.
Guttman, L. (1968). A general non-metric technique for finding the smallest co-
ordinate space for a configuration of points. Psychometrika, 33, 469-506.
Hall, D.E. (1980). Musical Acoustics. Wadsworth Publishing Company: Belmont,
California.
Halsey, D. and Hewitt, E. (1978). Eine gruppentheoretische Methode in der
Musiktheorie. Jaresbericht der Duetschen Math. Vereinigung, Vol. 80.
Hampel, F.R., Ronchetti, E., Rousseeuw, P., and Stahel, W.A. (1986). Robust
Statistics: The Approach based on Influence Functions. Wiley, New York.
Hand, D.J. (1986). Discrimination and Classification. Wiley, New York.
Hand, D.J., Mannila, H., and Smyth, P. (2001). Principles of Data Mining. MIT
Press, Cambdridge (USA).
Hannan, E.J. (1973). The estimation of frequency. J. Appl. Probab., Vol. 10,
510-519.
Hannan, E.J. and Quinn, B.G. (1979). The determination of the order of an
autoregression. J. R. Statist. Soc., Series B, Vol. 41, 190-195.

Härdle, W. (1991) Smoothing Techniques. Springer. New York.
Härdle, W., Kerkyacharian, G., Picard, D., and Tsybokov, A. (1998). Wavelets,
Approximation, and Statistical Applications. Lecture Notes in Statistics, No.
129. Springer, New York.
Hartigan, J.A. (1975). Clustering Algorithms. Wiley, New York.
Hartley, R.V. (1928). Transmission of information. Bell Syst. Techn. J., 535-563.
Hassan, T. (1982). Nonlinear time series regression for a class of amplitude mod-
ulated cosinusoids. J. Time Series Analysis, Vol. 3, 109-122.
Hastie, T., Tibshirani, R., and Buja, A. (1994). Flexible discriminant analysis by
optimal scoring. JASA, Vol. 89, 1255-1270.
Hastie, T., Tibshirani, R., and Friedman, J.H. (2001). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Springer, New York.
Hausdorff, F. (1919). Dimension und äusseres mass. Mathematische Annalen, Vol.
79, 157-179.
von Helmholtz, H. (1863). Die Lehre von den Tonempfindungen als physiologische
Grundlage der Musik, Reprinted in Darmstadt, 1968.
Herstein, I.N. (1975). Topics in Algebra. Wiley, New York.
Hirst, D. (1996). Error-rate estimation in multiple-group linear discriminant anal-
ysis. Technometrics, Vol. 38, 389-399.
Hjort, N.L. and Glad, I.K. (2002). Nonparametric density estimation with a para-
metric start. Ann. Statistics, Vol. 23, No. 3, 882-904.
Hofstadter, D.R. (1999). Gdel, Escher, Bach, Basic Books, New York.
Höppner, F. Klawonn, F., Kruse, R. and Runkler, T. (1999). Fuzzy Cluster Anal-
ysis. Wiley, New York.
Hosking, J.R.M. (1981). Fractional differencing. Biometrika, Vol. 68, 165-176.
Howard, D.M. and Angus, J. (1996). Acoustics and Psychoacoustics, Focal Press.
Huber, P. (1981). Robust Statistics. Wiley, New York.
Huberty, C.J. (1994). Applied Discriminant Analysis. Wiley, New York.
Hurvich, C.M. and Ray, B.K. (1995). Estimation of the memory parameter for
nonstationary or noninvertible fractionally integrated processes. J. Time Series
Anal., Vol. 16 17-41.
Irizarry, R.A. (1998). Statistics and music: fitting a local harmonic model to
musical sound signals. PhD thesis, University of California, Berkeley.
Irizarry, R.A. (2000). Asymptotic distribution of estimates for a time-varying
parameter in a harmonic model with multiple fundamentals. Statistica Sinica,
Vol. 10, 1041-1067.
Irizarry, R.A. (2001). Local harmonic estimation in musical sound signals. JASA,
Vol. 96, No. 454, 357-367.
Irizarry, R.A. (2002). Weighted estimation of harmonic components in a musical
sound signal. J. Time Series Anal., Vol. 23, 29-48.
Isaacson, D.L. and Madsen, R.W. (1976). Markov Chains Theory and Applica-
tions. Wiley, New York.
Jaffard, S., Meyer, Y., and Ryan, R. (2001). Wavelets: Tools for Science and
Technology. SIAM, Philadelphia.
Jajuga, K., Sokoowski, A. and Bock, H.H. (Eds.) (2002). Statistical Pattern
Recognition. Springer, New York.
Jammalamadaka, S.R. and SenGupta, A. (2001). Topics in circular statistics.

Series on Multivariate Analysis, Vol. 5. World Scientific, River Edge, NJ.
Jansen, M. (2001). Noise Reduction by Wavelet Thresholding. Lecture Notes in
Statistics, No. 161. Springer, New York.
Jardine, N. and Sibson, R. (1971). Mathematical Taxonomy. Wiley, New York.
Johnson, J. (1997). Graph Theoretical Methods of Abstract Musical Transforma-
tion. Greenwood Publishing Group, London.
Johnson, R.A. and Wichern, D.W. (2002). Applied Multivariate Statistical Anal-
ysis. Pretice Hall, New Jersey.
Johnston, I. (1989). Measured Tones: The Interplay of Physics and Music. Insti-
tute of Physics Publishing, Bristol and Philadelphia.
Joshi, D.D. (1957). L’information en statistique mathématique et dans la théorie
des communications. PhD thesis, Faculté des Sciences de l’Université de Paris.
Juang, B.H. and Rabiner, L.R. (1991). Hidden Markov models for speech recog-
nition. Technometrics, Vol. 33, 251-272.
Kaiser, G. (1994). A Friendly Guide to Wavelets. Birkhäuser, Boston.
Keil, W. (1991). Gibt es den Goldenen Schnitt in der Musik des 16. bis 19.
Jahrhunderts? Eine kritische Untersuchung rezenter Forschungen. Augsburger
Jahrbuch für Musikwissenschaft, Vol. 8 1991. p. 7-70. Schneider, Tutzing, Ger-
many.
Kelly, J.P. (1991). Hearing. In: Principles of Neural Science, E.R. Kandel, J.H.
Schwarz, T.M. Jessel (Eds.), Elsevier, New York, pp. 481-499.
Kemey, J.G., Snell, J.L., and Knapp, A.W. (1976). Denumerable Markov Chains.
Springer, New York.
Khinchin, A.I. (1953). The entropy concept in probability theory. Uspekhi Matem-
aticheskikh Nauk, Vol. 8, No. 3 (55), 3-20 (Russian).
Khinchin, A.I. (1956). On the fundamental theorems of information theory. Us-
pekhi Matematicheskikh Nauk, Vol. 11, No. 1 (67), 17-75 (Russian).
Kinsler, L.E., Frey, A.R., Coppens, A.B., and Sanders, J.V. (2000) Fundamentals
of Acoustics, (4th ed.). Wiley, New York.
Klecka, W.R. (1980). Discriminant Analysis. Sage, London.
Kolmogorov, A.N. (1956). On the Shannon theory of information transmission
in the case of continuous signals. IRE Trans. on Inform. Theory, Vol. IT-2,
102-108.
Kono, N. (1986). Hausdorff dimension of sample paths for self-similar processes.
In: Dependence in Probability and Statistics, E. Eberlein and M.S. Taqqu (eds.),
Birkhäuser, Boston.
Kruskal, J.B. (1964a). Multidimensional scaling by optimizing goodness of fit to
a nonmetric hypothesis. Psychometrika, 29, 1-27.
Kruskal, J.B. (1964b). Nonmetric multidimensional scaling: a numerical method.
Psychometrika, 29, 115-129.
Kruskal, J.B. and Wish, M. (1978). Multidimensional Scaling. Sage, London.
Krzanowski, W.J. (1988). Principles of Multivariate Analysis. Oxford University
Press, Oxford.
Kullback, S. (1959). Information Theory and Statistics. Wiley, Newy York.
Lanciani, A. (2001). Mathématiques et musique: les labyrinthes de la
phénoménologie. Editions Jérôme Millon, Grenoble.
Läuter, H. (1985). An efficient estimator for the error rate in discriminant anal-

ysis. Statistics, Vol. 16, 107-119.
Lamperti, J.W. (1962). Semi-stable stochastic processes. Trans. American Math.
Soc., Vol. 104, 62-78.
Lamperti, J.W. (1972). Semi-stable Markov processes. Z. Wahrsch. verw. Geb.,
Vol. 22, 205-225.
LeBlanc, M. and Tibshirani, R. (1996). Combining estimates in regression and
classification. JASA, Vol. 91, 1641-1650.
Lendvai, E. (1993). Symmetries of Music. Kodály Institute, Kecskemet.
Levinson, S.E., Rabiner, L.R., and Sondhi, M.M. (1983). An introduction to the
application of the theory of probabilistic functions of a Markov process to
automatic speech reconition. Bell Systems Tech. J., Vol. 62, 1035-1074.
Lewin, D. (1987). Generalized Musical Intervals and Transformations. Yale Uni-
versity Press, New Haven/London.
Leyton, M. (2001). A Generative Theory of Shape. Springer, New York.
Licklider, J.R.C. (1951). A duplex theory of pitch reception. Experientia, Vol. 7,
128-134.
Ligges, U., Weihs, C., Hasse-Becker, P. (2002). Detection of locally stationary
segments in time series. In: Proceedings in Computational Statistics, W. Hrdle,
B. Rnz (Eds.), pp. 285-290.
Lindley, M. and Turner-Smith, R. (1993). Mathematical Models of Musical Scales.
Verlag für systematische Musikwissenschaft GmbH, Bonn.
Lingoes, J.C. and Roskam, E.E. (1973). A mathematical and empirical analysis
of two multidimensional scaling algorithms. Psychometrika, 38, Monograph
Suppl. No. 19.
MacDonald, I.L. and Zucchini, W. (1997). Hidden Markov and Other Models for
Discrete-valued Time Series. Chapman & Hall, London.
Mallat, S. (1998). A Wavelet Tour of Signal Processing. Academic Press, London.
Mandelbrot, B.B. (1953). Contribution à la théorie mathématique des jeux de
communication. Publs. Inst. Statist. Univ. Paris, Vol. 2, Fasc. 1 et 2, 3-124.
Mandelbrot, B.B. (1956). An outline of a purely phenomenological theory of
statistical thermodynamics: I. canonical ensembles. IRE Trans. on Inform.
Theory, Vol. IT-2, 190-203.
Mandelbrot, B.B. (1977). Fractals: Form, Chance and Dimension. Freeman &
Co., San Francisco.
Mandelbrot, B.B. (1983). The Fractal Geometry of Nature. Freeman & Co., San
Francisco.
Mandelbrot, B.B. and van Ness, J.W. (1968). Fractional Brownian motions, frac-
tional noises and applications. SIAM Review, Vol. 10, No.4, 422-437.
Mandelbrot, B.B. and Wallis, J.R. (1969). Computer experiments with fractional
Gaussian noises. Water Resour. Res., Vol. 5, No.1, 228-267.
Mardia, K.V. (1972). Statistics of Directional Data. Academic Press, London.
Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979). Multivariate Analysis. Aca-
demic Press, London.
Markuse, B. and Schneider, A. (1995). Ähnlichkeit, Nähe, Distanz: Zur
Anwendung multidimensionaler Skalierung in musikwissenschaftlichen Un-
tersuchungen. In: Festschrift für Jobst Peter Fricke zum 65. Geburt-
stag, W. Auhagen, B. Gätjen and K. Niemöller (Eds.), Musikwis-

senschaftliches Institut der Universitt zu Köln (http://www.uni-koeln.de/phil-
fak/muwi/publ/fs fricke/festschrift.html).
Matheron, G. (1973). The intrinsic random functions and their applications. Adv.
Appl. Prob., Vol. 5, 439-468.
Mathieu, E. (1861). Mémoire sur l’étude des fonctions de plusieurs quantitées. J.
Math. Pures Appl., Vol. 6, 241-243.
Mathieu, E. (1873). Sur la fonction cinq fois transitive de 24 quantitées. J. Math.
Pures Appl., Vol. 18, 25-46.
Mazzola, G. (1985) Gruppen und Kategorien in der Musik, Heldermann-Verlag,
Berlin.
Mazzola, G. (1990a). Geometrie der Töne. Birkhäuser, Basel.
Mazzola, G. (1990b). Synthesis. SToA music 1001.90, Zürich.
Mazzola, G. (1989/1994). Presto. SToA music, Zürich.
Mazzola, G. (2002). The Topos of Music. Birkhäuser, Basel.
Mazzola, G. and Beran, J. (1998). Rational composition of performance. In: con-
trolling creative processes in music, W. Auhagen, R. Kopiez (Eds.), Staatliches
Institut für Musikforschung (Berlin), Lang Verlag, Frankfurt/New York.
Mazzola, G., Zahorka, O. and Stange-Elbe, J. (1995). Analysis and Performance
of a Dream. In: Proceedings of the 1995 Symposium on Musical Performance,
J. Sundberg (ed.), KTH, Stockholm.
McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recogni-
tion. Wiley, New York.
McMillan, B. (1953). The basic theorems of information theory. Ann. Math.
Statistics, 24, 196-219.
Meyer, Y. (1992). Wavelets and Operators. Cambridge University
Press,Cambridge.
Meyer, Y. (1993). Wavelets: Algorithms and Applications. SIAM, Philadelphia,
PA.
Morris, R.D. (1987). Composition with Pitch-Classes. Yale University Press, New
Haven.
Morris, R.D. (1995). Compositional spaces and other territories. PNM 33, 328-
358.
Morse, P.M. and Ingard, K.U. (1968). Theoretical Acoustics. McGraw Hill.
(Reprinted by Princeton University Press 1986.)
Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression. Addison-
Wesley, Reading, MA.
Nadaraya, E.A. (1964). On estimating regression. Theory of Probability and its
Applications, Vol. 9, 141-142.
Nederveen, C.J. (1998). Acoustical Aspects of Woodwind Instruments. Northern
Illinois University Press, de Kalb.
Nettheim, N. (1997). A Bibliography of Statistical Applications in Musicology.
Musicology Australia, Vol. 20, 94-106.
Newton, H.J. and Pagano, M. (1983). A method for determining periods in time
series. JASA, Vol. 78, 152-157.
Noll, T. (1997). Harmonische Morpheme. Musikometrika, Vol. 8, 7-32.
Norden, H. (1964). Proportions in Music. Fibonacci Quarterly, Vol. 2, 219.
Norris, J.R. (1998). Markov Chains. Cambridge University Press, Cambridge.

Ogden, R.T. (1996). Essential Wavelets for Statistical Applications and Data
Analysis. Birkhäuser, Boston.
Orbach, J. (1999). Sound and Music. University Press of America, Lanham, MD.
Parzen, E. (1962). On estimation of a probability density function and mode.
Ann. Math. Statistics, Vol. 33, 1065-1076.
Peitgen, H.-O. and Saupe, D. (1988). The Science of Fractal Images. Springer,
New York.
Percival, D.B. and Walden, A.T. (2000). Wavelet Methods for Time Series Anal-
ysis. Cambridge University Press, Cambridge, UK.
Perle, G. (1955). Symmetric formations in the string quartets of Béla Bartók.
Music Review 16, 300-312.
Pierce, J.R. (1983). The Science of Musical Sound. Scientific American Books,
New York (2nd ed. printed by W.H. Freeman & Co, 1992).
Plackett, R.L. (1960). Principles of Regression Analysis. Clarendon Press, Ox-
ford.
Polzehl, J. (1995). Projection pursuit discriminant analysis. Computational
Statist. Data Anal., Vol. 20, 141-157.
Price, B.D. (1969). Mathematical groups in campanology. Math. Gaz., 53, 129-
133.
Priestley, M.B. (1965). Evolutionary spectra and non-stationary processes. J. R.
Statist. Soc., Series B, Vol. 27, 204-237.
Priestley, M.B. (1981b). Spectral Analysis and Time Series, (Vol. 1): Univariate
Time Series. Academic Press, New York.
Priestley, M.B. (1981b). Spectral Analysis and Time Series, (Vol. 2): Multivariate
Series, Prediction and Control. Academic Press, New York.
Quinn, B.G. and Thomson, P.J. (1991) Estimating the frequency of a periodic
function. Biometrika, Vol. 78, No. 1, 65-74.
Rahn, J. (1980). Basic Atonal Theory. Longman, New York.
Raichel, D.R. (2000). The Science and Applications of Acoustics. American Inst.
of Physics, College Park, PA.
Ramsay, J.O. (1977). Maximum likelihood estimation in multidimensional scal-
ing. Psychometrika, 42, 241-266.
Raphael, C.S. (1999). Automatic segmentation of acoustic music signals using
hidden Markov models. IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 21, No. 4, 360-370.
Raphael, C.S. (2001a). A probabilistic expert system for automatic musical ac-
companiment. J. Computational Graphical Statist., Vol. 10, No. 3, 487-512.
Raphael, C.S. (2001b). Synthesizing musical accompaniment with Bayesian belief
networks. J. New Music Res., Vol. 30, No. 1, 59-67.
Rao, C.R. (1973). Linear Statistical Inference and its Applications (2nd ed.).
Wiley & Sons, New York.
Rayleigh, J.W.S. (1896). The Theory of Sound (2 vols), 2nd ed., Macmillan,
London (Reprinted by Dover, 1945).
Read, R.C. (1997). Combinatorial problems in the theory of music. Discrete Math-
ematics, 167/168, 543-551.
Reiner, D. (1985). Enumeration in music theory, American Math. Monthly, 92/1,
51-54.

Rényi, A. (1959a). On the dimension and entropy of probability distributions.
Acta Mathe. Acad. Sci. Hung., Vol. 10, 193-215.
Rényi, A. (1959b). On a theorem of P. Erdös and its applications in information
theory. Mathematica Cluj, Vol. 1, No. 24, 341-344.
Rényi, A. (1961). On measures of entropy and information. Proc. Fourth Berkeley
Symposium on Math. Stat. Prob., Vol. I, Univ. California Press, Berkeley, 547-
561.
Rényi, A. (1965). On foundations of information theory. Review of the Interna-
tional Statistical Institute, Vol. 33, 1-14.
Rényi, A. (1970). Probability Theory. North Holland, Amsterdam.
Repp, B. (1992). Diversity and Communality in Music Performance: An Analysis
of Timing Microstructure in Schumann’s “Träumerei”. J. Acoustic Soc. Am.,
92, 2546-2568.
Rigden, J.S. (1977). Physics and the Sound of Music. Wiley, New York.
Ripley, B. (1995). Pattern Recognition and Neural Networks. Cambridge Univer-
sity Press, Cambridge.
Rodet, X.(1997). Musical sound signals analysis/synthesis: sinusoidal+residual
and elementary waveform models. Appl. Signal Processing, 4, 131-141.
Roederer, J.G. (1995). The Physics and Psychophysics of Music. Springer,
Berlin/New York.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density
function. Ann. Math. Statistics, Vol. 27, 832-837.
Rossing, T.D. (ed.) (1984). Acoustics of Bells. Van Nostrand Reinhold, New York.
Rossing, T.D. (1990). The Science of Sound (2nd ed.). Addison-Wesley, Reading,
MA.
Rossing, T.D. (2000). Science of Percussion Instruments. World Scientic, London.
Rossing, T.D. and Fletcher, N.H. (1995). Principles of Vibration and Sound.
Springer, Berlin/New York.
Rotman, J.J. (2002). Advanced Modern Algebra. Prentice Hall, New Jersey.
Rousseeuw, P. and Yohai, V.J. (1984). Robust regression by means of S-
estimators. In: Robust Nonlinear Time Series Analysis, J. Franke, W. Hardle,
and D. Martin (Eds.), Lecture Notes in Statistics, Vol. 26, 256-277, Springer,
New York.
Ruppert, D. and Wand, M.P. (1994). Multivariate locally weighted least squares
regression. Ann. Statistics, Vol. 22, 1346-1370.
Ryan, T.P. (1997). Modern Regression Methods. Wiley, New York.
Scheffé, H. (1959). The Analysis of Variance. Wiley, New York.
Schnitzler, G. (1976). Musik und Zahl. Verlag fr systematische Musikwissenschaft,
Bonn.
Schönberg, A. (1950). Die Komposition in 12 Tönen. In: Style and Idea, New
York.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., Vol. 6,
461-464.
Seber, G.A.F. (1984). Multivariate Observations. Wiley, New York.
Serra, X. and Smith, J.O. (1991). Spectral modeling synthesis: A sound anal-
ysis/synthesis system based on deterministic plus stochastic decomposition.
Computer Music J., Vol. 14, No. 4, 12-24.

Shannon, C.E. (1948). A mathematical theory of communication. Bell Syst.
Techn. J., Vol. 27, 379-423.
Shannon, C.E. and Weaver, W. (1949). The Mathematical Theory of Communi-
cation. Univ. Illinois Press, Urbana.
Shepard, R.N. (1962a). The analysis of proximities: multidimensional scaling with
unknown distance function Part I. Psychometrika, 27, 125-140.
Shepard, R.N. (1962b). The analysis of proximities: multidimensional scaling with
unknown distance function Part II. Psychometrika, 27, 219-246.
Schiffman, S. (1997). Introduction to Multidimensional Scaling: Theory, Methods,
and Applications by Susan. Academic Press, New York.
Shumway, R. and Stoffer, D.S. (2000). Time Series Analysis and Its Applications.
Springer, New York.
Silverman, B. (1986). Density estimation for statistics and data analysis. Chap-
man & Hall, London.
Simonoff, J.S. (1996). Smoothing methods in statistics. Springer, New York.
Sinai, Y.G. (1976). Self-similar probability distributions. Theory Probab. Appl.,
Vol. 21, 64-80.
Slaney, M. and Lyon, R.F. (1991). Apple hearing demo real. Apple Technical
Report No. 25, Apple Computer Inc., Cupertino, CA.
Solo, V. (1992). Intrinsic random fluctuations. SIAM Appl. Math., Vol. 52, 270-
291.
Solomon, L.J. (1973). Symmetry as a determinant of musical composition. PhD
thesis, University of West Virginia.
Srivastava, M. and Sen, A.K. (1997). Regression Analysis: Theory, Methods and
Applications. Springer, New York.
Stamatatos, E. and Widmer, G. (2002). Music perfomer recognition using an
ensemble of simple classifiers. Austrian Research Institute for Artificial Intel-
ligence, Vienna, TR-2002-02.
Stange-Elbe, J. (2000). Analyse- und Interpretationsperspektiven zu J.S. Bachs
“Kunst der Fuge” mit Werkzeugen der objektorientierten Informationstech-
nologie. Habilitation thesis, University of Osnabrück.
Steinberg, R. (ed.) (1995). Music and the Mind Machine. Springer, Heidelberg.
Stewart, I. (1992). Another fine math you’ve got me into. . . , W. H. Freeman.
Stoyan, D. and Stoyan, H. (1994). Fractals, Random Shapes and Point Fields:
Methods of Geometrical Statistics. Wiley, New York.
Straub, H. (1989). Beiträge zur modultheoretischen Klassifikation musikalischer
Motive. Diploma thesis, ETH Zürich.
Taylor, R. (1999a). Fractal analysis of Pollocks drip paintings. Nature, Vol. 399,
p. 422.
Taylor, R. (1999b). Fractal Expressionism. Physics World, Vol. 12, No. 10, p. 25.
Taylor, R. (1999c). Fractal expressionism where art meets science. In: Art and
Complexity, J. Casti (ed.), Perseus Press, Vol.
Taylor, R. (2000). The use of science to investigate Jackson Pollocks drip paint-
ings. Art and the Brain, Journal of Consciousness Studies, Vol. 7, No. 8-9,
p137.
Telcs, A. (1990). Spectra of graphs and fractal dimensions. Probab. Th. Rel.
Fields, Vol. 82, 435-449.

Thumfart, A. (1995). Discrete Evolutionary Spectra and their Application to a
Theory of Pitch Perception. StatLab Heidelberg, Beiträge zur Statistik, No.
30.
Tricot, C. (1995). Curves and Fractal Dimension. Springer, New York.
Tufte, E. (1983). The visual display of quantitative information. Addison-Wesley,
Reading, MA.
Tukey, J.W. (1977). Exploratory data analysis. Addison-Wesley, Reading, MA.
Tukey, P.A. and Tukey, J.W. (1981). Graphical display of data sets in 3 or more
dimensions. In: Interpreting Multivariate Data, V. Barnett (ed.), Wiley, Chich-
ester, UK.
Ueda, K. and Ohgushi, K. (1987). Perceptual components of pitch: spatial rep-
resentation using a multidimensional scaling technique. J. Acoust. Soc. Am.,
82, 1193-1200.
Velleman, P. and Hoaglin, D. (1981). The ABC’s of EDA: Applications, Basics,
and Computing of Exploratory Data Analysis. Duxbury, Belmont, CA.
Vidakovic, B. (1999). Statistical Modeling by Wavelets. John Wiley, New York.
Voss, R.F. and Clarke, J. (1975). 1/f noise in music and speech. Nature, Vol. 258,
317-318.
Voss, R.F. and Clarke, J. (1978). 1/f noise in music: music from 1/f noise. J.
Acoust. Soc. America, Vol. 63, 258-263.
Voss, R.F. (1988). Fractals in nature: From characterization to simulation. In:
Science of fractal images, H.-O. Peitgen and D. Saupe (Eds.), Springer, Berlin,
pp. 26-69.
Vuza, D.T. (1991). Supplementary sets and regular complementary unending
canons (part one). Persp. New Music, Vol. 29, No. 2, 22-49.
Vuza, D.T. (1992a). Supplementary sets and regular complementary unending
canons (part two). Persp. of New Music, Vol. 30, No. 1, 184-207.
Vuza, D.T. (1992b). Supplementary sets and regular complementary unending
canons (part three). Persp. New Music, Vol. 30, No. 2, 102-125.
Vuza, D.T. (1993). Supplementary sets and regular complementary unending
canons (part four). Persp. New Music, Vol. 31, No. 1, 270-305.
van der Waerden, B.L. (1979). Die Pythagoreer. Artemis, Zürich.
Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia.
Walker, A.M. (1971). On the estimation of a harmonic component in a time series
with stationary independent residuals. Biometrika, Vol. 58, 21-36.
Walmsley, P.J., Godsill, S.J. and Rayner, P.J.W. (1999). Bayesian graphical mod-
els for polyphonic pitch tracking. In: Diderot Forum on Mathematics and Music
Computational and Mathematical Methods in Music, Vienna, Austria, Decem-
ber 2-4, 1999, H. G. Feichtinger and M. Drfler (eds.), sterreichische Comput-
ergesellschaft.
Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall,
London.
Watson, G. (1964). Smooth regression analysis. Sankhya, Series A, Vol. 26, 359-
372.
Watson, G. (1983). Statistics on Spheres. Wiley, New York.
Waugh, W.A. (1996). Music, probability, and statistics. In: Encyclopedia of Sta-
tistical Sciences, by S. Kotz, C. B. Read, and D.L. Banks (Eds.), 6, 134-137.

Webb, A.R. (2002). Statistical Pattern Recognition (2nd ed.). Wiley, New York.
Wedin, L. (1972). Multidimensional scaling of emotional expression in music.
Svensk Tidskrift för Musikforskning, 54, 115-131.
Wedin, L. and Goude, G. (1972). Dimension analysis of the perception of musical
timbre. Scand. J. Psychol., 13, 228-240.
Weihs, C., Berghoff, S., Hasse-Becker, P. and Ligges, U. (2001). Assessment of
Purity of Intonation in Singing Presentations by Discriminant Analysis. In:
Mathematical Statistics and Biometrical Applications, J. Kunert, and G. Tren-
kler. (Eds.), pp. 395-410.
White, A.T. (1983). Ringing the changes. Math. Proc. Camb. Phil. Soc. 94, 203-
215.
White, A.T. (1985). Ringing the changes II. Ars Combinatorica, 20-A, 65-75.
White, A.T. (1987). Ringing the cosets. American Math. Monthly 94/8, 721-746.
Whittle, P. (1953). Estimation and information in stationary time series. Ark.
Mat., Vol. 2, 423-434.
Widmer, G. (2001). Discovering Simple Rules in Complex Data: A Meta-learning
Algorithm and Some Surprising Musical Discoveries. Austrian Research Insti-
tute for Artifical Intelligence, Vienna, TR-2001-31.
Wiener, N. (1948). Cybernetics or control and communication in the animal and
the machine. Act. Sci. Indust., No. 1053, Hermann et Cie, Paris.
Wilson, W.G. (1965). Change Ringing. October House Inc., New York.
Wolfowitz, J. (1957). The coding of messages subject to chance errors. Illinois J.
Math., Vol. 1, 591-606.
Wolfowitz, J. (1958). Information theory for mathematicians. Ann. Math. Statis-
tics, Vol. 29, 351-356.
Wolfowitz, J. (1961). Coding Theorems of Information Theory. Springer, Berlin.
Woodward, P.M. (1953). Probability and Information Theory with Applications
to Radar. Pergamon Press, London.
Xenakis, I. (1971). Formalized Music: Thought and Mathematics in Composition.
Indiana University Press, Bloomington/London.
Yaglom, A.M. and Yaglom, I.M. (1967). Wahrscheinlichkeit und Information.
Deutscher Verlag der Wissenschaften, Berlin.
Yost, W.A. (1977). Fundamentals of Hearing. An Introduction. Academic Press,
San Diego.
Yohai, V.J. (1987). High breakdown-point and high efficiency robust estimates
for regression. Ann. Statistics, Vol. 15, 642-656.
Yohai, V.J., Stahel, W.A., and Zamar, R. (1991). A procedure for robust estima-
tion and inference in linear regression. In: Directions in robust statistics and
diagnostics, Part II, W.A. Stahel, and S.W. Weisberg (Eds.), Springer, New
York.
Young, G. and Householder, A. S. (1941). A note on multidimensional psycho-
physical analysis. Psychometrika, 6, 331-333.
Zassenhaus, H.J. (1999). The Theory of Groups. Dover, Mineola.
Zivot, E. and Wang, J. (2002). Modeling Financial Time Series with S-Plus.
Springer, New York.

BERAN, J. - Statistics in Musicology PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BERAN, J. - Statistics in Musicology PDF

Uploaded by

Copyright:

Available Formats

Interdisciplinar y Statistics

CHAPMAN & HALL/CRC

©2004 CRC Press LLC

Library of Congress Cataloging-in-Publication Data

Beran, Jan, 1959-

Visit the CRC Press Web site at www.crcpress.com

© 2004 by Chapman & Hall/CRC

No claim to original U.S. Government works

©2004 CRC Press LLC

©2004 CRC Press LLC

1 Some mathematical foundations of music

2 Exploratory data mining in musical spaces

3 Global measures of structure and randomness

4 Time series analysis

6 Markov chains and hidden Markov models

©2004 CRC Press LLC

©2004 CRC Press LLC

8 Principal component analysis

©2004 CRC Press LLC

An essential aspect of music is structure. It is therefore not surprising that a

©2004 CRC Press LLC

©2004 CRC Press LLC

Some mathematical foundations of

1.1 General background

The study of music by means of mathematics goes back several thousand

©2004 CRC Press LLC

Figure 1.2 J.S. Bach (1685-1750). (Engraving by L. Sichling after a painting by

©2004 CRC Press LLC

©2004 CRC Press LLC

Figure 1.4 Anton Webern (1883-1945). (Courtesy of Österreichische Post AG.)

©2004 CRC Press LLC

©2004 CRC Press LLC

©2004 CRC Press LLC

1.2 Some elements of algebra

1.2.2 Definitions and results

©2004 CRC Press LLC

©2004 CRC Press LLC

©2004 CRC Press LLC

©2004 CRC Press LLC

©2004 CRC Press LLC

1.3 Specific applications in music

1.3.1 The Mathieu group

©2004 CRC Press LLC

1.3.3 Representation of music

©2004 CRC Press LLC

2. Consider a finite module of notes (frequencies), such as for instance the

1.3.4 Classification of circular chords and other musical objects

©2004 CRC Press LLC

©2004 CRC Press LLC

Exchange of coordinates: Exchange of “parameters”

31, the full orbit generated by inversion, retrograde and transposition is

©2004 CRC Press LLC

Figure 1.6 W.A. Mozart (1759-1791) (authorship uncertain) – Spiegel-Duett.

©2004 CRC Press LLC

Figure 1.8 The torus of thirds Z3 + Z4 .

©2004 CRC Press LLC

©2004 CRC Press LLC

©2004 CRC Press LLC

©2004 CRC Press LLC

Figure 1.16 Ludwig van Beethoven (1770-1827). (Courtesy of Zentralbibliothek

©2004 CRC Press LLC

Exploratory data mining in musical

2.1 Musical motivation