Professional Documents
Culture Documents
STATISTICS in
MUSICOLOGY
Jan Beran
MT6.B344 2003
781.2—dc21 2003048488
This book contains information obtained from authentic and highly regarded sources. Reprinted material
is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable
efforts have been made to publish reliable data and information, but the author and the publisher cannot
assume responsibility for the validity of all materials or for the consequences of their use.
Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic
or mechanical, including photocopying, microfilming, and recording, or by any information storage or
retrieval system, without prior permission in writing from the publisher.
The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for
creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC
for such copying.
Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation, without intent to infringe.
Preface
5 Hierarchical methods
5.1 Musical motivation
5.2 Basic principles
5.3 Specific applications in music
7 Circular statistics
7.1 Musical motivation
7.2 Basic principles
7.3 Specific applications in music
9 Discriminant analysis
9.1 Musical motivation
9.2 Basic principles
9.3 Specific applications in music
10 Cluster analysis
10.1 Musical motivation
10.2 Basic principles
10.3 Specific applications in music
11 Multidimensional scaling
11.1 Musical motivation
11.2 Basic principles
11.3 Specific applications in music
List of figures
References
Physical equations for sound waves only describe the propagation of air
pressure. They do not provide, by themselves, an understanding of how
and why certain sounds are connected, nor do they tell us anything (at
least not directly) about the effect on the audience. As far as structure is
concerned, one may even argue – for the sake of argument – that music does
not necessarily need “physical realization” in form of a sound. Musicians
are able to hear music just by looking at a score. Beethoven (Figures 1.3,
1.16) composed his ultimate masterpieces after he lost his hearing. Thus,
on an abstract level, music can be considered as an organized structure
that follows certain laws. This structure may or may not express feelings
of the composer. Usually, the structure is communicated to the audience
by means of physical sounds – which in turn trigger an emotional expe-
rience of the audience (not necessarily identical with the one intended by
the composer). The structure itself can be analyzed, at least partially, us-
ing suitable mathematical structures. Note, however, that understanding
the mathematical structure does not necessarily tell us anything about the
effect on the audience. Moreover, any mathematical structure used for ana-
lyzing music describes certain selected aspects only. For instance, studying
symmetries of motifs in a composition by purely algebraic means ignores
psychological, historical, perceptual, and other important issues. Ideally, all
relevant scientific disciplines would need to interact to gain a broad under-
standing. A further complication is that the existence of a unique “truth”
is by no means certain (and is in fact rather unlikely). For instance, a
composition may contain certain structures that are important for some
listeners but are ignored by others. This problem became apparent in the
early 20th century with the introduction of 12-tone music. The general
public was not ready to perceive the complex structures of dodecaphonic
music and was rather appalled by the seemingly chaotic noise, whereas a
minority of “specialized” listeners was enthusiastic. Another example is the
1.3.2 Campanology
A rather peculiar example of group theory “in action” (though perhaps
rather trivial mathematically) is campanology or change ringing (Fletcher
1956, Wilson 1965, Price 1969, White 1983, 1985, 1987, Stewart 1992). The
art of change ringing started in England in the 10th century and is still
performed today. The problem that is to be solved is as follows: there are
k swinging bells in the church tower. One starts playing a melody that
consists of a certain sequence in which the bells are played, each bell be-
ing played only once. Thus, the initial sequence is a permutation of the
numbers 1, ..., k. Since it is not interesting to repeat the same melody over
and over, the initial melody has to be varied. However, the bells are very
heavy so that it is not easy to change the timing of the bells. Each variation
is therefore restricted, in that in each “round” only one pair of adjacent
bells can exchange their position. Thus, for instance, if k = 4 and the pre-
vious sequence was (1, 2, 3, 4), then the only permissible permutations are
(2, 1, 3, 4), (1, 3, 2, 4), and (1, 2, 4, 3). A further, mainly aesthetic restiction
is that no sequence should be repeated except that the last one is iden-
tical with the initial sequence. A typical solution to this problem is, for
instance, the “Plain Bob” that starts by (1, 2, 3, 4), (2, 1, 4, 3), (2, 4, 1, 3),...
and continues until all permutations in S4 are visited.
1.3.6 Transformations
For suitably chosen integers p1 , p2 , p3 , p4 , consider the four-dimensional
module M = Zp1 × Zp2 × Zp3 × Zp4 over Z where the coordinates rep-
resent onset time, pitch (well-tempered tuning if p2 = 12), duration, and
volume. Transformations in this space play an essential role in music. A se-
lection of historically relevant transformations used by classical composers
is summarized in Table 1.1 (also see Figure 1.13).
Generally, one may say that affine transformations are most important,
and among these the invertible ones. In particular, it can be shown that each
symmetry of Z12 can be written as a product (in the group of symmetries
Symm(Z12 )) of the following musically meaningful transformations:
• Multiplication by − 1 (inversion);
• Multiplication by 5 (ordering of notes according to circle of quarts);
• Addition of 3 (transposition by a minor third);
• Addition of 4 (transposition by a major third).
All these transformations have been used by composers for many centuries.
Some examples of apparent similarities between groups of notes (or motifs)
are shown in Figures 1.10 through 1.12. In order not to clutter the pic-
tures, only a small selection of similar motifs is marked. In dodecaphonic
and serial music, transformation groups have been applied systematically
(see e.g. Figure 1.9). For instance, in Schöberg’s Orchestervariationen op.
Violin
Vln.
12
Vln.
18
Vln.
22
Vln.
27
Vln.
32
Vln.
36
Vln.
41
Vln.
46
Vln.
51
Vln.
57
Vln.
60
Vln.
Figure 1.10 Notes of “Air” by Henry Purcell. (For better visibility, only a small
selection of related “motifs” is marked.)
Figure 1.12 Notes of op. 68, No. 2 from “Album für die Jugend” by Robert Schu-
mann. (For better visibility, only a small selection of related “motifs” is marked.)
Figure 1.14 Graphical representation of pitch and onset time in Z271 together with
instrumentation of polygonal areas. (Excerpt from Śānti – Piano concert No. 2
by Jan Beran, col legno CD 20062; courtesy of col legno, Germany.)
Piano
ritard.
9
13
17 ritard.
a tempo
21
23
ritard.
1947
-5
1963
log(tempo)
1965
-10
-15
0 10 20 30
onset time
• Barchart: If data can assume only a few different values, or if data are
qualitative (i.e. we only record which category an item belongs to), then
one can plot the possible values or names of categories on the x-axis and
on the vertical axis the corresponding (relative) frequencies.
• Q-q-plot for comparing two data sets x1 , ..., xn and y1 , ..., ym : 1. Define
a certain number of points 0 < p1 < ... < pk ≤ 1 (the standard choice is:
pi = i−0
N
.5
where N = min(n, m)). 2. Plot the pi -quantiles (i = 1, ..., N )
of the y−observations versus those of the x − −observations. Alternative
plots for comparing distributions are discussed e.g. in Ghosh and Beran
(2000) and Ghosh (1996, 1999).
ARGERICH
ARRAU
ASKENAZE
BRENDEL
-40
BUNIN
CAPOVA
CORTOT1
CORTOT2
CORTOT3
CURZON
DAVIES
-60
DEMUS
log(tempo)
ESCHENBACH
GIANOLI
HOROWITZ1
HOROWITZ2
HOROWITZ3
KATSARIS
-80
KLIEN
KRUST
KUBALEK
MOISEIWITSCH
NEY
NOVAES
-100
ORTIZ
SCHNABEL
SHELLEY
ZAK
0 10 20 30
onset time
sample considered here, pianists of the “modern era” tend to make a much
′′
stronger distinction between A and A in terms of slow tempi. The only
exceptions (outliers in the left boxplot) are Moiseiwitsch and Horowitz’
first performance and Ashkenazy (outlier in the right boxplot). The com-
parsion of skewness and curtosis in Figures 2.4g and h also indicates that
“modern” pianists seem to prefer occasional extreme ritardandi. The only
exception in the “early 20th century group” is Artur Schnabel, with an
extreme skewness of −2.47 and a kurtosis of 7.04.
Direct comparisons of tempo distributions are shown in Figures 2.5a
2
2
1
1
1
Figure 2.5a: q-q-plot Figure 2.5b: q-q-plot Figure 2.5c: q-q-plot
Demus (1960) - Ortiz (1988) Demus (1960) - Cortot (1935) Ortiz (1988) - Argerich (1983)
0
0
0
-1
-1
-1
Argerich
-2
Cortot
Ortiz
-2
-2
-3
-3
-3
-4
-4
-4
-2 -1 0 1 -2 -1 0 1 -4 -3 -2 -1 0 1
1
1
2
0
-1
-1
Horowitz 1963
Cortot 1947
-2
Krust
-2
-2
-3
-3
-4
-4
-4
-2 -1 0 1 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 0 1
0.4
Figure 2.6a: J.S.Bach - Fugue 1, Figure 2.6b: W.A.Mozart - KV 545,
frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)
0.20
2af
0.15
0
12 47ac 436 150f 13297afec f 14390 43 4678
5789 53
5460a 38fed 12 5460 1e 5276afe
0.2
6 219 1
0.10
2 0 78 490b 56789a0
f 5498760 0aedcb 0 a 29 5678a b
1afe 5436789c0bdfe 1543678 56789a0b 4
1543298760aedcb 1543298760afedcb 1543298760afedcb f 1543298760afedcb 215436789ac0bdfe cbd 215436789ac0bdfe 21 215436789ac0bdfe 2143cdfe 213cdfe
0.0
0.0
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) frequencies of notes number i, i in [1+j,16+j] (j=0, ,65)
45 687a0b 2e
736 59c 143fd
18209
d 5 a0b
0.3
afe
0.20
bd 2143 6c 98c
c
fe b 7 17fd
0.2
0.0
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
Figure 2.6 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16.
89 5
35670
0.3
6
24 34a78
1a 345 6a f 29
0.2
0.2
68 35f 56dcb c a08 165
79 de b7c 47 0ef
dcfe dcfe
f d 6ba7c98
0c
0.1
f e 46 8 d
0.1
b 790ab adbe 1ba0dc9 12d 1235 1235
ba09 dcef
12568dc f adcbfe 1f 234e8f 3465c 1b 4 1234bdcef 46507e98f
dcef f e dcef f
dfe a 34fe dcbfe 0 2345670adcbe 57 ba078 b 465a07de98 f 2345a09 65a0798 123bad 1234bdce
1234567890acb 1234567890 1234567890adcbfe 1234567890adcbfe 1234567890a 123456789 89 6 9 123465a0798 123bc 678 c 65a0798
0.0
0.0
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
(Notes-Tonic) mod 12 (Notes-Tonic) mod 12
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
(Notes-Tonic) mod 12 (Notes-Tonic) mod 12
Figure 2.7 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16.
2.4.1 Definitions
Correlation
where ui denotes the rank of xi among the x−values and vi is the rank
of yi among the y−values. In (2.3) and (2.4) it is assumed that sx , sy ,
su and sv are not zero. Recall that these definitions imply the following
properties: a) −1 ≤ r, rSp ≤ 1; b) r = 1, if and only if yi = βo + β1 xi
and β1 > 0 (exact linear relationship with positive slope); c) r = −1, if
and only if yi = βo + β1 xi and β1 < 0 (exact linear relationship with
negative slope); d) rSp = 1, if and only if xi > xj implies yi > yj (strictly
monotonically increasing relationship); e) r = −1, if and only if xi >
xj implies yi < yj (strictly monotonically decreasing relationship); f) r
measures the strength (and sign) of the linear relationship; g) rSp measures
the strength (and sign) of monotonicity; h) if the data are realizations of a
bivariate random variable (X, $ Y ), then r is an estimate of the population
correlation ρ = cov(X, Y )/ var(X)var(Y ) where cov(X, Y ) = E[XY ] −
E[X]E[Y ], var(X) = cov(X, X) and var(Y ) = cov(Y, Y ). When using
these measures of dependence one should bear in mind that each of them
measures a specific type of dependence only, namely linear and monotonic
dependence respectively. Thus, a Pearson or Spearman correlation near
or equal to zero does not necessarily mean independence. Note also that
correlation can be interpreted in a geometric way as follows: defining the
n−dimensional vectors x = (x1 , ..., xn )t and y = (y1 , ..., yn )t , r is equal to
the standardized scalar product between x and y, and is therefore equal to
the cosine of the angle between these two vectors.
A special type of correlation is interesting for time series. Time series are
data that are taken in a specific ordered (usually temporal) sequence. If
Y1 , Y2 , ..., Yn are random variables observed at time points i = 1, ..., n, then
one would like to know whether there is any linear dependence between
observations Yi and Yi−k , i.e. between observations that are k time units
apart. If this dependence is the same for all time points i, and the expected
value of Yi is constant, then the corresponding population correlation can
be written as function of k only (see Chapter 4),
cov(Yi , Yi+k )
= ρ(k) (2.5)
var(Yi )var(Yi+k )
$
where s2 = n−1 (yi − ȳ)(yi+k − ȳ). Note that here summation stops at
!
Regression
In addition to measuring the strength of dependence between two variables,
one is often interested in finding an explicit functional relationship. For
instance, it may be possible to express the response variable y in terms of an
explanatory variable x by y = g(x, ε) where ε is a variable representing the
part of y that is unexplained. More specifically, we may have, for example,
an additive relationship y = g(x) + ε or a multiplicative equation y =
g(x)eε . The simplest relationship is given by the simple linear regression
equation
y = βo + β1 x + ε (2.9)
where ε is assumed to be a random variable with E(ε) = 0 (and usually
finite variance σ 2 = var(ε) < ∞). Thus, the data are yi = βo +β1 xi +εi (i =
1, ..., n) where the ε′i s are generated by the same zero mean distribution.
Often the εi ’s are also assumed to uncorrelated or even independent – this is
however not a necessary assumption. An obvious estimate of the unknown
parameters βo and β1 is obtained by minimizing the total sum of squared
errors
# #
SSE = SSE(bo , b1 ) = (yi − bo − b1 xi )2 = ri2 (bo , b1 ) (2.10)
with respect to bo , b1 . The solution is found by setting the partial derivatives
with respect to bo and b1 equal to zero. A more elegant way to find the
solution is obtained by interpreting the problem geometrically: defining the
n-dimensional vectors 1 = (1, ..., 1)t , b = (bo , b1 )t and the n ×2 matrix X
with columns 1 and x, we have SSE = ||y − bo 1 − b1 x||2 = ||y − Xb||2
Regression smoothing
A more general, but more difficult, approach to modeling a functional re-
lationship is to impose less restrictive assumptions on the function g. For
instance, we may assume
y = g(x) + ε (2.17)
with g being a twice continuously differentiable function. Under suitable
additional conditions on x and ε it is then possible to estimate g from
observed data by nonparametric smoothing. As a special example consider
observations yi taken at time points i = 1, 2, ..., n. A standard model is
yi = g(ti ) + εi (2.18)
where ti = i/n, εi are independent identically distributed (iid) random
variables with E(εi ) = 0 and σ 2 = var(εi ) < 0. The reason for using
standardized time ti ∈ [0, 1] is that this way g is observed on an increasingly
fine grid. This makes it possible to ultimately estimate g(t) for all values
of t by using neighboring values ti , provided that g is not too “wild”. A
simple estimate of g can be obtained, for instance, by a weighted average
(kernel smoothing)
n
#
ĝ(t) = wi yi (2.19)
i=1
b )
K( t−ti
with b > 0, and a kernel function K ≥ 0 such that K(u) = K(−u), K(u) =
"1
0 (|u| > 1) and −1 K(u)du = 1. The role of b is to restrict observations
that influence the estimate to a small window of neighboring time points.
For instance, the rectangular kernel K(u) = 12 1{|u| ≤ 1} yields the sample
mean of observations yi in the “window” n(t − b) ≤ i ≤ n(t + b). An even
more elegant formula can be obtained" 1 by approximating the Riemann sum
!n t−tj
1
nb j=1 K( b ) by the integral −1
K(u)du = 1:
n n
# 1 # t − ti
ĝ(t) = wi yi = K( )yi (2.21)
i=1
nb i=1 b
In this case, the sum of the weights is not exactly equal to one, but asymp-
totically (as n → ∞ and b → 0 such that nb3 → ∞) this error is negligible.
It can be shown that, under fairly general conditions on g and ε, ĝ con-
verges to g, in a certain sense that depends on the specific assumptions (see
e.g. Gasser and Müller 1979, Gasser and Müller 1984, Härdle 1991, Beran
and Feng 2002, Wand and Jones 1995, and references therein).
An alternative to kernel smoothing is local polynomial fitting (Fan and
Gijbels 1995, 1996; also see Feng 1999). The idea is to fit a polynomial
locally, i.e. to data in a small neighborhood of the point of interest. This
can be formulated as a weighted least squares problem as follows:
ĝ(t) = β̂o (2.22)
where β̂ = (β̂o , β̂1 , ..., β̂p )t solves a local least squares problem defined by
# ti − t 2
β̂ = arg min K( )ri (a). (2.23)
a b
Here ri = yi − [ao + a1 (ti − t) + ... + ap (ti − t)p ], K is a kernel as above and
b > 0 is the bandwidth defining the window of neighboring observations.
It can be shown that asymptotically, a local polynomial smoother can be
written as kernel estimator (Ruppert and Wand 1994). A difference only
occurs at the borders (t close to 0 or 1) where, in contrast to the local
polynomial estimate, the kernel smoother has to be modified. The reason
is that observations are no longer symmetrically spaced in the window
t ± b). A major advantage of local polynomials is that they automatically
′ ′′
provide estimates of derivatives, namely ĝ (t) = β̂1 , ĝ (t) = 2β̂2 etc. Kernel
smoothing can also be used for estimation of derivatives; however different
(and rather complicated) kernels have to be used for each derivative (Gasser
and Müller 1984, Gasser et al. 1985). A third alternative, so-called wavelet
"where the kernel K is such that K(u, v) = K(−u, v) = K(u, −v) ≥ 0, and
K(u, v)dudv = 1. Usually, b1 = b2 = b and K(u, v) has compact sup-
"
Interpolation
Often a process may be generated in continuous time, but is observed at
discrete time points. One may then wish to guess the values of the points
Statistical inference
In this section, correlation, linear regression, nonparametric smoothing,
and interpolation were introduced in an informal way, without exact dis-
cussion of probabilistic assumptions and statistical inference. All these
techniques can be used in an informal way to explore possible structures
without specific model assumptions. Sometimes, however, one wishes to
obtain more solid conclusions by statistical tests and confidence intervals.
There is an enormous literature on statistical inference in regression, in-
cluding nonparametric approaches. For selected results see the references
given above. For nonparametric methods also see Wand and Jones (1995),
Simonoff (1996), Bowman and Azzalini (1997), Eubank (1999) and refer-
ences therein.
10
10
5
5
0
0
0
a(t)
a(t)
a(t)
-5
-5
-5
-10
-10
-10
-15
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
15
10
10
5
5
5
0
0
0
a(t)
a(t)
a(t)
-5
-5
-5
-10
-10
-10
-15
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
0
ARGERICH ARGERICH
ARRAU ARRAU
ASKENAZE ASKENAZE
BRENDEL BRENDEL
5
5
BUNIN BUNIN
CAPOVA CAPOVA
CORTOT1 CORTOT2
CORTOT2 CORTOT3
CORTOT3 CURZON
CURZON
10
DAVIES
10
DAVIES
DEMUS
DEMUS
ESCHENBACH
ESCHENBACH
15
HOROWITZ2
15
HOROWITZ2
HOROWITZ3
Performance
Performance
HOROWITZ3
KATSARIS
KATSARIS
other pianists
KLIEN
KLIEN
KRUST
KRUST
20
KUBALEK
20
NEY NEY
NOVAES NOVAES
Cortot (1935) with other performances
ORTIZ ORTIZ
25
SCHNABEL
25
SCHNABEL
SHELLEY SHELLEY
ZAK ZAK
0
ARGERICH ARGERICH
ARRAU ARRAU
ASKENAZE ASKENAZE
BRENDEL BRENDEL
5
5
BUNIN BUNIN
CAPOVA CAPOVA
CORTOT1 CORTOT1
CORTOT2 CORTOT2
CORTOT3 CORTOT3
CURZON
10
CURZON
10
DAVIES
DAVIES
DEMUS
DEMUS
ESCHENBACH
ESCHENBACH
GIANOLI
GIANOLI
HOROWITZ1
15
HOROWITZ2
15
HOROWITZ2
HOROWITZ3
Performance
Performance
HOROWITZ3
KATSARIS
other pianists
KATSARIS
KLIEN
KLIEN
KRUST
KRUST
20
KUBALEK
20
KUBALEK
d) Maximal correlations with
MOISEIWITSCH MOISEIWITSCH
b) Acceleration- Correlations of
NEY NEY
NOVAES NOVAES
ORTIZ ORTIZ
25
Horowitz (1947) with other performances
SCHNABEL
25
SCHNABEL
SHELLEY SHELLEY
-0.4
-0.4
-0.4
-0.4
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-0.6
-0.6
-0.6
-0.6
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-0.6
-0.6
-0.6
-0.6
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
Figure 2.13 Smoothed tempo curves ĝ1 (t) = (nb1 )−1 K( t−t
!
b1
i
)yi (b1 = 8).
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-2.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
Figure 2.14 Smoothed tempo curves ĝ2 (t) = (nb2 )−1 K( t−t
!
b2
i
)[yi − ĝ1 (t)] (b2 =
1).
1
1
1
-2
-2
-2
-2
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0
0
0
-2
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0
0
0
0
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0
0
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
Figure 2.15 Smoothed tempo curves ĝ3 (t) = (nb3 )−1 K( t−t
!
b3
i
)[yi − ĝ1 (t) −
ĝ2 (t)] (b3 = 1/8).
1.0
1.0
-1.0
-1.0
-1.0
-1.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t
Figure 2.16 Smoothed tempo curves – residuals ê(t) = yi − ĝ1 (t) − ĝ2 (t) − ĝ3 (t).
0.6
0.5
0.4
-80
0.0
0.2
mel. Ind.
2nd der.
1st der.
0.0
-82
-0.5
-0.4
-84
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t
150
40
-40
100
20
50
-60
mel. Ind.
2nd der.
1st der.
0
-80
-50
-20
-100
-100
-40
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t
Figure 2.17 Melodic indicator – local polynomial fits together with first and second
derivatives.
1.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t t t t
1.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t t t t
1.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t t t t
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
-1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t t t t
Figure 2.18 Tempo curves (Figure 2.3) – first derivatives obtained from local
polynomial fits (span 24/32).
3
3
3
3
3
3
2
2
2
2
2
2
1
1
1
1
1
1
0
0
0
0
0
0
-1
-1
-1
-1
-1
-1
-1
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
-2
-2
-2
-2
-2
-2
-2
3
3
3
3
3
-3
-3
-3
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
2
2
2
2
2
t t t t t t t
1
1
1
0
0
0
0
-1
-1
-1
-1
-1
-1
-1
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
-2
-2
-2
-2
-2
-2
-2
-3
-3
-3
-3
-3
-3
-3
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t t t t t
3
3
3
3
2
2
2
2
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
1
1
1
1
-0
-0
-0
-0
-0
-0
-0
1
1
1
1
--
--
--
--
--
--
--
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
3
0 5 10 15 20 25 30
3
3
3
3
2
2
2
2
t t t t t t t
2
2
2
2
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
1
1
1
1
-0
-0
-0
-0
-0
-0
-0
1
1
1
1
--
--
--
--
--
--
--
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
3
3
3
3
2
2
2
2
t t t t t t t
Figure 2.19 Tempo curves (Figure 2.3) – second derivatives obtained from local
polynomial fits (span 8/32).
tempo
tempo
0
0
HOROWITZ1 HOROWITZ2 HOROWITZ3
tempo
tempo
tempo
0
Figure 2.26 R. Schumann, Träumerei op. 15, No. 7 – tempo by Cortot and
Horowitz at sharpening onset times.
10
10
diff(tempo)
diff(tempo)
diff(tempo)
0
0
-10
-10
-10
10
10
diff(tempo)
diff(tempo)
diff(tempo)
0
0
-10
-10
-10
Figure 2.27 R. Schumann, Träumerei op. 15, No. 7 – tempo “derivatives” for
Cortot and Horowitz at sharpening onset times.
Most methods for analyzing multivariate data are based on these two statis-
tics. One of the main tools consists of dimension reduction by suitable pro-
jections, since it is easier to find and visualize structure in low dimensions.
These techniques go far beyond descriptive statistics. We therefore post-
pone the discussion of these methods to Chapters 8 to 11. Another set of
methods consists of visualizing individual multivariate observations. The
main purpose is a simple visual identification of similarities and differences
between observations, as well as search for clusters and other patterns.
Typical examples are:
• Faces: xi =(xi1 , ..., xip )t is represented by a face with features depending
on the values of corresponding coordinates. For instance, the face func-
tion in S-Plus has the following correspondence between coordinates and
feature parameters: xi,1 =area of face; xi,2 = shape of face; xi,3 = length
of nose; xi,4 = location of mouth; xi,5 = curve of smile; xi,6 = width of
mouth; xi,7 = location of eyes; xi,8 = separation of eyes; xi,9 = angle
of eyes; xi,10 = shape of eyes; xi,11 = width of eyes; xi,12 = location of
pupil; xi,13 = location of eyebrow; x14 = angle of eyebrow; xi,15 = width
of eyebrows.
• Stars: Each coordinate is represented by a ray in a star, the length of
each corresponding to the value of the coordinate. More specifically, a
star for a data vector xi = (xi1 , ..., xip )t is constructed as follows:
1. Scale xi to the range [0, r] : 0 ≤ x1j, ..., xnj ≤ r;
2. Draw p rays at angles ϕj = 2π(j − 1)/p (j = 1, ..., p); for a star with
Figure 2.29 b) Chernoff faces for the same compositions as in figure 2.29a, after
permuting coordinates.
Figure 2.31 Star plots of p∗j = (p6 , p11 , p4 , p9 , p2 , p7 , p12 , p5 , p10 , p3 , p8 )t for com-
positions from the 13th to the 20th century.
and
Eup = {(t∗j , max x(t)), j = 1, ..., n}.
(t,x(t))∈Cj
In other words, for each onset time, the lowest and highest note are se-
lected to define the lower and upper envelope respectively. In the example
below, we consider interval steps ∆y(ti ) = y(ti+1 ) − y(ti ) mod 12 for the
upper envelope of a composition with onset times t1 , ..., tn and pitches
y(t1 )..., y(tn ). A simple aspect of melodic and harmonic structure is the
question in which sequence intervals are likely to occur. Here, we look at
the empirical two-dimensional distribution of (∆y(ti ), ∆y(ti+1 )). For each
pair (i, j), (−11 ≤ i, j ≤ 11, i, j̸=0), we count the number nij of occurences
and define Nij = log(nij + 1). (The value 0 is excluded here, since repe-
titions of a note – or transposition by an octave – are less interesting.) If
only the type of interval and not its direction is of interest, then i, j assume
the values 1 to 11 only. A useful representation of Nij can be obtained by
a symbol plot. In Figures 2.32 and 2.33, the x- and y-coordinates corre-
spond to i and j respectively. The radius of a circle with center (i, j) is
proportional to Nij . The compositions considered here are: a) J.S. Bach:
Präludium No. 1 from ”Das Wohltemperierte Klavier”; b) W.A. Mozart :
Sonata KV 545, (beginning of 2nd Movement); c) A. Scriabin: Prélude op.
51, No. 4; and d) F. Martin: Prélude No. 6. For Bach’s piece, there is a clear
clustering in three main groups in the first plot (there are almost never two
successive interval steps downwards) and a horseshoe-like pattern for abso-
lute intervals. Remarkable is the clear negative correlation in Mozart’s first
plot and the concentration on a few selected interval sequences. A nega-
tive correlation in the plots of interval steps with sign can also be found
for Scriabin and Martin. However, considering only the types of intervals
without their sign, the number and variety of interval sequences that are
used relatively frequently is much higher for Scriabin and even more for
Martin. For Martin, the plane of absolute intervals (Figure 2.33d) is filled
almost uniformly.
ARCADELT
RAMEAU
RAMEAU
SCHUMANN
ARCADELT
BYRD
SCRIABIN
RAMEAU SCARLATTI
SCARLATTI
0.15
CLEMENTI BYRD
MOZART
BYRD
DEBUSSY
SCRIABIN PROKOFFIEFF
BACH BACH OCKEGHEM
CHOPIN MOZART
SCARLATTI
CHOPIN DEBUSSY SCHUMANN
BACH HAYDN
WEBERN WAGNER MOZART
CLEMENTI SCRIABIN
0.10
DEBUSSY
CHOPIN SCHUMANN
PROKOFFIEFF
BARTOK
WAGNER
TAKEMITSU
0.05
SCHOENBERG
MESSIAEN
ARCADELT BERAN
BARTOK
PROKOFFIEFF
BARTOK
0.0
HALLE
Figure 2.34 Symbol plot with x = pj5 , y = pj7 and radius of circles proportional
to pj1 .
ARCADELT
RAMEAU
RAMEAU
SCHUMANN
ARCADELT
BYRD
0.15
SCRIABIN
RAMEAU SCARLATTI
SCARLATTI
CLEMENTI BYRD
MOZART
BYRD
DEBUSSY
SCRIABIN PROKOFFIEFF
BACH BACH OCKEGHEM
CHOPIN MOZART
SCARLATTI
CHOPIN DEBUSSY SCHUMANN
BACH HAYDN
WEBERN WAGNER MOZART
0.10
CLEMENTI SCRIABIN
DEBUSSY
CHOPIN SCHUMANN
PROKOFFIEFF
BARTOK
WAGNER
TAKEMITSU
0.05
SCHOENBERG
MESSIAEN
ARCADELT BERAN
BARTOK
PROKOFFIEFF
BARTOK
0.0
HALLE
Figure 2.35 Symbol plot with x = pj5 , y = pj7 and radius of circles proportional
to pj6 . (Color figures follow page 152 .)
pj1 (diminished second) and height pj6 (augmented fourth). Using the same
colors for the names as above, a similar clustering as in the circle-plot can
be observed. The picture not only visualizes a clear four-dimensional rela-
tionship between pj1 , pj5 , pj6 and pj7 , but also shows that these quantities
are related to the time period.
RAMEAU ARCADELT
RAMEAU
SCHUMANN
ARCADELT
BYRD
SCRIABIN
RAMEAU
CLEMENTI SCARLATTI
SCARLATTI
MOZART DEBUSSY BYRD
BYRD
PROKOFFIEFF MOZART OCKEGHEM
CHOPIN SCRIABIN BACH SCARLATTI
BACH
CHOPIN DEBUSSY SCHUMANN
BACH HAYDN
WAGNER
WEBERN CLEMENTI MOZART
SCRIABIN
DEBUSSY
CHOPIN SCHUMANN
PROKOFFIEFF
BARTOK WAGNER
TAKEMITSU
SCHOENBERG
ARCADELTMESSIAEN
BERAN
BARTOK
PROKOFFIEFF
BARTOK
HALLE
-0.1
Figure 2.36 Symbol plot with x = pj5 , y = pj7 . The rectangles have width pj1
(diminished second) and height pj6 (augmented fourth). (Color figures follow page
152.)
Figure 2.37 Symbol plot with x = pj5 , y = pj7 , and triangles defined by pj1 (di-
minished second), pj6 (augmented fourth) and pj10 (diminished seventh). (Color
figures follow page 152 .)
RAMEAU
SCHUMANN
ARCADELT
BYRD
SCRIABIN
RAMEAU SCARLATTI
SCARLATTI
CLEMENTI
0.15
MOZART BYRD
DEBUSSY
BYRD
PROKOFFIEFF
CHOPIN SCRIABIN BACH BACH MOZART OCKEGHEM
SCARLATTI
CHOPIN DEBUSSY SCHUMANN
BACH HAYDN
WEBERN WAGNER MOZART
CLEMENTI SCRIABIN
0.10
DEBUSSY
CHOPIN SCHUMANN
PROKOFFIEFF
BARTOK
WAGNER
TAKEMITSU
0.05
SCHOENBERG
ARCADELTMESSIAEN
BERAN
BARTOK
PROKOFFIEFF
BARTOK
0.0
HALLE
Figure 2.3 8 Names plotted at locations (x, y) = (pj5 , pj7 ). (Color figures follow
page 152.)
0.10
0.20
0.10
0.10
0.10
0.08
0.10
0.10
0.10
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
0.10
0.10
0.10
0.10
0.08
0.08
0.10
0.10
0.02
0.02
0.0
0.0
0.0
0.0
0.0
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
0.15
0.15
0.10
0.10
0.08
0.10
0.10
0.05
0.05
0.05
0.02
0.02
0.0
0.0
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
0.10
0.10
0.10
0.10
0.10
0.08
0.08
0.10
0.02
0.02
0.02
0.0
0.04
0.0
0.0
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
0.10
0.10
0.10
0.10
0.10
0.10
0.10
0.06
0.05
0.02
0.02
0.0
0.0
0.0
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
0.10
0.12
0.10
0.20
0.09
0.08
0.06
0.07
0.05
0.05
0.04
0.02
0.04
0.02
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Let I1 be the information needed to identify the set Vj which v belongs to.
Then the total information needed for identifying (encoding) elements of
V is
log2 N = I1 + I2 (3.3)
On the other hand, pj log2 N = log2 N so that we obtain Shannon’s
"
famous formula
!k
I1 = − pj log2 (pj ) (3.4)
j=1
I1 is also called Shannon information. Shannon information is thus the ex-
pected information about the occurence of the sets V1 , ..., Vk contained in
a randomly chosen element from V . Note that the term “information” can
be used synonymously for “uncertainty”: the information obtained from
a random experiment diminishes uncertainty by the same amount. The
derivation of Shannon information is credited to Shannon (1948) and, in-
dependently, Wiener (1948). In physics, an analogous formula is known as
entropy and is a measure of the disorder of a system (see Boltzmann 1896,
figure 3.1).
Shannon’s formula can also be derived by postulating the following prop-
erties for a measure of information of the outcome of a random experiment:
let V1 , ..., Vk be the possible outcomes of a random experiment and denote
by pj = P (Aj ) the corresponding probabilities. Then a measure of infor-
mation, say I, obtained by the outcome of the random experiment should
have the following properties:
1. Function of probabilities: I = I(p1 , ..., pk ), i.e. I depends on the proba-
bilities pj only;
2. Symmetry: I(p1 , ..., pk ) = I(pπ(1) , ..., pπ(k) ) for any permutation π;
3. Continuity: I(p, 1 − p) is a continuous function of p (0 ≤ p ≤ 1);
4. Definition of unit: I( 12 , 12 ) = 1;
Shannon information has an obvious upper bound that follows from Jensen’s
inequality: recall that Jensen’s
" inequality states that for a convex function
g and weights wj ≥ 0 with wj = 1 we have
! !
g( wj xj ) ≤ wj g(xj ).
This definition is plausible, because for a process with unit variance, f has
the same properties as a probability distribution and can be interpreted as
a distribution on frequencies. The process Xt is uncorrelated if and only if
f is constant, i.e. if f is the uniform distribution on [− π, π]. Exactly in this
case entropy is maximal, and knowledge of past observations does not help
to predict future observations. On the other hand, if f has one or more
Specific indicators
A possible objection to weight functions as defined above is that only in-
formation about pitch and onset time is used. A score, however, usually
contains much more symbolic information that helps musicians to read it
correctly. For instance, melodic phrases are often connected by a phras-
ing slur, notes are grouped by beams, separate voices are made visible by
suitable orientation of note stems, etc. Ideally, structural indicators should
take into account such additional information. An improved indicator that
takes into account knowledge about musical “motifs” can be defined for
example as follows:
Definition 31 Let M = {(τ1 , y1 ), ..., (τk , yk )}, τ1 < τ2 < ... < τk be a
“motif ” where y denotes pitch and τ onset time. Given a composition K ⊂
T ×Z ⊂ Z2 , define for each score-onset time ti ∈ T (i = 1, ..., n) and
u ∈ {1, ..., k}, the shifted motif
M (ti , u) = {(ti + τ1 − τu , y1 ), ..., (ti + τk − τu , yk )}
and define ru (ti ) to be the sample correlation between xo and y = (y1 , ..., yk ).
If M (ti , u) " K, then set ru (ti ) = 0.
Disregarding the position within a motif, we can now define overall motivic
indicators (or weights), for instance by
k
!
wd,mean (ti ) = g( du (ti )) (3.14)
u=1
Clearly, as r tends to zero, µr,h becomes at most larger and therefore has a
limit. The limit can be either zero (if µr,h = 0 already), infinity, or a finite
number. This leads to the following definition:
Definition 34 A function h for which
0 < µh (A) < ∞
is called intrinsic function of A.
Consider, for example, a simple shape in the plane such as a circle with
radius R. The area of the circle A can be measured by covering it by small
circles of radius r and evaluating µh (A) using the function h(r) = πr2 .
It is well known that limr→0µr,h (A) exists and is equal to µh (A) = πR2 .
On the other hand, if we took h(r) = πrα with α < 2, then µh (A) = ∞,
whereas for α > 2, µh (A) = 0. For standard sets, such as circles, rectangles,
triangles, cylinders, etc., it is generally true that the intrinsic function for a
set A that with topological dimension dT = d is given by (Hausdorff 1919)
{Γ( 21 )}d
h(r) = hd (r) = rd . (3.21)
Γ(1 + d2 )
Figure 3.2 Fractal pictures (by Céline Beran, computer generated.) (Color figures
follow page 152 .)
where X1 , X2 , ... is a stationary discrete time process with zero mean and
a1 , a2 , ... a sequence of positive normalizing constants such that log an →
∞. Then there exists an H > 0 such that for any u > 0, limn→∞ (anu /an ) =
uH , Zt is self-similar with self-similarity parameter H, and Zt has station-
ary increments.
The self-similarity parameter therefore also makes sense for processes that
are not exactly self-similar themselves, since it is defined by the rate n−H
needed to standardize partial sums. Moreover, H is related to the frac-
tal dimension, the exact relationship between H and the fractal dimen-
sion however depends on some other properties of the process as well. For
instance, sample paths of (univariate) Gaussian self-similar processes so-
called fractional Brownian motion (see Chapter 4) have, with probability
one, a fractal dimension of 2 − H with possible values of H in the inter-
val (0, 1). Thus, the closer H is to 1, the more a sample paths is similar
to a simple geometric line with dimension one. On the other hand, as H
approaches zero, a typical sample path fills up most of the plane so that
the dimension approaches two. Practically, H can be determined from an
observed series X1 , ..., Xn , for example by maximum likelihood estimation.
For a thorough discussion of self-similar and related processes and sta-
tistical methods see e.g. Beran (1994). Further references on fractals apart
from those given above are, for instance, Edgar (1990), Falconer (1990),
Peitgen and Saupe (1988), Stoyan and Stoyan (1994), and Tricot (1995).
A cautionary remark should be made at this point: in view of theorem 11,
the fact that we do find self-similarity in aggregated time series is hardly
surprising and can therefore not be interpreted as something very special
that would distinguish the particular series from other data. What may
be special at most is which particular value of H is obtained and which
particular self-similar process the normalized aggregated series converges
to.
Let x(ti ) be the upper and y(ti ) the lower envelope of a composition at
score-onset times ti (i = 1, ..., n). To investigate the shape of the melodic
where f̂ is obtained from the observed data x(2;12) (t1 ), ..., x(2;12) (tn ) by
kernel estimation.
2. E2 : Same as E1 , but using x(2) (t1 ), ..., x(2) (tn ) instead.
3. #
E3 = − f̂ (x, y) log2 f̂ (x, y)dxdy (3.30)
Figure 3.6 Comparison of entropies 9 and 10 for Bach, Schumann, and Scri-
abin/Martin.
Figures 3.7, and 3.9 through 3.11 show the “omnibus” metric, melodic,
and harmonic weight functions for Bach’s Canon cancricans, Schumann’s
op. 15/2 and 7, and for Webern’s Variations op 27. For Bach’s composi-
tion, the almost perfect symmetry around the middle of the composition
can be seen. Moreover, the metric curve exhibits a very regular up and
down. Schumann’s curves, in particular the melodic one, show clear pe-
riodicities. This appears to be quite typical for Schumann and becomes
even clearer when plotting a kernel-smoothed version of the curves (here
a bandwidth of 8/8 was used). Interestingly, this type of pattern can also
be observed for Webern. In view of the historic development of 12-tone
music as a logical continuation of harmonic freedom and romantic ges-
ture achieved in the 19th and early 20th centuries, this similarity is not
completely unexpected. Finally, note that a relationship between metric,
Figure 3.7 Metric, melodic, and harmonic global indicators for Bach’s Canon
cancricans.
melodic and harmonic structure can not be seen directly from the “raw”
curves. However, smoothed weights as shown in the figures above reveal
clear connections between the three weight functions. This is even the case
for Webern, in spite of the absence of tonality.
50
0
0 5 10 15 20 25 30
onset time
As gn → e itλ
, the integrals In converge to a random variable I, in the sense
that
lim E[(I − In )2 ] = 0.
n→∞
The random variable I is then denoted by exp(itλ)dZ(λ). The spectral
#
This means that the variance is a sum of contributions that are due to the
frequencies λj (1 ≤ j ≤ k). A sample path of Xt cannot be distinguished
from a deterministic periodic function, because the randomly selected am-
plitudes Aj are then fixed.
Finally, it should be noted that not all frequencies are observable when
observations are taken at discrete time points t = 1, 2, ..., n. The smallest
identifiable period is 2, which corresponds to a highest observable frequency
of 2π/2 = π. The largest identifiable period is n/2, which corresponds to
the smallest frequency 4π/n. As n increases, the lowest frequency tends to
zero, however the highest does not. In other words, the highest frequency
resolution does not improve with increasing sample size.
To obtain more general models, one may wish to relax the condition
of stationarity. An asymptotic concept of local stationarity is defined in
Dahlhaus (Dahlhaus 1996a,b, 1997): a sequence of stochastic processes Xt,n
with “ = ” meaning almost sure (a.s.) equality, µ(u) continuous, and there
exists a 2π−periodic function A : [0, 1] ×R → C such that A(u, −λ) =
Ā(u, λ), A(u, λ) is continuous in u, and
t
sup |A( , λ) − At,n (λ)| ≤ cn−1 (4.15)
t,λ n
(a.s.) for some constant c < ∞. Intuitively, this means that for n large
enough, the observed process can be approximated locally in a small time
window t ± ε by the stationary process exp(itλ)A( nt , λ)dZX (λ). The or-
#
der n−1 of the approximation is chosen such that most standard estima-
tion procedures, such as maximum likelihood estimation, can be applied
locally and their usual properties (e.g. consistency, asymptotic normality)
still hold. Under smoothness conditions on A one can prove that a mean-
ingful “evolving” spectral density fX (u, λ) (u ∈ (0, 1)) exists such that
∞
1 "
fX (u, λ) = lim cov(X[u·n−k/2],n , X[u·n+k/2],n ) (4.16)
n→∞ 2π
k=−∞
The function fX (u, λ) is called evolutionary spectral density. Note that, for
fixed u,
lim cov(X[u·n−k/2],n , X[u·n+k/2],n ) = γX (k)
n→∞
!
= (2π)−1 exp(ikλ)fX (u, λ)dλ.
Thumfart (1995) carries this concept over to series with discrete spectra.
A simplified definition can be given as follows: a sequence of stochastic
processes Xt,n (n ∈ N ) is said to have a discrete evolutionary spectrum
FX (u, λ), if
t " t t
Xt,n = µ( ) + Aj ( )eiλj ( n )t (4.17)
n n
j∈M
The reason why the frequency range extends to (−∞, ∞), instead of [−π, π],
is that in continuous time, by definition, arbitrarily small frequencies are
observable.
Suppose now that Yτ is observed at discrete time points t = j · ∆τ , i.e.
we observe
Xt = Yj·∆τ (4.21)
Then we can write
! ∞ ∞ !
" π/∆τ +(2π/∆τ )u
Xt = eij(∆τ λ) dZY (λ) = eij(∆τ λ) dZY (λ)
−∞ u=−∞ −π/∆τ +(2π/∆τ )u
(4.22)
∞ !
" π/∆τ ! π/∆τ
= eij(∆τ λ) dZY (λ + (2π/∆τ )u) = eitλ dZX (λ)
u=−∞ −π/∆τ −π/∆τ
(4.23)
where ∞
"
dZX (λ) = dZY (λ + (2π/∆τ )u) (4.24)
u=−∞
Moreover, if Yτ has spectral density fY , then the spectral density of Xt is
∞
"
fX (λ) = fY (λ + (2π/∆τ )u) (4.25)
u=−∞
for λ ∈ [− ∆τ
π π
, ∆τ ]. This result can be interpreted as follows: a frequency λ >
π/∆τ can be written as λ = λo − (2π/∆τ )j for some j ∈ N where λo is in
the interval [−π/∆τ, π/∆τ ]. The contributions of the two frequencies λ and
To eliminate a certain frequency band [a, b] one thus needs a linear filter
such that A(λ) ≡0 in this interval.
Equation (4.27) also helps to construct and simulate time series mod-
els with desired spectral densities: a series with spectral density fY (λ) =
(2π)−1 |A(λ)|2 can be simulated by passing a series of independent obser-
vations Xt through the filter A(λ). Note that, in reality, one can use only
a finite number of terms in the filter so that only an approximation can be
achieved.
with
Γ(d + 1)
% &
d
= .
k Γ(k + 1)Γ(d − k + 1)
The spectral density of (1 − B)m Xt is
|ψ(eiλ )|2
fX (λ) = σε2 |1 − eiλ |−2d . (4.37)
|ϕ(eiλ )|2
The fractional differencing parameter δ plays an important role. If δ =
0, then (1 − B)m Xt is an ordinary ARIMA(p, 0, q) process, with spectral
density such that fX (λ) converges to a finite value fX (0) as λ → 0 and
the covariances decay exponentially, i.e. |γX (k)| ≤ Cak for some 0 < C <
∞, 0 < a < 1. The process is therefore said to have short memory. For
δ > 0, fX has a pole at the origin of the form fX (λ) ∝λ−2δ as λ → 0, and
γX (k) ∝k 2d−1 so that
∞
"
γX (k) = ∞.
k=−∞
This case is also known as long memory, since autocorrelations decay very
slowly (see Beran 1994). On the other hand, if δ < 0, then fX (λ) ∝ λ−2δ
converges to zero at the origin and
∞
"
γX (k) = 0.
k=−∞
This is called antipersistence, since for large lags there is a negative corre-
lation. The fractional differencing parameter δ, or d = δ + m, is also called
long-memory parameter, and is related to the fractal or Hausdorff dimen-
sion dH (see Chapter 3). For an extended discussion of long-memory and
antipersistent processes see e.g. Beran (1994) and references therein.
where Ut is stationary.
9. Harmonic or seasonal trend model:
p
" p
"
Xt = αj cos λj t + αj sin λj t + Ut (4.42)
j=0 j=0
with Ut stationary
10. Nonparametic trend model:
t
Xt,n = g( ) + Ut (4.43)
n
with g : [0, 1] → R a “smooth” function (e.g. twice continuously differen-
tiable) and Ut stationary.
11. Semiparametric fractional autoregressive model, SEMIFAR(p, d, q) (Be-
ran 1998, Beran and Ocker 1999, 2001, Beran and Feng 2002a,b):
(1 − B)δ ϕ(B){(1 − B)m Xt − g(st )} = Ut (4.44)
where h is the joint density function of (X1 , ..., Xn ). If observations are dis-
crete, then h is the joint probability P (X1 = x1 , ..., Xn = xn ). Equivalently,
we may maximize the log-likelihood L(x1 , ..., xn ; θ) = log h(x1 , ..., xn ; θ).
Under fairly general regularity conditions, θ̂ is asymptotically consistent, in
the sense that it converges in probabilty to θo . In other words, limn→∞ P (|θ̂−
θo | > ε) = 0 for all ε > 0. In the case of a Gaussian time series with spectral
density fX (λ; θ), we have
1 t
L(x1 , ..., xn ; θ) = − [log 2π + log |Σn | + (x−x̄) Σ−1 n (x−x̄)] (4.46)
2
where x = (x1 , ..., xn )t , x̄ = x̄ · (1, 1, ..., 1)t , and |Σn | is the determinant of
the covariance matrix of (X1 , ..., Xn )t with elements [Σn ]ij = cov(Xi , Xj ).
Since under general conditions n−1 log |Σn | converges to (2π)−1 times the
integral of log fX (Grenander#and Szegö 1958), and the (j, l)th element of
−1
Σ−1
n can be approximated by fX (λ) exp{i(j − l)λ}dλ, an approximation
to θ̂ can be obtained by the so-called Whittle estimator θ̃ (Whittle 1953;
also see e.g. Fox and Taqqu 1986, Dahlhaus 1987) that minimizes
1
! π
I(λ)
Ln (θ) = [log fX (λ; θ) + ]dλ (4.47)
4π −π fX (λ; θ)
An alternative approximation for Gaussian processes$∞is obtained by using
an autoregressive representation of the type Xt = j=1 bj Xt−j + ϵt , where
ϵt are independent identically distributed zero mean normal variables with
variance σϵ2 . This leads to minimizing the sum of the squared residuals as
explained below in Equation (4.50) (see e.g. Box and Jenkins 1970, Beran
1995).
and σ̂ε2 = n−1 e2t (η̂). Under mild regularity conditions, as n tends to
$
√
infinity, the distribution of n(θ̂−θ) tends to a normal distribution N (0, V )
with with covariance matrix V = 2B −1 where B is a p ×p matrix with
elements
! π
∂ ∂
Bij = (2π)−1 log f (λ; θ) log f (λ; θ)dλ
−π ∂θi ∂θj
(see e.g. Box and Jenkins 1970, Beran 1995).
The estimation method above assumes that the order of the model, i.e.
the length p of the parameter vector θ, is known. This is not the case in
general so that p has to be estimated from data. Information theoretic con-
siderations (based on definitions discussed in Section 3.1) lead to Akaike’s
famous criterion (AIC; Akaike 1973a,b)
p̂ = arg min{−2 log likelihood + 2p} (4.51)
p
The Bias only depends on the function g, and is thus independent of the
error process. The variance, on the other hand, is a function of the co-
variances γU (k) = cov(Ut , Ut+k ), or equivalently the spectral density fU .
with Ut stationary. Note that, theoretically, this model can also be un-
derstood as a stationary process with jumps in the spectral distribution
FX (see Section 4.2.1). Given ω = (ω1 , ..., ωp )t , the parameter vector θ =
(α1 , ..., αp , β1 , ..., βp )t can be estimated by the least squares or, more gen-
erally, weighted least squares method,
n p
" t "
θ̂ = arg min w( )[xt − (αj cos ωj t + βj sin ωj t)]2 (4.61)
θ
t=1
n j=1
and $n
w( t )xt sin ω̂j t
β̂j = t=1$n n (4.64)
t=1 w( n )
t
Note that (4.64) means that we look for the k largest peaks in the (w-
tapered) periodogram. Under quite general assumptions, the asymptotic
distribution of the estimates can be shown to be as follows: the vectors
√ √ 3
Zn,j = [ n(α̂j − αj ), n(β̂j − βj ), n 2 (ω̂j − ωj )]t
(j = 1, ..., p) are asymptotically mutually independent, each having a 3-
dimensional normal distribution with expected value zero and covariance
matrix C(ωj ) that depends on fU (ωj ) and the weight function w. The
formulas for C are as follows (Irizarry 1998, 2000, 2001, 2002):
4πfU (ωj )
C(ωj ) = V (ωj ) (4.65)
α2j + βj2
time in seconds
Figure 4.2 Zoomed piano sound wave – shaded area in Figure 4.1.
spectrogram)
n
1 " t − j −iλj 2
I(t, λ) = $n t−j
| W( )e xj | (4.81)
2π j=1 W ( nb ) j=1
2 nb
where W : R → R+ is a weight function such that W (u) = 0 for |u| > 1 and
b > 0 is a bandwidth that determines how large the window (block) is, i.e.
how many consecutive observations are considered to correspond approx-
imately to a harmonic regression model with fixed coefficients αj , βj and
stationary noise Ut . This is illustrated in color Figure 4.7 for a harpsichord
sound, with W (u) = 1{|u| ≤ 1}. Intense pink corresponds to high values of
I(t, λ). Figures 4.6a through d show explicitly the change in I(t, λ) between
four different blocks. Since the note was played “staccato”, the sound wave
is very short, namely about 0.1 seconds. Nevertheless, there is a change in
the spectrum of the sound, with some of the higher harmonics fading away.
Apart from the relative amplitudes of partials, most musical sounds in-
0
-1000
-2000
-3000
time in seconds
10^6
periodogram
periodogram
10^5
10^4
10^2
10^3
10^0
10^1
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
frequency frequency
a a
periodogram
10^4
10^4
10^2
10^2
10^0
10^0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
frequency frequency
b c
Figure 4.6 Harpsichord sound – periodogram plots for different time frames (mov-
ing windows of time points).
musical perception, the clarinet player was not out of tune, because the
deviation from 441Hz was less than 0.76Hz which corresponds to 0.03
semitones. According to experimental studies, the human ear cannot
distinguish notes that are 0.03 semitones apart (Pierce 1983/1992).
2. Physical models (see e.g. Fletcher and Rossing 1991) postulate the fol-
lowing relationships between the fundamental frequency and partials:
for a “harmonic instrument” such as the clarinet, one expects
ωj = j · ω1 ,
whereas for a “plucked string instrument”, such as the guitar, one should
have
ωj ≈cj 2 · ω1
where c is a constant determined by properties of the strings. The ex-
periment described in Irizarry (2001) supports the assumption for the
clarinet in the sense that, in general, the 95%-confidence intervals for
the difference ωj − jω1 contained 0. For the guitar, his findings suggest
a relationship of the form ωj ≈c(a + j)2 ω1 with a ̸= 0.
which Slaney and Lyon call “correlogram”. This is in fact an estimated local
autocovariance at lag k for section j and the time-segment with midpoint
u. The “Slaney-Lyon-correlogram” thus essentially characterizes the local
autocovariance structure of the resulting nerve impulse series. Thumfart
(1995) shows formally how, and under which conditions, this model can
be defined within the framework of processes with a discrete evolutionary
spectrum. He also suggests a simple method for estimating pitch (the fun-
damental frequency) at local time u by setting ω̂1 (u) = 2π/kmax (u) where
$86
kmax (u) = arg maxk C(k, u) and C(k, u) = j=1 c(k, j, u).
In other words, ω̃1 corresponds to the Fourier frequency where the first
peak of the periodogram occurs. Because of the restriction to Fourier fre-
quencies, the peridogram may have two adjacent peaks and the estimate is
too inaccurate in general. An empirical interpolation formula is suggested
by the authors to obtain an improved estimate ω̂1 . A comparison with har-
monic regression is not made, however, so that it is not clear how good the
interpolation works in comparison.
Given a procedure for pitch identification, an automatic note separation
procedure can be defined. This is a procedure that identifies time points in
a sound signal where a new note starts. The interesting result in Weihs et
al. is that automatic note separation works better for amateur singers than
for professionals. The reason may be the absence of vibrato in amateur
voices. In a third step, Weihs et al. address the question of how to as-
sess computationally the purity of intonation based on a vocal time series.
This is done using discriminant analysis. The discussion of these results is
therefore postponed to Chapter 9.
18
3000
17
1000
16
air pressure
log(power)
15
-1000
14
-3000
13
0.0 0.02 0.04 0.06 0.08 0.10 0.12 0.0 0.02 0.04 0.06 0.08 0.10 0.12
c) Harpsichord - d) Harpsichord -
histogram of log(power) log-log-periodogram and SEMIFAR-fit (d=0.51)
1.0000
0
10
80
0.0100
log(f)
60
40
20
0.0001
0
log(y**2) log(frequency)
Figure 4.8 A harpsichord sound wave (a), logarithm of squared amplitudes (b),
histogram of the series (c) and its periodogram on log-scale (d) together with fitted
SEMIFAR-spectrum.
are similar to before, namely dˆ = 0.51 ([0.20, 0.81]) for Bach and 0.33
([0.24, 0.42]) for Paganini.
Hierarchical methods
Suppose that we have two time series Yt , Xt and we wish to model the re-
latioship between Yt and Xt . The simplest model is simple linear regression
Yt = βo + β1 Xt + εt (5.1)
M
!
Yt,2 = β02 + βj2 Xt,j + εt,2
j=1
..
.
M
!
Yt,L = β0L + βjL Xt,j + εt,L .
j=1
The collection of time series {X1,j , ..., Xn,j } (j = 1, ..., M ) is called a hier-
archical decomposition of Xt . The HIREG-model is then defined by (5.2).
If εt (t = 1, 2, ...) are independent, then usual techniques of multiple linear
regression can be used (see e.g. Plackett 1960, Rao 1973, Ryan 1996, Sri-
vastava and Sen 1997, Draper and Smith 1998). In case of correlated errors
εt , appropriate adjustments of tests, confidence intervals, and parameter
selection techniques must be made. The main assumption in the HIREG
model is that we know which bandwidths to use. In some cases this may
indeed be true. For instance, if there is a three-fourth meter at the begin-
ning of a musical score, then bandwidths that are divisible by three are
plausible.
b )
K( t−ti
for j = M +1, ..., 2M. Under suitable assumptions, the estimate θ̂ is asymp-
totically normal. More specifically, set
hi (t; θo ) = g(t; bi ) (i = 1, ..., M ) (5.13)
Then, as n → ∞, lim inf |ar | > 0, and lim inf |br | > 0 for all r, s ∈
{1, ..., M }.
(A3) x(ti ) = ξ(ti ) where ξ : [0, T ] → R is a function in C[0, T ], T < ∞.
(A4) The set of time points converges to a set A that is dense in [0, T ].
Then we have (Beran and Mazzola 1999b):
Theorem 12 Let Θ1 and Θ2 be compact subsets of R and R+ respectively,
Θ = ΘM1 ×Θ2 and let η = 2 min{1, 1 − 2d}. Suppose that (A1), (A2), (A3)
M 1
where $
ak =< g, ϕk >= g(x)ϕk (x)dx (5.25)
and $
bjk =< g, ψjk >= g(x)ψjk (x)dx (5.26)
wavelet decomposition g(t) = ak φk (t) + bj,k ψj,k (t). For 0 = cM+1 <
cM < ... < c1 < co = ∞ let
! !
g(t; ci−1 , ci ) = ak φk (t) + bj,k ψj,k (t).
ci ≤|ak |<ci−1 ci ≤|bj,k |<ci−1
The definition means that the time series Y (t) is decomposed into orthog-
onal components that are proportional to certain “bands” in the wavelet
decomposition of the explanatory series X(t) – the bands being defined by
the size of wavelet coefficients. As for HISMOOTH models, the parameter
vector θ = (β, η)t can be estimated by nonlinear least squares regression.
To illustrate how HIWAVE-models may be used, consider the following
simulated example: let xi = g(ti ) (i = 1, ..., 1024) as in the previous ex-
ample. The function g is decomposed into g(t) = g(t; ∞, η1 ) + g(t; η1 , 0) =
g1 (t) + g2 (t) where η1 is such that 50 wavelet coefficients of g are larger
or equal η1 . Figure 5.2 shows g, g1 , and g2 . A simulated series of response
variables, defined by Y (ti ) = 2g1 (ti ) + εi (t = 1, ..., 1024) with independent
zero-mean normal errors εi with variance σε2 = 100, is shown in Figure 5.3b.
x
0
x
-10
Data
D1
D2
D3
D4
S4
D3.92
D3.102
D3.112
D3.109
D3.104
Resid
20
10
0
-10
10
10
-20
5
x-g1
g1
0
0
-30
-10
-10
-40
A comparison of the two scatter plots in Figures 5.3c and d shows a much
clearer dependence between y and g1 as compared to y versus x = g. Figure
5.3e illustrates that there is no relationship between y and g2 . Finally, the
time-frequency plot in Figure 5.3f indicates that the main periodic behavior
occurs for t ∈ {701, ..., 900}. The difficulty in practice is that the correct
decomposition of x into g1 and the redundant component g2 is not known
a priori. Figure 5.4 shows y and the HIWAVE-curve β̂o + β̂1 g(ti ; ∞, η̂1 ) (for
graphical reason the fitted curve is shifted vertically) fitted by nonlinear
least squares regression. Apparently, the algorithm identified η̂1 and hence
the relevant time span [701, 900] quite exactly, since g(ti ; ∞, η̂1 ) corresponds
to the sum of the largest 51 wavelet components. The estimated coefficients
are β̂o = −0.36 and β̂1 = 1.95. If we assume (incorrectly of course) that
η̂1 has been known a priori, then we can give confidence intervals for both
parameters as in linear least squares regression. These intervals are gen-
erally too short, since they do not take into account that η̂1 is estimated.
However, if a null hypothesis is not rejected using these intervals, then it
will not be rejected by the correct test either. In our case, the linear re-
gression confidence intervals for βo and β1 are [−0.96, 0.24] and [1.81, 2.09]
respectively, and thus contain the true values βo = 0 and β1 = 2.
COLOR FIGURE 2.35 Symbol plot with x = pj5 , y = pj7 , and radius of circles
proportional to pj6 .
COLOR FIGURE 2.37 Symbol plot with x = pj5 , y = pj7 , and triangles defined
by pj1 (diminished second), pj6 (augmented fourth), and pj10 (diminished
seventh).
COLOR FIGURE 3.2 Fractal pictures (by Céline Beran, computer generated).
20
15
15
estimated and observed log(tempo)
10
5
5
0
0
-5
5 10 15 20 25 30 5 10 15 20 25 30
two (Figure 5.12), five (Figure 5.13) and ten (Figure 5.14) best basis func-
tions. The plots show interesting and plausible similarities and differences.
Particularly striking are Cortot’s 4-bar oscillations, Horowitz’s “seismic”
local fluctuations, the relatively unbalanced tempo with a few extreme
tempo variations for Eschenbach, Klien, Ortiz, and Schnabel, the irregular
shapes for Moisewitsch, and also a strong similarity between Horowitz1 and
Moisewitsch with respect to the general shape (Figure 5.12).
d1 d1 d1
d2 d2 d2
s2 s2 s2
d1 d1 d1
d2 d2 d2
s2 s2 s2
Figure 5.11 Wavelet coefficients for Cortot’s and Horowitz’s three performances.
1.0
1.0
1.0
1
1
0
0.5
0.5
0.5
0
0
-1
-1
0.0
0.0
-1
0.0
-2
-2
-1
-1.0
-3
-0.5
-0.5
-2
-3
-2
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
1.0
2
1.0
1.0
1.0
1
1.0
1
0.5
0.5
0.5
0.5
0
0.5
-1
0.0
0.0
-1
0.0
0.0
-0.5
-2
-2
-0.5
-3
-1.0
-1.0
-3
-1.0
-4
-1.5
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
1.0
1
0
0
1.0
1
0.5
0
0.5
0.5
-1
-1
0.0
-1
-2
-1
-0.5
0.0
-2
-2
-3
-2
-3
-0.5
-1.0
-1.5
-3
-3
-4
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
1.0
1
1.0
0
0
0
1
0.5
-1
0.5
-1
0
-1
0
-2
0.0
-2
0.0
-1
-2
-3
-1
-0.5
-3
-4
-0.5
-3
-2
-1.0
-2
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
Figure 5.12 Tempo curves – approximation by most important 2 best basis func-
tions.
1
1
1
1.0
0
0
0
0
0
0
0.5
-1
-1
-1
-1
-1
-1
0.0
-2
-2
-2
-2
-2
-2
-3
-3
-3
-3
-1.0
-3
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
1
1
1
0.5
1
1
0
0
0
0
0
-1
0
-1
-1
-0.5
-1
-2
-1
-2
-1
-2
-3
-2
-3
-2
-1.5
-4
-3
-4
-2
-3
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
1
1.0
1.0
1
0.5
0
0
-1
-1
-1
0.0
0.0
-1
-0.5
-2
-2
-2
-2
-3
-1.0
-3
-1.0
-1.5
-3
-3
-4
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
1
1
0
1
1
1
1.0
-1
0
0
0
0
0
0.5
-2
-1
-1
-3
-1
0.0
-2
-1
-1
-2
-4
-0.5
-3
-2
-2
-3
-5
-2
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
Figure 5.13 Tempo curves – approximation by most important 5 best basis func-
tions.
2
1
1
1
1
1
0
0
0
0
0
-1
-1
-1
-1
-1
-1
-2
-2
-2
-2
-2
-2
-1
-3
-3
-3
-3
-3
-3
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
1
1
1
0.5
1
1
0
0
0
0
-1
-1
-1
0
-1
-0.5
-1
-2
-2
-2
-2
-1
-3
-3
-2
-1.5
-3
-3
-4
-4
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
2
2
1
1
1
1
1
1
0
0.5
0
0
0
-1
0
-1
-1
-1
-1
-2
-0.5
-2
-2
-1
-2
-2
-3
-3
-3
-3
-3
-1.5
-2
-4
0 50 100 150 0 50 100 150 0 50 100 150 -4 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
2
1
1
1
1
1
1
0
0
0
0
-1
0
0
0
-1
-2
-1
-1
-1
-1
-2
-3
-1
-2
-2
-4
-2
-3
-2
-3
-2
-5
-3
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
Figure 5.14 Tempo curves – approximation by most important 10 best basis func-
tions.
and
(n)
pj = P (Xt+n = j) = [π t M n ]j (6.5)
This implies
qij = 0 for fjj < 1
and
qij = 1 for fjj = 1.
A simple way of checking whether a state is persistent or not is given by
Theorem 13 The following holds for a Markov chain:
"∞ (n)
i) A state j is transient ⇔ qjj = 0 ⇔ n=1 pjj < ∞
(n)
ii) A state j is persistent ⇔ qjj = 1 ⇔ ∞ n=1 pjj = ∞.
"
"∞ (n)
The condition on n=1 pii can be simplified further for irreducible Markov
chains:
Definition 43 A Markov chain is called irreducible, if for each i, j ∈ S,
(n)
pij > 0 for some n.
Irreducibility means that wherever we start, any state j can be reached in
due time with positive probability. This excludes the possibility of being
caught forever in a certain subset of S. With respect to persistent and tran-
sient states, the situation simplifies greatly for irreducible Markov chains:
Theorem 14 Suppose that Xt (t = 0, 1, ...) is an irreducible Markov chain.
Then one of the following possibilities is true:
or in matrix form,
π t M = π. (6.10)
This means that if we start with distribution π, then the distribution of all
subsequent Xt′ s is again π.
The next question is in how far the initial distribution influences the
dynamic behavior (probability distribution) into the infinite future. A pos-
sible complication is that the process may be periodic in the sense that one
may return to certain states periodically:
Definition 45 A state j is called to have period τ , if
(n)
pjj > 0
implies that n is a multiple of τ .
For an irreducible Markov chain, all states have the same period. Hence,
the following definition is meaningful:
Definition 46 An irreducible Markov chain is called periodic if τ > 1, and
it is called aperiodic if τ = 1.
It can be shown that for an aperiodic Markov chain, there is at most one
stationary distribution and, if there is one, then the initial distribution does
not play any role ultimately:
Theorem 15 If Xt (t = 0, 1, ...) is an aperiodic irreducible Markov chain
for which a stationary distribution π exists, then the following holds:
(i) the Markov chain is persistent;
(n)
(ii) limn→∞ pij = πj > 0 for all i, j;
(iii) the stationary distribution π is unique.
In the other case of an aperiodic irreducible Markov chain for which no
stationary distribution exists, we have
(n)
lim p =0
n→∞ ij
for all i, j. Note that this is even the case if the Markov chain is persistent.
One then can classify irreducible aperiodic Markov chains into three classes:
∞
(n)
!
pij = ∞
n=1
and
∞
(n)
!
µj = nfjj = ∞
n=1
for all i, j and the average number of steps till the process returns to
state j is given by
µj = πj−1
For Markov chains with a finite state space, the results simplify further:
Theorem 17 If Xt is an irreducible aperiodic Markov chain with a finite
state space, then the following holds:
(i) Xt is persistent
(ii) a unique stationary distribution π = (π1 , ..., πk )t exists and is the so-
lution of
!
π t (I −M ) = 0, (0 ≤ πj ≤ 1, πj = 1) (6.11)
where I is the m ×m identity matrix.
Note that j Mij = j pij = 1 so that j (I −M )ij = 0, i.e. the matrix
" " "
(I −M ) is singular. (If this were not the case, then the only solution to the
system of linear equations would be 0 so that no stationary distribution
would exist.) Thus, there are infinitely many solutions of (6.13). However,
there is only one solution that satisfies the conditions 0 ≤ πj ≤ 1 and
πj = 1.
"
and the stationary distribution π of the Markov chain with transition ma-
trix M̂ = (p̂ij )i,j=1,...,11 is estimated by solving the system of linear equa-
tions
π t (I −M̂ ) = 0
as described above. Figures 6.3a through l show the resulting values of π̂j
(joined by lines). For each composition, the vector π̂j is plotted against j.
For visual clarity, points at neighboring states j and j−1 are connected. The
figures illustrate how the characteristic shape of π changed in the course of
the last 500 years. The most dramatic change occured in the 20th century
with a “flattening” of the peaks. Starting with Scriabin a pioneer of atonal
music, though still rooted in the romantic style of the late 19th century, this
is most extreme for the compositions by Schönberg, Webern, Takemitsu,
and Messiaen. On the other hand, Prokoffieff’s “Visions fugitives” exhibit
clear peaks but at varying locations. The estimated stationary distributions
can also be used to perform a cluster analysis. Figure 6.4 shows the result
of the single linkage algorithm with the manhattan norm (see Chapter
10). To make names legible, only a subsample of the data was used. An
almost perfect separation between Bach and composers from the classical
and romantic period can be seen.
SCHUMANN
4
CHOPIN
BRAHMS
BRAHMS
2
RACHMANINOFF
HAYDN
SCHUMANN
MOZART
HAYDN
BACH
BRAHMS
HAYDN
SCHUMANN
CHOPIN
RACHMANINOFF
SCHUMANN
SCHUMANN
SCHUMANN
CHOPIN
SCHUMANN
SCHUMANN
SCHUMANN
BACH
HAYDN
HAYDN
SCHUMANN
BRAHMS
MOZART
CHOPIN
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
HAYDN
HAYDN
MOZART
MOZART
MOZART
BRAHMS
BRAHMS
SCHUMANN
SCHUMANN
SCHUMANN
RACHMANINOFF
RACHMANINOFF
Figure 6.4 Cluster analysis based on stationary Markov chain distributions for
compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rach-
maninoff.
can be observed. The differences are even more visible when comparing in-
dividual composers. This is illustrated in Figures 6.9a and b where Bach’s
and Schumann’s log(π̂1 /π̂3 ) and log(π̂2 /π̂3 ) are compared, and in Figures
6.10a through f where the median and lower and upper quartiles of π̂j are
plotted against j. Finally, Figure 6.11 shows the plots of log(π̂1 /π̂3 ) and
log(π̂2 /π̂3 ) against the date of birth.
SCHUMANN
1.0
BRAHMS
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
HAYDN
HAYDN
HAYDN
HAYDN
HAYDN
HAYDN
HAYDN
CHOPIN
MOZART
CHOPIN
CHOPIN
MOZART
CHOPIN
MOZART
MOZART
MOZART
BRAHMS
BRAHMS
BRAHMS
BRAHMS
BRAHMS
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
SCHUMANN
RACHMANINOFF
RACHMANINOFF
RACHMANINOFF
RACHMANINOFF
country separately. Only 70% of the data are used for estimation. The
remaining 30% are used for validation of a classification rule defined as
follows: a melody is assigned to country j, if the corresponding likelihood
(calculated using the country’s hidden Markov model) is the largest. Not
surprisingly, the authors conclude that the most reliable distinction can be
made between Irish and non-Irish songs.
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
-1.5
-1.5
b. 1600 1600 1720 1800 from birth 1720-1800 birth before 1720 or
-1720 -1800 -1880 1880 1800 and later
Figure 6.6 Comparison of log odds ratios log(π̂1 /π̂2 ) of stationary Markov chain
distributions of torus distances.
2
1
1
0
0
-1
-1
b. 1600 1600 1720 1800 from birth before 1720 birth 1720 and later
-1720 -1800 -1880 1880
Figure 6.7 Comparison of log odds ratios log(π̂1 /π̂3 ) of stationary Markov chain
distributions of torus distances.
3
2
2
1
1
0
0
b. 1600 1600 1720 1800 from birth before 1720 birth 1720 and later
-1720 -1800 -1880 1880
Figure 6.8 Comparison of log odds ratios log(π̂2 /π̂3 ) of stationary Markov chain
distributions of torus distances.
2.0
1.0
1.5
0.5
1.0
0.0
0.5
-0.5
0.0
-1.0
-0.5
Figure 6.9 Comparison of log odds ratios log(π̂1 /π̂3 ) and log(π̂2 /π̂3 ) of stationary
Markov chain distributions of torus distances.
log(pi(1)/pi(3)) log(pi(2)/pi(3))
plotted against date of birth plotted against date of birth
3
2
2
1
log(pi(1)/pi(3))
log(pi(2)/pi(3))
1
0
0
-1
year year
a b
Figure 6.11 Log odds ratios log(π̂1 /π̂3 ) and log(π̂2 /π̂3 ) plotted against date of
birth of composer.
Circular statistics
Many phenomena in music are circular. The best known examples are re-
peated rhythmic patterns, the circles of fourths and fifths, and scales mod-
ulo octave in the well-tempered system. In the circle of fourths, for example,
one progresses by steps of a fourth and arrives, after 12 steps, at the ini-
tial starting point modulo octave. It is not immediately clear whether and
how to “calculate” in such situations, and what type of statistical proce-
dures may be used. The theory of circular statistics has been developed
to analyze data on circles where angles have a meaning. Originally, this
was motivated by data in biology (e.g. direction of bird flight), meteorol-
ogy (e.g. direction of wind), and geology (e.g. magnetic fields). Here we
give a very brief introduction, mostly to descriptive statistics. For an ex-
tended account of methods and applications of circular statistics see, for
instance, Mardia (1972), Batschelet (1981), Watson (1983), Fisher (1993),
and Jammalamadaka and SenGupta (2001). In music, circular methods can
be applied to situations where angles measure a meaningful distance be-
tween points on the circle and arithmetic operations in the sense of circular
data are well defined.
Cp Sp Rp
C̄p = , S̄p = , R̄p = (7.7)
n n n
and
Sp
ϕ̄(p) = arctan + π1{Cp < 0} + 2π1{Cp > 0, Sp < 0} (7.8)
Cp
Then
mp = C̄p + iS̄p = R̄p eiϕ̄(p) (7.9)
is called the pth trigonometric sample moment.
For p = 1, this definition yields
m1 = C̄1 + iS̄1 = R̄1 eiϕ̄(1)
with C̄1 = C̄, S̄p = S̄, R̄p = R̄ and ϕ̄(p) = ϕ̄ as before. Similarily, we have
Definition 52 Let
n
! n
!
Cpo = cos p(ϕi − ϕ̄(1)), Spo = sin p(ϕi − ϕ̄(1)) (7.10)
i=1 i=1
Cpo Spo
C̄po = , S̄po = (7.11)
n n
Spo
ϕ̄o (p) = arctan + π1{Cpo < 0} + 2π1{Cpo > 0, Spo < 0} (7.12)
Cpo
Then
o
mop = C̄po + iS̄po = R̄p eiϕ̄ (p)
(7.13)
is called the pth centered trigonometric (sample) moment centered rel-
mop ,
ative to the mean direction ϕ̄(1).
Note, in particular, that sin(ϕi −ϕ̄(1)) = 0 so that mo1 = R̄1 . An overview
&
of descriptive measures of center and variability is given in Table 7.1.
1 &n
Mean deviation Dn = π − n i=1|π − |ϕi − Mn || Variability
or
&n−k
det(n−1 i=1 xi xi+k )
t
rϕ (k) = n−k
(7.18)
det(n−1 i=1 xi xti )
&
u
F (u) = P (0 ≤ ϕ ≤ u) = 1{0 ≤ u < 2π},
2π
′ 1
f (ϕ) = F (ϕ) = 1{0 ≤ u < 2π}.
2π
In this case, µp = ρp = 0, the mean direction µϕ is not defined, and the
circular standard deviation σ and dispersion δ are infinite. This expresses
the fact that there is no preference for any direction and variability is
therefore maximal.
ρ u
F (u) = [ sin(u − µ) + ]1{0 ≤ u < 2π}
π 2π
and
1
f (u) = (1 + 2ρ cos(u − µ))1{0 ≤ u < 2π}
2π
where 0 ≤ ρ ≤ 12 . In this case, µϕ = µ, ρ1 = ρ, µp = 0 (p ≥ 1) and
δ = 1/(2ρ2 ). An interesting property is that this distribution tends to the
uniform distribution as ρ → 0.
Wrapped distribution:
is the modified Bessel function of the first kind and order 0. In this case,
we have µϕ = µ, ρ1 = I1 /Io , δ = (κI1 /Io )−1 , µp,C = Ip /Io and µp,S = 0
(p ≥ 1) where
∞
! 1 κ
Ip = ( )2 j+p
j=0
(j + p)!j! 2
is a modified Bessel function of order p. For κ → 0, the M (µ, κ)-distribution
converges to U ([0, 2π)), and for κ → ∞ we obtain a point mass in the
direction µϕ .
Mixture distribution:
All distributions above are unimodal. Distributions with more than one
mode can be modeled, for instance, by mixture distributions
fϕ (u) = p1 fϕ,1 (u) + ... + pm fϕ,m (u)
where 0 ≤ p1 , ..., pm ≤ 1, pi = 1 and fϕ,j are different circular probabil-
&
ity densities.
• D. Scarlatti: Sonatas Kirkpatrick No. 49, 125, 222, 345, 381, 412, 440,
541
• B. Bartók (Figure 7.1): Bagatelles No. 1–3, Sonata for Piano (2nd move-
ment)
The same as above can be carried out for intervals between successive
notes (Figure 7.5). Figure 7.6 shows that, again, variability is much lower
for Bartók and Prokoffieff.
λ1 0 . . 0
⎛ ⎞
⎜ 0 λ2 . ⎟
where Λ = ⎜ . ⎟is a diagonal matrix, λj are the eigen-
⎜ ⎟
⎜ . . ⎟
⎝. . 0 ⎠
0 . . 0 λp
values and the columns a(j) of A the corresponding orthonormal eigenvec-
tors of B, i.e. we have
Ba(j) = λj a(j) (8.2)
|a(j) |2 = [a(j) ]t a(j) = 1, and [a(j) ]t a(l) = 0 for j ̸= l (8.3)
In matrix form equation (8.3) means that A is an orthogonal matrix, i.e.
At A = I (8.4)
where I denotes the identity matrix with Ijj = 1 and Ijl = 0 (j ̸= l).
This result can now be applied to the covariance matrix of a random
vector X = (X1 , ..., Xp )t :
Theorem 19 Let X be a p-dimensional random vector with expected value
E(X) = µ and p × p covariance matrix Σ. Then
Σ = AΛAt (8.5)
where the columns a (j)
of A are eigenvectors of Σ and Λ is a diagonal
matrix with eigenvalues λ1 , ..., λp ≥ 0.
In particular, we may permute the sequence of the X-components such that
the eigenvalues are ordered. We thus obtain:
Theorem 20 Let X be a p-dimensional( random vector with expected value
E(X) = µ and a p×p covariance matrix . Then there exists an orthogonal
matrix A such that
Σ = AΛAt (8.6)
where the columns a (j)
of A are eigenvectors of and Λ is a diagonal
(
matrix with eigenvalues λ1 ≥ λ2 ≥ ... ≥ λp ≥ 0. Moreover, the covariance
matrix of the transformed vector
Z = At (X − µ) (8.7)
If λ̂j is large, then the observed jth principal components zj (1), ..., zj (n)
have a large sample variance so that the observed values are scattered far
apart.
Since the eigenvalues λi are ordered according to their size, we may there-
fore hope that the proportion of total variation
λ1 + ... + λq
P (q) = (p (8.19)
i=1 λi
is close to one for a low value of q. If this is the case, then one may re-
duce the dimension of the random vector considerably without losing much
information. For data, we plot P̂ (q) = (λ̂1 + ... + λ̂q )/ λ̂i versus q and
(
judge by eye from which point on the increase in P̂ (q) is not worth the
price of adding additional dimensions.
( Alternatively, we may plot the con-
tribution of each eigenvalue, λ̂j / λ̂i or λ̂j itself, against j. This is the
so-called scree graph. More formal tests, e.g. for testing which eigenvalues
are nonzero or for comparing different eigenvalues, are available however
mostly under the rather restrictive assumption that the distribution of X
is multivariate normal (see e.g. Mardia et al. 1979, Ch. 8.3.2).
In addition to the scree plot, the decision on the number of principal
components is often also based on the (possibly subjective) interpretability
of the components. The interpretation of principal components may be
(i)
based on the coefficients aj and/or on the correlation between Zj and
the coordinates of the original random vector X = (X1 , ..., Xp )t . Note that
since E(ZX t ) = E(At XX t) = At Σ = At AΛAt = ΛAt , var(Xk ) = σkk and
8.2.5 Plots
One of the main difficulties with high-dimensional data is that they cannot
be represented directly in a two-dimensional display. Principal components
provide a possible solution to this problem. The situation is particularly
simple if the first two principal components explain most of the variability.
In that case, the original data (x1 (i), ..., xp (i))t (i = 1, 2, ..., n) may be re-
placed by the first two principal components (z1 (i), z2 (i))t (i = 1, 2, ..., n).
Thus, z2 (i) is plotted against z1 (i). If more than two principal components
are needed, then the plot of z2 (i) versus z1 (i) provides at least a partial view
of the data structure, and further projections can viewed by corresponding
scatter plots of other components, or by symbol plots as described in Chap-
ter 2. The scatter plots can be useful for identifying structure in the data.
In particular, one may detect unusual observations (outliers) or clusters of
similar observations.
1 2 3 4 5 6 7 8
Figure 8.1 Tempo curves for Schumann’s Träumerei: skewness for the eight parts
′ ′ ′′ ′′
A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 for 28 performances, plotted against the number of
the part.
where M is the median and Q1 , Q2 are the lower and upper quartile respec-
tively. Figure 8.1 shows ηj (i) plotted against i. An apparent pattern is the
generally strong negative skewness in B2 . (Recall that negative skewness
can be created by extreme ritardandi.) Apart from that, however, Figure
8.1 is difficult to interpret directly. Principal component analysis helps to
find more interesting features. Figure 8.3 shows the loadings for the first
four principal components which explain more than 80% of the variability
(see Figure 8.2). The loadings can be interpreted as follows: the first com-
ponent corresponds to a weighted average emphasizing the skewness values
in the first half of the piece. The 28 performances apparently differ most
with respect to ηj (i) during the first 16 bars of the piece (parts A1 , A2 ,
′ ′
A1 , A2 ). The second most important distinction between pianists is charac-
terized by the second component. This component compares skewness for
the A-parts with the values in B1 and B2 . The third component essentially
0.564
Variances
0.709
0.824
0.005
0.889
0.935
0.971 1
0.0
Comp. 1
Comp. 2
Comp. 3
Comp. 4
Comp. 5
Comp. 6
Comp. 7
Comp. 8
compares the first with the second half. Finally, the fourth component es-
sentially compares the odd with the even numbered parts, excluding the
′′ ′′
end A1 , A2 . Components two to five are displayed in Figure 8.4, with z2
and z3 on the x- and y-axis respectively and rectangles representing z4 and
z5 . Note in particular that Cortot and Horowitz mainly differ with respect
to the third principal component. Horowitz has a more extreme difference
in skewness betweem the first and second halves of the piece. Also striking
are the “outliers” Brendel, Ortiz, and Gianoli. The overall skewness, as
represented by the first component, is quite extreme for Brendel and Ortiz.
For comparison, their tempo curves are plotted in Figure 8.5 together with
Cortot’s and Horowitz’ first performances. In view of the PCA one may
now indeed see that in the tempo curves by Brendel and Ortiz there is a
strong contrast between small tempo variations applied most of the time
and occasional strong local ritardandi.
0.2
0.4
loading
loading
-0.2
0.3
0.2
-0.6
2 4 6 8 2 4 6 8
0.2
loading
loading
0.2
-0.2
-0.2
-0.6
2 4 6 8 2 4 6 8
PCA of skewness -
symbol plot of principal components 2-5
0.0
Y
NE
IZ
RT
T2
-0.1
O
RT
T3
CO T1
CO
AC TO
O
ES
VA
R
RT
N
CU C H
VI
IE1
OO
N
DA
KA TKZ
CH
RZAP
CO
K L
-0.2
I
EN
ER
LEZ I
ES
BA W
CH
G
VA
KU RO
AR
ES
NO
AZ H
Z2
HO
U EN SC
US
IT
z3
IT
DE E
AR A EIW
M
W
Z3
-0.3
RO
T OIS
RA SK
IT
N
US M
NI
HO
W
EL
BU
L
D
RO
BE
KR
EN
IS
NA
AR
BR
HO
H
TS
SC
KA
-0.4
LI
NO
-0.5
IA
G
Y
LE
EL
SH
z2
Horowitz1
-10
Brendel
-15
Gianoli
-20
-25
Piano
11
14
Purcell
4
3
2
Bach
1
Bach
Bach
Schumann
Bach
Bach
Purcell Bach
Bach
Bach
Bach
0
Bach
Schumann
Schumann
Schumann
Schumann
Purcell
-1
Schumann
-2
-4 -2 0 2
Figure 8.9 Entropies – symbol plot of the first four principal components.
Bach
Bach
1
Bach
Bach
Bach Bach
Bach
Schumann
Bach
Schumann
Schumann
Schumann
Purcell
0
Schumann
Bach
Purcell
-1
Schumann
Purcell
-2
-1 0 1 2 3 4
Discriminant analysis
Thus, correct classification for individuals from group Gi occurs with prob-
ability pii and misclassification with probability 1 − pii . A rule r with
correct-classification-probabilities pii is said to be at least as good as a rule
r̃ with probabilities p̃ii , if pii ≥ p̃ii for all i. If there is at least one “ > ”
sign, then r is better. If there is no better rule than r, then r is called
admissible. Consider now a Bayes rule r with probabilities pij . Is there any
better rule than r? Suppose that r̃ is better. Then
# #
πi pii < πi p̃ii .
On the other hand,
# #"
πi p̃ii = ψ̃i πi fi (x)dx
#" "
≤ ψ̃i max{πj fj (x)}dx = max{πj fj (x)}dx.
j j
Since r is a Bayes rule, we have
#
max{πj fj (x)} = ψi πi fi (x)
j
The rule becomes particularly simple if fi are normal with unknown means
µi and equal covariance matrices Σ1 = Σ2 = ... = Σ. Let x̄i be the sample
mean and Σ̂i the sample covariance matrix for observations from group Gi .
Estimating the common covariance matrix Σ by
Σ̂ = (n1 Σ̂1 + n2 Σ̂2 + ... + np Σ̂p )/(n − p)
where ni is the number of observations from Gi and n = n1 + ... + np , the
ML-rule allocates x to Gi , if
(x − µi )t Σ−1 (x − µi ) = min (x − µj )t Σ−1 (x − µj ) (9.12)
j=1,...,p
data. In principle this is easy, since the corresponding estimates can simply
be plugged into the formula for pii . The observed data that are used for
estimation are also called “training sample”. A problem with these esti-
mates is, however, that the search for the optimal discriminant rule was
done with the same data. Therefore, p̂11 will tend to be too optimistic (i.e.
too large), unless n is very large. The same is true for any method that
estimates classification probabilities from the training data. A possibility
to avoid this is to partition the data set randomly into a “training” sample
that is used for estimation of the discriminant rule, and a disjoint “val-
idation” sample that is used for estimation of classification probabilities.
Obviously, this can only be done for large enough data sets. For recently
developed computational methods of validation, such as bootstrap, see e.g.
Efron (1979), Läuter (1985), Fukunaga (1990), Hirst (1996), LeBlanc and
Tibshirani (1996), Davison and Hinkley (1997), Chernick (1999), Good
(2001).
before 1800
after 1800
-1.2
P(Subdominant)
-1.4
-1.6
-1.8
-2.0
entropy
Figure 9.2 Linear discriminant analysis of compositions before and after 1800,
with the training sample. The data used for the discriminant rule consists of
x = (p5 , E).
a straight line. Figure 9.2 shows the estimated partitioning line together
with the training sample (o = before 1800, x = after 1800). Apparently, the
two groups can indeed be separated quite well by the estimated straight
line. This is quite surprising, given the simplicity of the two variables. As
expected, however, the partition is not perfect, and it does not seem to be
possible to improve it by more complicated partitioning lines. To assess how
well the rule may indeed classify, we consider 50 other compositions that
were not used for estimating the discriminant rule. Figure 9.3 shows that
the rule works well, since almost all observations in the validation sample
are classified correctly. An unusual composition is Bartók’s Bagatelle No.
3 which lies far on the left in the “wrong” group.
The partitioning can be improved if the time periods of the two groups
are chosen farther apart. This is done in figures 9.3a and b with Group
1 = “Early Music to Baroque” and 2 = “Romantic to 20th century”. (A
beautiful example of early music is displayed in Figure 9.6; also see Fig-
ures 9.7 and 9.8 for portraits of Brahms and Wagner.) Figure 9.4 shows
the corresponding plot of the partition together with the data (n = 72).
Compositions not used in the estimation are shown in Figure 9.5. Again,
the rule works well, except for Bartók’s third Bagatelle.
before 1800
after 1800
-1.2
P(Subdominant)
-1.4
-1.6
k
to
ar
B
-1.8
-2.0
entropy
Figure 9.3 Linear discriminant analysis of compositions before and after 1800,
with the validation sample. The data used for the discriminant rule consists of
x = (p5 , E).
-1.2
-1.4
-1.6
-1.8
-2.0
entropy
Figure 9.4 Linear discriminant analysis of “Early Music to Baroque” and “Ro-
mantic to 20th Century”. The points (”o” and ”×”) belong to the training sample.
The data used for the discriminant rule consists of x = (p5 , E).
-1.0
k
to
-1.4
ar
B
-1.8
-2.2
entropy
Figure 9.5 Linear discriminant analysis of “Early Music to Baroque” and “Ro-
mantic to 20th century”. The points (”o” and ”×”) belong to the validation sam-
ple. The data used for the discriminant rule consists of x = (p5 , E).
Figure 9.6 Graduale written for an Augustinian monastery of the diocese Kon-
stanz, 13 th century. (Courtesy of Zentralbibliothek Zürich.) (Color figures follow
page 152.)
Cluster analysis
where
p
!
h(η) = |Σ̂j (η)|nj (η) (10.5)
j=1
Computationally this means that the function h(η) is evaluated for all
groupings η of the observations x1 , ..., xn , and the estimate is the grouping
and the corresponding (n− i)× (n− i) distance matrix D(i) with elements
(i)
djl (j, l = 1, ..., n − i).
5. If
(i)
max d ≤ do (10.8)
j,l=1,...,n− i jl
then stop. Otherwise, set i = i + 1 and go to step 3.
Note in particular that for the final clusters, the maximal distance within
each cluster is at most do . As a result, the final clusters tend to be very
“compact”. A related method is the so-called nearest neighbor single link-
age algorithm. It is identical with the above except that distance between
clusters is defined as the minimal distance between points in the two clus-
ters. This can lead to so-called “chaining” in the form of elongated clusters.
HALLE SCHOENBERG
ANONYMUS BARTOK
OCKEGHEM BARTOK
DOWLAND TAKEMITSU
ARCADELT MESSIAEN
ANONYMUS BARTOK
ARCADELT WEBERN
ANONYMUS BACH
ARCADELT BACH
BACH DOWLAND
BACH HASSLER
BACH HASSLER
BACH ANONYMUS
BACH ARCADELT
BARTOK ARCADELT
BARTOK ARCADELT
TAKEMITSU PALESTRINA
1-3; Piano Sonata (2nd Mv.); 17) O. Messiaen (1908-1992): Vingts regards
de Jesu No. 3; 18) T. Takemitsu (1930-1996): Rain tree sketch No. 1.
Figure 10.1 shows the result of complete linkage clustering of the vectors
(ζ1 , ..., ζ11 )t , based on the Euclidian and do = 5. The most striking fea-
ture is the clear separation of “early music” from the rest. Moreover, the
20th century composers considered here are in a separate cluster, except
for Bartók’s Bagatelle No. 3 (and Debussy, who may be considered as be-
longing to the 19th and 20th centuries). In contrast, clusters provided by a
single linkage algorithm are less easy to interpret. Figure 10.2 illustrates a
typical result of this method namely long narrow clusters where the maxi-
mal distance within a cluster can be quite large. In our example this does
Bach: Pr.1/WK I
Bach: Cello Suite V/1
Bach: Cello Suite IV/1
Bach: F.1/WK I
Bach: F.8/WK I
Bach: Pr.8/WK I
Figure 10.5 Complete linkage clustering of entropies.
10.3.2 Entropies
Consider entropies as defined in Chapter 3. More specifically, we define
for each composition a vector y = (E1 , ..., E10)t . After standardization of
each coordinate, cluster analysis is applied the following compositions by
J.S. Bach: Cello Suites No. I to VI (1st movement from each); Preludes and
Fugues No. 1 and 8 from “Das Wohltemperierte Klavier” (each separately).
The complete linkage algorithm leads to a clear separation of the Cello
Suites from “Das Wohltemperierte Klavier” displayed in Figure 10.5.
14
12
10
DEMUS
BUNIN
ARGERICH
DAVIES
KUBALEK
KLIEN
MOISEIWITSCH
8
ORTIZ
ASKENAZE
ARRAU
SCHNABEL
GIANOLI
KATSARIS
BRENDEL
NEY
ESCHENBACH
NOVAES
SHELLEY
CORTOT3
ZAK
CAPOVA
HOROWITZ1
KRUST
6
CURZON
HOROWITZ2
HOROWITZ3
CORTOT1
CORTOT2
who are represented more than once in the sample, so that the consistency
of their performances can be checked empirically. Figure 10.6 also shows
that Cortot is somewhat of an “outlier”, since his cluster separates from
all other pianists at the top level.
ARRAU
MOISEIWITSCH
5
4
KRUST
ORTIZ
NOVAES
DEMUS
NEY
ZAK
KLIEN
BUNIN
DAVIES
SHELLEY
CAPOVA
KUBALEK
KATSARIS
GIANOLI
BRENDEL
ASKENAZE
SCHNABEL
CURZON
ARGERICH
CORTOT1
CORTOT3
CORTOT2
ESCHENBACH
HOROWITZ3
HOROWITZ1
HOROWITZ2
Figure 10.7 Complete linkage clustering of HISMOOTH-fits to tempo curves.
B BB
B
A
F A
2
D
D
1
C
C
D C
D
F CDF E F
F
0
C E
Figure 10.8 Symbol plot of HISMOOTH bandwidths for tempo curves. The radius
of each circle is proportional to a constant plus log b3 ; the horizontal and vertical
axes are equal to b1 and b2 respectively. The letters A–F indicate where at least
one observation from the corresponding cluster occurs.
Multidimensional scaling
before 1720
1720-1880
1880 or later
Schoenberg
0.4
0.2
x2
0.0
-0.2
-0.4
x1
Figure 11.3 Fragment of a graduale from the 14th century. (Courtesy of Zentral-
bibliothek Zürich.)
Figure 11.5 Freddy (by J.B.) and Johannes Brahms (1833-1897) going for a
drink. (Caricature from a contemporary newspaper; courtesy of Zentralbibliothek
Zürich.)
8).
Figure 2.14: Smoothed tempo curves ĝ2 (t) = (nb2 )−1 K( t−t b2 )[yi −
! i