Proc. of the 11th Int.

Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008

A UNIFYING FRAMEWORK FOR ONSET DETECTION, TEMPO ESTIMATION AND PULSE CLARITY PREDICTION Olivier Lartillot, Tuomas Eerola, Petri Toiviainen, Jose Fornari∗ Finnish Center of Excellence in Interdisciplinary Music Research, University of Jyväskylä Finland <first.last>
ABSTRACT An overview of studies dealing with onset detection and tempo extraction is reformulated under the new conceptual and computational framework defined by MIRtoolbox [1]. Each approach can be specified as a flowchart of general high-level signal-processing operators that can be tuned along diverse options. This framework encourages more advanced combinations of the different approaches and offers the possibility of comparing multiple approaches under a single optimized flowchart. Besides, a composite model explaining pulse clarity judgments is decomposed into a set of independent factors related to various musical dimensions. To evaluate the pulse clarity model, 25 participants have rated the pulse clarity of one hundred excerpts from movie soundtracks. The mapping between the model predictions and the ratings was carried out via regressions. 1. INTRODUCTION MIRtoolbox is a Matlab toolbox offering an extensive set of signal processing operators and musical feature extractors [1]. The objective is to design a tool capable of analyzing a large range of musical dimensions from extensive set of audio files. The first public version released last year contains the core of the framework enabling a broad overview of musical dimensions investigated in computational music analysis. The aim of current research is mainly to improve the set of tools by integrating a large range of approaches currently advertised in the research community. This paper focuses on the joint questions of onset extraction and tempo estimation. A synthetic overview of studies in this domain is reformulated using the operators defined in MIRtoolbox. Section 2 shows various methods to compute the onset detection curve and section 3 deals with the description of the curve, and in particular the detection of the onsets themselves. The estimation of tempo from the onset curve is dealt in section 4. Throughout this review, each approach is modeled in terms of a flowchart of general high-level signal-processing operators available in MIRtoolbox with multiple options and parameters to be tuned accordingly. This framework encourages more advanced combinations of the different approaches and offers the possibility of comparing multiple approaches under a single optimized flowchart. In section 5, a composite model explaining pulse clarity judgments is decomposed into a set of independent factors related to various musical dimensions. To evaluate the pulse clarity model,
This work has been supported by the European Commission (NEST project “Tuning the Brain for Music", code 028570) and by the Academy of Finland (project number 119959).

25 participants have rated the pulse clarity of one hundred excerpts from movie soundtracks. The mapping between the model predictions and the ratings, discussed in section 6, was carried out via regressions. 2. COMPUTING THE ONSET DETECTION FUNCTION 2.1. Preprocessing First of all, the audio signal is loaded from a file: a = miraudio(’myfile.wav’) (1)

The audio signal can be segmented into characteristic and similar regions based on novelty [2] by calling the mirsegment operator [1]: a = mirsegment(a) (2) When the tempo is supposed to remain stable within each segment [3], command (2) automatically ensures that the tempo will be computed for each segment separately. 2.2. Filterbank decomposition The estimation of the onset positions generally requires a decomposition of the audio waveform along particular frequency regions. The simplest method consists in discarding the high-frequency components by filtering the signal with a narrow bandpass filter [4]: a = mirfilter(a,’Scale’,50,20000) (3)

More subtle models require a multi-channel decomposition of the signal mimicking auditory processes. This can be done through filterbank decomposition [5, 6]: b = mirfilterbank(a,’CriticalBand’, ’Scale’,44,18000) (4) where more precised specification can be optionally indicated. Alternatively, the decomposition can be performed via a timefrequency representation computed through STFFT [7, 3]: s = mirspectrum(a,’Frame’,’FFT’, ’WinLength’,.023,’s’, ’Hop’,50,’%’) (5)


Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008 further recombined for instance into critical bands1 : b = mirspectrum(s,’CriticalBand’, ’Scale’,50,20000)

2.3.2. Energy Another strategy consists in computing the (root-)mean-square energy of each successive frame of the signal [9, 8, 10]: od = mirrms(x,’Frame’) (13)


The filterbank is sometimes split into subbands sent to distinct analyses [8] , for instance: [l m h]= mirfilterbank(b,’Split’, [1200 11000]) (7) The high frequency component h, related to the frequency range over 11000 Hz, is discarded, the middle register m, for frequencies over 1200 Hz, will be applied to energy-based analyses (paragraph 2.3): om = mironsets(m,’Energy’) (8) whereas the low component l will be applied to frequency-based analyses (paragraph 2.4): ol = mironsets(l,’Freq’) 2.3. Energy-based strategy These strategies focus on the variation of amplitude or energy of the signal. 2.3.1. Envelope extraction The description of this temporal evolution of energy results from an envelope extraction, basically through rectification (or squaring) of the signal, low-pass filtering, and finally downsampling using the following command: od = mirenvelope(x) (10) (9)

2.3.3. Recombining banks The channels is sometimes recombined immediately after the envelope extraction [11]: od = sum(od,’Band’) (14)

In order to allow the detection of periodicities scattered along several bands, while still optimizing the analysis accuracy, the recombination can be partial, by summing adjacent bands in groups [7] resulting for instance in four wider-frequency channels: od = sum(od,’Bands’,4) 2.4. Frequency-based strategy Frequency-based methods start from the time-frequency representation too, as in (5), but analyzing this time each successive frame successively. High-frequency content [12] can be highlighted via linear weighting: s = mirspectrum(s,’HFC’) (16) (15)

The contrast between successive frames is observed through differentiation, leading to a spectral flux: od = mirflux(s) (17)

where x can be either the undecomposed audio signal a, or the filterbank decomposed b2 , or the middle-frequency band m. Further refinement enables an improvement of the peak picking: first the logarithm of the signal is computed [5]3 : od = mirenvelope(od,’Log’) and the result is differentiated and half-wave rectified4 : od = mirenvelope(od,’Diff’,’HWR’) (12) (11)

Some approaches advocate the use of a smoothed differentiator FIR instead, based on exponential weightening5 [8] (available as a parameter ’Smooth’ of the ’Diff’ option).
1 This second call of the command mirspectrum does not mean that a second FFT is computed. It just indicates a further operation on a mirspectrum object already computed. 2 It should be mentioned that if x is a mirspectrum object, the command (10) should include the ’Band’ keyword in order to specify that the envelope should be computed along bands and not the spectrum decomposition in each frame. 3 A µ-law compression [7] can be specified as well using the ’Mu’ option. 4 A weighted average of the original envelope and its differentiated version [7] can be obtained using the ’Weight’ option. 5 The logarithmic transformation might exempt from this loss of information, though.

where diverse distance can be specified using the ’Dist’ parameter, such as L1 -norm [12] or L2 -norm [8]. Components contributing to a decrease of energy can be ignored [8] using the ’Inc’ option. Instead of simple differentiation, FIR filter differentiator [13] can be specified. Each distance between successive frames can be normalized by the total energy of the first frame (’Norm’ option) in order to ensure a better adaptiveness to volume variability [8]. Besides, the computation can be performed in the complex domain (’Complex’ option) in order to include the phase information [14]. The novelty curve designed for musical segmentation, as mentioned in section 2.1, can actually be considered as a more refined way of evaluating distance between frames [15]. We notice in particular that the use of novelty on multi-pitch extraction results [16] leads to particular good results when estimating onsets from violin soli (see Figures 1-4). f = mirpitch(a,’Frame’) od = mirnovelty(f) 2.5. Post-processing If necessary, the onset detection function can be smoothed through low-pass filtering [15]: od = smooth(od) (20) (18) (19)

In order to adapt further computation (such as peak picking or periodicity estimation) to local context, the onset detection curve can be detrended by removing the median [17, 13, 15]: od = detrend(od,’Median’) (21)


Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008

coefficient value (in Hz)

Pitch 1000




5 10 Temporal location of events (in s.)


Figure 1: Multi-pitch extraction from a violin solo recording.

Figure 2: Frame-decomposed generalized and enhanced autocorrelation function [16] used for the multi-pitch extraction.
Similarity matrix temporal location of frame centers (in s.) 14 12 10 8 6 4 2 2 4 6 8 10 temporal location of frame centers (in s.) 12 14

Figure 3: Similarity matrix computed from the frame-decomposed autocorrelation function.
Novelty 1 coefficient value




5 10 Temporal location of events (in s.)


Figure 4: Novelty curve estimated along the diagonal of the similarity matrix [2], and onset detection (circles) featuring one false positive (the second onset) and one false negative (around time t = 12.5 s.).


Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008

and the onset detection curve can be half-wave rectified as well: od = hwr(od) (22) If on the contrary the note onset positions are found at local maxima of the temporal derivate of the energy curve [7, 8], then the attack slope can be directly identified with the values of the local maxima. 4. TEMPO ESTIMATION 4.1. Pulsation estimation The periodicity of the onset curve can be assessed in various ways. 4.1.1. Fourier transform FFT can be computed in separate bands, leading to a so-called fluctuation pattern [21]:7 p = mirspectrum(od,’FFT’,’Band’) (30)

3. NON-PERIODIC CHARACTERIZATIONS OF THE ONSET DETECTION CURVE 3.1. Articulation The Low-Energy Rate – commonly defined as the percentage of frames within a file that have an RMS energy lower than the mean RMS energy across that file [18] – can be generalized to any kind of onset detection curve: art = mirlowenergy(od) (23)

An estimation of the articulation can be assessed by computing the Average Silence Ratio [19], which can be formalized as a lowenergy rate where the threshold (here, silence threshold) is set to a fraction of the mean RMS energy: asr = mirlowenergy(od,’Threshold’,.5) 3.2. Onset detection Onsets can then be associated to local maxima of the onset detection curve: o = peaks(od) (25) The onsets found on each different bands can be combined together: o = sum(o,’Band’) (26) As one onset, when scattered along several bands, produces a series of onsets that are not necessarily exactly synchronous, an alignment is performed by selecting the major peak within a 50-ms neighborhood [5]6 . When combining the hybrid subband scheme defined in equations (8-9): o = om + ol (27) onsets from the higher frequency band, offering better time resolution, are preferably chosen via a weighting sum [8]. 3.3. Attack characterization If the note onset temporal position is estimated using an energybased strategy (section 2.3), some characteristics related to the attack phase can be assessed as well. If the note onset positions are found at local maxima of the energy curve (amplitude envelope or RMS in particular), they can be considered as ending positions of the related attack phases. A complete determination of the attack requires therefore an estimation of its starting position, through an extraction of the preceding local minima using an appropriate smoothed version of the energy curve. Figure 5 shows the output of the command: at = mironsets(’ragtime.wav’,’Attacks’) (28) Then the characteristics of the attack phase can be its duration or its mean slope [20]. Figure 6 shows the output of the command: as = mirattackslope(at) (29) (24)

Similarly, the spectral product removes the harmonics of the periodicities [13]: p = mirspectrum(od,’FFT’,’Prod’) 4.1.2. Autocorrelation function More often, periodicity is estimated via autocorrelation [22]:8 p = mirspectrum(od,’Autocor’) (32) (31)

Alternatively, the autocorrelation phase matrix shows the distribution of autocorrelation energy in phase space [10]: ph = mirautocorphase(od) (33)

Metrically-salient lags can then be emphasized by computing the Shannon entropy of the phase distribution of each lag: p = entropy(ph,’Lag’) (34)

An emphasis towards best perceived periodicities can be obtained by multiplying the autocorrelation function (or the spectrum) with a resonance curve [23, 10] (’Resonance’ option). 4.1.3. Comb filters Another strategy commonly used for periodicity estimation is based on a bank of comb filters [6, 7]: ph = mirspectrum(od,’Comb’) (35)

6 More subtle combination processes have been proposed [5], based on detailed auditory modeling, but are not integrated in the toolbox yet.

7 Following our discussion initiated at footnote 2, here the ’Band’ option is explicitly mentioned as fluctuation pattern is usually computed from time-frequency representation. The ’Band’ keyword will not be mentioned in the following commands for clarity sake. 8 In MIRtoolbox 1.0, mirspectrum was strictly related to FFT whereas mirautocor was related to autocorrelation function. In the new version, mirspectrum should be understood as a general representation of energy distribution along frequencies, implemented by various methods.


Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
Onset curve (Envelope) 0.25 0.2 amplitude 0.15 0.1 0.05 0





2 time (s)






Figure 5: Onset detection with determination of the attack phases.
Attack Slope 2 coefficient value 1.5 1 0.5 0





2 2.5 Temporal location of events (in s.)





Figure 6: Slope of the attack phases extracted in the previous example.

4.1.4. Event-wise Periodicities can be estimated also from actual onset positions, either detected from the onset detection curve (o, computed in section 3.2), or from onset dates read from MIDI files: o = mironsets(’myfile.mid’) (36)

The main pulse can be estimated by extracting the global maximum in the spectrum. mp = peak(p) (40)

The summation of the peaks across bands can also be performed after the peak picking as well [25], with a clustering of the close peaks and summation of the clusters: mp = sum(mp,’Band’,’Tolerance’,.02,’s’) (41) More refined tempo estimation are available as well. For instance, three peaks can be collected for each periodicity spectrum, and if a multiplicity is found between their lags, the fundamental is selected [13]. Similarly, harmonics of a series of candidate lag values can be searched in the autocorrelation function [10]. Finally the peaks in the autocorrelation can be converted into BPM using the mirtempo operator: t = mirtempo(mp) 5. MODELING PULSE CLARITY The computation developed in the previous sections help to offer a description of the metrical content of musical work in terms of tempo. But some further analyses may enable to produce further important information related to rhythm. In particular, one important way of description musical genre and particular works is related to the amount of pulsation, more precisely to the clarity of its expression. The understanding of pulse clarity may yields new ways to improve automated genre classification in particular. 5.1. Previous work At least one previous work has studied this dimension [26] – termed beat strength. The proposed solution is based on the computation (42)

The periodicities can be displayed with a histogram showing the distribution of all the possible inter-onset intervals [24]: h = mirhisto(mirioi(o,’All’)) which can be represented in the frequency domain: p = mirspectrum(h) (38) (37)

Alternatively, the MIDI file can be transformed into an onset detection curve by summing Gaussian kernels located at the onset points of each note [23]. The onset detection curve can then be fed to the same analyses as for audio files, as presented at the beginning of this section. 4.2. Peak picking The previous paragraph gave an overview of diverse methods for the estimation of rhythmic periodicity: FFT, autocorrelation function, comb filters output, histogram. Following the unifying view encouraged in the mirtoolbox framework, all these diverse representation can be considered as one single periodicity spectrum p, which can be further analyzed as follows. The periodicity estimation on separate band can be summed before the peak picking: p = sum(p,’Band’) (39)


Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008

of the autocorrelation function of the onset detection curve decomposed into frames: p = mirspectrum(o,’Autocor’,’Frame’) the three best periodicities are extracted [11]: t = peaks(p,’Total’,3) (44) (43)

5.3. Harmonic relations between pulsations The clarity of a pulse seems to decrease if pulsations with no harmonic relations coexist. We propose to formalize this idea as follows. First a certain number of peaks are selected from the autocorrelation curve p:9 pp = peaks(p) (53)

These periodicities – or more precisely, their related autocorrelation coefficients – are collected into a histogram: h = mirhisto(t) (45)

From the histogram, two estimation of beat strength are proposed: the SUM measure sums all the bins of the histogram: SUM = sum(h) (46)

Let the list of peak lags be P = {li }i∈[0,N ] , and let the first peak l0 be the one considered as the main pulsation, as determined in paragraph 4.2. The list of peak amplitudes is {p(li )}i∈[0,N ] . A peak will be inharmonic if the remainder of the euclidian division of its lag with the lag of the main peak (and the inverted division as well) is significantly high. This defines the set of inharmonic peaks H: ˛  ˛l ∈ [αl0 , (1 − α)l0 ] H = i ∈ [0, N ] ˛ i ˛ l0 ∈ [αli , (1 − α)li ] (mod l0 ) (mod li ) ff (54)

whereas the PEAK measure divides the maximum value to the main amplitude: PEAK = peak(h)/mean(h) (47)

where α is a constant tuned to .15 in our implementation. The degree of harmonicity is hence decreased by the cumulation of the autocorrelation coefficients of the non-harmonic peaks: « „ P p(li ) HARM = exp − i∈H βp(l0 ) where β is another constant set to 4. 5.4. Non-periodic accounts of pulse clarity Other descriptors have been added that do not relate directly to the periodicity of the pulses, but indicate factors of energy variability that could contribute to the perception of clear pulsation. Some factors defined in section 3 have been included: • the articulation ARTI, based on Average Silence Ratio (24), • the attack slope ATAK (3.3). Finally, a variability factor VAR sums the amplitude difference between successive local extrema of the onset detection curve. The whole flowchart of operators required for the estimations of the pulse clarity factors is indicated in Figure 7. 6. MAPPING MODEL PREDICTIONS TO LISTENERS’ RATINGS In order to assess the validity of the models predicting pulse clarity judgments presented in the previous section, an experimental protocol has been designed. In a listening test experiment, 25 participants have rated the pulse clarity of one hundred excerpts from movie soundtracks. In parallel, the same musical database has been fed to the diverse pulse clarity models presented in the previous section. The mapping between the model predictions and the listeners ratings was finally carried out via regressions.
9 If no value is given for the ’Total’ parameter, by default all the local maxima offering sufficient contrast with their related local minima are selected.

This approach is therefore aimed at understanding the global metrical aspect of an extensive musical piece. Our study, on the contrary, is focused here on an understanding of the short-term characteristics of rhythmical pulse. Indeed, even musical excerpt less than a few seconds long can easily convey to the listeners a strong sense of rhythmicity. The analysis of each successive local context can then be extended to the global scope through usual statistical techniques. 5.2. Statistical description of the autocorrelation curve For that purpose, the analysis is focused on the analysis of the autocorrelation function p itself, as defined in equation (43), and tries to extract from it any information related to the dominance of the pulsation. The most evident descriptor is the amplitude of the main peak, hence the global maximum of the curve: MAX = max(p) (48)


It seems that the global minimum is usually (inversely) related to the importance of the main pulsation: MIN = min(p) The kurtosis of the main peak describes its distinctness: KURT = kurtosis(p) (50) (49)

The entropy of the autocorrelation function indicates the quantity of pulsation information conveyed. ENTR = entropy(p) (51)

Another hypothesis is that the faster a tempo is, the more clearly it is perceived by the listeners (due to the increased density of event): TEMP = mirtempo(p) (52)


Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008

VAR = mirpulseclarity (o,!Variability!) AS = mirattackslope (od) o = mironsets (od,!Detect!,!Yes!) od = mironsets (a,!Detect!,!No!) p = mirspectrum (o,!Autocor!) ART = mirlowenergy (od,!Threshold!,.5) MAX = max(p) MIN = min(p) ENTR = entropy(p) pp = mirpeaks(p) HARM = mirpulseclarity (pp,!Harmony!) mp = mirpeak(p) TEMP = mirtempo(mp) KURT = kurtosis(mp)

Figure 7: Flowchart of operators of the compound pulse clarity model.

6.1. Model optimizations One problem raised by the computational framework presented in this paper is related to the high number of degrees of freedom that have to be specified when choosing the proper onset detection curve and periodicity evaluation method. In the public version of the toolbox, default strategies and parameter values will be specified. The choice of these default settings will result from an evaluation of the performance offered by the various approaches. Due to the combinatorial complexity of possible configurations, we are designing optimization tool that systematically sample the set of possible solutions and produce a large number of flowcharts that are progressively run with musical databases and compared with ground truth data. The pulse clarity experiment described in this section is a first attempt towards this goal. 6.2. Pre-processing of the statistical variables As a prerequisite to the statistical mapping, listeners ratings and models predictions need to be normalized. The mapping routine mirmap includes an optimization algorithm that automatically finds optimal Box-Cox transformations [27] of the data ensuring that their distributions becomes sufficiently gaussian. 6.3. Results The major factors correlating with the ratings are indicated in table 1. The best predictor is the global maximum of the autocorrelation function, with a correlation of .46 with the ratings, followed by the kurtosis of the main peak, and by the global minimum. The pulse harmonicity factor shows a correlation of .48 with the ratings, but is also correlated up to .48 with the other aforementioned factors. The envelope variability factor shows a correlation of .41. Multiple regressions are being attempted in current works.

Table 1: Majors factors correlating with pulse clarity ratings. rank 1 2 3 4 5 factors MAX KURT MIN HARM VAR correlation between factor and ratings .46 .46 .44 .43 .41 maximal correlation with previous factors .3 .5 .48 .45

8. REFERENCES [1] O. Lartillot and P. Toiviainen, “A matlab toolbox for musical feature extraction from audio,” in Proc. Digital Audio Effects (DAFx-07), Bordeaux, France, Sep. 10-15 2007, pp. 237– 244. [2] J. Foote and M. Cooper, “Media segmentation using selfsimilarity decomposition,” in Proceedings of SPIE Storage and Retrieval for Multimedia Databases, 2003, number 5021, pp. 167–175. [3] C. Uhle, “Tempo induction by investigating the metrical structure of music using a periodicity signal that relates to the tatum period,” Available at, accessed March 26, 2008. [4] M. Alghoniemy and A. H. Tewfik, “Rhythm and periodicity detection in polyphonic music,” in Proc. IEEE Third Workshop Multimedia Sig. Proc., Copenhagen, Denmark, Sep. 1315 1999, pp. 185–190. [5] A. Klapuri, “Sound onset detection by applying psychoacoustic knowledge,” in Proc. Intl. Conf. on Acoust. Speech Sig. Proc., Phoenix, Arizona, Mar. 15-19 1999, pp. 3089– 3092. [6] E. D. Scheirer, “Tempo and beat analysis of acoustic musical signals,” J. Acoust. Soc. Am., vol. 103, no. 1, pp. 588–601, 1998. [7] A. Klapuri, A. Eronen, and J. Astola, “Analysis of the me-

7. CONCLUSION The new version of MIRtoolbox enabling the operations presented in this paper will be released during the summer, at the following address:


Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008

ter of acoustic musical signals,” IEEE Trans. Audio Speech Langage Proc., vol. 14, no. 1, pp. 342–355, 2006. [8] C. Duxbury, M. Sandler, and M. Davies, “A hybrid approach to musical note onset detection,” in Proc. Digital Audio Effects (DAFx-02), Hamburg, Germany, Sep. 26-28, 2002, pp. 33–38. [9] A. Friberg, E. Schoonderwaldt, and P. N. Juslin, “Cuex: An algorithm for extracting expressive tone variables from audio recordings,” Acustica / Acta Acustica, , no. 93, pp. 411–420, 2007. [10] D. Eck and N. Casagrande, “Finding meter in music using an autocorrelation phase matrix and shannon entropy,” in Proc. Intl. Conf. on Music Information Retrieval, London, UK, Sep. 11-15 2005, pp. 504–509. [11] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans. Speech Audio Proc., vol. 10, no. 5, pp. 293–302, 2002. [12] P. Masri, Computer modeling of Sound for Transformation and Synthesis of Musical Signal, Ph.D. thesis, University of Bristol, 1996. [13] M. Alonso, B. David, and G. Richard, “Tempo and beat estimation of musical signals,” in Proc. Intl. Conf. on Music Information Retrieval, Barcelona, Spain, Oct. 10-14 2004, pp. 158–163. [14] J. P. Bello, C. Duxbury, M. Davies, and M. Sandler, “On the use of phase and energy for musical onset detection in complex domain,” IEEE Sig. Proc. Letters, vol. 11, no. 6, pp. 553–556, 2004. [15] J. P. Bello, S. Abdallah L. Daudet, C. Duxbury, M. Davies, and M. Sandler, “A tutorial on onset detection in music signals,” Tr. Speech Audio Proc., vol. 13, no. 5, pp. 1035–1047, 2005. [16] T. Tolonen and M. Karjalainen, “A computationally efficient multipitch analysis model,” IEEE Trans. Speech Audio Proc., vol. 8, no. 6, pp. 708–716, 2000. [17] M. Davies and M. Plumbley, “Comparing mid-level representations for audio based beat tracking,” in Proc. Digital Music Res. Network Summer Conf., Glasgow, July 23-24 2005. [18] J. J. Burred and A. Lerch, “A hierarchical approach to automatic musical genre classification,” in Proc. Digital Audio Effects (DAFx-03), London, UK, Sep. 8-11 2003, pp. 344– 349. [19] Y. Feng, Y. Zhuang, and Y. Pan, “Popular music retrieval by detecting mood,” in Proc. Intl. ACM SIGIR Conf. on Res. Dev. Information Retrieval, Toronto, Canada, Jul. 28-Aug. 1 2003, pp. 375–376. [20] G. Peeters, “A large set of audio features for sound description (similarity and classification) in the cuidado project (version 1.0),” Tech. Rep., Ircam, 2004. [21] E. Pampalk, A. Rauber, and D. Merkl, “Content-based organization and visualization of music archives,” in Proc. Intl. ACM Conf. on Multimedia, 2002, pp. 570–579. [22] J. C. Brown, “Determination of the meter of musical scores by autocorrelation,” J. Acoust. Soc. Am., vol. 94, no. 4, pp. 1953–1957, 1993.

[23] P. Toiviainen and J. S. Snyder, “Tapping to bach: Resonancebased modeling of pulse,” Music Perception, vol. 21, no. 1, pp. 43–80, 2003. [24] F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer, “Evaluating rhythmic descriptors for musical genre classiÞcation,” in Proc. AES Intl. Conf., London, UK, June 17-19 2004, pp. 196–204. [25] S. Dixon, E. Pampalk, and G. Widmer, “Classification of dance music by periodicity pattern,” in Proc. Intl. Conf. on Music Information Retrieval, London, UK, Oct. 26-20 2003, pp. 504–509. [26] G. Tzanetakis, G. Essl, and P. Cook, “Human perception and computer extraction of musical beat strength,” in Proc. Digital Audio Effects (DAFx-02), Hamburg, Germany, Sep. 26-28, 2002, pp. 257–61. [27] G. E. P. Box and D. R. Cox, “An analysis of transformations,” J. Roy. Stat. Soc., , no. 26, pp. 211–246, 1964.