You are on page 1of 6

AN INTEGRATED FRAMEWORK FOR ONSET DETECTION, TEMPO ESTIMATION AND PULSE CLARITY PREDICTION

ABSTRACT An integrated framework for onset detection and tempo extraction is proposed, that unifies a large range of approaches proposed in the literature. Each approach is modeled as a flowchart of general high-level signal-processing operators (such as filterbank decomposition, envelope extraction, spectral flux, post-processing, peak picking, etc.). These operators can be modified according to various options. The proposed framework encourages more advanced combinations of the different approaches and allows the comparison of multiple approaches under a single optimized flowchart. Various descriptions of the onset detection curve are compiled, such as articulation and attack characterization. As an application and extension of the framework, a composite model explaining pulse clarity judgments is decomposed into a set of independent factors related to various musical dimensions. To evaluate the pulse clarity model, 25 participants have rated the pulse clarity of one hundred excerpts from movie soundtracks. The mapping between the model predictions and the ratings was carried out via regressions. 1 INTRODUCTION This paper focuses on the joint questions of onset extraction and tempo estimation. A synthetic overview of studies in this domain is formulated, and implemented in Matlab using MIRtoolbox [18]. Section 2 shows various technique for computing the onset detection curve and section 3 deals with the description of the curve, and in particular the detection of the onsets themselves. The estimation of tempo from the onset curve is dealt in section 4. Throughout this review, each approach is modeled in terms of a flowchart of general high-level signalprocessing operators with multiple options and parameters to be tuned accordingly. This framework encourages more advanced combinations of the different approaches and offers the possibility of comparing multiple approaches under a single optimized flowchart. In section 5, a composite model explaining pulse clarity judgments is decomposed into a set of independent factors related to various musical dimensions. To evaluate the pulse clarity model, 25 participants have rated the pulse clarity of one hundred excerpts from movie soundtracks. The mapping between the model predictions and the ratings, discussed in section 6, was carried out via regressions.

2 COMPUTING THE ONSET DETECTION FUNCTION First of all, the audio signal can be segmented into successive homogeneous parts based on novelty [13]. Tempo, when assumed to remain stable within each segment [27], is thus computed in each of these segments separately. 2.1 Filterbank decomposition The estimation of the onset positions generally requires a decomposition of the audio waveform along particular frequency regions. • The simplest method discards high-frequency components by filtering the signal with a narrow bandpass filter [1]. • Subtler models require a multi-channel decomposition of the signal mimicking auditory processes. This can be done through filterbank decomposition [16, 22] or via a time-frequency representation computed by means of STFFT [17, 27], which can be further recombined into critical bands. • The filterbank is sometimes split into subbands sent to distinct analyses [10]. The high frequency component, related to the frequency range over 11000 Hz, is discarded, the middle register, for frequencies over 1200 Hz, is applied to energy-based analyses (paragraph 2.2), whereas the low component is directed to frequency-based analyses (paragraph 2.3). 2.2 Energy-based strategies These strategies focus on the variation of amplitude or energy of the signal. • The description of the temporal evolution of energy results from an envelope extraction, basically through rectification (or squaring) of the signal, low-pass filtering, and finally downsampling. Further refinement enables an improvement of the peak picking: first the logarithm of the signal is computed [16] 1 , and the result is differentiated and half-wave rectified 2 : Some
A µ-law compression [17] can be specified as well. A weighted average of the original envelope and its differentiated version [17] can be obtained.
2 1

approaches advocate the use of a smoothed differentiator FIR instead, based on exponential weightening 3 [10]. • Another strategy consists in computing the root-meansquare (RMS) energy of each successive frame of the signal [14, 10, 11]. The channels are sometimes recombined immediately after the envelope extraction [26]. In order to allow the detection of periodicities scattered along several bands, while still optimizing the analysis accuracy, the recombination can be partial, by summing adjacent bands in groups [17] resulting for instance in four wider-frequency channels. Figure 2. Frame-decomposed generalized and enhanced autocorrelation function [24] used for the multi-pitch extraction in Fig. 1.
Similarity matrix temporal location of frame centers (in s.) 14 12 10 8 6 4 2 2 4 6 8 10 temporal location of frame centers (in s.) 12 14

2.3 Frequency-based strategies Frequency-based methods start from the time-frequency representation (STFFT in particular), analyzing this time each successive frame successively. High-frequency content [19] can be highlighted via linear weighting. The contrast between successive frames is measured, leading to a so-called spectral flux, where various distances can be specified, such as L1 -norm [19] or L2 -norm [10] distances. Components contributing to a decrease of energy, not contributing to the onset attack phases, are often ignored [10]. Instead of computing distance between frames, FIR filter differentiator [2] can be specified. Each distance between successive frames can be normalized by the total energy of the first of these frames in order to ensure a better adaptiveness to amplitude variability [10]. Besides, the computation can be performed in the complex domain in order to include the phase information [3]. The novelty curve designed for musical segmentation, as mentioned in section 2, can actually be considered as a more refined way of evaluating distance between frames [4]. We notice in particular that the use of novelty on multi-pitch extraction results [24] leads to particular good results when estimating onsets from violin soli (see Figures 1-4).
coefficient value (in Hz) Pitch 1000

Figure 3. Similarity matrix computed from the framedecomposed autocorrelation function in Fig. 2. 2.4 Post-processing If necessary, the onset detection function can be smoothed through low-pass filtering [4]. In order to adapt further computation (such as peak picking or periodicity estimation) to local context, the onset detection curve can be detrended by removing the median [2, 4, 8]. 3 NON-PERIODIC CHARACTERIZATIONS OF THE ONSET DETECTION CURVE 3.1 Articulation The Low-Energy Rate – commonly defined as the percentage of frames within a file that have an RMS energy lower than the mean RMS energy across that file [7] – can be generalized to any kind of onset detection curve. A simple estimation of articulation (related to the staccato vs. legato polarity) is given by the Average Silence Ratio [12], which can be formalized as a low-energy rate where the threshold (here, silence threshold) is set to a fraction of the mean RMS energy. 3.2 Onset detection Onsets can be related to local maxima of the onset detection curve. The onsets found on each different bands can be combined together. As one onset, when scattered along several bands, produces a series of onsets that are not necessarily

500

0

0

5 10 Temporal location of events (in s.)

15

Figure 1. Multi-pitch extraction from a violin solo recording.

3 The logarithmic transformation might exempt from this lossy operation, though.

1 coefficient value

coefficient value

Novelty

Attack Slope 6 4 2 0 0 1 2 3 4 Temporal location of events (in s.) 5 6

0.5

0

0

5 10 Temporal location of events (in s.)

15

Figure 6. Slope of the attack phases of the beginning of a performance of Schumann’s Kinderszene II. 4 TEMPO ESTIMATION 4.1 Pulsation estimation The periodicity of the onset curve can be assessed in various ways. • FFT can be computed in separate bands, leading to a so-called fluctuation pattern [20]. Or, by means of spectral product transformation, harmonics of periodicities can be removed [2]. • More often, periodicity is estimated via autocorrelation [6]. Alternatively, the autocorrelation phase matrix shows the distribution of autocorrelation energy in phase space [11]. Metrically-salient lags can then be emphasized by computing the Shannon entropy of the phase distribution of each lag. • An emphasis towards best perceived periodicities can be obtained by multiplying the autocorrelation function (or the spectrum) with a resonance curve [23, 11]. • Another strategy commonly used for periodicity estimation is based on a bank of comb filters [22, 17]. • Periodicities can be estimated from actual onset positions as well, either detected from the onset detection curve (section 3.2), or from onset dates read from MIDI files. The periodicities is displayed with a histogram showing the distribution of all the possible inter-onset intervals [15]. • Alternatively, the MIDI file can be transformed into an onset detection curve by summing Gaussian kernels located at the onset points of each note [23]. The onset detection curve can then be fed to the same analyses as for audio files, as presented at the beginning of this section. 4.2 Peak picking The main pulse can be estimated by extracting the global maximum from the spectrum. The periodicity estimation on separate band can be summed before the peak picking – which is the case for most approaches – or after the peak picking [9], with a clustering of the close peaks and summation of the clusters.

Figure 4. Novelty curve [13] estimated along the diagonal of the similarity matrix in Fig. 3, and onset detection (circles) featuring one false positive (the second onset) and one false negative (around time t = 12.5 s.). exactly synchronous, an alignment is performed by selecting the major peak within a 50-ms neighborhood [16] 4 . When combining the hybrid subband scheme (paragraph 2.1), onsets from the higher frequency band, offering better time resolution, are preferably chosen via a weighting sum [10]. 3.3 Attack characterization If the note onset temporal position is estimated using an energy-based strategy (section 2.2), some characteristics related to the attack phase can be assessed as well. • If the note onset positions are found at local maxima of the energy curve (amplitude envelope or RMS in particular), they can be considered as ending positions of the related attack phases. A complete determination of each attack requires therefore an estimation of the starting position, by detecting the preceding local minima from an appropriate smoothed version of the energy curve. Figure 5 shows the onset detection of the beginning of a performance of Schumann’s Kinderszene II, with determination of the attack phases. Then the characteristics of the attack
Onset curve (Envelope) 1 amplitude 0.5 0

0

1

2

3 time (s)

4

5

6

Figure 5. Onset detection of Schumann’s Kinderszene II with determination of the attack phases. phase can be its duration or its mean slope [21]. Figure 6 shows the slope of the attack phases extracted from the same example. • If on the contrary the note onset positions are found at local maxima of the temporal derivate of the energy curve [17, 10], then the attack slope can be directly identified with the values of the local maxima.
4 Subtler combination processes have been proposed [16], based on detailed auditory modeling, but are not integrated in the framework yet.

More refined tempo estimation are available as well. For instance, three peaks can be collected for each periodicity spectrum, and if a multiplicity is found between their lags, the fundamental is selected [2]. Similarly, harmonics of a series of candidate lag values can be searched in the autocorrelation function [11]. Finally the peaks in the autocorrelation can be converted into BPM . 5 MODELING PULSE CLARITY The computation developed in the previous sections help to offer a description of the metrical content of musical work in terms of tempo. But some further analyses can offer further important information related to rhythm. In particular, one important way of describing musical genre is related to the amount of pulsation, more precisely to the clarity of its expression. The understanding of pulse clarity may yields new ways to improve automated genre classification in particular. 5.1 Previous work At least one previous work has studied this dimension [25] – termed beat strength. The proposed solution is based on the computation of the autocorrelation function of the onset detection curve decomposed into frames, the three best periodicities are extracted [26]. These periodicities – or more precisely, their related autocorrelation coefficients – are collected into a histogram. From the histogram, two estimation of beat strength are proposed: • the SUM measure sums all the bins of the histogram, whereas • the PEAK measure divides the maximum value to the main amplitude. This approach is therefore aimed at understanding the global metrical aspect of an extensive musical piece. Our study, on the contrary, is focused on an understanding of the shortterm characteristics of rhythmical pulse. Indeed, even musical excerpt less than a few seconds long can easily convey to the listeners sense of rhythmicity of various strength. The analysis of each successive local context can then be extended to the global scope through usual statistical techniques. 5.2 Statistical description of the autocorrelation curve For that purpose, the analysis is focused on the analysis of the autocorrelation function itself and tries to extract from it any information related to the dominance of the pulsation. • The most evident descriptor is the amplitude of the main peak, i.e., the global maximum MAX of the curve.

• The global minimum MIN gives another aspect of the importance of the main pulsation. • The kurtosis of the main peak KURT describes its distinctness. • The entropy of the autocorrelation function ENTR indicates the quantity of pulsation information conveyed. • Another hypothesis is that the faster a tempo TEMP is, the more clearly it is perceived by the listeners (due to the increased density of events). 5.3 Harmonic relations between pulsations The clarity of a pulse seems to decrease if pulsations with no harmonic relations coexist. We propose to formalize this idea as follows. First a certain number of peaks 5 are selected from the autocorrelation curve r. Let the list of peak lags be P = {li }i∈[0,N ] , and let the first peak l0 be the one considered as the main pulsation, as determined in paragraph 4.2. The list of peak amplitudes is {r(li )}i∈[0,N ] . A peak will be inharmonic if the remainder of the euclidian division of its lag with the lag of the main peak (and the inverted division as well) is significantly high. This defines the set of inharmonic peaks H: H= i ∈ [0, N ] li ∈ [αl0 , (1 − α)l0 ] (mod l0 ) l0 ∈ [αli , (1 − α)li ] (mod li )

(1) where α is a constant tuned to .15 in our implementation. The degree of harmonicity is hence decreased by the cumulation of the autocorrelation coefficients of the non-harmonic peaks: 1 i∈H r(li ) HARM = exp − (2) β r(l0 ) where β is another constant set to 4. 5.4 Non-periodic accounts of pulse clarity Other descriptors have been added that do not relate directly to the periodicity of the pulses, but indicate factors of energy variability that could contribute to the perception of clear pulsation. Some factors defined in section 3 have been included: • the articulation ARTI, based on Average Silence Ratio (paragraph 3.1), • the attack slope ATAK (paragraph 3.3). Finally, a variability factor VAR sums the amplitude difference between successive local extrema of the onset detection curve. The whole flowchart of operators required for the estimations of the pulse clarity factors is indicated in Figure 7.
5 If no value is specified, by default all the local maxima offering sufficient contrast with their related local minima are selected.

envelope variability VAR attack slope AS detected onsets onset detection curve autocorrelation function low-energy rate ART global maximum MAX entropy ENTR global minimum MIN important peaks pulse harmonicity HARM main peak tempo TEMP kurtosis KURT

Figure 7. Flowchart of operators of the compound pulse clarity model. 6 MAPPING MODEL PREDICTIONS TO LISTENERS’ RATINGS In order to assess the validity of the models predicting pulse clarity judgments presented in the previous section, an experimental protocol has been designed. In a listening test experiment, 25 participants have rated the pulse clarity of one hundred excerpts from movie soundtracks. In parallel, the same musical database has been fed to the diverse pulse clarity models presented in the previous section. The mapping between the model predictions and the listeners ratings was finally carried out via regressions. 6.1 Pre-processing of the statistical variables As a prerequisite to the statistical mapping, listeners ratings and models predictions need to be normalized. The mapping routine includes an optimization algorithm that automatically finds optimal Box-Cox transformations [5] of the data ensuring that their distributions becomes sufficiently gaussian. 6.2 Results The major factors correlating with the ratings are indicated in table 1. The best predictor is the global maximum of the autocorrelation function, with a correlation of .46 with the ratings, followed by the kurtosis of the main peak, and by the global minimum. The pulse harmonicity factor shows a correlation of .48 with the ratings, but is also correlated up to .48 with the other aforementioned factors. The envelope variability factor shows a correlation of .41. Multiple regressions are being attempted in current works. 6.3 Model optimization One problem raised by the computational framework presented in this paper is related to the high number of degrees of freedom that have to be specified when choosing

Table 1. Majors factors correlating with pulse clarity ratings. rank 1 2 3 4 5 factors MAX KURT MIN HARM VAR correlation between factor and ratings .46 .46 .44 .43 .41 maximal correlation with previous factors .3 .5 .48 .45

the proper onset detection curve and periodicity evaluation method. The choice of default settings will result from an evaluation of the performance offered by the various approaches. Due to the combinatorial complexity of possible configurations, we are designing optimization tools that systematically sample the set of possible solutions and produce a large number of flowcharts progressively loaded with musical databases and compared with the ground-truth data. The pulse clarity experiment described in this section is a first attempt towards this goal. 7 REFERENCES [1] M. Alghoniemy and A. H. Tewfik. ”Rhythm and periodicity detection in polyphonic music”, Proc. IEEE Third Workshop Multimedia Sig. Proc., 185–190, 1999, Copenhagen, Denmark, Sep. 13-15. [2] M. Alonso and B. David and G. Richard. ”Tempo and beat estimation of musical signals”, Proc. Intl. Conf. on Music Information Retrieval, 158–163, 2004, Barcelona, Spain, Oct. 10-14. [3] J. P. Bello and C. Duxbury and M. Davies and M. San-

dler. ”On the use of phase and energy for musical onset detection in complex domain”, IEEE Sig. Proc. Letters, 11, 6, 2004, june, 553–556. [4] J. P. Bello and L. Daudet, S. Abdallah and C. Duxbury and M. Davies and M. Sandler. “A tutorial on onset detection in music signals”, Tr. Speech Audio Proc., 13, 5, 2005, september, 1035–1047. [5] G. E. P. Box and D. R. Cox. ”An analysis of transformations” J. Roy. Stat. Soc., B, 26, 1964, 211–246. [6] J. C. Brown. ”Determination of the meter of musical scores by autocorrelation”, J. Acoust. Soc. Am., 94, 4, 1993, 1953–1957. [7] J. J. Burred and A. Lerch. Proc. Digital Audio Effects (DAFx-03), ”A hierarchical approach to automatic musical genre classification”, 344–349, 2003 London, UK, Sep. 8-11. [8] M. Davies and M. Plumbley. ”Comparing mid-level representations for audio based beat tracking”, Proc. Digital Music Res. Network Summer Conf., 2005, Glasgow, July 23-24. [9] S. Dixon and E. Pampalk and G. Widmer. ”Classification of dance music by periodicity pattern” Proc. Intl. Conf. on Music Information Retrieval, 504–509, 2003, London, UK, Oct. 26-20. [10] C. Duxbury and M. Sandler and M. Davies. Proc. Digital Audio Effects (DAFx-02), ”A hybrid approach to musical note onset detection”, 33–38, 2002, Hamburg, Germany, Sep. 26-28,. [11] D. Eck and N. Casagrande. ”Finding Meter in Music Using an Autocorrelation Phase Matrix and Shannon Entropy”, Proc. Intl. Conf. on Music Information Retrieval, 504–509, 2005, London, UK, Sep. 11-15. [12] Y. Feng and Y. Zhuang and Y. Pan. Proc. Intl. ACM SIGIR Conf. on Res. Dev. Information Retrieval, ”Popular music retrieval by detecting mood”, 375–376, 2003, Toronto, Canada, Jul. 28-Aug. 1. [13] J. Foote and M. Cooper. ”Media Segmentation using Self-Similarity Decomposition”, Proceedings of SPIE Storage and Retrieval for Multimedia Databases, 5021, 167–175, 2003. [14] A. Friberg and E. Schoonderwaldt and P. N. Juslin. ”CUEX: An algorithm for extracting expressive tone variables from audio recordings”, Acustica / Acta Acustica, 93, 2007, 411–420. [15] F. Gouyon and S. Dixon and E. Pampalk and G. Widmer. ”Evaluating rhythmic descriptors for musical genre

classfication”, Proc. AES Intl. Conf., 2004, 196–204, London, UK, June 17-19. [16] A. Klapuri. ”Sound onset detection by applying psychoacoustic knowledge”, Proc. Intl. Conf. on Acoust. Speech Sig. Proc., 3089–3092, 1999, Phoenix, Arizona, Mar. 15-19. [17] A. Klapuri and A. Eronen and J. Astola. ”Analysis of the meter of acoustic musical signals”, IEEE Trans. Audio Speech Langage Proc., 14, 1, 2006, 342–355. [18] O. Lartillot and P. Toiviainen, ”MIR in Matlab (II): A toolbox for musical feature extraction from audio”, Proc. Intl. Conf. on Music Information Retrieval, 127– 130, 2007, Wien, Austria, Sep. 23-27. [19] P. Masri. University of Bristol, ”Computer modeling of Sound for Transformation and Synthesis of Musical Signal”, 1996. [20] E. Pampalk and A. Rauber and D. Merkl. ”Contentbased Organization and Visualization of Music Archives”, Proc. Intl. ACM Conf. on Multimedia, 2002, 570–579. [21] G. Peeters. ”A large set of audio features for sound description (similarity and classification) in the CUIDADO project (version 1.0)”, Ircam., 2004. [22] E. D. Scheirer. ”Tempo and beat analysis of acoustic musical signals”, J. Acoust. Soc. Am., 103, 1, 1998, january, 588–601. [23] P. Toiviainen and J. S. Snyder. ”Tapping to Bach: Resonance-based modeling of pulse”, Music Perception, 21, 1, 2003, 43–80. [24] T. Tolonen and M. Karjalainen, ”A Computationally Efficient Multipitch Analysis Model”, IEEE Trans. Speech Audio Proc., 8, 6, 2000, 708–716. [25] G. Tzanetakis and G. Essl and P. Cook. Proc. Digital Audio Effects (DAFx-02), ”Human perception and computer extraction of musical beat strength”, 257–61, 2002, Hamburg, Germany, Sep. 26-28,. [26] G. Tzanetakis and P. Cook. ”Musical genre classification of audio signals”, IEEE Trans. Speech Audio Proc., 10, 5, 2002, 293–302. [27] C. Uhle. ”Tempo induction by investigating the metrical structure of music using a periodicity signal that relates to the tatum period”, http://www.music-ir.org/ evaluation/mirex-results/articles/tempo/uhle.pdf.