You are on page 1of 11

520 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO.

3, MARCH 2013

On-Line Melody Extraction From Polyphonic Audio


Using Harmonic Cluster Tracking
Vipul Arora and Laxmidhar Behera, Senior Member, IEEE

Abstract—Extraction of predominant melody from the musical


performances containing various instruments is one of the most
challenging task in the field of music information retrieval and
computational musicology. This paper presents a novel frame-
work which estimates predominant vocal melody in real-time
by tracking various sources with the help of harmonic clusters
(combs) and then determining the predominant vocal source by
using the harmonic strength of the source. The novel on-line
harmonic comb tracking approach complies with both structural
as well as temporal constraints simultaneously. It relies upon Fig. 1. General framework for the vocal melody extraction.
the strong higher harmonics for robustness against distortion
of the first harmonic due to low frequency accompaniments, in
contrast to the existing methods which track the pitch values. The
the presence of pitched and percussion accompaniments from
predominant vocal source identification depends upon the novel a monaural, i.e. single channel, audio recording. Here the term
idea of source dependant filtering of recognition score, which polyphonic music is used to denote single-voice multi-accom-
allows the algorithm to be implemented on-line. The proposed paniment music, as opposed to its conventional meaning. Also,
method, although on-line, is shown to significantly outperform we ignore the linguistic information in the song and consider
our implementation of a state-of-the-art offline method for vocal
melody extraction. Evaluations also show the reduction in octave
only the melody information. The system is oriented to extract
error and the effectiveness of novel score filtering technique in the predominant melody in the real time, which has various
enhancing the performance. applications in enhancing computer participation in live music
Index Terms—Music information retrieval, pitch tracking, spec- performances, query based music search, music teaching tools
tral Harmonics, vocal melody estimation. etc.
Generally, but not always, pitch has one-to-one correspon-
dence with the fundamental frequency [1]. There has been a
I. INTRODUCTION lot of work in fundamental frequency (F0) estimation. But re-
cently, melody transcription from polyphonic music has become
an active research topic [2], [3]. This has various applications in
H UMAN auditory system has a wonderful ability of effec-
tively focusing on the sound from a particular source and
of a particular nature among a mixture of sounds. But mathemat-
music information retrieval. In contrast to monophonic sounds
whose structure in various representational spaces has been very
well studied [4], polyphonic sounds contain sources with over-
ically, this problem is extremely hard due to various constraints,
lapping structures, giving rise to complex patterns. Frequently,
like having too few degrees of freedom. For speech signals, this
the F0s or the harmonics of various pitched sources are over-
problem is known as the cocktail party problem. In musical sig-
lapping. Moreover, the percussion instruments introduce short-
nals, this problem has led to many research topics like instru-
time high-energy bursts.
ment separation, melody transcription, etc. In general, songs
The estimation of F0 is carried out by representing the signal
have singing voices accompanied by various pitched as well as
in various forms and estimating the features which correspond
percussion instruments. In this work, we focus our attention on
to F0. A general architecture that is underlying most of the F0
estimating the melody of singing voice from a single singer in
estimation systems can be outlined as shown in Fig. 1. The infor-
mation spread in the audio signal is transformed into a suitable
representation space, where the information can be conveniently
Manuscript received December 27, 2011; revised April 12, 2012, July 09,
2012, September 21, 2012, and October 10, 2012; accepted October 15, 2012. clustered into various subspaces such that each subspace repre-
Date of publication November 15, 2012; date of current version December 31, sents one source and can be extracted reliably. Static constraints
2012. The associate editor coordinating the review of this manuscript and ap-
are used to analyze and cluster the information in a single time
proving it for publication was Prof. Laurent Daudet.
V. Arora is with the Department of Electrical Engineering, Indian Institute of frame. Dynamic constraints model the evolution of subspaces
Technology, Kanpur 208016, India (e-mail: vipular@iitk.ac.in). over time and help in clustering the source information over
L. Behera is with the Department of Electrical Engineering, Indian Institute
successive time frames. The clustered subspaces then give us
of Technology, Kanpur 208016, India, and also with the Intelligent Systems Re-
search Centre, School of Computing and Intelligent Systems, University of Ul- multiple F0s corresponding to various sources, out of which one
ster, Northern Ireland BT48 7JL, U.K. (e-mail: lbehera@iitk.ac.in; l.behera@ul- particular source, which is vocal in this work, is selected using a
ster.ac.uk).
harmonic strength criterion and instrument specific constraints.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. Time domain methods are based on identifying the period-
Digital Object Identifier 10.1109/TASL.2012.2227731 ically repeating structure in the temporal representation of the

1558-7916/$31.00 © 2012 IEEE


ARORA AND BEHERA: ON-LINE MELODY EXTRACTION FROM POLYPHONIC AUDIO USING HARMONIC CLUSTER TRACKING 521

sound waveform and the static constraints involve estimating F0


from auto-correlation [5] and difference function [1] based fea-
tures. Computational Auditory Scene Analysis (CASA) based
methods [6] segregate the audio into various streams using filter
banks, inspired from the psycho-acoustic cues for the human au-
ditory modeling.
In many cases, frequency domain methods involve short time
Fourier transform (STFT) based representation space.
Static constraints involve various ways to model a harmonic Fig. 2. Overview of the proposed system.
source spectrum. [7]–[9] use Non-negative matrix factorization
to decompose a spectrum into a dictionary of spectral compo- framework which does both these tasks simultaneously so as to
nents. Others consider clustering the harmonic spectral peaks as be implementable on-line. This approach is inspired from the
belonging to different sources. Many works [10], [11] formu- Kalman Filter framework which takes care of both structural
late a scalar harmonic salience score which depends upon the as well as dynamic constraints of a system for estimation of
power of the harmonics and the deviation of harmonic peaks the state trajectory. A harmonically related cluster of partials is
from whole number multiples of the estimated F0. To reduce termed as a comb. With each comb aiming to track the partials
the computations, several works [12]–[16] consider only the from a single source, there are several combs simultaneously
frequency, amplitude and phase of the peaks in the spectrum, tracking various pitched sources. While other works use the F0
without considering their shape. These peaks are called sinu- and salience values for dynamic constraints, our work relies
soidal components or partials and this modelling approach is upon directly tracking the higher harmonics.
known as sinusoidal analysis. [14] discusses and compares var- Next task is to identify one melody trajectory as the predom-
ious F0 estimation methods based on formulating the harmonic inant (vocal) one, for which there are a few approaches used by
salience functions using sinusoidal components. [15] develops the researchers. [24] uses the features derived from individual
a two-way mismatch criterion for harmonic based pitch detec- partials for identifying the instrument in the polyphonic music.
tion, which is further used by [16] for vocal melody extraction But the human voice shows a large variety of variations due
from polyphonic music. to age, gender, style and interpersonal voice characteristics, for
The dynamic constraints deal with the temporal evolution which this method is yet to be tested. Many researchers [14] use
of the harmonic structure of a source over consecutive spectra. the harmonic strength (salience function), quantified in various
This step enhances the efficiency of locating the harmonic ways, as a criterion to determine the predominant melody. [25]
sources even if they become weak in amplitude, with time. It uses the idea of temporal instability of the vocal pitch contour
also helps in grouping the harmonic structures in consecutive due to involuntary jitter for eliminating the loud pitched instru-
spectra as belonging to the same source. [17], [18] use hidden ments having stable F0 contours. [26] also uses the jitter infor-
Markov models to model the temporal evolution of the features mation to enhance the melodic component. [27] uses Fourier
derived from the spectrum. The Harmonic Temporal Clustering transform of estimated pitch contour for vocal/instrument clas-
scheme used in [19] defines a probabilistic model to jointly sification based on vibrato characteristics. In this work, we de-
estimate the F0 and its temporal evolution, thereby combining velop a novel vocal melody selection scheme using a series of
the static and dynamic constraints into a single step. Many filters, with an aim to make the system implementable in real
works [8], [9], [16], [20] use dynamic programming for finding time.
F0 trajectories for various sources. [21] models the melodic There has been very little work in real-time pitch detection
source with harmonic GMMs and tracks them in time using the [22], [28]–[30]. Real-time processing requires each temporal
Kalman Filter framework and dynamic programming. [22] uses frame to be processed only once. Also, it constrains the com-
the idea of multi-agent based tracking, where an F0 probability putational complexity and the memory requirements of the al-
density function (pdf) is computed at each instant and the peaks gorithm. Real-time melody extraction algorithm of [22] tracks
of this pdf are tracked in time. the F0 trajectories, but our algorithm depends upon the higher
In sinusoidal modelling, often due to low frequency accom- harmonics, too, for tracking.
paniment interferences, some harmonics are distorted or lost, The novel contributions of this work include-(i) a unified
while the higher harmonics are usually less effected. In that time framework for real-time melody extraction, which (ii) depends
frame, if the algorithm can depend upon these unaffected har- on strong higher harmonics for tracking, and (iii) a filter based
monics, the F0 tracking can be made much more robust. For this vocal source selection scheme. To the best of our knowledge,
task, the individual partials have to be tracked in time. Virtanen these concepts have not been used previously for melody ex-
[23] presents a peak continuation algorithm for tracking indi- traction. The overall flowchart of the proposed system is shown
vidual peaks. in Fig. 2.
In this paper, we represent the audio signal in terms of a Sections II, III, IV describe the major modules of our system
partial space which consists of the sinusoidal components. We (as in Fig. 2), starting from the extraction of the partial space,
see the F0 estimation task as two-level clustering of this partial then the harmonic source tracking module and finally the vocal
space-first statically or at spectral level, i.e. in each temporal source identification module, respectively. Section V gives the
window, and second dynamically, i.e. over successive temporal comparative evaluation of the performance of this work with
windows-into various sound sources. We develop a unified another state of the art system as well as the justification for
522 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 3, MARCH 2013

some novel ideas presented in this work, using standard music


databases. Conclusion follows in Section VI.

II. EXTRACTION OF PARTIAL SPACE

As the first step, the signal has to be transformed to a repre-


sentation space which contains most of the relevant information
of various pitched sources in the polyphony. This is achieved
by considering the frequency and amplitude of all the partials
(peaks in magnitude spectrum) in the discrete Fourier transform
(DFT) space. This representation was chosen because it is com-
putationally fast and hence good for real time processing. Fig. 3. Block schematic of the harmonic source tracking system.
An N-point short time Fourier transform (STFT) of the
monaural music recording is computed, using a sliding hanning respectively. The comb state contains the infor-
window having a temporal length of 80 ms. is chosen to be mation of the frequencies and amplitudes of number of par-
of the order of the sampling frequency , . tials associated with it,
Choosing such a large does not increase the frequency reso-
lution (which depends upon the window length), but it reduces
the discretization error to less than 1 Hz. Musical signals are
largely periodic and vocal pitches change slowly in rhythm, still
we choose shorter window length so as to get good estimates We omit at several places below for ease of representation.
even when the pitch is making a transition from one note to In the following discussion, two likelihood functions will be
another. Only the magnitude of STFT is considered, ignoring used-static and dynamic-defined as
the phase information.
The next step is to extract the peaks (partials) in the spec-
trum. There are various ways to improve the partial estima- (1)
tion accuracy. Dressler [31] uses multi-resolution FFT for com-
puting STFT at various time-frequency resolutions. For better
estimation of frequency and amplitude, some use parabolic in-
terpolation, based on the fact that the spectrum main lobes (in (2)
dB) can be approximated as parabola; while others use instan-
taneous frequency information based on the phase vocoder con- Eq. (1) ensures that for the static likelihood to be high, the
cept. Salamon et al. [14] give a comparative evaluation of these distance between the frequencies , should be small and the
partial extraction algorithms in the context of melody extrac- partial amplitude should be large. This likelihood is called
tion. However, our work uses the simplest of all, i.e. the ampli- static because it is used in the context of a single spectrum. On
tude and frequency of only the local maxima in the spectrum. A the other hand, (2) ensures that for the dynamic likelihood to
local maximum is a point whose amplitude is greater than that be high, the distance between both-the frequencies , as well
of its immediate neighbors on the frequency axis. as the partial amplitudes , -should be small. This likelihood is
The analysis range is restricted to half the sampling frequency called dynamic as it is generally used in the context of dynamic
or 5 kHz, whichever is minimum, as this is the region where the evolution of spectra, such that ( , ) are the values predicted
vocal harmonics with significant amplitude are found. The am- from the spectra at the previous instants.
plitudes and frequencies of these partials, represented as , ; The human auditory perception of pitch as well as ampli-
, indexed in decreasing order of amplitude, form tude is close to logarithmic, hence logarithmic distance mea-
the whole space for further analysis. This space is termed as the sures have been used [23]. , are weighing parameters in
partial space. Next task is to cluster this space into subspaces both the functions, which are set to different values in different
corresponding to different sources. equations.

A. Comb Tracking
III. SOURCE TRACKING SYSTEM
The state of an active comb, , which is tracking a cluster of
This section explains how the partial space is clustered into partials, has to be updated at the next time instant. This task is
source subspaces and how these clusters are tracked in time, accomplished using prediction and measurement update, as in
using harmonic as well as dynamic constraints. We call these the Kalman Filter framework [32].
clusters as combs to signify their harmonic structure. The main Prediction update: The a priori state of the comb is esti-
modules of the source tracking system are illustrated in Fig. 3. mated using the first-order prediction as
The state of the th comb at the th instant (time frame) is
, . The amplitude and frequency of the th (3)
partial in the th comb at th instant are represented by , (4)
ARORA AND BEHERA: ON-LINE MELODY EXTRACTION FROM POLYPHONIC AUDIO USING HARMONIC CLUSTER TRACKING 523

Fig. 4. Block schematic of measurement update step in comb tracking.

for all . Here, , are prediction coefficients, the other harmonics will change by the same ratio as that of the
and is defined as leader harmonic.

where is to be substituted with , .


Measurement update: The partial space at the th instant is
used as the measurement to obtain the a posteriori estimate of
the comb state. To tackle with situations when the 1st harmonic (7)
is distorted by the accompaniment interferences, the system re-
lies upon the higher harmonics for state update. To update the In this way, all the harmonics of the th potential state for the
state, several leader harmonics are chosen. Each leader har- th comb are calculated. Same procedure is used to find out all
monic helps in defining a new potential state for the comb. The the potential states for the th comb.
potential state with maximum likelihood is used to finally up- The winner potential state, which finally updates the th
date the comb state. All the steps of measurement update are comb, is chosen as the one which maximizes the static likeli-
outlined in Fig. 4 and the detailed procedure is explained below. hood,

A harmonic is selected as a leader harmonic, if it is strong


in amplitude. The criterion used is that its amplitude should be
greater than a constant fraction of that of the first harmonic (8)
for the latest 4 available states of the comb, i.e.
For the comb at the th instant, F0 is calculated as
(5)
(9)
Let be the set of all leader harmonics for the comb at
the th instant. Consider the th harmonic, which belongs to
the set . The new potential state Here, , are the frequencies and amplitudes from the aposte-
has to be estimated as follows. riori state, .
The th harmonic for the th potential state is found from In this way, the comb tracking step updates the states of all
the observed partials less than two semitones apart from the the sources taking care of both the structural as well as temporal
apriori estimate, constraints, simultaneously.

B. Comb Initialization
If any of the strongest (in amplitude) three partials in the
partial space is not being tracked in one of the already active
combs, then a new comb, , is initialized with , ini-
tialized using the corresponding features of that strongest par-
tial. Note that in the algorithm, this step is performed after all
(6)
the active comb states have been updated at the current instant.
Maximum number of combs is , each of which tracks
For the other harmonics, the constraints of harmonic structure
number of harmonics. If all the combs are active, then the comb
are used, based on the F0 and calculated from the leader
is terminated to start a new comb.
harmonic.
The other harmonics of this comb are found using the har-
monic constraint, i.e. the th harmonic is predicted to be located
at a frequency equal to times the F0. is considered as F0
for this step. For ,

It is expected that the frequencies of the other harmonics will be


harmonically related to the F0 derived from the leader harmonic
frequency. Assuming that the harmonic envelop for a source
varies slowly with time, it is expected that the amplitudes of (10)
524 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 3, MARCH 2013

Eq. (10) selects the partial that is close to the predicted harmonic
frequency and has large amplitude.

C. Comb Termination
The playing of any instrument continues for a short period
of time and then breaks. Correspondingly, our combs should Fig. 5. Block schematic for obtaining the score for vocal comb identification.
also get terminated with the decay of source power to less than
certain level. The comb gets terminated if the sum of its partial To reduce their score, we develop a filter based on the idea pre-
amplitudes falls below a threshold, , times , the amplitude of sented in [25] that these instruments mostly have a stable pitch
the strongest partial in the partial space, contour, whereas the vocal pitch contour has an involuntary in-
stability called jitter. The stability of pitch contour is quantified
(11)
using standard deviation (SD), calculated over a finite number
of previous instants (here 20).
There is also a possibility that two combs start tracking the
same source, due to tracking error. In such a case when two
combs have same , for all at latest two time in-
stants, we terminate the one with lesser temporal length. We
consider two time instants so as to avoid termination when the
F0 trajectories of two sources collide, as we are using last two
states in prediction step. If is less than a threshold, , the recognition score is atten-
To make this system more specific to a particular instrument, uated using the filter
i.e. vocal in this work, we specify some more constraints.
1) The vocal F0 range is limited: . if
2) The first harmonic is very predominant. (14)
otherwise.
Under first constraint, the F0 candidates are searched for only
in the limited range, in the comb initialization step. Also, the This weakens the instrumental comb strength. Sometimes, the
combs which have F0 outside this range are terminated because vocal pitch contour also happens to have an SD little less than
this work is not concerned with tracking the accompaniments. the threshold. Pruning the comb with a low SD altogether as in
The second constraint is required only for the initialization of [25] severely reduces the accuracy, but by attenuating them pro-
the comb and is not required once the comb is initialized as the portionately (as in (14)) makes the vocal combs compete better.
algorithm can depend upon higher harmonics for tracking. The vocal comb is selected as the one having maximum score
after filtering, , among all the combs present at time ,
IV. VOCAL COMB IDENTIFICATION
(15)
Among maximum combs present at the current time in-
stant, the one which corresponds to the vocal source has to be
The scheme for obtaining score for vocal comb recognition is
identified.
illustrated in Fig. 5. To summarize, the scheme consists of cal-
To determine the vocal contour, the harmonic strength crite-
culating the harmonic salience score for each comb and filtering
rion is used. The vocal recognition score for the th comb at the
this score through a first order smoothing filter and then a jitter
th instant is defined as
based filter. The comb having maximum score after filtering is
selected to be the one tracking the vocal source.
(12) The algorithm for the entire melody extraction scheme is
given in the form of a pseudo-code in Fig. 6.
While other researchers have used recognition criteria over
individual frames, we use the knowledge-that a comb tracks the V. EVALUATION
same source-to smoothen out the score over time by using a first The proposed scheme of harmonic cluster tracking for
order linear filter, represented in z-domain as, melody extraction (HCTM henceforth) accomplishes the
melody extraction task in two steps, namely harmonic source
(13) tracking and vocal pitch selection. So we evaluate the perfor-
mance accuracies for melody extraction without vocal source
This reinforces the probability of selecting a source as vocal at selection as well as with vocal source selection. The former
the next time instant too, if it is selected vocal at the current one tells if one of the combs is tracking the vocal melody. This
time instant. Thus, it helps in identifying a vocal comb even if performance is not compared with any existing system because
it has less salience at a particular instant, due to momentary rise this work is not concerned with multi-F0 tracking. The latter
in accompaniment strength (e.g. during accompaniment onset). one tells the tracking accuracy of the complete HCTM system.
Here, is a positive constant less than unity. We used , . The values of various other
Many times it is seen that some loud pitched instruments have parameters chosen for our implementation of HCTM are shown
greater score and hence they deteriorate the recognition quality. in the Table I. These parameters were chosen heuristically for
ARORA AND BEHERA: ON-LINE MELODY EXTRACTION FROM POLYPHONIC AUDIO USING HARMONIC CLUSTER TRACKING 525

TABLE II
DATA DESCRIPTION

be noted that our implementation of TWMDP does not seem


to perform quite as well as the original algorithm, when tested
on the same dataset as used in [16]. Our implementation has a
drop in performance of about 8% in accuracy as compared to the
results reported by [16] on its third dataset (details given later
in Table III). However, the results reported in this article should
still be a good indication on the performance of our on-line algo-
rithm compared to one state-of-the-art offline reference system.
We used a fixed set of parameters for evaluation over all the
data. The F0 search range was fixed to 60–1100 Hz for both the
algorithms and for all the datasets.
We evaluate the performance of these algorithms on three
datasets. First dataset consists of six different singers, 4 males
and 2 females, of MIR-1 k database [36]. The dataset consists
of audio files sampled at 16 kHz, each of which contains two
parallel tracks consisting of the vocal track sung by amateur
singers and the accompaniment track extracted from Chinese
pop karaoke. Total duration and vocal section length (when the
vocal source is active) of audio clips for each singer is given in
Table II. The size of this dataset is quite large, consisting of 2340
s (39 minutes) of total audio, with 1740 s (29 minutes) of vocal
audio duration. The singing voice and the music accompaniment
Fig. 6. Pseudo-code of the HCTM algorithm.
are mixed at three different SNRs of 5 dB, 0 dB and 5 dB
TABLE I (accompaniment considered as noise) for evaluation.
PARAMETER VALUES FOR HCTM The second dataset consists of 9 vocal songs (5 male and
4 female) from LabROSA training data for MIREX 05 Audio
Melody Extraction Competition.1 Audio files are re-sampled to
16 kHz.
The third dataset consists excerpts from two North Indian
classical vocal performances, sung by a male and a female. This
same dataset has been used by Rao & Rao in [16]. These perfor-
mances contain the voice, a drone instrument, tonal percussion
and a loud secondary melodic instrument harmonium, which is
similar to the accordion. The clips are re-sampled to 16 kHz.
The description of data durations for all the datasets is
one singer, viz. Kenshin and applied as such for other singers,
given in Table II. Ground truth pitch values are given at an
without singer specific tuning.
interval of 10 ms for all the three datasets. Both TWMDP and
We compare the performance of our system with the state-of-
HCTM algorithms compute the melody estimates at the hop
the-art Two Way Mismatch Dual Pitch tracking (TWMDP) al-
intervals equal to that of the ground truth data. Each clip in
gorithm developed by Rao & Rao [16]. TWMDP system fol-
both the datasets consists of vocal as well as non-vocal regions
lows the general melody estimation architecture i.e. computing
marked with pitch Hz . We evaluate the estimation
salience for candidate F0-pairs and then using dynamic pro-
accuracy only for the vocal regions, ignoring the non-vocal
gramming for trajectory smoothness. This algorithm has been
regions. There have already been works separating the vocal
shown to be competitive amongst the state-of-the-art systems,
and non-vocal accompaniment regions in an audio clip [37],
such as the algorithms by Li and Wang [33], Klapuri and Ryy-
and this topic is not dealt with in the present work.
nanen [34], or Goto [35]. We used our own implementation of
TWMDP, with parameter tuning as described in [16]. It should 1available from http://labrosa.ee.columbia.edu/projects/melody/.
526 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 3, MARCH 2013

TABLE III
PITCH AND CHROMA ACCURACIES

The estimated melody is considered correct if it is less than


half-semitone away from the ground truth pitch value. The ac-
curacy is reported in two ways [38]. Raw pitch accuracy (RPA)
is the probability of giving the correct pitch value. Raw chroma
accuracy (RCA) allows octave errors and is the probability that
the estimated pitch value, when mapped into the same octave as
that of the ground truth pitch value, is identified as the correct
pitch value.

A. Results
1) Melody Extraction Accuracy: The Table III shows the
pitch and chroma accuracies for melody extraction by the
TWMDP and HCTM algorithms for all the datasets, with
dataset 1 having different SNRs of 5 dB, 0 dB and 5 dB for
all the six singers. The Fig. 7 illustrates the overall comparison
of the HCTM with the TWDMP algorithm over the entire
dataset 1 consisting of all the six singers, at different SNRs.
For dataset 1, at 0 dB and 5 dB SNRs, both the raw pitch as
well as raw chroma accuracies of the complete HCTM system Fig. 7. Overall pitch and chroma accuracies (%) for all the singers of dataset
are significantly higher than that of the TWMDP, for all the 1, at different SNRs.
singers. At 5 dB SNR, the raw pitch accuracy of complete
HCTM system is larger than the TWMDP for all the singers. vocal pitch and the other tracks a strong accompaniment pitch.
Better results are obtained with the HCTM over the other two The score for instrumental comb (comb 1) spikes at some
datasets, also. The overall accuracies, as illustratively compared points and becomes larger than that for the vocal comb (comb
in Fig. 7, show that the overall pitch and chroma accuracies are 2) leading to mis-classification (Fig. 8(b)). This identification
better for the HCTM than the TWMDP. All these results show error is reduced by using first order filter which effectively uses
that the HCTM significantly outperforms our implementation the information that a comb tracks the same source (Fig. 8(c)).
of the TWMDP algorithm. To check the statistical significance The error is further reduced by the jitter filter (Fig. 8(d)) since
of these reults, we performed a paired-sample t-test to compare the instrumental F0 contour is stable and has low SD.
the accuracies of the two algorithms, and found the p-values to To evaluate the overall effect of score filtering (both
be less than 0.05, which implies that the improvements due to smoothing as well as jitter filters), Fig. 9 presents the accuracy
the HCTM are statistically significant. Also, the same set of pa- of HCTM method with and without the use of the score filters.
rameters was used for all the three datasets, which shows the This shows that the capability of the recognition score, , in
robustness of the HCTM. telling the predominant melody is improved when we incorpo-
2) Effectiveness of Score Filtering: To show the importance rate source information to enhance the score by filtering out
of the vocal recognition score filtering, Fig. 8 shows the effect noisy fluctuations over individual sources and by using the
of filters on the scores of two combs, one of which tracks the jitter information when the predominant melody is vocal. The
ARORA AND BEHERA: ON-LINE MELODY EXTRACTION FROM POLYPHONIC AUDIO USING HARMONIC CLUSTER TRACKING 527

Fig. 8. (a) F0 for two combs, 1-instrumental (solid) and 2-vocal (dotted) (b) unfiltered vocal recognition score, : Score for comb 1 spikes and becomes more
than that of comb 2, causing mis-identification (c) score filtered using first order filter: Mis-identification reduced (d) final score filtered with both the filters, :
Mis-identification further reduced. Here the SD for comb 1 was too small, hence its score is reduced to almost zero.

Fig. 9. Pitch and chroma accuracies (%) without and with score filtering for
various singers (at 0 dB SNR).

effect of filters in effectively removing the accompaniment


interferences is clearly illustrated by these figures. The use of
score filters for identifying the vocal comb helps in making
the HCTM algorithm implementable on-line while maintaining Fig. 10. Effect of the parameter variation on RPA and RCA without
vocal selection, and RPA and RCA with vocal selection. The varied
high accuracy. parameter is mentioned in the caption of the corresponding subfigure. (a) ;
3) Computational Requirements: The computational re- (b) ; (c) ((12) only); (d) .
quirements for real-time implementation were further analyzed.
The algorithm was compiled as a c code in Ubuntu and run on the singer variation as a large can attenuate the vocal recog-
IntelCore2 Quad 2.83 GHz CPU with 3 GB RAM. The real nition score for a trained singer who can control the unwanted
time factor, which is defined as the ratio of processing time to jitter much better than an amateur singer.
the signal duration, was found to be approximately 0.1. Also,
the algorithm does not have heavy memory requirements, as B. Discussion
the comb state prediction step needs the comb states at only 1) Source Tracking Accuracy: In TWMDP framework,
last four instants and the jitter filter needs to store the F0 values The number of F0-units increase at exponential rate, with the
for latest 20 instants. number of F0 contours to be tracked. E.g. for dual-F0 tracking,
4) Parameter Configuration: To study the effect of param- the number of F0-pairs is approximately square of the number
eter variation on the accuracy, we vary a few important param- of candidate F0s . Hence tracking more number of
eters one-by-one, while keeping others fixed to their original contours is costly. But in our HCTM framework, the number
values. The RPA and RCA without as well as with vocal source of combs is same as the number of contours to be tracked
selection, for dataset 2 and with different parameter values while . The high accuracies of HCTM without vocal selection
modifying only one parameter at a time, are shown in Fig. 10. (Table III) show that the source tracking algorithm is wisely
These results show that modifying these parameters do not dras- tracking the vocal pitch.
tically effect the performance of HCTM, however, some tuning The prediction based source tracking in HCTM is, in prin-
may improve the performance. The jitter filter can be sensitive to ciple, better capable of reducing the errors caused by collisions
528 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 3, MARCH 2013

between the vocal and the instrumental pitch contours. After the
collision, the prediction step helps the comb in tracking the same
source as it was tracking just before the collision. On the con-
trary, in TWMDP, at the time of such collision, the dual-F0 state
is not able to find the second F0 and hence starts tracking spu-
rious F0, resulting in decrease in accuracy.
Most of the algorithms falling in the general architecture (in-
cluding TWMDP) first find salience and then apply smoothness
constraints. But the HCTM algorithm simultaneously uses the
smoothness constraints as well as the harmonic based salience in
the form of source comb tracking. It then computes the score for
determining vocal F0, filtered over time for individual source.
This reduces errors when the accompaniments momentarily be-
come stronger than vocal source, e.g. during onset. It also helps
in persistently selecting a source as vocal by filtering out the
fluctuations in recognition score. Another important advantage
is that HCTM also relies upon the timbral features for tracking,
although in a naive way, as the measurement update equations
((6), (7)) tend to minimize the distance between successive spec-
tral envelops of a comb. However, the architectures allows for
more sophisticated ways of capturing timbral features to be in-
corporated easily.
Notably, the accuracy of HCTM is higher than the TWMDP
even while the former uses online processing whereas the later
uses offline dynamic programming for determining melody. A
typical example of melody extraction using the two algorithms
Fig. 11. Example of vocal melody (shown by dark cross) extracted by TWMDP
is given in Fig. 11. It can be seen that HCTM gives smoother and HCTM along with other F0s (shown by dots) for a 4 s long excerpt from
estimate of F0 contours because it relies on multiple harmonics the song titled ‘Kenshin_2_04’ at 0 dB SNR. The actual vocal melody (ground
for F0 estimation. Also, some combs of HCTM can be seen truth) is shown by light circles.
tracking the higher harmonics of the vocal comb, but the octave
error is taken care of by the identification score, as explained However, the HCTM algorithm uses a fixed . This effectively
ahead. prevents this problem if is large enough to capture most
2) Octave Error: Many works [10] including TWMDP use of the energy in the vocal spectrum, which is generally car-
the F0 likelihood criterion of the form ried by the first few harmonics. The use of a fixed is also
supported empirically by the trained weight models in [12],
where the weight of the harmonics, for weighted sum based
(16)
harmonic salience function, decreases with the increase in har-
monic index, showing that the harmonics with large harmonic
where, is F0 candidate, is predicted harmonic count ( index contribute less to the salience function and hence can be
, here), is a function which mea- dropped out.
sures the likelihood for the individual partial peak being the th Another reason for the octave error is the distortion of the
harmonic. So the overall likelihood is measured over a variable first harmonic. This can make the first harmonic unavailable to
number of partials. Such a formulation is prone to octave error. be selected as candidate F0 or may reduce its salience because
For example, when the even harmonics are stronger than the a small error in its frequency gets multiplied by integer fac-
odd harmonics, then may become more than , re- tors while matching with the higher harmonic partials. TWMDP
sulting in being selected as pitch in place of . We note takes care of this problem by taking various integer sub-multi-
that this problem is occurring because for is simply twice ples of a well formed harmonic partial, which however increases
of that for . the number of computations. HCTM algorithm is capable of de-
pending upon higher harmonics for tracking and hence tackles
this problem in a computationally efficient way.
The reduction in octave errors can be seen in the results
(Table III) in the form of reduced difference between the
raw pitch and raw chroma accuracies for HCTM algorithm.
Especially for the singers Fdps and Geniusturtle, the chroma
accuracies of TWMDP and HCTM (complete) algorithm are
close but HCTM pitch accuracy is much better than the former.
The overall results (Fig. 7) also illustrate the same effect, as
the difference in RPA’s for the two algorithms is quite large as
ARORA AND BEHERA: ON-LINE MELODY EXTRACTION FROM POLYPHONIC AUDIO USING HARMONIC CLUSTER TRACKING 529

compared to that between the RCA’s for the two. This shows TABLE IV
that the HCTM is better capable to reduce the octave errors. MIREX AUDIO MELODY EXTRACTION RESULTS
3) Comb Initialization: HCTM simply uses a fixed number
(3 here) of maximum amplitude partials for comb initialization
using the assumption that the first harmonic is the strongest of
all other harmonics. Even if it is not the strongest, but is one of
the top three harmonics, it is able to initialize a new comb. The
effectiveness of this unsophisticated but fast way for the vocal
sources can be seen in high accuracies without vocal selection.
However, some instrumental sounds may not satisfy this con-
straint, for which this method can be improved by using other
functions like the inverse Fourier transform or by using the par- log magnitude spectrum at the th time instant, at the index
tials having high static likelihood (as in (10)), but at the cost of :
extra computations.

VI. CONCLUSION
Notably, the accuracies achieved by the submitted HCTM
algorithm were better than those achieved by the version of
In this work, we have described a harmonic cluster tracking TWMDP algorithm submitted by Rao & Rao [40] in 2009, over
system for tracking various harmonic sources in polyphonic the same datasets.3 The achieved RPA and RCA are shown in
music and among them identifying the predominant vocal the Table IV.
melody based on various heuristics. The evaluation results clearly indicate that the proposed
The novel contributions of this work are: on-line real time system is able to achieve significant accu-
(i) Unified Approach: Most of the previous approaches sep- racy levels as compared to the existing state-of-the-art offline
arately apply the static and dynamic constraints, applying method. The current approach identifies the desired sound
the static constraints first, followed by the dynamic ones source based mainly on the harmonic strength, but using tim-
in the form of dynamic programming. But our approach is bral features for this task may further improve the accuracy.
a unified one using both these constraints simultaneously
at every step of tracking. Thus, each frame is traversed
only once, which is required for on-line processing. ACKNOWLEDGMENT
(ii) Tracking strong higher harmonics: While previous ap- The authors would like to thank Dr. P. Rao and Dr. V. Rao
proaches track the ‘F0 trajectories,’ using Viterbi algo- for sharing their audio dataset for the experiments reported in
rithm etc., our algorithm depends upon the ‘strong higher this paper. The authors are grateful to the reviewers for helpful
harmonics’ for tracking. comments in improving this manuscript.
(iii) Vocal selection filters: Instead of mostly used dynamic
programming based offline methods, our approach uses
score filters which select vocal source based on strength REFERENCES
and jitter constraints. [1] A. de Cheveigné and H. Kawahara, “YIN, a fundamental frequency
(iv) Real time method: The proposed method is implementable estimator for speech and music,” J. Acoust. Soc. Amer., vol. 111, no. 4,
in real time (10 times faster than real time) which makes pp. 1917–1930, 2002.
[2] A. Klapuri, “Automatic music transcription as we know it today,” J.
the proposed method suitable for various novel applica- New Music Res., vol. 33, no. 3, pp. 269–282, 2004.
tions. [3] A. Klapuri and M. Davy, Signal Processing Methods for Music Tran-
The algorithm presented in this work was submitted to scription. Secaucus, NJ: Springer-Verlag, 2006.
[4] T. F. Quatieri, Discrete Time Speech Signal Processing-Principles and
the Music Information Retrieval Evaluation eXchange Practice. Upper Saddle River, NJ: Prentice-Hall, 2002.
(MIREX) 2012 campaign for automatic melody extrac- [5] P. Boersma and D. Weenink, Praat: Doing Phonetics by Computer
tion task [39],2 the results for which are available at 2004.
[6] G. Hu and D. L. Wang, “A tandem algorithm for pitch estimation
http://www.music-ir.org/mirex/wiki/2012:MIREX2012\_Re- and voiced speech segregation,” IEEE Trans. Audio, Speech, Lang.
sults. The HCTM algorithm was extended to track the instru- Process., vol. 18, no. 8, pp. 2067–2079, Nov. 2010.
mental melodies as well, by using inverse Fourier transform [7] E. Vincent, N. Berlin, and R. Badeau, “Harmonic and inharmonic non-
negative matrix factorization for polyphonic pitch transcription,” in
(IFT) of the log magnitude spectrum, in the initialization and Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2008, pp.
predominant source identification functions. Initialization was 109–112.
based on the top peaks in IFT, instead of STFT. For predomi- [8] J. Durrieu, G. Richard, B. David, and C. Fevotte, “Source/filter model
for unsupervised main melody extraction from polyphonic audio sig-
nant source identification, the R.H.S. of (12) was by multiplied nals,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 3, pp.
with , the inverse Fourier transform of the 564–575, Mar. 2010.
2The submitted code is available for research purpose at http://home. 3http://www.music-ir.org/mirex/wiki/2009:Audio\_Melody\_Extrac-
iitk.ac.in/~lbehera/isl/AB1.rar tion\_Results
530 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 3, MARCH 2013

[9] J. Durrieu, B. David, and G. Richard, “A musically motivated mid-level [30] P. Verma and P. Rao, “Real-time melodic accompaniment system for
representation for pitch estimation and musical audio source separa- indian music using TMS320C6713,” in Proc. Int. Conf. VLSI Design
tion,” IEEE J. Sel. Top. Signal Process., vol. 5, no. 6, pp. 1180–1191, and Embedded Syst., 2012.
Oct. 2011. [31] K. Dressler, “Sinusoidal extraction using an efficient implementation
[10] J. Wise, J. Caprio, and T. Parks, “Maximum likelihood pitch estima- of a multi-resolution FFT,” in Proc. Int. Conf. Digital Audio Effects
tion,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-24, no. (DAFx), 2006, pp. 247–252.
5, pp. 418–423, Oct. 1976. [32] G. Welch and G. Bishop, An introduction to the Kalman filter Chapel
[11] C. Yeh, A. Roebel, and X. Rodet, “Multiple fundamental frequency es- Hill, NC, 1995, Tech. Rep..
timation and polyphony inference of polyphonic music signals,” IEEE [33] Y. Li and D. L. Wang, “Separation of singing voice from music accom-
Trans. Audio, Speech, Lang. Process., vol. 18, no. 6, pp. 1116–1126, paniment for monaural recordings,” IEEE Trans. Audio, Speech, Lang.
Aug. 2010. Process., vol. 15, no. 4, pp. 1475–1487, May 2007.
[12] A. Klapuri, “Multiple fundamental frequency estimation by summing [34] M. P. Ryynanen and A. Klapuri, “Polyphonic music transcription
harmonic amplitudes,” in Proc. 7th Int. Symp. Music Inf. Retreival using note event modeling,” in Proc. IEEE Workshop Applicat. Signal
(ISMIR’06), Oct. 2006, pp. 216–221. Process. Audio Acoust., Oct. 2005, pp. 319–322.
[13] K. Dressler, “Pitch estimation by the pair-wise evaluation of spectral [35] M. Goto, “Prefest: A predominant-f0 estimation method for polyphonic
peaks,” in Proc. AES 42nd Int. Conf., 2011. musical audio signals,” MIREX 2005.
[14] J. Salamon, E. Gomez, and J. Bonada, “Sinusoid extraction and [36] C. L. Hsu and J. S. R. Jang, “On the improvement of singing voice
salience function design for predominant melody estimation,” in Proc. separation for monaural recordings using the MIR-1K dataset,” IEEE
Int. Conf. Digital Audio Effects (DAFx), 2011, pp. 73–80. Trans. Audio, Speech, Lang. Process., vol. 18, no. 2, pp. 310–319, Feb.
[15] R. C. Maher and J. W. Beauchamp, “Fundamental frequency esti- 2010.
mation of musical signals using a two-way mismatch procedure,” J. [37] W.-H. Tsai and H.-M. Wang, “Automatic singer recognition of popular
Acoust. Soc. Amer., vol. 95, no. 4, pp. 2254–2263, 1994. music recordings via estimation and modeling of solo vocal signals,”
[16] V. Rao and P. Rao, “Vocal melody extraction in the presence of pitched IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 1, pp. 330–341,
accompaniment in polyphonic music,” IEEE Trans. Audio, Speech, Jan. 2006.
Lang. Process., vol. 18, pp. 2145–2154, Nov. 2010. [38] G. E. Poliner, D. P. W. Ellis, A. F. Ehmann, E. Gomez, S. Streich,
[17] B. Doval and X. Rodet, “Fundamental frequency estimation and and B. Ong, “Melody transcription from music audio: Approaches and
tracking using maximum likelihood harmonic matching and HMMs,” evaluation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no.
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 1993, 4, pp. 1247–1256, May 2007.
vol. 1, pp. 221–224. [39] V. Arora and L. Behera, “Online melody extraction: Mirex 2012,”
[18] M. P. Ryynänen and A. Klapuri, “Automatic transcription of melody, MIREX 2012.
bass line, chords in polyphonic music,” Comput. Music J., vol. 32, no. [40] V. Rao and P. Rao, “Melody extraction using harmonic matching,”
3, pp. 72–86, Sep. 2008. MIREX 2009.
[19] J. Wu, E. Vincent, S. A. Raczynski, T. Nishimoto, N. Ono, and S.
Sagayama, “Polyphonic pitch estimation and instrument identification Vipul Arora received the B.Tech. degree in elec-
by joint modeling of sustained and attack sounds,” IEEE J. Sel. Topics trical engineering from the Indian Institute of
Signal Process., vol. 5, no. 6, pp. 1124–1132, Oct. 2011. Technology (IIT), Kanpur, India. Currently, he is
[20] W. H. Liao, A. W. Y. Su, C. Yeh, and A. Roebel, “On the use of per- working towards the PhD degree at IIT Kanpur.
ceptual properties for melody estimation,” in Proc. Int. Conf. Digital His research interests include music information
Audio Effects (DAFx-11), Paris, France, 2011, pp. 141–145. retrieval and semantic signal processing.
[21] H. Kameoka, T. Nishimoto, and S. Sagayama, “Multi-pitch trajectory
estimation of concurrent speech based on harmonic GMM and non-
linear Kalman filtering,” in Proc. Int. Conf. Spoken Lang. Process., Oct.
2004, vol. 1, pp. 2433–2436.
[22] M. Goto, “A real-time music scene description system: Predominant-f0
estimation for detecting melody and bass lines in real-world audio sig-
nals,” Speech Commun. (ISCA J.), vol. 43, no. 4, pp. 311–329, 2004.
[23] T. Virtanen, “Audio signal modeling with sinusoids plus noise,” M.S.
thesis, Dept. of Information Technol., Tampere Univ. of Technol., Tam- Laxmidhar Behera (S’92–M’03–SM’03) received
pere, Finland, 2000. the BSc (engineering) and MSc (engineering)
[24] J. G. A. Barbedo and G. Tzanetakis, “Musical instrument classification degrees from NIT Rourkela in 1988 and 1990,
using individual partials,” IEEE Trans. Audio, Speech Lang. Process., respectively. He received the PhD degree from
vol. 19, no. 1, pp. 111–122, Jan. 2011. IIT Delhi. He has worked as an assistant professor
[25] V. Rao, S. Ramakrishnan, and P. Rao, “Singing voice detection in poly- at BITS Pilani during 1995–1999 and pursued
phonic music using predominant pitch,” in Proc. INTERSPEECH’09, the postdoctoral studies in the German National
2009, pp. 1131–1134. Research Center for Information Technology, GMD,
[26] H. Tachibana, T. Ono, N. Ono, and S. Sagayama, “Melody extraction Sank Augustin, Germany, during 2000–2001. He is
in music audio signal by melodic component enhancement and pitch currently working as a professor in the Department
tracking,” MIREX, 2009. of Electrical Engineering, IIT Kanpur. He joined
[27] C. L. Hsu, D. Wang, and J. S. R. Jang, “A trend estimation algorithm the Intelligent Systems Research Center (ISRC), University of Ulster, United
for singing pitch detection in musical recordings,” in Proc. IEEE Int. Kingdom, as a reader on sabbatical from IIT Kanpur during 2007–2009. He
Conf. Acoust., Speech Signal Process., May 2011, pp. 393–396. has also worked as a visiting researcher/professor at FHG, Germany, and ETH,
[28] P. de la Cuadra and A. Master, “Efficient pitch detection techniques for Zurich, Switzerland. He has more than 150 papers to his credit published
interactive music,” in Proc. Int. Comp. Music Conf., 2001. in refereed journals and presented in conference proceedings. His research
[29] P. McLeod, “Fast, accurate pitch detection tools for music analysis,” interests include intelligent control, robotics, information processing, neural
Ph.D. dissertation, Univ. of Otago, Dunedin, New Zealand, 2008. networks, and cognitive modeling. He is a senior member of the IEEE.

You might also like