EURASIP Journal on Audio, Speech, and Music Processing
Volume 2010, Article ID 523791, 15 pages
doi:10.1155/2010/523791
Research Article
CorrelationBased Amplitude Estimation of Coincident Partials
in Monaural Musical Signals
Jayme Garcia Arnal Barbedo
1
and George Tzanetakis
2
1
Department of Communications, FEEC, UNICAMP C.P. 6101, CEP: 13.083852, Campinas, SP, Brazil
2
Department of Computer Science, University of Victoria, Columbia, Canada V8W 3P6
Correspondence should be addressed to Jayme Garcia Arnal Barbedo, jbarbedo@gmail.com
Received 12 January 2010; Revised 29 April 2010; Accepted 5 July 2010
Academic Editor: Mark Sandler
Copyright © 2010 J. G. A. Barbedo and G. Tzanetakis. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
This paper presents a method for estimating the amplitude of coincident partials generated by harmonic musical sources
(instruments and vocals). It was developed as an alternative to the commonly used interpolation approach, which has several
limitations in terms of performance and applicability. The strategy is based on the following observations: (a) the parameters of
partials vary with time; (b) such a variation tends to be correlated when the partials belong to the same source; (c) the presence
of an interfering coincident partial reduces the correlation; and (d) such a reduction is proportional to the relative amplitude of
the interfering partial. Besides the improved accuracy, the proposed technique has other advantages over its predecessors: it works
properly even if the sources have the same fundamental frequency, it is able to estimate the ﬁrst partial (fundamental), which is not
possible using the conventional interpolation method, it can estimate the amplitude of a given partial even if its neighbors suﬀer
intense interference from other sources, it works properly under noisy conditions, and it is immune to intraframe permutation
errors. Experimental results show that the strategy clearly outperforms the interpolation approach.
1. Introduction
The problem of source separation of audio signals has
received increasing attention in the last decades. Most of the
eﬀort has been devoted to the determined and overdeter
mined cases, in which there are at least as many sensors as
sources [1–4]. These cases are, in general, mathematically
more treatable than the underdetermined case, in which
there are fewer sensors than sources. However, most real
world audio signals are underdetermined, many of them
having only a single channel. This has motivated a number
of proposals dealing with this kind of problem. Most of such
proposals try to separate speech signals [5–9], speech from
music [10–12], or a singing voice from music [13]. Only
recently methods trying to deal with the task of separating
diﬀerent instruments in monaural musical signals have been
proposed [14–18].
One of the main challenges faced in music source sepa
ration is that, in real musical signals, simultaneous sources
(instruments and vocals) normally have a high degree of
correlation and overlap both in time and frequency, as a
result of the underlying rules normally followed by western
music (e.g., notes with integer ratios of pitch intervals). The
high degree of correlation prevents many existing statistical
methods from being used, because those normally assume
that the sources are statistically independent [14, 15, 18].
The use of statistical tools is further limited by the also very
common assumption that the sources are highly disjoint in
the timefrequency plane [19, 20], which does not hold when
the notes are harmonically related.
An alternative that has been used by several authors is
the sinusoidal modeling [21–23], in which the signals are
assumed to be formed by the sum of a number of sinusoids
whose parameters can be estimated [24].
In many applications, only the frequency and amplitude
of the sinusoids are relevant, because the human hearing is
relatively insensitive to the phase [25]. However, estimating
the frequency in the context of musical signals is often
challenging, since the frequencies do not remain steady with
time, especially in the presence of vibrato, which manifests
2 EURASIP Journal on Audio, Speech, and Music Processing
0
380 400 420 440 460
Frequency (Hz)
480
0.5
1
1.5
A
b
s
o
l
u
t
e
m
a
g
n
i
t
u
d
e
2
2.5
3
3.5
(a)
(b)
4
×10
4
Figure 1: Magnitude spectrumshowing: (a) an example of partially
colliding partials, and (b) an example of coincident partials.
as frequency and amplitude modulation. Using very short
time windows to perform the analysis over a period in
which the frequencies would be expected to be relatively
steady also does not work, as this procedure results in a very
coarse frequency resolution due to the wellknown time
frequency tradeoﬀ. The problem is even more evident in the
case of coincident partials, because diﬀerent partials vary in
diﬀerent ways around a common frequency, making it nearly
impossible to accurately estimate their frequencies. However,
in most cases the band within which the partials are located
can be determined instead. Since the phase is usually ignored
and the frequency often cannot be reliably estimated due to
the time variations, it is the amplitude of individual partials
that can provide the most useful information to eﬃciently
separate coincident partials.
For the remainder of this paper, the term partial will
refer to a sinusoid with a frequency that varies with time. As
a result, the frequency band occupied by a partial during a
period of time will be given by the range of such a variation.
It is also important to note that the word partial can be
both used to indicate part of an individual source (isolated
harmonic), or part of the whole mixture—in this case, the
merging of two or more coincident partials would also be
called a partial. Partials referring to the mixture will be called
mixturepartials whenever the context does not resolve this
ambiguity.
The sinusoidal modeling technique can successfully esti
mate the amplitudes when the partials of diﬀerent sources do
not collide, but it loses its eﬀectiveness when the frequencies
of the partials are close. The expression colliding partials
refers here to the cases in which two partials share at least part
of the spectrum (Figure 1(a)). The expression coincident
partials, on the other hand, is used when the colliding
partials are mostly concentrated in the same spectral band
(Figure 1(b)). In the ﬁrst case, the partials may be separated
enough to generate some eﬀects that can be explored to
resolve them, but in the second case they usually merge in
such a way they act as a single partial. In this work, two
partials will be considered coincident if their frequencies are
separated by less than 5% for frequencies below 500 Hz, and
by less than 25 Hz for frequencies above 500 Hz—according
to tests carried out previously, those values are roughly the
thresholds for which traditional techniques to resolve close
sinusoids start to fail. A small number of techniques to
resolve colliding partials have been proposed, and only a few
of them can deal with coincident partials.
Most techniques proposed in the literature can only
reliably resolve colliding partials if they are not coincident.
Klapuri et al. [26] explore the amplitude modulation result
ing from two colliding partials to resolve their amplitudes.
If more than two partials collide, the standard interpolation
approach as described later is used instead. Virtanen and
Klapuri [27] propose a technique that iteratively estimates
phases, amplitudes, and frequencies of the partials using a
leastsquare solution. Parametric approaches like this one
tend to fail when the partials are very close, because some of
the matrices used to estimate the parameters tend to become
singular. The same kind of problem can occur in the strategy
proposed by Tolonen [16], which uses a nonlinear least
squares estimation to determine the sinusoidal parameters
of the partials. Every and Szymanski [28] employ three
ﬁlter designs to separate partly overlapping partials. The
method does not work properly when the partials are mostly
concentrated in the same band. Hence, it cannot be used to
estimate the amplitudes of coincident or almost coincident
partials.
There are a few proposals that are able to resolve
coincident partials, but they only work properly under cer
tain conditions. An eﬃcient method to separate coincident
partials based on the similarity of the temporal envelopes was
proposed by Viste and Evangelista [29], but it only works
for multichannel mixtures. Duan et al. [30] use an average
harmonic structure (AHS) model to estimate the amplitudes
of coincident partials. To work properly, this method requires
that, at least for some frames, the partials be suﬃciently
disjoint so their individual features can be extracted. Also,
the technique does not work when the frequencies of the
sources have octave relations. Woodruﬀ et al. [31] propose a
technique based on the assumptions that harmonics of the
same source have correlated amplitude envelopes and that
phase diﬀerences can be predicted from the fundamental
frequencies. The main limitation of the technique is that it
depends on very accurate pitch estimates.
Since most of these elaborated methods usually have lim
ited applicability, simpler and less constrained approaches
are often adopted instead. Some authors simply attribute all
the content to a single source [32], while others use a simple
interpolation approach [33–35]. The interpolation approach
estimates the amplitude of a given partial that is known to
be colliding with another one by linearly interpolating the
amplitudes of other partials belonging to the same source.
Several partials can be used in such an interpolation but,
according to Virtanen [25], normally only the two adjacent
ones are used, because they tend to be more correlated to
the amplitude of the overlapping partial. The advantage of
such a simple approach is that it can be used in almost
every case, with the only exceptions being those in which
the sources have the same fundamental frequency. On the
other hand, it has three main shortcomings: (a) it assumes
EURASIP Journal on Audio, Speech, and Music Processing 3
that both adjacent partials are not signiﬁcantly changed
by the interference of other sources, which is often not
true; (b) the ﬁrst partial (fundamental) cannot be estimated
using this procedure, because there is no previous partial
to be used in the interpolation; (c) the assumption that the
interpolation of the partials is a good estimate only holds for
a few instruments and, for the cases in which a number of
partials are practically nonexistent, such as a clarinet with
odd harmonics, the estimates can be completely wrong.
This paper presents a more reﬁned alternative to the
interpolation approach, using some characteristics of the
harmonic audio signals to provide a better estimate for the
amplitudes of coincident partials. The proposal is based on
the hypothesis that the frequencies of the partials of a given
source will vary in approximately the same fashion over time.
In a short description, the algorithm tracks the frequency
of each mixture partial over time, and then uses the results
to calculate the correlations among the mixture partials.
The results are used to choose a reference partial for each
source, by determining which is the mixture partial that is
more likely to belong exclusively to that source, that is, the
partial with minimum interference from other sources. The
inﬂuence of each source over each mixture partial is then
determined by the correlation of the mixture partials with
respect to the reference partials. Finally, this information is
used to estimate how the amplitude of each mixture partial
should be split among its components.
This proposal has several advantages over the interpola
tion approach.
(a) Instead of relying in the assumption that both
neighbor partials are interferencefree, the algorithm
depends only on the existence of one partial strongly
dominated by each source to work properly, and
relatively reliable estimates are possible even if this
condition is not completely satisﬁed.
(b) The algorithm works even if the sources have the
same fundamental frequency (F0)—tests comparing
the spectral envelopes of a large number of pairs
of instruments playing the same note and having
the same RMS level, revealed that in 99.2% of the
cases there was at least one partial whose energy was
more than ﬁve times greater than the energy of its
counterpart.
(c) The ﬁrst partial (fundamental) can be estimated.
(d) There are no intraframe permutation errors, mean
ing that, assuming the amplitude estimates within a
frame are correct, they will always be assigned to the
correct source.
(e) The estimation accuracy is much greater than that
achieved by the interpolation approach.
In the context of this work, the term source refers
to a sound object with harmonic frequency structure.
Therefore, a vocal or an instrument generating a given note is
considered a source. This also means that the algorithmis not
able to deal with sound sources that do not have harmonic
characteristics, like percussion instruments.
The paper is organized as follows. Section 2 presents
the preprocessing. Section 3 describes all steps of the algo
rithm. Section 4 presents the experiments and corresponding
results. Finally, Section 5 presents the conclusions and ﬁnal
remarks.
2. Preprocessing
Figure 2 shows the basic structure of the algorithm. The
ﬁrst three blocks, which represent the preprocessing, are
explained in this section. The last four blocks represent
the core of the algorithm and are described in Section 3.
The preprocessing steps described in the following are fairly
standard and have shown to be adequate for supporting the
algorithm.
2.1. Adaptive Frame Division. The ﬁrst step of the algorithm
is dividing the signal into frames. This step is necessary
because the amplitude estimation is made in a frameby
frame basis. The best procedure here is to set the boundaries
of each frame at the points where an onset [36, 37] (newnote,
instrument or vocal) occurs, so the longest homogeneous
frames are considered. The algorithm works better if the
onsets themselves are not included in the frame, because
during the period they occur, the frequencies may vary
wildly, interfering with the partial correlation procedure
described in Section 3.3. The algorithm presented in this
paper does not include an onsetdetection procedure in order
to avoid cascaded errors, which would make it more diﬃcult
to analyze the results. However, a study about the eﬀects
of onset misplacements on the accuracy of the algorithm is
presented in Section 4.5.
To cope with partial amplitude variations that may occur
within a frame, the algorithm includes a procedure to divide
the original frame further, if necessary. The ﬁrst condition
for a new division is that the duration of the note be at
least 200 ms, since dividing shorter frames would result in
frames too small to be properly analyzed. If this condition is
satisﬁed, the algorithm divides the original frame into two
frames, the ﬁrst one having a 100ms length, and the second
one comprising the remainder of the frame. The algorithm
then measures the RMS ratio between the frames according
to
R
RMS
=
min(r
1
, r
2
)
max(r
1
, r
2
)
, (1)
where r
1
and r
2
are the RMS of the ﬁrst and second new
frames, respectively. R
RMS
will always assume a value between
zero and one. The RMS values were used here because
they are directly related to the actual amplitudes, which are
unknown at this point.
The R
RMS
value is then stored and a new division is
tested, now with the ﬁrst new frame being 105ms long and
the second being 5 ms shorter than it was originally. This
new R
RMS
value is stored and new divisions are tested by
successively increasing the length of the ﬁrst frame by 5 ms
and reducing the second one by 5 ms. This is done until the
4 EURASIP Journal on Audio, Speech, and Music Processing
Signal Estimates
Division into
frames
F0
estimation
Partial
ﬁltering
Frame
subdivision
Segmental
frequency
estimation
Partial
correlation
Amplitude
estimation
procedure
Figure 2: Algorithm general structure.
resulting second frame is 100ms long or shorter. If the lowest
R
RMS
value obtained is below 0.75 (empirically determined),
this indicates a considerable amplitude variation within
the frame, and the original frame is deﬁnitely divided
accordingly. If, as a result of this newdivision, one or both the
new frames have a length greater than 200 ms, the procedure
is repeated and new divisions may occur. This is done until
all frames are smaller than 200ms, or until all possible R
RMS
values are above 0.75.
Some results using diﬀerent ﬁxed frame lengths are
presented in Section 4.
2.2. F0 Estimation and Partial Location. The position of the
partials of each source is directly linked to their fundamental
frequency (F0). The ﬁrst versions of the algorithm included
the multiple fundamental frequencies estimator proposed by
Klapuri [38]. A common consequence of using supporting
tools in an algorithm is that the errors caused by ﬂaws
inherent to those supporting tools will propagate throughout
the rest of the algorithm. Fundamental frequency errors are
indeed a problem in the more general context of sound
source separation, but since the scope of this paper is limited
to the amplitude estimation, errors coming from third
party tools should not be taken into account in order to
avoid contamination of the results. On the other hand, if all
information provided by the supporting tools is assumed to
be known, all errors will be due to the proposed algorithm,
providing a more meaningful picture of its performance.
Accordingly, it is assumed that a hypothetical sound source
separation algorithm would eventually reach a point in
which the amplitude estimation would be necessary—to
reach this point, such an algorithm would maybe depend on
a reliable F0 estimator, but this is a problem that does not
concern this paper, so the correct fundamental frequencies
are assumed to be known.
Although F0 errors are not considered in the main tests,
it is instructive to discuss some of the impacts that F0
errors would have in the algorithm proposed here. Such a
discussion is presented in the following, and some practical
tests are presented in Section 4.6.
When the fundamental frequency of a source is mises
timated, the direct consequence is that a number of false
partials (partials that do not exist in the actual signal, but
that are detected by the algorithm due to F0 estimation
error) will be considered and/or a number of real partials
will be ignored. F0 errors may have signiﬁcant impact in the
estimation of the amplitudes of correct partials depending
on the characteristics of the error. Higher octave errors, in
which the detected F0 is actually a multiple of the correct one,
have very little impact on the estimation of correct partials.
This is because that, in this case, the algorithm will ignore
a number of partials, but those that are taken into account
are actual partials. Problems may arise when the algorithm
considers false partials, which can happen both in the case of
lower octave errors, in which the detected F0 is a submultiple
of the correct one, and in the case of nonoctave errors—this
last situation is the worst because most considered partials
are actually false, but fortunately this is the less frequent kind
of error. When the positions of those false partials coincide
with the positions of partials belonging to sources whose F0
were correctly identiﬁed, some problems may happen. As will
be seen in Section 3.4, the proposed amplitude estimation
procedure depends on the proper choice of reference partials
for each instrument, which are used as a template to estimate
the remaining ones. If the ﬁrst reference partial to be chosen
belongs to the instrument for which the F0 was misestimated,
that has little impact on the amplitude estimation of the
real partials. On the other hand, if the ﬁrst reference partial
belongs to the instrument with the correct F0, then the
entire amplitude estimation procedure may be disrupted.
The reasons for this behavior are presented in Section 4.6,
together with some results that illustrate how serious is the
impact of such a situation over the algorithm performance.
The discussion above is valid for signiﬁcant F0 estimation
errors—precision errors, in which the estimated frequency
deviates by at most a few Hertz from the actual value, are
easily compensated by the algorithm as it uses a search width
of 0.1 · F0 around the estimated frequency to identify the
correct position of the partial.
As can be seen, considerable impact on the proposed
algorithmwill occur mostly in the case of lower octave errors,
since they are relatively common and result in a number
of false partials—a study about this impact is presented in
Section 4.6.
To work properly, the algorithm needs a good estimate
of where each partial is located—the location or position of
a partial, in the context of this work, refers to the central
frequency of the band occupied by that partial (see deﬁnition
of partial in the introduction). Simply, taking multiples of F0
sometimes work, but the inherent inharmonicity [39, 40] of
some instruments may cause this approach to fail, especially
if one needs to take several partials into consideration. To
make the estimation of each partial frequency more accurate,
an algorithm was created—the algorithm is fed with the
frames of the signal and it outputs the position of the partials.
The steps of the algorithm for each F0 are the following:
(a) The expected (preliminary) position of each partial
(p
n
) is given by p
n−1
+ F0, with p
0
= 0.
(b) The shorttime discrete Fourier transform (STDFT)
is calculated for each frame, from which the magni
tude spectrum M is extracted.
EURASIP Journal on Audio, Speech, and Music Processing 5
(c) The adjusted position of the current partial ( p
n
) is
given by the highest peak in the interval [p
n
−s
w
, p
n
+
s
w
] of M, where s
w
= 0.1 · F0 is the search width.
This search width contains the correct position of the
partial in nearly 100% of the cases; a broader search
region was avoided in order to reduce the chance of
interference from other sources. If the position of the
partial is less than 2s
w
apart from any partial position
calculated previously for other source, and they are
not coincident (less than 5% or 25 Hz apart), the
positions of both partials are recalculated considering
s
w
equal to half the frequency distance among the two
partials.
When two partials are coincident in the mixed signal,
they often share the same peak, in which case steps (a) to
(c) will determine not their individual positions, but their
combined position, which is the position of the mixture
partial. Sometimes coincident partials may have discernible
separate peaks; however, they are so close that the algorithm
can take the highest one as the position of the mixture partial
without problem. After the positions of all partials related
to all fundamental frequencies have been estimated, they are
grouped into one single set containing the positions of all
mixture partials. The procedure described in this section has
led to partial frequency estimates that are within 5% from
the correct value (inferred manually) in more than 90% of
the cases, even when a very large number of partials are
considered.
2.3. Partial Filtering. The mixture partials for which the
amplitudes are to be estimated are isolated by means of a
ﬁlterbank. In real signals, a given partial usually occupies
a certain band of the spectrum, which can be broader or
narrower depending on a number of factors like instrument,
musician, and environment, among others. Therefore, a ﬁlter
with a narrow passband may be appropriate for some kinds
of sources, but may ignore relevant parts of the spectrum for
others. On the other hand, a broad passband will certainly
include the whole relevant portion of the spectrum, but may
also include spurious components resulting from noise and
even neighbor partials. Experiments have indicated that the
most appropriate band to be considered around the peak
of a partial is given by the interval [0.5 · (p
n−1
+ p
n
), 0.5 ·
(p
n
+ p
n+1
)], where p
n
is the frequency of the partial under
analysis, and p
n−1
and p
n+1
are the frequencies of the closest
partials with lower and higher frequencies, respectively.
The ﬁlterbank used to isolate the partials is composed by
thirdorder elliptic ﬁlters, with a passband ripple of 1 dB and
stopband attenuation of 80 dB. This kind of ﬁlter was chosen
because of its steep rolloﬀ. Finite impulse response (FIR)
ﬁlters were also tested, but the results were practically the
same, with a considerably greater computational complexity.
As commented before, this method is intended to be
used in the context of sound source separation, whose
main objective is to resynthesize the sources as accurately as
possible. Estimating the amplitudes of coincident partials is
an important step toward such an objective, and ideally the
amplitudes of all partials should be estimated. In practice,
however, when partials have very low energy, noise plays
an important role, making it nearly impossible to extract
enough information to perform a meaningful estimate. As
a result of those observations, the algorithm only takes into
account partials whose energy—obtained by the integration
of the power spectrum within the respective band—is at
least 1% of the energy of the most energetic partial. Mixture
partials follow the same rules; that is, they will be considered
only if they have at least one percent of the energy the
strongest partial—thus, the energy of an individual partial
in a mixture may be below the 1% limit. It is important
to notice that partials below −20 dB from the strongest one
may, in some cases, be relevant. Such a hard lower limit for
the partial energy is the best current solution for the problem
of noisy partials, but alternative strategies are currently under
investigation. In order to avoid that a partial be considered in
certain frames and not in others, if a given F0 keeps the same
in consecutive frames, the number of partials considered by
the algorithm is also kept the same.
3. The Proposed Algorithm
3.1. Frame Subdivision. The resulting frames after the
ﬁltering are subdivided into 10ms subframes, with no
overlap (overlapping the subframes did not improve the
results). Longer subframes were not used because they may
not provide enough points for the subsequent correlation
calculation (see Section 3.3) to produce meaningful results.
On the other hand, if the subframe is too short and
the frequency is low, only a fraction of a period may
be considered in the frequency estimation described in
Section 3.2, making such estimation either unreliable, or
even impossible.
3.2. Partial Trajectory Estimation. The frequency of each
partial is expected to ﬂuctuate over the analysis frame,
which have a length of at least 100 ms. Also, it is expected
that partials belonging to a given source will have similar
frequency trajectories, which can be explored to match
partials to that particular source. The 10ms subframes
resulting from the division described in Section 3.1 are used
to estimate such a trajectory. The frequency estimation for
each 10ms subframe is performed in the time domain by
taking the ﬁrst and last zerocrossing, measuring the distance
d in seconds and the number of cycles c between those zero
crossings, and then determining the frequency according to
f = c/d. The exact position of the zerocrossing is given by
z
c
= p
1
+
a
1
 ·
_
p
2
− p
1
_
a
1
 + a
2

, (2)
where p
1
and p
2
are, respectively, the positions in seconds
of the samples immediately before and immediately after the
zerocrossing, and a
1
and a
2
are the amplitudes of those same
samples. Once the frequencies for each 10ms subframe are
calculated, they are accumulated into a partial trajectory.
6 EURASIP Journal on Audio, Speech, and Music Processing
−35
50 150 250 350 450 550 650 750 850
Frequency (Hz)
950
−30
−25
−20
P
e
r
f
o
r
m
a
n
c
e
(
i
n
%
o
f
t
h
e
m
e
a
n
a
c
c
u
r
a
c
y
)
−15
−10
−5
0
5
Figure 3: Eﬀect of the frequency on the accuracy of the amplitude
estimates.
It is worth noting that there are more accurate tech
niques to estimate a partial trajectory, like the normalized
crosscorrelation [41]. However, replacing the zerocrossing
approach by the normalized crosscorrelation resulted in
almost the same overall amplitude estimation accuracy
(mean error values diﬀer by less than 1%), probably due
to artiﬁcial ﬂuctuations in the frequency trajectory that are
introduced by the zerocrossing approach. Therefore, any of
the approaches can be used without signiﬁcant impact on
the accuracy. The use of the zerocrossings, in this context,
is justiﬁed by the low computational complexity associated.
The use of subframes as small as 10ms has some
important implications in the estimation of low frequencies.
Since at least two zerocrossings are necessary for the
estimates, the algorithm cannot deal with frequencies below
50 Hz. Also, below 150 Hz the partial trajectory shows some
ﬂuctuations that may not be present in higher frequency
partials, thus reducing the correlation between partials and,
as a consequence, the accuracy of the algorithm. Figure 3
shows the eﬀect of the frequency on the accuracy of the
amplitude estimates. In the plot, the vertical scale indicates
how better or worse is the performance for that frequency
with respect to the overall accuracy of the accuracy, in
percentage. As can be seen, for 100 Hz the accuracy of the
algorithm is 16% below average, and the accuracy drops
rapidly as lower frequencies are considered. However, as will
be seen in Section 4, the accuracy for such low frequencies is
still better than that achieved by the interpolation approach.
3.3. Partial Trajectory Correlation. The frequencies estimated
for each subframe are arranged into a vector, which
generates trajectories like those shown in Figure 4. One
trajectory is generated for each partial. The next step is
to calculate the correlation between each possible pair of
trajectories, resulting in N(N−1)/2 correlation values, where
N is the number of partials.
3.4. Amplitude Estimation Procedure. The main hypothe
sis motivating the procedure described here is that the
partial frequencies of a given instrument or vocal vary
approximately in the same way with time. Therefore, it is
hypothesized that the correlation between the trajectories of
two mixture partials will be high when they both belong
exclusively to a single source, with no interference fromother
partials. Conversely, the lowest correlations are expected to
occur when the mixture partials are completely related to
diﬀerent sources. Finally, when one partial results from a
given source A (called reference), and the other one results
from the merge of partials coming both from source A and
from other sources S, intermediary correlation values are
expected. More than that, it is assumed that the correlation
values will be proportional to the ratio a
A
/a
S
in the second
mixture partial, where a
A
is the amplitude of source A partial
and a
S
is the amplitude of the mixture partial with the source
A partial removed. If a
A
is much larger than a
S
, it is said that
the partial from source A dominates that band.
Lemma 1. Let A
1
= X
1
+N
1
and A
2
= X
2
+N
2
be independent
random variables, and let A
3
= aA
1
+ bA
2
be a random
variable representing their weighted sum. Also, let X
1
, X
2
also
be independent randomvariables, and N
1
and N
2
be zeromean
independent random variables. Finally, let
ρ
X,Y
=
cov(X, Y)
σ
X
σ
Y
=
E
__
X −μ
X
__
Y −μ
Y
__
σ
X
σ
Y
(3)
be the correlation coeﬃcient between two random variables X
and Y with expected values μ
X
and μ
Y
and standard deviations
σ
X
and σ
Y
. Then,
ρ
A1,A3
ρ
A2,A3
=
a
b
_
σ
2
X1
+ σ
2
N1
σ
2
X2
+ σ
2
N2
_
⎛
⎝
_
σ
2
X2
+ σ
2
N2
_
σ
2
X1
+ σ
2
N1
⎞
⎠
. (4)
Assuming that σ
2
N1
σ
2
X1
, σ
2
N2
σ
2
X2
, and σ
X1
≡ σ
X2
, (4)
reduces to
ρ
A1,A3
ρ
A2,A3
=
a
b
.
(5)
For proof, see the appendix.
The lemma stated above can be directly applied to
the problem presented in this paper, as explained in the
following. First, a model is deﬁned in which the nth partial
P
n
of an instrument is given by P
n
(t) = n · F0(t), where
F0(t) is the timevarying fundamental frequency and t is
the time index. In this idealized case, all partial frequency
trajectories would vary in perfect synchronism. In practice, it
is observed that the partial frequency trajectories indeed tend
to vary together, but factors like instrument characteristics,
room acoustics, and reverberation, among others, introduce
disturbances that prevent a perfect match between the
trajectories. Those disturbances can be modeled as noise,
so now P
n
(t) = n · F0(t) + N(t), where N is the noise.
If we consider both the fundamental frequency variations
F0(t) and the noisy disturbances N(t) as random variables,
the lemma applies—in this context, A
1
is the frequency
trajectory of a partial of instrument 1, given by the sum of
the ideal partial frequency trajectory X
1
and the disturbance
N
1
; A
2
is the frequency trajectory of a partial of instrument
2, which collides with the partial of instrument 1; A
3
is
the partial frequency trajectory resulting from the sum of
EURASIP Journal on Audio, Speech, and Music Processing 7
946
0 50 100 150 200 250 300 350
Second partial trajectory  instrument A
Time (ms)
400
947
948
F
r
e
q
u
e
n
c
y
(
H
z
)
949
950
951
952
953
(a)
1419
0 50 100 150 200 250 300 350
Third partial trajectory  instrument A
Time (ms)
400
1420
1421
1422 F
r
e
q
u
e
n
c
y
(
H
z
)
1423
1424
1425
1426
1427
1428
1429
(b)
955.5
0 50 100 150 200 250 300 350
Second partial trajectory  instrument B
Time (ms)
400
956
956.5
F
r
e
q
u
e
n
c
y
(
H
z
)
957
957.5
958
958.5
959
959.5
(c)
944
0 50 100 150 200 250 300 350
Second partial trajectory  mixture
Time (ms)
400
945
946
947 F
r
e
q
u
e
n
c
y
(
H
z
)
948
949
950
951
952
953
954
(d)
Figure 4: Trajectories (a) and (b) come from partials belonging to the same source, thus having very similar behaviors. Trajectory (c)
corresponds to a partial from another source. Trajectory (d) corresponds to a mixture partial; its characteristics result from the combination
of each partial trends, as well as from phase interactions between the partials. The correlation procedure aims to quantify how close the
mixture trajectory is from the behavior expected for each source.
the colliding partials. According to the lemma, the shape
of A
3
is the sum of the trajectories A
1
and A
2
weighted by
the corresponding amplitudes (a and b). In practice, this
assumption holds well when one of the partials has a much
larger amplitude than the other one. When the partials have
similar amplitudes, the resulting frequency trajectory may
diﬀer from the weighted sum. This is not a serious problem
because such a diﬀerence is normally mild, and the algorithm
was designed to explore exactly the cases in which one partial
dominates the other ones.
It is important to emphasize that some possible ﬂaws in
the model above were not overlooked: there are not many
samples to infer the model, the random variables are not IID
(independent and identically distributed), and the mixing
model is not perfect. However, the lemma and assumptions
stated before have as main objective to support the use of
crosscorrelation to recover the mixing weights, for which
purpose they hold suﬃciently well—this is conﬁrmed by a
number of empirical experiments illustrated in Figures 4 and
5, which show how the correlation varies with respect to the
amplitude ratio between the reference source Aand the other
sources. Figure 5 was generated using the database described
in the beginning of Section 4, in the following way:
(a) A partial from source A is taken as reference (h
r
).
(b) A second partial of source A is selected (h
a
), together
with a partial of same frequency from source B (h
b
).
(c) Mixture partials (h
m
) are generated according to w ·
h
a
+ (1 − w) · h
b
, where w varies between zero and
one and represents the dominance of source A, as
represented in the horizontal axis of Figure 5. When
w is zero, source A is completely absent, and when
w is one, the partial from source A is completely
dominant.
(d) The correlation values between the frequency tra
jectories of h
r
and h
m
are calculated and scaled in
such a way the normalized correlations are 0 and
1 when w = 0 and w = 1, respectively. The
scaling is performed according to (6), where C
i j
is the
correlation to be normalized, C
min
is the correlation
between the partial from source A and the mixture
when w = 0, and C
max
is the correlation between the
partial from source A and the mixture when w = 0—
in this case C
max
is always equal to one.
8 EURASIP Journal on Audio, Speech, and Music Processing
−1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Dominance of reference partial
0.9
−0.5
0
1
N
o
r
m
a
l
i
s
e
d
c
o
r
r
e
l
a
t
i
o
n
0.5
1
1.5
2
Figure 5: Relation between correlation of the frequency trajectories
and partial ratio.
If the hypothesis hold perfectly, the normalized corre
lation would have always the same value of w (solid line
in Figure 5). As can be seen in Figure 5, the hypothesis
holds relatively well in most cases; however, there are some
instruments (particularly woodwinds) for which this tends
to fail. Further investigation will be necessary in order to
determine why this happens only for certain instruments.
The amplitude estimation procedure described next was
designed to mitigate the problems associated to the cases in
which the hypotheses tend to fail. As a result, the strategy
works fairly well if the hypotheses hold (partially or totally)
for at least one of the sources.
The amplitude estimation procedure can be divided into
two main parts: determination of reference partials and the
actual amplitude estimation, as described next.
3.4.1. Determination of Reference Partials. This part of the
algorithm aims to ﬁnd the partials that best represent each
source in the mixture. The objective is to ﬁnd the partials
that are less aﬀected by sources other than the one it should
represent. The use of reference partials for each source
guarantees that the estimated amplitudes within a frame will
be correctly grouped. As a result, no intraframe permutation
errors can occur. It is important to highlight that this paper
is devoted to be problem of estimating the amplitudes for
individual frames. A subsequent problem would be taking
all framewise amplitude estimates within the whole signal
and assign them to the correct sources. A solution for this
problem based on musical theory and continuity rules is
expected to be investigated in the future.
In order to illustrate how the reference partials are
determined, consider a hypothetical signal generated by
two simultaneous instruments playing the same note. Also,
consider that all mixture partials after the ﬁfth have negligible
amplitudes. Table 1 shows the frequency correlation values
between the partials of this hypothetical signal, as well as
the amplitude of each mixture partial. The values between
parentheses are the warped correlation values, calculated
according to
C
i j
=
C
i j
−C
min
C
max
−C
min
, (6)
where C
i j
is the correlation value (between partials i and
j) to be warped, and C
min
and C
max
are the minimum and
maximum correlation values for that frame. As a result, all
correlation values now lie between 0 and 1, and the relative
diﬀerences among the correlation values are reinforced.
The values in Table 1 are used as example to illustrate
each step of the procedure to determine the amplitude of
each source and partial. Although the example considers
mixtures of only two instruments, the rules are valid for any
number of simultaneous instruments.
(a) If a given source has some partials that do not
coincide with any other partial, which is determined
using the results of the partial positioning procedure
described in Section 2.2, the most energetic among
such partials is taken as reference for that source. If
all sources have at least one of such “clean” partials
to be taken as reference, the algorithm skips directly
to the amplitude estimation. If at least one source
satisﬁes the “clean partial” condition, the algorithm
skips to item (d), and the most energetic reference
partial is taken as the global reference partial G. Items
(b) and (c) only take place if no source satisﬁes such a
condition, which is the case of the hypothetical signal.
(b) The two mixture partials that result in the greatest
correlation are selected (ﬁrst and third in Table 1).
Those are the mixture partials for which the fre
quency variations are more alike, which indicates that
they both belong mostly to a same source. In this case,
possible coincident partials have small amplitudes
compared to the dominant partials.
(c) The most energetic among those two partials is
chosen both as the global reference Gand as reference
for the corresponding source, as the partial with
greatest amplitude probably has the most deﬁned
features to be compared to the remaining ones. In the
example given by Table 1, the ﬁrst partial is taken as
reference R
1
for instrument 1 (R
1
= 1).
(d) In this step, the algorithm chooses the reference
partials for the remaining sources. Let I
G
be the
source of partial G, and let I
C
be the current source
for which the reference partial is to be determined.
The reference partial for I
C
is chosen by taking the
mixture partial that result in the lowest correlation
with respect to G, provided that the components of
such mixture partial belong only to I
C
and I
G
(if no
partial satisﬁes this condition, item (e) takes place).
As a result, the algorithm selects the mixture partial
in which I
C
is more dominant with respect to I
G
. In
the example shown in Table 1, the fourth partial has
the lowest correlation with respect to G(−0.3), being
taken as reference R
2
for instrument 2 (R
2
= 4).
EURASIP Journal on Audio, Speech, and Music Processing 9
Table 1: Illustration of the amplitude estimation procedure. If the last row is removed, the table is a matrix showing the correlations between
the mixture partials, and the values between parentheses are the warped correlation values according to (6). Thus, the regular and warped
correlations between partials 1 and 2 are, respectively, 0.2 and 0.62. As can be seen, the lowest correlation value overall will have a warped
correlation of 0, and the highest correlation value is warped to 1; all other correlations will have intermediate warped value. The last row in
the table reveals the amplitude of each one of the mixture partials.
Partial 1 2 3 4 5
1 — 0.2 (0.62) 0.5 (1.0) −0.3 (0.0) 0 (0.37)
2 — — 0.1 (0.5) −0.1 (0.25) −0.2 (0.12)
3 — — — −0.2 (0.12) −0.2 (0.12)
4 — — — — 0.1 (0.5)
5 — — — — —
Amp. 0.7 0.9 0.4 0.5 0.3
(e) This item takes place if all mixture partials are
composed by at least three instruments. In this
case, the mixture partial that result in the lowest
correlation with respect to G is chosen to represent
the partial least aﬀected by I
G
. The objective now is
to remove from the process all partials signiﬁcantly
inﬂuenced by I
G
. This is carried out by removing
all partials whose warped correlation values with
respect to R
1
are greater than half the largest warped
correlation value of R
1
. In the example given by
Table 1, the largest warped correlation would be 1,
and partials 2 and 3 would be removed accordingly.
Then, items (a) to (d) are repeated for the remaining
partials. If more than two instruments still remain
in the process, item (e) takes place once more, and
the process continues until all reference partials have
been determined.
3.4.2. Amplitude Estimation. The reference partials for each
source are now used to estimate the relative amplitude to be
assigned to each partial of each source, according to
A
s
(i) =
C
i,Rs
N
n=1
C
i,Rn
, (7)
where A
s
indicate the relative amplitude to be assigned to
source s in the mixture partial, n is the index of the source
(considering only the sources that are part of that mixture),
and C
i j
is the warped correlation value between partials i
and j. The warped correlation were used because, as pointed
out before, they enhance the relative diﬀerences among the
correlations. As can be seen in (7), the relative amplitudes
to be assigned to the partials in the mixture are directly
proportional to the warped correlations of the partial with
respect to the reference partials. This reﬂects the hypothesis
that higher correlation values indicate a stronger relative
presence of a given instrument in the mixture. Table 2 shows
the relative partial amplitudes for the example given by
Table 1.
As can be seen, both (6) and (7) are heuristic. They were
determined empirically by a thorough observation of the
data and exhaustive tests. Other strategies, both heuristic and
statistical, were tested, but this simple approach resulted in a
performance comparable to those achieved by more complex
strategies.
In the following, the relative partial amplitudes are used
to extract the amplitudes of each individual partial from the
mixture partial (values between parentheses). In the exam
ple, the amplitude of the mixture partial is assumed to be
equal to the sum of the amplitudes of the coincident partials.
This would only hold if the phases of coincident partials were
aligned, which in practice does not occur. Ideally, amplitude
and phase should be estimated together to produce accurate
estimates. However, the characteristics of the algorithmmade
it necessary the adoption of simpliﬁcations and assumptions
that, if uncompensated, might result in inaccurate estimates.
To compensate (at least partially) the phase being neglected
in previous steps of the algorithm, some further processing is
necessary: a rough estimate of which amplitude the mixture
would have if the phases were actually perfectly aligned is
obtained by summing the amplitudes estimated using part of
the algorithm proposed by Yeh and Roebel [42] in Sections
2.1 and 2.2 of their paper. This rough estimate is, in general,
larger than the actual amplitude of the mixture partial. This
diﬀerence between both amplitudes is a rough measure of the
phase displacement between the partials. To compensate for
such a phase displacement, a weighting factor given by w =
A
r
/A
m
, where A
r
is the rough amplitude estimate and A
m
is
the actual amplitude of the mixture partial and is multiplied
to the initial zerophase partial amplitude estimates. This
procedure improves the accuracy of the estimates by about
10%.
As a ﬁnal remark, it is important to emphasize that
the amplitudes within a frame are not constant. In fact,
the proposed method explores the frequency modulation
(FM) of the signals, and FM is often associated with
some kind of amplitude modulation (AM). However, the
intraframe amplitude variations are usually small (except
in some cases of strong vibrato), making it reasonable
to estimate an average amplitude instead of detecting the
exact amplitude envelope, which would be a task close to
impossible.
10 EURASIP Journal on Audio, Speech, and Music Processing
Table 2: Relative and corresponding eﬀective partial amplitudes (between parentheses). The relative amplitudes reveal which percentage
of the mixture partial should be assigned to each source, hence the sum in each column is always 1 (100%). The eﬀective amplitudes are
obtained by multiplying the relative amplitudes by the mixture partial amplitudes shown in the last row of Table 1, hence the sum of each
column in this case is equal to the amplitudes shown in the last row of Table 1.
Partial 1 2 3 4 5
Inst. 1 1 (0.7) 0.71 (0.64) 0.89 (0.36) 0 (0) 0.43 (0.13)
Inst. 2 0 (0) 0.29 (0.26) 0.11 (0.04) 1 (0.5) 0.57 (0.17)
4. Experimental Results
The mixtures used in the tests were generated by summing
individual notes taken from the instrument samples present
in the RWC database [43]. Eighteen instruments of several
types (winds, bowed strings, plucked strings, and struck
strings) were considered—mixtures including both vocals
and instruments were tested separately, as described in
Section 4.7. In total, 40156 mixtures of two instruments,
three, four and ﬁve instruments were used in the tests.
The mixtures of two sources are composed by instruments
playing in unison (same note), and the other mixtures
include diﬀerent octave relations (including unison). A
mixture can be composed by the same kind of instrument.
Those settings were chosen in order to test the algorithm
with the hardest possible conditions. All signals are sampled
at 44.1 kHz, and have a minimum duration of 800 ms. Next
subsections present the main results according to diﬀerent
performance aspects.
4.1. Overall Performance and Comparison with Interpolation
Approach. Table 3 shows the mean RMS amplitude error
resulting from the amplitude estimation of the ﬁrst 12
partials in mixtures with two to ﬁve instruments (I2 to I5 in
the ﬁrst column). The error is given in dB and is calculated
according to
error =
E
abs
A
max
, (8)
where E
abs
is the absolute error between the estimate and
the correct amplitude, and A
max
is the amplitude of the
most energetic partial. The error values for the interpolation
approach were obtained by taking an individual instrument
playing a single note, and then measuring the error between
the estimate resulting from the interpolation of the neighbor
partials and the actual value of the partial. This represents
the ideal condition for the interpolation approach, since the
partials are not disturbed at all by other sources. The inherent
dependency of the interpolation approach on clean partials
makes its use very limited in real situations, especially if
several instruments are present. This must be taken into
consideration when comparing the results in Table 3.
In Table 3, the partial amplitudes of each signal were
normalized so the most energetic partial has a RMS value
equal to 1. No noise besides that naturally occurring in the
recordings was added, and the RMS values of the sources
have a 1 : 1 ratio.
The results for higher partials are not shown in Table 3
in order to improve the legibility of the results. Additionally,
their amplitudes are usually small, and so is their absolute
error, thus including their results would not add much
information. Finally, due to the rules deﬁned in Section 2.2,
normally only a fewpartials above the twelfth are considered.
As a consequence, higher partials will have much less results
to be averaged, thus their results are less signiﬁcant. Only
one line was dedicated to the interpolation approach because
the ideal conditions adopted in the tests make the number of
instruments in the mixture irrelevant.
The total errors presented in Table 3 were calculated
taking only the 12 ﬁrst partials into consideration. The
remaining partials were not considered because their only
eﬀect would be reducing the total error value.
Before comparing the techniques, there are some impor
tant remarks to be made about the results shown in Table 3.
As can be seen, for both techniques the mean errors are
smaller for higher partials. This is not because they are
more eﬀective in those cases, but because the amplitudes
of higher partials tend to be smaller, and so does the error,
since it is calculated having the most energetic partial as
reference. As a response, new error rates—called modiﬁed
mean error—were calculated for twoinstrument mixtures
using as reference the average amplitude of the partials, as
shown in Table 4—the error values for the other mixtures
were omitted because they have approximately the same
behavior. The modiﬁed errors are calculated as in (8), but
in this case A
max
is replaced by the average amplitude of the
12 partials.
As stated before, the results for the interpolation
approach were obtained under ideal conditions. Also, it
is important to note that the ﬁrst partial is often the
most energetic one, resulting in greater absolute errors.
Since the interpolation procedure cannot estimate the ﬁrst
partial, it is not part of the total error. In real situations
with diﬀerent kinds of mixtures present, the results for the
interpolation approach could be signiﬁcantly worse. As can
be seen in Table 3, although facing harder conditions, the
proposed strategy outperforms the interpolation approach
even when dealing with several simultaneous instruments.
This indicates that the relative improvement achieved by the
proposed algorithmwith respect to the interpolation method
is signiﬁcant.
As expected, the best results were achieved for mixtures
of two instruments. The accuracy degrades when more
instruments are considered, but meaningful estimates can be
obtained for up to ﬁve simultaneous instruments. Although
the algorithm can, in theory, deal with mixtures of six
or more instruments, in such cases the spectrum tends to
become too crowded for the algorithm to work properly.
EURASIP Journal on Audio, Speech, and Music Processing 11
Table 3: Mean error comparison between the proposed algorithm and the interpolation approach (in dB).
Partial 1 2 3 4 5 6 7 8 9 10 11 12 Total
I2 −7.1 −8.5 −9.7 −11.2 −12.3 −13.7 −14.8 −15.9 −17.0 −17.5 −18.3 −19.2 −12.0
I3 −5.4 −6.7 −7.9 −9.4 −10.3 −11.9 −12.8 −13.9 −14.8 −15.6 −16.2 −17.0 −10.2
I4 −4.8 −6.1 −7.4 −8.8 −9.9 −11.2 −12.4 −13.4 −14.1 −15.0 −15.6 −16.0 −9.6
I5 −4.5 −5.8 −7.0 −8.4 −9.5 −10.8 −12.0 −12.9 −13.6 −14.5 −14.8 −15.2 −9.2
Interp.
a
−4.6 −6.2 −7.7 −8.9 −9.9 −10.3 −10.6 −11.0 −10.8 −10.5 −10.5 −8.6
a
First partial cannot be estimated.
Table 4: Modiﬁed mean error values in dB.
Partial 1 2 3 4 5 6 7 8 9 10 11 12 Total
Prop. −5.4 −5.3 −5.8 −5.9 −6.0 −6.1 −6.4 −6.3 −6.6 −6.4 −6.6 −6.8 −6.1
Analyzing speciﬁcally Table 4, it can be observed that the
performance of the proposed method is slightly better for
higher partials.
This is because the mixtures in Table 4 were generated
using instruments playing the same notes, and higher partials
in that kind of mixture are more likely to be strongly
dominated by one of the instruments—most instruments
have strong low partials, so they will all have signiﬁcant
contributions in the lower partials of the mixture. Mixture
partials that are strongly dominated by a single instrument
normally result in better amplitude estimates, because they
correlate well with the reference partials, explaining the
results shown in Table 4.
From this point to the end of Section 4, all results were
obtained using twoinstrument mixtures—other mixtures
were not included to avoid redundancy.
4.2. Performance Under Noisy Conditions. Table 5 shows the
performance of the proposal when the signals are corrupted
by additive white noise. The results were obtained by
artiﬁcially summing the white noise to the mixtures of two
signals used in Section 4.1.
As can be seen, the performance is only weakly aﬀected
by noise. The error rates only begin to rise signiﬁcantly
close to 0 dB but, even under such an extremely noisy
condition, the error rate is only 25% greater than that
achieved without any noise. Such a remarkable robustness to
noise probably happens because, although noise introduces
a random factor in the frequency tracking described in
Section 3.2, the frequency variation tendencies are still able
to stand out.
4.3. Inﬂuence of RMS Ratio. Table 6 shows the performance
of the proposal for diﬀerent RMS ratios between the sources.
The signals were generated as described in Section 4.1, but
scaling one of the sources to result in the RMS ratios shown
in Table 6. As can be seen, the RMS ratio between the sources
has little impact on the performance of the strategy.
4.4. Length of the Frames. As stated in Section 2.1, the best
way to get the most reliable results is to divide the signal in
frames with variable lengths according to the occurrence of
onsets. Therefore, it is useful to determine how dependent
the performance is to the frame length. Table 7 shows the
results for diﬀerent ﬁxed frame lengths. The signals were
generated by simply taking the twopartial mixtures used in
Section 4.1 and truncating the frames to the lengths shown
in Table 7.
As expected, the performance degrades as shorter frames
are considered because there is less information available,
making the estimates less reliable. The interpolation results
are aﬀected in almost the same way, which indicates that
this is indeed a matter of lack of information, and not
a problem related to the characteristics of the algorithm.
Future algorithm improvements may include a way of
exploring the information contained in other frames to
counteract the damaging eﬀects of using short frames.
4.5. Onset Errors. This section analyses the eﬀects of onset
misplacements. The following kinds of onset location errors
may occur.
(a) Small errors: errors smaller than 10% of the frame
length have little impact in the accuracy of the
amplitude estimates. If the onset is placed after the
actual position, a small section of the actual frame
will be discarded, in which case there is virtually no
loss. If the onset is placed before the actual position,
a small section of other note may be considered,
slightly aﬀecting the correlation values. This kind of
mistake increases the amplitude estimation error in
about 2%.
(b) Large errors, estimated onset placed after the actual
position: the main consequence of this kind of
mistake is that fewer points are available in the
calculation of the correlations, which has a relatively
mild impact in the accuracy. For instruments whose
notes decay with time, like piano and guitar, a more
damaging consequence is that the most relevant part
of the signal may not be considered in the frame.
The main problem here is that after the note decays
by a certain amount, the frequency ﬂuctuations in
diﬀerent partials may begin to decorrelate. Therefore,
12 EURASIP Journal on Audio, Speech, and Music Processing
Table 5: Mean error values in dB for diﬀerent noise levels.
SNR (dB) 60+ 50 40 30 20 10 0
Error (dB) −12.0 −12.0 −12.0 −12.0 −11.9 −11.8 −11.0
Table 6: Mean error in dB for diﬀerent RMS ratios.
Ratio 1 : 1 1 : 0.9 1 : 0.7 1 : 0.5 1 : 0.3
Error (dB) −12.0 −12.0 −12.0 −12.2 −11.8
Table 7: Mean error in dB for diﬀerent frame lengths.
Length (s) 1 0.5 0.2 0.1
Proposal −12.0 −11.7 −11.3 −10.8
Interpol. −8.6 −8.4 −8.0 −7.6
if the strongest part of the note is not considered,
the results tend to be worse. Figure 6 shows the
dependency of the RMSE values on the extent of the
onset misplacements. The results shown in the ﬁgure
were obtained exactly in the same way as those in
Section 4.1, but deliberately misplacing the onsets to
reveal the eﬀects of this kind of error.
(c) Large errors, estimated onset placed before the actual
position: in this case, a part of the signal that does
not contain the new note is considered. The eﬀect of
this kind of error is that many points that should not
be considered in the correlation calculation are taken
into account. As can be seen in Figure 6, the larger is
the error, the worse is the amplitude estimate.
There are other kinds of onset errors besides
positioning—missing and spurious onsets. The analysis
of those kinds of errors is analog to that presented to the
onset location errors. The eﬀect of spurious onset is that
the note will be divided into additional segments, so there
will be fewer points available for the calculation, and the
observations presented in item (b) hold. In the case of
missing onset, two segments containing diﬀerent notes will
be considered, in a situation that is similar to that discussed
in item (c).
4.6. Impact of Lower Octave Errors. As stated in Section 2.2,
lower octave errors may aﬀect the accuracy of the proposed
algorithm when this is cascaded with a F0 estimator. Table 8
shows the algorithmaccuracy for the actual partials when the
estimated F0 for one of them is one, two or three octaves
below the actual value. As commented before, the lower
octave errors usually introduce a number of low correlations
that are mistakenly taken into account in the amplitude
estimation procedure described in Section 3.4. Since the
choice of the ﬁrst reference partial is based on the highest
correlation, it is not usually aﬀected by the lower octave
error. If this ﬁrst reference partial belongs to the instrument
for which the fundamental frequency was misestimated, the
choice of the other reference partial will also not be aﬀected,
because in this case all potential partials actually belong to
the instrument. Also, all correlations considered in this case
0 10 20
Error after actual onset positionsustained instruments
Error after actual onset positiondecaying instruments
Error before actual onset position
30 40 50 60
Position error in percentage of the actual ideal frame
70
−11
−12
−10
80
M
e
a
n
e
r
r
o
r
(
d
B
)
−9
−8
−7
−6
Figure 6: Impact of onset misplacements in the accuracy of the
proposed algorithm.
are valid, as they are related to the ﬁrst reference partial. As a
result, the accuracy of amplitude estimates is not aﬀected.
Problems occur when the ﬁrst reference partial belongs
to the instrument whose F0 was correctly estimated. In this
case, several false potential partials will be considered in the
process to determine the second reference partial, which is
chosen based on the lowest correlation. Since those false
partials are expected to have very low correlations with
respect to all other partials, the chance of one of them
being taken as reference is high. In this case, all the process
is disrupted and the amplitude estimates are likely to be
wrong. This explains the deterioration of the results shown in
Table 8. Those observations can be extended to mixtures with
any number of instruments, and the higher is the number of
F0 misestimates, the worse will be the results.
4.7. Separating Vocals. The mixtures used in this section were
generated by taking vocal samples from the RWC instrument
database [43] and adding an accompanying instrument or
another vocal. Therefore, all mixtures are composed by two
sources, with at least one being a vocal sample. Several notes
of all 18 instruments were considered, and they were all
scaled to have the same RMS value as the vocal sample.
Table 9 shows the results when the algorithm was applied to
separate vocals from the accompanying instruments (second
row) and from other vocals (third row). The vocals were
produced by male and female adults with diﬀerent pitch
ranges (soprano, alto, tenor, baritone and bass), and consist
of vowels being verbalized in a sustained way.
The results shown in Table 9 refer only to the vocal
sources, and the conditions are the same as those used to
generate Table 3.
EURASIP Journal on Audio, Speech, and Music Processing 13
Table 8: Mean error in dB for some lower octave errors.
Partial 1 2 3 4 5 6 7 8 9 10 11 12 Total
no error −7.1 −8.5 −9.7 −11.2 −12.2 −13.6 −14.8 −15.9 −17.1 −17.3 −18.2 −18.2 −12.0
1 octave −5.9 −7.3 −8.5 −10.1 −11.1 −12.5 −13.6 −14.7 −15.9 −15.9 −17.1 −17.3 −10.8
2 octaves −5.3 −6.7 −7.9 −9.5 −10.4 −11.9 −13.1 −14.1 −15.5 −15.4 −16.4 −16.3 −10.4
3 octaves −4.8 −6.2 −7.3 −9.1 −9.9 −11.4 −12.5 −13.6 −14.9 −14.9 −15.7 −15.9 −9.8
Table 9: Mean error in dB for vocal signals.
Partial 1 2 3 4 5 6 7 8 9 10 11 12 Total
V × I −7.2 −8.2 −9.9 −11.2 −12.5 −13.5 −13.9 −14.3 −14.3 −14.2 −14.2 −14.3 −11.5
V × V −6.9 −7.9 −9.6 −10.8 −11.9 −13.0 −13.2 −13.8 −13.6 −13.5 −13.5 −13.6 −11.1
As can be seen, the results for vocal sources are only
slightly worse than those achieved for musical instruments.
This indicates that the algorithm is also suitable for dealing
with vocal signals. Future work will try to extend the
technique to the speech separation problem.
4.8. Final Remarks. The problemof estimating the amplitude
of coincident partials is a very diﬃcult one. More than
that, this is a technology in its infancy. In that context,
many of the solutions adopted did not perform perfectly,
and there are some pathological cases in which the method
tends to fail completely. However, the algorithm performs
reasonably well in most cases, which shows its potentiality.
Since this is a technology far from mature, each part
of the algorithm will probably be under scrutiny in the
near future. The main motivation for this paper was to
propose a completely diﬀerent way of tackling the problemof
amplitude estimation, highlighting its strong characteristics
and pointing out the aspects that still need improvement.
In short, this paper was intended to be a starting point
in the development of a new family of algorithms capable
of overcoming some the main diﬃculties currently faced
by both amplitude estimation and sound source separation
algorithms.
5. Conclusions
This paper presented a new strategy to estimate the
amplitudes of coincident partials. The proposal has several
advantages over its predecessors, such as better accuracy, the
ability to estimate the ﬁrst partial, reliable estimates even
if the instruments are playing the same note, and so forth.
Additionally, the strategy is robust to noise and is able to deal
with any number of simultaneous instruments.
Although it presents a better performance than its
predecessor, there is still room for improvement. Future
versions may include new procedures to reﬁne the estimates,
like using the information from previous frames to verify
the consistency of the current estimates. The extension of
the technique to the speech separation problem is currently
under investigation.
Appendix
Proof of Lemma 1
Let A
1
= X
1
+ N
1
and A
2
= X
2
+ N
2
, where X
1
and X
2
are
random variables representing the actual partial frequency
trajectory, and N
1
and N
2
are random variables representing
the zeromean, independent noise associated. Then, A
3
=
aA
1
+ bA
2
= aX
1
+ bX
2
+ aN
1
+ bN
2
and
ρ
A1,A3
ρ
A2,A3
=
E(A
1
A
3
) −E(A
1
)E(A
3
)
σ
A1
σ
A3
·
σ
A2
σ
A3
E(A
2
A
3
) −E(A
2
)E(A
3
)
,
(A.1)
In the following, each term of (A.1) is expanded:
E(A
1
· A
3
) = E[(X
1
+ N
1
) · (aX
1
+ bX
2
+ aN
1
+ bN
2
)]
= aE
_
X
2
1
_
+ bE(X
1
)E(X
2
) + aE
_
N
2
1
_
.
(A.2)
The terms aE
2
(X
1
) and aE
2
(N
1
) are then summed and
subtracted to the expression, keeping it unaltered. This is
done to explore the relation σ
2
X
= E(X
2
) − E
2
(X) and,
assuming that E(N
1
) = 0 and E(N
2
) = 0, the expression
becomes
E(A
1
· A
3
) = aE
_
X
2
1
_
+ bE(X
1
)E(X
2
) + aE
_
N
2
1
_
+ aE
2
(X
1
) −aE
2
(X
1
)
. ¸, .
=0
+ aE
2
(N
1
) −aE
2
(N
1
)
. ¸, .
=0
= aE
_
X
2
1
_
−aE
2
(X
1
) + bE(X
1
)E(X
2
) + aE
_
N
2
1
_
−aE
2
(N
1
) + aE
2
(N
1
)
. ¸, .
=0
+ aE
2
(X
1
)
= aσ
2
X1
+ bE(X
1
)E(X
2
) + aσ
2
N1
+ aE
2
(X
1
).
(A.3)
Following the same steps,
E(A
2
· A
3
) = bσ
2
X2
+ aE(X
1
)E(X
2
) + bσ
2
N2
+ bE
2
(X
2
). (A.4)
14 EURASIP Journal on Audio, Speech, and Music Processing
The other terms are given by
E(A
1
) · E(A
3
) = E(X
1
)(aE(X
1
) + bE(X
2
))
= aE
2
(X
1
) + bE(X
1
)E(X
2
),
E(A
2
) · E(A
3
) = E(X
2
)(aE(X
1
) + bE(X
2
))
= bE
2
(X
2
) + aE(X
1
)E(X
2
).
(A.5)
Substituting all these terms in (A.1) and eliminating self
cancelling terms, the expression becomes
ρ
A1,A3
ρ
A2,A3
=
a ·
_
σ
2
X1
+ σ
2
N1
_
b ·
_
σ
2
X2
+ σ
2
N2
_
·
σ
A2
σ
A1
. (A.6)
Considering that σ
2
A1
= σ
2
X1
+ σ
2
N1
and σ
2
A2
= σ
2
X2
+ σ
2
N2
,
then
ρ
A1,A3
ρ
A2,A3
=
a ·
_
σ
2
X1
+ σ
2
N1
_
b ·
_
σ
2
X2
+ σ
2
N2
_
·
_
σ
2
X2
+ σ
2
N2
_
σ
2
X1
+ σ
2
N1
. (A.7)
If σ
2
N1
σ
2
X1
and σ
2
N2
σ
2
X2
, then
ρ
A1,A3
ρ
A2,A3
=
a
b
·
σ
2
X1
σ
2
X2
·
σ
X2
σ
X1
=
a
b
·
σ
X1
σ
X2
. (A.8)
Aditionally if σ
X1
≡ σ
X2
,
ρ
A1,A3
ρ
A2,A3
=
a
b
.
(A.9)
Acknowledgments
Special thanks are extended to Foreign Aﬀairs and Inter
national Trade Canada for supporting this work under its
PostDoctoral Research Fellowship Program (PDRF). The
authors also would like to thank Dr. Sudhakar Ganti for his
help. Work was performed while the ﬁrst author was with
the Department of Computer Science, University of Victoria,
Canada.
References
[1] K. Kokkinakis and A. K. Nandi, “Multichannel blind decon
volution for source separation in convolutive mixtures of
speech,” IEEE Transactions on Audio, Speech and Language
Processing, vol. 14, no. 1, pp. 200–212, 2006.
[2] H. Saruwatari, T. Kawamura, T. Nishikawa, A. Lee, and K.
Shikano, “Blind source separation based on a fastconvergence
algorithm combining ICA and beamforming,” IEEE Transac
tions on Audio, Speech and Language Processing, vol. 14, no. 2,
pp. 666–678, 2006.
[3] T. Kim, H. T. Attias, S.Y. Lee, and T.W. Lee, “Blind source
separation exploiting higherorder frequency dependencies,”
IEEE Transactions on Audio, Speech and Language Processing,
vol. 15, no. 1, pp. 70–79, 2007.
[4] S. C. Douglas, M. Gupta, H. Sawada, and S. Makino, “Spatio
temporal FastICA algorithms for the blind separation of
convolutive mixtures,” IEEE Transactions on Audio, Speech and
Language Processing, vol. 15, no. 5, pp. 1511–1520, 2007.
[5] N. Roman and D. Wang, “Pitchbased monaural segregation
of reverberant speech,” Journal of the Acoustical Society of
America, vol. 120, no. 1, pp. 458–469, 2006.
[6]
¨
O. Yilmaz and S. Rickard, “Blind separation of speech
mixtures via timefrequency masking,” IEEE Transactions on
Signal Processing, vol. 52, no. 7, pp. 1830–1846, 2004.
[7] M. H. Radfar and R. M. Dansereau, “Singlechannel speech
separation using soft mask ﬁltering,” IEEE Transactions on
Audio, Speech and Language Processing, vol. 15, no. 8, pp. 2299–
2310, 2007.
[8] R. Saab,
¨
O. Yilmaz, M. J. McKeown, and R. Abugharbieh,
“Underdetermined anechoic blind source separation via
q

basispursuit With q 1,” IEEE Transactions on Signal
Processing, vol. 55, no. 8, pp. 4004–4017, 2007.
[9] A. A¨ıssaElBey, N. LinhTrung, K. AbedMeraim, A.
Belouchrani, and Y. Grenier, “Underdetermined blind sepa
ration of nondisjoint sources in the timefrequency domain,”
IEEE Transactions on Signal Processing, vol. 55, no. 3, pp. 897–
907, 2007.
[10] A. A¨ıssaElBey, K. AbedMeraim, and Y. Grenier, “Blind
separation of underdetermined convolutive mixtures using
their timefrequency representation,” IEEE Transactions on
Audio, Speech and Language Processing, vol. 15, no. 5, pp. 1540–
1550, 2007.
[11] M. K. I. Molla and K. Hirose, “Singlemixture audio source
separation by subspace decomposition of hilbert spectrum,”
IEEE Transactions on Audio, Speech and Language Processing,
vol. 15, no. 3, pp. 893–900, 2007.
[12] A. Abramson and I. Cohen, “Singlesensor audio source
separation using classiﬁcation and estimation approach and
GARCH modeling,” IEEE Transactions on Audio, Speech and
Language Processing, vol. 16, no. 8, pp. 1528–1540, 2008.
[13] Y. Li and D. Wang, “Separation of singing voice from music
accompaniment for monaural recordings,” IEEE Transactions
on Audio, Speech and Language Processing, vol. 15, no. 4, pp.
1475–1487, 2007.
[14] T. Virtanen, “Monaural sound source separation by non
negative matrix factorization with temporal continuity and
sparseness criteria,” IEEE Transactions on Audio, Speech and
Language Processing, vol. 15, no. 3, pp. 1066–1074, 2007.
[15] L. Benaroya, F. Bimbot, and R. Gribonval, “Audio source
separation with a single sensor,” IEEE Transactions on Audio,
Speech and Language Processing, vol. 14, no. 1, pp. 191–199,
2006.
[16] T. Tolonen, “Methods for separation of harmonic sound
sources using sinusoidal modeling,” in Proceedings of the Audio
Engineering Society Convention, May 1999, preprint 4958.
[17] K. Itoyama, M. Goto, K. Komatani, T. Ogata, and H. G.
Okuno, “Integration and adaptation of harmonic and inhar
monic models for separating polyphonic musical signals,”
in Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP ’07), pp. 57–60, April
2007.
[18] L. Benaroya, L. M. Donagh, F. Bimbot, and R. Gribonval, “Non
negative sparse representation for Wiener based source separa
tion with a single sensor,” in Proceedings of IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP
’03), pp. 613–616, April 2003.
[19] J. J. Burred and T. Sikora, “On the use of auditory rep
resentations for sparsitybased sound source separation,” in
Proceedings of the 5th International Conference on Informa
tion, Communications and Signal Processing, pp. 1466–1470,
December 2005.
EURASIP Journal on Audio, Speech, and Music Processing 15
[20] Z. He, S. Xie, S. Ding, and A. Cichocki, “Convolutive
blind source separation in the frequency domain based on
sparse representation,” IEEE Transactions on Audio, Speech and
Language Processing, vol. 15, no. 5, pp. 1551–1563, 2007.
[21] M. Ito and M. Yano, “Sinusoidal modeling for nonstationary
voiced speech based on a local vector transform,” Journal of the
Acoustical Society of America, vol. 121, no. 3, pp. 1717–1727,
2007.
[22] J. McAulay and T. F. Quatieri, “Speech analysis/synthesis
based on a sinusoidal representation,” IEEE Transactions on
Acoustics, Speech, and Signal Processing, vol. 34, no. 4, pp. 744–
754, 1986.
[23] J. O. Smith and X. Serra, “PARSHL: an analysis/synthesis
program for nonharmonic sounds based on a sinusoidal
representation,” in Proceedings of the International Computer
Music Conference (ICMC ’87), pp. 290–297, 1987.
[24] X. Serra, “Musical sound modeling with sinusoids plus noise,”
in Musical Signal Processing, C. Roads, S. Pope, A. Picialli, and
G. D. Poli, Eds., pp. 91–122, Swets & Zeitlinger, 1997.
[25] T. Virtanen, Sound source separation in monaural music
signals, Ph.D. dissertation, Tampere University of Technology,
Finland, 2006.
[26] A. Klapuri, T. Virtanen, and J.M. Holm, “Robust multipitch
estimation for the analysis and ma nipulation of polyphonic
musical signals,” in Proceedings of the COSTG6 Conference on
Digital Audio Eﬀects, pp. 141–146, 2000.
[27] T. Virtanen and A. Klapuri, “Separation of harmonic sounds
using linear models for the overtone series,” in Proceedings of
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP ’02), pp. 1757–1760, May 2002.
[28] M. R. Every and J. E. Szymanski, “Separation of synchronous
pitched notes by spectral ﬁltering of harmonics,” IEEE Trans
actions on Audio, Speech and Language Processing, vol. 14, no.
5, pp. 1845–1856, 2006.
[29] H. Viste and G. Evangelista, “A method for separation of
overlapping partials based on similarity of temporal envelopes
in multichannel mixtures,” IEEE Transactions on Audio, Speech
and Language Processing, vol. 14, no. 3, pp. 1051–1061, 2006.
[30] Z. Duan, Y. Zhang, C. Zhang, and Z. Shi, “Unsupervised
singlechannel music source separation by average harmonic
structure modeling,” IEEE Transactions on Audio, Speech and
Language Processing, vol. 16, no. 4, pp. 766–778, 2008.
[31] J. Woodruﬀ, Y. Li, and D. L. Wang, “Resolving overlapping
harmonics for monaural musical sound separation using pitch
and common amplitude modulation,” in Proceedings of the
International Conference on Music Information Retrieval, pp.
538–543, 2008.
[32] J. J. Burred and T. Sikora, “Monaural source separation from
musical mixtures based on timefrequency timbre models,” in
Proceedings of the International Conference on Music Informa
tion Retrieval, pp. 149–152, 2007.
[33] R. C. Maher, “Evaluation of a method for separating digitized
duet signals,” Journal of the Audio Engineering Society, vol. 38,
no. 12, pp. 956–979, 1990.
[34] T. Virtanen and A. Klapuri, “Separation of harmonic sound
sources using sinusoidal modeling,” in Proceedings of IEEE
International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’00), pp. 765–768, June 2000.
[35] M. Gainza, B. Lawlor, and E. Coyle, “Harmonic sound source
separation using FIR comb ﬁlters,” in Proceedings of the Audio
Engineering Society Convention, 2004, preprint 6312.
[36] H. Thornburg, R. J. Leistikow, and J. Berger, “Melody extrac
tion and musical onset detection via probabilistic models
of framewise STFT peak data,” IEEE Transactions on Audio,
Speech and Language Processing, vol. 15, no. 4, pp. 1257–1272,
2007.
[37] G. Hu and D. Wang, “Auditory segmentation based on onset
and oﬀset analysis,” IEEE Transactions on Audio, Speech and
Language Processing, vol. 15, no. 2, pp. 396–405, 2007.
[38] A. Klapuri, “Multiple fundamental frequency estimation by
summing harmonic amplitudes,” in Proceedings of the Interna
tional Conference on Music Information Retrieval, pp. 216–221,
2006.
[39] J. Rauhala, H.M. Lehtonen, and V. V¨ alim¨ aki, “Fast automatic
inharmonicity estimation algorithm,” Journal of the Acoustical
Society of America, vol. 121, no. 5, pp. EL184–EL189, 2007.
[40] J. C. Brown, “Frequency ratios of spectral components of
musical sounds,” Journal of the Acoustical Society of America,
vol. 99, no. 2, pp. 1210–1218, 1996.
[41] M. Wu, D. L. Wang, and G. J. Brown, “A multipitch tracking
algorithm for noisy speech,” IEEE Transactions on Speech and
Audio Processing, vol. 11, no. 3, pp. 229–241, 2003.
[42] C. Yeh and A. Roebel, “The expected amplitude of overlapping
partials of harmonic sounds,” in 2009 IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP
’09), pp. 3169–3172, twn, April 2009.
[43] M. Goto, “Development of the RWC music database,” in
Proceedings of the 18th International Congress on Acoustics (ICA
’04), pp. 553–556, 2004.
In this work. For the remainder of this paper. Duan et al. Every and Szymanski [28] employ three ﬁlter designs to separate partly overlapping partials. the partials be suﬃciently disjoint so their individual features can be extracted. the standard interpolation approach as described later is used instead. it cannot be used to estimate the amplitudes of coincident or almost coincident partials. the partials may be separated enough to generate some eﬀects that can be explored to resolve them.5 0 380 400 420 440 Frequency (Hz) 460 480 (a) (b) Figure 1: Magnitude spectrum showing: (a) an example of partially colliding partials. If more than two partials collide.2 ×104 EURASIP Journal on Audio. as frequency and amplitude modulation. is used when the colliding partials are mostly concentrated in the same spectral band (Figure 1(b)). It is also important to note that the word partial can be both used to indicate part of an individual source (isolated harmonic). Klapuri et al. Also. Since most of these elaborated methods usually have limited applicability.5 2 1. making it nearly impossible to accurately estimate their frequencies. To work properly. because diﬀerent partials vary in diﬀerent ways around a common frequency. and (b) an example of coincident partials. the frequency band occupied by a partial during a period of time will be given by the range of such a variation. and . On the other hand. the term partial will refer to a sinusoid with a frequency that varies with time. amplitudes. the technique does not work when the frequencies of the sources have octave relations. and only a few of them can deal with coincident partials. An eﬃcient method to separate coincident partials based on the similarity of the temporal envelopes was proposed by Viste and Evangelista [29]. In the ﬁrst case. The expression coincident partials. [31] propose a technique based on the assumptions that harmonics of the same source have correlated amplitude envelopes and that phase diﬀerences can be predicted from the fundamental frequencies. A small number of techniques to resolve colliding partials have been proposed. Some authors simply attribute all the content to a single source [32]. two partials will be considered coincident if their frequencies are separated by less than 5% for frequencies below 500 Hz. but it only works for multichannel mixtures. The method does not work properly when the partials are mostly concentrated in the same band. it has three main shortcomings: (a) it assumes 4 3. it is the amplitude of individual partials that can provide the most useful information to eﬃciently separate coincident partials. at least for some frames. There are a few proposals that are able to resolve coincident partials. Speech. in most cases the band within which the partials are located can be determined instead. Since the phase is usually ignored and the frequency often cannot be reliably estimated due to the time variations.5 Absolute magnitude 3 2. those values are roughly the thresholds for which traditional techniques to resolve close sinusoids start to fail. [30] use an average harmonic structure (AHS) model to estimate the amplitudes of coincident partials. As a result. or part of the whole mixture—in this case. the merging of two or more coincident partials would also be called a partial. but it loses its eﬀectiveness when the frequencies of the partials are close. The main limitation of the technique is that it depends on very accurate pitch estimates. and Music Processing by less than 25 Hz for frequencies above 500 Hz—according to tests carried out previously. according to Virtanen [25]. The advantage of such a simple approach is that it can be used in almost every case. [26] explore the amplitude modulation resulting from two colliding partials to resolve their amplitudes. Using very short time windows to perform the analysis over a period in which the frequencies would be expected to be relatively steady also does not work. Virtanen and Klapuri [27] propose a technique that iteratively estimates phases. The interpolation approach estimates the amplitude of a given partial that is known to be colliding with another one by linearly interpolating the amplitudes of other partials belonging to the same source. on the other hand. simpler and less constrained approaches are often adopted instead. with the only exceptions being those in which the sources have the same fundamental frequency. normally only the two adjacent ones are used. because they tend to be more correlated to the amplitude of the overlapping partial. which uses a nonlinear leastsquares estimation to determine the sinusoidal parameters of the partials. Parametric approaches like this one tend to fail when the partials are very close. while others use a simple interpolation approach [33–35]. but in the second case they usually merge in such a way they act as a single partial. The same kind of problem can occur in the strategy proposed by Tolonen [16]. However.5 1 0. and frequencies of the partials using a leastsquare solution. as this procedure results in a very coarse frequency resolution due to the wellknown timefrequency tradeoﬀ. Woodruﬀ et al. Hence. Partials referring to the mixture will be called mixturepartials whenever the context does not resolve this ambiguity. because some of the matrices used to estimate the parameters tend to become singular. Most techniques proposed in the literature can only reliably resolve colliding partials if they are not coincident. Several partials can be used in such an interpolation but. this method requires that. The sinusoidal modeling technique can successfully estimate the amplitudes when the partials of diﬀerent sources do not collide. The expression colliding partials refers here to the cases in which two partials share at least part of the spectrum (Figure 1(a)). but they only work properly under certain conditions. The problem is even more evident in the case of coincident partials.
The ﬁrst condition for a new division is that the duration of the note be at least 200 ms. Section 2 presents the preprocessing. and the second one comprising the remainder of the frame. which would make it more diﬃcult to analyze the results. so the longest homogeneous frames are considered. This paper presents a more reﬁned alternative to the interpolation approach. The algorithm presented in this paper does not include an onsetdetection procedure in order to avoid cascaded errors. because there is no previous partial to be used in the interpolation. are explained in this section. instrument or vocal) occurs.5. the algorithm depends only on the existence of one partial strongly dominated by each source to work properly. respectively. The last four blocks represent the core of the algorithm and are described in Section 3. In a short description. which are unknown at this point. RRMS will always assume a value between zero and one. 37] (new note. This also means that the algorithm is not able to deal with sound sources that do not have harmonic characteristics. This step is necessary because the amplitude estimation is made in a framebyframe basis. Finally. The ﬁrst step of the algorithm is dividing the signal into frames. (c) The ﬁrst partial (fundamental) can be estimated. Adaptive Frame Division. This new RRMS value is stored and new divisions are tested by successively increasing the length of the ﬁrst frame by 5 ms and reducing the second one by 5 ms. the ﬁrst one having a 100ms length. the partial with minimum interference from other sources. like percussion instruments. and relatively reliable estimates are possible even if this condition is not completely satisﬁed. interfering with the partial correlation procedure described in Section 3. Section 4 presents the experiments and corresponding results. Finally. a study about the eﬀects of onset misplacements on the accuracy of the algorithm is presented in Section 4. To cope with partial amplitude variations that may occur within a frame. (b) the ﬁrst partial (fundamental) cannot be estimated using this procedure. Speech. (d) There are no intraframe permutation errors. they will always be assigned to the correct source. 2. The RMS values were used here because they are directly related to the actual amplitudes. The algorithm then measures the RMS ratio between the frames according to RRMS = min(r1 . The proposal is based on the hypothesis that the frequencies of the partials of a given source will vary in approximately the same fashion over time. if necessary. now with the ﬁrst new frame being 105ms long and the second being 5 ms shorter than it was originally. this information is used to estimate how the amplitude of each mixture partial should be split among its components.EURASIP Journal on Audio. meaning that. The inﬂuence of each source over each mixture partial is then determined by the correlation of the mixture partials with respect to the reference partials. In the context of this work. Therefore. 3 The paper is organized as follows. The best procedure here is to set the boundaries of each frame at the points where an onset [36. Section 3 describes all steps of the algorithm. (e) The estimation accuracy is much greater than that achieved by the interpolation approach. This is done until the .3. since dividing shorter frames would result in frames too small to be properly analyzed. using some characteristics of the harmonic audio signals to provide a better estimate for the amplitudes of coincident partials. Section 5 presents the conclusions and ﬁnal remarks. the frequencies may vary wildly. revealed that in 99. (b) The algorithm works even if the sources have the same fundamental frequency (F0)—tests comparing the spectral envelopes of a large number of pairs of instruments playing the same note and having the same RMS level. max(r1 .2% of the cases there was at least one partial whose energy was more than ﬁve times greater than the energy of its counterpart. The results are used to choose a reference partial for each source. 2. a vocal or an instrument generating a given note is considered a source. This proposal has several advantages over the interpolation approach. The ﬁrst three blocks. because during the period they occur. that is. the algorithm includes a procedure to divide the original frame further. the term source refers to a sound object with harmonic frequency structure. Preprocessing Figure 2 shows the basic structure of the algorithm. for the cases in which a number of partials are practically nonexistent. (a) Instead of relying in the assumption that both neighbor partials are interferencefree. The algorithm works better if the onsets themselves are not included in the frame. r2 ) (1) where r1 and r2 are the RMS of the ﬁrst and second new frames. assuming the amplitude estimates within a frame are correct. The preprocessing steps described in the following are fairly standard and have shown to be adequate for supporting the algorithm. by determining which is the mixture partial that is more likely to belong exclusively to that source. the estimates can be completely wrong. If this condition is satisﬁed. and Music Processing that both adjacent partials are not signiﬁcantly changed by the interference of other sources.1. However. which represent the preprocessing. (c) the assumption that the interpolation of the partials is a good estimate only holds for a few instruments and. The RRMS value is then stored and a new division is tested. r2 ) . and then uses the results to calculate the correlations among the mixture partials. the algorithm divides the original frame into two frames. which is often not true. such as a clarinet with odd harmonics. the algorithm tracks the frequency of each mixture partial over time.
the procedure is repeated and new divisions may occur. but fortunately this is the less frequent kind of error. taking multiples of F0 sometimes work. refers to the central frequency of the band occupied by that partial (see deﬁnition of partial in the introduction).6. providing a more meaningful picture of its performance. such an algorithm would maybe depend on a reliable F0 estimator. this indicates a considerable amplitude variation within the frame. (b) The shorttime discrete Fourier transform (STDFT) is calculated for each frame. As will be seen in Section 3.6. To work properly. that has little impact on the amplitude estimation of the real partials. Higher octave errors. considerable impact on the proposed algorithm will occur mostly in the case of lower octave errors. but this is a problem that does not concern this paper. If the lowest RRMS value obtained is below 0. or until all possible RRMS values are above 0. some problems may happen. in which the detected F0 is a submultiple of the correct one. The reasons for this behavior are presented in Section 4. in which the detected F0 is actually a multiple of the correct one. The ﬁrst versions of the algorithm included the multiple fundamental frequencies estimator proposed by Klapuri [38]. As can be seen. are easily compensated by the algorithm as it uses a search width of 0. Fundamental frequency errors are indeed a problem in the more general context of sound source separation. 40] of some instruments may cause this approach to fail. This is done until all frames are smaller than 200ms. To make the estimation of each partial frequency more accurate.4. the direct consequence is that a number of false partials (partials that do not exist in the actual signal. have very little impact on the estimation of correct partials. the algorithm needs a good estimate of where each partial is located—the location or position of a partial. but since the scope of this paper is limited to the amplitude estimation. When the positions of those false partials coincide with the positions of partials belonging to sources whose F0 were correctly identiﬁed. in this case. but those that are taken into account are actual partials. in which the estimated frequency deviates by at most a few Hertz from the actual value.6. all errors will be due to the proposed algorithm. the proposed amplitude estimation procedure depends on the proper choice of reference partials for each instrument. On the other hand. . together with some results that illustrate how serious is the impact of such a situation over the algorithm performance. The position of the partials of each source is directly linked to their fundamental frequency (F0).75. and some practical tests are presented in Section 4.4 Signal Division into frames EURASIP Journal on Audio. so the correct fundamental frequencies are assumed to be known. an algorithm was created—the algorithm is fed with the frames of the signal and it outputs the position of the partials. Problems may arise when the algorithm considers false partials. especially if one needs to take several partials into consideration. since they are relatively common and result in a number of false partials—a study about this impact is presented in Section 4. F0 Estimation and Partial Location. Although F0 errors are not considered in the main tests. When the fundamental frequency of a source is misestimated. but that are detected by the algorithm due to F0 estimation error) will be considered and/or a number of real partials will be ignored. The discussion above is valid for signiﬁcant F0 estimation errors—precision errors. and in the case of nonoctave errors—this last situation is the worst because most considered partials are actually false.1 · F0 around the estimated frequency to identify the correct position of the partial. 2. The steps of the algorithm for each F0 are the following: (a) The expected (preliminary) position of each partial (pn ) is given by pn−1 + F0. with p0 = 0. if the ﬁrst reference partial belongs to the instrument with the correct F0.2. the algorithm will ignore a number of partials. This is because that. it is instructive to discuss some of the impacts that F0 errors would have in the algorithm proposed here. errors coming from thirdparty tools should not be taken into account in order to avoid contamination of the results. as a result of this new division. If. Speech.75 (empirically determined). Simply. it is assumed that a hypothetical sound source separation algorithm would eventually reach a point in which the amplitude estimation would be necessary—to reach this point. but the inherent inharmonicity [39. which can happen both in the case of lower octave errors. which are used as a template to estimate the remaining ones. F0 errors may have signiﬁcant impact in the estimation of the amplitudes of correct partials depending on the characteristics of the error. If the ﬁrst reference partial to be chosen belongs to the instrument for which the F0 was misestimated. resulting second frame is 100ms long or shorter. and the original frame is deﬁnitely divided accordingly. and Music Processing Segmental frequency estimation Amplitude Estimates estimation procedure F0 estimation Partial ﬁltering Frame subdivision Partial correlation Figure 2: Algorithm general structure. then the entire amplitude estimation procedure may be disrupted. if all information provided by the supporting tools is assumed to be known. Accordingly. On the other hand. from which the magnitude spectrum M is extracted. Some results using diﬀerent ﬁxed frame lengths are presented in Section 4. in the context of this work. A common consequence of using supporting tools in an algorithm is that the errors caused by ﬂaws inherent to those supporting tools will propagate throughout the rest of the algorithm. Such a discussion is presented in the following. one or both the new frames have a length greater than 200 ms.
Partial Filtering. and environment. It is important to notice that partials below −20 dB from the strongest one may. musician. whose main objective is to resynthesize the sources as accurately as possible. Estimating the amplitudes of coincident partials is an important step toward such an objective. but the results were practically the same. but their combined position. the algorithm only takes into account partials whose energy—obtained by the integration of the power spectrum within the respective band—is at least 1% of the energy of the most energetic partial. The frequency estimation for each 10ms subframe is performed in the time domain by taking the ﬁrst and last zerocrossing. where pn is the frequency of the partial under analysis. 3. respectively. that is.2. Frame Subdivision. When two partials are coincident in the mixed signal. they are accumulated into a partial trajectory. a ﬁlter with a narrow passband may be appropriate for some kinds of sources. this method is intended to be used in the context of sound source separation. the number of partials considered by the algorithm is also kept the same. As commented before. where sw = 0. and pn−1 and pn+1 are the frequencies of the closest partials with lower and higher frequencies. and then determining the frequency according to f = c/d. The ﬁlterbank used to isolate the partials is composed by thirdorder elliptic ﬁlters. which can be explored to match partials to that particular source. with a considerably greater computational complexity. but alternative strategies are currently under investigation. On the other hand. a broad passband will certainly include the whole relevant portion of the spectrum. even when a very large number of partials are considered. which have a length of at least 100 ms. they will be considered only if they have at least one percent of the energy the strongest partial—thus. a given partial usually occupies a certain band of the spectrum. among others. . with no overlap (overlapping the subframes did not improve the results). noise plays an important role. they are grouped into one single set containing the positions of all mixture partials. As a result of those observations. and Music Processing (c) The adjusted position of the current partial ( pn ) is given by the highest peak in the interval [pn − sw . 0. Experiments have indicated that the most appropriate band to be considered around the peak of a partial is given by the interval [0. they are so close that the algorithm can take the highest one as the position of the mixture partial without problem. The mixture partials for which the amplitudes are to be estimated are isolated by means of a ﬁlterbank. Therefore. a1  + a2  zc = p 1 + (2) where p1 and p2 are. The 10ms subframes resulting from the division described in Section 3. when partials have very low energy. Finite impulse response (FIR) ﬁlters were also tested. If the position of the partial is less than 2sw apart from any partial position calculated previously for other source. and ideally the amplitudes of all partials should be estimated. The resulting frames after the ﬁltering are subdivided into 10ms subframes.1 · F0 is the search width. the positions in seconds of the samples immediately before and immediately after the zerocrossing. On the other hand.1 are used to estimate such a trajectory. a broader search region was avoided in order to reduce the chance of interference from other sources. and a1 and a2 are the amplitudes of those same samples. pn + sw ] of M. however. The procedure described in this section has led to partial frequency estimates that are within 5% from the correct value (inferred manually) in more than 90% of the cases. the energy of an individual partial in a mixture may be below the 1% limit. Once the frequencies for each 10ms subframe are calculated. but may ignore relevant parts of the spectrum for others.1. Mixture partials follow the same rules. in which case steps (a) to (c) will determine not their individual positions. making it nearly impossible to extract enough information to perform a meaningful estimate. Partial Trajectory Estimation. Speech. In real signals. it is expected that partials belonging to a given source will have similar frequency trajectories. and they are not coincident (less than 5% or 25 Hz apart). if a given F0 keeps the same in consecutive frames. making such estimation either unreliable.5 · (pn−1 + pn ). The frequency of each partial is expected to ﬂuctuate over the analysis frame. in some cases. respectively. only a fraction of a period may be considered in the frequency estimation described in Section 3. 2.2. be relevant. This kind of ﬁlter was chosen because of its steep rolloﬀ. with a passband ripple of 1 dB and stopband attenuation of 80 dB.3) to produce meaningful results. if the subframe is too short and the frequency is low.5 · (pn + pn+1 )]. 5 however. which can be broader or narrower depending on a number of factors like instrument. In order to avoid that a partial be considered in certain frames and not in others. In practice. but may also include spurious components resulting from noise and even neighbor partials. Also.3. The Proposed Algorithm 3. the positions of both partials are recalculated considering sw equal to half the frequency distance among the two partials. Longer subframes were not used because they may not provide enough points for the subsequent correlation calculation (see Section 3. measuring the distance d in seconds and the number of cycles c between those zerocrossings.EURASIP Journal on Audio. After the positions of all partials related to all fundamental frequencies have been estimated. or even impossible. Such a hard lower limit for the partial energy is the best current solution for the problem of noisy partials. 3. Sometimes coincident partials may have discernible separate peaks. which is the position of the mixture partial. This search width contains the correct position of the partial in nearly 100% of the cases. The exact position of the zerocrossing is given by a1  · p2 − p1 . they often share the same peak.
Speech. Therefore.A3 b σX2 + σN2 σX1 + σN1 2 Assuming that σN1 reduces to 2 2 σX1 . it is be the correlation coeﬃcient between two random variables X and Y with expected values μX and μY and standard deviations σX and σY . below 150 Hz the partial trajectory shows some ﬂuctuations that may not be present in higher frequency partials. In practice. and reverberation. Finally.A3 a σX1 + σN1 ⎝ σX2 + σN2 ⎠ = . intermediary correlation values are expected. 2 2 2 2 ρA1 . replacing the zerocrossing approach by the normalized crosscorrelation resulted in almost the same overall amplitude estimation accuracy (mean error values diﬀer by less than 1%). First. In the plot. Those disturbances can be modeled as noise. The main hypothesis motivating the procedure described here is that the partial frequencies of a given instrument or vocal vary approximately in the same way with time. Let A1 = X1 +N1 and A2 = X2 +N2 be independent random variables. with no interference from other partials. Since at least two zerocrossings are necessary for the estimates. the accuracy for such low frequencies is still better than that achieved by the interpolation approach. It is worth noting that there are more accurate techniques to estimate a partial trajectory. Amplitude Estimation Procedure. A3 is the partial frequency trajectory resulting from the sum of . probably due to artiﬁcial ﬂuctuations in the frequency trajectory that are introduced by the zerocrossing approach. let X1 . Therefore. However. resulting in N(N − 1)/2 correlation values. A2 is the frequency trajectory of a partial of instrument 2. and σX1 ≡ σX2 . As can be seen. Figure 3 shows the eﬀect of the frequency on the accuracy of the amplitude estimates. 3. a model is deﬁned in which the nth partial Pn of an instrument is given by Pn (t) = n · F0(t). any of the approaches can be used without signiﬁcant impact on the accuracy. A1 is the frequency trajectory of a partial of instrument 1. in percentage. as a consequence. and let A3 = aA1 + bA2 be a random variable representing their weighted sum.A3 b (5) For proof. the vertical scale indicates how better or worse is the performance for that frequency with respect to the overall accuracy of the accuracy. Finally. for 100 Hz the accuracy of the algorithm is 16% below average. but factors like instrument characteristics. Y ) E X − μX Y − μY = σX σY σX σY (3) 5 0 −5 −10 −15 −20 −25 −30 −35 50 150 250 350 450 550 650 Frequency (Hz) 750 850 950 Figure 3: Eﬀect of the frequency on the accuracy of the amplitude estimates. the algorithm cannot deal with frequencies below 50 Hz. If we consider both the fundamental frequency variations F0(t) and the noisy disturbances N(t) as random variables. However. which collides with the partial of instrument 1. which generates trajectories like those shown in Figure 4.A3 a = . ρA2 . thus reducing the correlation between partials and. room acoustics. the accuracy of the algorithm. The use of the zerocrossings. when one partial results from a given source A (called reference). it is observed that the partial frequency trajectories indeed tend to vary together. given by the sum of the ideal partial frequency trajectory X1 and the disturbance N1 . where aA is the amplitude of source A partial and aS is the amplitude of the mixture partial with the source A partial removed.Y = cov(X. and the accuracy drops rapidly as lower frequencies are considered. Partial Trajectory Correlation. The frequencies estimated for each subframe are arranged into a vector. all partial frequency trajectories would vary in perfect synchronism. among others. (4) ρA1 . as will be seen in Section 4. and N1 and N2 be zeromean independent random variables. The next step is to calculate the correlation between each possible pair of trajectories. Also. it is assumed that the correlation values will be proportional to the ratio aA /aS in the second mixture partial. where N is the number of partials. Then. Also. More than that. see the appendix.6 Performance (in % of the mean accuracy) EURASIP Journal on Audio. 3. the lemma applies—in this context. The lemma stated above can be directly applied to the problem presented in this paper. in this context. let ρX. If aA is much larger than aS . so now Pn (t) = n · F0(t) + N(t). where N is the noise. like the normalized crosscorrelation [41].3. introduce disturbances that prevent a perfect match between the trajectories. The use of subframes as small as 10ms has some important implications in the estimation of low frequencies. 2 2 2 2 ρA2 . as explained in the following. is justiﬁed by the low computational complexity associated. it is said that the partial from source A dominates that band.4. the lowest correlations are expected to occur when the mixture partials are completely related to diﬀerent sources. Conversely. Lemma 1. In this idealized case. X2 also be independent random variables. One trajectory is generated for each partial. and Music Processing hypothesized that the correlation between the trajectories of two mixture partials will be high when they both belong exclusively to a single source. and the other one results from the merge of partials coming both from source A and from other sources S. where F0(t) is the timevarying fundamental frequency and t is the time index. σN2 ⎛ ⎞ (4) 2 σX2 .
the resulting frequency trajectory may diﬀer from the weighted sum. However. and the algorithm was designed to explore exactly the cases in which one partial dominates the other ones. Speech. the shape of A3 is the sum of the trajectories A1 and A2 weighted by the corresponding amplitudes (a and b).5 0 50 100 150 200 250 Time (ms) (d) 300 350 400 (c) Figure 4: Trajectories (a) and (b) come from partials belonging to the same source. source A is completely absent. and the mixing model is not perfect. thus having very similar behaviors.mixture 958 957.instrument A Third partial trajectory . its characteristics result from the combination of each partial trends. which show how the correlation varies with respect to the amplitude ratio between the reference source A and the other sources. According to the lemma. in the following way: (a) A partial from source A is taken as reference (hr ). for which purpose they hold suﬃciently well—this is conﬁrmed by a number of empirical experiments illustrated in Figures 4 and 5. . and Cmax is the correlation between the partial from source A and the mixture when w = 0— in this case Cmax is always equal to one.instrument B 300 350 400 Frequency (Hz) 951 1429 1428 1427 1426 1425 1424 1423 1422 1421 1420 1419 0 50 100 150 200 250 Time (ms) 300 350 400 (b) 954 953 952 951 950 949 948 947 946 945 944 Second partial trajectory . (c) Mixture partials (hm ) are generated according to w · ha + (1 − w) · hb .EURASIP Journal on Audio. It is important to emphasize that some possible ﬂaws in the model above were not overlooked: there are not many samples to infer the model.5 957 956. When the partials have similar amplitudes. the lemma and assumptions stated before have as main objective to support the use of crosscorrelation to recover the mixing weights. When w is zero. (d) The correlation values between the frequency trajectories of hr and hm are calculated and scaled in such a way the normalized correlations are 0 and 1 when w = 0 and w = 1. Cmin is the correlation between the partial from source A and the mixture when w = 0. This is not a serious problem because such a diﬀerence is normally mild. where w varies between zero and one and represents the dominance of source A. as well as from phase interactions between the partials. Trajectory (d) corresponds to a mixture partial.5 959 Second partial trajectory . the random variables are not IID (independent and identically distributed). Trajectory (c) corresponds to a partial from another source. this assumption holds well when one of the partials has a much larger amplitude than the other one. the colliding partials.instrument A 7 953 952 Frequency (Hz) 950 949 948 947 946 0 50 100 150 200 250 Time (ms) (a) 959. The scaling is performed according to (6).5 956 955.5 0 50 100 150 200 250 Time (ms) 300 350 400 Frequency (Hz) Frequency (Hz) 958. The correlation procedure aims to quantify how close the mixture trajectory is from the behavior expected for each source. the partial from source A is completely dominant. Figure 5 was generated using the database described in the beginning of Section 4. (b) A second partial of source A is selected (ha ). In practice. as represented in the horizontal axis of Figure 5. where Ci j is the correlation to be normalized. and Music Processing Second partial trajectory . and when w is one. together with a partial of same frequency from source B (hb ). respectively.
Cmax − Cmin (6) 0 0. In order to illustrate how the reference partials are determined.3 0.2 0. Those are the mixture partials for which the frequency variations are more alike.4 0. As a result. As a result.8 Dominance of reference partial 0. The values in Table 1 are used as example to illustrate each step of the procedure to determine the amplitude of each source and partial. The reference partial for IC is chosen by taking the mixture partial that result in the lowest correlation with respect to G. and Cmin and Cmax are the minimum and maximum correlation values for that frame. however. The amplitude estimation procedure can be divided into two main parts: determination of reference partials and the actual amplitude estimation. which is determined using the results of the partial positioning procedure described in Section 2. If all sources have at least one of such “clean” partials to be taken as reference. the hypothesis holds relatively well in most cases. It is important to highlight that this paper is devoted to be problem of estimating the amplitudes for individual frames. The values between . and the most energetic reference partial is taken as the global reference partial G. As a result. If at least one source satisﬁes the “clean partial” condition. Table 1 shows the frequency correlation values between the partials of this hypothetical signal. where Ci j is the correlation value (between partials i and j) to be warped. the normalized correlation would have always the same value of w (solid line in Figure 5). The amplitude estimation procedure described next was designed to mitigate the problems associated to the cases in which the hypotheses tend to fail. (d) In this step. no intraframe permutation errors can occur. As can be seen in Figure 5. The objective is to ﬁnd the partials that are less aﬀected by sources other than the one it should represent. the fourth partial has the lowest correlation with respect to G(−0. the algorithm selects the mixture partial in which IC is more dominant with respect to IG . A solution for this problem based on musical theory and continuity rules is expected to be investigated in the future. A subsequent problem would be taking all framewise amplitude estimates within the whole signal and assign them to the correct sources. Items (b) and (c) only take place if no source satisﬁes such a condition. (c) The most energetic among those two partials is chosen both as the global reference G and as reference for the corresponding source. the algorithm chooses the reference partials for the remaining sources. provided that the components of such mixture partial belong only to IC and IG (if no partial satisﬁes this condition. Let IG be the source of partial G. the rules are valid for any number of simultaneous instruments. (a) If a given source has some partials that do not coincide with any other partial. which is the case of the hypothetical signal. consider that all mixture partials after the ﬁfth have negligible amplitudes. As a result.5 1 0.5 0 −0. In the example shown in Table 1. Further investigation will be necessary in order to determine why this happens only for certain instruments. the algorithm skips directly to the amplitude estimation. which indicates that they both belong mostly to a same source. and let IC be the current source for which the reference partial is to be determined.6 0. the ﬁrst partial is taken as reference R1 for instrument 1 (R1 = 1).8 2 Normalised correlation 1.2. Also.9 1 Figure 5: Relation between correlation of the frequency trajectories and partial ratio.5 0. being taken as reference R2 for instrument 2 (R2 = 4). and Music Processing parentheses are the warped correlation values. This part of the algorithm aims to ﬁnd the partials that best represent each source in the mixture. If the hypothesis hold perfectly. Although the example considers mixtures of only two instruments. calculated according to Ci j = Ci j − Cmin . 3. Speech. as described next.1 0.4. the strategy works fairly well if the hypotheses hold (partially or totally) for at least one of the sources.7 0. as the partial with greatest amplitude probably has the most deﬁned features to be compared to the remaining ones. item (e) takes place). In the example given by Table 1. The use of reference partials for each source guarantees that the estimated amplitudes within a frame will be correctly grouped. as well as the amplitude of each mixture partial. consider a hypothetical signal generated by two simultaneous instruments playing the same note.1.3). the most energetic among such partials is taken as reference for that source. Determination of Reference Partials. there are some instruments (particularly woodwinds) for which this tends to fail. the algorithm skips to item (d).5 −1 EURASIP Journal on Audio. and the relative diﬀerences among the correlation values are reinforced. (b) The two mixture partials that result in the greatest correlation are selected (ﬁrst and third in Table 1). all correlation values now lie between 0 and 1. possible coincident partials have small amplitudes compared to the dominant partials. In this case.
2 and 0. respectively. according to As (i) = Ci. as pointed out before. in general.62) — — — — 0. where Ar is the rough amplitude estimate and Am is the actual amplitude of the mixture partial and is multiplied to the initial zerophase partial amplitude estimates.3 (e) This item takes place if all mixture partials are composed by at least three instruments. and the highest correlation value is warped to 1.Rs N n=1 Ci. if uncompensated. They were determined empirically by a thorough observation of the data and exhaustive tests.2 (0. This reﬂects the hypothesis that higher correlation values indicate a stronger relative presence of a given instrument in the mixture. and the process continues until all reference partials have been determined. In this case.2 (0.2 (0.EURASIP Journal on Audio.1 and 2.3 (0.5) — 0.7 2 0. To compensate for such a phase displacement. and partials 2 and 3 would be removed accordingly. which would be a task close to impossible. the lowest correlation value overall will have a warped correlation of 0. n is the index of the source (considering only the sources that are part of that mixture). the proposed method explores the frequency modulation (FM) of the signals. As a ﬁnal remark. To compensate (at least partially) the phase being neglected in previous steps of the algorithm. As can be seen. Table 2 shows the relative partial amplitudes for the example given by Table 1.5) — — — 0. 1 — — — — — 0. the largest warped correlation would be 1.0) 0. the relative partial amplitudes are used to extract the amplitudes of each individual partial from the mixture partial (values between parentheses). both heuristic and statistical. Partial 1 2 3 4 5 Amp. This rough estimate is.5 (1.0) −0. which in practice does not occur. items (a) to (d) are repeated for the remaining partials. it is important to emphasize that the amplitudes within a frame are not constant.25) −0. amplitude and phase should be estimated together to produce accurate estimates. both (6) and (7) are heuristic. might result in inaccurate estimates. the table is a matrix showing the correlations between the mixture partials. This is carried out by removing all partials whose warped correlation values with respect to R1 are greater than half the largest warped correlation value of R1 . However. In the example given by Table 1.12) 0. However. the relative amplitudes to be assigned to the partials in the mixture are directly proportional to the warped correlations of the partial with respect to the reference partials. some further processing is necessary: a rough estimate of which amplitude the mixture would have if the phases were actually perfectly aligned is obtained by summing the amplitudes estimated using part of the algorithm proposed by Yeh and Roebel [42] in Sections 2. (7) where As indicate the relative amplitude to be assigned to source s in the mixture partial. In the example.4. Other strategies. 0.2 of their paper.5 5 0 (0. 3. making it reasonable to estimate an average amplitude instead of detecting the exact amplitude envelope. the intraframe amplitude variations are usually small (except in some cases of strong vibrato). the amplitude of the mixture partial is assumed to be equal to the sum of the amplitudes of the coincident partials. and Music Processing 9 Table 1: Illustration of the amplitude estimation procedure. Speech. and the values between parentheses are the warped correlation values according to (6).9 3 0.2 (0. were tested.4 4 −0.62. Thus. The objective now is to remove from the process all partials signiﬁcantly inﬂuenced by IG . all other correlations will have intermediate warped value. Amplitude Estimation.1 (0. This procedure improves the accuracy of the estimates by about 10%. In the following. This diﬀerence between both amplitudes is a rough measure of the phase displacement between the partials. The warped correlation were used because. the regular and warped correlations between partials 1 and 2 are.12) −0. larger than the actual amplitude of the mixture partial.12) — — 0. This would only hold if the phases of coincident partials were aligned. Then. As can be seen.1 (0. a weighting factor given by w = Ar /Am . . As can be seen in (7). Ideally. In fact.Rn . If more than two instruments still remain in the process. and FM is often associated with some kind of amplitude modulation (AM). The reference partials for each source are now used to estimate the relative amplitude to be assigned to each partial of each source. they enhance the relative diﬀerences among the correlations.1 (0. item (e) takes place once more. the mixture partial that result in the lowest correlation with respect to G is chosen to represent the partial least aﬀected by IG . but this simple approach resulted in a performance comparable to those achieved by more complex strategies.2. and Ci j is the warped correlation value between partials i and j. The last row in the table reveals the amplitude of each one of the mixture partials. the characteristics of the algorithm made it necessary the adoption of simpliﬁcations and assumptions that.37) −0. If the last row is removed.
Overall Performance and Comparison with Interpolation Approach.29 (0.2. The accuracy degrades when more instruments are considered. Amax (8) where Eabs is the absolute error between the estimate and the correct amplitude. Also. Since the interpolation procedure cannot estimate the ﬁrst partial. The eﬀective amplitudes are obtained by multiplying the relative amplitudes by the mixture partial amplitudes shown in the last row of Table 1. since the partials are not disturbed at all by other sources. All signals are sampled at 44. the proposed strategy outperforms the interpolation approach even when dealing with several simultaneous instruments. the results for the interpolation approach could be signiﬁcantly worse. especially if several instruments are present. Eighteen instruments of several types (winds. The error is given in dB and is calculated according to error = Eabs . and the other mixtures include diﬀerent octave relations (including unison). Although the algorithm can. Speech. This represents the ideal condition for the interpolation approach. and so is their absolute error. and Amax is the amplitude of the most energetic partial.57 (0. hence the sum of each column in this case is equal to the amplitudes shown in the last row of Table 1. In real situations with diﬀerent kinds of mixtures present.1. but meaningful estimates can be obtained for up to ﬁve simultaneous instruments.11 (0. as shown in Table 4—the error values for the other mixtures were omitted because they have approximately the same behavior. thus including their results would not add much information. normally only a few partials above the twelfth are considered. and then measuring the error between the estimate resulting from the interpolation of the neighbor partials and the actual value of the partial. The mixtures of two sources are composed by instruments playing in unison (same note).26) 3 0. in such cases the spectrum tends to become too crowded for the algorithm to work properly. Before comparing the techniques. This must be taken into consideration when comparing the results in Table 3. Finally. resulting in greater absolute errors.5) 5 0.89 (0. and struck strings) were considered—mixtures including both vocals and instruments were tested separately. higher partials will have much less results to be averaged. and Music Processing Table 2: Relative and corresponding eﬀective partial amplitudes (between parentheses). hence the sum in each column is always 1 (100%). Table 3 shows the mean RMS amplitude error resulting from the amplitude estimation of the ﬁrst 12 partials in mixtures with two to ﬁve instruments (I2 to I5 in the ﬁrst column).10 EURASIP Journal on Audio.36) 0. A mixture can be composed by the same kind of instrument. for both techniques the mean errors are smaller for higher partials.71 (0. new error rates—called modiﬁed mean error—were calculated for twoinstrument mixtures using as reference the average amplitude of the partials.7) 0 (0) 2 0. Next subsections present the main results according to diﬀerent performance aspects. Experimental Results The mixtures used in the tests were generated by summing individual notes taken from the instrument samples present in the RWC database [43]. plucked strings.7. As a consequence. 1 Inst. As stated before.64) 0. No noise besides that naturally occurring in the recordings was added. their amplitudes are usually small. 40156 mixtures of two instruments. and so does the error. Those settings were chosen in order to test the algorithm with the hardest possible conditions. . but in this case Amax is replaced by the average amplitude of the 12 partials.17) 4. The inherent dependency of the interpolation approach on clean partials makes its use very limited in real situations. and the RMS values of the sources have a 1 : 1 ratio. four and ﬁve instruments were used in the tests. The modiﬁed errors are calculated as in (8). The total errors presented in Table 3 were calculated taking only the 12 ﬁrst partials into consideration.43 (0. as described in Section 4. As can be seen. 4. As can be seen in Table 3. In total. but because the amplitudes of higher partials tend to be smaller. The results for higher partials are not shown in Table 3 in order to improve the legibility of the results. thus their results are less signiﬁcant. 2 1 1 (0. and have a minimum duration of 800 ms. due to the rules deﬁned in Section 2. in theory.04) 4 0 (0) 1 (0. As expected. In Table 3. As a response. Partial Inst. it is not part of the total error. since it is calculated having the most energetic partial as reference. there are some important remarks to be made about the results shown in Table 3. the results for the interpolation approach were obtained under ideal conditions. it is important to note that the ﬁrst partial is often the most energetic one. the partial amplitudes of each signal were normalized so the most energetic partial has a RMS value equal to 1.13) 0. The error values for the interpolation approach were obtained by taking an individual instrument playing a single note. The relative amplitudes reveal which percentage of the mixture partial should be assigned to each source. Only one line was dedicated to the interpolation approach because the ideal conditions adopted in the tests make the number of instruments in the mixture irrelevant. deal with mixtures of six or more instruments. the best results were achieved for mixtures of two instruments. although facing harder conditions. The remaining partials were not considered because their only eﬀect would be reducing the total error value. This indicates that the relative improvement achieved by the proposed algorithm with respect to the interpolation method is signiﬁcant. three. bowed strings. This is not because they are more eﬀective in those cases.1 kHz. Additionally.
6 partial cannot be estimated.5 Total −12.5 −8. As can be seen. in which case there is virtually no loss.7 −6.9 6 −13.6 −14. the frequency ﬂuctuations in diﬀerent partials may begin to decorrelate. Table 4: Modiﬁed mean error values in dB. even under such an extremely noisy condition.5.0 −16. Speech.1 −5. The interpolation results are aﬀected in almost the same way. The signals were generated as described in Section 4. like piano and guitar.0 −14. 4. Therefore. Inﬂuence of RMS Ratio.0 −15. explaining the results shown in Table 4.3.2.3 −10. As expected. Mixture partials that are strongly dominated by a single instrument normally result in better amplitude estimates. (b) Large errors.3 3 −5. Table 7 shows the results for diﬀerent ﬁxed frame lengths. the RMS ratio between the sources has little impact on the performance of the strategy. and not a problem related to the characteristics of the algorithm. The results were obtained by artiﬁcially summing the white noise to the mixtures of two signals used in Section 4. 4.4 8 −6. but scaling one of the sources to result in the RMS ratios shown in Table 6.5 −15.5 12 −19. and higher partials in that kind of mixture are more likely to be strongly dominated by one of the instruments—most instruments have strong low partials.1 and truncating the frames to the lengths shown in Table 7.8 −14.3 −16.8 −4.5 −6.7 −11. Length of the Frames. all results were obtained using twoinstrument mixtures—other mixtures were not included to avoid redundancy.4 −12.9 −13. Partial I2 I3 I4 I5 Interp.9 −10. so they will all have signiﬁcant contributions in the lower partials of the mixture.7 −7.6 −11.2 −10.4 −7.0 −6.8 Total −6.8 −12.2 −9.0 −10. it is useful to determine how dependent the performance is to the frame length. the performance is only weakly aﬀected by noise.9 −13.9 5 −6. This is because the mixtures in Table 4 were generated using instruments playing the same notes. Such a remarkable robustness to noise probably happens because.9 −7.8 −8.3 8 −15. a First 11 1 −7.2 −9.2 4 −11. the frequency variation tendencies are still able to stand out. As stated in Section 2.8 11 −18.8 −12.4 −7.6 12 −6. The signals were generated by simply taking the twopartial mixtures used in Section 4.1 Analyzing speciﬁcally Table 4. although noise introduces a random factor in the frequency tracking described in Section 3. . it can be observed that the performance of the proposed method is slightly better for higher partials. The error rates only begin to rise signiﬁcantly close to 0 dB but.4.6 −15. 1 −5.2.5 a 2 −8. For instruments whose notes decay with time.3 −9.7 5 −12.8 −9.9 7 −14. If the onset is placed after the actual position.8 −10. 4. Table 6 shows the performance of the proposal for diﬀerent RMS ratios between the sources. Performance Under Noisy Conditions.8 −4.6 3 −9. (a) Small errors: errors smaller than 10% of the frame length have little impact in the accuracy of the amplitude estimates. slightly aﬀecting the correlation values. which has a relatively mild impact in the accuracy. the error rate is only 25% greater than that achieved without any noise. and Music Processing Table 3: Mean error comparison between the proposed algorithm and the interpolation approach (in dB). The following kinds of onset location errors may occur. which indicates that this is indeed a matter of lack of information.6 10 −6.9 −9. the performance degrades as shorter frames are considered because there is less information available.0 −10. a small section of other note may be considered. The main problem here is that after the note decays by a certain amount.4 2 −5. the best way to get the most reliable results is to divide the signal in frames with variable lengths according to the occurrence of onsets.8 4 −5.1 7 −6.0 6 −6.6 9 −17.4 −8.3 9 −6.4 11 −6.9 −11. This section analyses the eﬀects of onset misplacements.2 −17. If the onset is placed before the actual position. making the estimates less reliable. a small section of the actual frame will be discarded.1. because they correlate well with the reference partials.1 −5. Future algorithm improvements may include a way of exploring the information contained in other frames to counteract the damaging eﬀects of using short frames. Table 5 shows the performance of the proposal when the signals are corrupted by additive white noise.2 −15. Onset Errors. Therefore. This kind of mistake increases the amplitude estimation error in about 2%.EURASIP Journal on Audio. From this point to the end of Section 4.1.4 −12.5 −10.2 −8.1. 4.0 10 −17. As can be seen.1 −13.2 −10. estimated onset placed after the actual position: the main consequence of this kind of mistake is that fewer points are available in the calculation of the correlations. a more damaging consequence is that the most relevant part of the signal may not be considered in the frame.6 −9. Partial Prop.4 −4.0 −14.
are valid.0 1 : 0. SNR (dB) Error (dB) 60+ −12.8 −7 −8 −9 −10 −11 −12 Table 7: Mean error in dB for diﬀerent frame lengths.0 0. The eﬀect of this kind of error is that many points that should not be considered in the correlation calculation are taken into account. The analysis of those kinds of errors is analog to that presented to the onset location errors.0 1 : 0. baritone and bass). several false potential partials will be considered in the process to determine the second reference partial. it is not usually aﬀected by the lower octave error. This explains the deterioration of the results shown in Table 8.2 −11.0 −8. The eﬀect of spurious onset is that the note will be divided into additional segments. two or three octaves below the actual value. Table 8 shows the algorithm accuracy for the actual partials when the estimated F0 for one of them is one.0 Table 6: Mean error in dB for diﬀerent RMS ratios. Impact of Lower Octave Errors. 4. the choice of the other reference partial will also not be aﬀected. There are other kinds of onset errors besides positioning—missing and spurious onsets. and they were all scaled to have the same RMS value as the vocal sample.0 1 : 0. the worse is the amplitude estimate. and the observations presented in item (b) hold. If this ﬁrst reference partial belongs to the instrument for which the fundamental frequency was misestimated. Since the choice of the ﬁrst reference partial is based on the highest correlation.1 −10.4 0. In this case. (c) Large errors. the lower octave errors usually introduce a number of low correlations that are mistakenly taken into account in the amplitude estimation procedure described in Section 3. Speech. Those observations can be extended to mixtures with any number of instruments. As can be seen in Figure 6. the larger is the error.6 0. . Figure 6 shows the dependency of the RMSE values on the extent of the onset misplacements. but deliberately misplacing the onsets to reveal the eﬀects of this kind of error. a part of the signal that does not contain the new note is considered.3 −8. the chance of one of them being taken as reference is high. and the higher is the number of F0 misestimates. estimated onset placed before the actual position: in this case.3 −11.5 −11.1. the accuracy of amplitude estimates is not aﬀected. because in this case all potential partials actually belong to the instrument. The mixtures used in this section were generated by taking vocal samples from the RWC instrument database [43] and adding an accompanying instrument or another vocal.0 50 −12. all correlations considered in this case 10 20 30 40 50 60 70 Position error in percentage of the actual ideal frame 80 Error after actual onset positionsustained instruments Error after actual onset positiondecaying instruments Error before actual onset position Figure 6: Impact of onset misplacements in the accuracy of the proposed algorithm. Several notes of all 18 instruments were considered. so there will be fewer points available for the calculation. and Music Processing Table 5: Mean error values in dB for diﬀerent noise levels.9 −12. In this case.8 0 −11. Table 9 shows the results when the algorithm was applied to separate vocals from the accompanying instruments (second row) and from other vocals (third row).8 −7. in a situation that is similar to that discussed in item (c).2 1 : 0. 1 −12. all the process is disrupted and the amplitude estimates are likely to be wrong. as they are related to the ﬁrst reference partial. two segments containing diﬀerent notes will be considered.4. 4. The vocals were produced by male and female adults with diﬀerent pitch ranges (soprano. and consist of vowels being verbalized in a sustained way.5 −12.6 0 if the strongest part of the note is not considered. The results shown in the ﬁgure were obtained exactly in the same way as those in Section 4.0 30 −12. Problems occur when the ﬁrst reference partial belongs to the instrument whose F0 was correctly estimated. As commented before.0 20 −11.7 −8. As stated in Section 2. with at least one being a vocal sample. In the case of missing onset.6.12 EURASIP Journal on Audio. Since those false partials are expected to have very low correlations with respect to all other partials. As a result.2. Also.9 10 −11. the results tend to be worse. which is chosen based on the lowest correlation. Separating Vocals. all mixtures are composed by two sources. Length (s) Proposal Interpol. the worse will be the results.0 40 −12. The results shown in Table 9 refer only to the vocal sources. Mean error (dB) −6 Ratio Error (dB) 1:1 −12. and the conditions are the same as those used to generate Table 3. lower octave errors may aﬀect the accuracy of the proposed algorithm when this is cascaded with a F0 estimator. Therefore.7. tenor.7 −12. alto.
7 −6.1 −10.3 4 −11. The proposal has several advantages over its predecessors. highlighting its strong characteristics and pointing out the aspects that still need improvement. where X1 and X2 are random variables representing the actual partial frequency trajectory. 2 2 E(A2 · A3 ) = bσX2 + aE(X1 )E(X2 ) + bσN2 + bE2 (X2 ).1 5 −12.A3 E(A1 A3 ) − E(A1 )E(A3 ) = ρA2 .4) . Since this is a technology far from mature.0 7 −13. + aE2 (X1 ) − aE2 (X1 ) + aE2 (N1 ) − aE2 (N1 ) =0 =0 2 2 = aE X1 − aE2 (X1 ) + bE(X1 )E(X2 ) + aE N1 − aE2 (N1 ) + aE2 (N1 ) + aE2 (X1 ) =0 = 2 aσX1 2 + bE(X1 )E(X2 ) + aσN1 + aE2 (X1 ). In that context.5 −7.9 11 −18.8 9 −14. A3 = aA1 + bA2 = aX1 + bX2 + aN1 + bN2 and ρA1 .A3 σA 1 σA 3 · 4.5 −13.5 8 −15. many of the solutions adopted did not perform perfectly.9 6 −13.6 10 −14. The extension of the technique to the speech separation problem is currently under investigation. Conclusions This paper presented a new strategy to estimate the amplitudes of coincident partials. This is 2 done to explore the relation σX = E(X 2 ) − E2 (X) and. σA 2 σA 3 . Although it presents a better performance than its predecessor. the ability to estimate the ﬁrst partial. Partial V×I V×V 1 −7.3 −13.2 −6.9 −15. The problem of estimating the amplitude of coincident partials is a very diﬃcult one.5 −14.3 −13. and Music Processing Table 8: Mean error in dB for some lower octave errors.8 −13.EURASIP Journal on Audio. Partial no error 1 octave 2 octaves 3 octaves 1 −7. independent noise associated. (A. Additionally.9 −11. and N1 and N2 are random variables representing the zeromean.5 12 −14.2 8 −14. E(A2 A3 ) − E(A2 )E(A3 ) (A. and so forth.9 −9. Future versions may include new procedures to reﬁne the estimates.3 −13.7 −14. reliable estimates even if the instruments are playing the same note.0 −10. this is a technology in its infancy. Future work will try to extend the technique to the speech separation problem.4 −9. such as better accuracy.1 −12.7 12 −18. Final Remarks. this paper was intended to be a starting point in the development of a new family of algorithms capable of overcoming some the main diﬃculties currently faced by both amplitude estimation and sound source separation algorithms. However.8 Table 9: Mean error in dB for vocal signals.8. assuming that E(N1 ) = 0 and E(N2 ) = 0. which shows its potentiality.9 2 −8.7 −8.3 −6. More than that. This indicates that the algorithm is also suitable for dealing with vocal signals.5 −9.1 −16.1 −9.2 −17.5 11 −14.8 −10.9 −14.6 Total −11. (A. (A.5 −11.2 −13.9 −15.4 −15. each part of the algorithm will probably be under scrutiny in the near future.9 Total −12.2) The terms aE2 (X1 ) and aE2 (N1 ) are then summed and subtracted to the expression.5 −7.9 3 −9.9 −7.3 −4.3) Following the same steps. the results for vocal sources are only slightly worse than those achieved for musical instruments.9 −13.2 −10.9 10 −17.8 5 −12. Then. the strategy is robust to noise and is able to deal with any number of simultaneous instruments.2 −7.3 −16. the algorithm performs reasonably well in most cases.6 −12.4 −9.2 −11. Appendix Proof of Lemma 1 Let A1 = X1 + N1 and A2 = X2 + N2 .6 9 −17. like using the information from previous frames to verify the consistency of the current estimates.3 −15. each term of (A.2 −13.2 −10.8 13 2 −8.1 As can be seen.2 −17.1 −13.6 4 −11.1) is expanded: E(A1 · A3 ) = E[(X1 + N1 ) · (aX1 + bX2 + aN1 + bN2 )] 2 2 = aE X1 + bE(X1 )E(X2 ) + aE N1 .5 −11.4 7 −14. The main motivation for this paper was to propose a completely diﬀerent way of tackling the problem of amplitude estimation. keeping it unaltered.9 −5. In short.5 −11.6 −13.1 −15. the expression becomes 2 2 E(A1 · A3 ) = aE X1 + bE(X1 )E(X2 ) + aE N1 5. there is still room for improvement. Speech.1 −5. and there are some pathological cases in which the method tends to fail completely.2 3 −9.3 −15.4 −14.1) In the following.9 6 −13.
200–212. Kawamura. 3. 2007.” IEEE Transactions on Audio. Attias. Speech and Language Processing.A3 b (A. LinhTrung. “Blind source separation based on a fastconvergence algorithm combining ICA and beamforming. “Audio source separation with a single sensor. 2006. ¨ [8] R. J.” in Proceedings of the Audio Engineering Society Convention. pp. Molla and K. [4] S. “Underdetermined anechoic blind source separation via q 1. I. and Signal Processing (ICASSP ’03). Sudhakar Ganti for his help. Donagh. and T. [19] J. and Music Processing [5] N.8) Aditionally if σX1 ≡ σX2 .W. K. Communications and Signal Processing. pp. A. vol. The authors also would like to thank Dr.” IEEE Transactions on Audio. “Singlechannel speech separation using soft mask ﬁltering. 897– 907. “Underdetermined blind separation of nondisjoint sources in the timefrequency domain. M. vol. Saruwatari. pp. 57–60. Speech and Language Processing. Sikora. K. Hirose. the expression becomes 2 2 a · σX1 + σN1 ρA1 . pp. G. “Integration and adaptation of harmonic and inharmonic models for separating polyphonic musical signals.” Journal of the Acoustical Society of America. Okuno. EURASIP Journal on Audio. [10] A. pp. Benaroya. and K.1) and eliminating selfcancelling terms. 1540– 1550. vol. pp. and Y. [9] A. December 2005. 5. pp. and R. Speech. [12] A. “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. 8. Canada. vol. 16. 1511–1520. 1528–1540. 2006.7) 2 ρA1 .” IEEE Transactions on Audio. 1466–1470. and S.” IEEE Transactions on Audio. vol. 1. ı Belouchrani. May 1999. H. 8. Speech and Language Processing. [13] Y. K.” in Proceedings of IEEE International Conference on Acoustics. 3. E(A2 ) · E(A3 ) = E(X2 )(aE(X1 ) + bE(X2 )) = bE2 (X2 ) + aE(X1 )E(X2 ). 15.A3 σA 1 b · σX2 + σN2 (A. Speech and Language Processing. McKeown. Speech and Language Processing. 3. Speech. pp. Radfar and R. References [1] K. Nandi. 1. Gupta. Speech and Signal Processing (ICASSP ’07). Abramson and I. 2007. . T. University of Victoria. Ogata. ρA2 .” IEEE Transactions on Audio. (A.5) Substituting all these terms in (A. Speech and Language Processing. pp. “On the use of auditory representations for sparsitybased sound source separation. 2004.” IEEE Transactions on Audio.” IEEE Transactions on Signal Processing. no. [16] T. no. 120. 52. no. no. Tolonen. N. AbedMeraim.” in Proceedings of the 5th International Conference on Information. Bimbot.” in Proceedings of IEEE International Conference on Acoustics. no. “Methods for separation of harmonic sound sources using sinusoidal modeling. 14. 2006. “Spatiotemporal FastICA algorithms for the blind separation of convolutive mixtures. no. “Singlemixture audio source separation by subspace decomposition of hilbert spectrum. pp. vol. 2007. and H. pp. vol. Speech and Language Processing. vol. no.” IEEE Transactions on Signal Processing.” IEEE Transactions on Audio. H. [14] T. “Blind ı separation of underdetermined convolutive mixtures using their timefrequency representation. then 2 2 2 2 a · σX1 + σN1 σX2 + σN2 ρA1 . S. [11] M. Dansereau. 5. 2007. Lee. F. preprint 4958. T. AbedMeraim. F. A. 15. M. 2007. 15. 4. T. no. 2. “Singlesensor audio source separation using classiﬁcation and estimation approach and GARCH modeling.14 The other terms are given by E(A1 ) · E(A3 ) = E(X1 )(aE(X1 ) + bE(X2 )) = aE2 (X1 ) + bE(X1 )E(X2 ). then (A. pp. K. vol. 2007. 15. pp. Shikano. “Pitchbased monaural segregation of reverberant speech. “Non negative sparse representation for Wiener based source separation with a single sensor. no. 15. 1830–1846. Nishikawa. [17] K. Gribonval. vol. 70–79. Speech and Language Processing. 613–616. 55.” IEEE Transactions on Audio. M. 1. 4004–4017. 191–199. Komatani. T. no. “Blind source separation exploiting higherorder frequency dependencies. “Blind separation of speech mixtures via timefrequency masking. Cohen. A¨ssaElBey. · 2 2 2 2 ρA2 . Benaroya. M. Kim. “Separation of singing voice from music accompaniment for monaural recordings. 2007. Burred and T. Grenier. and R. H. 14. vol. vol.” IEEE Transactions on Audio. Gribonval. 15.” IEEE Transactions on Audio. Lee. no. J. Lee.A3 b · σX2 + σN2 σX1 + σN1 2 If σN1 2 2 σX1 and σN2 2 σX2 .6) 2 2 2 2 2 2 Considering that σA1 = σX1 + σN1 and σA2 = σX2 + σN2 . Rickard. and R. Itoyama. 1475–1487. C. M. 2299– 2310. Speech and Language Processing. vol. Wang. pp. Work was performed while the ﬁrst author was with the Department of Computer Science. Saab. ρA1 . April 2007. [7] M. Douglas. [15] L. Yilmaz. A¨ssaElBey. = 2 2 ρA2 . no. Li and D. 1066–1074. [18] L. Virtanen. K. 458–469. vol.Y. 55.” IEEE Transactions on Signal basispursuit With q Processing.A3 a σX σX a σX = · 21 · 2 = · 1 . Goto. 2007. 8. ρA2 . “Multichannel blind deconvolution for source separation in convolutive mixtures of speech. April 2003. pp. 666–678. vol. Abugharbieh. 14.” IEEE Transactions on Audio. no.9) Acknowledgments Special thanks are extended to Foreign Aﬀairs and International Trade Canada for supporting this work under its PostDoctoral Research Fellowship Program (PDRF). 1. Wang. Sawada. L. [2] H.A3 a = . 7. [3] T. Roman and D. and Y. no. pp. O. Kokkinakis and A. 2007. pp. 2006. Yilmaz and S.A3 σA · 2. Speech and Language Processing. Grenier. no. Speech and Language Processing. Bimbot.A3 = . 2008. Makino. 15. pp.A3 b σX2 σX1 b σX2 (A. 893–900. ¨ [6] O.
[27] T. Eds. [30] Z. Coyle. pp.” in Proceedings of the Audio Engineering Society Convention. [23] J. Burred and T. Quatieri. Li. “Sinusoidal modeling for nonstationary voiced speech based on a local vector transform. Finland. vol. “Development of the RWC music database. 396–405. 2006. Speech.EURASIP Journal on Audio. “A method for separation of overlapping partials based on similarity of temporal envelopes in multichannel mixtures. Wang.” in Proceedings of IEEE International Conference on Acoustics. vol. Every and J. Brown.” Journal of the Acoustical Society of America. pp. “Convolutive blind source separation in the frequency domain based on sparse representation. C. vol.” IEEE Transactions on Audio. 1257–1272. and G. “Melody extraction and musical onset detection via probabilistic models of framewise STFT peak data. 1717–1727. Gainza.” IEEE Transactions on Audio.. S. pp. S. “A multipitch tracking algorithm for noisy speech. Duan. no. 2000. pp. 2004. Speech. no. pp. no. “Multiple fundamental frequency estimation by summing harmonic amplitudes. pp. [34] T. Tampere University of Technology. Thornburg. [25] T.” IEEE Transactions on Audio. pp. [29] H. Holm. pp. Cichocki. Zhang. 229–241. Roebel. 14. R. 5. pp. “Monaural source separation from musical mixtures based on timefrequency timbre models. [37] [38] [39] [40] [41] [42] [43] . Speech and Language Processing.” in Proceedings of the 18th International Congress on Acoustics (ICA ’04). Roads. J.M. V¨ lim¨ ki. D. 2. 538–543. 3.” in 2009 IEEE International Conference on Acoustics. Speech. 2008. J. [26] A. 2004. 1986. “Separation of harmonic sounds using linear models for the overtone series. no. 1757–1760. O. vol. Sikora. pp. 121. Shi. vol. Szymanski. 15 Speech and Language Processing. 3. Speech and Language Processing. D. no. and J. 3169–3172. 99. Ito and M. vol. preprint 6312. A. 11. June 2000. and Signal Processing (ICASSP ’09). 15. 2007. C. [32] J. M. vol. 765–768. Serra. Swets & Zeitlinger. R. Klapuri. pp. G. 2007. 2006. “Separation of harmonic sound sources using sinusoidal modeling. [22] J. Yeh and A.D. 2003.” in Proceedings of the COSTG6 Conference on Digital Audio Eﬀects. [36] H. and G. 149–152. 12. Ph.M. Yano. “Speech analysis/synthesis based on a sinusoidal representation. no. Maher. 553–556. pp. pp. He. 121. and J. and Signal Processing. and A. no. 2. “Evaluation of a method for separating digitized duet signals. vol. and E. 15. and V.” Journal of the Audio Engineering Society. Rauhala. Virtanen. 2006. J. Picialli.” in Proceedings of the International Conference on Music Information Retrieval. Xie. “Auditory segmentation based on onset and oﬀset analysis. 2008. J. vol. pp.” IEEE Transactions on Acoustics. Evangelista. Wu. Goto. Lawlor.” IEEE Transactions on Audio. Speech. no. pp. A. 3. [35] M. 4. Wang.” in Proceedings of the International Computer Music Conference (ICMC ’87). “Harmonic sound source separation using FIR comb ﬁlters. McAulay and T. Hu and D. 4. “PARSHL: an analysis/synthesis program for nonharmonic sounds based on a sinusoidal representation. pp. Y.” IEEE Transactions on Audio. Lehtonen. Woodruﬀ. Zhang. Klapuri. Serra. [28] M. L. H. pp. 1997. no.” in Musical Signal Processing. vol. 2006.” IEEE Transactions on Speech and Audio Processing. 1551–1563.” in Proceedings of the International Conference on Music Information Retrieval. 38. [31] J. S. 15. Speech and Signal Processing (ICASSP ’02). pp. “Fast automatic a a inharmonicity estimation algorithm. Brown. “Resolving overlapping harmonics for monaural musical sound separation using pitch and common amplitude modulation. J. 2007. 34. 5. “Unsupervised singlechannel music source separation by average harmonic structure modeling. 744– 754. [21] M. 1051–1061. Speech and Language Processing. pp. no. Pope. 1996. twn. pp. April 2009. C. M.” IEEE Transactions on Audio. 4. “Frequency ratios of spectral components of musical sounds. C. Virtanen and A. 956–979. 2007. E. 14. 1210–1218. EL184–EL189. “The expected amplitude of overlapping partials of harmonic sounds. Speech and Language Processing. Smith and X. 5. Poli. 2007. Klapuri. Virtanen and A.” Journal of the Acoustical Society of America. 2007. “Musical sound modeling with sinusoids plus noise.nipulation of polyphonic musical signals. and Z.” in Proceedings of the International Conference on Music Information Retrieval. pp. C. pp. Ding. 1987. 16. Sound source separation in monaural music signals. and Signal Processing (ICASSP ’00). “Robust multipitch estimation for the analysis and ma. 1990. [24] X. Viste and G. no. no. F. Leistikow. 1845–1856. and D. 216–221. May 2002. and Music Processing [20] Z. Speech and Language Processing. B.” in Proceedings of IEEE International Conference on Acoustics. T. Wang.” Journal of the Acoustical Society of America. “Separation of synchronous pitched notes by spectral ﬁltering of harmonics. Virtanen. Y. 766–778. 91–122. [33] R. L. 290–297. Klapuri. vol. Berger. 141–146. dissertation. vol.