Professional Documents
Culture Documents
127
ISMIR 2008 – Session 1c – Timbre
2. Learn a classifier function f (x) ∈ [0, 1] which re- 2.2 Segment Classifiers
turns higher values for segments more likely to be part
of the target object. We compare three classifiers: Extra Trees [8], which, as we
will explain, should be fairly appropriate for this task, to
3. Classify segments with the learned function f (x). a very simple classifier, the K-nearest neighbors algorithm
[4], and a fairly sophisticated classifier, the Support Vector
4. Combine each of the classifications of f (x), F (Xi ) = Machine [2] with a radial basis kernel. This comparison is
C[f (xi1 ), f (xi2 ), · · · , f (xin )] ∈ {0, 1}. The combi- detailed in Section 4.
nator C[·] returns a positive result if the average of It is important to recognize the noisy nature of the label-
f (x) for all segments is greater than 0.5 (as per [7]). ing data: it is possible that a segment from a positive exam-
ple, which would be labeled positively, contains no energey
We chose this approach because it has shown promise, even from the target instrument. Since labels are weak, a positive
compared to more sophisticated approaches, in similar prob- example for “saxophone” could be a jazz recording contain-
lems, such as learning to recognize visual objects from clut- ing many portions where the saxophone is silent.
tered data [7]. We describe the extraction operator in Sec- The results in [3] suggest that methods such as bagging
tion 2.1 and the segment classifiers f (x) in Section 2.2. or randomization work well under such noisy labeling con-
128
ISMIR 2008 – Session 1c – Timbre
ditions. The Extra Tree algorithm [8] is an ensemble method Cello we used notes from C2 to C7, for Violin from G3 to
that uses randomized decision trees. It differes from other Db7, for Clarinet from D3 to B6, and for Oboe from Bb3
randomized decision trees, in that, with the lowest parameter to C6. These were played at three different dynamic lev-
setting, the algorithm would completely randomize both the els, pp, mf and ff. All samples were encoded as linear PCM
split and the attribute selected at each decision node, yield- 16bit wav files at a sampling rate of 44100Hz. The Iowa
ing a tree completely independent of training data (Extra recordings were segmented into individual notes. Training
Tree is short for extremley randomized tree). The construc- and testing examples were constructed as follows.
tion of a single Extra Tree is summarized below.
1. Select a set of instruments
1. Let the subset of training examples be S, the feature-
split pool size be K and the minimum subset be nm . 2. For each instrument, create a melody by randomly
picking a series of note recordings from the database
2. If |S| < nm or if all features values are constant, or
all examples in S share the same label. 3. For each instrument melody, intersperse the notes with
silences of random length from 0 to some maximum
(a) Define this tree’s output f (x) as the value of the value.
most frequent label in S.
4. Adjust instrument volumes to obfuscate the target in-
3. Otherwise: strument to the desired level.
(a) Select K randomly uniform split si and feature 5. Sum the melodies to create a mixture.
fi pairs.
Examples were a minimum of 10 seconds long. The aver-
(b) Select the best split-feature pair smax , fmax by
age length of an example was 11.5 seconds and the longest
entropy measures.
example was 17.3 seconds. To vary the a level of target ob-
(c) Let the output f (x) be the output of the right fuscation we varied the maximum spacing between notes in
tree if fmax < smax for x and the output of the the target instrument melody (st ) and notes in the distrac-
left tree if fmax ≥ smax . tor instrument melodies(sd ), as well as the relative ampli-
(d) Create the right tree, with S equal to the subset tude of the distractors notes (ad ) across three target obfus-
of examples with fmax < smax . cation conditions. These values were selected so that we
would obtain a range of target to mixture ratios, as defined
(e) Create the left tree, with S equal to the subset of in Section 4.2. For easy obfuscation, st = 0.5, sd = 1
examples with fmax ≥ smax . and ad = 1; for medium obfuscation, st = 1.5, sd = 1
and ad = 1; and for hard obfuscation st = 1.5, sd = 0.5,
Results in [8] suggest that with appropriate parameter set-
and ad = 1.5. Thus, for easy obfuscation, target instrument
tings, as detailed in Section 4.1.1, the Extra Tree ensemble
notes were more frequent than distractors and of equal am-
outperforms other random tree ensemble methods.
plitude. For hard obfuscation, distractors were significantly
more frequent and louder than targets.
3 DATA Conditions varied along three dimensions: target instru-
ment (one of the four instruments), ensemble of distractor
For our initial tests we wanted to tightly control the diffi- instruments (one to two of the other instruments in all pos-
culty of our examples, allowing us to observe how each of sible combinations), and target obfuscation (from easy to
our classifiers degrades as the data becomes more challeng- hard): this resulted in a total of 4 × 6 × 3 = 72 data con-
ing. So, rather than use commercial recordings, we trained ditions. We created 10 positive examples and 10 negative
and tested on synthesized mixtures of randomly selected examples for each of the conditions, yielding a total of 1440
notes from four instruments. Each positive example con- example audio files.
tains a mixture of instruments, one of which corresponds Some examples of typical correctly and incorrectly clas-
to the label the system should learn. Thus, a positive ex- sified examples for the mixed condition (from Section 4.1)
ample for “Cello” might contain a Cello, Violin and Oboe. can be found at the following URL:
The four instrument classes used for our source sounds were http://www.cs.northwestern.edu/~dfl909/ismir2008sounds
Cello, Violin, Oboe, and Bb Clarinet. Instrument recordings
were taken from a set of instrument samples made available
by the University of Iowa 1 . The ranges of pitches used in 4 EVALUATION
our experiments for the instruments were as follows: for
In our evaluation we sought to answer to the following ques-
1 http://theremin.music.uiowa.edu/index.html tions.
129
ISMIR 2008 – Session 1c – Timbre
1. What effect does training on mixtures, rather than iso- 1. Train the learner on the example from 5 of the 6 pos-
lated notes have on accuracy when identifying instru- sible ensembles of distractors using target T , totaling
ments in novel mixtures of sound? 50 positive and 50 negative examples.
2. How does accuracy degrade across our three segment 2. Test on the remaining 20 examples from the final en-
classifiers as the target class is increasingly obfus- semble condition.
cated by the distractors in the mixture?
3. Repeat steps 1-2, excluding a different condition each
3. How does performance using weak labels compare to time.
performance using stronger labels?
In the second (isolated) training condition, we used this pro-
We begin our evaluation by addressing questions 1 and 2 in
cedure for each target instrument class T and each target
Section 4.1. Questions 3 is addressed in Section 4.2.
obfuscation condition:
130
ISMIR 2008 – Session 1c – Timbre
●
● ●
●
proportion correct
●
●
exact labeling of segments would affect our results. To do
●
0.7
●
●
● this we determined the ratio of the target instrument to the
total mixture (R) in each segment selected for classification.
0.5
●
●
This was found using the following equation:
extra.tree
extra.tree
extra.tree
svm
knn
svm
knn
svm
knn
0.3
projt m2
R= (1)
easy obf. med. obf. hard obf. m2
The linear projection of the target onto the mixture, projt m,
Isolated training condition
is used as an estimate of the target that remains in the mix-
ture as per [14]. An R of 0.5 means half the energy in a
0.9
proportion correct
extra.tree
extra.tree
svm
knn
svm
knn
svm
knn
131
ISMIR 2008 – Session 1c – Timbre
strument to be identified, through, for instance, source sepa- [5] J. Eggink and G. J. Brown. A missing feature approach
ration methods [15], missing feature approaches [5] or weight- to instrument identification in polyphonic music. In
ing of features [10]. Here we presented an alternative ap- IEEE International Conference on Acoustics, Speech,
proach: rather than being smart about how we use isolated and Signal Processing (ICASSP ’03), volume 5, pages
training data, we have simply used a variety of mixtures as 553–556, 2003.
our training data.
[6] S. Essid, G. Richard, and B. David. Instrument recog-
This simple method makes classificiation possible in mix-
nition in polyphonic music based on automatic tax-
tures of random notes using only weakly labeled data, when
onomies. IEEE Transactions on Audio, Speech and Lan-
it is not possible to do so when trained with isolated training
guage Processing, 14(1):68–80, 2006.
data, without one of the above mentioned methods. We be-
lieve this is because the training data, in the mixed case, is [7] P. Geurts, R. Maree, and L. Wehenkel. Segment and
more representative of the testing data, even when the train- combine: a generic approach for supervised learn-
ing mixtures do not use the same set of instruments as the ing of invariant classifiers from topologically structured
testing mixtures. It’s possible that the use of mixtures during data. Machine Learning Conference of Belgium and The
training could be used in conjuction with these other meth- Netherlands (Benelearn), 2006.
ods to further improve performance.
We found that when using more strongly labeled data, [8] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Ex-
our results did not improve much (Section 4.2). We believe tremely randomized trees. Machine Learning, 63(1):3–
the most likely explanation of this result is that we have hit 42, 2006.
a performance ceiling for the chosen features set. This may [9] Perfecto Herrera-Boyer, Anssi Klapuri, and Manuel
also explain the similar performance of Extra Trees and the Davy. Automatic classification of pitched musical in-
SVM. In future work, now that we have identified a poten- strument sounds. In Anssi Klapuri and Manuel Davy,
tially useful training method and learning algorithm for this editors, Signal Processing Methods for Music Transcrip-
task, we will begin to explore a more effective set of fea- tion, pages 163–200. Springer Sciences+Business Me-
tures. dia LLC, New York, NY, 2006.
We will also need to confirm these results on more mu-
sical arrangements, and more challenging recordings (e.g. [10] T. Kitahara, M. Goto, K. Komatani, T. Ogata, and H. G.
varied articualation, echoes and non-instrumental background, Okuno. Instrument identification in polyphonic music:
etc...). Since our algorithm does not a priori depend on these Feature weighting to minimize influence of sound over-
simplifications, we suspect the conclusions drawn here will laps. EURASIP Journal on Advances in Signal Process-
be meaningful in more challenging environments. ing, 2007.
[11] A. Livshin and X. Rodet. Musical instrument identifica-
6 ACKNOWLEDGEMENTS tion in continuous recordings. In 7th International Con-
ference on Digital Audio Effects, 2004.
We’d like to thank Josh Moshier for his time preparing in-
strument data. This work was funded by National Science [12] Keith D. Martin and Youngmoo E. Kim. 2pmu9. musi-
Foundation Grant number IIS-0643752. cal instrument identification: A pattern-recognition ap-
proach. In 136th meeting of the Acoustical Society of
America, 1998.
7 REFERENCES
[13] J. Menke and TR Martinez. Using permutations in-
[1] P. Carbonetto, G. Dorko, C. Schmid, H. Kück, N. de Fre- stead of student’s t distribution for p-values in paired-
itas, and M. Vision. Learning to recognize objects with difference algorithm comparisons. IEEE International
little supervision. International Journal of Computer Vi- Joint Conference on Neural Networks, 2, 2004.
sion, 2007.
[14] E. Vincent, R. Gribonval, and C. Fevotte. Performance
[2] Corinna Cortes and Vladimir Vapnik. Support-vector measurement in blind audio source separation. Audio,
networks. Machine Learning, 20(3):273–297, 1995. Speech and Language Processing, IEEE Transactions
on [see also Speech and Audio Processing, IEEE Trans-
[3] Thomas G. Dietterich. An experimental comparison of actions on], 14(4):1462–1469, 2006.
three methods for constructing ensembles of decision
trees: Bagging, boosting, and randomization. Machine [15] Emmanuel Vincent and Xavier Rodet. Instrument iden-
Learning, 40(2):139–157, 2000. tification in solo and ensemble music using independent
subspace analysis. In International Conference on Mu-
[4] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classifi- sic Information Retrieval, 2004.
cation, chapter 4.4. Wiley-Interscience, 2000.
132