Professional Documents
Culture Documents
1
Classification
In our case:
(1) the items are audio signals (e.g. sounds, songs, excerpts);
(2) their characteristics are the features we extract from them (MFCC,
chroma, centroid);
(3) the classes (e.g. instruments, genres, chords) fit the problem definition
2
Example
200 sounds of 2 different kinds (red and blue); 2 features extracted per sound
3
Example
4
Example
5
Example
6
Classification of music signals
Artist ID
Genre Classification
Music/Speech Segmentation
Music Emotion Recognition
Transcription of percussive instruments
Chord recognition
Re-purposing of machine learning methods that have been successfully used
in related fields (e.g. speech, image processing)
7
A music classifier
8
Feature set (recap)
9
Feature set (what to look for?)
(1) can be robustly estimated from audio (e.g. spectral envelope vs onset rise
times in polyphonies)
(2) relevant to classification task (e.g. MFCC vs chroma for instrument ID) ->
noisy features make classification more difficult!
Classes are never fully described by a point in the feature space but by the
distribution of a sample population
10
Features and training
11
Feature set (what to look for?)
12
Feature set (what to look for?)
(4) low(er) dimensional feature space -> Classification becomes more difficult
as the dimensionality of the feature space increases.
13
Feature distribution
14
Feature distribution
1 (xl )2
N (xl ; , ) = e 22
2
L
!
1
= xl
L
l=1
L
!
2 1
= (xl )2
L
l=1
15
Feature distribution
L
!
1
Cx = (xl )(xl )T
L
l=1
16
Feature distribution
17
Data normalization
To avoid bias towards features with wider range, we can normalize all to have
zero mean and unit variance:
x = (x )/
=0 " =1
! !
18
PCA
19
PCA
Unit variance
20
PCA
How to choose A?
! "
1 1 1
Cy = Y Y = (AX)(AX)T = A
T
XX T AT = ACx AT
L L L
Any symmetric matrix (such as Cx) is diagonalized by an orthogonal matrix E
of its eigenvectors
21
PCA
In MATLAB:
From http://www.snl.salk.edu/~shlens/pub/notes/pca.pdf
22
Dimensionality reduction
We can then use an MxD subset of this reordered matrix for PCA, such that
the result corresponds to an approximation using the M most relevant feature
vectors
This is equivalent to projecting the data into the few directions that maximize
variance
23
Discrimination
Let us define:
global mean
K
!
Sb = (Lk /L)(k )(k )T
k=1
Mean of class k
Between-class scatter matrix
24
Discrimination
Trace{Sb} measures average distance between class means and global mean
across all classes
trace{Sb }
J0 =
trace{Sw }
High when samples from a class are well clustered around their mean (small
trace{Sw}), and/or when different classes are well separated (large trace{Sb}).
25
Feature selection
We can try all possible M-long feature combinations and select the one that
maximizes J0 (or any other class separability measure)
26
Feature selection
Good for eliminating bad features; nothing guarantees that the optimal (F-1)-
dimensional vector has to originate from the optimal F-dimensional one.
27
Feature selection
28
LDA
LDA is similar to PCA, but the eigenanalysis is performed on the matrix Sw-1Sb
instead of Cx
Then we can use only the top M rows of A, where M < rank of Sw-1Sb
LDA projects the data into a few directions maximizing class separability
29
Classification
We have:
A taxonomy of classes
Goals:
Learn class models from the data
Classify new instances using these models
Strategies:
Supervised: models learned by example
Unsupervised: models are uncovered from unlabeled data
30
Instance-based learning
Nearest-neighbor classification:
Measures distance between new sample and all samples in the training set
Selects the class of the closest training sample
k-nearest neighbors (k-NN) classifier:
Measures distance between new sample and all samples in the training set
Identifies the k nearest neighbors
Selects the class that was more often picked.
31
Instance-based learning
32
Instance-based learning
33
Instance-based learning
34
Instance-based learning
35
Probability
Let us assume that the observations X and classes Y are random variables
Product rule
37
Probability
Prior: Probability
P (x|classi )P (classi ) of class i
P (classi |x) =
P (x)
38
Probabilistic Classifiers
Classification: finding the class with the highest probability given the
observation x
Since P(x) is the same for all classes, this is equivalent to:
From the training data we can learn the likelihood P(x|classi) and the prior
P(classi)
39
Gaussian Mixture Model
K
!
P (x|classi ) = wik N (x; ik , Cik )
k=1
1 12 (x)T Cx 1 (x)
N (x; , Cx ) = D/2 1/2
e
(2) |Cx |
40
Gaussian Mixture Model
41
Gaussian Mixture Model
42
K-means
L !
! K
J= rlk !xl k !22
l=1 k=1
43
K-means
44
K-means
random assign to
select k centers cluster
recalculate
centers repeat end
*from http://www.autonlab.org/tutorials/kmeans11.pdf
45
K-means
Many possible improvements (see, e.g. Dan Pelleg and Andrew Moores work)
does not always converge to the optimal solution -> run k-means multiple
times with different random initializations
sensitive to initial centers -> start with random datapoint as center; next
center is farthest datapoint from closest center
sensitive to choice of K -> find the K that minimizes the Schwarz criterion
(see Moores tutorial):
L
! (k)
!xl k !22 + (DK)logL
l
46
EM Algorithm
L
" K
#
! !
log{p(X|, C, w)} = log wk N (xl ; k , Ck )
l=1 k=1
47
EM Algorithm
1. Initialize k, Ck and wk
48
EM Algorithm
50
MAP Classification
After learning the likelihood and the prior during training, we can classify new
instances based on MAP classification:
argmax[P (x|classi )P (classi )]
i
51
References
This lecture borrows heavily from Emmanuel Vincents lecture notes on instrument
classification (QMUL - Music Analysis and Synthesis) and from Anssi Klapuris
lecture notes on Audio Signal Classification (ISMIR 2004 Graduate School: http://
ismir2004.ismir.net/graduate.html)
Duda, R.O., Hart, P.E. and Stork, D.G. Pattern Classification (2nd Ed). John Wiley &
Sons (2000)
Witten, I. and Frank, E. Data Mining: Practical Machine Learning Tools and
Techniques. Morgan Kaufmann (2005)
52