You are on page 1of 15

Music Genre Classification using Machine

Learning Techniques

CS 698 - Computational Audio

Hareesh Bahuleyan
Problem Statement
● Music genres are a way to classify
music based on rhythmic structure,
harmonic content and
instrumentation

● Automatically recognition
○ Organize digital libraries
○ Provide recommendations
Data
Google Audio Set
● 2.1 Million audio samples (of 10 seconds)
● 527 classes of sounds
● Selected 7 labels

● Not the actual audio, just the YouTubeIDs,


start and end times
● 880 KB per wav file,
● Approximately 34 GB data
Convolutional Neural
Networks
MEL Spectrograms
● 2D colormap representation of the signal
● STFT: Window size = 2048, Hop size = 512, Hann window function, Number
of MEL bins = 96
CNN - Image Classification
● Consider spectrogram as an image and train a CNN classifier
● Matrix of pixel values - 3 channel RGB input
Convolution Block
Convolution Pooling Non-Linear Activation

Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
VGG-16

● Transfer Learning
○ Weights of conv base are fixed
● Fine Tuning
○ Both conv base and feed-forward network are trainable
Feature Engineering
Approaches
Feature Extraction

Time Domain Frequency Domain Classifiers


1. Mean 1. MEL Frequency Cepstral 1. Logistic Regression
Coefficients (MFCCs)
2. Variance 2. Random Forest
2. Chroma Features
3. Skewness 3. Support Vector
3. Spectral Centroid Machines
4. Kurtosis
4. Spectral Band-widths 4. Extreme Gradient
5. Zero Crossing Rate Boosting
5. Spectral Contrast
6. Root Mean Square
Energy 6. Spectral Roll-offs

7. Tempo

● Total Number of Features = 97


Spectral Features
● Spectral Centroid

● Spectral Band-width

● Spectral Contrast
○ Divide spectrum into frequency bands
○ Maximum magnitude - Minimum magnitude in each band
● Spectral Roll-off
○ Frequency below which 85% of the total energy in the spectrum lies
● Chroma Features
○ 12-element feature vector
○ Indicates how much energy of each pitch class, {C, C#, D, D#, E, ..., B}
Results
Comparison of Models
● Metrics: Accuracy | F-score | AUC

Baseline uses flatten


vector of pixels

Ensembling classifiers
is beneficial
Feature Importance Study
Keep only
most
important
top N
features

Time domain vs.


Frequency domain
Confusion Matrix

Good at predicting some


classes. Eg: Rock

Many mis-classifications
for Rhythm blues, Pop
genre

Classes are also


unbalanced
Thank You

You might also like