You are on page 1of 28

ICME 2011 Oral Presentation

2011/07/14

Automatic Transcription of Piano Music by Sparse Representation of Magnitude Spectra


Cheng-Te Lee, Yi-Hsuan Yang, and Homer Chen
National Taiwan University Speaker: Cheng-Te Lee

Outline
Introduction Proposed System Performance Analysis & Demo

I.

Introduction

Automatic Transcription
Music signal (in WAVE format) Musical score (in MIDI format)

Goal: Converting music signal to musical scores Main drawbacks of previous work
Training data is difficult to generate Assuming the spectral shapes of notes are constant
4

Spectral Shape of Piano Sound


Spectra of note C4 (MIDI number 60) produced by 6 pianos

ADSR Model
Attack, Decay, Sustain, Release The spectral shape of a note varies with time
Note C4 in time-domain
D

Spectra over time

S
R

Frame

Design Consideration
Exploit online repository of piano notes as database to make the transcription
work without generating training data adapt to a new piano easily adopt the ADSR model
Keyboard

Database of individual piano notes

Input signal

Synthesized mixture

II. Proposed System

System Overview
Volume normalization Frame decomposition FFT analysis Sparse representation computation

WAVE file

Tuning factor estimation

Note candidate selection

Noise elimination

Piano sound database

Database Tuning

Tuned piano sound database

HMM postprocessing

MIDI file

Note Candidate Selection


Volume normalization Frame decomposition FFT analysis Sparse representation computation

WAVE file

Tuning factor estimation

Note candidate selection

Noise elimination

Piano sound database

Database Tuning

Tuned piano sound database

HMM postprocessing

MIDI file

10

Note Candidate Selection


Octave notes can be easily mistaken for each other because they have similar spectra Avoid octave error by note candidate selection
Leverage the harmonic structure of piano sounds Spectra of note C4 (MIDI number 60) of two pianos:
Strong fundamental Weak fundamental

11

Illustration of Candidate Selection


Strong fundamental

Weak fundamental

13

Sparse Representation Computation

WAVE file

Volume normalization

Frame decomposition

FFT analysis

Sparse representation computation

Tuning factor estimation

Note candidate selection

Noise elimination

Piano sound database

Database Tuning

Tuned piano sound database

HMM postprocessing

MIDI file

14

Sparsity of Played Notes


A total of 88 keys on a piano But the actual keys played each time are a sparse subset of the whole keys

Only 4 voiced notes at a time on average


15

Sparse Representation
Problem formulation
x* arg min || x ||0 subject to y = Ax,
x

(1)

y: vector of the magnitude spectrum of a frame A: matrix of bases, each column of A is the magnitude spectrum of a note candidate x*: vector of sparse representation coefficients

16

Illustration of Sparse Representation


x* arg min || x ||0 subject to y = Ax,
x

(1)

y (frame spectrum)

A (spectra of note candidates)

x* (coefficient vector)

Solving (1) is NP-complete


17

Sparse Representation (contd)


x* arg min || x ||0 subject to y = Ax,
x

(1)

If the solution of (1) is sparse enough, it is close to the solution of the l1-regularized problem
x* arg min || y - Ax ||2 + || x ||1
x

Can be solved in polynomial time, O(n1.2)

18

HMM Post-Processing
Volume normalization Frame decomposition FFT analysis Sparse representation computation

WAVE file

Tuning factor estimation

Note candidate selection

Noise elimination

Piano sound database

Database Tuning

Tuned piano sound database

HMM postprocessing

MIDI file

19

HMM Post-Processing
Model each note with a two-state (on/off) HMM (88 HMMs for 88 keys on a piano) Given a frame sequence X = x1x2xn, t[1,n] Maximize Because
Estimated from sparse representation coefficient

so we maximize
Learnt from MIDI files
20

Result of HMM Post-Processing


True Positive , False Positive False Negative , True Negative,

(a) Before HMM post-processing

(b) After HMM post-processing


21

III. Performance Analysis & Demo

22

Frame-Level Evaluation
70.2% F-measure
10 one-minute long classical music recordings Each frame is 100 ms long, hop size is 10 ms 59,910 frames, 211,082 notes, 3.54 avg. polyphony

Significant improvement compared to two stateof-the-art systems


Under the one-tailed t-test (p-value < 0.05)
F-measure Proposed system Klapuris system [1] 70.2% 62.2% Precision 74.4% 72.4% Recall 66.5% 54.6%

Marolts system [2]

66.1%

78.6%

57.1%
23

[1] M. Marolt, A connectionist approach to automatic transcription of polyphonic piano music, IEEE Trans. Multimedia, vol. 6, no. 3, pp. 439449, 2004. [2] A. Klapuri, Multiple fundamental frequency estimation by summing harmonic amplitudes, in Proc. ISMIR, Victoria, Canada, pp. 216221, Oct. 2006.

Note-Level Evaluation
73.0% F-measure Only consider onsets of notes
Within 100ms of the ground-truth onset 4937 notes

Significant improvement compared to the best system of MIREX F0 tracking 2010 [3]
F-measure Proposed system Yehs system [3] 70.2% 67.1% Precision 74.6% 57.2% Recall 71.6% 81.1%
24

[3] C. Yeh and A. Roebel. (2010). Multiple-F0 estimation for MIREX 2010. Music Information Retrieval Evaluation eXchange. [Online]. Available: http://www.music-ir.org/mirex/abstracts/2010/AR1.pdf

Analysis of System Components

25

Number of Base Elements


Because we adopt the ADSR model, there are more than one base element for each note F-measure is improved from 64.6% (88 base elements) to 70.2% (646 base elements)

26

Conclusion
We have presented an automatic transcription system that
exploits sparse nature of played keys adapts to a new piano easily adopts ADSR model to improve the accuracy

Significant improvement over state-of-the-art systems

Live Demo
Song
Sonata no. 8 Prelude and Pathetique in Fugue No.2 in C minor, 3rd C Minor movement Bach Beethoven Moments Musicaux No. 4 Schubert Sonata K.333 in Bb Major, 1st Movement Mozart

Composer Original

Result
F-measure 78.2% 74.6% 67.0% 78.4%

28

Thanks for your attention


Q&A

29

You might also like