Professional Documents
Culture Documents
2011/07/14
Outline
Introduction Proposed System Performance Analysis & Demo
I.
Introduction
Automatic Transcription
Music signal (in WAVE format) Musical score (in MIDI format)
Goal: Converting music signal to musical scores Main drawbacks of previous work
Training data is difficult to generate Assuming the spectral shapes of notes are constant
4
ADSR Model
Attack, Decay, Sustain, Release The spectral shape of a note varies with time
Note C4 in time-domain
D
S
R
Frame
Design Consideration
Exploit online repository of piano notes as database to make the transcription
work without generating training data adapt to a new piano easily adopt the ADSR model
Keyboard
Input signal
Synthesized mixture
System Overview
Volume normalization Frame decomposition FFT analysis Sparse representation computation
WAVE file
Noise elimination
Database Tuning
HMM postprocessing
MIDI file
WAVE file
Noise elimination
Database Tuning
HMM postprocessing
MIDI file
10
11
Weak fundamental
13
WAVE file
Volume normalization
Frame decomposition
FFT analysis
Noise elimination
Database Tuning
HMM postprocessing
MIDI file
14
Sparse Representation
Problem formulation
x* arg min || x ||0 subject to y = Ax,
x
(1)
y: vector of the magnitude spectrum of a frame A: matrix of bases, each column of A is the magnitude spectrum of a note candidate x*: vector of sparse representation coefficients
16
(1)
y (frame spectrum)
x* (coefficient vector)
(1)
If the solution of (1) is sparse enough, it is close to the solution of the l1-regularized problem
x* arg min || y - Ax ||2 + || x ||1
x
18
HMM Post-Processing
Volume normalization Frame decomposition FFT analysis Sparse representation computation
WAVE file
Noise elimination
Database Tuning
HMM postprocessing
MIDI file
19
HMM Post-Processing
Model each note with a two-state (on/off) HMM (88 HMMs for 88 keys on a piano) Given a frame sequence X = x1x2xn, t[1,n] Maximize Because
Estimated from sparse representation coefficient
so we maximize
Learnt from MIDI files
20
22
Frame-Level Evaluation
70.2% F-measure
10 one-minute long classical music recordings Each frame is 100 ms long, hop size is 10 ms 59,910 frames, 211,082 notes, 3.54 avg. polyphony
66.1%
78.6%
57.1%
23
[1] M. Marolt, A connectionist approach to automatic transcription of polyphonic piano music, IEEE Trans. Multimedia, vol. 6, no. 3, pp. 439449, 2004. [2] A. Klapuri, Multiple fundamental frequency estimation by summing harmonic amplitudes, in Proc. ISMIR, Victoria, Canada, pp. 216221, Oct. 2006.
Note-Level Evaluation
73.0% F-measure Only consider onsets of notes
Within 100ms of the ground-truth onset 4937 notes
Significant improvement compared to the best system of MIREX F0 tracking 2010 [3]
F-measure Proposed system Yehs system [3] 70.2% 67.1% Precision 74.6% 57.2% Recall 71.6% 81.1%
24
[3] C. Yeh and A. Roebel. (2010). Multiple-F0 estimation for MIREX 2010. Music Information Retrieval Evaluation eXchange. [Online]. Available: http://www.music-ir.org/mirex/abstracts/2010/AR1.pdf
25
26
Conclusion
We have presented an automatic transcription system that
exploits sparse nature of played keys adapts to a new piano easily adopts ADSR model to improve the accuracy
Live Demo
Song
Sonata no. 8 Prelude and Pathetique in Fugue No.2 in C minor, 3rd C Minor movement Bach Beethoven Moments Musicaux No. 4 Schubert Sonata K.333 in Bb Major, 1st Movement Mozart
Composer Original
Result
F-measure 78.2% 74.6% 67.0% 78.4%
28
29