Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Download
Standard view
Full view
of .
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Deep Scattering Spectrum (Sound)

Deep Scattering Spectrum (Sound)

Ratings:
(0)
|Views: 36|Likes:
Published by LaCasaIda.org

More info:

Published by: LaCasaIda.org on Apr 30, 2013
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

08/15/2013

pdf

text

original

 
Deep Scattering Spectrum
Joakim And´en, St´ephane Mallat
Abstract
A scattering transform defines a locally translation invariant represen-tation which is stable to time-warping deformations. It extends MFCCrepresentations by computing modulation spectrum coefficients of multi-ple orders, through cascades of wavelet convolutions and modulus opera-tors. Second-order scattering coefficients characterize transient phenom-ena such as attacks and amplitude modulation. A frequency transpositioninvariant representation is obtained by applying a scattering transformalong log-frequency. State-the-of-art classification results are obtained formusical genre and phone classification on GTZAN and TIMIT databases,respectively.
Keywords:
Audio classification, deep neural networks, MFCC, modulationspectrum, wavelets.
1 Introduction
A major difficulty of audio representations for classification is the multiplicityof information at different time scales: pitch and timbre at the scale of mil-liseconds, the rhythm of speech and music at the scale of seconds, and themusic progression over minutes and hours. Mel-frequency cepstral coefficients(MFCCs) are efficient local descriptors at time scales up to 25ms. Capturinglarger structures up to 500ms is however necessary in most applications. Thispaper studies the construction of stable, invariant signal representations oversuch larger time scales. We concentrate on audio applications, but introduce ageneric scattering representation for classification, which applies to many signalmodalities beyond audio [12].Spectrograms compute locally invariant descriptors over time intervals lim-ited by a window. Section2shows that high-frequency spectrogram coefficientsare not stable to variability due to time-warping deformations, which occurin most signals, particularly in audio. MFCCs average spectrogram values overmel-frequency bands, which improves stability to time warping but also removesinformation. Over time intervals larger than 25ms, the information loss becomestoo important, which is why MFCCs are limited to such short time intervals.
This work is supported by the ANR 10-BLAN-0126 and ERC InvariantClass 320959grants.
1
  a  r   X   i  v  :   1   3   0   4 .   6   7   6   3  v   1   [  c  s .   S   D   ]   2   4   A  p  r   2   0   1   3
 
ω ωt t
(a) (b)
t
0
t
1
t
0
t
1
Figure 1: (a) Spectrogram log
|
 
x
(
t,ω
)
|
for a harmonic signal
x
(
t
) (centeredin
t
0
) followed by log
|
 
x
τ 
(
t,ω
)
|
for
x
τ 
(
t
) =
x
((1
)
t
) (centered in
t
1
), as afunction of 
t
and
ω
. The right graph plots log
|
 
x
(
t
0
,ω
)
|
(blue) and log
|
 
x
τ 
(
t
1
,ω
)
|
(red) as a function of 
ω
. Their partials do not overlap at high frequencies. (b)Mel-frequency spectrogram log
Mx
(
t,ω
) followed by log
Mx
τ 
(
t,ω
). The rightgraph plots log
Mx
(
t
0
,ω
) (blue) and log
Mx
τ 
(
t
1
,ω
) (red) as a function of 
ω
.With a mel-scale frequency averaging, the partials of 
x
and
x
τ 
overlap at allfrequencies.Modulation spectrum decompositions [2,17,23,26,32,36,37,40,42] characterize the temporal evolution of mel-frequency spectrograms over larger time scales,with autocorrelation or Fourier coefficients. However, this modulation spec-trum also suffers from instability to time-warping deformation, which impedesclassification performance.Section3shows that the information lost by mel-frequency spectral coeffi-cients can be recovered with multiple layers of wavelet coefficients, which arestable to time-warping deformations. A scattering transform[31]computes sucha cascade of wavelet transforms and modulus non-linearities. Its computationalstructure is similar to a convolutional deep neural network[3,14,21,24,25,27,34], but involves no learning. It outputs time-averaged coefficients, providing infor-mative signal invariants over potentially large time scales.A scattering transform has striking similarities with physiological models of the cochlea and of the auditory pathway [11,15], also used for audio process- ing [33]. Its energy conservation and other mathematical properties are reviewedin Section4. An approximate inverse scattering transform is introduced in Sec-tion5,with numerical examples. Section6relates the amplitude of scattering coefficients to audio signal properties. These coefficients provide accurate mea-surements of frequency intervals between harmonics and also characterize theamplitude modulation of voiced and unvoiced sounds. The logarithm of scat-tering coefficients linearly separates audio components related to pitch, formantand timbre.Frequency transpositions form another important source of audio variabil-ity, which should be kept or removed depending upon the classification task.For example, speaker-independent phone recognition requires some frequencytransposition invariance, while frequency localization is necessary for speakeridentification. Section7shows that cascading a scattering transform alonglog-frequency yields a transposition invariant representation which is stable tofrequency deformation.2
 
Scattering representations have proved useful for image[5,39]and audio [1,4,10]classification. Section8explains how to adapt and optimize the amount of time and frequency invariance for each signal class, at the supervised learn-ing stage. A time and frequency scattering representation is used for musicalgenre classification over the GTZAN database, and for phone classification overthe TIMIT corpus. State-of-the-art results are obtained with a Gaussian kernelSVM applied to scattering feature vectors. All figures and results are repro-ducible using a MATLAB software package, available at
.
2 Mel-frequency Spectrum
Section2.1shows that high-frequency spectrogram coefficients are not stable totime-warping deformation. The mel-frequency spectrogram stabilizes these co-efficients by averaging them along frequency, but loses information. To analyzethis information loss, Section2.2relates the mel-frequency spectrogram to theamplitude output of a filter bank which computes a wavelet transform.
2.1 Fourier Invariance and Deformation Instability
Let
x
(
ω
) =
 
x
(
u
)
e
iωu
du
be the Fourier transform of 
x
. If 
x
c
(
t
) =
x
(
t
c
)then
x
c
(
ω
) =
e
icω
 
x
(
ω
). The Fourier transform modulus is thus invariant totranslation:
|
 
x
c
(
ω
)
|
=
|
 
x
(
ω
)
|
.
(1)A spectrogram localizes this translation invariance with a window
φ
of duration
such that
 
φ
(
u
)
du
= 1. It is defined by
|
 
x
(
t,ω
)
|
=
 
x
(
u
)
φ
(
u
t
)
e
iωu
du
.
(2)If 
|
c
| 
then one can verify that
|
 
x
c
(
t,ω
)
| ≈ |
 
x
(
t,ω
)
|
.Suppose that
x
is not just translated but time-warped to give
x
τ 
(
t
) =
x
(
t
τ 
(
t
)) with
|
τ 
(
t
)
|
<
1. A representation Φ(
x
) is said to be stable to deformationif its Euclidean norm
Φ(
x
)
Φ(
x
τ 
)
is small when the deformation is small. Thedeformation size is measured by sup
t
|
τ 
(
t
)
|
. If it vanishes then it is a “pure”translation without deformation. Stability is formally defined as a Lipschitzcontinuity condition relatively to this metric. It means that there exists
C >
0such that for all
τ 
with sup
t
|
τ 
(
t
)
|
<
1
Φ(
x
)
Φ(
x
τ 
)
 ≤
sup
t
|
τ 
(
t
)
|
x
.
(3)A Fourier modulus representation Φ(
x
) =
|
 
x
|
is not stable to deformationbecause high frequencies are severely distorted by small deformations. For ex-ample, let us consider a small dilation
τ 
(
t
) =
t
with 0
<
1. Since
τ 
(
t
) =
,the Lipschitz continuity condition (3) becomes
|
 
x
||
 
x
τ 
| ≤
C 
x
.
(4)3

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->