Professional Documents
Culture Documents
Ltfatnote 045
Ltfatnote 045
Abstract—A Matlab/GNU Octave toolbox for phase (signal) II. O FFLINE ALGORITHMS
reconstruction from the short-time Fourier transform (STFT)
magnitude is presented. The toolbox provides an implementation In this section, we will only consider signals of finite length
of various, conceptually different algorithms ranging from the L with periodic boundaries. The STFT of such signals is also
well known Griffin-Lim algorithm and its derivatives to the very called the (finite) discrete Gabor transform (DGT) [3] and
recent ones. The list includes real-time capable algorithms which we will adopt such name in this paper. The DGT coefficient
are also implemented in real-time audio demos running directly matrix c ∈ CM ×N of a signal f ∈ CL with respect to a window
in Matlab/GNU Octave. The toolbox is well-documented, open-
source and it is available under the GPL3 license. In this paper, g ∈ CL can be obtained as [4]
we give an overview of the algorithms contained in the toolbox L−1
and discuss their properties.
X
c(m, n) = f (l + na)g(l)e−i2πml/M (1)
l=0
I. I NTRODUCTION for m = 0, . . . , M −1 and n = 0, . . . , N −1, M is the number
In many STFT-based audio processing applications, the of frequency channels, N = L/a number of time shifts, a is the
direct synthesis from the STFT magnitude (or spectrogram) hop size in samples. The overline denotes complex conjugation.
is unavoidable. Since the missing information is the phase of Index (l+na) is assumed to be evaluated modulo L; effectively
the complex coefficients, the task is also often called phase introducing the aforementioned periodic boundaries. In other
retrieval or phase reconstruction. To date, many algorithms have words, the signal f and the window g are considered to be
been proposed, but they have mostly been published without single periods of L-periodic signals. Note that in order not
reference implementation or means for reproducing the results. to introduce undesirable effects caused by the phase of the
Moreover, the algorithms have often been formulated and/or window itself, we assume the original, nonperiodized window
evaluated for a single STFT setting using a fixed window for to be whole-point symmetric and centered around the origin.
the analysis and synthesis. In this paper, we present the Phase The complex coefficients can be expressed in polar coordinates
Retrieval Toolbox (PHASERET), which tries to rid the user as
of all hindrances mentioned above and others connected to
implementation and evaluation of the algorithms. The phase s(m, n) = c(m, n) φ(m, n) = arg c(m, n) , (2)
reconstruction algorithms can be divided into two main classes. s denoting magnitude and φ denoting their phase. The log of
Algorithms in the first class require the knowledge of the the magnitude s (m, n) = log(s(m, n)) is usually visualised
log
magnitude component of the entire signal (offline only) while in the form of a spectrogram. Using matrices, we can write
algorithms in the second class are capable of running in the c = F∗ f , where c ∈ CMN denotes vectorized c such that
vec g vec
real-time setting i.e. to process STFT frames in the frame- c (p) = c(m, n) for p = m + nM and F∗ ∈ CM N ×L is
vec g
by-frame manner. Therefore, the paper is divided into two a conjugate transpose of a “fat” (M N > L) matrix F with
g
parts each devoted to one of the classes. The acronyms in the columns in the form of modulated and circularly shifted
the typewriter font used in the titles and in the text refer to versions of g (see Fig. 1)
the actual function names from the toolbox. For a detailed
description of the functions, please refer to the documentation. gm,n (l) = g(l − na)ei2πm(l−na)/M ,
where (l − na) is again evaluated modulo L. Equation (1)
A. Toolbox Organization
can be therefore equivalently expressed using inner products
The toolbox is mostly written in the Matlab scripting lan-
guage compatible with GNU Octave. The main computational c(m, n) = f , gm,n (3)
functions are written in the C language for greater speed L−1
X
and they are accessed trough MEX functions. The toolbox = f (l)g(l − na)e−i2πm(l−na)/M . (4)
depends on the Large Time-Frequency Analysis Toolbox [1, 2] l=0
(LTFAT) and adopts its conventions and the GPLv3 license. The As long as the matrix Fg has full row rank (linearly independent
documentation is available at http://ltfat.github.io/phaseret/doc rows), the signal can be recovered using f = F cvec where
g
e
and the GitHub development page containing GIT reposi-
∗
−1
∗
tory, release packages and the issue tracker is available at Fge = Fg Fg Fg is the left inverse of Fg such that
∗ ∗
http://github.com/ltfat/phaseret. F g
e F g = F g F g
e = I. The fundamental property of invertible
Gabor systems is that Fge has the same structure as Fg with
Z. Průša* is with the Acoustics Research Institute, Austrian Academy g
e being the synthesis window, defined by
of Sciences, Wohllebengasse 12–14, 1040 Vienna, Austria, email:
zdenek.prusa@oeaw.ac.at (corresponding address), −1
This work was supported by the Austrian Science Fund (FWF): Y 551–N13 e = Fg F∗g
g g. (5)
2 THIS IS A PREPRINT OF A PAPER PRESENTED AT THE AES INTERNATIONAL CONFERENCE ON SEMANTIC AUDIO, JUNE 2017, ERLANGEN, GERMANY
Algorithm 1: (Fast) Griffin-Lim algorithm (the q-th power is performed elementwise) and expressing its
Input: Initial coefficients cvec,0 , initial α0 . gradient explicitly as
Output: Coefficients with new phase cvec,i .
∇G(f ) =
1 tvec,0 = cvec,0 ; " #
2 for i = 1, 2, . . . do
q q2 −1 (15)
∗
2qR Fg Fg f − sqvec · F∗g f · F∗g f
3 tvec,i = PC2 PC1 cvec,i−1 ;
4 cvec,i = tvec,i + αi−1 (tvec,i − tvec,i−1 );
5 Update αi ; (R denotes the real-part operator) allows usage of generic
6 end
algorithms for unconstrained minimization. The authors suggest
the l-BFGS algorithm as it does not require explicit second
order derivatives. Further, they suggested to pick q = 2/3. Note
B. Le Roux’s Modifications of GLA – legla that the function decolbfgs requires the implementation of
Le Roux et al. [9] proposed several modifications of the lBFGS from the minFunc package [18].
Griffin-Lim algorithm. First, they recognized a structure in the
projection (9) and proposed its approximation using a truncated
kernel. Second, since this way the projection (9) can be done D. Phase Gradient Heap Integration – pghi
coefficient-wise and so can be done (11), the authors proposed Průša et al. [10] proposed a phase reconstruction method
to do on-the-fly updates i.e. to reuse the just updated coefficient based on the relationship between the magnitude and the
for computing others within the same iteration. Third, they phase gradients. Assuming the Gaussian window is used, the
proposed a modified phase update (omitting contribution of the phase gradient can be estimated using the magnitude gradient
coefficient being updated). The projection (9) can be written and the phase is then recovered by employing an adaptive
in a form of (non-commutative) circular twisted convolution integration procedure. The estimate of the phase derivative in
[16] the frequency direction φω (m, n) and in the time direction
(cvec \hvec ) (m + nM ) = φt (m, n) expressed solely from the magnitude (pre-scaled for
M −1,N the integration) can be written as
X −1 (13)
c(j, k)h(m − j, n − k)ei2π(n−k)ja/M φω (m, n) =
j,k=0 γ
− slog (m, n + 1) − slog (m, n − 1)
where the kernel is obtained as hvec = F∗g g e. Clearly, there 2aM
is only lcm(a, M )/a (least common multiple) unique kernel (16)
φt (m, n) =
modulations which can be precomputed. Typically, the kernel aM
h is essentially supported around the origin and it can be slog (m + 1, n) − slog (m − 1, n) + 2πam/M
2γ
truncated with only a minor loss of precision. Further, the
kernel is conjugate symmetric, which can be exploited to reduce with γ being the “time-frequency” ratio of the Gaussian window.
the number of multiplications. Note that the modified phase Given phase at a single point φ(m, n), the phase of its four
update can be obtained simply by setting h(0, 0) = 0. The neighbors is computed as
algorithm is summarized as Alg. 2. Note that the acceleration 1
technique from [15] can be used here as well. φ(m + 1, n) ← φ(m, n) + φω (m, n) + φω (m + 1, n) ,
2
1
Algorithm 2: Le Roux’s version of GLA φ(m, n + 1) ← φ(m, n) + φt (m, n) + φt (m, n + 1) ,
2
Input: Initial coefficients cvec,1 . 1
φ(m − 1, n) ← φ(m, n) − φω (m, n) + φω (m − 1, n) ,
Output: Coefficients with new phase cvec,i . 2
1 Initialize and order array I of indices p to be processed; 1
φ(m, n − 1) ← φ(m, n) − φt (m, n) + φt (m, n − 1) .
2 for i = 1, 2, . . . do 2
3 for all p ∈ I do Further phase spreading is guided by the coefficient magnitude
4 cvec,i (p) ← such that the phase of significant coefficients along spectrogram
svec (p) cvec,i \hvec (p)/ cvec,i \hvec (p);
ridges is computed first. For non-Gaussian windows, the
5 end approximate γ can be obtained from pghi_findgamma.
6 end
C. Unconstrained Optimization – decolbfgs When dealing with real-time algorithms, the finite length
signal model from the previous section is no longer suitable. In
Decorsiere et al. [17] proposed to take a direct optimization order to model streams of audio data, it is a common practice to
approach. Writing the objective function as work with infinite length discrete time signals. In this section,
2
we will reformulate STFT for such signals and use it in the
∗ q q
G(f ) =
Fg f − svec
(14) description of the algorithms. Given a discrete time signal f
4 THIS IS A PREPRINT OF A PAPER PRESENTED AT THE AES INTERNATIONAL CONFERENCE ON SEMANTIC AUDIO, JUNE 2017, ERLANGEN, GERMANY
and a compactly supported window g, the STFT at time index Algorithm 3: SPSI
na denoted as cn is given by Input: Magnitude of n-th frame sn , phase of the previous
X frame φn−1 .
cn (m) = f (l + na)g(l)e−i2πml/M (17) Output: Phase of the current frame φn .
l∈I
1 Identify peaks sn (mpeak ) and their regions of influence
Algorithm 4: (GS)RTISI-LA [3] Søndergaard, P. L., Finite discrete Gabor analysis, Ph.D.
Input: Number of look-ahead frames NLA , magnitude of thesis, Technical University of Denmark, 2007, available
the current and the look-ahead frames from: http://ltfat.github.io/notes/ltfatnote003.pdf.
sn , . . . , sn+NLA [4] Søndergaard, P. L., “Gabor frames by Sampling and
Output: Time frame fn and its coefficients cn . Periodization,” Adv. Comput. Math., 27(4), pp. 355 –373,
1 Initialize fn+NLA to zeros.; 2007, doi:10.1007/s10444-005-9003-y.
2 for i = 1, 2, . . . do [5] Portnoff, M. R., “Implementation of the digital phase
3 for k = NLA , . . . , 0 do vocoder using the fast Fourier transform,” IEEE Transac-
4 Compute fen+NLA using (21).; tions on Acoustics, Speech, and Signal Processing, 24(3),
pp. 243–248, 1976, ISSN 0096-3518, doi:10.1109/TASSP.
5 t ← Vgk fen+NLA ;
n+k 1976.1162810.
6 cn+k (m) ← sn+k (m) t(m)/ t(m); [6] Søndergaard, P. L., “Efficient Algorithms for the Dis-
7 Compute fn+k using (19); crete Gabor Transform with a long FIR window,” J.
8 end Fourier Anal. Appl., 18(3), pp. 456–470, 2012, doi:
9 end 10.1007/s00041-011-9210-5.
[7] Griffin, D. and Lim, J., “Signal estimation from modified
short-time Fourier transform,” Acoustics, Speech and
IV. OTHER ALGORITHMS Signal Processing, IEEE Transactions on, 32(2), pp. 236–
243, 1984, ISSN 0096-3518, doi:10.1109/TASSP.1984.
The toolbox is primarily focused on processing audio signals 1164317.
which consist of tens of thousands of samples per second. We [8] Strang, G., Introduction to Linear Algebra, Wellesley-
have intentionally not included implementation of algorithms Cambridge Press, 5 edition, 2016, ISBN 978-09802327-
which cannot efficiently handle signals at least few seconds 7-6.
long or which have some other serious drawback. In this section, [9] Le Roux, J., Kameoka, H., Ono, N., and Sagayama,
we mention other approaches and give a reason why they are S., “Fast Signal Reconstruction from Magnitude STFT
not practical for processing audio signals. Spectrogram Based on Spectrogram Consistency,” in Proc.
A popular approach is reformulating the problem as a convex, 13th Int. Conf. on Digital Audio Effects (DAFx-10), pp.
matrix rank minimization problem [24, 25, 26]. Doing so 397–403, 2010.
however squares the size of the original problem and only very [10] Průša, Z., Balazs, P., and Søndergaard, P. L., “A Nonitera-
short signals can be handled efficiently. The authors themselves tive Method for Reconstruction of Phase from STFT Mag-
evaluate the algorithms on signals shorter than 100 samples. nitude,” IEEE/ACM Trans. on Audio, Speech, and Lang.
An approach by Bouvrie and Ezzat [27] is based on solving a Process., 25(5), 2017, doi:10.1109/TASLP.2017.2678166.
non-linear system of equations for each time frame. It however [11] Nakamura, T. and Kameoka, H., “Fast signal reconstruc-
relies on using a rectangular window for analysis which is tion from magnitude spectrogram of continuous wavelet
known to have bad frequency selectivity. transform based on spectrogram consistency,” Proc. 17th
Worth mentioning is an algorithm by Eldar et. al. [28] which Int. Conf. Digital Audio Effects DAFx (DAFx-14), 2014,
is only applicable to signals which are sparse in the original pp. 129–135, 2014.
domain. Such assumption is not realistic when considering real [12] Dittmar, C. and Müller, M., “Towards transient restoration
world audio signals. in score-informed audio decomposition,” Proc. 18th Int.
Conf. Digital Audio Effects DAFx (DAFx-15), 2015, pp.
145–152, 2015.
V. ACKNOWLEDGEMENT
[13] Sturmel, N. and Daudet, L., “Iterative phase reconstruction
This work was supported by the Austrian Science Fund of Wiener filtered signals,” in Acoustics, Speech and
(FWF): Y 551–N13. Signal Processing (ICASSP), 2012 IEEE International
Conference on, pp. 101–104, 2012, ISSN 1520-6149, doi:
10.1109/ICASSP.2012.6287827.
R EFERENCES
[14] Sturmel, N. and Daudet, L., “Informed Source Separation
[1] Søndergaard, P. L., Torrésani, B., and Balazs, P., “The Using Iterative Reconstruction,” IEEE Transactions on
Linear Time Frequency Analysis Toolbox,” Interna- Audio, Speech, and Language Processing, 21(1), pp. 178–
tional Journal of Wavelets, Multiresolution Analysis and 185, 2013, ISSN 1558-7916, doi:10.1109/TASL.2012.
Information Processing, 10(4), pp. 1–27, 2012, doi: 2215597.
10.1142/S0219691312500324. [15] Perraudin, N., Balazs, P., and Søndergaard, P. L., “A
[2] Průša, Z., Søndergaard, P. L., Holighaus, N., Wiesmeyr, fast Griffin-Lim algorithm,” in Applications of Signal
C., and Balazs, P., “The Large Time-Frequency Analysis Processing to Audio and Acoustics (WASPAA), IEEE
Toolbox 2.0,” in Sound, Music, and Motion, Lecture Workshop on, pp. 1–4, 2013, ISSN 1931-1168, doi:
Notes in Computer Science, pp. 419–442, Springer 10.1109/WASPAA.2013.6701851.
International Publishing, 2014, ISBN 978-3-319-12975-4, [16] Werther, T., Matusiak, E., Eldar, Y. C., and Subbana, N. K.,
doi:{10.1007/978-3-319-12976-1\ 25}. “A Unified Approach to Dual Gabor Windows,” IEEE
6 THIS IS A PREPRINT OF A PAPER PRESENTED AT THE AES INTERNATIONAL CONFERENCE ON SEMANTIC AUDIO, JUNE 2017, ERLANGEN, GERMANY