Lecture # 5 Session 2003

Speech Signal Representation
• Fourier Analysis – Discrete-time Fourier transform – Short-time Fourier transform – Discrete Fourier transform • Cepstral Analysis – The complex cepstrum and the cepstrum – Computational considerations – Cepstral analysis of speech – Applications to speech recognition – Mel-Frequency cepstral representation • Performance Comparison of Various Representations

6.345 Automatic Speech Recognition (2003)

Speech Signal Representaion 1

Discrete-Time Fourier Transform
  jω  ) X ( e         x [n ] =
+∞ � n=−∞ 1 2π

x[n]e−jωn
π

=

X (ejω )ejωn dω

� � +∞ � � � � � • Suﬃcient condition for convergence: �x [n ]� < + ∞
n=−∞

−π

• Although x[n] is discrete, X (ejω ) is continuous and periodic with period 2π . • Convolution/multiplication duality:  y [n ]       y [n ] Y (ejω ) = x [n ] ∗ h [n ] = X (ejω )H (ejω )

= x [n ]w [n ] =
1 2π

   Y (ejω )

π −π

W (ejθ )X (ej (ω−θ) )dθ

6.345 Automatic Speech Recognition (2003)

Speech Signal Representaion 2

Short-Time Fourier Analysis (Time-Dependent Fourier Transform) w [ 50 . W (ejω ) must resemble an impulse with respect to X (ejω ). or simply that the signal is zero outside the window.m ] w [ 200 . 6.m ] x[m] m 0 n = 50 n = 100 n = 200 Xn (e ) = jω +∞ � m =−∞ w [n − m ]x[m ]e−jωm • If n is ﬁxed. • In order for Xn (ejω ) to correspond to X (ejω ).m ] w [ 100 .345 Automatic Speech Recognition (2003) Speech Signal Representaion 3 . then it can be shown that: Xn (e ) = jω 1 2π � π W (ejθ )ejθn X (ej (ω+θ) )dθ −π • The above equation is meaningful only if we assume that X (ejω ) represents the Fourier transform of a signal whose properties continue outside the window.

0≤n≤N −1 6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 4 .Rectangular Window w [n ] = 1.

54 − 0.Hamming Window � 2πn . w [n] = 0.46cos N −1 � 0≤n≤N −1 6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 5 .

Comparison of Windows 6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 6 .

345 Automatic Speech Recognition (2003) Speech Signal Representaion 7 .Comparison of Windows (cont’d) 6.

345 Automatic Speech Recognition (2003) Speech Signal Representaion 8 .A Wideband Spectrogram Two plus seven is less than ten 6.

A Narrowband Spectrogram Two plus seven is less than ten 6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 9 .

• If M > N . N . the number of input points. need not be the same. we must time-alias the signal 6. M .345 Automatic Speech Recognition (2003) Speech Signal Representaion 10 . and the number of frequency samples.Discrete Fourier Transform x [n ] Npoints     X [k ]          x [n ] N −1 � n=0 −1 � 2πk 1 M X [k ]e j M n M k=0 ⇐⇒ X [k ] = X (z ) | z=e j 2πk n M Mpoints = x[n]e−j 2πk n M = In general. we must zero-pad the signal • If M < N .

345 Automatic Speech Recognition (2003) Speech Signal Representaion 11 .Examples of Various Spectral Representations 6.

Cepstral Analysis of Speech Voiced u[n] H(z) s[n] Unvoiced • The speech signal is often assumed to be the output of an LTI system. it is the convolution of the input and the impulse response. analysis is a common procedure used for such de-convolution. i.e..345 Automatic Speech Recognition (2003) Speech Signal Representaion 12 . we must go through the process of de-convolution. • Cepstral. 6. • If we are interested in characterizing the signal in terms of the parameters of such a model.

odd.Cepstral Analysis • Cepstral analysis for convolution is based on the observation that: x[n] = x1 [n] ∗ x2 [n] ⇐⇒ X (z) = X1 (z)X2 (z) By taking the complex logarithm of X (z). 6. • If we restrict ourselves to the unit circle.345 Automatic Speech Recognition (2003) Speech Signal Representaion 13 . then: ˆ(ejω ) = log |X (ejω )| + j arg{X (ejω )} X It can be shown that one approach to dealing with the problem of uniqueness is to require that arg{X (ejω )} be a continuous. then • If the complex logarithm is unique. and if X ˆ2 (n) ˆ(n) = x ˆ1 (n) + x x The two convolved signals will be additive in this new. periodic function of ω. then ˆ(z) log{X (z)} = log{X1 (z)} + log{X2 (z)} = X ˆ(z) is a valid z-transform. cepstral domain. z = ejω .

345 Automatic Speech Recognition (2003) Speech Signal Representaion 14 . • It can easily be shown that c [n] is the even part of x ˆ[n] is real and causal. 6.Cepstral Analysis (cont’d) ˆ(z) = log{X (z)} is valid. • To the extent that X    ˆ[n] x                   c [n ] = 1 2π � � � +π ˆ(ejω ) ejωn dω X log{X (ejω )} ejωn dω log |X (ejω )| ejωn dω complex cepstrum cepstrum −π +π = 1 2π −π +π = 1 2π −π ˆ[n]. then x ˆ[n] be recovered from c [n]. This is known as the • If x Minimum Phase condition.

345 Automatic Speech Recognition (2003) Speech Signal Representaion 15 .An Example p [n ] P (z ) ˆ(z) P = δ [n] + αδ [n − N ] = 1 + αz−N � � � � = log P (z) = log 1 + αz−N � � = log 1 − (−α)(zN )−1 = ∞ � n=1 ∞ � n=1 ∞ � r =1 0<α<1 (−1) n n+1 α n n n+1 α z−nN ˆ(z) P = (−1) n (zN )−n ˆ[n] p = (−1) r r +1 α r δ [n − rN ] 6.

An Example (cont’d) 6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 16 .

Therefore. N must be large.Computational Considerations • We now replace the Fourier transform expressions by the discrete Fourier transform expressions :  N −1 � π kn  −j 2  N [ k ] = x [ n ] e 0≤k≤N −1 X  p    n=0 ˆp [k] = log{Xp [k]} X 0≤k≤N −1   N −1 �  π kn  j2 1  ˆ N ˆ X [ n ] = [ k ] e 0≤n≤N −1 x  p p N k=0 ˆp [k] is a sampled version of X ˆ(ejω ). • X ∞ � ˆ[n + rN ] ˆp [n] = x x r =−∞ • Likewise: where. 6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 17 . cp [n] = ∞ � r =−∞ c [n + rN ] −1 � 1 N 2π log |Xp [k]| ej N kn 0 ≤ n ≤ N − 1 cp [n] = N k=0 • To minimize aliasing.

• Thus cepstral analysis can be used for pitch extraction and formant tracking. 6. • For unvoiced speech: s [n] = w [n] ∗ v [n] ∗ r [n] = w [n] ∗ hu [n]. and will decay rapidly with n. • Contributions due to the glottal waveform (for voiced speech). • Deconvolution can be achieved by multiplying the cepstrum with an appropriate window. vocal tract. l [n].Cepstral Analysis of Speech • For voiced speech: s [n ] = p [n ] ∗ g [n ] ∗ v [n ] ∗ r [n ] = p [n ] ∗ h v [n ] = ∞ � r =−∞ hv [n − rNp ]. x s [n ] x [n ] D[ ] * x x [n ] y [n] D[ * -1 ] y [n] w [ n] l [n ] where D∗ is the characteristic system that converts convolution into addition. and radiation will be concentrated in the low quefrency region.345 Automatic Speech Recognition (2003) Speech Signal Representaion 18 . • Contributions to the cepstrum due to periodic excitation will occur at integer multiples of the fundamental period.

345 Automatic Speech Recognition (2003) Speech Signal Representaion 19 .Example of Cepstral Analysis of Vowel (Rectangular Window) 6.

Example of Cepstral Analysis of Vowel (Tapering Window) 6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 20 .

Example of Cepstral Analysis of Fricative (Rectangular Window) 6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 21 .

345 Automatic Speech Recognition (2003) Speech Signal Representaion 22 .Example of Cepstral Analysis of Fricative (Tapering Window) 6.

the time derivatives of the cepstral coeﬃcients have also been used.The Use of Cepstrum for Speech Recognition Many current speech recognition systems represent the speech signal as a set of cepstral coeﬃcients. In addition. 6. computed at a ﬁxed frame rate.345 Automatic Speech Recognition (2003) Speech Signal Representaion 23 .

Statistical Properties of Cepstral Coeﬃcients (Tohkura. 1987) From a digit database (100 speakers) over dial-up telephone lines. 6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 24 .

.

1 Noisy Data 61.7 61. 1993). the pattern classiﬁcation techniques used.and LPC-based representations. • Cepstral representation typically out-performs Fourier.345 Automatic Speech Recognition (2003) Speech Signal Representaion 26 36.2 45. (Leung.6 61. 6. 1991) 80 Clean Data Testing Accuracy (%) 70 66. and auditory-based representations.Signal Representation Comparisons • Many researchers have compared cepstral representations with Fourier-. LPC-.6 .0 DFT 44. i.0 60 40 30 Auditory Model MFSC MFCC Acoustic Representation • Performance of various signal representations cannot be compared without considering how the features will be used.. Example: Classiﬁcation of 16 vowels using ANN (Meng. et al.5 50 54.e..

. durations.g.Things to Ponder.. • Are there other spectral representations that we should consider (e.g. models of the human auditory system)? • What about representing the speech signal in terms of phonetically motivated attributes (e. what are the appropriate methods for modelling them)? 6. fundamental frequency contours)? • How do we make use of these (sometimes heterogeneous) features for recognition (i..e. formants.345 Automatic Speech Recognition (2003) Speech Signal Representaion 27 ...

" IEEE Trans. B. ASSP. Vol. Chigier. 4. Meng. 1987. Tohkura. MIT EECS. 10." IEEE Trans. Mermelstein.. 6.. Y.. 1993. ICASSP. No. and Glass.. 4. 1980. S. P. No.345 Automatic Speech Recognition (2003) Speech Signal Representaion 28 . “A Weighted Cepstral Distance Measure for Speech Recognition. The Use of Distinctive Features for Automatic Speech Recognition. 357-366. and Davis. Vol. J." Proc. SM Thesis. ASSP-28. 680-683.. ASSP. “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. 2. “A Comparative Study of Signal Represention and Classiﬁcation Techniques for Speech Recognition.. 1414-1422. H. 1991. H.References 1. ASSP-35. 3. Leung. Vol. II.