You are on page 1of 73

Digital Speech Processing—

Lectures 7-8

Time Domain Methods
in Speech Processing
1

General Synthesis Model
voiced sound
amplitude

T1

Log Areas, Reflection
Coefficients, Formants, Vocal
Tract Polynomial, Articulatory
Parameters, …

T2

R( z ) = 1 − α z −1
unvoiced sound
amplitude

Pitch Detection, Voiced/Unvoiced/Silence Detection, Gain Estimation, Vocal Tract
2
Parameter Estimation, Glottal Pulse Shape, Radiation Model

General Analysis Model

s[n]

Speech
Analysis
Model

Pitch Period, T[n]
Glottal Pulse Shape, g[n]
Voiced Amplitude, AV[n]
V/U/S[n] Switch
Unvoiced Amplitude, AU[n]
Vocal Tract IR, v[n]
Radiation Characteristic, r[n]

• All analysis parameters are time-varying at rates
commensurate with information in the parameters;
• We need algorithms for estimating the analysis
parameters and their variations over time

3

t) formants reflection coefficients voiced-unvoiced-silence pitch sounds of language speaker identification emotions • time domain processing => direct operations on the speech waveform • frequency domain processing => direct operations on a spectral representation of the signal x[n] system zero crossing rate level crossing rate energy autocorrelation • simple processing • enables various types of feature estimation 4 . x[n] Signal Processing representation of speech speech or music A(x.Overview speech.

Basics V P1 P2 • 8 kHz sampled speech (bandwidth < 4 kHz) • properties of speech change with time U/S • excitation goes from voiced to unvoiced • peak amplitude varies with the sound being produced • pitch varies within and across voiced sounds • periods of silence where background signals are seen • the key issue is whether we can create simple time-domain processing methods that enable us to measure/estimate speech representations reliably and accurately5 .

Fundamental Assumptions • properties of the speech signal change relatively slowly with time (5-10 sounds per second) – over very short (5-20 msec) intervals => uncertainty due to small amount of data. varying pitch. transitions between sounds. varying amplitude – over medium length (20-100 msec) intervals => uncertainty due to changes in sound quality. rapid transients in speech – over long (100-500 msec) intervals => uncertainty due to large amount of sound changes • there is always uncertainty in short time measurements and estimates from speech signals 6 .

. or a set of numbers (an estimate of the formant frequencies for the analysis frame) – the end result of the processing is a new. an estimate of the pitch period within the frame).Compromise Solution • “short-time” processing methods => short segments of the speech signal are “isolated” and “processed” as if they were short segments from a “sustained” sound with fixed (non-time-varying) properties – this short-time processing is periodically repeated for the duration of the waveform – these short analysis segments. time-varying sequence that serves as a new representation of the speech signal 7 . or “analysis frames” almost always overlap one another – the results of short-time processing can be a single number (e.g.

x[L-1]} Frame2={x[R].….….x[2R+1].x[R+1]. frame shift=R=L/4 Frame1={x[0].….Frame-by-Frame Processing in Successive Windows Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 75% frame overlap => frame length=L.x[1].x[2R+L-1]} … 8 .x[R+L-1]} Frame3={x[2R].

2 R + 1...3R + 1. L − 1 Frame 2: samples R. R + L − 1 Frame 3: samples 2 R.1..Frame 1: samples 0....3R + L − 1 . R + 1..... 2 R + L − 1 Frame 4: samples 3R.......

n = L . L where n is the time index and m is the frame index. m = 0. 1. 1. 10 . 2. L to parameter vector f [m].Frame-by-Frame Processing in Successive Windows Frame 1 Frame 2 50% frame overlap Frame 3 Frame 4 • Speech is processed frame-by-frame in overlapping intervals until entire region of speech is covered by at least one such frame • Results of analysis of individual frames used to derive model parameters in some manner • Representation goes from time sample x[n]. 2.0.

Frames and Windows FS = 16. 000 samples/second L = 641 samples (equivalent to 40 msec frame (window) length) R = 240 samples (equivalent to 15 msec frame (window) shift) Frame rate of 66.7 frames/second 11 .

1 ≤ m ≤ 200. L is the size of the analysis vector (e. 1 for pitch period estimate. etc) 12 .g... fL [m]} = vectors at 100/sec rate.g.Short-Time Processing speech waveform. (e. 2 seconds of 4 kHz bandlimited speech.. x [n ]. 12 for autocorrelation estimates. f2 [m]. 0 ≤ n ≤ 16000) r f [m] = {f 1 [m]. x[n] short-time processing speech representation.. f[m] x [n ] = samples at 8000/sec rate..

Generic Short-Time Processing ⎛ ∞ ⎞ Qnˆ = ⎜ ∑ T ( x[m]) w% [n − m] ⎟ ⎝ m =−∞ ⎠ n = nˆ x[n] T( ) linear or non-linear transformation T(x[n]) w[n] ~ Qnˆ window sequence (usually finite length) • Qnˆ is a sequence of local weighted average values of the sequence T(x[n]) at time n = nˆ 13 .

there is little or no utility of this definition for time-varying signals Enˆ = nˆ ∑ x 2 [m] = x 2 [nˆ − L + 1] + .. + x 2 [nˆ ] m = nˆ −L +1 -.this is the long term definition of signal energy -.Short-Time Energy E= ∞ ∑ x 2 [m ] m =−∞ -..short-time energy in vicinity of time nˆ T (x) = x2 w% [n ] = 1 =0 0 ≤ n ≤ L −1 otherwise 14 .

L samples ( Enˆ is a lowpass function—so it can be decimated without lost of information.Computation of Short-Time Energy w% [n − m] n − L +1 n • window jumps/slides across sequence of squared values. if L is small.4.. why is Enˆ lowpass?) • effects of decimation depend on L. then Enˆ is a lot more variable than if L is large (window bandwidth changes with L!) 15 ..8.. selecting interval for processing • what happens to Enˆ as sequence jumps by 2..

Effects of Window Qnˆ = T ( x [n ]) ∗ w% [n ] = x ′[n ] ∗ w% [n ] n = nˆ n = nˆ • w% [n] serves as a lowpass filter on T ( x[n]) which often has a lot of high frequencies (most non-linearities introduce significant high frequency energy—think of what ( x[n] ⋅ x[n] ) does in frequency) • often we extend the definition of Qnˆ to include a pre-filtering term so that x[n] itself is filtered to a region of interest xˆ[n] Linear Filter x[n] T ( x[n]) T( ) Qnˆ = Qn w% [n] n = nˆ 16 .

m] Enˆ = ∞ ∑ m =−∞ x [m] w% [nˆ − m] = 2 2 ∞ ∑ x 2 [m] h[nˆ − m ] m =−∞ h[n ] = w% 2 [n ] short time energy x[n] FS ( )2 x2[n] FS h[n] Enˆ = En n = nˆ FS / R 17 .concentrates measurement at sample nˆ.Short-Time Energy • • serves to differentiate voiced and unvoiced sounds in speech from silence (background signal) natural definition of energy of weighted signal is: Enˆ = ∞ ∑ 2 m =−∞ ⎡⎣ x [m] w% [nˆ − m]⎤⎦ (sum or squares of portion of signal) -. using weighting w% [nˆ .

. n=0. namely we need short duration window to be responsive to rapid sound changes...Short-Time Energy Properties • depends on choice of h[n]. ~ window w[n] – if w[n] duration very long and constant amplitude ~ (w[n]=1.L-1). En would not change much over time. and would not reflect the short-time amplitudes of the sounds of the speech – very long duration windows correspond to narrowband lowpass filters – want En to change at a rate comparable to the changing sounds of the speech => this is the essential conflict in all speech processing.. but short windows will not provide sufficient averaging to give smooth and reliable energy function 18 .1. or equivalently.

w[n] – rectangular window: • h[n]=1. 0≤n≤L-1 and 0 otherwise – Hamming window (raised cosine window): • h[n]=0.n-L+1) – Hamming window gives most weight to middle samples and tapers off strongly at the beginning and the end of the window 19 ....54-0.Windows ~ • consider two windows.. 0≤n≤L-1 and 0 otherwise – rectangular window gives equal weight to all L samples in the window (n.46 cos(2πn/(L-1)).

Rectangular and Hamming Windows L = 21 samples 20 .

54w% R [n] − 0.Window Frequency Responses • rectangular window H (e j ΩT sin(ΩLT / 2) − j ΩT ( L −1)/ 2 )= e sin(ΩT / 2) • first zero occurs at f=Fs/L=1/(LT) (or Ω=(2π)/(LT)) => nominal cutoff frequency of the equivalent “lowpass” filter • Hamming window w% H [n] = 0.46*cos(2π n / ( L − 1)) w% R [n] • can decompose Hamming Window FR into combination of three terms 21 .

the window duration => increasing L simply decreases window bandwidth • L needs to be larger than a pitch period (or severe fluctuations will occur in En). since a pitch period can be as short as 20 samples (500 Hz at a 10 kHz sampling rate) for a high pitch child or female. but smaller than a sound duration (or En will not adequately reflect the changes in the speech signal) There is no perfect value of L.RW and HW Frequency Responses • log magnitude response of RW and HW • bandwidth of HW is approximately twice the bandwidth of RW • attenuation of more than 40 dB for HW outside passband. versus 14 dB for RW • stopband attenuation is essentially independent of L. a compromise value of L on the order of 100-200 samples for a 10 kHz sampling 22 rate is often used in practice . and up to 250 samples (40 Hz pitch at a 10 kHz sampling rate) for a low pitch male.

L=21.101 Hamming Windows.81.41.41. L=21.81.101 23 .Window Frequency Responses Rectangular Windows.61.61.

1.. w% [m] = 1.. L − 1 giving Enˆ = nˆ ∑ m = nˆ − L +1 ( x[m]) 2 24 . m = 0.Short-Time Energy Short-time energy computation: Enˆ = ∞ ∑ ( x[m]w[nˆ − m]) 2 m =−∞ ∞ ∑ ( x[m]) = m =−∞ 2 w% [nˆ − m] For L-point rectangular window...

Short-Time Energy using RW/HW Enˆ Enˆ L=51 L=51 L=101 L=101 L=201 L=201 L=401 L=401 • as L increases. can even be used to find regions of silence/background signal 25 . the plots tend to converge (however you are smoothing sound energies) • short-time energy provides the basis for distinguishing voiced from unvoiced speech regions. and for medium-to-high SNR recordings.

Short-Time Energy for AGC
Can use an IIR filter to define short-time energy, e.g.,

• time-dependent energy definition

σ [n ] =
2

x [m]h[n − m] / ∑ h[m]
2

m =−∞

m =0

• consider impulse response of filter of form
h[n ] = α n −1u[n − 1] = α n −1 n ≥ 1
=0

σ [n ] =
2

n <1

(1 − α ) x 2 [m]α n −m −1u[n − m − 1]

m =−∞

26

Recursive Short-Time Energy
u[n − m − 1] implies the condition n − m − 1 ≥ 0
or m ≤ n − 1 giving

σ [n ] =
2

n −1

(1 − α ) x 2 [m] α n −m −1 = (1 − α )( x 2 [n − 1] + α x 2 [n − 2] + ...)

m =−∞

for the index n − 1 we have

σ [n − 1] =
2

n −2

(1 − α ) x 2 [m] α n −m − 2 =(1 − α )( x 2 [n − 2] + α x 2 [n − 3] + ...)

m =−∞

thus giving the relationship

σ 2 [n ] = α ⋅ σ 2 [n − 1] + x 2 [n − 1](1 − α )
and defines an Automatic Gain Control (AGC) of the form
G0
G[ n ] =
σ [n ]
27

Recursive Short-Time Energy
σ 2 [n ] = x 2 [n ] ∗ h[n ]
h[n ] = (1 − α )α n −1u[n − 1]

σ 2 (z) = X 2 (z) H (z)
H (z) =

h[n ] z

−n

n −1
−n
(1

α
)
α
u
[
n

1]
z

=

n =−∞

n =−∞

= ∑ (1 − α )α n −1 z − n
n =1

m = n −1
H (z) =

∑ (1 − α )α

m

z

− ( m +1)

m =0

= (1 − α ) z

−1

=

∑ (1 − α ) z

−1

α m z −m

m =0

∑α z
m

−m

= (1 − α ) z

m =0

σ 2 [n ] = ασ 2 [n − 1] + (1 − α ) x 2 (n − 1)

−1

1
1 − α z −1

= σ 2 (z) / X 2 (z)
28

Recursive Short-Time Energy x[n] ( ) 2 x 2 [ n] z −1 (1 − α ) σ 2 [ n] + σ 2 [n − 1] z −1 α σ [n ] = α ⋅ σ [n − 1] + x [n − 1](1 − α ) 2 2 2 29 .

Recursive Short-Time Energy 30 .

Use of Short-Time Energy for AGC 31 .

99 32 .Use of Short-Time Energy for AGC α = 0.9 α = 0.

Short-Time Magnitude • short-time energy is very sensitive to large signal levels due to x2[n] terms – consider a new definition of ‘pseudo-energy’ based on average signal magnitude (rather than energy) M nˆ = ∞ ∑ | x[m] | w% [nˆ − m] m =−∞ – weighted sum of magnitudes. rather than weighted sum of squares x[n] FS |x[n]| | | FS w% [n] Mnˆ = Mn n=nˆ FS / R • computation avoids multiplications of signal with itself (the squared term) 33 .

Short-Time Magnitudes M nˆ M nˆ L=51 L=51 L=101 L=101 L=201 L=201 L=401 L=401 • differences between En and Mn noticeable in unvoiced regions • dynamic range of Mn ~ square root (dynamic range of En) => level differences between voiced and unvoiced segments are smaller • En and Mn can be sampled at a rate of 100/sec for window durations of 20 msec or so => efficient representation of signal energy/magnitude 34 .

Short Time Energy and Magnitude— Rectangular Window Enˆ M nˆ L=51 L=51 L=101 L=101 L=201 L=201 L=401 L=401 35 .

Short Time Energy and Magnitude— Hamming Window Enˆ M nˆ L=51 L=51 L=101 L=101 L=201 L=201 L=401 L=401 36 .

.e. h[n ] = a n . n ≥ 0. i. 0 < a < 1 =0 n<0 1 H (z) = 1 − az −1 |z| > |a| 37 .Other Lowpass Windows • • • • can replace RW or HW with any lowpass filer window should be positive since this guarantees En and Mn will be positive FIR windows are efficient computationally since they can slide by R samples for efficiency with no loss of information (what should R be?) can even use an infinite duration window if its z-transform is a rational function.

Other Lowpass Windows • this simple lowpass filter can be used to implement En and Mn recursively as: En = a En −1 + (1 − a )x 2 [n ] − short-time energy Mn = a Mn −1 + (1 − a ) | x [n ] | − short-time magnitude • need to compute En or Mn every sample and then down-sample to 100/sec rate • recursive computation has a non-linear phase. so delay cannot be compensated exactly 38 .

z1 proportional to F0 ) zM=M (2F0 /FS) crossings/(M samples) 39 .e. sinusoids) • sinusoid at frequency F0 with sampling rate FS has FS/F0 samples per cycle with two zero crossings per cycle.g. giving an average zero crossing rate of z1=(2) crossings/cycle x (F0 / FS) cycles/sample z1=2F0 / FS crossings/sample (i.Short-Time Average ZC Rate zero crossing => successive samples have different algebraic signs zero crossings • zero crossing rate is a simple measure of the ‘frequency content’ of a signal—especially true for narrowband signals (e...

F0 = 1000 Hz sinusoid has FS / F0 = 10. or z100 = 2 / 2 * 100 = 100 crossings/10 msec interval 40 . or z1 = 2 / 2 crossings/sample. 000 / 1000 = 10 samples/cycle. 000 / 100 = 100 samples/cycle. F0 = 5000 Hz sinusoid has FS / F0 = 10. or z100 = 2 / 10 * 100 = 20 crossings/10 msec interval 3. or z1 = 2 / 10 crossings/sample. or z1 = 2 / 100 crossings/sample. or z100 = 2 / 100 * 100 = 2 crossings/10 msec interval 2. F0 = 100 Hz sinusoid has FS / F0 = 10. 000 / 5000 = 2 samples/cycle. 000 Hz 1.Sinusoid Zero Crossing Rates Assume the sampling rate is FS = 10.

offset sinewave.5 0 0 50 100 150 200 41 .5 ZC=8 1 0. 100 Hz sinewave.Zero Crossing for Sinusoids offset:0.5 -1 0 50 100 150 200 Offset=0. ZC:8 1 100 Hz sinewave 0.75.75 100 Hz sinewave with dc offset 1.5 ZC=9 0 -0. ZC:9.

Zero Crossings for Noise offseet:0.75.75 6 random gaussian noise with dc offset 4 ZC=122 2 0 -2 0 50 100 150 200 250 42 . ZC:122 3 random gaussian noise 2 ZC=252 1 0 -1 -2 0 50 100 150 200 Offset=0. ZC:252. random noise. offset noise.

ZC Rate Definitions 1 Znˆ = 2Leff nˆ ∑ m = nˆ − L +1 | sgn( x[m]) − sgn( x[m − 1]) |w% [nˆ − m] x [n ] ≥ 0 sgn( x [n ]) = 1 = −1 x[n ] < 0 simple rectangular window: w% [n ] = 1 0 ≤ n ≤ L −1 =0 otherwise Leff = L Same form for Z nˆ as for Enˆ or M nˆ 43 .

000 Hz. τ =10 msec. for an interval of τ sec. corresponding to M samples we get zM = z1 ⋅ M. which is zM = z1 ⋅ M = rate of zero crossings per M sample interval Thus. we need the rate of zero crossings per fixed interval of M samples. T = 62.ZC Normalization The formal definition of Znˆ is: nˆ 1 Znˆ = z1 = | sgn( x[m]) − sgn( x[m − 1]) | ∑ 2L m =nˆ −L +1 is interpreted as the number of zero crossings per sample. T = 100 μ sec.5 μ sec. τ =10 msec. τ =10 msec. M = 80 samples FS = 16.. T = 125 μ sec. M = τ FS = τ / T FS = 10. M = 100 samples FS = 8. For most practical applications. 000 Hz. M = 160 samples Zero crossings/10 msec interval as a function of sampling rate 44 . 000 Hz.

ZC Normalization For a 1000 Hz sinewave as input. using a 40 msec window length (L ). with various values of sampling rate (FS ). is independent of the sampling rate and can be used as a measure of the dominant energy in a band. 45 . we get the following: FS L z1 M zM 8000 10000 16000 320 400 640 1/ 4 1/ 5 1/ 8 80 100 160 20 20 20 Thus we see that the normalized (per interval) zero crossing rate. zM .

5 msec at Fs=16 kHz) Hamming window with duration L=401 samples (25 msec at Fs=16 kHz) 46 .ZC and Energy Computation Hamming window with duration L=201 samples (12.

ZC Rate Distributions Unvoiced Speech: the dominant energy component is at about 2.5 kHz • for unvoiced speech.5 kHz 1 KHz 2KHz 3KHz 4KHz Voiced Speech: the dominant energy component is at about 700 Hz • for voiced speech. energy is mainly above 1.5 kHz • mean ZC rate for unvoiced speech is 49 per 10 msec interval • mean ZC rate for voiced speech is 14 per 10 msec interval 47 . energy is mainly below 1.

ZC Rates for Speech • 15 msec windows • 100/sec sampling rate on ZC computation 48 .

Short-Time Energy. Magnitude. ZC 49 .

hum. need zero DC in signal => need to remove offsets. noise => use bandpass filter to eliminate DC and hum • can quantize the signal to 1-bit for computation of ZC rate • can apply the concept of ZC rate to bandpass filtered speech to give a ‘crude’ spectral estimate in narrow bands of speech (kind of gives an estimate of the strongest frequency in each narrow band of speech) 50 .Issues in ZC Rate Computation • for zero crossing rate to be accurate.

Zero Crossing Rate: nˆ 1 sgn( x [m]) − sgn( x [m − 1]) w% [nˆ − m] Znˆ = z1 = ∑ 2L m =nˆ −L +1 where sgn( x [m]) = 1 x [m] ≥ 0 = −1 x [m] < 0 51 . Magnitude: Mnˆ = nˆ ∑ m = nˆ −L +1 x [m] w% [nˆ − m] 3.Summary of Simple Time Domain Measures s ( n) Linear Filter Qnˆ = x ( n) T[ ] T ( x [n]) w% [n] Qnˆ ∞ ∑ T ( x[m])w% [nˆ − m] m =−∞ 1. Energy: Enˆ = nˆ ∑ m = nˆ −L +1 x 2 [m] w% [nˆ − m] can downsample Enˆ at rate commensurate with window bandwidth 2.

then Φ[k ] = Φ[k + P ]. Φ[k ] = Φ[ −k ] 2. Φ[k ] is maximum at k = 0. the autocorrelation function is defined as: Φ[ k ] = ∞ ∑ x [m ] x [m + k ] m =−∞ -for a random or periodic signal. the autocorrelation function is: L 1 Φ[k ] = lim x [m ] x [m + k ] ∑ L →∞ ( 2L + 1) m =− L . => the autocorrelation function preserves periodicity -properties of Φ[k ] : 1.if x [n ] = x [n + P ]. Φ[k ] is even. Φ[0] is the signal energy or power (for random signals) 52 . ∀k 3. | Φ[k ] |≤ Φ[0].Short-Time Autocorrelation -for a deterministic signal.

Periodic Signals • for a periodic signal we have (at least in theory) Φ[P]=Φ[0] so the period of a periodic signal can be estimated as the first non-zero maximum of Φ[k] – this means that the autocorrelation function is a good candidate for speech pitch detection algorithms – it also means that we need a good way of measuring the short-time autocorrelation function for speech signals 53 .

select a segment of speech by windowing 2.symmetry Rnˆ [k ] = Rnˆ [ −k ] ∞ ∑ = m =−∞ x [m ] x [m − k ] ⎡⎣w% [nˆ − m]w% [nˆ + k − m]⎤⎦ .define filter of the form w% k [nˆ ] = w% [nˆ ] w% [nˆ + k ] .a reasonable definition for the short-time autocorrelation is: Rnˆ [k ] = ∞ ∑ m =−∞ x [m] w% [nˆ − m] x [m + k ] w% [nˆ − k − m] 1.Short-Time Autocorrelation .the value of w% nˆ [k ] at time nˆ for the k th lag is obtained by filtering the sequence x[nˆ ] x[nˆ − k ] with a filter with impulse response w% k [nˆ ] 54 . compute deterministic autocorrelation of the windowed speech .this enables us to write the short-time autocorrelation in the form: Rnˆ [k ] = ∞ ∑ m =−∞ x [m] x [m − k ] w% k [nˆ − m ] .

Short-Time Autocorrelation Rnˆ [k ] = ∞ ∑ ⎡⎣ x[m]w% [nˆ − m]⎤⎦ ⎡⎣ x[m + k ]w% [nˆ + k − m]⎤⎦ m =−∞ ~ ~ n-L+1 n+k-L+1 ⇒ L points used to compute Rnˆ [0]. 55 . ⇒ L − k points used to compute Rnˆ [k ].

Short-Time Autocorrelation 56 .

=> 140 Hz pitch • much less clear estimates of periodicity since • Φ(P)<Φ(0) since windowed speech is not perfectly periodic HW tapers signal so strongly..Examples of Autocorrelations • autocorrelation peaks occur at k=72. making it look like a non-periodic signal 57 • no strong peak for unvoiced speech • over a 401 sample window (40 msec of signal). 144. . pitch period changes occur.. so P is not perfectly defined .

Voiced (female) L=401 (magnitude) T0 = N 0T t = nT T0 1 F0 = T0 F / Fs 58 .

Voiced (female) L=401 (log mag) T0 t = nT T0 1 F0 = T0 F / Fs 59 .

Voiced (male) L=401 T0 T0 3 3F0 = T0 60 .

Unvoiced L=401 61 .

Unvoiced L=401 62 .

the number of window points decrease. window duration • small L so pitch period almost constant in window L=401 L=251 L=125 • large L so clear periodicity seen in window • as k increases.Effects of Window Size • choice of L. reducing the accuracy and size of Rn(k) for large k => have a taper of the type R(k)=1-k/L. |k|<L shaping of autocorrelation (this is the autocorrelation of size L rectangular window) • allow L to vary with detected pitch periods (so that at least 2 full periods are included) 63 .

0 ≤ k ≤ K m =0 .Modified Autocorrelation • want to solve problem of differing number of samples for each different k term in Rnˆ [k ]. so modify definition as follows: Rˆ nˆ [k ] = ∞ ∑ m =−∞ x [m] w% 1 [nˆ − m] x [m + k ] w% 2 [nˆ − m − k ] . 0 ≤ m ≤ L − 1 wˆ 2 [m] = 1.always use L samples in computation of Rˆ nˆ [k ] ∀k 64 .where wˆ 1 [m] = w% 1 [ −m] and wˆ 2 [m] = w% 2 [ −m] . 0 ≤ m ≤ L − 1 + K -giving Rˆ nˆ [k ] = L −1 ∑ x[nˆ + m] x[nˆ + m + k ].where w% 1 is standard L-point window.for rectangular windows we choose the following: wˆ 1 [m] = 1.we can rewrite modified autocorrelation as: Rˆ nˆ [k ] = ∞ ∑ m =−∞ x [nˆ + m] wˆ 1 [m] x [nˆ + m + k ] wˆ 2 [m + k ] . and w% 2 is extended window of duration L + K samples. where K is the largest lag of interest .

Rˆ n [k ] ≠ Rˆ n [ −k ] .Examples of Modified Autocorrelation L-1 L+K-1 .Rˆ n [k ] is a cross-correlation.Rˆ [k ] will have a strong peak at k = P for periodic signals n and will not fall off for large k 65 . not an auto-correlation .

Examples of Modified Autocorrelation 66 .

125 67 .Examples of Modified AC L=401 L=401 L=401 Modified Autocorrelations – fixed value of L=401 L=401 L=251 L=125 Modified Autocorrelations – values of L=401.251.

Waterfall Examples 68 .

Waterfall Examples 69 .

Waterfall Examples 70 .

with w% 1 [m] and w% 2 [m] being rectangular windows. the difference function d [n ] = x [n ] − x [n − k ] . whereas if w% 2 [m] is longer than w% 1 [m]. ± P.. the short-time Average Magnitude Difference Function (AMDF) is defined as: γ nˆ [k ] = ∞ ∑ | x[nˆ + m] w% [m] − x[nˆ + m − k ] w% [m − k ] | m =−∞ 1 2 ..will be approximately zero for k = 0. In fact it can be shown that 1/ 2 γ nˆ [k ] ≈ 2 β [k ] ⎡⎣Rˆ nˆ [0] − Rˆ nˆ [k ]⎤⎦ . ±2P. then γ nˆ [k ] is similar to the modified short-time autocorrelation (or covariance) function.where β [k ] varies between 0. Based on this reasoning. 71 . For realistic speech signals.belief that for periodic signals of period P. If both are the same length. then γ nˆ [k ] is similar to the short-time autocorrelation. d [n ] will be small at k = P --but not zero.6 and 1.0 for different segments of speech.Short-Time AMDF ..

AMDF for Speech Segments 72 .

Summary • Short-time parameters in the time domain: –short-time energy –short-time average magnitude –short-time zero crossing rate –short-time autocorrelation –modified short-time autocorrelation –Short-time average magnitude difference function 73 .