Speech

Speech Analysis
http://svr-www.eng.cam.ac.uk/~ajr/SpeechAnalysis/node1.html
Linear Prediction analysis
Linear prediction analysis oI speech is historically one oI the most important speech
analysis techniques. The basis is the source-Iilter model where the Iilter is constrained to
be an all-pole linear Iilter. This amounts to perIorming a linear prediction oI the next
sample as a weighted sum oI past samples:
This linear Iilter has the transIer Iunction:
A good introductory article is |8|, and this subject is also covered well in |1, 2, 3|.
Motivation from lossless tubes
The transIer Iunction oI a lossless tube can be described by an all pole model. This is also
a reasonable approximation to speech Iormed by the excitation oI the vocal tract by
glottal pulses (although the glottal pulses are not spectrally Ilat).
Figure 37: The lossless tube model oI speech production
But:
The vocal tract is not built oI cylinders
The vocal tract is not lossless
The vocal tract has a side passage (the nasal cavity)
Iricatives (e.g. /s/ and /sh/) are generated near the lips
Nevertheless, with suIIicient parameters the LP model can make a reasonable
approximation to the spectral envelope Ior all speech sounds.
Parameter estimation
Given N samples oI speech, we would like to compute estimates to that result in the
best Iit. One reasonable way to deIine ``best Iit'' is in terms oI mean squared error. These
can also be regarded as ``most probable'' parameters iI it is assumed the distribution oI
errors is Gaussian and a priori there were no restrictions on the values oI .
The error at any time, , is:
Hence the summed squared error, E, over a Iinite window oI length N is:
The minimum oI E occurs when the derivative is zero with respect to each oI the
parameters, . As can be seen Irom equation 67 the value oI E is quadratic in each oI the
thereIore there is a single solution. Very large positive or negative values oI must
lead to poor prediction and hence the solution to must be a minimum.
Figure 38: Schematic showing single minimum oI a quadratic
Hence diIIerentiating equation 67 with respect to and setting equal to zero gives the set
oI p equations:
rearranging equation 69 gives:
DeIine the covariance matrix, with elements :
Now we can write equation 70 as:
or in matrix Iorm:
or simply:
Hence the Covariance method solution is obtained by matrix inverse:
Note that is symmetric, i.e. , and that this symmetry can be expoited in
inverting (see |9|).
These equations reIerence the samples .
The autocorrelation method
When dealing with windowed speech we need to take into account the boundary eIIects
in order to avoid large prediction errors at the edges. Can reIine the area over which we
perIorm least squares minimisation in equation 67 and make use oI the Iact that samples
are zero outside oI the window to rewrite as:
Now is only dependent on the diIIerence, i-f, and may be written in terms oI the
autocorrelation Iunction, :
Now is Toeplitz:
EIIicient methods exist to invert such matrices, one oI which is Durbin's algorithm.
Denoting the values oI the LP parameters at iteration i by and the residual energy by
( ) Ior i 1, 2, ...
For example, Ior the signal oI Iigure 21:
ThereIore on the Iirst iteration:
And on the second iteration:
The parameters are known as the reIlection parameters. Note that:
All intermediate solutions are calculated
This method also provides the reIlection coeIIicients
The resulting Iilter is guaranteed to be stable
The value oI the squared prediction residual, is also computed and is
guaranteed to decrease (or remain constant) on each iteration
The covariance method
The covariance method uses the real values oI .
Note that:
This method can be used on much smaller sample sequences (as end
discontinuities are less oI a problem)
There is no guarantee oI stability (but you can check Ior instability)
Commonly used in `òpen'' and ``closed'' phase analysis
This is just a special case oI the general least squares problem.
Pre-emphasis
The LP Iilter so Iar presented attempts to Iit an all-pole model using the least-
mean-squares distance metric.
The lower Iormants contain more energy and thereIore are preIerentially modeled
with respect to the higher Iormants
A pre-emphasis Iilter,
is oIten used to boost the higher Irequencies. Typically , or
the optimal pre-emphasis is used.
II reconstructing the speech the inverse Iilter should be used:
The LP spectrum
The transIer Iunction 1/H(:) is a FIR whitening Iilter Ior the speech. The Irequency
response Ior this can be computed as the FT oI the Iilter coeIIicients, then inverted to
give the Irequency response oI H (Z). Figure 39 shows the LP spectrum Ior the example
segmetn oI speech.
Figure 39: The LP spectrum
Matlab:
Note:
The Iormants are very peaked
Gain computation
Figure 39 gives the amplitude response oI a LP Iilter Ior any single Irequency. The
source-Iilter excitations are Ilat, hence the overall gain can be computed.
Durbin's algorithm gives the gain as .
This may also be calculated as:
Power in an impulse train separated by N samples is 1/N
Power is a Gaussian distributed variable with zero mean and unit variance is 1.
Power in a random number uniIormly distributed on |-1,1| is:
The lattice filter implementation
Direct implementation oI the IIR Iilter can lead to instabilities iI is quantised. The Iilter
is stable provided - hence can be quantised and the result is
guaranteed to be stable.
Can either convert back Irom to or implement the IIR Iilter as a lattice and use the
values directly - useIul iI working on a limited precision DSP chip (e.g. a GSM phone).
Figure 40: The lattice Iilter
This is analogous with the lossless tube model:
each Iilter section is one section oI the tube
The Iorward wave is partially reIlected backwards
The backward wave is partially reIlected Iorwards
Hence the terminology oI .
The Itakura distance measure
Consider the case where a speech, , is passed through the linear predictor, a,
corresponding to a template. The residual mean squared error, E, is:
So can deIine
Or in the autocorrelation case:
Let be the augmented vector oI ``template'' or ``reIerence'' LP coeIIicients
and the augmented vector oI `òbserved'' or `ùnknown''
LP coeIIicients then:
In the autocorrelation case this can be computed with similar complexity to the Euclidean
distance.
The LP cepstrum
The cepstrum parameters may be computed directly Irom the LP parameters using the
Iollowing recursion:
Log area ratios
The ratio oI the areas, , oI the sections in the lossless tube may be directly computed
Irom the reIlection coeIIicients:
Hence, assuming constant areas at the glottis the total cross section oI the vocal tract
could be calculated:
The method is subject to all the LP assumptions
better normalization methods need to be employed so as to get realistic cross
sections
with Iurther Iiddling this methods can be useIul - Ior example as an aid to the deaI
Alternatively this provides the reIlection coeIIicients iI we knew the areas.
The roots of the predictor polynomial
The denominator oI the transIer Iunction may be Iactored:
Where is a set oI complex numbers deIining the roots with angular Irequency:
and amplitude:
II close to the unit circle then the root represents a Iormant.
Line spectral pairs
The LP polynomial can be decomposed into:
Now:
All the roots oI P(:) and Q(:) lie on the unit circle
The roots oI P(:) and Q(:) are interspersed
P(:) corresponds to the vocal tract with the glottis closed and Q(:) with the glottis
open
Very useIul in speech coding
Perceptual Linear Prediction
A combination oI DFT and LP techniques is perceptual linear prediction (PLP) |10|.
Figure 41: Perceptual Linear Prediction
Spectral Analysis
The typical window length is 20ms. For 10 kHz sampling Irequency, 200speech samples
are used, padded with 56 zeros, hamming windowed, FFT and converted to a power
spectral density.
Critical-band spectral resolution
The power spectrum is warped onto a Bark scale using the approximation:
The bark scaled spectra is convolved with the power spectra oI the critical band
Iilter. This simulates the Irequency resolution oI the ear which is approximately
constant on the Bark scale.
This convolution reduces the spectral resolution
The smoothed Bark scale spectrum is down-sampled by resampling every 1 Bark
(0 - 5 kHz maps to 0- 16.9 Bark).
Equal loudness preemphasis
Need to compensate Ior the non-equal perception oI loudness at diIIerent
Irequencies
Preemphasise by an equal-loudness curve
The inverse iI this curve is plotted in Iigure 42 on a dB scale.
Figure 42: The equal-loudness curve at 40 dB
Intensity-loudness power law
Perceived loudness, , is approximately the cube root oI the intensity,
Not true Ior very loud or very quiet sounds
A reasonable approximation Ior speech
Autoregressive modelling
Apply the IDFT to get the equivalent oI the autocorrelation Iunction
Fit a LP model
Optionally transIorm into cepstral coeIIicients
Discussion
Computes a simple auditory spectrum
Computational requirements similar to LP or FFT analysis
Better spectral modelling than LP
Mel scaled cepstral coeIIicients (MFCC) are historically more popular
Provides a more robust representation than MFCC.
References
1
L. R. Rabiner and R. W. SchaIer. Digital Processing of Speech Signals. Prentice
Hall, Englewood CliIIs, New Jersey, 1978.
2
Frank Fallside and William A. Woods. Computer Speech Processing. Prentice-
Hall, Englewood CliIIs, NJ, 1985.
3
John R. Deller, John G. Proakis, and John H. L. Hansen. Discrete-Time
Processing of Speech Signals. Maxwell Macmillan International, 1993.
4
Emmanuel C. IIeachor and Barrie W. Jervis. Digital Signal Processing. A
practical approach. Addison-Wesley, 1993.
5
B. C. J. Moore. An Introduction to the Psvchologv of Hearing. Academic Press,
London, second edition, 1982.
6
Steven F. Boll. Suppression oI acoustic noise in speech using spectral subtraction.
IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-
27(2):113-120, April 1979.
7
M. Berouti, R. Schwartz, and J. Makhoul. Enhancement oI speech corrupted by
acoustic noise. In Proc. ICASSP, pages 208-211, 1979.
8
J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE,
63(4):561-580, April 1975.
9
Willaim H. Press, Brian P. Flannery, Saul A. Teukolsky, and Willaim T.
Vetterling. Numerical Recipies in C. The art of Scientific Programming.
Cambridge University Press, Cambridge, England, 1988.
10
H. Hermansky. Perceptual linear predictive (PLP) analysis oI speech. Journal of
the Acoustical Societv of America, 87:1738-1752, 1990.
11
W. Hess. Pitch Determination of Speech Signals. Springer-Verlag, 1983.

Speech

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech

Uploaded by

Copyright:

Available Formats

Speech Analysis

You might also like