• Speech analysis: Used to extract features from a speech signal
s(n), directly pertinent for different applications while suppressing redundant aspects of the speech. • Transformation of s(n) into another signal, a set of signals, or a set of parameters, with the objective of simplification and data reduction. • An efficient representation for speech recognition would be a set of parameters which is consistent across speakers, yielding similar values for the same phonemes uttered by various speakers, while exhibiting reliable variation for different phonemes Methods Of Speech Analysis • The time domain (operating directly on the speech waveform) • The frequency domain (after a spectral transformation of the speech). Short-Time Speech Analysis • Speech is dynamic and time varying • Speech analysis assumes that signal properties change relatively slowly with time • During speech production, the vocal tract shape and type of excitation may not alter much for small durations (100 ms for vowel, 10 ms for stops). This allows examination of a short time window to extract parameters (stationary) which yields parameters averaged over the window • To take care of dynamics, analysis windows allowed to overlap Windowing • Windowing is the multiplication of speech signal s(n) by a window w(n), which yields a set of speech samples weighted by the shape of the window. • Windows have a finite length typically 20-30 ms to include 2-3 pitch periods • Window can be shifted to examine any part of s(n) • Choice of window size-trading of 3 parameters (1) short enough to satisfy stationarity (2) long enough to limit number of frames per sec. (3) long enough to get the desired parameters Shape of Analysis Window Time-domain parameters • Advantage: Simplicity in calculation and physical interpretation. • Speech features relevant for coding and recognition occur in temporal analysis: e.g., energy (or amplitude), voicing, and F0.
• Energy can be used to segment speech in automatic recognition
systems, and must be replicated in synthesizing speech. Accurate voicing and F0 estimation are crucial for many speech coders. • Time features, e.g., zero-crossing rate and autocorrelation, provide inexpensive spectral detail without formal spectral techniques Signal Analysis in the Time Domain Short-Time Average Energy and Magnitude • Q(n) corresponds to short-time energy or amplitude if T in Equation (1) is a squaring or absolute magnitude operation. • Energy emphasizes high amplitudes (since the signal is squared in calculating Q(n), while the amplitude or magnitude measure avoids such emphasis and is simpler to calculate • Such measures can help segment speech into smaller phonetic units, e.g., approximately corresponding to syllables or phonemes. • The large variation in amplitude between voiced and unvoiced speech, as well as smaller variations between phonemes with different manners of articulation, permit segmentations based on energy Q(n) in automatic recognition systems Short-Time Average Zero-crossing Rate (ZCR) Continued…….. Short-Time Aurocorrelation Function Continued………