Time Frequency Analysis and Wavelet Transform Tutorial Time-Frequency Analysis for Voiceprint (Speaker) Recognition

Wen-Wen Chang (張雯雯) E-mail: r01942035@ntu.edu.tw Graduate Institute of Communication Engineering National Taiwan University, Taipei, Taiwan, ROC

Contents

Abstract 1. Speaker Recognition 1.1 Introduction of speaker recognition 1.2 General Structure of speaker recognition system 2. Human Speech Production 2.1 Vocal Folds 2.1 Vocal tract 3. Feature Extraction 3.1 Short-term spectral features 3.1.1 Mel-Frequency Cepstral Coefficients (MFCC) 3.1.2 Linear Predictive Cepstral Coefficients (LPCC) 3.1.3 Mel-Frequency Discrete Wavelet Coefficient (MFDWC) 3.2 Voice source features 3.2.1 Wavelet Octave Coefficients of Residues (WOCOR) 3.3 Spectral-temporal features 3.4 Prosodic features 3.5 High-level features 4. Speaker Modeling References

the recognition task can be done. In this Tutorial. the structure of speaker recognition systems and several feature extraction techniques are introduced. because the frequency content of the speech signal varies with time. By extracting the speaker-specific features from the speech samples. . and use these features to model the speaker.Abstract Speaker recognition is aim to identify a speaker by his/her speech samples. Time frequency analysis and wavelet transform is important for feature extraction in speaker recognition.

Of all the biometrics for the recognition of individuals (DNA. the test speaker is accepted. the recognition system has prior knowledge of the text to be spoken (a user specific pass-phase or a system prompted phrase) and expects the user to be cooperative. if the likelihood is above the threshold. Figure 1 . voice is a compelling biometric. face recognition). The voice sample of the test speaker is compared to the target speaker model. fingerprint. The speaker verification (authentication) task is to determine whether the speaker is the person he or she claims to be. speaker identification and speaker verification. Figure 1 illustrates these two tasks.1. The speaker identification task is to determine who the speaker is among a group of known speakers. characterization and recognition of the speaker-specific information contained in the speech signal. The voice sample of the test speaker is compared to all the known speaker models. In a text-independent system. Text-independent recognition is more difficult but also more flexible. In a text-dependent system. the telephone system provides a ubiquitous. and it does not require a specialized input device.1 Introduction of speaker recognition The goal of automatic speaker recognition is to identity the speaker by extraction. Speaker Recognition 1. familiar network of sensors for obtaining and delivering the speech signal. to find the model with the closest match. Also. Generally it involves two major tasks. Speaker recognition methods can also be divided into text-dependent and text -independent methods. the speaker recognition task can be done while the test speaker is conducting other speech interactions (background verification). Because speech is a natural signal to produce. For example. The performance of a recognition system is often better than text-independent system because of the prior knowledge of the text. the system does not know the text to be spoken by the user.

and feature extraction. In the recognition mode. A high pass filter is utilized to emphasis the high frequency components. In the enrollment mode.2 General Structure of speaker recognition system The basic structure for speaker identification system and speaker verification system are shown in Figure 2. The speaker modeling is based on the feature vectors extracted in the front-end processing. the feature vectors extracted from the test speaker‟s speech are compared against the models in the system database and a score is computed for decision making. pre-emphasis. Silence detection is performed to remove non-speech portions from the speech signal. Pre-emphasis is needed because high frequency components of the speech signal have small amplitude with respect to low frequency components. The front-end processing module generally includes silence detection.1. The feature extraction process transforms the raw signal into feature vectors in which speaker-specific properties are emphasized and statistical redundancies are suppressed. Figure 2 . a speaker model is trained using the feature vectors of the target speaker.

Then this pulsed wave passes through the vocal cords and its frequency content is modified by the resonances of the vocal tract.2. The vocal folds vibrate when creating a voiced sound. while they do not for an unvoiced sound. Human Speech Production Normal human speech is produced when air is exhaled from the lungs. called the glottal pulses. voiced and unvoiced. The frequency of . and the oscillating of the vocal folds modulates this air flow to a pulsed air stream. The vocal folds and vocal tract are two important parts in speech production. Figure 3 Figure 4 2.1 Vocal Folds The vocal folds are the sources for speech production in humans. It generates two kinds of speech sounds.

When the glottal pulses signal generated by the vibration of the vocal folds passes through the vocal tract. lips. thus the formants are higher. Figure 5 shows a spectral envelope and its formants. thus the formants are lower. The vocal tract works as a filter. and jaw) and the nasal tract. Figure 6 demonstrate this phenomenon.2 Vocal tract The vocal tract is the speech production organs above the vocal folds.the vocal folds vibration is called the fundamental frequency. the features derived from the vocal tract characteristic are most commonly used. also called formants are the peaks of the spectral envelope. Male speakers usually have longer vocal tract length. the vocal tract length is shorter. The features characterize the vocal folds oscillation are called voice source features. These features can be obtained from the spectrogram of the speech signal. The lowest resonance frequency is called the first formant and next the second formant frequency and so on. which consist of the oral tract (tongue. For female speakers and children. The resonance frequencies (formants) are inversely proportional to the vocal tract length. 2. and its frequency response depends on the resonances of the vocal tract. and the mass and length of the vocal folds. it is modified. In speaker recognition. pharynx. thus can be utilized for speaker recognition. thus are categorized as Short-Term Spectral Features. Vocal tract resonances. The vibration of the vocal folds depends on the tension exerted by the muscle. Figure 5 . The vocal tract shape can be estimated from the spectral shape such as formant location and spectral tilt of the speech signal. plate. These characteristics vary between speakers.

accent. Figure 7 shows a summary of features from viewpoint of their physical interpretation. Also the dimension of features should be low. intonation style. Feature Extraction The speaker-specific characteristics of speech can be categorized into physical and learned. State-of-the-art speaker recognition systems often combine these features. because otherwise the computation cost would be high. For speaker recognition. so is most commonly used in speaker recognition. The learned characteristics include rhythm. and most discriminative. choice of vocabulary and so on. The physical characteristics are the shapes and sizes of the speech production organs. like vocal folds and vocal tract. and be robust against noise and distortion. The features for speaker recognition can be divided into: (1) Short-term spectral features (2) Voice source features (3) Spectral-temporal features (4) Prosodic features (5) High-level features The short-term spectral features are the simplest.Figure 6 3. good features should have large between-speaker variability and small within-speaker variability. and statistical models such as the Gaussian mixture model (GMM) cannot handle high-dimensional data. attempting to achieve more accurate .

short-term spectral analysis is used to obtain the spectrogram. Figure 8 shows the spectral envelopes of two different speakers (one male. It is quite similar to the Short-Time Fourier Transform (STFT). like the location and magnitude of the peaks (formants) in the spectrum.recognition results. Short-term spectral analysis is done by framing the speech signal. Figure 7 3. Figure 8 In most of the spectral analysis of the speech signal.1 Short-Term Spectral Features The short-term spectral features convey information of the spectral envelope. the frame . one female). hence is commonly used for speaker recognition. The spectral envelope contain information of the speaker‟s vocal tract characteristics.

width is about 20−30 microseconds. artifacts occur in the DFT spectrum. is usually used in speech signal spectral analysis. By multiplying a window function to smoothly attenuate both ends of the signal towards zero. The width of the frame is generally about 30ms with an overlap of about 20ms (10ms shift). The process of short-term spectral analysis is illustrated in Figure 9. because its spectrum falls off rather quickly so the resulting frequency resolution is better. which is suitable for detecting formants. ◆ Windowing The framed signal is multiplied by a window function. The DFT computation makes an assumption that the input signal repeats over and over. this unwanted artifacts can be avoided. and the frames are shifted by about 10 microseconds. as shown in Figure 10. It is assumed that although the speech signal is non-stationary. The window function is used to smooth the signal for the computation of the DFT. Each frame contains N sample points of the speech signal. If there is a discontinuity between the first point and the last point of the signal. but is stationary for a short duration of time. . The hamming window. Figure 9 ◆ Framing the Signal The speech signal is broken down into short frames.

LPCC (Linear Predictive Cepstral Coefficients). such as MFCC (Mel-Frequency Cepstral Coefficients).Figure 10 Figure 11 There are many short-term spectral features that convey information about the spectral envelope of the speech signal. Figure 12 . and MFDWC (Mel-Frequency Discrete Wavelet Coefficient). Figure 12 shows the estimation of the spectral envelope using cepstral analysis and linear prediction separately.

1 Mel-Frequency Cepstral Coefficients (MFCC) The Mel-Frequency Cepstral Coefficients (MFCC) features is the most commonly used features in speaker recognition.3. The steps for computing the Mel-Frequency Cepstral Coefficients from the speech signal are as follows. FFT 4.1. the frequency response of each frame is computed by Discrete Fourier Transform (DFT). . Figure 14 illustrates the computation of the spectrogram by short-term spectral analysis. Then the spectrogram of the speech signal is obtained.Frequency Warping 5. Computing the Cepstral Coefficients Figure 13 After segmenting the speech signal into overlapping frames. It combines the advantages of the cepstrum analysis with a perceptual frequency scale based on critical bands. Framing the Signal 2. and this algorithm is shown in Figure 13. Mel. Windowing 3. 1.

these filters are non-uniformly spaced on the frequency scale. with more filters in the low frequency regions and less filters in the high frequency regions. Thus the human auditory system can be modeled by a set of band-pass filters. based on human auditory perception experiments.Figure 14 ◆ Mel-Frequency Warping Mel (melody) is a unit of pitch. Mel-frequency scale. Figure 15 shows the plot of pitch (Mel) versus frequency. Since the relationship between frequency scale and Mel-frequency scale is nonlinear. Figure 17 shows the power spectrum of the frame passing though the 24-band Mel-frequency filter bank. which are uniformly spaced on the Mel-frequency scale. is approximately linear up to the frequency of 1000 Hz and then becomes close to logarithmic for the higher frequencies. Figure 15 It is observed that human ear acts as filters that concentrate on only certain frequency components. Figure 16 . Figure 16 shows a 24-band Mel-frequency filter bank.

Cepstrum analysis can be used to extract the spectral envelope from the spectrum. i. take the log of both sides log X [k ]  log H[k ]  log E[k ] Then the log spectrum is the sum of a smooth signal (the spectral envelope) and a fast varying signal (the spectral details). The final MFCC feature vectors are obtained by retaining about 12-15 lowest DCT coefficients. The spectrum X[k] equals the spectral envelope H[k] multiplied by the spectral details E[k] X [k ]  H[k ] E[k ] In order to separate the spectral envelope and the spectral details from the spectrum.e. Cepstrum can be considered as the spectrum of the log spectrum. Thus the spectral envelope can be obtained by the low frequency components of the spectrum of the log spectrum. .Figure 17 ◆ Computing the Cepstral Coefficients Our goal is to obtain the spectral envelope because it conveys the information about formants. the low frequency Cepstrum Coefficients. And conduct the Discrete Cosine Transform (DCT). The concepts of cepstrum analysis is illustrated in Figure 18 ◆ Mel-Frequency Cepstral Coefficients (MFCC) The MFCC features are obtained by taking log of the outputs of a Melfrequency filter bank.

The prediction error signal. The predictor equation is defined as s[n]   ak s[n  k ] k 1 p Here s[n] is the signal. . all-pole model. It has good intuitive interpretation both in time domain (adjacent samples are correlated) and in frequency domain (all-pole spectrum corresponding to the resonance structure).2 Linear Predictive Cepstral Coefficients (LPCC) Linear prediction Coding (LPC) is an alternative method for spectral envelope estimation. a k are the predictor coefficients and s[n] is the predicted signal. The signal s[n] is predicted by a linear combination of its past values. or the autoregressive (AR) model. or residual. e[n] is the voice source (glottal pulses). and H ( z ) is the response of the vocal tract filter. This method is also known by the names. As shown in Figure 19. is defined as e[n]  s[n]  s[n] The coefficients a k are determined by minimizing the residual energy E[(e[n]) 2 using the Levinson-Durbin algorithm.Figure 18 3. s[n] is the speech signal.1.

However. A recursive algorithm proposed by Rabiner and Juang can be used for computing the cepstral coefficients from the LPC coefficients. . such as Mel-frequency scale. unlike MFCC. The only difference is that a Discrete Wavelet Transform (DWT) is used to replace the DCT in the last step. This led to the development of the Perceptual Linear Predictive (PLP) analysis. representing the vocal tract is H ( z)  1 1   k 1 ak z  k p The predictor coefficients a k are rarely used as features but they are transformed into the more robust Linear Predictive Cepstral Coefficients (LPCC) features.Figure 19 e[n]  s[n]  s[n]  s[n]   ak s[n  k ] k 1 p E ( z )  S ( z )[1   ak z  k ] k 1 p H ( z)  S ( z) E( z) Thus the spectral model. 3.1. Figure 20 shows the algorithms for MFCC and MFDWC.3 Mel-Frequency Discrete Wavelet Coefficient (MFDWC) Mel-Frequency Discrete Wavelet Coefficients are computed in the similar way as the MFCC features. the LPCC are not based on perceptual frequency scale.

The voice source features are extracted from the voice source signal. because the voice source signal is modified when passing though the vocal tract. which is the residual of the linear prediction model. and it was shown that they give better performance than the MFCCs in noisy environments.2 Voice source features Voice source features characterize the voice source (glottal pulses signal). Here S ( z ) is the speech signal. but fusing these two complementary features (short-term spectral features and voice source features) can improve recognition accuracy. namely the pitch generated by the vocal folds. E( z)  S ( z)  1 H ( z) Figure 21 shows the voice source signal extracted from the speech signal using linear prediction inverse filtering. These features depend on the source of the speech. such as glottal pulse shape and fundamental frequency. Then the vocal tract filter can be first estimated using the linear prediction model. so they are less sensitive to the content of speech than short-term spectral features.Figure 20 MFDWCs were used in speaker verification. An explanation for this improvement is DWT allows good localization both in time and frequency domain. The voice source features are not as discriminative as vocal tract features. can be estimated by inverse filtering the speech signal. 3.1. like MFCCs features. and H ( z ) is the response of the vocal tract filter. E ( z ) is the voice source signal. . The voice source signal.2. The voice source signal is extracted from the speech signal by assuming the voice source and the vocal tract are independent of each other. described in subsection 3. These features can not be directly measured from the speech signal.

wavelet transform is applied to every two pitch cycles of the linear prediction residual signal to WOCOR features. as in short-term spectral analysis. the pitch of the speech signal is estimated using cepstrum analysis. With pitch synchronous analysis. and obtain the voice source signal by inverse filtering. Then perform the linear prediction of each frame. Recently. some research proposed the . Figure 22 3. The speech signal is divided into overlapping frames.2.Figure 21 3.3 Spectro-temporal features Spectro-temporal features refer to the features extracted from the frequency content of the subband of the speech signal spectrogram. Spectro-temporal signal details contain useful speaker-specific information such as formant transitions and energy modulations. an example is shown in Figure 23. First.1 Wavelet Octave Coefficients of Residues (WOCOR) The algorithm for computing Wavelet Octave Coefficients of Residues (WOCOR) features is shown in Figure 22.

implying 36 features per frame). which are extracted by represent the non-stationary speech signal as a sum of amplitude modulated (AM) and frequency modulation (FM) signals. These coefficients are usually appended with the original MFCC coefficients on the frame level (e. Figure 23 A common way to incorporate some temporal information to short-term spectral features is by using the first and second order differences of the feature vectors. The first and second order differences of the MFCC are called Delta and Delta-Delta Cepstral Coefficients. 12 MFCCs with delta and delta-delta coefficients.g.modulation features. Figure 24 shows the MFCC and its first and second order differences. Figure 24 .

such as speaker‟s characteristic vocabulary. ◆ Neural Networks (NN) A potential advantage of neural networks is that feature extraction and speaker modeling can be combined into a single network. 3. unlike the short-term spectral features. speaking rate and rhythm of speech. and utterances.3. whole phrases or phonemes may be modeled using multi-state left-to-right HMMs. speaking rate. etc) from the speech signal. A brief description about some of the most prevalent speaker modeling techniques is as follow. called idiolect. the prosodic features spans over long segments like syllables. enabling joint optimization of the (speaker-dependent) feature extractor and the speaker model. . prosody refers to syllable stress. intonation patterns. sentence type.5 High-level features High-level features attempt to capture conversation-level features of speakers. “you know”. intonation patterns. In order to obtain long-term information (syllable stress. single state HMMs. Prosody may convey various features of the speaker like differences in speaking style. language background. also known as Gaussian Mixture Models (GMMs) are used.4 Prosodic features In linguistics. Speaker Modeling During enrollment. ◆ Hidden Markov Models (HMM) For text-dependent applications. “oh yeah” can be used for recognition. while for text-independent applications. words. the kind of words the speakers tend to use in their conversations. like „„uh-huh”. 4. and emotions to mention a few. For example the phrases frequently used by a speaker. The idea in high-level modeling is to convert each utterance into a sequence of tokens where the co-occurrence patterns of tokens characterize speaker differences. speech from a speaker is passed through the front-end processing steps described above and the feature vectors are used to create a speaker model.

Using the labeled training feature vectors. and the other class consists of the training feature vectors from an impostor (background) population (labeled as -1). In speaker verification. . The actual speaker recognition systems are very complicated. Research on speaker recognition methods and techniques has been undertaken for over four decade and it continues to be an active area. SVM finds a boundary that maximizes the margin of separation between these two classes. as illustrated in Figure 25. and can be used as classifiers in speaker verification. concerning feature extraction and speaker modeling are briefly introduced in this tutorial. one class consists of the target speaker training feature vectors (labeled as +1).◆ Support vector machine (SVM) The Support vector machine (SVM). Some factors like noise and channel effects also need to be considered. Figure 25 Conclusion The fundamentals of automatic speaker recognition. is a binary classifier which models the decision boundary between two classes.

tw/ [6] N. and T. Haizhou Li.ntu.1437-1462.P. "Time -frequency analysis of vocal source signal for speaker recognition".ee. P.edu.. 2004. INTERSPEECH.85. no. Volume 52. Douglas A. pp. . “Fundamentals of Speaker Recognition. 2002 IEEE International Conference on. pp. ISSN 0167-6393 [3] Reynolds.. 13-17 May 2002 [4] Campbell.IV-4072-IV-4075. January 2010. [2] Tomi Kinnunen.” Acoustics.in Proc. Ching. "Speaker recognition: a tutorial. (2011)." Proceedings of the IEEE . Lee.C. Zheng. no. vol. New York. . J. . Pages 12-40. “An overview of automatic speaker recognition technology. ISBN:978-0-387-77591-3. Speech.” Speech Communication. Jr... H. . Sep 1997 [5] 李琳山,「數位語音處理」課程投影片 [Online]. “An Overview of Text-Independent Speaker Recognition: From Features to Supervectors.4. Issue 1.Reference [1] Beigi. and Signal Processing (ICASSP).9.” Springer. Available : http://speech. vol.

Sign up to vote on this title
UsefulNot useful