ARTIFICIAL BANDWIDTH EXTENSION OF SPEECH

COURSE SGN-1650 AND SGN-1656, 2010–2011

In this work, we implement a simple speech bandwidth extension system that converts a narrowband speech signal into a wideband signal. Passing the course SGN-4010 Speech processing methods before selecting this exercise is recommended. You should be familiar with the basic (speech) signal processing methods, such as windowing, Fourier transform, and linear prediction.

1
1.1

Introduction
Artificial bandwidth extension

In digital signal processing, signals are bandlimited with respect to the used sampling frequency. For instance, if a sampling frequency of 8kHz is used, the highest possible frequency component in the signal is 4kHz (Nyquist frequency). In analog telephone speech, speech bandwidth has traditionally been limited to 300–3400Hz. Most of the information in speech is below the upper boundary 3.4kHz and even though the very low frequencies are not transmitted, the human hearing system can detect the speech fundamental frequency based on the harmonic components present in the signal. For simplicity, the terms narrowband speech and wideband speech are used here to refer to speech signals with bandwidth of 4kHz (sampling frequency of 8kHz) and 8kHz (sampling frequency of 16kHz), respectively. The amount of information in narrowband speech is smaller compared to wideband speech and the perceived speech quality is thus lower. To achieve wideband speech quality without actually transmitting wideband signals, algorithms for artificial bandwidth extension (ABE, BWE) have been developed. These algorithms convert original narrowband signals into artificial wideband signals by estimating the missing high-frequency content based on the existing low-frequency content. In this work we implement a simple ABE system that utilizes a source-filter model of speech. Each narrowband signal frame is decomposed into a source part and a filter part and the parts are extended separately. The vocal tract is modeled as an all-pole filter and the filter coefficients are estimated using linear prediction (LP). The model residual is used as a source signal. The vocal tract model is extended using the most suitable wideband model taken from a codebook and the residual signal by time domain zero-insertion. The created signal is added to a resampled and delayed version of the original narrowband signal to form an artificial wideband signal.

1.2

Linear prediction of speech

Linear prediction (LP) is one of the most important tools used in digital speech processing. Speech production can be modeled as a source-filter system where a source signal produced by the vocal cords is filtered by a vocal tract filter with resonances at formant frequencies. For recap, check www.cs.tut.fi/ kurssit/SGN-4010/LP en.pdf. Vocal tract can be considered to be pth order all-pole filter 1/A(z): 1 1 = A(z) 1 + a1 z −1 ... + ap z −p (1)

where the filter coefficients a1 . . . ap are estimated using linear prediction. Figure 1 illustrates the spectrum of the all-pole filter 1/A(z) estimated from a short speech frame. The thin line represents the amplitude spectrum (absolute value of discrete Fourier transform, DFT) of the frame. Speech frame Y (z) (now in frequency domain) is formed by filtering the residual signal X(z) by the vocal tract all-pole filter 1/A(z) (remember that convolution/filtering in time domain corresponds to multiplication in frequency domain): Y (z) = X(z) A(z)

1

Amplitude and LP spectrum for a Finnish vowel ’y’ 40 30 20 10 Amplitude (dB) 0 −10 −20 −30 −40 −50 −60 0 1000 2000 3000 4000 5000 Frequency (Hz) 6000 7000 8000 Figure 1: Frame LP amplitude spectrum (thick line) and amplitude spectrum (thin line). use function lpc to compute LP coefficients of a given order: a = lpc(frame. ω2 . LP spectrum and LSF coefficients for a Finnish vowel ’y’ 40 30 20 10 Amplitude (dB) 0 −10 −20 −30 −40 −50 −60 0 1000 2000 3000 4000 5000 Frequency (Hz) 6000 7000 8000 Figure 2: Frame LP spectrum (thick line) and corresponding LSF values (thin lines). LSFs have good quantization and interpolation properties and are thus widely used in speech coding. % Convert LP coefficients into LSF coefficients a = lsf2poly(w). % Estimate LP coefficients For speech coding purposes. Residual signal X(z) is formed by filtering the frame Y (z) by the vocal tract inverse filter A(z): X(z) = Y (z)A(z) In Matlab. The LSF representation of the previous LP spectrum is given in Figure 2.tut. .order). An LSF coefficient vector Ω corresponding to A(z) of order p consists of p root angle (frequency) values ωi : Ω = (ω1 . . ωp ). In Matlab. % Convert LSF coefficents into LP coefficients Note that the values of vector w are between 0 and π (whereas in Figure 2 the LSF values are scaled to be between 0 and 8kHz) 2 .fi/sgn/arg/8003102/syn en. .cs. The thick line represents the LP spectrum and the frequency values of the thin lines represent the LSF coefficient values.pdf. The idea is to decompose A(z) into two polynomials that have their roots on the unit circle. LP polynomial A(z) can be decomposed into line spectral frequencies (LSF). . LP-to-LSF and LSF-to-LP conversions can be computed using functions poly2lsf and lsf2poly: w = poly2lsf(a). A more detailed derivation of LSFs can be found in http://www.

% Downsample (wideband -> narrowband) By default. As test data.fi.kaiser(500. % Upsample signal 3 .wav .pdf). let’s go through some basic speech processing Matlab functions.speech. 2.cs. are there any audible differences between them? Increase the sampling frequency by upsampling the narrowband signal: yus = resample(ynb.wav are used as LSF codebook training data. figure.cs.silen@tut. What can you say about the frequency content of the narrowband signal compared to the wideband signal? Listen to the signals (soundsc).edu/cmu_arctic/packed/cmu_us_slt_arctic-0. plot(ywb). arctic a0100.1 What to include in the report? The report of the work should contain the commented Matlab codes of your bandwidth extension system.5). specgram(ywb. Read the selected test wideband speech signal into Matlab: [ywb.2). . Write the whole report in a single file and send it to hanna.wav’). e. Question 1: What is the sampling frequency in the downloaded wideband signal? What is the highest frequency component of this band-limited signal? Create a narrowband signal by downsampling the wideband signal: ynb = decimate(ywb.3. The extended parts are combined using filtering to form a frame that contains artificial high-frequency components missing from the narrowband frame. The filter part is extended using a codebook built in Section 2.475). we build a simple speech bandwidth extension system. arctic a0501. Question 2: Plot the created time domain signal and its spectrogram (note the new sampling frequency).cmu. .2 Work assignment In this work.512. Extended time domain frames are catenated using overlap-add (see http://www. Therefore.bz2 The package is extracted in Lintula by typing: tar -xjf cmu_us_slt_arctic-0. The extended signal is added to an interpolated and delayed version of the original narrowband signal.bz2 The wideband speech files should now be in folder cmu us slt arctic/wav/.tut. Before starting to build the ABE system. Wavefiles arctic a0001. function decimate filters out the high-frequency content before downsampling thus preventing aliasing. Include the names and student numbers of the group members! 2. answers to the questions including the plotted figures.tar.wav. The system should read a narrowband speech signal and process it framewise so that each frame is first decomposed into an all-pole filter and a source signal.fs] = wavread(’cmu_us_slt_arctic/wav/arctic_b0001.fi/kurssit/SGN-4010/ikkunointi en.2 Getting started The speech files can be downloaded from: http://www.tar.95-release. The filter and the source signal parts are extended separately. in this case you do not need to take care of the anti-aliasing filtering.g.1). and plot the time domain signal and spectrogram: figure. any of the wavefiles excluded in the training data can be used.95-release.fs.2.

yf(1:2:end) = ynb. This can be done easily in frequency domain: 4 . a suitable wideband LSF vector is found based on the corresponding narrowband representation. To construct the LSF codebook. use for the narrowband signal a pre-emphasis filter whose DFT on the frequency band 0–4kHz (sampling frequency is 8kHz) is identical to the DFT of the wideband filter on the band 0–4kHz (sampling frequency is 16kHz). we are going to need both narrowband and wideband LP spectra.ywb).3 LSF codebook construction The first step in building our ABE system is the construction of the LSF codebook. % Filter signal ywb The frequency response of the wideband pre-emphasis filter is illustrated in Figure 4...95].4 for finding a suitable wideband representation for the spectral envelope based on the known narrowband representation.1. NB LSF vector 1 NB LSF vector 2 WB LSF vector 1 WB LSF vector 2 . How do they differ? Add a zero after each sample of the narrowband signal: yf = zeros(length(ynb)*2. Question 4: Plot the signal spectrogram and listen to the signal. In the bandwidth extension.95z −1 : ywb = filter([1 -0.. use a FIR filter H(z) = 1 − 0. NB LSF vector N WB LSF vector N Figure 3: The LSF codebook consisting of narrowband-wideband representation pairs. Compare the spectrogram to the narrowband spectrogram.. the following processing should be repeated for every training data wavefile. The codebook stores narrowband-wideband representation pairs and it is used in extension phase of Section 2. Therefore. . Pre-emphasis Before LP analysis. The narrowband signals can be formed by decimating the existing wideband signals (decimate). filter the signals using a pre-emphasis filter.1). How has the signal changed? 2.Question 3: Plot the time domain signal and spectrogram. 10 Magnitude (dB) 0 −10 −20 −30 0 1000 2000 3000 4000 5000 Frequency (Hz) 6000 7000 8000 80 Phase (degrees) 60 40 20 0 0 1000 2000 3000 4000 5000 Frequency (Hz) 6000 7000 8000 Figure 4: Frequency response of the wideband pre-emphasis filter. For wideband signals.

pdf.’symmetric’).nfft).% FFT length for the wideband signal nfft = 2^nextpow2(length(ywb)).200). 5 . The resulting matrix centroids of matrix clcentr are used as wideband entries of the codebook.005).25*nfft 0. Vector clidx contains the cluster index for every original wideband LSF vector. The wideband vectors are clustered using k-means clustering and the resulting cluster centroids are used as wideband codebook entries. A detailed description about k-means clustering can be found at http://www. % Filtering in time domain corresponds % to multiplication in frequency domain Ynb = Ynb(:).fi/∼jupeto/jht lectnotes eng.tut.clcentr] = kmeans(codevec_wb. The number of clusters was now set to 200 and may be varied. Note that the sampling frequency (fs) for the narrowband signal is 8kHz and for the wideband signal 16kHz. Use filter order 10 for narrowband and 18 for wideband signals. % Inverse DFT: frequency domain -> time domain ynb = ifft(Ynb. tmp = hanning(fs*0. . form a mean vector (of size 1x10) of the corresponding narrowband LSF vectors. Windowing Window the signals using the following 25ms window and no overlapping between adjacent frames: awinlen = round(fs*0.5*nfft). Cluster the wideband LSF vectors using Matlab function kmeans: [clidx. % Wideband filter DFT (length nfft) % Note: DFT includes frequencies 0-16kHz (sampling frequency 16kHz) Hwb = fft([1 -0. % Narrowband signal DFT (length nfft/2) Ynb = fft(ynb. Start by estimating of the LP coefficients (lpc) and convert them further into LSF coefficients (poly2lsf).025). LSF computing Compute framewise LSF coefficients for the narrowband and wideband signals.75*nfft+1:nfft]).cs. For each cluster.95]. % Narrowband filter DFT (length nfft/2) % Note: DFT includes frequencies 0-8kHz (sampling frequency 8kHz) Hnb = Hwb([1:0. Use this mean vector as a key for the corresponding cluster.0. < ωp ). where the matrix rows correspond to LSF vectors of individual frames N being the total number of frames in the training data.1). Check that the clustered LSF vectors result in stable LP filters (ω1 < ω2 < . LSF clustering The collected LSF vectors are not used as codebook entries as such. Repeat the framewise processing for each training wavefile and store the results in two matrices of size Nx10 and Nx18. awinfun = [tmp(1:length(tmp)/2). ones(awinlen-length(tmp).*Hnb(:). . tmp(length(tmp)/2+1:end)].

Increase the sampling rate of the narrowband signal using time domain zero-insertion (or spectral mirroring in frequency domain).fi/kurssit/SGN-4010/ikkunointi en. In analysis. A block diagram of the system is given below. Resampled and delayed narrowband signal SOURCE SIGNAL EXTENSION LP ANALYSIS SPECTRAL ENVELOPE EXTENSION WAVEFORM GENERATION Narrowband signal + Artificial wideband signal Figure 5: Bandwidth extension procedure. For overlap-add. After scaling the frames are joined together using overlap-add. the input signal is filtered using a pre-emphasis filter.4 Extension of a narrowband test signal Write a Matlab code that extends a given narrowband test signal artificially into a wideband signal. a 25ms analysis window and a 10ms synthesis window are used.pdf. You can modify this code or write your own implementation. Use the same model order as in the codebook training.2.tut. You can operate either in time domain (use filtering) or frequency domain (use multiplication/division). . In synthesis. compute the all-pole filter coefficients 1. The time difference between adjacent frames is 5ms in both analysis and synthesis. . use the same window function as earlier (awinfun). Use the same narrowband pre-emphasis filter as before. ap for the frame and then form the model residual signal as was explained in Section 1. Overlap-add The filtered signal is extended framewise and the extended frames are catenated using overlap-add technique. Create a wideband source signal that has its energy mainly on the frequency band 4–8kHz. Join the windowed segments using an overlap of 5ms. . Note that the sampling frequency of our ABE system is 8kHz in analysis and 16kHz in synthesis! Extension of the source and filter parts Decompose each narrowband frame using LP analysis into source and filter parts. reconstruct a 25ms time domain speech frame and use a 10ms Hanning window to extract a segment around the center of the frame. . This will create a signal whose spectrum on the band 4–8kHz is a mirrored copy of the spectrum on the band 0–4kHz (check the spectrogram). Use the narrowband source signal as a basis for the wideband source signal: 1. Note that the extension causes a delay in the signal and therefore also the resampled narrowband signal must be delayed to synchronize the signals. The frames are decomposed into source signal and filter parts using LP analysis and the parts are extended separately. Pre-emphasis As in the codebook construction. The frame waveform is reconstructed by filtering the extended source signal using the extended filter.cs. The basic idea is to create a signal that contains the frequencies that are missing from the original narrowband signal.2. First. An example code for overlap-add is available at http://www. a1 . 6 . This signal having its energy mainly on the frequency band 4–8kHz is then added to an interpolated version of the original narrowband signal that has most of its energy on the band 0–4kHz. Start the processing by windowing the incoming signal.

create a signal whose energy is mainly on the band 4–8kHz (check the spectrogram). Find the narrowband codebook entry with the minimum Euclidean distance to the current LSF vector.95]. Listen to the resulting signal. 3. Delay the signal with with the total delay of your ABE system. Use this signal as a wideband source signal. Are there audible differences compared to the original narrowband and wideband signals? Question 6: What are the most dominant artifacts caused by the extension? Could they be avoided somehow? 7 . Scale the frame energy according to the energy of the original narrowband frame. Convert the selected wideband LSF coefficients into LP coefficients for waveform synthesis.[1 -0.2. Convert LP coefficients of the current narrowband frame into LSFs.95z −1 ). How does it differ from the spectrogram of the original wideband and narrowband signals? Listen to the signal. Waveform synthesis Reconstruct the wideband frame by filtering the created wideband source signal with the selected wideband all-pole filter. Add the resampled and delayed signal to the artificially extended signal. Then compute energy Enb of the original narrowband frame (sampling frequency is 8kHz). 2. p Multiply each sample of the frame by the scaling factor Enb /Ecb . Question 5: Plot the spectrogram of the resulting signal.sig). In Matlab: sig = filter(1. Increase the sampling frequency of the original narrowband signal from 8kHz to 16kHz using command resample. Use the codebook to extend the spectral envelope: 1. Using the signal in (1). Use overlap-add to join the scaled time domain frames. compute energy Ecb of the frequency band 0–4kHz for the frame that results from filtering an interpolated version of the narrowband source signal (sampling frequency of the interpolated signal is 16kHz) by the selected wideband all-pole filter. First. Select the corresponding wideband entry to be used as a wideband spectral envelope representation. Remove the effect of pre-emphasis filtering by filtering the signal with the inverse filter of the wideband pre-emphasis filter H(z): 1/H(z) = 1/(1 − 0.