You are on page 1of 22

Speech Processing Project

Linear Predictive coding using Voice excited


Vocoder

ECE 5525
Osama Saraireh
Fall 2005
Dr. Veton Kepuska
The basic form of pitch excited LPC vocoder is shown below

The speech signal is filtered to no more than one half the system sampling
frequency and then A/D conversion is performed. The speech is processed on a frame by
frame basis where the analysis frame length can be variable. For each frame a pitch
period estimation is made along with a voicing decision. A linear predictive coefficient
analysis is performed to obtain an inverse model of the speech spectrum A (z). In addition
a gain parameter G, representing some function of the speech energy is computed. An
encoding procedure is then applied for transforming the analyzed parameters into an
efficient set of transmission parameters with the goal of minimizing the degradation in
the synthesized speech for a specified number of bits. Knowing the transmission frame
rate and the number of bits used for each transmission parameters, one can compute a
noise-free channel transmission bit rate.

At the receiver, the transmitted parameters are decoded into quantized versions of
the coeifficent analysis and pitch estimation parameters. An excitation signal for
synthesis is then constructed from the transmitted pitch and voicing parameters. The
excitation signal then drives a synthesis filter 1/A (z) corresponding to the analysis model
A (z). The digital samples s^(n) are then passed through an D/A converter and low pass
filtered to generate the synthetic speech s(t). Either before or after synthesis, the gain is
used to match the synthetic speech energy to the actual speech energy. The digital
samples are the converted to an analog signal and passed through a filter similar to the
one at the input of the system.
Linear predictive coding (LPC) of speech

The linear predictive coding (LPC) method for speech analysis and synthesis is
based on modeling the Vocal tract as a linear All-Pole (IIR) filter having the system
transfer function:

simple speech production

Where p is the number of poles, G is the filter Gain, and a[k] are the parameters
that determine the poles. There are two mutually exclusive ways excitation functions to
model voiced and unvoiced speech sounds. For a short time-basis analysis, voiced speech
is considered periodic with a fundamental frequency of Fo, and a pitch period of 1/Fo,
which depends on the speaker. Hence, Voiced speech is generated by exciting the all pole
filter model by a periodic impulse train. On the other hand, unvoiced sounds are
generated by exciting the all-pole filter by the output of a random noise generator.
The fundamental difference between these two types of speech sounds comes
from the way they are produced. The vibrations of the vocal cords produce voiced
sounds. The rate at which the vocal cords vibrate dictates the pitch of the sound. On the
other hand, unvoiced sounds do not rely on the vibration of the vocal cords. The unvoiced
sounds are created by the constriction of the vocal tract. The vocal cords remain open and
the constrictions of the vocal tract force air out to produce the unvoiced sounds

Given a short segment of a speech signal, lets say about 20 ms or 160 samples at a
sampling rate 8 KHz, the speech encoder at the transmitter must determine the proper
excitation function, the pitch period for voiced speech, the gain, and the coefficients
ap[k]. The block diagram below describes the encoder/decoder for the Linear Predictive
Coding. The parameters of the model are determined adaptively from the data and
modeled into a binary sequence and transmitted to the receiver. At the receiver point, the
speech signal is the synthesized from the model and excitation signal.

The parameters of the all-pole filter model are determined from the speech
samples by means of linear prediction. To be specific the output of the Linear Prediction
filter is

^ p
s ( n)    a p ( k ) s ( n  k )
k 1

and the corresponding error between the observed sample S(n) and the predicted value
^
s(n) is
^
e( n )  s ( n )  s ( n )

by minimizing the sum of the squared error we can determine the pole parameters a p (k )
of the model. The result of differentiating the sum above with respect to each of the
parameters and equation the result to zero, is a sep of p linear equations
p

a
k 1
p (k )rss (m  k )   rss ( m ) where m=1,2,….p

where rss ( m ) represent the autocorrelation of the sequence s (n) defined as

N
rss ( m )   s ( n) s ( n  m)
n 0

the equation above can be expressed in matrix form as

Rss a   rss ( m )

where Rss a is a pxp autocorrelation matrix, rss is a px1 autocorrelation vector, and a is
a px1 vector of model parameters.

[row col] = size(data);


if col==1 data=data'; end

nframe = 0;
msfr = round(sr/1000*fr); % Convert ms to samples
msfs = round(sr/1000*fs); % Convert ms to samples
duration = length(data);
speech = filter([1 -preemp], 1, data)'; % Preemphasize speech
msoverlap = msfs - msfr;
ramp = [0:1/(msoverlap-1):1]'; % Compute part of window

for frameIndex=1:msfr:duration-msfs+1 % frame rate=20ms


frameData = speech(frameIndex:(frameIndex+msfs-1)); % frame size=30ms
nframe = nframe+1;
autoCor = xcorr(frameData); % Compute the cross correlation
autoCorVec = autoCor(msfs+[0:L]);

These equations can be solved in MATLB by using the Levinson-Durbin algorithm.


% Levinson's method
err(1) = autoCorVec(1);
k(1) = 0;
A = [];
for index=1:L
numerator = [1 A.']*autoCorVec(index+1:-1:2);
denominator = -1*err(index);
k(index) = numerator/denominator; % PARCOR coeffs
A = [A+k(index)*flipud(A); k(index)];
err(index+1) = (1-k(index)^2)*err(index);

The gain parameter of the filter can be obtained by the input-output relationship as follow

p
s( n)   a p ( k ) s (n  k )  Gx(n)
k 1

where X(n) represent the input sequence.

We can further manipulate this equation and in terms of the error sequence we have

p
Gx(n)  s (n)   a p (k ) s (n  k )  e(n)
k 1

then

N 1 N 1
G 2  x 2 ( n)   e 2 ( n)
n 0 n 0

if the input excitation is normalized to unit energy by design, then

N 1 N 1 p
G 2  x 2 (n)   e 2 (n)  rss (0)   a p (k )rss (k )
n0 n0 k 1
where G^2 is set equal to the residual energy resulting from the least square optimization
.

% filter response
if 0
gain=0;
cft=0:(1/255):1;
for index=1:L
gain = gain + aCoeff(index,nframe)*exp(-i*2*pi*cft).^index;
end
gain = abs(1./gain);
spec(:,nframe) = 20*log10(gain(1:128))';
plot(20*log10(gain));
title(nframe);
drawnow;
end

if 0
impulseResponse = filter(1, aCoeff(:,nframe), [1 zeros(1,255)]);
freqResp = 20*log10(abs(fft(impulseResponse)));
plot(freqResp);
end

once the LPC coefficients are computed, we can determine weather the input
speech frame is voiced, and if it is indeed voiced sound, then what is the pitch. We can
determine the pitch by computing the following sequence in matlab:

p
re (n)   ra (k )rss (n  k )
k 1

whwre ra (k ) is defined as follow


p
ra ( n)   aa ( k )a p (i  k )
k 1

which is defined as the autocorrelation sequence of the prediction coefficients. The pitch
id detected by finding the peak of the normalized sequence

re ( n)
In the time interval corresponds to 3 to 15 ms in the 20ms sampling frame. If
re (0)
the value of this peak is at least 0.25, the frame of speech is considered voiced with a
re ( N p )
pitch period equal to the value of n  N p , where is a maximum value.
re (0)

If the peak value is less than 0.25, the frame speech is considered unvoiced and the pitch
would equal to zero.

errSig = filter([1 A'],1,frameData); % find excitation noise


G(nframe) = sqrt(err(L+1)); % gain
autoCorErr = xcorr(errSig); % calculate pitch & voicing information
[B,I] = sort(autoCorErr);
num = length(I);
if B(num-1) > .01*B(num)
pitch(nframe) = abs(I(num) - I(num-1));
else
pitch(nframe) = 0;
end

The value of the LPC coefficients, the pitch period, and the type of excitation are then
transmitted to the receiver. The decoder synthesizes the speech signal by passing the
proper excitation through the all pole filter model of the vocal tract.

Typically the pitch period requires 6 bits, the gain parameters are represented in 5 bits
after the dynamic range is compressed logrithmaticaly, and the prediction coefficients
require 8-10 bits normally for accuracy reasons. This is very important in LPC because
any small changes in the prediction coefficients result in large change in the pole
positions of the filter model, which cause instability in the model. This is overcome by
using the PARACOR method .
Is speech frame Voiced or Unvoiced ?

Once the LPC coefficients are competed, we can determine weather the input speech
frame is voiced, and if so, what the pitch is.

If the speech frame is decided to be voiced, an impulse train is employed to represent it,
with nonzero taps occurring every pitch period. A pitch-detecting algorithm is used in
order to determine to correct pitch period / frequency. The autocorrelation function is
used to estimate the pitch period as . However, if the frame is unvoiced, then white noise
is used to represent it and a pitch period of T=0 is transmitted. Therefore, either white
noise or impulse train becomes the excitation of the LPC synthesis filter

Two types of LPC vocoders were implemented in MATLAB

Plain LPC Vocoder diagram is shown below :


%LPC vocoder

function [ outspeech ] = speechcoder1( inspeech )

!
% Parameters:
% inspeech : wave data with sampling rate Fs
% (Fs can be changed underneath if necessary)
% Returns:
% outspeech : wave data with sampling rate Fs
% (coded and resynthesized)

if ( nargin ~= 1)
error('argument check failed');
end;

Fs = 16000; % sampling rate in Hertz (Hz)


Order = 10; % order of the model used by LPC

% encoded the speech using LPC


[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);

% decode/synthesize speech using LPC and impulse-trains as excitation


outspeech = synlpc(aCoeff, pitch, Fs, G)

results :

residual plot :
0.6

0.4

0.2

-0.2

-0.4

-0.6

-0.8
0 50 100 150 200 250 300 350 400 450 500

The LPC gain Vs. Frames


1.4

1.2

0.8

0.6

0.4

0.2

0
0 50 100 150
Original speech signal
0.4

0.3

0.2

0.1

-0.1

-0.2

-0.3
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4
x 10

output speech spectrum using LPC vocoder


10

-2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4
x 10
voice excited LPC Vocoder (utilizing DCT for high compression rate/low bits)

the input speech signal in each frame is filtered with the estimated transfer function of
LPC analyzer. This filtered signal is called the residual.

To achieve a high compression rate ,the discrete cosine transform (DCT) of the residual
signal could be employed. The DCT concentrates most of the energy of the signal in the
first few coefficients. Thus one way to compress the signal is to transfer only the coefficients, which
contain most of the energy.

function [ outspeech ] = speechcoder2( inspeech )


% Parameters:
% inspeech : wave data with sampling rate Fs
% (Fs can be changed underneath if necessary)
% Returns:
% outspeech : wave data with sampling rate Fs
% (coded and resynthesized)

if ( nargin ~= 1)
error('argument check failed');
end;
Fs = 16000; % sampling rate in Hertz (Hz)
Order = 10; % order of the model used by LPC

% encoded the speech using LPC


[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);

% perform a discrete cosine transform on the residual


resid = dct(resid);
[a,b] = size(resid);
% only use the first 50 DCT-coefficients this can be done
% because most of the energy of the signal is conserved in these coeffs
resid = [ resid(1:50,:); zeros(430,b) ];

% quantize the data


resid = uencode(resid,4);
resid = udecode(resid,4);

% perform an inverse DCT


resid = idct(resid);

% add some noise to the signal to make it sound better


noise = [ zeros(50,b); 0.01*randn(430,b) ];
resid = resid + noise;

% decode/synthesize speech using LPC and the compressed residual as excitation


outspeech = synlpc2(aCoeff, resid, Fs, G);

results

noise = [ zeros(50,b); 0.01*randn(430,b) ];


noise added to the signal to make it sound better
0.05

0.04

0.03

0.02

0.01

-0.01

-0.02

-0.03

-0.04

-0.05
0 50 100 150 200 250 300 350 400 450 500

resid = resid + noise;

0.2

0.1

-0.1

-0.2

-0.3

-0.4

-0.5
0 50 100 150 200 250 300 350 400 450 500
Original speech signal
0.4

0.3

0.2

0.1

-0.1

-0.2

-0.3
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4
x 10

reconstructed signal using voice Excited LPC vocoder


0.4

0.3

0.2

0.1

-0.1

-0.2

-0.3
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4
x 10

MATLAB files :
clear all;

%osama saraireh
% speech processing
%Dr. Veton Kepuska
%FIT FAll 2005
a= input ('please load the speech signal as a .wav file ' , 's');
Inputsoundfile = a ;
[inspeech, Fs, bits] = wavread(Inputsoundfile); % read the wavefile
outspeech1 = speechcoder1(inspeech); % plain LPC vocoder
outspeech2 = speechcoder2(inspeech); % Voice excitded LPC vocoder

% plot results
figure(1);
subplot(3,1,1);
plot(inspeech);
grid;
subplot(3,1,2);
plot(outspeech1);
grid;
subplot(3,1,3);
plot(outspeech2);
grid;
disp('Press any key to play the original sound file');
pause;
soundsc(inspeech, Fs);
disp('Press any key to play the LPC compressed file!');
pause;
soundsc(outspeech1, Fs);
disp('Press a key to play the voice-excited LPC compressed sound!');
pause;
soundsc(outspeech2, Fs);

function [aCoeff,resid,pitch,G,parcor,stream] = proclpc(data,sr,L,fr,fs,preemp)

% L - The order of the analysis. .


% fr - Frame time increment, in ms. Defaults to 20ms
% fs - Frame size in ms.
% preemp - default 0.9378
% aCoeff - The LPC analysis results,
% resid - The LPC residual,
% pitch - calculated by finding the peak in the residual's autocorrelation
%for each frame.
% G - The LPC gain for each frame.
% parcor - The parcor coefficients.
% stream - The LPC analysis' residual or excitation signal as one long vector.
if (nargin<3), L = 10; end
if (nargin<4), fr = 20; end
if (nargin<5), fs = 30; end
if (nargin<6), preemp = .9378; end

[row col] = size(data);


if col==1 data=data'; end

nframe = 0;
msfr = round(sr/1000*fr); % Convert ms to samples
msfs = round(sr/1000*fs); % Convert ms to samples
duration = length(data);
speech = filter([1 -preemp], 1, data)'; % Preemphasize speech
msoverlap = msfs - msfr;
ramp = [0:1/(msoverlap-1):1]'; % Compute part of window

for frameIndex=1:msfr:duration-msfs+1 % frame rate=20ms


frameData = speech(frameIndex:(frameIndex+msfs-1)); % frame size=30ms
nframe = nframe+1;
autoCor = xcorr(frameData); % Compute the cross correlation
autoCorVec = autoCor(msfs+[0:L]);

% Levinson's method
err(1) = autoCorVec(1);
k(1) = 0;
A = [];
for index=1:L
numerator = [1 A.']*autoCorVec(index+1:-1:2);
denominator = -1*err(index);
k(index) = numerator/denominator; % PARCOR coeffs
A = [A+k(index)*flipud(A); k(index)];
err(index+1) = (1-k(index)^2)*err(index);
end

aCoeff(:,nframe) = [1; A];


parcor(:,nframe) = k';

% filter response
if 0
gain=0;
cft=0:(1/255):1;
for index=1:L
gain = gain + aCoeff(index,nframe)*exp(-i*2*pi*cft).^index;
end
gain = abs(1./gain);
spec(:,nframe) = 20*log10(gain(1:128))';
plot(20*log10(gain));
title(nframe);
drawnow;
end

% Calculate the filter response


% from the filter's impulse
% response (to check above).
if 0
impulseResponse = filter(1, aCoeff(:,nframe), [1 zeros(1,255)]);
freqResponse = 20*log10(abs(fft(impulseResponse)));
plot(freqResponse);
end

errSig = filter([1 A'],1,frameData); % find excitation noise

G(nframe) = sqrt(err(L+1)); % gain


autoCorErr = xcorr(errSig); % calculate pitch & voicing information
[B,I] = sort(autoCorErr);
num = length(I);
if B(num-1) > .01*B(num)
pitch(nframe) = abs(I(num) - I(num-1));
else
pitch(nframe) = 0;
end

% improve the compressed sound quality


resid(:,nframe) = errSig/G(nframe);
if(frameIndex==1) % add residual frames using a trapezoidal window
stream = resid(1:msfr,nframe);
else
stream = [stream];
overlap+resid(1:msoverlap,nframe).*ramp;
resid(msoverlap+1:msfr,nframe);
end
if(frameIndex+msfr+msfs-1 > duration)
stream = [stream; resid(msfr+1:msfs,nframe)];
else
overlap = resid(msfr+1:msfs,nframe).*flipud(ramp);
end
end
stream = filter(1, [1 -preemp], stream)';

Speech Model one

LPC Vocoder :

function [ outspeech ] = speechcoder1( inspeech )

% Parameters:
% inspeech : wave data with sampling rate Fs

% outputs:
% outspeech : wave data with sampling rate Fs
% (coded and resynthesized)

if ( nargin ~= 1)
error('argument check failed');
end;

Fs = 8000; % sampling rate in Hertz (Hz)


Order = 10; % order of the model used by LPC

% encoded the speech using LPC


[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);

% decode/synthesize speech using LPC and impulse-trains as excitation


outspeech = synlpc(aCoeff, pitch, Fs, G);

% Voice-excited LPC vocoder

function [ outspeech ] = speechcoder2( inspeech )


% Parameters:
% inspeech : wave data with sampling rate Fs
% (Fs can be changed underneath if necessary)

% output:
% outspeech : wave data with sampling rate Fs
% (coded and resynthesized)

if ( nargin ~= 1)
error('argument check failed');
end;

Fs = 16000; % sampling rate in Hertz (Hz)


Order = 10; % order of the model used by LPC

% encoded the speech using LPC


[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);

% perform a discrete cosine transform on the residual


resid = dct(resid);
[a,b] = size(resid);
% only use the first 50 DCT-coefficients this can be done
% because most of the energy of the signal is conserved in these coeffs
resid = [ resid(1:50,:); zeros(430,b) ];
% quantize the data
resid = uencode(resid,4);
resid = udecode(resid,4);

% perform an inverse DCT


resid = idct(resid);

% add some noise to the signal to make it sound better


noise = [ zeros(50,b); 0.01*randn(430,b) ];
resid = resid + noise;

% decode/synthesize speech using LPC and the compressed residual as excitation


outspeech = synlpc2(aCoeff, resid, Fs, G)
References

Linear Prediction of Speech, J.D MARKEL, A.H GRAY, Jr. Pages 10-96, 190-158

Digital signal Processing, Alan V. Oppenheim/ Ronald W. Schafer

Digital signal processing using MATLAB, Vinay K. Ingle, John Proakid

http://www.data-compression.com/speech.html

You might also like