You are on page 1of 9

EE6641 Analysis and Synthesis of Audio Signals

Lab5 : Linear Prediction Speech Processing

Student ID: 102061144
In this lab, we will implement the linear prediction that teacher mentioned in the course, the
main formula of linear prediction is below:
x [ n ]= ak x [ nk ] +e [n]

in this formula the e[n] we call it the prediction error or the glottal source, we could easily find
that if the error is smaller the result will be better. Moreover, the parameter a is the linear
prediction coefficient.
Through this procedure, we could recover the environment of our throat or even the reflectance
Adjust the frame length and the order of LP. Listen to the resulted excitation
signal excitat. Find out what range of these two parameters results in successful
removal of the original vowel quality.

The method we use is to fix the frame length first and to see the mean of the absolute value of
excitation noise. In the beginning, we try to use our ears to hear the sound and specify whether
it is success or not, but this method is too subjective. Then we calculate the mean to be
determine, and try to make the mean smaller and smaller. Some results we try is below:
(1) fix the frame length at 0.016. (pronunciation of /a/)
P Mean of e_n
16 0.12955
30 0.1237
45 0.12428
60 0.12204
70 0.11888
80 0.11785
150 0.09912
In this case, we found if we keep increasing the linear prediction order(p) the mean of the
excitation error will keep decreasing, too. However, if we listen to the sound of excitation, after p
equals to 60, the sound left only buzzing like teacher mentioned in the video.
(2) fix the frame length at 0.032. (pronunciation of /a/)
P Mean of e_n
16 0.062
30 0.058
45 0.059
60 0.058
70 0.0569
80 0.0567
150 0.04
In this case, we also find if we increase the frame length, the mean of the excitation will also
decrease. In order to check this, we also decrease the frame length to 0.008, and the result is as what
we expect. Therefore, we think if we want to get the most successful result, we just need to increase
the frame length and also the linear prediction order. However, we guess there must be a limit, could
not keep increasing the both parameter.
Investigating how stationary the estimated K parameters are as they inevitably
vary cross frames. Since the parameters are intimately related to the shape of
the vocal tract, do they remain more consistent if you try to sustain your vowel
quality by holding your articulators still, such as your jaw position and your
tongue position? How about your pitch? Does it help to keep Kcoeff stationary
by keeping your voice at the same pitch.

In this question, I try several methods to check the stationary of the K parameters. The first one
is to keep my voice low and at the same pitch, then I calculate the mean of the whole K parameters
matrix to be the comparison standard. After getting this, we use the K parameters matrix to minus the
mean column by column (frame by frame), and calculate the average of the difference again at the
end. The code is below:
% How stationary
Kcoeffs_mean = mean(Kcoeffs,2);
Kcoeffs_diff = zeros(size(Kcoeffs));
for i = 1:size(Kcoeffs,2)
Kcoeffs_diff(:,i) = Kcoeffs(:,i) - Kcoeffs_mean;
Kcoeffs_diff = mean(Kcoeffs_diff,2);
hold on;
The result is below: (red line is zero)

Same pitch (low) Same pitch (high)

The next experiment is increasing the tone of the /a/ pronunciation, and the method is the same
as what we mentioned above. The figure is below:

Increasing pitch
In these experiments, we could find that if we keep our pitch at still no matter it is high or low
doesnt vary too much. The most different is that if we increasing the pitch in the record, the
difference seems like symmetric to the zero with the former one. We dont know how to explain the
symmetry phenomenon, but we think if we try to hold our tongue and the jaw at the same position
and keep still will help to keep the stationary of K parameters!
Write your own code to estimate the frequency of the first three formants of
each of the vowels. Perform pairwise comparison between:
/a/ and /i/
/i/ and /u/
In this question, we write a code to find out the peaks, but if we just use the findpeaks function
in the matlab, there will be some problem about the small peaks or the false peaks.
Therefore, after discussing with our classmates, we decide to group some peaks together and
find out the true peak. The main idea of our code is that we will use the mean of each frame to be the
figure which we want to find out the peaks. First we will use the matlab function to find the peaks, but
this will contain many false peaks, so we have to filter them by ourselves. We adjust a distance of the
frequency to group peaks. In the code, we will set two condition in the parameter check, if check
equals to zero which means they are not belong to the same group peaks, otherwise, they are in the
same group.
Therefore, we will continue update the peak in the same group with the maximum one to be the
true peak. The code is below:
H_mean = mean(Htmp,2);
[pks,locs] = findpeaks(H_mean);
check = 0;
pks_new = [];
locs_new = [];
for i= 1: size(pks,1)-1
if ff(locs(i+1))-ff(locs(i))< 200 %frequency distance less than 200
if check == 0
check = 1; %we are checking the same group
[p_tmp, index] = max([pks(i), pks(i+1)]);
i_tmp = i-1+index;
elseif check == 1 % still in the same group
[p_tmp, index] = max([p_tmp, pks(i+1)]);
if index == 1
i_tmp = i_tmp;
elseif index == 2
i_tmp = i-1+index;
if check == 0
check = check;
pks_new = [pks_new; pks(i)];
locs_new = [locs_new; locs(i)];
elseif check == 1
check = 0;
pks_new = [pks_new; p_tmp];
locs_new =[locs_new; locs(i_tmp)];

peaks = zeros(size(pks_new,1), 2);

peaks(:,1) = pks_new;
peaks(:,2) = locs_new;
[~, index_sort] = sort(peaks(:,1), 'descend');
[i_final, ~ ] = sort(index_sort(1:3));
f = ff(peaks(i_final(:),2));
After we find out the true peaks, we sort it through the frequency and pick the largest three
peaks to be the first three formants, though this method is not robust enough. If there is a better
solution, we will fix this part in the future.
The following are the average of the pronunciation /a/, /i/, /u/, and , in these figures we could
find the peaks are almost the true peaks.

/a/ /i/

After getting this result, we also compare with the true frequency position of the
pronunciation /a/, /i/, /u/, the information we find is below:
In this result, we could easily find that the pronunciation /a/ and /i/ are similar to the true
frequency we find. However, the pronunciation /u/ is a little different from the information we find,
but we think this is due to the difference from person to person, if we look at the wave trend, it is also
similar to the information we find, too.
Then, we use the peaks we find out the calculate the f2-f1 v.s. f1 and the f3-f1 v.s. f1. To see the
difference between each other. The reason why we look at f3 is due to f1 and f2 are usually close and
not too far, so we want to see if f1 and f2 of two different vocal tracts are close, what about the f3 of
these two vocal tracts.
/a/ and /i/

/a/ f2 - f1 v.s. f1 /i/ f2 - f1 v.s. f1

/a/ f3 - f1 v.s. f1 /i/ f3 f1 v.s. f1

/i/ and /u/

/i/ f2 - f1 v.s. f1 /u/ f2 - f1 v.s. f1

/i/ f3 - f1 v.s. f1 /u/ f3 f1 v.s. f1

/i/ f2 - f1 v.s. f1 f2 - f1 v.s. f1

/i/ f3 - f1 v.s. f1 f3 f1 v.s. f1

After getting these results, we could compare with the result in teachers ppt in the course below:

Though the result is not totally the same, the positions are relatively the same. If we plot them into
one figure, the result is below:
We think this result is very similar to the figure above, the positions are relatively correct.
In this homework, I think we have known how to calculate the linear prediction and also the to
get the parameters about it. However, we still a little curious about the implement, hope we could
implement it on our final successfully. In this home work we know the most is how to analysis the
result and each parameters meaning!
If we could combine it with other emotion to apply on our final, we think the effect will be funny
and useful!