Algorithms in Signal Processors

Audio and Video Applications
2011
DSP Project Course
using
Texas Instruments TMS320C6713 DSK and TMS320DM6437
Dept. of Electrical and Information Technology, Lund University,
Sweden
i
ii
Contents
I Guitar tuner
P V Soumya, A Norrgren, C-J Waldeck, F Brosj¨o 1
1 Introduction 2
2 Theory 3
2.1 Harmonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Fourier Transform . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Cross Correlation . . . . . . . . . . . . . . . . . . . . . 4
3 Method 5
3.1 Analysis of the guitar sound . . . . . . . . . . . . . . . . . . . 5
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Echo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.3 Process . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.4 DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.5 SmallXcorr . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.6 MatLab GUI . . . . . . . . . . . . . . . . . . . . . . 9
4 Results and Discussion 9
4.1 Matlab testing . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Post-testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Conclusion and further development 12
II Pitch Estimaiton
Jonas Rosenqvist, Kim Smidje, Henrik Nilsson, Johan Mattsson 15
1 Introduction 16
2 Theory 16
3 Methods 18
3.1 Time domain . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Frequency domain . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Implementation 19
5 Problems encountered 22
iii
6 Conclusion 23
7 References 23
III Vocoder
Mattias Danielsson, Andre Ericsson, Kujtim Iljazi, Babak Rajabian 25
1 Introduction 26
2 Theory 26
2.1 Overall description of our vocoder model . . . . . . . . . . . . 26
2.2 The highpass filter . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 The autocorrelation function . . . . . . . . . . . . . . . . . . 27
2.4 The Levinson-durbin recursion . . . . . . . . . . . . . . . . . 27
2.5 The IIR lattice filter . . . . . . . . . . . . . . . . . . . . . . . 28
3 Implementation 30
4 Testing and debugging 31
5 Results and conclusions 32
IV Reverberation
R. Tullberg, R. Mittipalli, S. Abdu-Rahman, T. Isacsson 35
1 Introduction 36
1.1 Reverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2 Theory 38
2.1 Reverb Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2 Reverberation time . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Delay elements z
−mi
. . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Damping Filters h
i
(z) . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Diffusion Matrix A . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Gains b
i
and c
i
. . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.7 Tonal Correction Filter t(z) . . . . . . . . . . . . . . . . . . . 40
3 Implementation 41
3.1 Realtime versus non-realtime implementation . . . . . . . . . 41
3.2 Diffusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Software Optimizations . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Matrix Multiplication . . . . . . . . . . . . . . . . . . 41
3.3.2 Circular Buffers . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Compiler Optimization . . . . . . . . . . . . . . . . . . . . . . 43
iv
3.5 Hardware Memory Considerations . . . . . . . . . . . . . . . 43
4 Result 45
4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Discussion and Conclusion 45
5.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Non-realtime versus realtime implementation . . . . . . . . . 46
5.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
V Speech Recognition Using MFCC
Harshavardhan Kittur, Kaoushik Raj Ramamoorthy,
Manivannan Ethiraj, Mohan Raj Gopal 49
1 Introduction 50
1.1 Why Speech recognition? . . . . . . . . . . . . . . . . . . . . 50
1.2 Common problems found in designing such a system . . . . . 50
1.3 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2 Theory 51
2.1 Speech Recognition Algorithm . . . . . . . . . . . . . . . . . 51
2.1.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . 51
2.1.2 Feature Matching . . . . . . . . . . . . . . . . . . . . . 52
2.2 Mel Frequency Cepstrum Coefficients . . . . . . . . . . . . . . 52
3 Implementation 53
3.1 Level detection . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Frame blocking . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Fast Fourier transform . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Power spectrum calculation . . . . . . . . . . . . . . . . . . . 53
3.6 Mel-frequency wrapping . . . . . . . . . . . . . . . . . . . . . 55
3.7 Log-energy spectrum . . . . . . . . . . . . . . . . . . . . . . . 56
3.8 Mel-frequency cepstral coefficients . . . . . . . . . . . . . . . 56
3.9 Comparison in the feature matching phase . . . . . . . . . . . 56
4 Implementation in MATLAB 56
5 Implementation in DSP Board 58
6 Tests and Results 58
7 Conclusion 58
v
VI Face Detection, Tracking and Recognition
Asheesh Mishra, Mohammed Ibraheem, Shashikant Patil 61
1 Abstract 61
2 Introduction 62
3 YCbCr- Color Space Model 63
4 Image Filtering for Noise reduction 63
5 Edge Detection 63
6 Face Detection and Tracking 64
7 Face Recognition 67
8 Problem faced 68
9 Conclusion and Future work 69
VII Circular Object Detection
Ajosh K Jose, Qazi Omar Farooq, Sherine Thomas, Sreejith P Raghavan 71
1 Introduction 72
2 Theory 72
2.1 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.1.1 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . 74
2.1.2 Finding gradients . . . . . . . . . . . . . . . . . . . . . 75
2.1.3 Non-maximum suppression . . . . . . . . . . . . . . . 75
2.1.4 Double thresholding . . . . . . . . . . . . . . . . . . . 76
2.1.5 Edge tracking by hysteresis . . . . . . . . . . . . . . . 76
2.2 Circular Object Detection . . . . . . . . . . . . . . . . . . . . 78
3 Implementation 78
4 Conclusion & Future Work 79
vi
Part I
Guitar tuner
P V Soumya, A Norrgren, C-J Waldeck, F Brosj¨o
Abstract
This report handles the development of a guitar tuner based on the
Texas Instrument TMS320C6713DSK signal processing board. First
the theory about the guitar strings and it’s harmonic patterns is cov-
ered, along with a description of the different mathematical algorithms
used to tune them. This is followed by the analysis of the guitar sound
and the method used implement the tuner on the DSP-board. A great
deal of the report handles problems associated with the memory of
the board, together with solutions developed to work around them. A
working guitar tuner was then made. Suggestions of possible improve-
ments are presented in the final section, which concludes this project.
1
2 Guitar tuner
1 Introduction
When thinking of pitch estimation one of the project group members came
to think of a problem he experiences in tuning his guitars. When a string
on some guitars is tuned, the other strings change pitch as well due to the
higher stress on the guitar caused by the tension. This makes tuning a
tough and time consuming procedure, since the strings have to be tuned
separately and many times to achieve a stable pitch for all of them. A way
to address this issue is to have a tuner that allows the user to overlook all
strings at the same time; thereby being able to correct changes to the other
strings instantaneously. When searching for this type of tuner only one
was found on the commercial market, a TC Electronic polytune [4]. This
was our source of inspiration for this project in pitch estimation. The goal
of this project is first to be able to determine the pitch of a single string
and eventually expanding it to being able to estimate the pitch for multiple
strings simultaneously and presenting the result to the user.
P V Soumya, A Norrgren, C-J Waldeck, F Brosj¨o 3
2 Theory
2.1 Harmonics
A note played on a string instrument consists of a fundamental frequency,
called the pitch, and a number of harmonics. The frequencies of these har-
monics are multiples of the pitch and are the same for every instrument
tuned to the same pitch. What differs the instruments and gives them their
unique sound is the amplitude pattern for these harmonics. These patterns
depends on many factors like the length, thickness and material of the string.
A guitar has six strings tuned to different pitches. In table 1 all strings with
their corresponding pitches and first two harmonics are shown.
String nbr Octave f
0
2f
0
3f
0
6 E2 82.407 164.814 247.221
5 A2 110.000 220.000 330.000
4 D3 146.830 293.660 440.490
3 G3 196.000 392.000 588.000
2 B3 246.940 493.880 740.820
1 E4 329.630 659.260 988.890
Table 1: Guitar frequencies
The pattern of the harmonic amplitudes are not the same for different strings
and guitars which gives each guitar its specific sound. In figure 1 the fre-
quency pattern of the E4 string on a Hagstr¨om Viking is shown.
Figure 1: Spectrum of E4
4 Guitar tuner
The chromatic scale that is used globally, an octave is divided into 12
pitches, which are one semitone apart. Between each semitone there are
100 equally spaced increments called cents [2]. This unit is commonly used
to measure the accuracy of instrument tuners. A commercial good quality
tuner usually have an accuracy of between +/-1 and +/-3 cents.
2.2 Algorithms
2.2.1 Fourier Transform
The Fourier transform is a discrete transform between the time and fre-
quency domain. A portion of a signal in time domain is analysed and the
frequency components are extracted with their corresponding amplitude.
This is done using equation 1 which has a linear frequency scale.
X
k
=
¸
N−1
n=0
x
n
e
−2πnk/N
k = 0, 1, ..., N −1 (1)
In this case the frequency scale of octaves is logarithmic, meaning that an
increase of one octave corresponds to a doubling of the frequency. To handle
this smoothly a logarithmic frequency scale can be implemented. Since the
distance between the octaves is increasing for higher frequencies the number
of increments between each octave is increasing in the linear frequency scale.
This makes the accuracy better for higher frequencies. To obtain a linear
accuracy over the entire frequency spectrum, the frequency scale is made
logarithmic. This is done by replacing k in equation 1 with (2):
k = f
0
· B
i/N
i = 0, 1, ..., N −1 (2)
Where B is an arbitrary base and f
0
is the starting frequency. The base
determines the size of the scale along with the number of points.
The information about DFT and the logarithmic frequency scale was found
in Martin Stridh doctoral thesis, Signal Characterization of Atrial Arrhyth-
mias using the Surface ECG [3].
2.2.2 Cross Correlation
A cross correlation is used to find similarities between two different discrete
signals. This is done by multiplying the signal components on each index
with each other and summing them up. This is done repeatedly where one
of the signals are shifted in relation to the other. The cross correlation of
the signals x and y is given by equation 3.
r
xy
(n) =
¸
l
x(l) · y(l −n) (3)
P V Soumya, A Norrgren, C-J Waldeck, F Brosj¨o 5
Provided that the guitar is roughly in tune the reference spectrum should
have a high correlation near the centre index. To save time and calculations
there is no need to do a full correlation and so it was limited to 20 steps
around the centre index. The correlation results in an array where the index
with highest value represents the best match between the signals.
The reference spectra is only composed of three ones, where the ones
represents the fundamental and the two harmonics. The correlation speed
can then be improved by removing all the multiplications and most of the
additions, so all that is left is the sum of the three values in the spectrum
where the reference is one. If the spectrum is of the length 1500 an ordinary
scalar product requires 1500 multiplications and 1499 additions. With the
improved correlation it only requires two additions and no multiplications.
That means that it saves 2997 operations for every scalar product. Roughly
estimated this saves a total of 700 000 operations with the improved corre-
lation for all the six strings.
−40 −30 −20 −10 0 10 20 30 40
0
0.5
1
1.5
2
2.5
Shift
C
o
r
r
e
l
a
t
i
o
n

c
o
e
f
f
i
c
i
e
n
t
Figure 2: Correlation of the spectra in figure 1
3 Method
3.1 Analysis of the guitar sound
The project began with recordings of the guitar strings from a Hagstr¨om
Viking semi acoustic electric guitar. The sounds were then analysed us-
ing Matlab’s built in FFT-function to get an idea of how the frequency
spectrum would look like. The spectrum was found to vary a lot between
different strings and pitches, and it was quickly concluded that the frequency
positions of the harmonic peaks were of higher significance than their am-
plitude. Since the goal is to tune multiple strings simultaneously, a normal
auto correlation, or even a cepstrum, that might have been used for one
string, could not be used. Instead another method was tested.
6 Guitar tuner
By cross correlating the frequency spectra from the coincident strings
with the spectra from the individual strings in turn, a separate correlation
for each string could be obtained. This method was tested using recorded
sounds, which resulted in a very messy graph. The amplitude difference
of the different harmonics of the reference sounds made it hard to get a
clear result. To get rid of this issue, the references were constructed rather
than recorded, to be quantised and noise free. In this way we could get
the exact frequencies for the pitch and harmonics of each string, and the
correlation was done to the frequency pattern rather than the amplitude of
the harmonics.
Since it was realised that the frequency difference between the pitch
and the harmonics would be changed when the string is not tuned, the
linear frequency scale would be hard to use because the correlation would
yield multiple peaks depending on if the pitch or the harmonics matched
perfectly, as shown in figure 3 A. This would compromise the accuracy since
it would be better if both the pitch and the harmonics matched at the same
time, as in figure 3 B. To obtain an adequate accuracy, there was a need
−20 −15 −10 −5 0 5 10 15 20
0
0.5
1
1.5
2
2.5
3
Cross correlation for linear frequency scale
A
m
p
lit
u
d
e
Shift
−20 −15 −10 −5 0 5 10 15 20
0
0.5
1
1.5
2
2.5
3
Cross correlation for logarithmic frequency scale
A
m
p
lit
u
d
e
Shift
Figure 3: A Linear correlation B Logarithmic correlation
for a specialised Fourier transform with a logarithmic frequency scale. This
solves the problem because when the cross correlation is made, the pitch
and the harmonics will match at the same time.
The aim was to construct a tuner with relatively high accuracy, and with
the limited memory, an accuracy of ±3 cents was chosen. This came to effect
the choice of resolution and thereby the base to the logarithmic frequency
scale. This corresponds to an acceptable error in the correlation of ±1 steps
in the correlation.
P V Soumya, A Norrgren, C-J Waldeck, F Brosj¨o 7
3.2 Implementation
The program was constructed using a number of different functions. A main
function where the necessary parameters and arrays were initialized and
constructed. Since the DSP uses software interrupts, no loop is needed to run
the program. A software interrupt called echo is activated when the input
buffer is full. This calls the detection which registers the input amplitude of
the signal and triggers a software interrupt if the signal exceeds a threshold
level. The interrupt runs the process which calls the different functions
needed to process the signal. Further explanation of these functions follow.
Figure 4: Flow chart over the algorithm
3.2.1 Echo
Echo is called by interrupt when the input buffer is full and collects the values
from the input buffer and passes on the buffer to the detection function.
When the DSP is starting up is has a lot of random values on the input
that must be ignored. A counter was then used to ignore the first 128 calls
to detection. The counter is then reset when the data has been processed,
to ignore the end of a signal and prevent misreadings.
8 Guitar tuner
3.2.2 Detection
The detection function first calculates the mean power of the input. If
the value is higher than a predefined threshold value, the detection function
goes into an buffering mode where it samples every package until the sample
buffer is filled. When the last step is done the buffering mode is deactivated,
the process flag is set and the interrupt process is called.
3.2.3 Process
Process gathers all the functions needed to perform the processing of the
signal. The first step is to perform a DFT which is covered in the section
below. The result is then correlated together with the reference array for the
individual strings. The returning values represents the indices of the max-
imum correlation and the corresponding value in relation to the maximum
possible correlation value.
3.2.4 DFT
The Discrete-FT is based on a normal Fourier transform summation using
a double for-loop. The frequency array used was explained in the theory
section, and this is the only difference from a normal DFT. Since the DSP
does not have support for the complex numbers, the summation had to be
done in two separate variables, one for the real and one for the imaginary
part of the complex result. The magnitude of these are normalized to reduce
the risk of overflow and then stored in the output array.
3.2.5 SmallXcorr
The small cross correlation function is made so that it only calculates a
small part of a normal correlation. It was chosen to only shift 20 steps to
the left and to the right around the centre element. In other words 41 steps
in total. The small correlation is done using a for-loop which runs from -20
to 20 where it sums the elements where the reference frequencies are, which
is done for the references of all six strings. When this is done a second
loop goes through the resulting correlation arrays to find the index of the
maximum value.
P V Soumya, A Norrgren, C-J Waldeck, F Brosj¨o 9
3.2.6 MatLab GUI
A graphical user interface was done using Matlab’s GUI Guide. The pro-
gram consists of a table, two buttons and a timer. To access the DSP a built
in function called ccsdsp was used, which makes it possible to load and run
the project on the board from within Matlab. The RUN button uploads
the program to the DSP and runs it and the STOP button stops the DSP
and closes the program. The table is updated every second using a timer
interrupt.
Figure 5: Graphical User Interface
4 Results and Discussion
4.1 Matlab testing
The first thing we did was to record the sound from all strings and take a
DFT to get an idea of how the spectra would look like. The result, using
Matlab’s built in FFT-function, gave us the spectra shown in figure 6A.
As can be seen there is a lot of different peaks with varying amplitude and
it’s difficult to distinguish between the fundamentals and the harmonics.
In a spectrum for one string this difference, as visible in figure 6, is much
clearer. The work continued by implementing a DFT-algorithm in Matlab
to ensure its functionality. The initial results were good, however it was
soon realized, as mentioned prior, that the distance between the frequency
increments had to be logarithmic to yield appropriate accuracy in lower
frequencies. The implementation of this was fairly simple in Matlab and
did not generate any problems out of the ordinary. The function was tested
against Matlab’s built in FFT-function and resulted in very similar data.
Some slight variations were found, but could very well be contributed to
round off errors.
10 Guitar tuner
Figure 6: A Spectrum of all strings B Spectrum of D3
The correlation algorithm was also implemented and tested against Mat-
lab’s correlation function. There were some minor differences that most
probably are because of round off errors. This is due to that Matlab does
not use full length floats like C, but a user defined length; in our case the for-
mat short containing four decimals. Since the correlation algorithm handles
a discrete frequency array, the best correlation can sometimes be between
two indexes, and thereby result in a double peak in the correlated data. This
could be avoided by using a higher frequency resolution; however the amount
of memory and the time needed for additional calculations set a limit to this
resolution. It was also important to get the frequency points in the array
as close as possible to the known frequencies to be used as reference values.
Otherwise there would be an error because of displacement from the correct
value, and the tuner would always have an offset. The values used for the
frequency array, calculated using Stridh’s formula described in the theory
section, had an initial frequency of 72.1 Hz, 1500 points and a base of 15.
4.2 Implementation
The implementation on the DSP board was straight forward and did not
generate so many problems at first. As mentioned above, the DSP com-
piler did not support the complex.h package, but a suitable work around to
this problem has already been covered. After implementing the necessary
functions without any major issues, the program was tested and resulted in
confusion. The values were not at all consistent with the expected values
generated in Matlab. After many hours spent on error correction it was
found that the memory was over written in some way and replaced the re-
sult values with memory addresses. This turned out to be because of lacking
P V Soumya, A Norrgren, C-J Waldeck, F Brosj¨o 11
internal memory that we were expected to have, and many arrays had to
be moved to the external memory. The memory configuration of the board
was fairly hard to understand, especially the amount of memory that was
available in the different memory banks.
By moving almost all of the arrays to the external memory the functions
started to work better with correct results, however very slow. The time
to go through one cycle was too long; it took between 10 and 20 seconds.
To decrease this time the algorithms were analysed many times to look for
means of improvement. After much testing a new correlation algorithm was
created that only used the points of interest in the reference arrays rather
than correlating every point. This drastically improved the time to process
the signal to a few seconds. This is still a bit too long but acceptable.
4.3 Post-testing
The system was now complete and some post testing was done. The Hag-
str¨om guitar was used to test the tuner and tuning one string worked well.
This is shown in figure 7 where the 5th and 6th strings are tuned individually.
The strings has prior to the test been tuned with a TC electronic polytune
commercial tuner [4]. As seen the strings get the tuning value of 1, which
implies that their pitch is slightly high. This is due to low resolution of the
tuner, and the guitar should be viewed as tuned for values between -1 and
1.
When tuning all strings the first result is most often wrong, most likely
as a result of sampling to early when the strings are still unstable after the
strum, see figure 8 A. The second sampling of the same strum usually have
a more accurate shift. The accuracy of the tuning, based on the amplitude
of the correlation however is lower, visible in figure 8 B. The cause could
also be that the different strings have different amplitudes and different
sustain, causing the spectrum to be uneven. This is clearly something to
continue working on, the timing of the tuning has to be perfected. The
data presented after a tuning was also more unstable the more strings that
were included. The most likely cause is the enormous amount of frequency
data which results in correlation matches in other places than the intended.
This could easily be improved with more memory and processing power that
would allow longer reference arrays with more harmonics.
12 Guitar tuner
Figure 7: A Sixth string E2, B Fifth string A2
Figure 8: A All strings 1, B All strings 2
5 Conclusion and further development
As it is now the tuner works quite well for tuning of one string at a time,
but get much less accurate when more strings are tuned. This can be due
to too good correlation matches in more than one point in the spectrum
because of the large amount of fundamentals and harmonics in the sampled
signal. Another reason may be that the sound level from some of the strings
has dropped in amplitude before the sampling has started; thereby causing
a lower probability that the right string is tuned. This might be counter-
acted by increasing the frequency resolution of the tuner and the number of
harmonics in the reference arrays.
One way to make sure that some noise is suppressed might be to use
windowing functions, however we discovered that a hamming window did
not actually improve the frequency spectrum in a noticeable way. More
testing with windowing functions should be able to clean up the spectra by
narrowing the peaks, making correlation more accurate.
Since we have limited memory and computing power we had to limit the
number of frequency points to 1500 to be able to get a result in a reasonable
time. This yields an uncertainty in the higher frequencies where the step
size becomes to large. In further development this can be helped by using
more frequency points and also by reducing the base, in order to make the
points come closer to each other. This would improve the accuracy for the
entire spectra, and not just the higher frequencies, which is good.
A great deal of time could be spent on optimization for faster perfor-
mance, allowing longer arrays and higher resolution as well as quicker results.
As for now, the result is presented in a few seconds which is a bit too slow
P V Soumya, A Norrgren, C-J Waldeck, F Brosj¨o 13
in our opinion. A goal for further development would be to reduce this time
to under a second to make real time tuning bearable.
References
[1] Vaughn Aubuchon, This Vaughns Music Note Frequency Chart
http://www.vaughns-1-pagers.com/music/musical-note-
frequencies.htm (2011-02-28).
[2] Hyperphysics,
http://hyperphysics.phy-astr.gsu.edu/hbase/music/cents.html
(2011-03-03)
[3] Martin Stridh, Signal Characterization of Atrial Arrhythmias using the
Surface ECG, Vol. 33, ISSN 1402-8662, 2003
[4] TC Electronics,
http://www.tcelectronic.com/polytune.asp (2011-03-03)
14 Guitar tuner
15
Part II
Pitch Estimaiton
Jonas Rosenqvist, Kim Smidje, Henrik Nilsson, Johan Matts-
son
Abstract
This report covers the implementation of a digital signal process-
ing algorithm for the TMTS320C6713 by Texas Instruments. The
algorithm is written in C using Code Composer Studio IDE and it
aims to determine the dominant frequency for a given input audio sig-
nal. This is achieved by applying the cepstrum transform which is an
extension of the Discrete Fourier Transform, involving additional ma-
nipulation of each sample in the frequency domain. This is followed by
an inverse Fourier transform which yields a signal in what is known as
the quefrency domain, in where you search for the highest amplitude,
disregarding certain intervals. From here the dominant frequency can
be extracted and the closest pure tone as well as the distance to it is
presented to the user.
16 Pitch Estimaiton
1 Introduction
Pitch estimation is just as the name says, an estimation of the pitch. There
are some techniques that can be used with different advantages and disad-
vantages. But the common thing is that they all require a lot of compu-
tations. In this report we will focus on the cepstrum algorithm which is a
collection of mathematical tools. The cepstrum algorithm has the advantage
that its faster than for example autocorrelation which gives the possibility to
compute the frequencies from a faster sample rate. The reason why cepstrum
is fast comes from the fact that it is computed using only the Fast Fourier
Transform, its inverse, as well as the absolute and logarithmic functions, all
of which have a fairly low time complexity. The working process we chose to
solve the problem was to first solve the problem in a familiar environment,
namely in Matlab, and after that go deeper into Code Composer.
2 Theory
The Fourier transform transforms a signal in the time domain, i.e. the
amplitude as a function of time, into a signal in the frequency domain, i.e.
amplitude as a function of frequency. This is visualised in Figure 1 and
Figure 2. In Figure 1, the top plot, two sinusoidal signals with different
phase, amplitude and frequency are shown and the lower plot its the sum
of these signals. The result of applying the Fourier transform on to the
sum of the signals is displayed in Figure 2. One sees here that the value
of the Fourier transform for a given value of the variable f corresponds to
the amplitude of a sinusoid component with that frequency in the original
signal. Since the original signal is made up only of the sum of two pure
sinusoid with different amplitudes, the Fourier transform consists of only
two peaks, each of them representing the amplitude of the two sinusoid. For
all other values of f, F(f) has a value of zero because those frequencies are
absent, or to put it differently, have an amplitude of zero in the original
signal.
Cepstrum, which comes from reversing the four first letters in the word
“spectrum”, is a method which distinguish frequencies in a tone or sound in
order to determine which frequency is the ground frequency of them all. The
cepstrum method uses several different other operations in order to reach
its goal and its mathematical representation is:
F
−1
(log
10
(|F (x) |))
What it does is that it takes the signal and samples it in the discrete Fourier
transform, then the absolute values of that result, converts it into the loga-
rithmic scale and lastly transforms the samples back into the time domain.
Jonas Rosenqvist, Kim Smidje, Henrik Nilsson, Johan Mattsson 17
Figure 1: Top: Two sinusoidal signals with different amplitudes and phases.
Bottom: The sum of the two signals
The reason for using both the absolute and the logarithmic functions are
because you want to emphasize lower frequencies to make sure that the
dominant peak comes from the ground frequency and not from one of the
over tones. The result after the transformation is called quefrency, which is
measured in seconds, though not in the sense of a signal in the time domain.
Because of the convolution occurring with the FFT, the signals will be ad-
ditive, which is an important property for the cepstra (the spectra for the
cepstrum). The quefrency will therefor be the sum of all the signals which
are recorded. After the quefrency has been calculated, the highest peak in
the window will correspond to the right frequency. For instance if a pea!
k in the cepstrum diagram would appear at point X (dimensionless), this
would respond to the frequency derived from taking the sample rate (mea-
sured in Hz) divided with X. The peaks in the cepstra occurs as a result of
the periodicity of the signal or the sound.
[1]
18 Pitch Estimaiton
Figure 2: The power spectrum of the two sinusoidal signals
3 Methods
There’s two types of algorithms that can be used to obtain estimated fre-
quencies. Algorithms in the time domain and algorithms in the frequency
domain.
3.1 Time domain
One very simple approach that could be used in the time domain is to look
at the zero crossings in the signal. That is when the signal goes from a high
value to a low value and the other way around. In one period the signal will
cross zero two times and by measuring at what times this is done a rough
estimate of the pitch can be calculated. This approach isn’t that robust
since for signals that consist of multiple sinus signals with different periods
the result will not be close to the real frequency. Other methods in the time
domain such as autocorrelation has a different approach. As the name says
autocorrelation tries to find the correlation of an input signal with the input
signal with some lag, by comparing the cross correlation 1 point a time we
can start building another graph which hopefully will look like a sinus signal.
The main problem with the autocorrelation is for higher frequency’s because
the number of additions will become overwhelming. This is the reason why
autocorrelation is mostly used in the low to mid frequency range.
[1]
Jonas Rosenqvist, Kim Smidje, Henrik Nilsson, Johan Mattsson 19
autocorrelation[k] =
1
N −k
N
¸
n=k
signal [n] ∗ signal [n −k]
3.2 Frequency domain
Frequency domain methods takes an input in the time domain and com-
putes the frequency spectrum. The input signal displayed in the frequency
spectrum will cover the whole spectrum but the dominant frequency will
have the highest peak. The main advantage of frequency methods is the use
of the fast Fourier transform which makes the computations fast and reli-
able. There are a number of different algorithms that perform operations
in the frequency domain for example kepstrum, cepstrum, power cepstrum
and maximum likelihood. To be able to compute the estimated pitch the
input signal needs to be divided into smaller parts and computed for each
part. The disadvantage of dividing the input signal is the loss of resolution,
since the estimated frequency depends on the sampling rate and the length
of the divided input signal. It is possible to still get a good resolution if for
example the sampling rate is 8000 and the lenght of the divided input signal
is 8000 then each frequency can be represented.
[1]
sample rate
indexof (maximumvalue)
4 Implementation
In order to evaluate the algorithms ability to correctly detect the dominant
frequency we decided to first implement it in Matlab. In addition to the
group being more experienced working with Matlab as opposed to the C
language, it allows for much faster implementation thanks to the high level
developing environment with many of the crucial algorithms, such as the fast
Fourier transform, already implemented. It was also at this stage that we
estimated the appropriate cut-off level in the quefrency domain, as well as
suitable signal sample size, by experimenting with various audio signals. By
not discarding enough initial values in the quefrency domain one run the risk
of finding a false dominant frequency. On the other hand, if too many values
are ignored, one might miss the true dominant frequency. The sample size
must cover a large enough time frame to detect the lowest possible frequency,
and that the same time not be too big in regards to the limited memory
and real time requirements. Using a sample rate of 32000 samples/second
and a vector of length 512, corresponding to a window of 16 ms, we found
that we got acceptable results while still being able to detect frequencies as
low as approximately 240 Hz. We loaded an audio signal generated by a
horn, which had the frequency 123.47 Hz. That frequency corresponds to
20 Pitch Estimaiton
Figure 3: Power spectrum from the B
2
horn
a B
2
note, which means the note B in the second octave. When running
our Matlab pitch detection program, we get these plots where the first plot
represent the power spectrum from the B
2
horn and the second plot shows
the cepstrum of the B
2
horn. Given the cepstrum, the estimated frequency
can be calculated to approximately 126 Hz, which can be considered as a
reasonable deflection from the correct answer. Since we aimed to make our
range designed for fourth octave we did the same with C
4
, then we got the
plot shown below. If the peak is located at index 124, which gives us the
frequency circa 258 Hz.
Implementation in Code composer was very similar to what we did in
Matlab. With the use of the DSP library the algorithms FFT and IFFT
was made available. Other functions as log and abs was introduced into the
program with the libraries math.h. To increase the accuracy of the program
we chose to take the 10 latest values and only present the median of these to
remove any potential outliers. After some trial and error we chose a cut off
point at 20 samples as that gave us the right results in our chosen octave.
Results
Table 1 shows the frequency estimation for 17 different frequencies that cov-
ers evenly spaced intervals of the third and fifth as well as the whole fourth
octave in the frequency range of 260-520 Hz. In addition to this we’ve added
Jonas Rosenqvist, Kim Smidje, Henrik Nilsson, Johan Mattsson 21
Figure 4: The cepstrum plot with folding
Figure 5: Highest peak at index 124
22 Pitch Estimaiton
all the pure notes in the fourth octave.The fourth column of the table shows
that the errors in the fourth octave are very small (average of 1.5 Hz), and
rapidly increase as we move outside it.
Input frequency (Hz) Output frequency (Hz) Note Deviation (Hz)
240 260 20
262 262 C
4
0
280 280 0
293 292 D
4
-1
320 320 0
330 328 E
4
-2
349 346 F
4
-3
360 358 -2
392 390 G
4
-2
400 400 0
440 438 A
4
-2
480 476 -4
494 492 B
4
-2
520 524 4
523 524 C
5
1
560 560 0
600 602 2
640 640 0
680 680 0
720 726 6
760 760 0
800 800 0
840 842 0
880 888 8
We could get the right results as low as 180 Hz and as high as 2000 Hz
but at these values the reliability suffers and you sometimes end up with an
over tone, the right note but wrong octave.
5 Problems encountered
The biggest problems which we encountered where during the implementa-
tion in C. Mainly the DSP library provided by Texas Instruments caused
big problems. First just setting the class path right was kind of tricky but
after looking into the reference guide and some help from Frida we got it
right. After solving the class path a new problem was introduced, how to
use the function DSPF sp fftSPxSP which was provided by the DSP library.
Apparently there is a bug in C when converting from unsigned short to float
Jonas Rosenqvist, Kim Smidje, Henrik Nilsson, Johan Mattsson 23
which forces you to first cast to from unsigned short to short and after that
cast to float. If this isn’t done correctly there will be values like -0 and those
will be interpreted as the maximum float value.
The last issue we had to resolve was that some frequencies seemed impos-
sible to get good estimations for and the estimations fluctuated a lot. What
we did was to use the median of the last 10 values instead of just the regular
average. This cancel out the fluctuation but introduce other disadvantages
for example if the frequency of the input signal varies very fast, but on the
other hand in reality this should not be a problem.
6 Conclusion
The estimation was good for frequencies in the range 240 - 880 Hz before
and after this range the errors becomes too large. The reason for this is
that the input signal is divided into smaller parts. Another limitation of
our program is if the input signal changes frequency every 16 ms then the
output will just be the median frequency of the last 10 frequencies.
7 References
[1] Roads, Curtis (1996). The Computer Music Tutorial, Part 4: Sound
analysis
[2] Norton, Michael; Karczub, Denis (2003). Fundamentals of Noise and
Vibration Analysis for Engineers, Cambridge University Press.
[3] Frequencies of Musical Notes - http://www.phy.mtu.edu/˜suits/notefreqs.html
24 Pitch Estimaiton
25
Part III
Vocoder
Mattias Danielsson, Andre Ericsson, Kujtim Iljazi, Babak
Rajabian
Abstract
This report is based on a project by students to become more ex-
perienced in programming signal processing algorithms on a Texas In-
struments TMS320C6713 DSK. The project was to program a musical
based LPC-vocoder. A highpass filter was used for prefiltering (FIR
filter). The Levinson-Durbin recursion was used to model the voice
using an IIR lattice filter structure. In order to do so the autocor-
relation was needed from the voice. A synthesizer was needed as a
carrier signal which was the key to change the voice. The vocoder was
programmed so that the voice was controlling the level of the carrier
signal. The course Optimum Signal Processing is recommended to take
before reading this report.
26 Vocoder
1 Introduction
The purpose of this project was to program a Texas Instruments TMS320C6713
DSK in Code Composer Studio v3.3 in C to an LPC-based (Linear Prediction
Coding) vocoder used as a musical instrument. The vocoder is programmed
to be used together with a synthesizer. When the performer presses down a
key on the synthesizer and speaks into the microphone (both plugged in to
the vocoder) a synthesized vocal sound is heard from the loudspeakers. The
characteristic from the synthesized vocal sound is dependent on what sound
the synthesizer is set to produce. The big advantage is that the vocoder
is compatible with basically every musical instrument that can produce a
sound with a constant sustain level and rich frequency content.
2 Theory
2.1 Overall description of our vocoder model
Figure 1: Our vocoder model
Our vocoder model is shown in figure 1 above. A sampled speech signal
from the microphone is first filtered through a high pass filter, in order to get
rid of the low frequency content in the voice (the higher frequency content
in the voice spectrum is what defines the vocal tract, which explained on
the second lecture of this course). Otherwise the unnecessary, low frequency
content of the voice will be modeled. Thereafter a block of samples is built
from the highpass filter in order to calculate the autocorrelation values.
With the autocorrelation values we can estimate the filter coefficients for
Mattias Danielsson, Andre Ericsson, Kujtim Iljazi, Babak Rajabian 27
the all-pole model (IIR-filter) which should represent a model for the vocal
tract.
From the sample block from the high pass filter the maximum value can
be obtained to represent the amplitude of the speech signal. This value
is multiplied with the normalized carrier signal (which is always a signal
between -1 and 1). This way the sound level of the carrier signal is basically
controlled by the voice signal. The signal from the IIR-model is the modified
voice. The last step that needs to be done is to invert the effect of the high
pass filter that was implemented at the beginning by filtering the output
signal from the IIR-model with a low pass filter.
2.2 The highpass filter
An FIR filter is simply defined by the following equation
H(z) =
p
¸
k=0
b
p
(k)z
−k
(1)
An FIR filter structure of this type is used to implement a highpass
filter.
2.3 The autocorrelation function
hej The autocorrelation function is a function to determine how much a
signal relates to itself at different time lags. The estimation of the autocor-
relation function is given below.
r
x
(k) =
1
N −k
N
¸
n=k
x(n)x

(n −k) (2)
The autocorrelation function must be normalized in order to prevent over-
flow and to work properly with the Levinson-Durbin recursion.
ρ
x
(k) =
r
x
(k)
r
x
(0)
(3)
2.4 The Levinson-durbin recursion
The Levinson-Durbin recursion is an algorithm used to find an all-pole
model by using a sequence of autocorrelation values. It calculates both
regular IIR-filter coefficients, a(j) , and the reflection coefficients for an IIR
lattice filter, Γ
j
. The Levinson-Durbin algorithm is described in [1] and
repeted in the table at the top of the next page.
28 Vocoder
1. Initialize the recursion
(a) a
0
(0) = 1
(b)
0
= ρ
x
(0)
2. For j = 0, 1, ..., p −1
(a) γ
j
= ρ
x
(j + 1) +
j
¸
i=1
a
j
(i)ρ
x
(j −i + 1)
(b) Γ
j+1
= −γ
j
/
j
(c) For i = 1, 2, ..., j
a
j+1
(i) = a
j
(i) + Γ
j+1
a

j
(j −i + 1)
(d) a
j+1
(j + 1) = Γ
j+1
(e)
j+1
=
j
[1 −|Γ
j+1
|
2
]
3. b(0) =

p
Table 1: The Levinson-Durbin recursion
2.5 The IIR lattice filter
The IIR lattice filter structure is an alternative structure of the IIR-filter.
Instead of using the regular filter coefficients, a(k) , it uses the reflection
coefficients Γ
k
. This structure has ”the same advantages of modularity,
simple tests for stability and decreased sensitivity to parameter quantization
effects”. A single stage of an IIR lattice filter structure is shown on the next
page and it’s difference equations describing it is shown below.
e
+
j
(n) = e
+
j+1
(n) −Γ
j+1
e

j
(n −1) (4)
e

j+1
(n) = e

j
(n −1) + Γ

j+1
e
+
j
(n) (5)
Mattias Danielsson, Andre Ericsson, Kujtim Iljazi, Babak Rajabian 29
Figure 2: Single stage of an IIR lattice filter
A complete p:th order IIR lattice filter can than be derived from the
difference equations from the previous page and the figure above as shown
in the figure below.
Figure 3: p:th order IIR lattice filter
30 Vocoder
3 Implementation
The implementation of the vocoder was done in the C programming lan-
guage using Code Composer Studio. A code template used in the second lab
in this course was used to make starting easier. There are a lot of configura-
tion bits that can be set to change the parameters of the AD/DA converter
but these were left at their default value as found except for the sampling
rate that was changed to 8 kHz, not to break any realtime performance. Us-
ing the knowledge that speech is somewhat frequency stationary during 20
ms (which was read on a report based on a similar speech modeling project
in this course) and the sampling frequency of 8 kHz results in the buffersize
of 160 samples. One thing to keep in mind is that the PIP buffers are filled
with unsigned shorts that have to be type casted to short and then to float
to be able to do calculations in increased precision. Also the input voltage
to the AD-converter has to be taken into account. For example if you want
to use an old analogue synthesizer to generate the carrier signal you have
to make sure that the output from the synthesizer is below the reference
voltage for the AD-converter. To make the implementation of the highpass
filter (an FIR filter) of any size a general and already written function by
Texas Instrument was used even though the order of the highpass filter was
only of order one.
Our highpass filter is defined by the following equation
H
HP
= 1 −0.98z
−1
(6)
The order of the filter modeling the vocal tract was chosen to 8. This
results in the size of the normalized autocorrelation vector to be 9 (to make
an all-pole model using the Levinson-Durbin recursion of order n you need
n+1 autocorrelation values). At first a regular IIR-filter was used to model
the vocal tract but was later replaced by an IIR lattice filter structure which
is described later in the results section. To make the voice control the level
of the carrier signal, the absolute largest value is found from the block of 160
samples from the highpass filtered speech signal. This value is multiplied
with the normalized carrier signal which is in an interval of values between
-1 and 1, which was done by dividing the carrier signal by 32000, because
the absolute maximum value that the samples we work with can obtain is
32000.
Mattias Danielsson, Andre Ericsson, Kujtim Iljazi, Babak Rajabian 31
4 Testing and debugging
For testing of the different blocks in the vocoder we took a pragmatic
approach. Knowing the behaviour of the blocks, different signals revealing
easily if the blocks were working correctly or not. For the highpass filter
we used sounds having both high and low frequency content and listened to
the filtered signal. In code composer studio there is also special commands
enabling printout of internal variables and output from blocks while running
the code.
For the autocorrelation we used a sine signal and looked at the resulting
output vector. The autocorrelation was strictly descending in value for
increasing lag shifts. The maximum lag shift in our case is 9, resulting in 9
autocorrelation coefficients. We also tested the autocorrelation with white
noise and this resulted in a low autocorrelation and not predictable values
for lag shifts greater than zero.
The Levinson-Durbin algorithm was tested by feeding white noise into
an IIR filter where we ourselves have set the filter coefficients. The output of
this filter was used as input to our autocorrelation block and the output of
the autocorrelation was sent to the Levinson-Durbin algorithm. The output
of the Levinson-Durbin algorithm should be estimates of the filter coefficients
in the IIR filter. Our Levinson-Durbin algorithm produced estimates that
varied around the values preset by us. The reason for the variation around
the correct filter coefficient values was the short input block length of 160.
The value of 160 is the result of the sample rate of 8 KHz and speech
duration block of 20 ms used. To test our complete vocoder system we used
a recorded voice sample on one the left channel and different square wave
audio sources increasing in frequency on the right channel. The different
sources on the right channel could be mixed and amplified at will in the
sound program Audacity.
32 Vocoder
5 Results and conclusions
The first test of our complete vocoder system resulted in low sound level.
This was overcome by using amplified speakers (we did’nt gain the output
with the parameter in the program because we did’nt know how to change
the gain parameter during runtime). The reason for the low sound level
is probably that no amplification is done in the AD/DA converter. When
using our regular IIR filter for filtering of the carrier, this resulted in loud
and painful sound level spikes later discovered to be caused by unstable
filter coefficients. Trying different fixes to the IIR filter resulted in some
improvements but no complete absence of the painful sound spikes. The
ordinary IIR filter was later replaced by a lattice IIR filter. After altering
the Levinson-Durbin algorithm so that the reflection coefficients did not
exceed an absolute value of one resulted in no spikes in the output signal.
The pitch of the generated speech was also tested by changing the frequency
of the square wave carrier. The speech pitch changed satisfactory when
changing the frequency of the carrier making us happy with the result. As
always there are different changes, choices and improvements you can make
in system design and implementation but we are satisfied with our choices.
We never had time to test carrier signals from real synthesizers before the
deadline for the report but we will show it in our demonstration.
Mattias Danielsson, Andre Ericsson, Kujtim Iljazi, Babak Rajabian 33
References
[1] Monson H. Hayes, Statistical digital signal processing and modeling,
John Wiley and Sons, Inc, 1996
[2] http://en.wikipedia.org/wiki/Vocoder
34 Vocoder
35
Part IV
Reverberation
R. Tullberg, R. Mittipalli, S. Abdu-Rahman, T. Isacsson
Abstract
In this project, the challenge was to implement a digital reverb on
a Texas Instruments TMS320C6713 DSK development board. Jean-
Marc Jot’s Feedback Delay Network algorithm was used as reverbera-
tion algorithm. Different parameters in the algorithm had to be iden-
tified and tuned experimentally. To meet realtime constraints imposed
by CPU and memory speeds various hardware and software optimiza-
tions had to be employed. To aid development, the algorithm was also
implemented as a non-realtime version in Matlab. This was benefi-
cial both as a reference design as well as a tool for parameter tuning
and code analysis. The finished application produces a smooth reverb
sound running without glitches at a CPU consumption of approxi-
mately 60-65%.
36 Reverberation
1 Introduction
1.1 Reverberation
Sound waves travelling in a room are reflected when they hit walls or other
obstacles. The reflections go on themselves to hit walls and obstacles and
get reflected again and so on. This phenomenon is called reverberation.
Sounds are enriched and colored by these reflections due to airs and obsta-
cles tendency to dampen higher frequencies to a greater extent than lower
frequencies.
A reverberated sound consists of three main parts: The sound that trav-
els directly from the source to the listener is called the direct sound. Re-
flected copies of the sound are delayed some time depending on the physical
properties of the surroundings, such as room size, material and wall surface,
before reaching the listener. The earliest iterations of these reflections are
called early reflections and extend roughly 60 to 100 ms, depending on the
size of the room[1], after the initial direct sound (see figure 1).
In time, as reflections are reflected and multiplied again and again, they
become indistinguishable as separate echoes to the listener. This last part
called late reverberations starts at 100 ms and can go on for several seconds
in a large enough room or concert hall.
Figure 1: Simplified image of direct sound and 1:st and 2:d order early
reflections
Figure 2 shows the early part, consisting of a several of early delays,
followed by the late reverberated decaying part.
R. Tullberg, R. Mittipalli, S. Abdu-Rahman, T. Isacsson 37
Figure 2: Impulse Response of actual implemented reverb showing early
reflections and late reverberations
38 Reverberation
2 Theory
2.1 Reverb Algorithm
Early on in the project we settled on the first of the algorithms developed
by Jean-Marc Jot. We had read that, while computationally expensive, it
produced an impressive reverberated sound with rich echo density[1]. The
defining characteristic of this algorithm being the feedback delay network
that Jot had introduced to model late reverberations.
Figure 3: Jot’s FDN algorithm and how it fits in the overall reverb imple-
mentation
2.2 Reverberation time
The time for a sound to attenuate by 60 dB in a reverberant space is called
the reverberation time, T
r
. Sound is attenuated because of the surfaces in
the room, as they absorb the energy of the sound waves and their reflec-
tions. The reasoning behind the 60 dB attenuation requirement is that the
difference between the intensity of a common orchestra, 100 dB, and the
background noise of an ordinary room, 40 dB[5]. The two most common
formulas for the approximation of T
r
is the Eyring formula[1] (0.163 in both
formulas corrected to 0.161[2]):
T
r
=
0.161 · V
A· ln(1 −s) + 4 · δ
a
· V
(1)
R. Tullberg, R. Mittipalli, S. Abdu-Rahman, T. Isacsson 39
and the Sabine equation[1]:
T
r
=
0.161 · V
s · A
(2)
where V is the room volume, A is the room surface area, s an average ab-
sorption coefficient and δ
a
is the frequency dependent attenuation constant
of air. OnceT
r
is calculated, the attenuation can be derived and frequency
dependency established as[1]:
T
r
(ω) = −
3 · T
log
10
(γ(ω))
(3)
where T is the sampling period and γ(ω) is the attenuation per sample
period as a function of the frequency ω.
2.3 Delay elements z
−mi
The z
−mi
delays model the time it takes for a reflection to reach the listener
and/or another obstacle or wall. The delay in samples is the delay time in
milliseconds multiplied by sampling frequency, for example:
m
16
= 100ms48kHz = 4800 (4)
The different delay values are recommended to be mutually prime, when
expressed in sample units. This is to avoid superpositioning of harmoni-
cally related sound waves causing unpleasant resonances, so called flutter
echoes[3].
2.4 Damping Filters h
i
(z)
Starting from input x(n) in figure 3, the signal is copied into, in our case,
16 different lines, delayed m
i
samples, and then filtered by the h
i
(z) filters.
These lowpass filters model the real worlds attenuation due to absorption,
reflection and spreading in walls and other obstacles. High frequency compo-
nents are attenuated to a greater extent than lower frequencies as described
in 2.2. The filters are expressed as follows in the frequency domain[1]:
h
i
(z) = g
i
(1 −a
i
)
(1 −a
i
z
−1
)
(5)
where
g
i
= 10

3miT
T
r
(dc)
,
a
i
=
ln 10
4
log
10
(g
i
)(1 −
1
α
2
),
40 Reverberation
α =
T
r
(Nyquist)
T
r
(dc)
,
with T
r
(Nyquist) and T
r
(dc) being the time it takes for the highest and
lowest frequencies respectively to decay by 60 dB.
2.5 Diffusion Matrix A
When a sound wave is reflected when hitting an obstacle in a room it is scat-
tered across the room hitting other obstacles, which in turn scatter the new
reflections across the room hitting other obstacles and so on. Each time, the
reflections are redistributed among the walls and obstacles in the room. The
element responsible for this redistribution in Jot’s algorithm is the diffusion
matrix. It takes its inputs from the n delay lines and redistributes them
back into the same delay lines. Since the damping or attenuation of sound
waves is handled by the damping h
i
(z) filters, the diffusion matrix should
only redistribute the energy and neither amplify nor attenuate it. In other
words, it should be both stable and lossless, both of which are fulfilled if the
matrix is unitary, in case of a complex valued matrix, or orthogonal, in case
of a matrix containing only real values.
2.6 Gains b
i
and c
i
These two vectors are simple gains used to achieve different effects. We
simply set all the elements of vector b to
1
16
to make sure the output did
not clip as the input was copied 16 times and then summed. The c vector,
often used to achieve stereo spread or other cross channel effects, was left
unused, the equivalent of setting all its elements to a value of one.
2.7 Tonal Correction Filter t(z)
As the outputs of the h
i
(z) filters have lose some of the higher frequen-
cies they tend to not be an accurate representation of the original signal.
The solution to this is to place an inverted version of our low pass filters,
called a tonal correction filter, before output, to equalize the modal energy
irrespective of the reverberation time in each filter[4].
t(z) =
1 −bz
−1
1 −b
(6)
where
b =
1 −α
1 + α
R. Tullberg, R. Mittipalli, S. Abdu-Rahman, T. Isacsson 41
3 Implementation
3.1 Realtime versus non-realtime implementation
To gain understanding of the algorithm a reference prototype was initially
developed in Matlab. When this algorithm produced satisfying results it
was adopted to the real time environment in TMS320C6713 DSK, where
especially constraints on CPU and memory usage were the next challenges,
with the obvious requirement that to be able to keep up with a sound input
sampled at a certain frequency, our application had to process a certain
amount of samples before the next chunk of samples arrived.
3.2 Diffusion Matrix
A potentially unlimited amount of matrices fulfill the condition of being uni-
tary, when containing complex values, or orthogonal, when only containing
real values, so other considerations were taken into account when choosing
the diffusion matrix. For example, a better echo density is achieved the
more non-zero elements there is in a matrix[1]. However, the more non-zero
elements a matrix contains the more multiplications have to be performed.
So naturally a matrix that lends itself to optimization when multiplied with
a vector is preferred.
Doing some research one such matrix[6] was found:
A =
1
2

A
4
−A
4
−A
4
−A
4
−A
4
A
4
−A
4
−A
4
−A
4
−A
4
A
4
−A
4
−A
4
−A
4
−A
4
A
4
¸
¸
¸
¸
(7)
where A
4
is a Hadamard matrix of the 4:th order[7]:
A
4
=
1
2

1 1 1 1
1 −1 1 −1
1 1 −1 −1
1 −1 −1 1
¸
¸
¸
¸
(8)
This matrix has the triple benefits of being orthogonal (A = A
-1
), con-
taining only non-zero elements of equal magnitude, and being, as we later
shall see, easy to optimize.
3.3 Software Optimizations
3.3.1 Matrix Multiplication
Normally multiplying a vector of size n by a matrix of dimension n requires
n
2
multiplications and n(n − 1) additions. However, some matrices have
beneficial properties that make multiplying by them easier. The diffusion
42 Reverberation
matrix described in section 3.2 is one such matrix. To start with the matrix
consists of only positive and negative ones, aside from a scalar that can
be factored out,
1
4
in this case. Thus, the vector elements need only be
multiplied with the scalar after having been summed according to the signs
in each matrix column. This reduces the number of multiplication to n, or
16 in our case. Furthermore, the regularity of the A
4
matrix allows us to
calculate intermediate sum values that can be reused instead of having to
do each addition separately[6]. For example, when multiplying a vector x of
size 4 with anA
4
matrix the following intermediate values are calculated:
a = x
1
+ x
2
b = x
1
−x
2
c = x
3
+ x
4
d = x
3
−x
4
and the resulting vector becomes:
Y
1-4
=

a + c
a −c
b + d
b −d
¸
¸
¸
¸
In our case this was done for the 16 element input vector in groups of
four so that four resulting vectors were calculated for each sub input vector.
These were organized in the following manner:
B =

Y
1-4
−Y
5-8
−Y
9-12
−Y
13-16
Y
5-8
−Y
1-4
−Y
9-12
−Y
13-16
Y
9-12
−Y
5-8
−Y
1-4
−Y
13-16
Y
13-16
−Y
5-8
−Y
9-12
−Y
1-4
¸
¸
¸
¸
Finally, each row was summed to get the final result vector B of the
matrix multiplication. The resulting operation count is 16 + 16 + 163 = 80
additions, instead of the usual 1615 = 240. So by choosing a certain type of
matrix, the number of multiplications could be reduced from 256 to 16 and
the, admittedly cheaper, additions from 240 to 80.
3.3.2 Circular Buffers
Every input to the filters h
i
(n) is delayed m
i
samples, the outputs from the
filters are then fed to the diffusion matrix and summed before being sent
to the tonal correction filter. Because of the long delay between the time
a value is calculated and the time when it finally reaches the output and
can be discarded, arrays had to be used to store the values. These arrays
were implemented as circular buffers, with each size of buffer i equal to the
sample delay length mi.
R. Tullberg, R. Mittipalli, S. Abdu-Rahman, T. Isacsson 43
Circular buffers are governed by a pointer to the array. The pointers
position is incremented each iteration and when it reaches the end of the
array it is reset to the beginning of the array. Values are read from the
position in the buffer indicated by the pointer, and after the pointer is
moved to the next position in the buffer, a newly calculated value is written
to that new position, making sure it wont be read until the pointer has
traversed the whole buffer, which happens exactly mi iterations later.
Using circular buffers reduces CPU load by just reading or writing to
the array element at the pointer position instead of having to move all the
elements of the array one position forward for each iteration. In addition to
the delays, the predelay line was also implemented as a circular buffer.
3.4 Compiler Optimization
Compiling the program with default options and running it on the DPS
board at 48 KHz resulted in very high CPU utilization. So high in fact
that the low priority analysis module had trouble report any CPU load
information back to the host. To remedy this we tried the different compiler
optimization levels and settled on -O2 which gave an equally good CPU
load as -O3, but without any potential increase in program size. This is
as expected since -O3 mostly deals with the inlining of functions[8] and,
aside from the interrupt triggered process function, our program only calls
a function once to set some initial global variables.
Figure 4: CPU load when compiled with optimization level O1
3.5 Hardware Memory Considerations
The predelay buffer and the 16 delay line buffers, duplicated for each of
the two stereo channels, need a memory space of roughly 0.5 MB depending
on configured delay lengths and predelay, where each single sample delay
44 Reverberation
Figure 5: CPU load when compiled with optimization level O2
Figure 6: CPU load when compiled with optimization level O3
requires 4 bytes (size of float data type) per channel. This large amount of
data that has to be stored for later, delayed processing prohibited the use
of the relatively small internal memory of 256 KB[9]. Instead the onboard
SDRAM memory with its larger capacity of 8 MB[10] was used. An external
heap of 983,040 (0xf0000) bytes was declared in the memory configuration
utility in Code Composer Studio, with the additional space to allow for some
headroom in setting predelay and m
i
delay lengths. The 34 buffer arrays
were then allocated to it using MEM calloc commands.
While the SDRAM memory worked great, we were careful as to not
allocate anything that we didn’t have to, as the internal memory is faster
than the SDRAM memory[9].
R. Tullberg, R. Mittipalli, S. Abdu-Rahman, T. Isacsson 45
4 Result
4.1 Experimental setup
Experimentation was conducted in part using the prototype in Matlab where
files of differing sound material were treated with the reverb algorithm and
sound files of 100% wet and userdefined wet ratio signals were generated
and compared to the originals, both by ear and by comparing plots of the
dry, wet and mixed signals. The bulk of the algorithms parameters, such as
the A matrix and the m delay lengths, were chosen and permanently set at
this stage.
Once the realtime implementation was in place and working, we pro-
ceeded to use recordings of anechoic music as input to the DSP board to
finetune parameters, such as room dimensions, α, and T
r
(Nyquist), to taste.
4.2 Results
We achieved most of the goals that we sought. The reverb is nice sounding,
in no small part due to the large amount of delay lines, and as described
in 3.4 the CPU load is within a satisfactory range running at the maxi-
mum sampling frequency of the development board. The ones that weren’t
achieved were minor ones like the implementation of a host based user in-
terface.
Also, the learning experience has been huge, and has gained us additional
skills in different areas, such as signal processing, C programming, optimized
algorithms, etc.
5 Discussion and Conclusion
5.1 General
When modelling a reverberation application, in essence trying to replicate
a physical environment as much as possible, the physical properties have to
be known. The volume of the room, it’s total surface area, absorption of
the materials used in the room, what other elements in the room that could
absorb energy and/or reflect the rays. One way of doing this is Ray Tracing
where an impulse response of the room is achieved by setting up microphones
in the room and then firing a starter gun to produce the impulse. The
response could be used to design a system which mimics the room perfectly
but the size of the response is so big and complex that in real time processing
it would yield impossible.
So replicating rooms with an algorithm is the way to go, but with no idea
of what kind of room we are trying to replicate, nor what kinds of absorption
the room would present we had problems with knowing what we were looking
for. But as the work progressed, we made these parameters configurable,
46 Reverberation
and didn’t bother too much with accurately reproducing real rooms, but
focused on the audible results and getting the program to actually work.
The hardest part was understanding the limitations of the DSP boards
memory configuration, and the related special commands. Understanding
what the compiler was trying to tell you, why things didn’t work at all,
why global variables weren’t seen by subfunctions and that Code Composer
Studios own flavor of the C language seems to have math functions that
aren’t exact replicas of the Math library in C.
The hours spent on trying to understand Code Composer Studio and
sifting through the wealth of help information provided was the biggest
drawback of the project.
5.2 Non-realtime versus realtime implementation
As mentioned earlier a version of the algorithm was implemented in Mat-
lab, as a non-realtime version. Besides the obvious advantages of having a
prototype that all members of the group can reference while working on the
realtime implementation, it was also beneficiary not to have to deal with
constraints on memory and CPU while trying to get a grip on the algo-
rithm itself. However, there were drawbacks. Realtime issues did naturally
not manifest during the prototyping phase, but it was also hard to measure
the benefits of optimizing matrix multiplication in Matlab since it’s already
heavily optimized for that purpose.
The main disadvantage of a realtime processing is the lack of infinite,
or at least comfortably large, processing power. Because of the time re-
strictions when doing processing, the calculations had to be optimized and
if not enough, cut down and quality restricted. In comparison to offline
implementation, a lot more time is spent on making a realtime application
work. Allocating memory right, optimizing calculations and using the DSP’s
built-in functions is just some of the aspects that had to be addressed. The
advantage is though having playback in realtime, using a reverb in live mu-
sic rigs and recording software to enable the user to hear the effect while
playing.
5.3 Improvements
A user interface was on our todo list, but never implemented due to lack
of time. This interface would contain controls for parameters like wet/dry
ratio, room size, gain, and the like. The interface would ideally be able to
communicate with the board in real time, using a subfunction that calculates
all the necessary parameters from a theoretical user input.
R. Tullberg, R. Mittipalli, S. Abdu-Rahman, T. Isacsson 47
References
[1] Lilja, Ola
Algorithms for Reverberation - Theory and Implementation
Master Thesis, LTH
2002
[2] Wikipedia: Reverberation
http://en.wikipedia.org/wiki/Reverberation#Sabine equation
Visited: 2011-03-01
[3] Rocchesso, Davide
Introduction to Sound Processing, 3.6.2 Reverberation
2003
[4] Tonal Correction Filter
https://ccrma.stanford.edu/˜jos/Reverb/Tonal Correction Filter.html
[5] Reverberation Time
http://hyperphysics.phy-astr.gsu.edu/hbase/acoustic/revtim.html
[6] Campbell, Spencer
An Implementation of a Feedback Delay Network - Final Project Report
2008-12-09
http://twentyhertz.com/618 FinalProjectReport SpencerCampbell.pdf
[7] Wikipedia: Hadamard matrix
http://en.wikipedia.org/wiki/Hadamard matrix
Visited: 2011-02-14
[8] Brian J. Gough
An Introduction to GCC - for the GNU compilers gcc and g++, 6.4
Optimization levels
2005
http://www.network-theory.co.uk/docs/gccintro/index.html
[9] Chassaing, Rulph; Reay, Donald
Applications With The TMS320C6713 And TMS320C6416 Dsk, 3.2
The TMS320C6x Architecture
2008
[10] SPECTRUM DIGITAL
TMS320C6713 DSK Technical Reference 506735-0001 Rev. A
May 2003
48 Reverberation
49
Part V
Speech Recognition Using MFCC
Harshavardhan Kittur, Kaoushik Raj Ramamoorthy,
Manivannan Ethiraj, Mohan Raj Gopal
Abstract
In this project, we present one of the techniques to extract the
feature set from a speech signal and implement it in an speech recogni-
tion algorithm using TMS320C6713 DSK Board. The key is to convert
the speech waveform into some type of parametric representation for
further analysis and processing. A wide range of techniques exist for
parametrically representing the speech signal for the speech recogni-
tion task, such as Linear Prediction Coding (LPC), Mel-Frequency
Cepstrum Coefficients (MFCC), and others. MFCC is perhaps the
best known and most popular, and is used in this project.
50 Speech Recognition Using MFCC
1 Introduction
1.1 Why Speech recognition?
Speech is the primary means of communication between people. For reasons
ranging from realization of human speech capabilities, to the desire to auto-
mate simple tasks inherently requiring human-machine interactions, research
in automatic speech recognition has attracted a great deal of attention over
the past few decades. Although there are numerous ways to model a speech
signal and perform speech recognition in both hardware and software, no
such system is stable for all kind of speakers in the world. Our interest is to
find out the intricacies of designing such a system by implementing it on a
TMS320C6713 DSK board.
1.2 Common problems found in designing such a system
• People from different parts of the world pronounce words differently.
Also, the rate at which they speak affects the implementation of a
speech modelling system
• Speech is usually continuous in nature and word boundaries are not
clearly defined.
• The rate of error in the recognition system depends on the amount of
data stored in the system by training. When the number of words in
the database is large and consists of similar sounding words (rhyming
words), there is a good probability that one word is recognized as the
other.
• Noise is generally a major factor in speech recognition and has to
be carefully analysed while designing a system. A noisy environment
limits the system performance.
1.3 Tools Used
• MATLAB
• Code Composer Studio
• TMS320C6713 DSK Board
• Hi-Fi Microphone
• Stereo Speakers
Harshavardhan Kittur, Kaoushik Raj Ramamoorthy,Manivannan Ethiraj, Mohan Raj Gopal51
Figure 1: Feature extraction using MFCC
Figure 2: Feature matching using MFCC
2 Theory
2.1 Speech Recognition Algorithm
At the highest level there are a number ways to do this complex task of
speech recognition but the basic principles are feature extraction and feature
matching. Feature extraction is the process that extracts a small amount of
data from the voice signal that can later be used to represent each speaker.
Feature matching involves the actual procedure to identify the unknown
speaker by comparing extracted features from the voice input with the ones
from a set of known speakers. The block diagram are shown in Figures 1
and 2.
2.1.1 Feature Extraction
In feature extraction phase the speech can be parameterized by various
methods such as Linear Prediction Coding (LPC), Mel-Frequency Cepstrum
52 Speech Recognition Using MFCC
Coefficients (MFCC), and others. MFCC which is used in this project is
perhaps the best known and most popular. MFCCs takes human perception
sensitivity with respect to frequencies into consideration, and therefore are
best for speech recognition. MFCCs are based on the known variation of the
human ears critical bandwidths with frequency filters spaced linearly at low
frequencies and logarithmically at high frequencies capture the phonetically
important characteristics of speech. This is expressed in the mel-frequency
scale, which is linear frequency spacing below 700 Hz and a logarithmic
spacing above 700 Hz.
2.1.2 Feature Matching
The feature matching phase involves the use of Euclidean distance. In math-
ematics, the Euclidean distance or Euclidean metric is nothing but the or-
dinary distance between two points that one would measure with a ruler,
which can be proven by repeated application of the Pythagorean Theorem.
By using this formula as distance, Euclidean space becomes a metric space.
This is a measurement of how similar two user templates are. Thus, the Eu-
clidean distance measures the percentage of dissimilar bits out of the number
of comparisons made, therefore we choose this method of comparison.
2.2 Mel Frequency Cepstrum Coefficients
These are derived from a type of cepstral representation of the audio clip
(a cepstrum is nothing but a ”spectrum-of-a-spectrum”). The difference be-
tween the cepstrum and the Mel-frequency cepstrum (MFC)is that in the
MFC, the frequency bands are positioned logarithmically (on the mel scale)
which approximates the human auditory system’s response more closely than
the linearly-spaced frequency bands obtained directly from the FFT or DCT.
This can allow for better processing of data, for example, in audio compres-
sion. However, unlike the sonogram, MFCCs lack an outer ear model and,
hence, cannot represent perceived loudness accurately. MFCCs are com-
monly derived as follows:
1. Take the Fourier transform of (a windowed excerpt of) a signal.
2. Map the log amplitudes of the spectrum obtained above onto the Mel
scale, using triangular overlapping windows.
3. Take the Discrete Cosine Transform of the list of Mel log-amplitudes,
as if it were a signal.
4. The MFCCs are the amplitudes of the resulting spectrum.
Harshavardhan Kittur, Kaoushik Raj Ramamoorthy,Manivannan Ethiraj, Mohan Raj Gopal53
3 Implementation
Both the training and the recognition system is the same till we find MFCC
coefficients. The training phase stores the co-efficients and the recognition
phase compare current recorded coefficients with the stored ones. The steps
that are implemented to complete our design are listed below and the block
diagram is shown in figure 3.
3.1 Level detection
When the speaker says out a word the system has to do silence detection
and capture only the speech signal. The start of an input speech signal is
identified based on a prestored threshold value. Speech is captured when it
exceeds the threshold and is passed on to the framing stage. The sampling
frequency for our system is 8kHz and the speech is captured for 1 sec which
leaves us with 8192 samples.
3.2 Frame blocking
It is assumed that recorded speech is piecewise stationary which means the
signal is stationary for short period of times. By taking advantage of this
property, we divide the captured signal into fixed number of overlapping
frames (156 samples overlap) of sample length 256. Meaning, Each frame
consists of 256 samples of speech signal, and the subsequent frame starts
from the 100th sample of the previous frame. This technique is called fram-
ing.
3.3 Windowing
After framing, windowing is applied to prevent spectral leakage.A Hamming
window with 256 coefficients is used, since the frame length is 256 samples.
Also, It is easier to combine this step with the Frame blocking step.
3.4 Fast Fourier transform
The FFT converts the time-domain speech signal into a frequency domain
to yield a complex signal. Speech is a real signal, but its FFT has both
real and imaginary components. We apply 256 point radix 2 FFT for each
frame. The total no of stages in FFT is 8. The FFT algorithm in Rulph
Chassing book [1] is used in our implementation.
3.5 Power spectrum calculation
The power in the frequency domain is calculated by summing the square of
the real and imaginary components of the signal. The second half of the
54 Speech Recognition Using MFCC
Figure 3: Our Speech Recognition System
Harshavardhan Kittur, Kaoushik Raj Ramamoorthy,Manivannan Ethiraj, Mohan Raj Gopal55
samples in the frame are ignored since they are symmetric to the first half
(since the speech signal is real and has a linear phase).
3.6 Mel-frequency wrapping
Triangular filters are designed using the Mel-frequency scale with a bank
of filters to approximate the human ear. The power signal is then applied
to this bank of filters to determine the frequency content across each filter.
Twenty filters are chosen, uniformly spaced in the Mel-frequency scale be-
tween 0 and 4kHz. The Mel-frequency spectrum is computed by multiplying
the signal spectrum with a set of triangular filters designed using the Mel
scale. For a given frequency f, the mel of the frequency is given by
B(f) = 1125 ∗ ln(1 +
f
700
)mels (1)
The frequency edge of each filter is computed by substituting the corre-
sponding mel. Once the edge frequencies and the center frequencies of the
filter are found, boundary points are computed to determine the transfer
function of the filter. The transfer function of the triangular filters is given
below:
H(k, m) =

0 if f[k] < f
c
[m−1],
f[k]−f
c
[m−1]
f
c
[m]−f
c
[m−1]
if f
c
[m−1] ≤ f[k] < f
c
[m],
f[k]−f
c
[m+1]
f
c
[m]−f
c
[m+1]
if f
c
[m] ≤ f[k] < f
c
[m + 1],
0 if f[k] ≥ f
c
[m + 1]
(2)
where f[k] is the frequency of the k
th
sample given by
k∗f
s
N
and N is the no. of samples in each frame (256 in our case)
The width of the filter (resolution) of the filter is given by:
φ =
φ
max
−φ
min
M + 1
(3)
where φ
min
is the lowest frequency of the filter bank and
φ
max
is the highest frequency of the filter bank
The center frequencies on mel-scale is given by φ
c
[m] = m ∗ φ for
m ∈ [1, 20]. The center frequencies in the frequency-scale is given as
f
c
[m] = 700 ∗ [10
φ
c
[m]
2595
−1].
Once the filter transfer function is obtained, we can apply this filter bank
onto the power-spectrum to obtain the mel-spectrum .This step is basically
a frequency-warping operation where we change the frequency of the signal
56 Speech Recognition Using MFCC
based on the mel-scale.
This is elaborated in the equation below:
Mel spectrum[m] =
N−1
¸
k=0
Power spectrum[k] ∗ H[k, m] (4)
3.7 Log-energy spectrum
Once the Mel-spectrum is obtained, we take the log-spectrum of the subse-
quent signal. The log-function is basically an amplitude-modulation where
the lower frequencies are boosted and the higher frequencies are almost
maintained constant.
This is given by: Log energy spectrum[m] = ln(Mel spectrum[m])
3.8 Mel-frequency cepstral coefficients
The log mel spectrum is converted back to time. The discrete cosine trans-
form (DCT) of the log mel spectrum yields the MFCC. We take the DCT
since the power-spectrum and log-mel spectrum are real signals.
3.9 Comparison in the feature matching phase
Once we have the MFCCs, these characterize the particular speaker and
word which are stored during the training phase. During the recognition or
feature matching phase, the coefficients are again determined for the uttered
word and recognition is carried out by analyzing the Euclidean distance
with respect to the stored coefficients and defining an appropriate threshold
calibrated appropriately to increase the word recognition rate.
4 Implementation in MATLAB
An initial feasible algorithm was implemented in MATLAB to emulate the
different steps for speech recognition, which will be implemented on the
DSP board. We normalized the recorded speech signal and implemented the
MFCC algorithm. Observing the various plots, we came to the conclusion
that the recorded speech signal can be of varying length (time). Therefore
instead of time-warping the speech signal into a standard time domain, we
considered that the speech is to be spoken for fixed time duration on the
DSP board. The maximal duration was considered as 100 frames (no. of
samples = 100*256*sampling frequency) after repeated trials by different
speakers to encompass all speech parameters. Steps that encomposes our
MATLAB implementation are shown below in figures 4 and 5
Harshavardhan Kittur, Kaoushik Raj Ramamoorthy,Manivannan Ethiraj, Mohan Raj Gopal57
Figure 4: Recorded signal and Signal after silence detection
Figure 5: Mel-spectrum and MFCC for all frames
58 Speech Recognition Using MFCC
5 Implementation in DSP Board
The DSP board has limited on-chip memory (192K internal RAM). We are
required to generate the op-code to fit into this memory along with the
stack and heap space. This poses a difficult situation for us; hence we used
only a minimal set of variables both global and local. Further we made
sure that the sequential steps in the algorithm operated on the pointer to
the variable instead of creating copies of the variable. Further important
constants were stored in the program memory (as #define pre-processor
directives) and other variables were instructed to be stored in the heap or
stack (as #pragma pre-processor directives).
6 Tests and Results
In the Speech training phase, the experimental setup was performed in a
controlled environment were the noise was minimal and its effects could be
disregarded. The training vectors (Time averaged MFCC coefficients) are
obtained for different words like Cat, Dog, Elephant, Hippopotamus, Mouse
and Tiger. It is also possible to easily add other words to our training sys-
tem.Then those training vectors are stored in a header file to be compared
with the test vectors. In the speech recognition phase training vectors are
compared with the test vector using the Euclidean distance method. The
word identified is displayed using normal printf statement in the code com-
poser studio. There is a 90% match for certain speakers and less than 50%
for some speakers. The words ’Cat’ and ’Dog’ has the higher recognition
rate than the other four words.
7 Conclusion
We have successfully implemented an MFCC system for extracting a feature
from voices. We were also able to identify the word from different speakers
uttering the word using the extracted feature. Although the results obtained
were not as expected, the amount of knowledge obtained during this project
is exceptional. We learned the techniques of implementing a speech recog-
nition system, using MFCC in particular. We also learned to use the Code
Composer studio and DSP Bios. This project enhanced our experiences of
working in MATLAB and C. Our MATLAB implementation helped us a lot
to complete our final implementation in the DSK Board. We also learned to
use Latex in the process of completing our report. Apart from the technical
aspects, we learned to manage our time with proper planning. Overall, this
project was challenging and was a good experience to us.
Harshavardhan Kittur, Kaoushik Raj Ramamoorthy,Manivannan Ethiraj, Mohan Raj Gopal59
References
[1] Rulph Chassaing Digital Signal Processing and Applications with the
C6713 and C6416 DSK, John Wiley & Sons, 2005
[2] Sigurdur Sigurdsson, Kaare Brandt Petersen and Tue Lehn-Schiler Mel
Frequency Cepstral Coefficients: An Evaluation of Robustness of MP3
Encoded Music, Proceedings of the Seventh International Conference
on Music Information Retrieval (ISMIR), 2006
[3] Adarsh K.P., A. R. Deepak, Diwakar R., Karthik R. Implementation of
a Voice-Based Biometric System, Thesis submitted at R.V. College of
Engineering, India ,2007.
60 Speech Recognition Using MFCC
61
Part VI
Face Detection, Tracking and
Recognition
Asheesh Mishra, Mohammed Ibraheem, Shashikant Patil
1 Abstract
The Face detection, tracking and recognition in video is a computational
extensive procedure. It requires the processing of each and every pixel,
depending on the desired final output. We used the exceptional capabilities
of the latest DSP board TMS320DM6437 from Texas Instrument, USA. The
board uses the state of the art DSP processor, DaVinci, which does all the
computations only on fixed point numbers. To facilitate our implementation
of project, we used the Code Composer Studio (CCS), that too from Texas
Instruments, which actually comes along with the board. TI provides various
in build functions to get started with the video projects like, videopreview.c
example file, contains various basic functions to process the pixels in current
frame. We did make use of those functions in the understanding of how the
system actually works and how we can manipulate the pixels values. There
can be various parameters depending on which, we can efficiently detect
the human face with in a frame of incoming video data stream, like edge
detection, skin detection, etc. We used the skin detection as our parameter
to implement the face detection and tracking in video. This is because,
we can achieve better efficiency in detection and can extract the various
feature like, eyes, nose, lips, ears for further processing in our face recognition
algorithm.
62 Face Detection, Tracking and Recognition
2 Introduction
Face recognition is getting most important with availability of cameras and
need for automated processing of the videos to serve many purposes. DSP
gives the power for processing video and get meaningful information from
the video. TMS320DM6437 with Code Composer Studio gives the algo-
rithm developers power to concentrate on developing powerful and efficient
algorithms in less time with many helping utilities. Our system that we
developed in our project consists of four main stages:
1 - Capture an image.
2 - Face Detection.
3 - Face Tracking.
4 - Face Recognition.
The first stage to capture an image frame from a stream video input to
the TI TMS320DM6437 and processing it to detect the face region in the
captured image and then send this part from image to the PCA module to
generate the features vector and compare it with the pre-stored vectors in
the database to find if the nearest match (face recognition stage). These
stages are shown on the following system block diagram in fig. 1.
Asheesh Mishra, Mohammed Ibraheem, Shashikant Patil 63
3 YCbCr- Color Space Model
A Color space is only a format of representing color, brightness, luminance,
and saturation in one way or the other. A thorough understanding of YCbCr
was mandatory in our project of video processing. Here, Y is the luminance
component whereas Cb & Cr are the blue and red color chrominance compo-
nents. Luma (Y) is basically, responsible for the brightness in an image and
greatly influences the perception of an image. Chrominance components in
an image are responsible for the color composition of an image. As far as
our project was concerned, we typically had to work around mostly with the
Chroma components of an image.
Every pixel contains the YCbCr information in the format of 4:2:2, it
means every other value is a Y component and every fourth value is a Cb
and Cr component in the series of video data.
4 Image Filtering for Noise reduction
Image filters are used to remove the undesired image details, like smoothen-
ing of image, which are more suitable for the edge detection or further pro-
cessing on image frame. There are various good image filtering algorithms
available for reducing the noise, like, Gaussian Noise filters, Median Image
filters and others. One can always take the help of any good book on Image
processing to refer for image filters. We made use of Median Image filter
to reduce the noise in our project, because, we wanted to reduce the unde-
sired pixels which are of similar values as of skin color. The median filter
works on the principal of 8- neighborhood. Here, if the pixels differentiate
greater than a certain threshold with their neighboring pixels, then they will
be set to average value of those pixels, hence, false edges are not detected
during further processing.
The main disadvantage of implementing this was that, this process con-
sumes a lot of time, and hence, slows down the overall performance of the
final output. To accommodate this feature we, had to reduce the actual
frame rate, hence process only a few frames to speed up the entire process.
5 Edge Detection
The simplest approach after detecting face is to get some feature extraction
from the detected face so that we can match with the next face to be recog-
64 Face Detection, Tracking and Recognition
nized. Detecting edge once we thought a simplest approach towards feature
extraction.
Different algorithms for Edge Detection are as below -
• Canny edge detection .
• Other first-order methods.
• Thresholding and linking.
• Edge Thinning.
• Phase congruency based edge detection.
When implementing, we found the thresholding and linking method
can be implemented successfully in the simulation and on the real-time
TMS320DM6437 system. The reason for using the edge detection was
to mainly to extract the features from the detected face like, eyes, nose
and mouth. Since, we have implemented the Principle Component Anal-
ysis(PCA) for the recognition part, this step may not be of much use in
that sense, but it actually helped the system to become more robust in rec-
ognizing only face, rather than recognizing other parts of human body as
face.
Basically using it we compare with the neighboring pixels with some
threshold value. The threshold value is dependent many factors so it has
to adjust with the idea setting for the successful differentiation and edge
detection.
6 Face Detection and Tracking
Among all frames it gets important task to detect face for further processing.
There is need to separate the face from other background. There are many
algorithms available for it like -
• Binary pattern-classification.
• Skin color to find face segments using static background and lighting
condition.
• Window-sliding technique using background pattern.
• Eye blinking pattern detection.
• Appearance, face and movement detection.
Asheesh Mishra, Mohammed Ibraheem, Shashikant Patil 65
While studying the method in the Ref[1] we found the implementation of
Skin color to detect face segments using static background and appropriate
lighting condition in lab environment most suitable and successful. It has
given expected results for detecting the face from the entire frame.
To detect the face in video, firstly, we had to store the current fame to
a temporary array, which we are going to manipulate. The luminance (Y)
component is set to any particular value which differentiate from skin color,
like setting it to 0xFF gives the comparable results, but the point is, entire
Y component should be of same value.
Since, we are going to differentiate face from background, the Chroma
components becomes much more important. Most of the skin color falls in
the specific color ranges of Chroma red component. hence, we need to first
set a specific limit to Chroma blue, to make entire procedure more robust.
Now then, the skin color falls in the range of 0x8A to 0x8C of Chroma red.
We set all the pixels falling in this range to specific color, and others to the
same value as we had given to the Luma component.
Here, the image consists of only face, then, this stored and modified
frame has to be written back on to the write cache buffer, to display on
to the monitor. Below are some of the images shown after processing this
function.
Fig.3 After Processing Luma(Y) and Chroma(Cb)(Cr) Components.
Now, we had to apply the filter to remove the undesired noises coming
from the reflective surfaces. As discussed above, we made use of the median
filter for this.
Now, we had to detect the face from the frame, which is done by scanning
the entire frame from top left position till the end of frame is reached. During
the scanning we searched for the continuous pixels holding the same values
66 Face Detection, Tracking and Recognition
as of skin color, when it satisfies a certain threshold, a flag is raised to
indicate that face had been detected in a frame and that position is also
saved in a pointer register.
Now, comes the tracking part, which required the detected face should
be tracked in a frame for its new position.
A box is place on to the position captured from face detection part, on
the basis of the status of the flag raised. The box was placed by setting the
values of all the desired pixels (to black) using the value in pointer register.
Fig.4 After detection and VGA display of processed frame.(Exp-1)
Fig.5 After detection and VGA display of processed frame.(Exp-2)
Asheesh Mishra, Mohammed Ibraheem, Shashikant Patil 67
7 Face Recognition
Common algorithms for face recognition like,
1. Principal Component Analysis (PCA)
2. Independent Component Analysis (ICA)
3. Support Vector Machine (SVM)
4. Hidden Markov Models (HMM)
5. Boosting and Ensemble
Among these algorithms PCA based Eigen face [9],[5],[14]algorithm we
found most interesting and successful in the simulated environment using
Matlab. Ref [11] also, the PCA has big advantages on the implementation
since it reduce the dimensions of the images that need to be stored to in
the database, and of course that helps us to improve the memory resources
consuming in the real time system because of the hardware limitation [13].
We apply the algorithm in three stages which are described as follows.
Creating the database: Before applying the PCA, we have to create
the training database that contains the faces. First we have to reformat each
image from a two dimensional image to a one vector image by concatenating
each row or coulomb into along vector. Then we combine the images vectors
into one matrix called the trained matrix (T-matrix).
Generating the Eigenfaces: Considering the T-matrix as input to
this stage, we have to calculate the following matrices to be the input to the
recognition stage:
1. M-Matrix: Mean values of the T-matrix (training database).
2. A-Matrix: Centered images generated by subtracting the M-matrix
from T-matrix.
3. Eigenfaces: So-called Eigenvectors, its the features matrix contains
the faces features. We first calculate the covariance matrix by multiplying
the A-Matrix by its transposed one and then finding the Eigen vectors Ma-
trix, and modifying it by sorting them and removing the negative values.
Finally by multiplying A-Matrix and by the modified Eigen vectors Matrix
we can get the Eigenfaces Matrix.
Face Recognition process: In this process we receive the 3 outputs
from the previous stage as the inputs for this stage addition to the input
picture that we want to recognize the face.
First we have to project the centering images into the face space by
multiplying the each column in the A-Matrix that represents corresponding
image by the Eigenfaces this give us the projected images Matrix. After
that we have to the project the input image using the same concept (center
the image by subtract the main of T-Matrix and multiplying it with the
Eigenfaces).
Having the projected images set and the projected input image, by cal-
culating the Euclidean distance between each projected image in the set.
68 Face Detection, Tracking and Recognition
Test image should be having the minimum distance with its corresponding
image in the database.
The flowchart (Fig.6) describes in detail.
Fig.6 Flowchart of PCA Algorithm for Face Recognition
8 Problem faced
We faced several problems, when implementing the reference model in to
the TMS320DM6437. The limited memory available onboard was one of
the major bottlenecks. It has only 192K of RAM available, so not much
of data we can store on it. Moreover, since, video data manipulation is
Asheesh Mishra, Mohammed Ibraheem, Shashikant Patil 69
very computation exhaustive, hence, slow response of the kit was also a
problem, during processing of image data. We tried to overcome this prob-
lem, by processing only a fewer number of frames, which improves a bit of
performance. the other problem we faced is in the implementation of recog-
nition part. We initially, thought of implementing every part of PCA on
to the board, including the training set, and creating database for at least
three distinct persons, but that actually degrades the overall performance
of system drastically, virtually unrealistic. So, we tried to implement the
calculation part of PCA(like finding Eigenfaces and training databases) on
to the matlab itself, and comparing the realtime data, with these eigenfaces
directly. The TMS320DM6437 is a fixed point processor, and also a problem
since most calculations on the PCA is done using division, multiplications,
square roots, so we have to use the scaling technique (scaling up) to convert
the data to fixed point and restoring it after calculation scaling down using
shift operations for the square root calculation we use Babylonian method
[13]. Although we were able to implement most of the part of recognition
phase onto the kit, but the results were not as expected.
9 Conclusion and Future work
We were able to finish the face detection and tracking pat of our project
successfully, and results were found to be as expected. In a highly, noisy en-
vironment(like improper background), the performance degrades marginally,
but still able to detect most of the time. The recognition part was also im-
plemented as a hybrid between onboard and matlab, but output results were
not as were expected, and also due to lack of enough time, we could did very
little to optimize the recognition part. Hence, we separated out the recog-
nition part from the project, as of time being. Since, the project was much
more complex to implement as opposite when we thought otherwise initially,
but the kind of exposer we had got during the implementation phase was
very satisfying. The project also honed our skills on embedded C language
and code composer studio.
References
[1] A.N.Rajagopalan, K. Sandeep ”Human Face Detection in Cluttered
Color Images using Skin color and Edge Information”, Indian Confer-
ence on Computer Vision, Graphics and Image Processing Dec. 2002.
[2] Sanjay Kr. Singh, D. S. Chauhan, Mayank Vatsa, Richa Singh. ”A
Robust Skin Color Based Face Detection Algorithm”, Tamkang Journal
of Science and Engineering,Vol. 6, No. 4, 2003, pp. 227-234.
70 Face Detection, Tracking and Recognition
[3] Minku Kang, PCA-based Face Recognition in an Embedded Module for
Robot Application, ICROS-SICE International Joint Conference 2009
[4] William H. Press, The Art of Scientific Computing , Cambridge Uni-
versity Press, 3rd edition, ISBN 13: 978-0521880688
[5] http://www.mathworks.com/matlabcentral/fileexchange/
17032-pca-based-face-recognition-system
[6] http://www.eit.lth.se/fileadmin/eit/courses/eti121/
Seminar/lect1_2011.pdf
[7] http://www.eit.lth.se/fileadmin/eit/courses/eti121/
Seminar/lect2_2011.pdf
[8] http://www.csus.edu/indiv/p/pangj/aresearch/video_
compression/ref/report_summer09_Shriram%20_face_detection.
pdf
[9] http://www.eit.lth.se/fileadmin/eit/courses/eti121/
Reports/ASP_Reports_2010.pdf
[10] http://www.face-rec.org/algorithms/
[11] http://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors
[12] http://cswww.essex.ac.uk/mv/allfaces/faces94.html
[13] http://en.wikipedia.org/wiki/Face_detection.
[14] http://www.cs.otago.ac.nz/cosc453/student_tutorials/
principal_components.pdf
71
Part VII
Circular Object Detection
Ajosh K Jose, Qazi Omar Farooq, Sherine Thomas, Sreejith
P Raghavan
Abstract
This project deals with the implementation of a circular object
detection method on DSP-TMS320DM6437 Evaluation Module. The
aim is to detect a circular moving object when the background is kept
fixed. The first step is to make the reference frame by capturing the
first frame sent from the video camera. Then the successive frames
are subtracted from the reference frame to obtain the moving object.
Further processing is done only on the area of moving object. This will
reduce the processing time required for each successive frame. In the
second step, Canny edge detection algorithm is employed to extract
the edges of moving object. Finally the object is checked for circular
shape using modified Circular Hough Transform. If a circular object
is detected on the frame then it is marked in the video.
72 Circular Object Detection
1 Introduction
Real-time image processing applications are now widely used due to the
very fast advancement in technology. Due to the introduction of portable
devices with stringent resource limitations, the image processing algorithms
used in such systems need to be chosen wisely. Also the image processing
algorithms are widely used in medical imaging, surveillance systems and
digital cinema. The rapid change in video and image processing standards
also introduce additional complexity and the need for higher throughput. In
this project we are trying to familiarize with the various algorithms used in
image processing. Detecting moving objects is an important task in video
surveillance. If the shape of moving object can be detected automatically,
the task of manually monitoring the safety can be reduced.
Our project implementation is to detect circular moving objects in real time.
Here we are trying to study the methods and challenges involved in detect-
ing an object correctly. After successful implementation of circular object
detection system, the algorithm can be further improved to detect objects
of more complex shape. We are using TMS320DM6437 evaluation module
for implementing our project.
2 Theory
Figure 1, shows the steps involved in our circular object detection imple-
mentation. The circular object detection is implemented in the following
steps,
• Moving object Detection
• Edge Detection
• Cicular Object Detection
In the first step moving objects in the frame is detected by background
substraction method. By doing this the processing time required for the next
steps can be reduced very much. Also to further reduce the processing time
every 15th frame is processed instead of consecutive frames. Background
frame is the first captured frame which is stored in the memory. Then
the new frame is substracted from the background frame to detect moving
objects. This will reduce the area of interest. In the next step edge detection
algorithm is applied on that particular area to detect the edges of the object.
In the final step, circular object detection algorithm is applied on the edge
detected input to find the circular objects in the region.
Ajosh K Jose, Qazi Omar Farooq, Sherine Thomas, Sreejith P Raghavan73
Figure 1: Block Diagram of the steps
2.1 Edge Detection
Edge detection is an important step in image processing. The main objective
of edge detection is to reduce the amount of data which is to be processed by
keeping the structural content intact. There are several algorithms proposed
by different people for edge detection. The processing time and output of
these algorithms vary very much. The main edge detection algorithms are
• Prewitt Method
• Canny Method
• Sobel Method
• Roberts Method
• Laplacian of Gaussian Method
• Zero-Cross Method
Please refer [1] for more details about edge detection algorithms. We did
an intial study using Matlab to find the most suitable algorithm for edge
detection. The results of this study is shown in figure 2.
From the comparison it was clear that canny edge detection algorithm will
be more suitable because of the fine details available in the output which
will be needed for tracking a moving object. But the actual problem of ap-
plying the canny edge detection algorithm to the complete frame is the large
processing time required. For reducing the processing time, we added the
method of background substraction. In this method we substract the static
background from the current frame to detect the actual area of interest. So
by applying the canny edge detection algorithm on that particular area the
processing time can be reduced considerabily.
The input to the Canny edge detection algorithm is the gray scaled image.
Canny Edge detection algorithm consists of five steps. They are
74 Circular Object Detection
Figure 2: Edge detection algorithm outputs
• Smoothing
• Finding gradients
• Non-maximum suppression
• Double thresholding
• Edge tracking by hysteresis
A brief idea about different steps of canny edge detection algorithm is
given below. For detailed description, please refer [2] or [3].
2.1.1 Smoothing
Smoothing is done to reduce the noise level in the image. Usually gaussian
filter is employed for this step. This step helps to remove unwanted edges
detected due to the noise present in the image. The image is smoothened by
applying a Gaussian filter with a standard deviation of 1.4. The gaussian
matrix is shown in the figure 3.
Smoothing takes long processing time due to the matrix multiplication in-
volved. In this project we skipped this step since we are processing only the
moving object and the effect on the edge detected image was found to be
less.
Ajosh K Jose, Qazi Omar Farooq, Sherine Thomas, Sreejith P Raghavan75
Figure 3: Gaussian Matrix
Figure 4: Sobel Matrix
2.1.2 Finding gradients
By finding the gradients, the edges where the gray scale intensity varies
most is determined. This is done by applying sobel matrix to each pixel in
the image. The sobel matrices are shown in figure 4. The matrices consists
of G
x
and G
y
matrix. Then the gray scale sum is calculated by the equation,
sum = |G
x
· pixelvalue| +|G
y
· pixelvalue| (1)
The gray scale angle is calculated by the equation,
angle =
G
x
· pixelvalue
G
y
· pixelvalue
(2)
The output after applying the sobel matrix is shown in the figure 5. It is
clearly visible that all the edges in the image are highlighted.
2.1.3 Non-maximum suppression
For suppressing the non-maximum, the angle calculated from the previous
step is used. The angle obtained is rounded to the nearest 45 degree with
which the gradient direction of all the pixels is determined. It will be either
76 Circular Object Detection
Figure 5: Image after finding the gradients
0,45,90 or 135 degrees. Then the current pixel strength is compared with
the positive and negative gradient direction. If the current pixel have more
strength than the positive and negative gradient direction, then it is choosen
and the other values will be suppressed. Thus in this step all the gradient
edges with local maxima will be selected. Figure 6 shows the output image
after non-maximum suppression.
2.1.4 Double thresholding
For double thresholding, the remaining edges are classified into strong and
weak edges. Strong edges will be retained and they will be part of the edges.
Weak edges will be further checked in the next step.
2.1.5 Edge tracking by hysteresis
For edge tracking by hysteresis, all the weak edges are checked for connection
with strong pixels in the neighbourhood. If it is connected to any of the
strong pixels, it will be considered as part of the edge and will be retained.
The other pixels will be cancelled. Figure 7 shows the output image after
doing the hysteresis.
Ajosh K Jose, Qazi Omar Farooq, Sherine Thomas, Sreejith P Raghavan77
Figure 6: Image after non maximum suppression
Figure 7: Image after Hysterisis
78 Circular Object Detection
Figure 8: Circle detection method
2.2 Circular Object Detection
Circular object is detected by applying circular Hough transform (CHT) al-
gorithm on the edge detected frame. Circular Hough Transform algorithm
is applied to the output of Canny edge detected image to find the edges of
circles in the image. This algorithm is based on the equation,
(x
1
−x
0
)
2
+ (y
1
−y
0
)
2
= r
2
(3)
All the pixels will be searched for the possibility of a circle with a radius
of particular limit. Six pixels are searched for determing the circle. If all
the six pixels have edge information within the particular radius, then it is
considered as a circle. Figure 8 shows the method used for determining the
circle. For detailed description on CHT please refer [4], [5] and [6].
3 Implementation
The implementation of the project was done in two steps. In the first step
the algorithm was tried in matlab to determine the efficiency. In the second
step, the matlab implementation was converted into a C implementation.
Code composer studio was used for compiling and downloading the code.
The hardware tools used for testing the project included,
• Video Camera
• DM-6437 Evaluation board
• Television
The platform DM6437 offers an interface in the framework, from which
we can access the input video stream frame by frame. The pixels are in
Ajosh K Jose, Qazi Omar Farooq, Sherine Thomas, Sreejith P Raghavan79
YCbCr format, where Y is the luma component and Cb& Cr are the chroma
components with the ratio 4:2:2. For this project we took into consideration
a PAL system. So the frame size is 720x576. The processing consists of
reading the frame buffer and updating the frame data and writing it back
into the buffer.
4 Conclusion & Future Work
Circular object detection was successfully implemented and tested. It was a
nice experience working with this project which introduced us to the world
of programming DSP processors. Different algorithms used in signal pro-
cessing were familiarized in this course. The two labs which were done as
part of this course were helpful in familiarizing the tool and the DSP kit.
Working with DM6437 and code composer studio was a nice experience.
We felt the processing power of DM6437 Evaluation kit is not enough for
handling complex algorithms used in video processing.
Due to the lack of enough processing time, we were not able to imple-
ment more reliable algorithms for circular object detection in the DM6437
processor. Also edge detection was implemented only in the selected area
where a moving object is detected. So as future work we are planning to
optimize our current implementation and add more reliable algorithms for
circular object detection which can detect circular objects with different ra-
dius. Also if the processing power permits, we would like to implement more
complex algorithms like human hand detection and tracking the movement
of hand.
References
[1] Raman Maini, Dr. Himanshu Aggarwal, Study and Comparison of Var-
ious Image Edge Detection Techniques, International Journal of Image
Processing (IJIP), Volume (3): Issue (1)
[2] John Canny, A computational approach to edge detection. Pattern Anal-
ysis and Machine Intelligence, IEEE Transactions on, PAMI-8(6):679-
698, Nov. 1986.
[3] Canny Edge Detection implementation tutorial, Labortary of computer
vision and media technology. Advanced image processing. Allborg Uni-
versity
[4] Mohamed Rizon, Object Detection using Circular Hough Transform,
American Journal of Applied Sciences, 2005.
80 Circular Object Detection
[5] Marcin Smereka, Ignacy Duleba, Circular Object Detection Using A
Modified Hough Transform, Int. J. Appl. Math. Comput. Sci., 2008.
[6] Mohamed Roushdy, Detecting Coins with Different Radii based on
Hough Transform in Noisy and Deformed Image, GVIP Journal, Vol-
ume 7, Issue 1, April, 2007.
[7] Project Report 2010, ETI121, Algorithm in Signal Processing Course.

ii

Contents
I Guitar tuner 1
2 3 3 4 4 4 5 5 7 7 8 8 8 8 9 9 9 10 11 12

P V Soumya, A Norrgren, C-J Waldeck, F Brosj¨ o

1 Introduction 2 Theory 2.1 Harmonics . . . . . . . . . 2.2 Algorithms . . . . . . . . 2.2.1 Fourier Transform 2.2.2 Cross Correlation .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3 Method 3.1 Analysis of the guitar sound 3.2 Implementation . . . . . . . 3.2.1 Echo . . . . . . . . . 3.2.2 Detection . . . . . . 3.2.3 Process . . . . . . . 3.2.4 DFT . . . . . . . . . 3.2.5 SmallXcorr . . . . . 3.2.6 MatLab GUI . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

4 Results and Discussion 4.1 Matlab testing . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Post-testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion and further development

II

Pitch Estimaiton 15
16 16

Jonas Rosenqvist, Kim Smidje, Henrik Nilsson, Johan Mattsson

1 Introduction 2 Theory

3 Methods 18 3.1 Time domain . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Frequency domain . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Implementation 5 Problems encountered iii 19 22

6 Conclusion 7 References

23 23

III

Vocoder 25
26 26 26 27 27 27 28 30 31 32

Mattias Danielsson, Andre Ericsson, Kujtim Iljazi, Babak Rajabian

1 Introduction 2 Theory 2.1 Overall description of our vocoder model 2.2 The highpass filter . . . . . . . . . . . . 2.3 The autocorrelation function . . . . . . 2.4 The Levinson-durbin recursion . . . . . 2.5 The IIR lattice filter . . . . . . . . . . . 3 Implementation 4 Testing and debugging 5 Results and conclusions

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

IV

Reverberation 35

R. Tullberg, R. Mittipalli, S. Abdu-Rahman, T. Isacsson

1 Introduction 36 1.1 Reverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2 Theory 2.1 Reverb Algorithm . . . . . 2.2 Reverberation time . . . . . 2.3 Delay elements z −mi . . . . 2.4 Damping Filters hi (z) . . . 2.5 Diffusion Matrix A . . . . . 2.6 Gains bi and ci . . . . . . . 2.7 Tonal Correction Filter t(z) 38 38 38 39 39 40 40 40 41 41 41 41 41 42 43

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

3 Implementation 3.1 Realtime versus non-realtime implementation 3.2 Diffusion Matrix . . . . . . . . . . . . . . . . 3.3 Software Optimizations . . . . . . . . . . . . 3.3.1 Matrix Multiplication . . . . . . . . . 3.3.2 Circular Buffers . . . . . . . . . . . . . 3.4 Compiler Optimization . . . . . . . . . . . . . iv

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . .8 Mel-frequency cepstral coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 V Speech Recognition Using MFCC 49 Harshavardhan Kittur. . . . .1. . . . . . . . . 50 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Discussion and Conclusion 45 5. . . .1 Feature Extraction . .7 Log-energy spectrum . . . . . . . . . . . . Mohan Raj Gopal 1 Introduction 50 1. . . . . . . . . .2 Common problems found in designing such a system . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Fast Fourier transform . . . . . 3. . . . . . . . . . . 50 1. . .1 General . . . . 3. . . . . . . . . 3. . . . . .1 Level detection . . . . . . 46 5. . . . . . . . . . . . . . . . .2 Frame blocking . . . . . . . . . . . 51 51 51 52 52 53 53 53 53 53 53 55 56 56 56 56 58 58 58 v . 43 4 Result 45 4. . . . . . . . . . . . . . . 3.3 Windowing . . . . . . . . . . . . .2 Feature Matching . . . . 3. . 3. . 3 Implementation 3. .3 Improvements . . . . . . . . . . .1 Why Speech recognition? . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . 3. . . . . . . .6 Mel-frequency wrapping . . . . . . 45 4. . . .2 Non-realtime versus realtime implementation . . . . . . . . . Kaoushik Raj Ramamoorthy. 2. . . . . . . . . .1 Experimental setup . . . . .2 Results . . . .2 Mel Frequency Cepstrum Coefficients . . . . . . . . . .1 Speech Recognition Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manivannan Ethiraj. . . . . . . . . . . . . . . . .3 Tools Used . . .1. . . . . . . . . . 45 5. . . . . . . .5 Power spectrum calculation . . . . . . . . . . . .9 Comparison in the feature matching phase 4 Implementation in MATLAB 5 Implementation in DSP Board 6 Tests and Results 7 Conclusion . . . . . . . . . . .5 Hardware Memory Considerations . . . . . . . . . . . .3. . . . . . . . . . . . . 50 2 Theory 2. . . . . 2.

. . . . . Sreejith P Raghavan 1 Introduction 2 Theory 2. . . . . . . . Sherine Thomas. Tracking and Recognition 61 61 62 63 63 63 64 67 68 69 Asheesh Mishra. . . . . . . . . . . . .2 Finding gradients . . . . .VI Face Detection. . . .2 Circular Object Detection . . . . . .1 Edge Detection . . . . . . . 3 Implementation 4 Conclusion & Future Work . . 2. . . .1. . . . . 2. . . . . .1. Shashikant Patil 1 Abstract 2 Introduction 3 YCbCr. .1 Smoothing . . . . . .1. . . . . .3 Non-maximum suppression 2. . . . . . . . . . . . . . . . . . . . . .Color Space Model 4 Image Filtering for Noise reduction 5 Edge Detection 6 Face Detection and Tracking 7 Face Recognition 8 Problem faced 9 Conclusion and Future work VII Circular Object Detection 71 72 72 73 74 75 75 76 76 78 78 79 Ajosh K Jose. . . . . . Mohammed Ibraheem. . .1. . . 2. . . .5 Edge tracking by hysteresis 2.1. . . . . .4 Double thresholding . . . . 2. Qazi Omar Farooq. . . . . . . . . . . vi . . . . . . . . . . . . . . . . . .

A working guitar tuner was then made. which concludes this project. A great deal of the report handles problems associated with the memory of the board. Suggestions of possible improvements are presented in the final section. First the theory about the guitar strings and it’s harmonic patterns is covered. along with a description of the different mathematical algorithms used to tune them. C-J Waldeck. F Brosj¨ o Abstract This report handles the development of a guitar tuner based on the Texas Instrument TMS320C6713DSK signal processing board. together with solutions developed to work around them. 1 .Part I Guitar tuner P V Soumya. This is followed by the analysis of the guitar sound and the method used implement the tuner on the DSP-board. A Norrgren.

. the other strings change pitch as well due to the higher stress on the guitar caused by the tension. When a string on some guitars is tuned. since the strings have to be tuned separately and many times to achieve a stable pitch for all of them. thereby being able to correct changes to the other strings instantaneously. a TC Electronic polytune [4]. When searching for this type of tuner only one was found on the commercial market. This was our source of inspiration for this project in pitch estimation.2 Guitar tuner 1 Introduction When thinking of pitch estimation one of the project group members came to think of a problem he experiences in tuning his guitars. The goal of this project is first to be able to determine the pitch of a single string and eventually expanding it to being able to estimate the pitch for multiple strings simultaneously and presenting the result to the user. A way to address this issue is to have a tuner that allows the user to overlook all strings at the same time. This makes tuning a tough and time consuming procedure.

String nbr 6 5 4 3 2 1 Octave E2 A2 D3 G3 B3 E4 f0 82.814 220. In table 1 all strings with their corresponding pitches and first two harmonics are shown. In figure 1 the frequency pattern of the E4 string on a Hagstr¨m Viking is shown.830 196.940 329. F Brosj¨ o 3 2 2.630 2f0 164.890 Table 1: Guitar frequencies The pattern of the harmonic amplitudes are not the same for different strings and guitars which gives each guitar its specific sound.820 988. What differs the instruments and gives them their unique sound is the amplitude pattern for these harmonics.000 293. A Norrgren.000 246.000 146. thickness and material of the string.000 740. o Figure 1: Spectrum of E4 .260 3f0 247. These patterns depends on many factors like the length.660 392. called the pitch.221 330.000 493.P V Soumya. A guitar has six strings tuned to different pitches.1 Theory Harmonics A note played on a string instrument consists of a fundamental frequency. The frequencies of these harmonics are multiples of the pitch and are the same for every instrument tuned to the same pitch. C-J Waldeck.407 110.880 659.490 588.000 440. and a number of harmonics.

the frequency scale is made logarithmic. This is done using equation 1 which has a linear frequency scale.2 Cross Correlation A cross correlation is used to find similarities between two different discrete signals. This is done by replacing k in equation 1 with (2): k = f0 · B i/N i = 0. 2.2 2. The cross correlation of the signals x and y is given by equation 3. . This makes the accuracy better for higher frequencies. which are one semitone apart. To handle this smoothly a logarithmic frequency scale can be implemented. an octave is divided into 12 pitches. N − 1 (1) In this case the frequency scale of octaves is logarithmic. rxy (n) = l x(l) · y(l − n) (3) . Since the distance between the octaves is increasing for higher frequencies the number of increments between each octave is increasing in the linear frequency scale. Xk = N −1 −2πnk/N n=0 xn e k = 0. To obtain a linear accuracy over the entire frequency spectrum. The information about DFT and the logarithmic frequency scale was found in Martin Stridh doctoral thesis.. The base determines the size of the scale along with the number of points. 1. N − 1 (2) Where B is an arbitrary base and f0 is the starting frequency. Between each semitone there are 100 equally spaced increments called cents [2].1 Algorithms Fourier Transform The Fourier transform is a discrete transform between the time and frequency domain. 1.. A commercial good quality tuner usually have an accuracy of between +/-1 and +/-3 cents. meaning that an increase of one octave corresponds to a doubling of the frequency.. .4 Guitar tuner The chromatic scale that is used globally..2. Signal Characterization of Atrial Arrhythmias using the Surface ECG [3]. This unit is commonly used to measure the accuracy of instrument tuners.2. This is done by multiplying the signal components on each index with each other and summing them up. A portion of a signal in time domain is analysed and the frequency components are extracted with their corresponding amplitude.. This is done repeatedly where one of the signals are shifted in relation to the other.. 2.

The correlation speed can then be improved by removing all the multiplications and most of the additions.1 Method Analysis of the guitar sound The project began with recordings of the guitar strings from a Hagstr¨m o Viking semi acoustic electric guitar. The correlation results in an array where the index with highest value represents the best match between the signals. and it was quickly concluded that the frequency positions of the harmonic peaks were of higher significance than their amplitude. . A Norrgren.P V Soumya. or even a cepstrum. Roughly estimated this saves a total of 700 000 operations with the improved correlation for all the six strings. could not be used. Instead another method was tested. that might have been used for one string. If the spectrum is of the length 1500 an ordinary scalar product requires 1500 multiplications and 1499 additions. so all that is left is the sum of the three values in the spectrum where the reference is one. a normal auto correlation. The sounds were then analysed using Matlab’s built in FFT-function to get an idea of how the frequency spectrum would look like. Since the goal is to tune multiple strings simultaneously. C-J Waldeck. 2. where the ones represents the fundamental and the two harmonics. To save time and calculations there is no need to do a full correlation and so it was limited to 20 steps around the centre index.5 0 −40 −30 −20 −10 0 Shift 10 20 30 40 Figure 2: Correlation of the spectra in figure 1 3 3. The reference spectra is only composed of three ones.5 2 Correlation coefficient 1. F Brosj¨ o 5 Provided that the guitar is roughly in tune the reference spectrum should have a high correlation near the centre index. That means that it saves 2997 operations for every scalar product. The spectrum was found to vary a lot between different strings and pitches. With the improved correlation it only requires two additions and no multiplications.5 1 0.

the references were constructed rather than recorded. This would compromise the accuracy since it would be better if both the pitch and the harmonics matched at the same time. This came to effect the choice of resolution and thereby the base to the logarithmic frequency scale. To get rid of this issue. The aim was to construct a tuner with relatively high accuracy.5 2 Amplitude Amplitude −15 −10 −5 0 Shift 5 10 15 20 2 1. In this way we could get the exact frequencies for the pitch and harmonics of each string. an accuracy of ±3 cents was chosen.5 0. the pitch and the harmonics will match at the same time.6 Guitar tuner By cross correlating the frequency spectra from the coincident strings with the spectra from the individual strings in turn. and with the limited memory. To obtain an adequate accuracy.5 1 1 0. This corresponds to an acceptable error in the correlation of ±1 steps in the correlation. as in figure 3 B. This solves the problem because when the cross correlation is made. as shown in figure 3 A. the linear frequency scale would be hard to use because the correlation would yield multiple peaks depending on if the pitch or the harmonics matched perfectly. .5 2. and the correlation was done to the frequency pattern rather than the amplitude of the harmonics. This method was tested using recorded sounds. there was a need Cross correlation for linear frequency scale 3 3 Cross correlation for logarithmic frequency scale 2. a separate correlation for each string could be obtained. which resulted in a very messy graph. Since it was realised that the frequency difference between the pitch and the harmonics would be changed when the string is not tuned.5 0 −20 0 −20 −15 −10 −5 0 Shift 5 10 15 20 Figure 3: A Linear correlation B Logarithmic correlation for a specialised Fourier transform with a logarithmic frequency scale. to be quantised and noise free.5 1. The amplitude difference of the different harmonics of the reference sounds made it hard to get a clear result.

The counter is then reset when the data has been processed. . no loop is needed to run the program. to ignore the end of a signal and prevent misreadings.2 Implementation The program was constructed using a number of different functions. A main function where the necessary parameters and arrays were initialized and constructed. A software interrupt called echo is activated when the input buffer is full.2. A Norrgren.1 Echo Echo is called by interrupt when the input buffer is full and collects the values from the input buffer and passes on the buffer to the detection function. The interrupt runs the process which calls the different functions needed to process the signal. Since the DSP uses software interrupts. F Brosj¨ o 7 3. Figure 4: Flow chart over the algorithm 3. Further explanation of these functions follow. This calls the detection which registers the input amplitude of the signal and triggers a software interrupt if the signal exceeds a threshold level. A counter was then used to ignore the first 128 calls to detection.P V Soumya. C-J Waldeck. When the DSP is starting up is has a lot of random values on the input that must be ignored.

The magnitude of these are normalized to reduce the risk of overflow and then stored in the output array. The result is then correlated together with the reference array for the individual strings. The returning values represents the indices of the maximum correlation and the corresponding value in relation to the maximum possible correlation value. It was chosen to only shift 20 steps to the left and to the right around the centre element.3 Process Process gathers all the functions needed to perform the processing of the signal. the detection function goes into an buffering mode where it samples every package until the sample buffer is filled.2.5 SmallXcorr The small cross correlation function is made so that it only calculates a small part of a normal correlation. The first step is to perform a DFT which is covered in the section below.2.2. When this is done a second loop goes through the resulting correlation arrays to find the index of the maximum value.2.2 Detection Guitar tuner The detection function first calculates the mean power of the input. and this is the only difference from a normal DFT. one for the real and one for the imaginary part of the complex result. 3. the process flag is set and the interrupt process is called.4 DFT The Discrete-FT is based on a normal Fourier transform summation using a double for-loop. When the last step is done the buffering mode is deactivated. 3. 3. . Since the DSP does not have support for the complex numbers. The small correlation is done using a for-loop which runs from -20 to 20 where it sums the elements where the reference frequencies are. the summation had to be done in two separate variables.8 3. In other words 41 steps in total. The frequency array used was explained in the theory section. If the value is higher than a predefined threshold value. which is done for the references of all six strings.

The function was tested against Matlab’s built in FFT-function and resulted in very similar data. using Matlab’s built in FFT-function. F Brosj¨ o 3. The implementation of this was fairly simple in Matlab and did not generate any problems out of the ordinary.6 MatLab GUI 9 A graphical user interface was done using Matlab’s GUI Guide. Figure 5: Graphical User Interface 4 4. A Norrgren. In a spectrum for one string this difference. as mentioned prior. The result. The program consists of a table.P V Soumya. . C-J Waldeck.1 Results and Discussion Matlab testing The first thing we did was to record the sound from all strings and take a DFT to get an idea of how the spectra would look like. however it was soon realized. is much clearer. but could very well be contributed to round off errors. that the distance between the frequency increments had to be logarithmic to yield appropriate accuracy in lower frequencies. To access the DSP a built in function called ccsdsp was used. The initial results were good. Some slight variations were found. As can be seen there is a lot of different peaks with varying amplitude and it’s difficult to distinguish between the fundamentals and the harmonics. which makes it possible to load and run the project on the board from within Matlab. The RUN button uploads the program to the DSP and runs it and the STOP button stops the DSP and closes the program. The table is updated every second using a timer interrupt. gave us the spectra shown in figure 6A. as visible in figure 6. two buttons and a timer. The work continued by implementing a DFT-algorithm in Matlab to ensure its functionality.2.

and thereby result in a double peak in the correlated data.h package. the best correlation can sometimes be between two indexes. however the amount of memory and the time needed for additional calculations set a limit to this resolution. 4.10 Guitar tuner Figure 6: A Spectrum of all strings B Spectrum of D3 The correlation algorithm was also implemented and tested against Matlab’s correlation function. After implementing the necessary functions without any major issues.1 Hz. but a suitable work around to this problem has already been covered. This could be avoided by using a higher frequency resolution. 1500 points and a base of 15. the DSP compiler did not support the complex. had an initial frequency of 72. The values used for the frequency array. After many hours spent on error correction it was found that the memory was over written in some way and replaced the result values with memory addresses. It was also important to get the frequency points in the array as close as possible to the known frequencies to be used as reference values. and the tuner would always have an offset. This is due to that Matlab does not use full length floats like C. There were some minor differences that most probably are because of round off errors. Since the correlation algorithm handles a discrete frequency array. calculated using Stridh’s formula described in the theory section. The values were not at all consistent with the expected values generated in Matlab.2 Implementation The implementation on the DSP board was straight forward and did not generate so many problems at first. but a user defined length. As mentioned above. in our case the format short containing four decimals. Otherwise there would be an error because of displacement from the correct value. This turned out to be because of lacking . the program was tested and resulted in confusion.

especially the amount of memory that was available in the different memory banks. and many arrays had to be moved to the external memory.P V Soumya. The time to go through one cycle was too long. By moving almost all of the arrays to the external memory the functions started to work better with correct results. This is clearly something to continue working on. and the guitar should be viewed as tuned for values between -1 and 1. The strings has prior to the test been tuned with a TC electronic polytune commercial tuner [4]. F Brosj¨ o 11 internal memory that we were expected to have. The data presented after a tuning was also more unstable the more strings that were included. most likely as a result of sampling to early when the strings are still unstable after the strum. As seen the strings get the tuning value of 1. The memory configuration of the board was fairly hard to understand. This is due to low resolution of the tuner. A Norrgren. . 4. see figure 8 A. The cause could also be that the different strings have different amplitudes and different sustain. the timing of the tuning has to be perfected. causing the spectrum to be uneven. based on the amplitude of the correlation however is lower. This could easily be improved with more memory and processing power that would allow longer reference arrays with more harmonics. The most likely cause is the enormous amount of frequency data which results in correlation matches in other places than the intended. This is still a bit too long but acceptable. This drastically improved the time to process the signal to a few seconds. The accuracy of the tuning. When tuning all strings the first result is most often wrong. The second sampling of the same strum usually have a more accurate shift. visible in figure 8 B.3 Post-testing The system was now complete and some post testing was done. To decrease this time the algorithms were analysed many times to look for means of improvement. however very slow. it took between 10 and 20 seconds. After much testing a new correlation algorithm was created that only used the points of interest in the reference arrays rather than correlating every point. o This is shown in figure 7 where the 5th and 6th strings are tuned individually. C-J Waldeck. The Hagstr¨m guitar was used to test the tuner and tuning one string worked well. which implies that their pitch is slightly high.

in order to make the points come closer to each other. making correlation more accurate. B Fifth string A2 Figure 8: A All strings 1. This would improve the accuracy for the entire spectra. B All strings 2 5 Conclusion and further development As it is now the tuner works quite well for tuning of one string at a time. which is good. however we discovered that a hamming window did not actually improve the frequency spectrum in a noticeable way. A great deal of time could be spent on optimization for faster performance. Since we have limited memory and computing power we had to limit the number of frequency points to 1500 to be able to get a result in a reasonable time. This might be counteracted by increasing the frequency resolution of the tuner and the number of harmonics in the reference arrays. One way to make sure that some noise is suppressed might be to use windowing functions. Another reason may be that the sound level from some of the strings has dropped in amplitude before the sampling has started. This yields an uncertainty in the higher frequencies where the step size becomes to large. This can be due to too good correlation matches in more than one point in the spectrum because of the large amount of fundamentals and harmonics in the sampled signal. but get much less accurate when more strings are tuned. and not just the higher frequencies. thereby causing a lower probability that the right string is tuned.12 Guitar tuner Figure 7: A Sixth string E2. As for now. In further development this can be helped by using more frequency points and also by reducing the base. the result is presented in a few seconds which is a bit too slow . allowing longer arrays and higher resolution as well as quicker results. More testing with windowing functions should be able to clean up the spectra by narrowing the peaks.

2003 [4] TC Electronics. http://hyperphysics. 33.edu/hbase/music/cents. http://www.htm (2011-02-28).asp (2011-03-03) .P V Soumya. References [1] Vaughn Aubuchon. Vol.tcelectronic. [2] Hyperphysics.com/polytune. Signal Characterization of Atrial Arrhythmias using the Surface ECG.gsu.html (2011-03-03) [3] Martin Stridh.vaughns-1-pagers. C-J Waldeck. This Vaughns Music Note Frequency Chart http://www. A goal for further development would be to reduce this time to under a second to make real time tuning bearable. F Brosj¨ o 13 in our opinion. ISSN 1402-8662.phy-astr. A Norrgren.com/music/musical-notefrequencies.

14 Guitar tuner .

disregarding certain intervals. This is achieved by applying the cepstrum transform which is an extension of the Discrete Fourier Transform. The algorithm is written in C using Code Composer Studio IDE and it aims to determine the dominant frequency for a given input audio signal. From here the dominant frequency can be extracted and the closest pure tone as well as the distance to it is presented to the user. This is followed by an inverse Fourier transform which yields a signal in what is known as the quefrency domain. Johan Mattsson Abstract This report covers the implementation of a digital signal processing algorithm for the TMTS320C6713 by Texas Instruments. Henrik Nilsson. in where you search for the highest amplitude. involving additional manipulation of each sample in the frequency domain.15 Part II Pitch Estimaiton Jonas Rosenqvist. Kim Smidje. .

. For all other values of f. But the common thing is that they all require a lot of computations. converts it into the logarithmic scale and lastly transforms the samples back into the time domain. as well as the absolute and logarithmic functions. two sinusoidal signals with different phase. the top plot. In this report we will focus on the cepstrum algorithm which is a collection of mathematical tools. namely in Matlab. The result of applying the Fourier transform on to the sum of the signals is displayed in Figure 2. is a method which distinguish frequencies in a tone or sound in order to determine which frequency is the ground frequency of them all. The reason why cepstrum is fast comes from the fact that it is computed using only the Fast Fourier Transform. an estimation of the pitch. i. its inverse. 2 Theory The Fourier transform transforms a signal in the time domain. The cepstrum algorithm has the advantage that its faster than for example autocorrelation which gives the possibility to compute the frequencies from a faster sample rate. amplitude and frequency are shown and the lower plot its the sum of these signals. amplitude as a function of frequency. then the absolute values of that result. into a signal in the frequency domain. the Fourier transform consists of only two peaks. The working process we chose to solve the problem was to first solve the problem in a familiar environment. This is visualised in Figure 1 and Figure 2. all of which have a fairly low time complexity. The cepstrum method uses several different other operations in order to reach its goal and its mathematical representation is: F −1 (log10 (|F (x) |)) What it does is that it takes the signal and samples it in the discrete Fourier transform. the amplitude as a function of time.16 Pitch Estimaiton 1 Introduction Pitch estimation is just as the name says. i.e. In Figure 1. have an amplitude of zero in the original signal. Since the original signal is made up only of the sum of two pure sinusoid with different amplitudes. which comes from reversing the four first letters in the word “spectrum”. One sees here that the value of the Fourier transform for a given value of the variable f corresponds to the amplitude of a sinusoid component with that frequency in the original signal. each of them representing the amplitude of the two sinusoid. There are some techniques that can be used with different advantages and disadvantages.e. or to put it differently. F(f) has a value of zero because those frequencies are absent. and after that go deeper into Code Composer. Cepstrum.

Jonas Rosenqvist. Kim Smidje. The peaks in the cepstra occurs as a result of the periodicity of the signal or the sound. Bottom: The sum of the two signals The reason for using both the absolute and the logarithmic functions are because you want to emphasize lower frequencies to make sure that the dominant peak comes from the ground frequency and not from one of the over tones. which is an important property for the cepstra (the spectra for the cepstrum). Johan Mattsson 17 Figure 1: Top: Two sinusoidal signals with different amplitudes and phases. the signals will be additive. though not in the sense of a signal in the time domain. this would respond to the frequency derived from taking the sample rate (measured in Hz) divided with X. For instance if a pea! k in the cepstrum diagram would appear at point X (dimensionless). After the quefrency has been calculated.[1] . The result after the transformation is called quefrency. The quefrency will therefor be the sum of all the signals which are recorded. the highest peak in the window will correspond to the right frequency. Henrik Nilsson. which is measured in seconds. Because of the convolution occurring with the FFT.

As the name says autocorrelation tries to find the correlation of an input signal with the input signal with some lag.18 Pitch Estimaiton Figure 2: The power spectrum of the two sinusoidal signals 3 Methods There’s two types of algorithms that can be used to obtain estimated frequencies. Algorithms in the time domain and algorithms in the frequency domain. The main problem with the autocorrelation is for higher frequency’s because the number of additions will become overwhelming.[1] .1 Time domain One very simple approach that could be used in the time domain is to look at the zero crossings in the signal. This is the reason why autocorrelation is mostly used in the low to mid frequency range. 3. This approach isn’t that robust since for signals that consist of multiple sinus signals with different periods the result will not be close to the real frequency. by comparing the cross correlation 1 point a time we can start building another graph which hopefully will look like a sinus signal. That is when the signal goes from a high value to a low value and the other way around. In one period the signal will cross zero two times and by measuring at what times this is done a rough estimate of the pitch can be calculated. Other methods in the time domain such as autocorrelation has a different approach.

Jonas Rosenqvist, Kim Smidje, Henrik Nilsson, Johan Mattsson

19

autocorrelation [k] =

1 N −k

N

signal [n] ∗ signal [n − k]
n=k

3.2

Frequency domain

Frequency domain methods takes an input in the time domain and computes the frequency spectrum. The input signal displayed in the frequency spectrum will cover the whole spectrum but the dominant frequency will have the highest peak. The main advantage of frequency methods is the use of the fast Fourier transform which makes the computations fast and reliable. There are a number of different algorithms that perform operations in the frequency domain for example kepstrum, cepstrum, power cepstrum and maximum likelihood. To be able to compute the estimated pitch the input signal needs to be divided into smaller parts and computed for each part. The disadvantage of dividing the input signal is the loss of resolution, since the estimated frequency depends on the sampling rate and the length of the divided input signal. It is possible to still get a good resolution if for example the sampling rate is 8000 and the lenght of the divided input signal is 8000 then each frequency can be represented.[1] sample rate index of (maximumvalue)

4

Implementation

In order to evaluate the algorithms ability to correctly detect the dominant frequency we decided to first implement it in Matlab. In addition to the group being more experienced working with Matlab as opposed to the C language, it allows for much faster implementation thanks to the high level developing environment with many of the crucial algorithms, such as the fast Fourier transform, already implemented. It was also at this stage that we estimated the appropriate cut-off level in the quefrency domain, as well as suitable signal sample size, by experimenting with various audio signals. By not discarding enough initial values in the quefrency domain one run the risk of finding a false dominant frequency. On the other hand, if too many values are ignored, one might miss the true dominant frequency. The sample size must cover a large enough time frame to detect the lowest possible frequency, and that the same time not be too big in regards to the limited memory and real time requirements. Using a sample rate of 32000 samples/second and a vector of length 512, corresponding to a window of 16 ms, we found that we got acceptable results while still being able to detect frequencies as low as approximately 240 Hz. We loaded an audio signal generated by a horn, which had the frequency 123.47 Hz. That frequency corresponds to

20

Pitch Estimaiton

Figure 3: Power spectrum from the B2 horn a B2 note, which means the note B in the second octave. When running our Matlab pitch detection program, we get these plots where the first plot represent the power spectrum from the B2 horn and the second plot shows the cepstrum of the B2 horn. Given the cepstrum, the estimated frequency can be calculated to approximately 126 Hz, which can be considered as a reasonable deflection from the correct answer. Since we aimed to make our range designed for fourth octave we did the same with C4 , then we got the plot shown below. If the peak is located at index 124, which gives us the frequency circa 258 Hz. Implementation in Code composer was very similar to what we did in Matlab. With the use of the DSP library the algorithms FFT and IFFT was made available. Other functions as log and abs was introduced into the program with the libraries math.h. To increase the accuracy of the program we chose to take the 10 latest values and only present the median of these to remove any potential outliers. After some trial and error we chose a cut off point at 20 samples as that gave us the right results in our chosen octave.

Results
Table 1 shows the frequency estimation for 17 different frequencies that covers evenly spaced intervals of the third and fifth as well as the whole fourth octave in the frequency range of 260-520 Hz. In addition to this we’ve added

Jonas Rosenqvist, Kim Smidje, Henrik Nilsson, Johan Mattsson

21

Figure 4: The cepstrum plot with folding

Figure 5: Highest peak at index 124

22

Pitch Estimaiton

all the pure notes in the fourth octave.The fourth column of the table shows that the errors in the fourth octave are very small (average of 1.5 Hz), and rapidly increase as we move outside it. Input frequency (Hz) 240 262 280 293 320 330 349 360 392 400 440 480 494 520 523 560 600 640 680 720 760 800 840 880 Output frequency (Hz) 260 262 280 292 320 328 346 358 390 400 438 476 492 524 524 560 602 640 680 726 760 800 842 888 Note C4 D4 E4 F4 G4 A4 B4 C5 Deviation (Hz) 20 0 0 -1 0 -2 -3 -2 -2 0 -2 -4 -2 4 1 0 2 0 0 6 0 0 0 8

We could get the right results as low as 180 Hz and as high as 2000 Hz but at these values the reliability suffers and you sometimes end up with an over tone, the right note but wrong octave.

5

Problems encountered

The biggest problems which we encountered where during the implementation in C. Mainly the DSP library provided by Texas Instruments caused big problems. First just setting the class path right was kind of tricky but after looking into the reference guide and some help from Frida we got it right. After solving the class path a new problem was introduced, how to use the function DSPF sp fftSPxSP which was provided by the DSP library. Apparently there is a bug in C when converting from unsigned short to float

Curtis (1996). If this isn’t done correctly there will be values like -0 and those will be interpreted as the maximum float value. Cambridge University Press.html . Part 4: Sound analysis [2] Norton.mtu. Fundamentals of Noise and Vibration Analysis for Engineers. The reason for this is that the input signal is divided into smaller parts. This cancel out the fluctuation but introduce other disadvantages for example if the frequency of the input signal varies very fast.Jonas Rosenqvist. Michael. Henrik Nilsson. Karczub. [3] Frequencies of Musical Notes . Johan Mattsson 23 which forces you to first cast to from unsigned short to short and after that cast to float. What we did was to use the median of the last 10 values instead of just the regular average.http://www. Kim Smidje. The Computer Music Tutorial. 7 References [1] Roads. The last issue we had to resolve was that some frequencies seemed impossible to get good estimations for and the estimations fluctuated a lot. Another limitation of our program is if the input signal changes frequency every 16 ms then the output will just be the median frequency of the last 10 frequencies. Denis (2003). but on the other hand in reality this should not be a problem. 6 Conclusion The estimation was good for frequencies in the range 240 .880 Hz before and after this range the errors becomes too large.edu/˜suits/notefreqs.phy.

24 Pitch Estimaiton .

25 Part III Vocoder Mattias Danielsson. . A synthesizer was needed as a carrier signal which was the key to change the voice. In order to do so the autocorrelation was needed from the voice. The Levinson-Durbin recursion was used to model the voice using an IIR lattice filter structure. The vocoder was programmed so that the voice was controlling the level of the carrier signal. Andre Ericsson. A highpass filter was used for prefiltering (FIR filter). The course Optimum Signal Processing is recommended to take before reading this report. Kujtim Iljazi. The project was to program a musical based LPC-vocoder. Babak Rajabian Abstract This report is based on a project by students to become more experienced in programming signal processing algorithms on a Texas Instruments TMS320C6713 DSK.

26 Vocoder 1 Introduction The purpose of this project was to program a Texas Instruments TMS320C6713 DSK in Code Composer Studio v3. low frequency content of the voice will be modeled. 2 2. The characteristic from the synthesized vocal sound is dependent on what sound the synthesizer is set to produce. With the autocorrelation values we can estimate the filter coefficients for . in order to get rid of the low frequency content in the voice (the higher frequency content in the voice spectrum is what defines the vocal tract. When the performer presses down a key on the synthesizer and speaks into the microphone (both plugged in to the vocoder) a synthesized vocal sound is heard from the loudspeakers. The vocoder is programmed to be used together with a synthesizer. which explained on the second lecture of this course).3 in C to an LPC-based (Linear Prediction Coding) vocoder used as a musical instrument. Thereafter a block of samples is built from the highpass filter in order to calculate the autocorrelation values. A sampled speech signal from the microphone is first filtered through a high pass filter. The big advantage is that the vocoder is compatible with basically every musical instrument that can produce a sound with a constant sustain level and rich frequency content. Otherwise the unnecessary.1 Theory Overall description of our vocoder model Figure 1: Our vocoder model Our vocoder model is shown in figure 1 above.

The Levinson-Durbin algorithm is described in [1] and repeted in the table at the top of the next page. ρx (k) = rx (k) rx (0) (3) 2. It calculates both regular IIR-filter coefficients.Mattias Danielsson.2 The highpass filter An FIR filter is simply defined by the following equation p H(z) = k=0 bp (k)z −k (1) An FIR filter structure of this type is used to implement a highpass filter. Kujtim Iljazi. a(j) . From the sample block from the high pass filter the maximum value can be obtained to represent the amplitude of the speech signal. rx (k) = 1 N −k N x(n)x∗ (n − k) n=k (2) The autocorrelation function must be normalized in order to prevent overflow and to work properly with the Levinson-Durbin recursion.3 The autocorrelation function hej The autocorrelation function is a function to determine how much a signal relates to itself at different time lags. and the reflection coefficients for an IIR lattice filter. . 2. The last step that needs to be done is to invert the effect of the high pass filter that was implemented at the beginning by filtering the output signal from the IIR-model with a low pass filter. Γj . The signal from the IIR-model is the modified voice. This way the sound level of the carrier signal is basically controlled by the voice signal. Andre Ericsson. 2. The estimation of the autocorrelation function is given below. Babak Rajabian 27 the all-pole model (IIR-filter) which should represent a model for the vocal tract. This value is multiplied with the normalized carrier signal (which is always a signal between -1 and 1).4 The Levinson-durbin recursion The Levinson-Durbin recursion is an algorithm used to find an all-pole model by using a sequence of autocorrelation values.

. simple tests for stability and decreased sensitivity to parameter quantization effects”.28 Vocoder 1. e+ (n) = e+ (n) − Γj+1 e− (n − 1) j j+1 j e− (n) = e− (n − 1) + Γ∗ e+ (n) j+1 j j+1 j (4) (5) . it uses the reflection coefficients Γk . 1. p − 1 j 0 = ρx (0) (a) γj = ρx (j + 1) + i=1 aj (i)ρx (j − i + 1) j (b) Γj+1 = −γj / (c) For i = 1. j aj+1 (i) = aj (i) + Γj+1 a∗ (j − i + 1) j (d) aj+1 (j + 1) = Γj+1 (e) 3. Instead of using the regular filter coefficients. a(k) . For j = 0... b(0) = √ p j+1 = j [1 − |Γj+1 |2 ] Table 1: The Levinson-Durbin recursion 2.. . . Initialize the recursion (a) a0 (0) = 1 (b) 2.. This structure has ”the same advantages of modularity. 2. A single stage of an IIR lattice filter structure is shown on the next page and it’s difference equations describing it is shown below..5 The IIR lattice filter The IIR lattice filter structure is an alternative structure of the IIR-filter.

Babak Rajabian 29 Figure 2: Single stage of an IIR lattice filter A complete p:th order IIR lattice filter can than be derived from the difference equations from the previous page and the figure above as shown in the figure below. Figure 3: p:th order IIR lattice filter . Kujtim Iljazi.Mattias Danielsson. Andre Ericsson.

98z −1 (6) The order of the filter modeling the vocal tract was chosen to 8. There are a lot of configuration bits that can be set to change the parameters of the AD/DA converter but these were left at their default value as found except for the sampling rate that was changed to 8 kHz. Our highpass filter is defined by the following equation HHP = 1 − 0. Using the knowledge that speech is somewhat frequency stationary during 20 ms (which was read on a report based on a similar speech modeling project in this course) and the sampling frequency of 8 kHz results in the buffersize of 160 samples. For example if you want to use an old analogue synthesizer to generate the carrier signal you have to make sure that the output from the synthesizer is below the reference voltage for the AD-converter. A code template used in the second lab in this course was used to make starting easier. This results in the size of the normalized autocorrelation vector to be 9 (to make an all-pole model using the Levinson-Durbin recursion of order n you need n+1 autocorrelation values). because the absolute maximum value that the samples we work with can obtain is 32000. . This value is multiplied with the normalized carrier signal which is in an interval of values between -1 and 1. To make the implementation of the highpass filter (an FIR filter) of any size a general and already written function by Texas Instrument was used even though the order of the highpass filter was only of order one. To make the voice control the level of the carrier signal. At first a regular IIR-filter was used to model the vocal tract but was later replaced by an IIR lattice filter structure which is described later in the results section.30 Vocoder 3 Implementation The implementation of the vocoder was done in the C programming language using Code Composer Studio. the absolute largest value is found from the block of 160 samples from the highpass filtered speech signal. not to break any realtime performance. which was done by dividing the carrier signal by 32000. One thing to keep in mind is that the PIP buffers are filled with unsigned shorts that have to be type casted to short and then to float to be able to do calculations in increased precision. Also the input voltage to the AD-converter has to be taken into account.

Mattias Danielsson. To test our complete vocoder system we used a recorded voice sample on one the left channel and different square wave audio sources increasing in frequency on the right channel. The output of the Levinson-Durbin algorithm should be estimates of the filter coefficients in the IIR filter. Our Levinson-Durbin algorithm produced estimates that varied around the values preset by us. The autocorrelation was strictly descending in value for increasing lag shifts. The Levinson-Durbin algorithm was tested by feeding white noise into an IIR filter where we ourselves have set the filter coefficients. Andre Ericsson. In code composer studio there is also special commands enabling printout of internal variables and output from blocks while running the code. Babak Rajabian 31 4 Testing and debugging For testing of the different blocks in the vocoder we took a pragmatic approach. . The different sources on the right channel could be mixed and amplified at will in the sound program Audacity. resulting in 9 autocorrelation coefficients. For the autocorrelation we used a sine signal and looked at the resulting output vector. The reason for the variation around the correct filter coefficient values was the short input block length of 160. Kujtim Iljazi. Knowing the behaviour of the blocks. For the highpass filter we used sounds having both high and low frequency content and listened to the filtered signal. The maximum lag shift in our case is 9. different signals revealing easily if the blocks were working correctly or not. The output of this filter was used as input to our autocorrelation block and the output of the autocorrelation was sent to the Levinson-Durbin algorithm. We also tested the autocorrelation with white noise and this resulted in a low autocorrelation and not predictable values for lag shifts greater than zero. The value of 160 is the result of the sample rate of 8 KHz and speech duration block of 20 ms used.

The pitch of the generated speech was also tested by changing the frequency of the square wave carrier. The reason for the low sound level is probably that no amplification is done in the AD/DA converter. We never had time to test carrier signals from real synthesizers before the deadline for the report but we will show it in our demonstration. The speech pitch changed satisfactory when changing the frequency of the carrier making us happy with the result. . This was overcome by using amplified speakers (we did’nt gain the output with the parameter in the program because we did’nt know how to change the gain parameter during runtime). When using our regular IIR filter for filtering of the carrier. After altering the Levinson-Durbin algorithm so that the reflection coefficients did not exceed an absolute value of one resulted in no spikes in the output signal. this resulted in loud and painful sound level spikes later discovered to be caused by unstable filter coefficients. Trying different fixes to the IIR filter resulted in some improvements but no complete absence of the painful sound spikes. As always there are different changes. The ordinary IIR filter was later replaced by a lattice IIR filter. choices and improvements you can make in system design and implementation but we are satisfied with our choices.32 Vocoder 5 Results and conclusions The first test of our complete vocoder system resulted in low sound level.

org/wiki/Vocoder . 1996 [2] http://en.Mattias Danielsson.wikipedia. Kujtim Iljazi. Statistical digital signal processing and modeling. Hayes. John Wiley and Sons. Inc. Babak Rajabian 33 References [1] Monson H. Andre Ericsson.

34 Vocoder .

R.35 Part IV Reverberation R. the challenge was to implement a digital reverb on a Texas Instruments TMS320C6713 DSK development board. Abdu-Rahman. Different parameters in the algorithm had to be identified and tuned experimentally. T. To meet realtime constraints imposed by CPU and memory speeds various hardware and software optimizations had to be employed. Mittipalli. To aid development. S. JeanMarc Jot’s Feedback Delay Network algorithm was used as reverberation algorithm. . Isacsson Abstract In this project. Tullberg. the algorithm was also implemented as a non-realtime version in Matlab. The finished application produces a smooth reverb sound running without glitches at a CPU consumption of approximately 60-65%. This was beneficial both as a reference design as well as a tool for parameter tuning and code analysis.

36 Reverberation 1 1. The reflections go on themselves to hit walls and obstacles and get reflected again and so on. as reflections are reflected and multiplied again and again. In time. The earliest iterations of these reflections are called early reflections and extend roughly 60 to 100 ms. after the initial direct sound (see figure 1). Sounds are enriched and colored by these reflections due to airs and obstacles tendency to dampen higher frequencies to a greater extent than lower frequencies. This last part called late reverberations starts at 100 ms and can go on for several seconds in a large enough room or concert hall.1 Introduction Reverberation Sound waves travelling in a room are reflected when they hit walls or other obstacles. such as room size. before reaching the listener. followed by the late reverberated decaying part. consisting of a several of early delays. material and wall surface. A reverberated sound consists of three main parts: The sound that travels directly from the source to the listener is called the direct sound. . they become indistinguishable as separate echoes to the listener. Figure 1: Simplified image of direct sound and 1:st and 2:d order early reflections Figure 2 shows the early part. Reflected copies of the sound are delayed some time depending on the physical properties of the surroundings. This phenomenon is called reverberation. depending on the size of the room[1].

R. Abdu-Rahman. S. Tullberg. T. Isacsson 37 Figure 2: Impulse Response of actual implemented reverb showing early reflections and late reverberations . Mittipalli. R.

2 Reverberation time The time for a sound to attenuate by 60 dB in a reverberant space is called the reverberation time. 40 dB[5]. while computationally expensive.163 in both formulas corrected to 0.161[2]): 0. We had read that. Sound is attenuated because of the surfaces in the room. The two most common formulas for the approximation of T r is the Eyring formula[1] (0. as they absorb the energy of the sound waves and their reflections.38 Reverberation 2 2.1 Theory Reverb Algorithm Early on in the project we settled on the first of the algorithms developed by Jean-Marc Jot. 100 dB. Figure 3: Jot’s FDN algorithm and how it fits in the overall reverb implementation 2. T r . The defining characteristic of this algorithm being the feedback delay network that Jot had introduced to model late reverberations. it produced an impressive reverberated sound with rich echo density[1]. and the background noise of an ordinary room.161 · V A · ln(1 − s) + 4 · δa · V Tr = (1) . The reasoning behind the 60 dB attenuation requirement is that the difference between the intensity of a common orchestra.

3 Delay elements z −mi The z −mi delays model the time it takes for a reflection to reach the listener and/or another obstacle or wall. These lowpass filters model the real worlds attenuation due to absorption. 16 different lines. for example: m16 = 100ms48kHz = 4800 (4) The different delay values are recommended to be mutually prime. the attenuation can be derived and frequency dependency established as[1]: Tr (ω) = − 3·T log10 (γ(ω)) (3) where T is the sampling period and γ(ω) is the attenuation per sample period as a function of the frequency ω. the signal is copied into. T. when expressed in sample units. and then filtered by the hi (z) filters. This is to avoid superpositioning of harmonically related sound waves causing unpleasant resonances. delayed mi samples. 2. OnceT r is calculated.161 · V s·A 39 (2) where V is the room volume. Tullberg. 4 α . reflection and spreading in walls and other obstacles.2. Mittipalli. Isacsson and the Sabine equation[1]: Tr = 0. ai = ln 10 1 log10 (gi )(1 − 2 ). R.R. 2. s an average absorption coefficient and δa is the frequency dependent attenuation constant of air. The delay in samples is the delay time in milliseconds multiplied by sampling frequency. so called flutter echoes[3]. Abdu-Rahman. The filters are expressed as follows in the frequency domain[1]: hi (z) = gi where (1 − ai ) (1 − ai z −1 ) − (5) 3miT gi = 10 Tr (dc) .4 Damping Filters hi (z) Starting from input x(n) in figure 3. High frequency components are attenuated to a greater extent than lower frequencies as described in 2. in our case. A is the room surface area. S.

Each time. The solution to this is to place an inverted version of our low pass filters. Tr (dc) with Tr (N yquist) and Tr (dc) being the time it takes for the highest and lowest frequencies respectively to decay by 60 dB. 2. the reflections are redistributed among the walls and obstacles in the room.6 Gains bi and ci These two vectors are simple gains used to achieve different effects. it should be both stable and lossless. We 1 simply set all the elements of vector b to to make sure the output did 16 not clip as the input was copied 16 times and then summed. the equivalent of setting all its elements to a value of one. often used to achieve stereo spread or other cross channel effects. t(z) = where b= 1−α 1+α 1 − bz −1 1−b (6) . before output. both of which are fulfilled if the matrix is unitary. In other words. The element responsible for this redistribution in Jot’s algorithm is the diffusion matrix. 2. The c vector. in case of a matrix containing only real values.5 Diffusion Matrix A When a sound wave is reflected when hitting an obstacle in a room it is scattered across the room hitting other obstacles. was left unused. 2. in case of a complex valued matrix. which in turn scatter the new reflections across the room hitting other obstacles and so on. called a tonal correction filter. to equalize the modal energy irrespective of the reverberation time in each filter[4]. Since the damping or attenuation of sound waves is handled by the damping hi (z) filters. or orthogonal. It takes its inputs from the n delay lines and redistributes them back into the same delay lines.40 Reverberation α= Tr (N yquist) .7 Tonal Correction Filter t(z) As the outputs of the hi (z) filters have lose some of the higher frequencies they tend to not be an accurate representation of the original signal. the diffusion matrix should only redistribute the energy and neither amplify nor attenuate it.

The diffusion . some matrices have beneficial properties that make multiplying by them easier.3 3. 3. a better echo density is achieved the more non-zero elements there is in a matrix[1]. Mittipalli. where especially constraints on CPU and memory usage were the next challenges. easy to optimize. containing only non-zero elements of equal magnitude. So naturally a matrix that lends itself to optimization when multiplied with a vector is preferred. with the obvious requirement that to be able to keep up with a sound input sampled at a certain frequency. T. our application had to process a certain amount of samples before the next chunk of samples arrived. For example.3. When this algorithm produced satisfying results it was adopted to the real time environment in TMS320C6713 DSK. so other considerations were taken into account when choosing the diffusion matrix. However. However. or orthogonal. Doing some research one such matrix[6] was found:   A4 −A4 −A4 −A4 1  −A4 A4 −A4 −A4   A=  (7) 2  −A4 −A4 A4 −A4  −A4 −A4 −A4 A4 where A4 is a Hadamard matrix  1 1 1 A4 =  2 1 1 of the 4:th order[7]:  1 1 1 −1 1 −1   1 −1 −1  −1 −1 1 (8) This matrix has the triple benefits of being orthogonal (A = A-1 ).1 Software Optimizations Matrix Multiplication Normally multiplying a vector of size n by a matrix of dimension n requires n2 multiplications and n(n − 1) additions. Abdu-Rahman. Tullberg. as we later shall see.R. 3. R. when containing complex values. the more non-zero elements a matrix contains the more multiplications have to be performed. Isacsson 41 3 3. S.1 Implementation Realtime versus non-realtime implementation To gain understanding of the algorithm a reference prototype was initially developed in Matlab. and being.2 Diffusion Matrix A potentially unlimited amount of matrices fulfill the condition of being unitary. when only containing real values.

each row was summed to get the final result vector B of the matrix multiplication. To start with the matrix consists of only positive and negative ones. the vector elements need only be 4 multiplied with the scalar after having been summed according to the signs in each matrix column.3. when multiplying a vector x of size 4 with anA4 matrix the following intermediate values are calculated: a = x1 + x2 b = x1 − x2 c = x3 + x4 d = x3 − x4 and the resulting vector becomes:  a+c  a−c   =  b+d  b−d  Y 1-4 In our case this was done for the 16 element input vector in groups of four so that four resulting vectors were calculated for each sub input vector. So by choosing a certain type of matrix. aside from a scalar that can 1 be factored out. This reduces the number of multiplication to n. The resulting operation count is 16 + 16 + 163 = 80 additions. 3. in this case. the number of multiplications could be reduced from 256 to 16 and the.2 Circular Buffers Every input to the filters hi (n) is delayed mi samples. the regularity of the A4 matrix allows us to calculate intermediate sum values that can be reused instead of having to do each addition separately[6]. the outputs from the filters are then fed to the diffusion matrix and summed before being sent to the tonal correction filter. admittedly cheaper. Because of the long delay between the time a value is calculated and the time when it finally reaches the output and can be discarded. arrays had to be used to store the values.2 is one such matrix. Furthermore. . Thus. instead of the usual 1615 = 240. For example. These were organized in the following manner: Y 1-4  Y 5-8 B=  Y 9-12 Y 13-16   −Y 5-8 −Y 9-12 −Y 13-16 −Y 1-4 −Y 9-12 −Y 13-16   −Y 5-8 −Y 1-4 −Y 13-16  −Y 5-8 −Y 9-12 −Y 1-4 Finally. These arrays were implemented as circular buffers. with each size of buffer i equal to the sample delay length mi.42 Reverberation matrix described in section 3. additions from 240 to 80. or 16 in our case.

and after the pointer is moved to the next position in the buffer. 3. making sure it wont be read until the pointer has traversed the whole buffer. R. In addition to the delays. our program only calls a function once to set some initial global variables.5 MB depending on configured delay lengths and predelay. Tullberg. The pointers position is incremented each iteration and when it reaches the end of the array it is reset to the beginning of the array. but without any potential increase in program size. Mittipalli. Values are read from the position in the buffer indicated by the pointer. Abdu-Rahman. duplicated for each of the two stereo channels. T. which happens exactly mi iterations later. Figure 4: CPU load when compiled with optimization level O1 3. aside from the interrupt triggered process function. where each single sample delay . So high in fact that the low priority analysis module had trouble report any CPU load information back to the host. need a memory space of roughly 0. Using circular buffers reduces CPU load by just reading or writing to the array element at the pointer position instead of having to move all the elements of the array one position forward for each iteration. a newly calculated value is written to that new position. S. Isacsson 43 Circular buffers are governed by a pointer to the array.4 Compiler Optimization Compiling the program with default options and running it on the DPS board at 48 KHz resulted in very high CPU utilization.5 Hardware Memory Considerations The predelay buffer and the 16 delay line buffers. the predelay line was also implemented as a circular buffer. To remedy this we tried the different compiler optimization levels and settled on -O2 which gave an equally good CPU load as -O3.R. This is as expected since -O3 mostly deals with the inlining of functions[8] and.

While the SDRAM memory worked great. .040 (0xf0000) bytes was declared in the memory configuration utility in Code Composer Studio. as the internal memory is faster than the SDRAM memory[9]. T his large amount of data that has to be stored for later. The 34 buffer arrays were then allocated to it using MEM calloc commands. delayed processing prohibited the use of the relatively small internal memory of 256 KB[9]. with the additional space to allow for some headroom in setting predelay and mi delay lengths. Instead the onboard SDRAM memory with its larger capacity of 8 MB[10] was used.44 Reverberation Figure 5: CPU load when compiled with optimization level O2 Figure 6: CPU load when compiled with optimization level O3 requires 4 bytes (size of float data type) per channel. An external heap of 983. we were careful as to not allocate anything that we didn’t have to.

The response could be used to design a system which mimics the room perfectly but the size of the response is so big and complex that in real time processing it would yield impossible. and Tr (N yquist). and as described in 3. . such as the A matrix and the m delay lengths. in essence trying to replicate a physical environment as much as possible. R. such as signal processing. optimized algorithms. S. The volume of the room. to taste. The bulk of the algorithms parameters.1 Discussion and Conclusion General When modelling a reverberation application. 5 5. The reverb is nice sounding. we proceeded to use recordings of anechoic music as input to the DSP board to finetune parameters. Once the realtime implementation was in place and working. absorption of the materials used in the room. The ones that weren’t achieved were minor ones like the implementation of a host based user interface. 4.R. Mittipalli. etc. Tullberg. and has gained us additional skills in different areas. Abdu-Rahman. T. α. But as the work progressed. both by ear and by comparing plots of the dry. wet and mixed signals. nor what kinds of absorption the room would present we had problems with knowing what we were looking for.1 Result Experimental setup Experimentation was conducted in part using the prototype in Matlab where files of differing sound material were treated with the reverb algorithm and sound files of 100% wet and userdefined wet ratio signals were generated and compared to the originals.2 Results We achieved most of the goals that we sought. in no small part due to the large amount of delay lines. it’s total surface area. the physical properties have to be known. Also. what other elements in the room that could absorb energy and/or reflect the rays. we made these parameters configurable.4 the CPU load is within a satisfactory range running at the maximum sampling frequency of the development board. the learning experience has been huge. So replicating rooms with an algorithm is the way to go. C programming. One way of doing this is Ray Tracing where an impulse response of the room is achieved by setting up microphones in the room and then firing a starter gun to produce the impulse. such as room dimensions. but with no idea of what kind of room we are trying to replicate. Isacsson 45 4 4. were chosen and permanently set at this stage.

there were drawbacks. as a non-realtime version. . but it was also hard to measure the benefits of optimizing matrix multiplication in Matlab since it’s already heavily optimized for that purpose. Besides the obvious advantages of having a prototype that all members of the group can reference while working on the realtime implementation. 5. The interface would ideally be able to communicate with the board in real time. room size. The main disadvantage of a realtime processing is the lack of infinite. gain. The advantage is though having playback in realtime. Allocating memory right. why global variables weren’t seen by subfunctions and that Code Composer Studios own flavor of the C language seems to have math functions that aren’t exact replicas of the Math library in C.2 Non-realtime versus realtime implementation As mentioned earlier a version of the algorithm was implemented in Matlab. a lot more time is spent on making a realtime application work. This interface would contain controls for parameters like wet/dry ratio. Understanding what the compiler was trying to tell you. why things didn’t work at all. 5. but never implemented due to lack of time. or at least comfortably large. cut down and quality restricted. optimizing calculations and using the DSP’s built-in functions is just some of the aspects that had to be addressed. it was also beneficiary not to have to deal with constraints on memory and CPU while trying to get a grip on the algorithm itself. In comparison to offline implementation.46 Reverberation and didn’t bother too much with accurately reproducing real rooms. and the like. using a subfunction that calculates all the necessary parameters from a theoretical user input. and the related special commands.3 Improvements A user interface was on our todo list. The hardest part was understanding the limitations of the DSP boards memory configuration. Realtime issues did naturally not manifest during the prototyping phase. However. using a reverb in live music rigs and recording software to enable the user to hear the effect while playing. The hours spent on trying to understand Code Composer Studio and sifting through the wealth of help information provided was the biggest drawback of the project. the calculations had to be optimized and if not enough. but focused on the audible results and getting the program to actually work. processing power. Because of the time restrictions when doing processing.

Spencer An Implementation of a Feedback Delay Network .Theory and Implementation Master Thesis.stanford.co. T. S. Abdu-Rahman. Isacsson 47 References [1] Lilja.for the GNU compilers gcc and g++.network-theory.Final Project Report 2008-12-09 http://twentyhertz.2 The TMS320C6x Architecture 2008 [10] SPECTRUM DIGITAL TMS320C6713 DSK Technical Reference 506735-0001 Rev.org/wiki/Hadamard matrix Visited: 2011-02-14 [8] Brian J.phy-astr.pdf [7] Wikipedia: Hadamard matrix http://en.uk/docs/gccintro/index. Reay.com/618 FinalProjectReport SpencerCampbell. 3. Davide Introduction to Sound Processing. R.gsu. 3. Mittipalli.2 Reverberation 2003 [4] Tonal Correction Filter https://ccrma.org/wiki/Reverberation#Sabine equation Visited: 2011-03-01 [3] Rocchesso. Gough An Introduction to GCC . LTH 2002 [2] Wikipedia: Reverberation http://en.html [6] Campbell.4 Optimization levels 2005 http://www.wikipedia. Tullberg.edu/˜jos/Reverb/Tonal Correction Filter.6.html [9] Chassaing.edu/hbase/acoustic/revtim.wikipedia. 6.html [5] Reverberation Time http://hyperphysics. Ola Algorithms for Reverberation . Donald Applications With The TMS320C6713 And TMS320C6416 Dsk.R. Rulph. A May 2003 .

48 Reverberation .

Mel-Frequency Cepstrum Coefficients (MFCC).49 Part V Speech Recognition Using MFCC Harshavardhan Kittur. . The key is to convert the speech waveform into some type of parametric representation for further analysis and processing. we present one of the techniques to extract the feature set from a speech signal and implement it in an speech recognition algorithm using TMS320C6713 DSK Board. A wide range of techniques exist for parametrically representing the speech signal for the speech recognition task. such as Linear Prediction Coding (LPC). Manivannan Ethiraj. Mohan Raj Gopal Abstract In this project. MFCC is perhaps the best known and most popular. and is used in this project. and others. Kaoushik Raj Ramamoorthy.

Although there are numerous ways to model a speech signal and perform speech recognition in both hardware and software. no such system is stable for all kind of speakers in the world. When the number of words in the database is large and consists of similar sounding words (rhyming words). • Noise is generally a major factor in speech recognition and has to be carefully analysed while designing a system. to the desire to automate simple tasks inherently requiring human-machine interactions. Our interest is to find out the intricacies of designing such a system by implementing it on a TMS320C6713 DSK board. A noisy environment limits the system performance.3 Tools Used • MATLAB • Code Composer Studio • TMS320C6713 DSK Board • Hi-Fi Microphone • Stereo Speakers . 1.2 Common problems found in designing such a system • People from different parts of the world pronounce words differently. there is a good probability that one word is recognized as the other. Also. For reasons ranging from realization of human speech capabilities. • The rate of error in the recognition system depends on the amount of data stored in the system by training. 1. the rate at which they speak affects the implementation of a speech modelling system • Speech is usually continuous in nature and word boundaries are not clearly defined. research in automatic speech recognition has attracted a great deal of attention over the past few decades.1 Introduction Why Speech recognition? Speech is the primary means of communication between people.50 Speech Recognition Using MFCC 1 1.

Mel-Frequency Cepstrum . Mohan Raj Gopal51 Figure 1: Feature extraction using MFCC Figure 2: Feature matching using MFCC 2 2.Harshavardhan Kittur.1. Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from the voice input with the ones from a set of known speakers. Kaoushik Raj Ramamoorthy.1 Theory Speech Recognition Algorithm At the highest level there are a number ways to do this complex task of speech recognition but the basic principles are feature extraction and feature matching. 2. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker. The block diagram are shown in Figures 1 and 2.Manivannan Ethiraj.1 Feature Extraction In feature extraction phase the speech can be parameterized by various methods such as Linear Prediction Coding (LPC).

This is expressed in the mel-frequency scale. The difference between the cepstrum and the Mel-frequency cepstrum (MFC)is that in the MFC. MFCCs takes human perception sensitivity with respect to frequencies into consideration. . the frequency bands are positioned logarithmically (on the mel scale) which approximates the human auditory system’s response more closely than the linearly-spaced frequency bands obtained directly from the FFT or DCT. MFCCs are based on the known variation of the human ears critical bandwidths with frequency filters spaced linearly at low frequencies and logarithmically at high frequencies capture the phonetically important characteristics of speech. therefore we choose this method of comparison. the Euclidean distance or Euclidean metric is nothing but the ordinary distance between two points that one would measure with a ruler. In mathematics. Euclidean space becomes a metric space. MFCCs lack an outer ear model and. Take the Fourier transform of (a windowed excerpt of) a signal. 2. in audio compression. as if it were a signal. 3. and others. Take the Discrete Cosine Transform of the list of Mel log-amplitudes.1. the Euclidean distance measures the percentage of dissimilar bits out of the number of comparisons made. cannot represent perceived loudness accurately. However.52 Speech Recognition Using MFCC Coefficients (MFCC). using triangular overlapping windows. which can be proven by repeated application of the Pythagorean Theorem. 4. 2. hence. and therefore are best for speech recognition. Thus. unlike the sonogram. MFCCs are commonly derived as follows: 1. This can allow for better processing of data.2 Mel Frequency Cepstrum Coefficients These are derived from a type of cepstral representation of the audio clip (a cepstrum is nothing but a ”spectrum-of-a-spectrum”). which is linear frequency spacing below 700 Hz and a logarithmic spacing above 700 Hz. This is a measurement of how similar two user templates are. 2. MFCC which is used in this project is perhaps the best known and most popular. The MFCCs are the amplitudes of the resulting spectrum. By using this formula as distance.2 Feature Matching The feature matching phase involves the use of Euclidean distance. Map the log amplitudes of the spectrum obtained above onto the Mel scale. for example.

we divide the captured signal into fixed number of overlapping frames (156 samples overlap) of sample length 256. but its FFT has both real and imaginary components. 3. windowing is applied to prevent spectral leakage. Meaning. Each frame consists of 256 samples of speech signal.4 Fast Fourier transform The FFT converts the time-domain speech signal into a frequency domain to yield a complex signal. We apply 256 point radix 2 FFT for each frame.3 Windowing After framing.A Hamming window with 256 coefficients is used. The steps that are implemented to complete our design are listed below and the block diagram is shown in figure 3. This technique is called framing. 3. 3. Mohan Raj Gopal53 3 Implementation Both the training and the recognition system is the same till we find MFCC coefficients. 3. By taking advantage of this property.2 Frame blocking It is assumed that recorded speech is piecewise stationary which means the signal is stationary for short period of times. The FFT algorithm in Rulph Chassing book [1] is used in our implementation.1 Level detection When the speaker says out a word the system has to do silence detection and capture only the speech signal.Harshavardhan Kittur.Manivannan Ethiraj. The start of an input speech signal is identified based on a prestored threshold value. 3. Speech is captured when it exceeds the threshold and is passed on to the framing stage. It is easier to combine this step with the Frame blocking step. The total no of stages in FFT is 8. The training phase stores the co-efficients and the recognition phase compare current recorded coefficients with the stored ones. Kaoushik Raj Ramamoorthy. Also. The second half of the . The sampling frequency for our system is 8kHz and the speech is captured for 1 sec which leaves us with 8192 samples. and the subsequent frame starts from the 100th sample of the previous frame. Speech is a real signal.5 Power spectrum calculation The power in the frequency domain is calculated by summing the square of the real and imaginary components of the signal. since the frame length is 256 samples.

54 Speech Recognition Using MFCC Figure 3: Our Speech Recognition System .

Once the filter transfer function is obtained. boundary points are computed to determine the transfer function of the filter. 20].    0 if f [k] ≥ fc [m + 1] where f[k] is the frequency of the k th sample given by k∗fs N and N is the no. m) = ff [k]−fcc [m−1] (2) [m+1]   fc [m]−fc [m+1] if fc [m] ≤ f [k] < fc [m + 1]. Once the edge frequencies and the center frequencies of the filter are found. of samples in each frame (256 in our case) The width of the filter (resolution) of the filter is given by: φ= φmax − φmin M +1 (3) where φmin is the lowest frequency of the filter bank and φmax is the highest frequency of the filter bank The center frequencies on mel-scale is given by φc [m] = m ∗ φ for m ∈ [1. uniformly spaced in the Mel-frequency scale between 0 and 4kHz.Manivannan Ethiraj. The power signal is then applied to this bank of filters to determine the frequency content across each filter.6 Mel-frequency wrapping Triangular filters are designed using the Mel-frequency scale with a bank of filters to approximate the human ear. For a given frequency f. c [m]−f H(k. 3. Twenty filters are chosen. Kaoushik Raj Ramamoorthy. Mohan Raj Gopal55 samples in the frame are ignored since they are symmetric to the first half (since the speech signal is real and has a linear phase).   f [k]−f [m−1]  c  if fc [m − 1] ≤ f [k] < fc [m]. The Mel-frequency spectrum is computed by multiplying the signal spectrum with a set of triangular filters designed using the Mel scale.This step is basically a frequency-warping operation where we change the frequency of the signal . we can apply this filter bank onto the power-spectrum to obtain the mel-spectrum .Harshavardhan Kittur. The transfer function of the triangular filters is given below:  0 if f [k] < fc [m − 1]. the mel of the frequency is given by B(f ) = 1125 ∗ ln(1 + f )mels 700 (1) The frequency edge of each filter is computed by substituting the corresponding mel. The center frequencies in the frequency-scale is given as fc [m] = 700 ∗ [10 φc [m] 2595 − 1].

Therefore instead of time-warping the speech signal into a standard time domain. these characterize the particular speaker and word which are stored during the training phase.8 Mel-frequency cepstral coefficients The log mel spectrum is converted back to time. which will be implemented on the DSP board. Observing the various plots.7 Log-energy spectrum Once the Mel-spectrum is obtained. The discrete cosine transform (DCT) of the log mel spectrum yields the MFCC. we came to the conclusion that the recorded speech signal can be of varying length (time). of samples = 100*256*sampling frequency) after repeated trials by different speakers to encompass all speech parameters. We normalized the recorded speech signal and implemented the MFCC algorithm. This is given by: Log energy spectrum[m] = ln(M el spectrum[m]) 3. we considered that the speech is to be spoken for fixed time duration on the DSP board. we take the log-spectrum of the subsequent signal.56 based on the mel-scale. We take the DCT since the power-spectrum and log-mel spectrum are real signals. m] (4) 3.9 Comparison in the feature matching phase Once we have the MFCCs. 3. the coefficients are again determined for the uttered word and recognition is carried out by analyzing the Euclidean distance with respect to the stored coefficients and defining an appropriate threshold calibrated appropriately to increase the word recognition rate. 4 Implementation in MATLAB An initial feasible algorithm was implemented in MATLAB to emulate the different steps for speech recognition. During the recognition or feature matching phase. The maximal duration was considered as 100 frames (no. The log-function is basically an amplitude-modulation where the lower frequencies are boosted and the higher frequencies are almost maintained constant. This is elaborated in the equation below: N −1 Speech Recognition Using MFCC M el spectrum[m] = k=0 P ower spectrum[k] ∗ H[k. Steps that encomposes our MATLAB implementation are shown below in figures 4 and 5 .

Kaoushik Raj Ramamoorthy.Manivannan Ethiraj.Harshavardhan Kittur. Mohan Raj Gopal57 Figure 4: Recorded signal and Signal after silence detection Figure 5: Mel-spectrum and MFCC for all frames .

7 Conclusion We have successfully implemented an MFCC system for extracting a feature from voices. This project enhanced our experiences of working in MATLAB and C. We also learned to use Latex in the process of completing our report. It is also possible to easily add other words to our training system. Hippopotamus. We are required to generate the op-code to fit into this memory along with the stack and heap space. 6 Tests and Results In the Speech training phase. the amount of knowledge obtained during this project is exceptional. Further we made sure that the sequential steps in the algorithm operated on the pointer to the variable instead of creating copies of the variable. Elephant. Although the results obtained were not as expected. There is a 90% match for certain speakers and less than 50% for some speakers.Then those training vectors are stored in a header file to be compared with the test vectors. Apart from the technical aspects. the experimental setup was performed in a controlled environment were the noise was minimal and its effects could be disregarded. This poses a difficult situation for us. The words ’Cat’ and ’Dog’ has the higher recognition rate than the other four words. The word identified is displayed using normal printf statement in the code composer studio.58 Speech Recognition Using MFCC 5 Implementation in DSP Board The DSP board has limited on-chip memory (192K internal RAM). We learned the techniques of implementing a speech recognition system. Overall. Our MATLAB implementation helped us a lot to complete our final implementation in the DSK Board. In the speech recognition phase training vectors are compared with the test vector using the Euclidean distance method. Further important constants were stored in the program memory (as #define pre-processor directives) and other variables were instructed to be stored in the heap or stack (as #pragma pre-processor directives). we learned to manage our time with proper planning. The training vectors (Time averaged MFCC coefficients) are obtained for different words like Cat. this project was challenging and was a good experience to us. Mouse and Tiger. Dog. We were also able to identify the word from different speakers uttering the word using the extracted feature. using MFCC in particular. . We also learned to use the Code Composer studio and DSP Bios. hence we used only a minimal set of variables both global and local.

Kaoushik Raj Ramamoorthy..2007.. 2005 [2] Sigurdur Sigurdsson. Mohan Raj Gopal59 References [1] Rulph Chassaing Digital Signal Processing and Applications with the C6713 and C6416 DSK. Proceedings of the Seventh International Conference on Music Information Retrieval (ISMIR).P. R.Harshavardhan Kittur. College of Engineering. Implementation of a Voice-Based Biometric System.V. Kaare Brandt Petersen and Tue Lehn-Schiler Mel Frequency Cepstral Coefficients: An Evaluation of Robustness of MP3 Encoded Music. 2006 [3] Adarsh K. . Thesis submitted at R. John Wiley & Sons. India . Deepak. A. Diwakar R. Karthik R.Manivannan Ethiraj.

60 Speech Recognition Using MFCC .

It requires the processing of each and every pixel. we used the Code Composer Studio (CCS). videopreview. we can efficiently detect the human face with in a frame of incoming video data stream. lips. tracking and recognition in video is a computational extensive procedure. We used the skin detection as our parameter to implement the face detection and tracking in video. ears for further processing in our face recognition algorithm. which does all the computations only on fixed point numbers. DaVinci. contains various basic functions to process the pixels in current frame. Tracking and Recognition Asheesh Mishra. This is because. The board uses the state of the art DSP processor. Mohammed Ibraheem. we can achieve better efficiency in detection and can extract the various feature like. eyes. There can be various parameters depending on which.61 Part VI Face Detection. etc. We did make use of those functions in the understanding of how the system actually works and how we can manipulate the pixels values. which actually comes along with the board.c example file. Shashikant Patil 1 Abstract The Face detection. USA. . depending on the desired final output. TI provides various in build functions to get started with the video projects like. We used the exceptional capabilities of the latest DSP board TMS320DM6437 from Texas Instrument. nose. skin detection. To facilitate our implementation of project. that too from Texas Instruments. like edge detection.

2 .Capture an image. TMS320DM6437 with Code Composer Studio gives the algorithm developers power to concentrate on developing powerful and efficient algorithms in less time with many helping utilities. 4 . . 3 . These stages are shown on the following system block diagram in fig. Tracking and Recognition 2 Introduction Face recognition is getting most important with availability of cameras and need for automated processing of the videos to serve many purposes.Face Tracking. DSP gives the power for processing video and get meaningful information from the video.Face Recognition. Our system that we developed in our project consists of four main stages: 1 .62 Face Detection.Face Detection. 1. The first stage to capture an image frame from a stream video input to the TI TMS320DM6437 and processing it to detect the face region in the captured image and then send this part from image to the PCA module to generate the features vector and compare it with the pre-stored vectors in the database to find if the nearest match (face recognition stage).

we wanted to reduce the undesired pixels which are of similar values as of skin color. Chrominance components in an image are responsible for the color composition of an image. hence.Color Space Model A Color space is only a format of representing color. Here. There are various good image filtering algorithms available for reducing the noise. Luma (Y) is basically. 4 Image Filtering for Noise reduction Image filters are used to remove the undesired image details. slows down the overall performance of the final output. We made use of Median Image filter to reduce the noise in our project. false edges are not detected during further processing. Mohammed Ibraheem. brightness. like smoothening of image. Y is the luminance component whereas Cb & Cr are the blue and red color chrominance components. and hence. The main disadvantage of implementing this was that. and saturation in one way or the other. Every pixel contains the YCbCr information in the format of 4:2:2. As far as our project was concerned. One can always take the help of any good book on Image processing to refer for image filters.Asheesh Mishra. luminance. Here. if the pixels differentiate greater than a certain threshold with their neighboring pixels. had to reduce the actual frame rate. it means every other value is a Y component and every fourth value is a Cb and Cr component in the series of video data. this process consumes a lot of time. To accommodate this feature we. Median Image filters and others. like. The median filter works on the principal of 8.neighborhood. then they will be set to average value of those pixels. which are more suitable for the edge detection or further processing on image frame. Shashikant Patil 63 3 YCbCr. hence process only a few frames to speed up the entire process. responsible for the brightness in an image and greatly influences the perception of an image. 5 Edge Detection The simplest approach after detecting face is to get some feature extraction from the detected face so that we can match with the next face to be recog- . because. A thorough understanding of YCbCr was mandatory in our project of video processing. we typically had to work around mostly with the Chroma components of an image. Gaussian Noise filters.

we have implemented the Principle Component Analysis(PCA) for the recognition part. eyes.64 Face Detection. • Window-sliding technique using background pattern. Different algorithms for Edge Detection are as below • Canny edge detection . There is need to separate the face from other background. we found the thresholding and linking method can be implemented successfully in the simulation and on the real-time TMS320DM6437 system. this step may not be of much use in that sense. face and movement detection. • Eye blinking pattern detection. • Edge Thinning. • Thresholding and linking. • Phase congruency based edge detection. The reason for using the edge detection was to mainly to extract the features from the detected face like. • Appearance. There are many algorithms available for it like • Binary pattern-classification. rather than recognizing other parts of human body as face. The threshold value is dependent many factors so it has to adjust with the idea setting for the successful differentiation and edge detection. nose and mouth. When implementing. but it actually helped the system to become more robust in recognizing only face. • Skin color to find face segments using static background and lighting condition. Tracking and Recognition nized. Basically using it we compare with the neighboring pixels with some threshold value. • Other first-order methods. 6 Face Detection and Tracking Among all frames it gets important task to detect face for further processing. Detecting edge once we thought a simplest approach towards feature extraction. . Since.

3 After Processing Luma(Y) and Chroma(Cb)(Cr) Components. Since. the Chroma components becomes much more important. we need to first set a specific limit to Chroma blue. which we are going to manipulate. like setting it to 0xFF gives the comparable results. which is done by scanning the entire frame from top left position till the end of frame is reached. the skin color falls in the range of 0x8A to 0x8C of Chroma red. As discussed above. Mohammed Ibraheem.Asheesh Mishra. we had to detect the face from the frame. the image consists of only face. It has given expected results for detecting the face from the entire frame. Now. to display on to the monitor. Now. firstly. this stored and modified frame has to be written back on to the write cache buffer. Below are some of the images shown after processing this function. Most of the skin color falls in the specific color ranges of Chroma red component. entire Y component should be of same value. we had to apply the filter to remove the undesired noises coming from the reflective surfaces. and others to the same value as we had given to the Luma component. Fig. Shashikant Patil 65 While studying the method in the Ref[1] we found the implementation of Skin color to detect face segments using static background and appropriate lighting condition in lab environment most suitable and successful. hence. Here. During the scanning we searched for the continuous pixels holding the same values . To detect the face in video. but the point is. we made use of the median filter for this. Now then. we had to store the current fame to a temporary array. we are going to differentiate face from background. then. The luminance (Y) component is set to any particular value which differentiate from skin color. We set all the pixels falling in this range to specific color. to make entire procedure more robust.

4 After detection and VGA display of processed frame.5 After detection and VGA display of processed frame.(Exp-1) Fig. on the basis of the status of the flag raised. Fig. when it satisfies a certain threshold.(Exp-2) . The box was placed by setting the values of all the desired pixels (to black) using the value in pointer register. comes the tracking part. Tracking and Recognition as of skin color. Now. which required the detected face should be tracked in a frame for its new position.66 Face Detection. a flag is raised to indicate that face had been detected in a frame and that position is also saved in a pointer register. A box is place on to the position captured from face detection part.

and of course that helps us to improve the memory resources consuming in the real time system because of the hardware limitation [13].Asheesh Mishra. Shashikant Patil 67 7 Face Recognition Common algorithms for face recognition like.[14]algorithm we found most interesting and successful in the simulated environment using Matlab. Support Vector Machine (SVM) 4. Generating the Eigenfaces: Considering the T-matrix as input to this stage. by calculating the Euclidean distance between each projected image in the set. Principal Component Analysis (PCA) 2. Having the projected images set and the projected input image. Then we combine the images vectors into one matrix called the trained matrix (T-matrix). its the features matrix contains the faces features. We first calculate the covariance matrix by multiplying the A-Matrix by its transposed one and then finding the Eigen vectors Matrix. We apply the algorithm in three stages which are described as follows. we have to calculate the following matrices to be the input to the recognition stage: 1. Hidden Markov Models (HMM) 5. 1. Face Recognition process: In this process we receive the 3 outputs from the previous stage as the inputs for this stage addition to the input picture that we want to recognize the face. A-Matrix: Centered images generated by subtracting the M-matrix from T-matrix. After that we have to the project the input image using the same concept (center the image by subtract the main of T-Matrix and multiplying it with the Eigenfaces). we have to create the training database that contains the faces.[5]. First we have to reformat each image from a two dimensional image to a one vector image by concatenating each row or coulomb into along vector. First we have to project the centering images into the face space by multiplying the each column in the A-Matrix that represents corresponding image by the Eigenfaces this give us the projected images Matrix. 3. Ref [11] also. Eigenfaces: So-called Eigenvectors. Mohammed Ibraheem. . 2. M-Matrix: Mean values of the T-matrix (training database). the PCA has big advantages on the implementation since it reduce the dimensions of the images that need to be stored to in the database. Creating the database: Before applying the PCA. Finally by multiplying A-Matrix and by the modified Eigen vectors Matrix we can get the Eigenfaces Matrix. Boosting and Ensemble Among these algorithms PCA based Eigen face [9]. Independent Component Analysis (ICA) 3. and modifying it by sorting them and removing the negative values.

68 Face Detection. The flowchart (Fig.6) describes in detail. The limited memory available onboard was one of the major bottlenecks. Moreover. It has only 192K of RAM available. so not much of data we can store on it. when implementing the reference model in to the TMS320DM6437. video data manipulation is . Fig. Tracking and Recognition Test image should be having the minimum distance with its corresponding image in the database.6 Flowchart of PCA Algorithm for Face Recognition 8 Problem faced We faced several problems. since.

Sandeep ”Human Face Detection in Cluttered Color Images using Skin color and Edge Information”. hence.Asheesh Mishra. Indian Conference on Computer Vision. we tried to implement the calculation part of PCA(like finding Eigenfaces and training databases) on to the matlab itself. which improves a bit of performance. Mohammed Ibraheem. Richa Singh. 2003. we could did very little to optimize the recognition part. and creating database for at least three distinct persons. ”A Robust Skin Color Based Face Detection Algorithm”. Chauhan. D. but output results were not as were expected. but that actually degrades the overall performance of system drastically. 227-234. Graphics and Image Processing Dec. So. but the kind of exposer we had got during the implementation phase was very satisfying. Hence. with these eigenfaces directly. pp. and also a problem since most calculations on the PCA is done using division. by processing only a fewer number of frames. we separated out the recognition part from the project. The recognition part was also implemented as a hybrid between onboard and matlab. We tried to overcome this problem. References [1] A. noisy environment(like improper background). 6. 9 Conclusion and Future work We were able to finish the face detection and tracking pat of our project successfully. In a highly. and also due to lack of enough time. The project also honed our skills on embedded C language and code composer studio. . Mayank Vatsa. Although we were able to implement most of the part of recognition phase onto the kit. The TMS320DM6437 is a fixed point processor. thought of implementing every part of PCA on to the board. slow response of the kit was also a problem. including the training set. Singh. Tamkang Journal of Science and Engineering. as of time being. the other problem we faced is in the implementation of recognition part. virtually unrealistic. square roots. Since. the performance degrades marginally.Rajagopalan. and comparing the realtime data. so we have to use the scaling technique (scaling up) to convert the data to fixed point and restoring it after calculation scaling down using shift operations for the square root calculation we use Babylonian method [13]. No. 4. multiplications.N. but still able to detect most of the time. We initially.Vol. during processing of image data. but the results were not as expected. Shashikant Patil 69 very computation exhaustive. the project was much more complex to implement as opposite when we thought otherwise initially. K. S. [2] Sanjay Kr. and results were found to be as expected. 2002.

eit. Tracking and Recognition [3] Minku Kang. ISBN 13: 978-0521880688 [5] http://www.lth.ac.eit.html [13] http://en. PCA-based Face Recognition in an Embedded Module for Robot Application.essex.70 Face Detection.pdf [8] http://www. 3rd edition.lth.org/wiki/Face_detection.cs.otago.mathworks.org/wiki/Eigenvalues_and_eigenvectors [12] http://cswww.lth.se/fileadmin/eit/courses/eti121/ Seminar/lect2_2011.pdf [7] http://www.nz/cosc453/student_tutorials/ principal_components.ac.csus.wikipedia.eit.pdf [10] http://www. [14] http://www.uk/mv/allfaces/faces94.com/matlabcentral/fileexchange/ 17032-pca-based-face-recognition-system [6] http://www. Press.pdf .se/fileadmin/eit/courses/eti121/ Seminar/lect1_2011.wikipedia. ICROS-SICE International Joint Conference 2009 [4] William H. Cambridge University Press. pdf [9] http://www. The Art of Scientific Computing .edu/indiv/p/pangj/aresearch/video_ compression/ref/report_summer09_Shriram%20_face_detection.face-rec.se/fileadmin/eit/courses/eti121/ Reports/ASP_Reports_2010.org/algorithms/ [11] http://en.

Then the successive frames are subtracted from the reference frame to obtain the moving object. . This will reduce the processing time required for each successive frame.71 Part VII Circular Object Detection Ajosh K Jose. Qazi Omar Farooq. The aim is to detect a circular moving object when the background is kept fixed. Canny edge detection algorithm is employed to extract the edges of moving object. Finally the object is checked for circular shape using modified Circular Hough Transform. The first step is to make the reference frame by capturing the first frame sent from the video camera. Sreejith P Raghavan Abstract This project deals with the implementation of a circular object detection method on DSP-TMS320DM6437 Evaluation Module. Sherine Thomas. Further processing is done only on the area of moving object. If a circular object is detected on the frame then it is marked in the video. In the second step.

Background frame is the first captured frame which is stored in the memory. 2 Theory Figure 1. the task of manually monitoring the safety can be reduced. Detecting moving objects is an important task in video surveillance. The circular object detection is implemented in the following steps. Here we are trying to study the methods and challenges involved in detecting an object correctly. the algorithm can be further improved to detect objects of more complex shape. In the next step edge detection algorithm is applied on that particular area to detect the edges of the object. Also to further reduce the processing time every 15th frame is processed instead of consecutive frames. circular object detection algorithm is applied on the edge detected input to find the circular objects in the region. In the final step. • Moving object Detection • Edge Detection • Cicular Object Detection In the first step moving objects in the frame is detected by background substraction method. Also the image processing algorithms are widely used in medical imaging. If the shape of moving object can be detected automatically. shows the steps involved in our circular object detection implementation. This will reduce the area of interest. After successful implementation of circular object detection system. By doing this the processing time required for the next steps can be reduced very much. Our project implementation is to detect circular moving objects in real time. We are using TMS320DM6437 evaluation module for implementing our project. The rapid change in video and image processing standards also introduce additional complexity and the need for higher throughput. .72 Circular Object Detection 1 Introduction Real-time image processing applications are now widely used due to the very fast advancement in technology. Then the new frame is substracted from the background frame to detect moving objects. the image processing algorithms used in such systems need to be chosen wisely. surveillance systems and digital cinema. Due to the introduction of portable devices with stringent resource limitations. In this project we are trying to familiarize with the various algorithms used in image processing.

The main objective of edge detection is to reduce the amount of data which is to be processed by keeping the structural content intact. From the comparison it was clear that canny edge detection algorithm will be more suitable because of the fine details available in the output which will be needed for tracking a moving object. we added the method of background substraction. We did an intial study using Matlab to find the most suitable algorithm for edge detection. Sreejith P Raghavan73 Figure 1: Block Diagram of the steps 2. For reducing the processing time. So by applying the canny edge detection algorithm on that particular area the processing time can be reduced considerabily. The main edge detection algorithms are • Prewitt Method • Canny Method • Sobel Method • Roberts Method • Laplacian of Gaussian Method • Zero-Cross Method Please refer [1] for more details about edge detection algorithms.Ajosh K Jose. In this method we substract the static background from the current frame to detect the actual area of interest. Sherine Thomas. The processing time and output of these algorithms vary very much.1 Edge Detection Edge detection is an important step in image processing. There are several algorithms proposed by different people for edge detection. Canny Edge detection algorithm consists of five steps. Qazi Omar Farooq. The input to the Canny edge detection algorithm is the gray scaled image. They are . The results of this study is shown in figure 2. But the actual problem of applying the canny edge detection algorithm to the complete frame is the large processing time required.

Smoothing takes long processing time due to the matrix multiplication involved. . For detailed description. 2.1. please refer [2] or [3]. Usually gaussian filter is employed for this step. The gaussian matrix is shown in the figure 3.4. The image is smoothened by applying a Gaussian filter with a standard deviation of 1.1 Smoothing Smoothing is done to reduce the noise level in the image. In this project we skipped this step since we are processing only the moving object and the effect on the edge detected image was found to be less. This step helps to remove unwanted edges detected due to the noise present in the image.74 Circular Object Detection Figure 2: Edge detection algorithm outputs • Smoothing • Finding gradients • Non-maximum suppression • Double thresholding • Edge tracking by hysteresis A brief idea about different steps of canny edge detection algorithm is given below.

1.1.3 Non-maximum suppression For suppressing the non-maximum. It will be either . Qazi Omar Farooq. angle = Gx · pixelvalue Gy · pixelvalue (1) (2) The output after applying the sobel matrix is shown in the figure 5. The sobel matrices are shown in figure 4. It is clearly visible that all the edges in the image are highlighted. Then the gray scale sum is calculated by the equation. the edges where the gray scale intensity varies most is determined. The angle obtained is rounded to the nearest 45 degree with which the gradient direction of all the pixels is determined. the angle calculated from the previous step is used. 2. sum = |Gx · pixelvalue| + |Gy · pixelvalue| The gray scale angle is calculated by the equation.2 Finding gradients By finding the gradients.Ajosh K Jose. Sreejith P Raghavan75 Figure 3: Gaussian Matrix Figure 4: Sobel Matrix 2. This is done by applying sobel matrix to each pixel in the image. The matrices consists of Gx and Gy matrix. Sherine Thomas.

Thus in this step all the gradient edges with local maxima will be selected. the remaining edges are classified into strong and weak edges. The other pixels will be cancelled. it will be considered as part of the edge and will be retained.90 or 135 degrees. Strong edges will be retained and they will be part of the edges.1.5 Edge tracking by hysteresis For edge tracking by hysteresis. If it is connected to any of the strong pixels. Figure 6 shows the output image after non-maximum suppression. Then the current pixel strength is compared with the positive and negative gradient direction. all the weak edges are checked for connection with strong pixels in the neighbourhood. Weak edges will be further checked in the next step. then it is choosen and the other values will be suppressed. .76 Circular Object Detection Figure 5: Image after finding the gradients 0. If the current pixel have more strength than the positive and negative gradient direction. 2.4 Double thresholding For double thresholding.1.45. Figure 7 shows the output image after doing the hysteresis. 2.

Sreejith P Raghavan77 Figure 6: Image after non maximum suppression Figure 7: Image after Hysterisis . Sherine Thomas.Ajosh K Jose. Qazi Omar Farooq.

Six pixels are searched for determing the circle. Circular Hough Transform algorithm is applied to the output of Canny edge detected image to find the edges of circles in the image. Code composer studio was used for compiling and downloading the code. (x1 − x0 )2 + (y1 − y0 )2 = r2 (3) All the pixels will be searched for the possibility of a circle with a radius of particular limit. In the first step the algorithm was tried in matlab to determine the efficiency. The pixels are in . 3 Implementation The implementation of the project was done in two steps. • Video Camera • DM-6437 Evaluation board • Television The platform DM6437 offers an interface in the framework. Figure 8 shows the method used for determining the circle. For detailed description on CHT please refer [4]. This algorithm is based on the equation. the matlab implementation was converted into a C implementation. from which we can access the input video stream frame by frame. [5] and [6].78 Circular Object Detection Figure 8: Circle detection method 2. If all the six pixels have edge information within the particular radius.2 Circular Object Detection Circular object is detected by applying circular Hough transform (CHT) algorithm on the edge detected frame. In the second step. The hardware tools used for testing the project included. then it is considered as a circle.

Allborg University [4] Mohamed Rizon. So the frame size is 720x576. Nov. American Journal of Applied Sciences. Sreejith P Raghavan79 YCbCr format. Different algorithms used in signal processing were familiarized in this course. For this project we took into consideration a PAL system. Due to the lack of enough processing time. where Y is the luma component and Cb& Cr are the chroma components with the ratio 4:2:2. International Journal of Image Processing (IJIP). References [1] Raman Maini. Also if the processing power permits. 2005. PAMI-8(6):679698. The processing consists of reading the frame buffer and updating the frame data and writing it back into the buffer. Also edge detection was implemented only in the selected area where a moving object is detected. We felt the processing power of DM6437 Evaluation kit is not enough for handling complex algorithms used in video processing. Volume (3): Issue (1) [2] John Canny. Himanshu Aggarwal. Advanced image processing. It was a nice experience working with this project which introduced us to the world of programming DSP processors. Study and Comparison of Various Image Edge Detection Techniques. Object Detection using Circular Hough Transform.Ajosh K Jose. Labortary of computer vision and media technology. So as future work we are planning to optimize our current implementation and add more reliable algorithms for circular object detection which can detect circular objects with different radius. Working with DM6437 and code composer studio was a nice experience. we were not able to implement more reliable algorithms for circular object detection in the DM6437 processor. Pattern Analysis and Machine Intelligence. Dr. . we would like to implement more complex algorithms like human hand detection and tracking the movement of hand. 4 Conclusion & Future Work Circular object detection was successfully implemented and tested. Qazi Omar Farooq. 1986. IEEE Transactions on. A computational approach to edge detection. [3] Canny Edge Detection implementation tutorial. The two labs which were done as part of this course were helpful in familiarizing the tool and the DSP kit. Sherine Thomas.

Int. [6] Mohamed Roushdy. April. 2008. Ignacy Duleba. 2007. ETI121. [7] Project Report 2010. Circular Object Detection Using A Modified Hough Transform. J. Comput. Detecting Coins with Different Radii based on Hough Transform in Noisy and Deformed Image. Sci. Algorithm in Signal Processing Course. . GVIP Journal.. Issue 1. Math. Appl. Volume 7.80 Circular Object Detection [5] Marcin Smereka.

Sign up to vote on this title
UsefulNot useful