Face Recognition On FPGA: Spring Term Report

Face Recognition on FPGA
Spring Term Report

EECE 501
Final Year Project
Ramzi Madi
200200055
Robin Lahoud
200200271
Supervisor
Prof. Mazen Saghir
May 23, 2006
Bassem Sawan
200200267
TABLE OF CONTENTS
1. INTRODUCTION ......................................................................................................... 5
1.1 Problem Definition .............................................................................................. 5
1.2 Applications.......................................................................................................... 5
1.3 Motivation and Objectives ................................................................................ 7
2 LITERATURE REVIEW ............................................................................................... 7
2.1 Still-Image versus Video.................................................................................... 7
2.2 Algorithms for Face Recognition.................................................................... 8
2.2.1 Principle Component Analysis ..................................................................... 8
2.2.2 Linear Discriminant Analysis ........................................................................ 9
2.2.3 Independent Component Analysis ............................................................ 10
2.2.4 Neural Networks ........................................................................................... 11
2.2.5 Genetic Algorithms....................................................................................... 12
2.3 FPGA Implementation of Face Recognition............................................... 14
2.3.1 FPGA Implementation using PCA ............................................................. 14
2.3.2 FPGA Implementation Using Composite/Modular PCA......................... 15
2.3.3 FPGA Implementation using Artificial Neural Networks......................... 17
2.3.4 FPGA Implementation using Genetic Algorithm...................................... 18
2.3.5 FPGA implementation using Evolutionary Reconf. Architecture .......... 20
2.4 Issues with Face Recognition........................................................................ 21
3. DESIGN....................................................................................................................... 23
3.1 System Specifications ..................................................................................... 23
3.1.1 Algorithm........................................................................................................ 23
3.1.2 Inputs and Outputs....................................................................................... 24
3.1.3 Timing Constraints ....................................................................................... 25
3.2 System Description........................................................................................... 25
3.3 Hardware FPGA Components........................................................................ 27
3.3.1 V2MB-1000 Overview.................................................................................. 27
3.3.2 MicroBlaze Processor.................................................................................. 29
3.3.3 OPB Interface ............................................................................................... 29
3.3.4 BRAM Controller........................................................................................... 30
3.3.5 External Memory Controller........................................................................ 30
3.3.6 Ethernet MAC Controller/Driver ................................................................. 30
3.3.7 UARTLite Controller/Driver ......................................................................... 31
3.3.8 On-board Hardware Multipliers .................................................................. 31
3.4 Software FPGA Components ......................................................................... 32
3.5 PC Application Components.......................................................................... 33
3.6 Memory Management ....................................................................................... 33
3.7 PCA Algorithm ................................................................................................... 35
3.7.1 Training Phase.............................................................................................. 35
3.7.2 Recognition Phase ....................................................................................... 36
3.8 Project Budget ................................................................................................... 37
4. IMPLEMENTATION .................................................................................................. 38
4.1 Modeling Algorithm in MATLAB.................................................................... 38
4.1.1 Implementation Details................................................................................ 38
2
4.1.2 Implementation Results............................................................................... 39

4.1.3 Exporting Code to C .................................................................................... 40
4.2 Training Stage Implementation in C............................................................. 41
4.2.1 Custom Function Library ............................................................................. 41
4.2.2 Training Stage Calculations........................................................................ 42
4.2.3 Writing to Binary Files.................................................................................. 45
4.3 Ethernet Implementation in C# ...................................................................... 46
4.4 Recognition Stage Implementation on FPGA............................................ 48
4.4.1 Receiving Ethernet frames and Storing to DDR...................................... 48
4.4.2 Verification and Testing of Ethernet interface and DDR ........................ 51
4.4.3 Recognition Phase Implementation on the FPGA .................................. 52
4.4.4 Verification and Testing of Recognition Phase........................................ 54
4.4.5 Implementation of Performance Measurements ..................................... 54
5. CRITICAL APPRAISAL ........................................................................................... 55
5.1 Researching Face Recognition ..................................................................... 55
5.2 Modeling in MATLAB........................................................................................ 56
5.3 Working with C................................................................................................... 57
5.4 Using the FPGA ................................................................................................. 58
5.5 Researching Hardware Multipliers ............................................................... 59
5.6 Learning to Use Ethernet ................................................................................ 60
5.7 Porting PCA to the FPGA ................................................................................ 61
5.8 Performance Assessment............................................................................... 62
6. RESULTS ................................................................................................................... 64
6.1 Methodology Overview .................................................................................... 64
6.2 PC Implementation............................................................................................ 65
6.3 FPGA Implementations.................................................................................... 65
6.4 Overall Performance Analysis ....................................................................... 67
7. EXTERNAL FACTORS AND CONSTRAINTS .................................................... 69
8. CONCLUSION ........................................................................................................... 71
9. REFERENCES........................................................................................................... 72
10. APPENDIX ............................................................................................................... 73
10.1 PCA Code in MATLAB ................................................................................... 73
10.2 PCA Code in C ................................................................................................. 73
10.2.1 Matrix Library .............................................................................................. 73
10.2.2 Eigenvector Functions............................................................................... 77
10.2.3 PCA Algorithm ............................................................................................ 84
10.3 Ethernet Code in C# ....................................................................................... 86
10.4 FPGA Code ....................................................................................................... 89
10.5 Recognition Results ....................................................................................... 94
LIST OF FIGURES
Figure 1: Applications of Face Recognition 6
Figure 2: Face Recognition Algorithms ............... 13
Figure 3: System Block Diagram... . 26
Figure 4: V2MB-1000 Board ... 28
Figure 5: P160 Communications Module 29
Figure 6: Ethernet CAT5 Cable ... 37
Figure 7: RS-232 Serial Cable . 37
Figure 8: Sample Face 38
Figure 9: Average Face .. 38
Figure 10: Eigenfaces ... 59
Figure 11: FPGA Comparison...68
Figure 12: Total Performance... 68
LIST OF TABLES
Table 1: MATLAB Functions Used . 38
Table 2: MATLAB Implementation Results. 39
Table 3: C Functions Used ... 41
Table 4: Implementation Descriptions..... 64
Table 5: Implementation 1 Results....... 65
Table 6: Implementation 2 Results....... 66
Table 7 Implementation 3 Results....... 66
1. INTRODUCTION
1.1 Problem Definition
Face recognition is a form of biometric identification that relies on data acquired
from the face of an individual. This data, which can be either two-dimensional or threedimensional in nature, is compared against a database of individuals. In recent years, face
recognition has gained popularity among researchers all over the world. With
applications ranging from security to entertainment, face recognition is an important
subset of biometrics.
In real world applications, it is desirable to have a stand-alone, embedded face
recognition system. The reason is that such systems provide a higher level of robustness,
hardware optimization, and ease of integration. As such, we have chosen the FPGA as a
reconfigurable platform to carry out our implementation. Ultimately, the stand alone
system may be implemented on an ASIC, a dedicated processor, or even an FPGA chip,
depending on the trade-offs in speed, portability, and reconfigurability.
1.2 Applications
Face recognition systems have gained a great deal of popularity due to the wide
range of applications that they have proved to be useful in. Broadly, two main categories
for these applications exist: commercial applications and research applications.
From a commercial standpoint, face recognition is practical in security systems
for law enforcement situations. It is in places like airports and international borders that
the need arises for a face recognition system that identifies individuals. Another
application of face recognition is the protection of privacy, obviating the need for
exchanging sensitive personal information. Instead, a computer-based face recognition

system would provide sufficient identification. For instance, PIN numbers, user IDs, and
passwords would be replaced by face recognition in order to unify personal identification.
Finally, face recognition systems can be used for entertainment purposes in areas like
video games and virtual reality [1].
In research applications, face recognition has opened the door for research in
areas like image and video processing [1]. The approaches used in face recognition are
useful in the general area of pattern recognition and data classification. Research has also
progressed into the realm of neural networks, where the human nervous system is used as
a model to attain higher recognition rates. Lastly, face recognition has paved the way for
advances in the field of computer vision. Any research in face recognition is a step
forward in autonomous vision-based artificial intelligence.
Computer
Vision
Security
Research
Face Recognition
Applications
Entertainment
Commercial
Pattern
Recognition
Unified
PIN
Figure 1: Applications of Face Recognition
1.3 Motivation and Objectives

After extensive research into the field of face recognition, we have found that
there is ample room for improving upon currently available face recognition systems.
These improvements range from the robustness of the design to the speed and accuracy of
the system. An FPGA can provide us with the necessary resources to achieve such
improvements in face recognition. These resources include various communication
interfaces, memory types, and intellectual property cores, as well as one million logic
gates that allow us to implement custom logic.
As such, the objective of our project is to implement a still-image based face
recognition algorithm on an FPGA.
We will use a hardware/software co-design
approach, delegating the more mathematically intensive tasks to the hardware while
controlling the algorithm procedure in software. Our aim is to achieve a speed up in the
process of recognition through the use of multiple parallelized components on the FPGA
while maintaining high accuracy in the results.
2 LITERATURE REVIEW
2.1 Still-Image versus Video
In the literature, two main forms of face recognition exist: still-image-based face
recognition and video-based face recognition. Still image face recognition relies on
classifying an individual based on a single image obtained from a still shot camera.
Conversely, video based face recognition relies on a sequence of frames to extract more
information about the face of a subject.
An inherent advantage of using still-image-based face recognition over videobased systems is that the images are of higher resolution. As a result, current face
recognition algorithms are able to recognize a face more accurately. Further to this, still
image based recognition is useful in controlled environments where pose and
illumination are relatively fixed. One example of such an environment is while taking
subjects photograph at the airport check in [1]. The disadvantages of still-image-based
face recognition occur when such a controlled environment is not easily attainable. An
example of this scenario would be a security camera used to identify a subject in a public
place. In this case, video-based recognition yields better results.
The clear advantage of video-based face recognition occurs in situations where
the image resolution is low and the video feed is continuous. Video-based algorithms
capitalize on both spatial and temporal variations in a subjects face. Nevertheless, a
natural disadvantage is the low resolution of the images being captured [1]. Since an
individual might be located at a distance, the pixels that represent this individuals face
might not constitute a sufficient information base for the algorithm to operate correctly.
Hence, the need for the two different approaches occurs in different situations.
2.2 Algorithms for Face Recognition

2.2.1 Principle Component Analysis
PCA is an algorithm developed by Turk and Pentland that treats face recognition
as a two dimensional recognition problem [2]. The correctness of this algorithm relies on
the fact that the faces are uniform in posture and illumination. PCA can handle minor
variations in these two factors, but performance is maximized if such variations are
limited. The algorithm basically involves projecting a face onto a face space, which
captures the maximum variation among faces in a mathematical form.
During the training phase, each face image is represented as a column vector, with
each entry corresponding to an image pixel. These image vectors are then normalized
with respect to the average face. Next, the algorithm finds the eigenvectors of the
covariance matrix of normalized faces by using a speedup technique that reduces the
number of multiplications to be performed. This eigenvector matrix is then multiplied by
each of the face vectors to obtain their corresponding face space projections. Lastly, the
recognition threshold is computed by using the maximum distance between any two face
projections [2].
In the recognition phase, a subject face is normalized with respect to the average
face and then projected onto face space using the eigenvector matrix. Next, the Euclidean
distance is computed between this projection and all known projections. The minimum
value of these comparisons is selected and compared with the threshold calculated during
the training phase. Based on this, if the value is greater than the threshold, the face is
new. Otherwise, it is a known face [2].
2.2.2 Linear Discriminant Analysis

Another popular algorithm used in face recognition is LDA. Although this
algorithm was initially developed for data classification, it has been adapted to face
recognition. Whereas PCA focuses on finding the maximum variation within a pool of
images, LDA distinguishes between the differences within an individual and those among
individuals. That is, the face space created in LDA gives higher weight to the variations
between individuals than those of the same individual. As a result, LDA is less sensitive
to lighting, pose, and expression variations [3]. The drawback is that this algorithm is
significantly more complicated than PCA.
As an input, LDA takes in a set of faces with multiple images for each individual.
These images are labeled and divided into within-classes and between-classes. The
former captures variations within the image of the same individual while the latter
captures variation among classes of individuals. LDA thus calculates the within-class
scatter matrix and the between-class scatter matrix, defined by two respective
mathematical formulas. Next, the optimal projection is chosen such that it maximizes
the ratio of the determinant of the between-class scatter matrix of the projected samples
to the determinant of the within-class scatter matrix of the projected samples [3]. This
ensures that the between-class variations are assigned higher weight than the within-class
variations. To prevent the within-class scatter matrix from being singular, PCA is usually
applied to initial image set. Finally, a well known mathematical formula is used to
determine the class to which the target face belongs. Since we have reduced the weight of
inter-class variation, the results will be relatively insensitive to variations.
2.2.3 Independent Component Analysis

ICA is the third mathematically-based algorithm for face recognition. Whereas
PCA depends on the pairwise relationships between pixels in the image database, ICA
strives to exploit higher-order relationships among pixels. [4] That is, PCA can only
represent second-order inter-pixel relationships, or relationships that capture the
amplitude spectrum of an image but not its phase spectrum. On the other hand, ICA
algorithms use higher order relationships between the pixels and are capable of capturing
10
the phase spectrum. Indeed, it is the phase spectrum that contains information which
humans use to identify faces [4].
The ICA implementation of face recognition relies on the infomax algorithm and
represents the input as an n-dimensional random vector. This random vector is then
reduced using PCA, without losing the higher order statistics. Then, the ICA algorithm
finds the covariance matrix of the result and obtains its factorized form. Finally,
whitening, rotation, and normalization are performed to obtain the independent
components that constitute the face space of the individuals. Since the higher order
relationships between pixels are used, ICA is robust in the presence of noise. Thus,
recognition is less sensitive to lighting conditions, changes in hair, make-up, and facial
expression [4].
2.2.4 Neural Networks

Unlike the above three algorithms, the neural networks algorithm for face
recognition is biologically inspired and based on the functionality of neurons. The
perceptron is the neural network equivalent of a neuron. Just like a neuron sums the
strengths of all its electric inputs, a perceptron performs a weighted sum on its numerical
inputs. Using these perceptrons as a basic unit, a neural network is formed for each
person in the database. The neural networks usually consist of three or more layers [8].
An input layer takes in a dimensionally reduced (using PCA) image from the database.
An output layer produces a numerical value between 1 and -1. In between these two
layers, there usually exist one or more hidden layers.
For the purposes of face recognition, one hidden layer usually provides a good
balance between complexity and accuracy. Including more than one such layer
11
exponentially increases the training time, while not including any results in poor
recognition rates. Once this neural network is formed for each person, it must be trained
to recognize that person. The most common training method is the back propagation
algorithm [8]. This algorithm sets the weights of the connections between neurons such
that the neural network exhibits high activity for inputs that belong to the person it
represents and low activity for others. During the recognition phase, a reduced image is
placed at the input of each of these networks, and the network with the highest numerical
output would represent the correct match.
The main problem with neural networks is that there is no clear method to find the
initial network topologies. Since training takes a long time, experimenting with such
topologies becomes a difficult task [8]. Another issue that arises when neural network are
used for face recognition is that of online training. Unlike PCA, where an individual may
be added by computing a projection, a neural network must be trained to recognize an
individual. This is a time consuming task not well suited for real-time applications.
2.2.5 Genetic Algorithms

Another biologically inspired algorithm that is commonly used for face
recognition is the Genetic Algorithm (GA). While neural networks mimic the function of
a neuron, genetic algorithms mimic the function of chromosomes. Like neural networks,
genetic algorithms are only well suited for the recognition of a limited number of
individuals and are generally not too scalable.
To start with, the images are divided into two classes: those that belong to the
target person and those that belong to other people. Each of these images is transformed
into a binary coded truth table. Within each of the above mentioned classes, the images
12
are further subdivided into F-tables and T-tables, where each image occupies a row in the
table. Initially, the rows in the F-tables and T-tables do not match. However, by gradually
changing some of the F-table values to dont-cares, some rows end up matching with
each other. Hence, the F-table obtains the generalization ability. The evolution process
ensures that the modified F-table includes as many rows in the T-table as possible. Once
evolution is complete, the modifications that result in the best fitness are chosen for each
category (target person and unknown people) and applied to the F-table [9].
During the recognition phase, the input image is passed through the tables that
correspond to both categories. Two counters keep track of the number of pixel matches in
each of the categories and the counter with the highest value classifies the input face as
belonging to the corresponding category [9]. The obvious drawback of this algorithm is
that entire tables have to be created whenever a new individual is to be detected. As in
neural networks, the scalability of this algorithm is hindered by the exponential
complexity involved when training for multiple target faces.
Linear
Discriminant
Analysis
Principle
Component
Analysis
Independent
Component
Analysis
Face Recognition
Algorithms
Neural
Networks
Genetic
Algorithms
Figure 2: Face Recognition Algorithms

13
2.3 FPGA Implementation of Face Recognition

2.3.1 FPGA Implementation using PCA
One instance of a PCA implementation of a face detection/recognition system on
an FPGA board was done by H. Ando, N. Fuchigami, M. Sasaki, and A. Iwata [5]. The
first stages of implementation were tested on prototype software running on a PC. This
software reads in an RGB image, reduced in size to 100 x 100 pixels. Face detection is
then performed by detecting skin color, and the PCA algorithm is applied to the face area
detected. The prototype software was designed on a XEON-based multi-CPU system
using Visual C++ as the programming language. The input to the system was a USB
camera device connected to the PC [5].
The next step was to implement the face recognition system on an FPGA. As
such, custom hardware blocks were designed in order to carry out the functionality of the
PCA algorithm discussed previously. Specifically, the database of images was
preprocessed, storing the average face and eigenvectors on the FPGA board memory. For
face recognition, an input image vector is fed into the subtraction unit for normalization
with respect to the average face. The processed image is then passed to a
multiplier/accumulator unit that reads the eigenvectors from memory and performs the
projection required by the algorithm. The next stage involves passing this Eigenspace
projection into the matching circuit, which contains an evaluation block that reads the
Eigenspace projections of known faces from memory and performs the necessary
Euclidean distance calculations. Finally, a decision unit reads in these distances and
makes the face recognition decision based on the requirements of the algorithm.
14
The face recognition system achieved a recognition time of 212 s. The image
size was 20 by 20 pixels and the FPGA board used was a Xilinx Virtex-II Pro (XC2VP7)
clocked at 100 MHz. The system made use of approximately 18% of the gates available
by the FPGA. At a more detailed level, the bit width was 8 bits for the input face, 7 bits
for the Eigenface and 18 bits for the Eigenspace [5].
2.3.2 FPGA Implementation Using Composite/Modular PCA

A second type of FPGA implementation relies on a varied version of the PCA
algorithm, called Composite or Modular PCA. The standard PCA algorithm considers
the global information of each face image and represents them with a set of weights [6].
It is because of this fact that the PCA algorithm does not function well when the images
of the test subjects vary in expression, illumination, and pose. Under these conditions, the
weights calculated for the subject will vary more than those stored in the training
database. The composite PCA algorithm divides the face into smaller regions and
computes the weight functions based on those regions. As such, in conditions where pose,
light, or expression varies, only specific regions of the face will vary and therefore only
specific weight functions will change [6].
One possible FPGA implementation of the Composite PCA algorithm utilizes the
inherent parallel nature involved with pixel calculations. The first hardware block
contains four separate processing elements that are used in the recognition phase. A
processing element consists of 20 separate processing lanes, each lane corresponding to a
single eigenvector from the training phase. This parallel processing unit is used to
compute the Eigenspace projections required by the PCA algorithm prior to performing
the distance calculations. The second hardware block is a classification module that
15
serves two main functions. The first function is that of an accumulator that reads results
from the processing elements and accumulates the results in registers (one register for
each of the twenty processing lanes). Secondly, the classification block finds the face
with the minimum distance to the face under test and stores its index [7].
The above mentioned system was implemented on an Altera Quartus board
clocked at 91 MHz. It was able to recognize a face from a database of 1000 images in 11
milliseconds. The performance of this implementation can be attributed to the parallel
hardware blocks used in performing the necessary calculations for the algorithm. Further
to this, the design can be scaled for larger databases by simply adding more processing
elements in parallel. This will yield an even higher throughput of data and improved
performance for larger sized databases [7].
Another FPGA implementation strategy that yields some good performance
results with Composite PCA relies on two process blocks, 16 pairs of which are
connected in parallel for high throughput calculations. The first block reads in the
eigenvectors and the test image and performs the necessary multiplications. This result is
then passed to the second processing block, which computes the distance using a reduced
formula designed to simplify the hardware implementation of distance calculations. All
16 blocks are connected to a distance grouper and a comparator, used to eliminate all
redundant distance calculations and find the smallest distance, respectively [6].
The above hardware design was implemented on an Altera Quartus II board
(clocked at 100 MHz) and was able to perform face recognition on a database of 10 faces
in 3.88 milliseconds. A total of 7,820 logic elements were used, 2,348 of which were flip-
16
flops. Again, performance can be attributed to the highly parallel nature of the hardware
design and the composite algorithm used.
2.3.3 FPGA Implementation using Artificial Neural Networks

Neural network algorithms for face recognition have been applied extensively on
FPGA boards. Li and Areibi undertook such an implementation in using both a soft-core
processor and a hardware module, referred to as a co-design approach. The face database
used consisted of 20 individuals, each having a set of 32 images that vary in expression
and direction. The images were grayscale and 8-bit, having dimensions of 120 by 128
pixels. The neural network used consisted of three layers, and with the pixel intensity (0
to 255) used as the input.
The face recognition system was implemented in two phases, a training phase and
a testing phase. In the training phase, image data is sent to a Target Generator, which
encodes the images and feeds them to the Learning System. The output of the Learning
System is then compared to the output of the Target Generator, and the difference is fed
back to the Learning System to further decrease error. These systems jointly implement
the forward, backward, and updating calculations typical of neural network training. The
testing phase of the implementation simply consists of and Image Sender, a trained
Learning System, and an Output Interpreter. The Image Sender sends the image to the
Learning System, whose output is passed through the Output Interpreter. This interpreter
extracts data from the Learning System to classify the individual [8].
For the implementation, a Xilinx Virtex-II XC2V2000 FPGA board was used.
The feed-forward and backward calculations were implemented on a MicroBlaze core,
while the updating calculations were delegated to a Hardware Update Module (HUM).
17
Both the C program and the training images were stored on BRAM, and the system
included peripherals such as OPB UART and the OPB GPIO bus. In order to increase the
speed of the neuron updating process, the HUM contains 4 parallel update units that are
capable of updating 4 neurons at a time. A finite state machine controls the floating-point
multipliers and the output is stored on 4 local registers. The MicroBlaze then reads off
these registers continuously until the update process is complete [8].
The results of the experiment showed that the HUM occupied 42% of the FPGA
and the MicroBlaze occupied 7%. Feed-forward and backward computations took around
20 ms to complete, MicroBlaze software updating took 173 ms, and HUM hardware
updating took 1.4 ms. The speed-up in updating was over 10x while the speed-up over a
software implementation was around 1.7x. This demonstrates that the algorithm contains
inherent parallelism which cannot be exploited with a general-purpose processor [8].
2.3.4 FPGA Implementation using Genetic Algorithm

Yasunaga, Nakamura, and Yoshihara implemented the genetic algorithm on an
FPGA chip with the intention of creating a personal identification system. After applying
the chromosome evolution technique to the F-tables of the target individual and other
people, the two tables were synthesized on an AND gate plane. That is, for each category,
the input image is fed into an AND gate grid with both connected and disconnected
nodes, representing bits and dont-cares, respectively. The output of each AND gate
category is then fed into a counter unit which is designed to keep track of the number of
activated AND gates. Finally, a maximum detector unit selects the counter output with
the highest value and classifies the face as belonging either to the target person or to an
18
unknown person. Since the input image is fed to all the AND gates simultaneously, the
matching process is carried out in parallel [9].
During the training process, 8-bit images were used to represent the faces. To
implement the chromosome evolution technique, the byte that represents each pixel was
manipulated at 8 different levels. The first level replaces the least significant bit with a
dont-care, and each level gradually adds a dont-care to the next least significant bit. The
last level consists of all eight bits replaced by dont-cares. Moreover, to test the
implementation, a database of 100 images was used, representing 5 individuals in 20
different poses. The dimensions of the original images were 240 by 240 pixels, but they
were preprocessed and reduced to 8 by 8 pixels. Also, the F-tables and T-tables were
each assigned 10 of the 20 poses for the individuals [9].
To synthesize the circuits, a logic synthesizer was employed. The average number
of gates required for each person, including the counter and maximum detector units,
amounted to 1,334. The presence of dont-cares allowed the number of gates to be
collapsed by less than 1/10. Using an FPGA board with a Xilinx XC4010 chip, the
identification accuracy of the system was 97.2% and identification took place within 1
s. This is due to the intrinsic hardware parallelism found in the AND gate planes.
Furthermore, fault tolerance tests were made on the system. Random stuck at 0s or 1s
were injected at the outputs of the AND gates, and an accuracy upwards of 90% was
maintained even with a stuck-at faulty gate ratio of 18%. Additionally, the system
exhibited graceful degradation as more stack-at faults were introduced [9].
19
2.3.5 FPGA implementation using Evolutionary Reconfigurable Architecture

A final implementation strategy for face recognition algorithms on an FPGA
involves using an evolutionary reconfigurable architecture. This approach is used in order
to enhance the functionality of face recognition in situations where the environment
varies. Under this architecture, there are three main stages to face recognition: a
reconfigurable filter module (RFM), a reconfigurable feature space module (RFSM), and
an evolutionary module (EM) [10].
The RFM is first used in order to enhance the quality of the image. This module
consists of four different filters that operate on the image. The first filter is a median filter
that is used in order to remove impulse noise in the image. Next, a histogram equalization
filter is used to improve the contrast of an image. In the third filtering stage, a
homomorphic filter is used in order to improve the reflectance effect of an image and
reduce the effect of lighting. It does so by reducing brightness and emphasizing contrast
in a frequency domain [10]. In the final filtering stage, an illumination compensation
filter is used to improve the brightness of the image. The RFSM module uses a Gabor
wavelet in order to reduce redundancy and noise in the image. The EM module finally
uses a survival-of-the-fittest concept to filter out through each known face and perform
face recognition using a genetic algorithm [10].
In hardware, only the RFM module and the EM module were implemented on an
FPGA while the RFSM module was implemented on a host computer. The RFM was
implemented on an RC1000-PP board, based on the Xilinx Virtex E-2000 FPGA. The
filters were coded using C and then moved to hardware, where they were implemented in
parallel in order to improve processing time. This was made possible by the existence of
20
4 separate SRAM banks that can be accessed simultaneously by the FPGA. As such, 4
different images could be processed in parallel. The EM module was implemented by a
hybrid parallel genetic algorithm processor [10].
For testing, 386 images of 39 people were stored in the database. Each image was
comprised of 128 by 128 gray pixels. It took approximately 1,000 iterations on the
images in order for the optimal filter combination to be obtained. After evolution, face
recognition rates increased by 63.4% when using images with poor illumination and
noise. When noise was added to the image, the rate increased by 36.5%. These figures
demonstrate the robustness of the system to changes [10].
2.4 Issues with Face Recognition

Although face recognition systems have advanced remarkably over the past few
years, there still exist some major obstacles that need to be overcome. In general, stillimage face recognition accuracy fades away as image variations are increased. The main
image variations are illumination levels, pose variation, and changes in facial expression.
Moreover, the problem of face detection, or the extraction of a face from an image, is a
required first step for face recognition.
The illumination problem occurs in an uncontrolled environment where the same
face appears different due to a change in lighting [1]. The problem is emphasized when
the variations in lighting are greater than the variations between people. One solution to
this problem involves preprocessing the images and introducing contrast normalization
and compensation. Another approach attempts to reconstruct all possible lighting
variations from a selection of training images for each individual. A third method relies
21
on creating a separate linear illumination subspace. This is similar to the space created to
capture face variations, except that it captures lighting variations [1].
Pose variation also impairs the face recognition process. Pose variation becomes
especially pronounced when it is combined with illumination changes. One solution to
the pose variation problem involves obtaining images with multiple views of an
individual. In this case, multiple poses are available during both training and recognition.
During the recognition process, each pose is aligned with a similar pose in the database to
achieve correct classification. The obvious drawbacks are that multiple views of an
individual are not always available. A more popular solution involves using multiple
poses during training but only a single pose during recognition. One such implementation
creates an Eigenspace for each pose to achieve pose-invariant recognition [1].
The problem of facial expression variation is also common in the literature. If
only one image of an individual is available, recognition accuracy drops considerably.
However, if many images are available, algorithms like PCA can absorb these changes. It
is important to note that during expression changes, parts of the face remain largely
unchanged. As a result, algorithms that segment the face are more robust to these
variations [1]. Many databases available today contain training images with multiple
expressions, and face recognition systems have been capable of making accurate image
classifications despite expression variations.
Lastly, it is important to discuss face detection in the context of the face
recognition problem. The need for face detection arises when one or more faces must be
extracted from an image. Furthermore, face detection and extraction is essential to reduce
external factors that might hinder the recognition process. One common method of face
22
detection relies on the use of Haar classifiers. These classifiers sweep through the image
and apply several filters to detect the presence of a face. Another method, mentioned
earlier, relies on skin color to detect a face.
As such, face recognition is a growing field with potential applications in security,
entertainment, and personal identification. The recognition algorithms can be grouped
into mathematical/statistical (PCA, ICA, LDA) algorithms and biological (NN, GA)
algorithms. Many of these algorithms have been implemented by several researchers on
FPGA boards with high recognition rates and recognition times within the margin of realtime applications. However, long training times and the scalability of face recognition has
been a recurring concern in all of these implementations. Finally, common face
recognition problems include illumination changes, pose variations, and the issue of face
detection and extraction.
3. DESIGN
3.1 System Specifications
3.1.1 Algorithm
Having researched the various algorithms for face recognition, we found that the
two most popular hardware implementations are PCA and Neural Networks. As stated
before, the advantage of PCA is its robustness, parallelizability, and relative simplicity.
Its disadvantages are its sensitivity to lighting and pose variations. On the other hand, the
Neural Networks approach provides strong accuracy but limits the number of individuals
that can be included in the database due to the long training periods involved.
23
We have chosen to adopt the PCA algorithm for face recognition for several
reasons. Firstly, the environment that will be used to obtain the individual face images is
controlled and hence lighting and pose variation effects can be minimized. Secondly,
since a face can be subdivided into multiple regions, pattern recognition can be applied in
parallel, resulting in faster face recognition. Lastly, PCA allows us to quickly add
individuals to the face database, making it better suited for real time applications.
3.1.2 Inputs and Outputs

The inputs of our system consist of bit streams representing the image to be
analyzed, an average face, an Eigenvector matrix, and a set of projections. The image to
be analyzed, as well as the average face, will consist of 150 125 = 18,750 pixels, each
being 8-bit grayscale (28 = 256 shades of gray, ranging from 0 to 255). These figures
were chosen because they provide a good balance between size and accuracy.
Additionally, these values were used successfully by many research groups. In a database
with M faces, the Eigenvector matrix will be of size 18,750 by M. Finally, the set of
projections will consist of M vectors, each having M values.
The outputs of our system will consist of the face ID with the closest match, as
well as a value representing how close this match is (a distance value). Furthermore, the
system outputs execution times to gauge the speed of the system, as well as each of the
functions involved in the recognition stage. All this information will be displayed on the
HyperTerminal of the PC.
The inputs of our system will be transmitted to the FPGA through an Ethernet
link. The choice to use Ethernet over RS232 was motivated by the difference in transfer
rates. While serial connections operate at less than 20,000 baud (although some systems
24
currently exceed this limit), Ethernet connections operate at 10 Mbits/second. Naturally,

this will imply faster system initialization times. The output of our system will be sent to
the HyperTerminal over a serial interface, primarily because there is no need for high
speeds in order to display character sequences on the PC.
The functionality of the system will closely follow Principle Component Analysis
(PCA) algorithm for face recognition. This involves normalizing the image, projecting it
onto face space, and computing the Euclidean distance to all M projections. The
projection phase will be implemented using the on-board multiplier, whereas
normalization and comparison will be implemented on the MicroBlaze core.
3.1.3 Timing Constraints

The time constraints of our system are bounded by the time constraints required
by real time face recognition. That is, in applications where the recognition of multiple
faces is required, the process must not take more than 2 seconds for each person.
However, this figure includes the overhead of obtaining the image, performing face
recognition, and displaying the results. Normally, the face recognition process is in the
order of milliseconds, as mentioned in the literature review.
3.2 System Description

From a top level perspective, the face recognition system consists of an FGPA
end and a PC end that communicate via an Ethernet interface. On the PC end, a C
program runs the training stage of the algorithm and produces binary data files, which are
relayed to the C# Ethernet program. This program encapsulates the data files and sends
them to the FPGA over Ethernet.
25
Figure 3: System Block Diagram
26
On the FPGA end, the data files are received, parsed, and stored in DDR memory.
Then, the recognition stage runs on the MicroBlaze core with the assistance of on-board
multipliers and the results are displayed on the HyperTerminal through a serial interface.
3.3 Hardware FPGA Components

3.3.1 V2MB-1000 Overview
The board we will be using to implement our system is Memec V2MB-1000
Development Board. The V2 is derived from the name of the Xilinx FPGA on the board
(the Virtex-II XC2V1000), the MB from the MicroBlaze processor core, and the 1000
from the fact that the FPGA consists of 1,000,000 programmable logic gates. The VirtexII XC2V1000 FPGA is designed for high performance applications in the fields of
networking, telecommunications and digital signal processing, among others.
In addition, it supports various I/O standards including Low Voltage Differential
Signaling (LVDS), Peripheral Component Interconnect (PCI) and Dual Data Rate (DDR).
LVDS is a high-speed type of signaling that uses twisted-pair copper cables. PCI is a bus
standard that allows for high-speed communication between peripheral devices and a
central processor. DDR allows for quick memory access by means of transferring data on
both the rising and falling edge of the clock. The XC2V1000 FPGA contains 90KB of
Block Select RAM (BRAM) memory, which can be used for fast-memory access
operations. In addition the V2MB-1000 board contains 32MB of external DDR memory
called the ZDT memory.
The board additionally contains two 7-segment LED displays as well as a single
LED display. It also has four push buttons that can generate an active low signal and
eight DIP switches that can generate both an active high and an active low signal.
27
Moreover, the V2MB-1000 board contains an RS232 port that allows for serial
communications, and a JTAG port, which is connected to the parallel port of a PC so that
bit stream configurations can be downloaded to the FPGA.
Figure 4: V2MB1000 Board
The board we are using comes with the P160 Communications Module-2
expansion. This module provides us with several different functions, but our use of the
board is restricted to Ethernet. This function consists of a Broadcom chip and an RJ45
connecter to which the Ethernet cable is hooked up.
28
Figure 5: P160 Communications Module

3.3.2 MicroBlaze Processor
The MicroBlaze processor is at the heart of the face recognition system. It runs
the main algorithm, communicates with peripherals, and delegates computationally
intensive operations to the custom multipliers/accumulators. The processor core
interfaces OPB interface, a BRAM Interface/Controller, an External Memory controller,
an Ethernet driver, and the custom multipliers.
3.3.3 OPB Interface

The OPB interface is used for interfacing with general purpose I/O devices,
interrupts, and timers. I/O devices might be used as indicators or as switches to enhance
the functionality of the system. One example of this might include blinking a LED to
indicate the status of certain operations. The inputs of this unit are supplied by the
29
MicroBlaze core as well as by input devices or interrupts. The outputs consist of the
MicroBlaze processor and any output indicators that may be used.
3.3.4 BRAM Controller

The BRAM interface controller is used by the MicroBlaze processor to implement
instruction and data memory banks. Since the MicroBlaze writes data to and reads
instructions and data from the BRAM memory, the BRAM controller is essential to
provide the interface needed.
3.3.5 External Memory Controller

The External Memory Controller is used to interface with external memory on the
V2MB1000 board. It provides a means for the MicroBlaze processor to exchange data
with the ZDT external memory.
3.3.6 Ethernet MAC Controller/Driver

The Ethernet MAC controller is the peripheral lying on the P160 Communications
Module that allows communication with the Ethernet port. This module contains an
Ethernet port and a Broadcom chip that converts the stream of bits into bytes that are
accessible by the MicroBlaze. It contains 2 FIFO queues, one that will be used to store
the incoming packets, and the other used to queue outgoing packets. These buffers can
each occupy up to 32KB of the BRAM. All matrices will be sent to the board through the
Ethernet port.
The Ethernet MAC driver provides us with Xilinx C functions used to initialize
the EMAC and send and receive frames. We have familiarized ourselves with these
functions for later use. An example of such a function is the XEmac_FifoRecv function
30
that receives a frame from the Ethernet port and stores it in a specified memory location
on the board. In addition, we were able to add the EMAC core to the system using the
corresponding pin constraints and correct signal matching.
3.3.7 UARTLite Controller/Driver

The UARTLite controller provides an interface to the serial port and RS232 cable.
Serial communications in our project will involve sending text information to the
HyperTerminal of the connected PC concerning the status of operations on the
V2MB1000 board. The UARTLite driver contains several useful functions such as the
XUartLite_SendByte function that sends a byte of data (for example a character).
3.3.8 On-board Hardware Multipliers

Utilizing the hardware multipliers requires some knowledge of projecting an
image to face space. Such a projection requires multiplying an M 18,750 matrix with an
18,750 1 matrix. Again, M stands for the number of faces in the database. This implies
that there will be a total of M 18,750 multiplication operations.
Our design uses the on-board dedicated hardware integer multipliers to implement
this matrix multiplication. We found that the multipliers are 18 bit 18 bit, and each is
associated with an 18 Kbit block of BRAM. Since there are 40 such blocks, up to 40 onboard multipliers can be used. Furthermore, software techniques are used in order to
accumulate the results within each stage of matrix multiplication. The rationale behind
this decision comes from the fact that the overhead incurred by hardware accumulation
results in longer execution time for the algorithm, as it involves writing to and reading
31
back from DDR memory. This concept will be discussed further in the implementation
section of this report.
3.4 Software FPGA Components

The software component on the FPGA board consists of the C code that runs on
the MicroBlaze. This code implements the face recognition algorithm and communicates
with the Xilinx and custom IP peripherals. The software component consists of several
phases, namely receiving a face vector, normalizing it, projecting it onto face space,
computing the distance to known faces, and finding the minimum distance.
To receive a face 150 125 pixel image vector, the code on the MicroBlaze must
receive a stream of 150 125 8 bits (or 18,750 bytes) sent by the application through
the Ethernet cable. As a result, the code must be able to communicate with the Ethernet
driver, which interfaces to the Ethernet port. Once the face is received, it is stored in
memory and normalized. Normalization consists of subtracting the vector of the average
face from the vector of the face in question. Thus, it involves the subtraction of two
vectors with 18,750 8-bit entries, both of which reside in DDR memory.
After the face is normalized, it must be projected onto face space. This involves
multiplying the Eigenvector matrix (M 18,750) with the normalized face vector (18,750
1). To speed up processing, this operation is sent to the multipliers. The result of the
multipliers, a 32-bit vector of size M, is sent back to the MicroBlaze. During this process,
the elements of the Eigenvectors matrix are fetched sequentially from DDR memory and
the resulting projection is stored in BRAM for fast access.
Next, the face projection must be compared with every projection on the
projection database. This requires finding the Euclidian distance between the projection
32
of the current face and the projection of each of the faces. In mathematical terms, the
magnitude of the difference between each pair of size-M vectors must be computed.
Although the operation is mainly subtraction, we did not design a custom hardware
comparator since we realized that calculating the distance to all the face projections is not
a bottleneck. For this operation, the face projection is stored in BRAM whereas the list of
projections is found in DDR memory, where they were dumped during the initialization
phase of the system. Lastly, the code on the MicroBlaze must then transmit the above
results to the HyperTerminal through the serial interface for the user to see. As mentioned
earlier, these results include the projection distances and the ID of the recognized face.
3.5 PC Application Components

The application of the system mainly deals with executing the training stage and
initializing the FPGA. Specifically, a C program runs the training stage of the PCA
algorithm and produces an average face, an Eigenvectors matrix, and a projections
matrix. These three portions of data, along with the test face to be recognized, are written
to binary files in the same C application.
At this point, a C# application takes over. This program reads each of the four
binary files produced by the previous program and encapsulates the data into Ethernet
frames. It then sends these Ethernet frames to the FPGA for initialization and displays
status messages confirming that the sending operation took place.
3.6 Memory Management

We used two types of memory on the V2MB1000; the BRAM Memory and the
ZDT External Memory. BRAM memory contains 40 blocks of size 18Kbits each for a
33
total of 720Kbits or approximately 90KB. It is located on the Virtex-II FPGA itself, and
thus has the fastest access time compared to all other types of memory on the board.
External memory is essentially a 16M 16 DDR memory that provides us with 32MB of
storage space. This memory lies on the board external to the FPGA, and thus has a longer
access time. Ideally, we would have opted to store all data in BRAM memory, but due to
the constraint in size, we are forced to store the data initially in External memory. Below
are the memory requirements assuming a 150 125 pixel image:
Target face:
Number of entries = 150 125
Bits per entry = 32 (since of type Xuint32)
Total memory for target face = 150 125 32 = 600,000 bits = 75 KB
Average face:
Number of entries = 150 125
Total memory for average face = 150 125 32 = 600,000 bits = 75 KB
Projections Matrix:
Number of entries = 51 51 (assuming database contains 51 individuals)
Total memory for projections = 51 51 32 = 83,232 bits = 10 KB
Eigenvector matrix:
Number of entries = 51 150 125 (assuming database contains 51 individuals)
Total memory for Eigenvectors = 51 150 125 32 = 30,600,000 bits = 3,825 KB
34
We therefore conclude that a total of 75 + 75 + 10 + 3,825 = 3,985 KB

(approximately 4 MB) of external memory must be used. As data enters the board
through the Ethernet port, it is stored in consecutive external memory locations.
Subsequently, these elements are fetched from memory back to BRAM as required by the
PCA algorithm. Although this incurs an overhead of data transfer, it is a price that must
inevitable be paid in order to ensure the functionality of the system.
3.7 PCA Algorithm

3.7.1 Training Phase
1. Each face in the database is represented as a column in a matrix A. The values in
each of these columns represent the pixels of the image and range from 0 to 255 for
an 8-bit grayscale image:
a11 K a1n
A= M O M
a
m1 L amn
2. Next, the matrix is normalized by subtracting from each column a column that
represents the average face (the mean of all the faces):
a m1 K a1n m1
ur 11
A=
M
O
M
a m L a m
m
mn
m
m1
3. We then want to compute the covariance matrix of A, which is A AT, but since the
operation is very mathematically intensive, we use a shortcut:
L = AT A
4. To obtain U, the matrix of covariance eigenvectors, we find V, the matrix of
eigenvectors of L, and calculate:
35
U = A V.
5. Each face is then projected to face space:
= UT A
6. We next compute the threshold value for comparison:
= max {|| i j ||}, for i, j = 1n.
3.7.2 Recognition Phase
1. We represent the target face as a column vector:
r1

r = M
r
m
2. The target face is then normalized:
r m1
r 1
r = M
r m
m
m
3. Next, the face is project to face space:
r
= UT r
4. We then find the Euclidean distance between the target projection and each of the
projections in the database:
2 = || i ||2 for i = 1n
5. Finally, we decide if the face is known or not by selecting the smallest distance and
comparing it to the threshold . If it is greater, then the face is new. Otherwise, the
face is a match.
36
3.8 Project Budget

Below is a list of hardware components that we will be using to implement our
face recognition system. The prices reflect an approximation of the current market price
of the components.
Ethernet CAT5 Cable (10 ft): $5
Figure 6: Ethernet CAT5 Cable
RS-232 Serial Cable: $5
Figure 7: RS-232 Serial Cable
V2MB-1000 Development Kit (+ P160 Comm. Module-2) with ISE Foundation and
JTAG cable: $2995.00 (This product was provided by the American University of
Beirut.)
37
4. IMPLEMENTATION
4.1 Modeling Algorithm in MATLAB
4.1.1 Implementation Details
A free database of faces, non faces, and new faces was used as a means to test the
implementation developed. The database of faces consists of 51 images each having
dimensions of 150 125 pixels represented as row vectors. Each pixel contains an 8 bit
grayscale value representing 1 of 256 possible shades of gray. The MATLAB
implementation followed the algorithm details outlined above and used built-in
MATLAB functions to achieve functionality. Some of these functions are outline in the
table below.
MATLAB Function
mean (A)
A
eigs (A, k)
dist (A, B)
Description
Calculates the mean of matrix A
Calculates the transpose of matrix A
Determines the fist k eigenvectors and eigenvalues of A
Determines the Euclidean distance between matrices A and B
Table 1: MATLAB Functions Used
In addition, we used functions to visualize the Eigenfaces as well as the average

of the faces in the database. Below is a sample of the images produced.
20
20
40
40
60
60
80
80
100
100
120
120
140
140
20
40
60
80
100
120
20
Figure 8: Sample Face
40
60
80
100
120
Figure 9: Average Face
38
50
100
150
200
250
300
350
400
450
100
200
300
400
500
600
Figure 10: Eigenfaces

4.1.2 Implementation Results
When trying to recognize a face image that already exists in the database, the
projection distance calculated for that specific image is zero. When an image of a known
person is used, but that image is not the exact one in the database, the distance turns out
to be the smallest of all the distance vectors. This is consistent with the algorithm outline
above. Specifically, the following table illustrates a sample of the distance calculations
when using the third face from the database as the test face for recognition. We can also
see that the next closest distance calculations correspond to the other images of the same
person in the database.
Face Index
1
2
3
4
5
6
7
8
Person #
Distance to Face Index 3
1
2.8476 107
1
2.7966 107
2
0.0000 107
2
0.1591 107
2
0.4335 107
3
1.9659 107
3
2.0871 107
3
2.1734 107
Table 2: MATLAB Implementation Results
39
4.1.3 Exporting Code to C

Since the MicroBlaze core only has a C compiler, the MATLAB code above has
to be exported to C. In order to achieve this, we used the mcc function in MATLAB,
which exports m-files to C-files. However, we noticed that the resulting code was 1.13
MB, which is too large to fit on the BRAM of the FPGA board. Additionally, the code
contained unwanted libraries and header files that result in a very high overhead. Lastly,
we encountered several linking problems when attempting to execute this conversion. As
a result, we chose to write a custom matrix library in C and implement the algorithm
manually using our own library functions.
4.1.4 Exporting the Database

The C implementation of the algorithm required that we use the same database
that was used when modeling the algorithm with MATLAB. This was done to ensure that
the results of our C implementation coincided with the results we had achieved from our
first modeling attempt.
In order for our C code to read the database, we had to first export the three
different elements of our database, faces, non-faces, and new faces, to three respective
binary files. The following MATLAB code illustrates writing a matrix m_towrite to a
binary file, where m_towrite represents any one of the database matrices faces, non-faces,
and new-faces.
% first create the matrix m_towrite containing the database
fid = fopen('database.txt','wb')
x = fwrite(fid,m_towrite,'float32')
fclose(fid)
The first line of the code creates a file ID that we will write to in wb mode, or
write binary mode. Next, we write the matrix m_towrite to the file ID created in the
40
previous line. The matrix is written using 32 bit floating point representation. Finally, we
close the file that we have created in the last line.
4.2 Training Stage Implementation in C

4.2.1 Custom Function Library
Having verified the correctness of our MATLAB implementation, we then
proceeded by coding the training stage of the PCA algorithm in C. However, prior to
doing so, we had to code a library containing a set of functions to facilitate the
implementation of the PCA algorithm. The matrix library consists of several functions
pertaining to the requirements of the PCA algorithm. The data used for all the functions is
of type float*. Below is a summary of the functions:
C Function
Matrix Transpose
Matrix Average
Matrix Multiply
Matrix Subtract
Vector Distance
Eigenvectors
Read Images
Read Test Face
Description
Calculates the transpose of a matrix
Finds the average row in a matrix
Multiplies two matrices
Subtracts a row from every matrix row
Calculates the Euclidean vector distance
Finds the Eigenvectors/Eigenvalues of a matrix
Reads image database binary file
Reads test face binary file
Table 3: C Functions Used
The eigenvectors function was obtained from a freely available internet source
and it is modeled after the algorithm outlined in Numerical Recipes in C [11]. This
algorithm computes the Eigenvalues and Eigenvectors of a real symmetric matrix using
Jacobi rotations. Once we completed the implementation of the library of functions, we
could then proceed with the implementation of the training stage itself.
41
4.2.2 Training Stage Calculations

The first task was to declare several single and double subscripted arrays of type
float to accommodate various initial, intermediate, and final matrices involved in the
PCA algorithm. Next, we allocated memory for all the matrices using the malloc
function, and then proceeded with the calculation of the matrices. Prior to this, however,
we used the read_images function to read all the images in the database and store them in
an array. This function opens the file database.txt and reads all the binary data from it
using the read binary option. It then stores all the bytes in an array called db which it
returns to the main function as shown below:
read_images(database,NUMFACES, FACESIZE);
At this stage, all the binary data concerning the faces in the database are available
for use and are stored in an array called "database" as shown in the function above. The
first matrix we had to calculate was the average matrix. To do so, we used the
matrix_average function. This finds the average pixel values by adding all the pixels of
the 51 faces in one position and dividing them by 51. The function takes in the "database"
array and returns "average" which is a single vector of size FACESIZE.
matrix_average(database,NUMFACES,FACESIZE,average);
Once we obtain the average we must normalize the entire database by subtracting
the vector "average" from every face vector in the "database" array. Normalization thus
describes how similar each face in the database is compared to the average face. The
function call is shown below:
matrix_subtract(database,NUMFACES,FACESIZE,average,database);
42
The "database" array now contains all the normalized face vectors. From this
point on, we will use these normalized vectors and not the original ones. The next step in
the algorithm is to find database database_transpose. Since this will result in a huge
number of multiplications and a huge array, a trick is used in which we perform
database_transpose database instead. First in order to transpose the "database" array
we created a simple function in which we replaced every row with a column. The
function call is shown below:
matrix_transpose(database,NUMFACES,FACESIZE,database_trans);
The details of all of these functions we used are available in the appendix section
for further reference. Once we obtain the transpose, we may now perform the above
multiplication operation. Below is the function call. The function takes in database_trans
as the first operand, and database as the second and stores the corresponding product in
matrix L.
matrix_multiply(database,NUMFACES,FACESIZE,database_trans,FACESIZE,NUMF
ACES,L);
Matrix L is of size NUMFACES NUMFACES (51 51) rather than

FACESIZE FACESIZE (18750 18750). This saves a lot of memory and is almost just
as accurate. The next step is to compute the eigenvectors of matrix L. We do so using the
Eigenvector function we created. This function results in an array that contains the
Eigenvectors of L as shown below.
eig(L,NUMFACES,eigenvalues,eigenvectors);
The above two operations are a part of the trick used to minimize the size of the
array that would result from multiplying database database_transpose. They are
intermediate operations that lead to obtaining the Eigenvectors of the original matrix
43
"database". The last step of this alternative method is to compute the Eigenvectors of the
original matrix by multiplying database_transpose by the Eigenvectors (obtained in the
above operation). This will result in the eigenvectors of database. The function is shown
below:
matrix_multiply(database_trans,FACESIZE,NUMFACES,eigenvectors,NUMFACES,
NUMFACES,eigenvectors_orig);
The result is stored in eigenvectors_orig which is of size FACESIZE

NUMFACES. Now that we have the Eigenvectors matrix we can determine the
projections of each face onto the face space by multiplying the "database" matrix by the
eigenvectors_orig array that we obtained above.
matrix_multiply(database,NUMFACES,FACESIZE,eigenvectors_orig,FACESIZE,N
UMFACES,projections);
This operation in effect highlights the key features of every face by projecting it
onto the face space. As such, when a new face is brought in to the system for recognition,
determining whether it is a match would take a much smaller amount of time. Once we
completed the calculation of the average, eigenvectors_orig, and projections matrices, we
decided to truncate the elements of the arrays. That is, up until this point, all the arrays
were of type float. However, since floating point calculations take significantly longer
than integer calculations, truncating the digits after the decimal point and changing the
type to integer saves a lot of computation. Before doing so, we compared the values
obtained in both the integer and floating point cases and found that the error due to
truncation is negligible (less than 0.01%). To do so, we simply created new integer
arrays, and copied the floating point values. This automatically truncates anything after
the decimal point. This process is shown below.
44
int *average_int;
int **eigenvectors_orig_int;
int **projections_int;
for(i=0;i<FACESIZE;i++)
average_int[i] = average[i];
for(j=0;j<NUMFACES;j++)
eigenvectors_orig_int[i][j] = eigenvectors_orig[i][j];
for(i=0;i<NUMFACES;i++)
for(j=0;j<NUMFACES;j++)
projections_int[i][j] = projections[i][j];
At this point, the above three arrays are ready to be written to a file and sent
through the Ethernet port to the FPGA.
4.2.3 Writing to Binary Files

Having obtained the truncated values for the intermediate data, we proceeded by
writing them to binary files. To do this, we first opened a file in write binary mode. We
also performed error detection by checking the value returned by the fopen function. The
code snippet below illustrates this:
if (!(f_testface = fopen("b_testface_int", "wb")))
return 1;
We next write to the binary file by invoking the fwrite function. However, for the
cases of the projection and Eigenvector matrices, caution was exercised to ensure that the
indexing of the matrices corresponds with the storage format. That is, the binary files will
store the data in a linear manner and consistency must be maintained when un-wrapping
two dimensional data into a linear space. By maintaining this indexing consistency at the
receiving end, we were able to reproduce these two dimensional matrices without the loss
or corruption of data. The following illustrates one example of this:
45
fwrite(eigenvectors_orig_int[i], NUMFACES*sizeof(int),1,
f_eigenvectors);
Finally, the file is closed and the process is repeated for all 4 segments of binary
data that are required by the recognition stage.
4.3 Ethernet Implementation in C#

As mentioned previously, we chose to use Ethernet to send data from the PC end
(training stage) to the FPGA end (recognition stage). However, in order to do so, we first
had to write a program to send Ethernet packets. To facilitate our work, we obtained an
open source program in C# containing functions for writing packet bytes to the Ethernet
adapter. We then customized the code to suit our needs.
Specifically, the C# program describes a class called RawEthernet for sending
raw Ethernet packets. Furthermore, it describes a method that retrieves all the network
devices of the system, thereby prompting the user to select the relevant network device
for the sending operation. Finally, the program details a DoWrite operation that actually
writes the packets to the Ethernet adapter.
In order to send customized frames containing data from the training stage, we
first create a variable called packet of type byte array. Our packet size was 1014,
resulting in 1000 bytes for data. Below is the declaration statement:
int TOTAL_PACKET_SIZE = 1014;
int DATA_SIZE = TOTAL_PACKET_SIZE - 14;
byte[] packet = new byte[TOTAL_PACKET_SIZE];
The next step involved reading the 4 binary files that were produced by the C
code and sequentially formatting and storing them into packets. In C#, opening and
closing a file stream for reading amounts to the following statements:
46
FileStream fs = File.OpenRead("b_testface_int");
BinaryReader br = new BinaryReader(fs);
br.Close();
fs.Close();
We started by initializing the packet to be sent with the appropriate destination

address, source address, and frame length. The destination address was made to match the
hardware address allocated on the FPGA board. The source address was also made to
match the address that the FPGA uses for filtering packets. It is important to make sure
that the consistency is maintained in both ends so that important packets do not get lost
and unwanted packets do not get through.
Having performed the necessary initializations, we proceed by reading the first
binary file. We loop from 0 to 75000, representing the number of bytes in the file, and
pick off each byte to store it in the previously defined packet. This operation continues
modulo 1000, so that every time 1000 bytes are written to the packet, the packet is sent
using the previously mentioned DoWrite function.
Once the binary file has been read, we repeat a similar process for each of the
other 3 binary files. For the last file, of size 10404 bytes, we faced the problem that the
binary file size was not a multiple of 1000 bytes. For this case, a statement was added to
the loop that checks for the last iteration and writes the last 404 bytes of the file prior to
breaking from the loop.
In terms of display, the program first prompts the user to select the desired
network device. Once that is done, the packets are sent in sequence, with a notification
message appearing on the screen after each packet is sent. Once all the packets have been
sent, the program displays a message confirming that all the desired packets were sent.
47
4.4 Recognition Stage Implementation on FPGA

4.4.1 Receiving Ethernet frames and Storing to DDR
Once the C# program on the sending end was fully functional, we moved on to
implement the C program that runs on the MicroBlaze. The first part of this program
deals with the reception of the data frames and their storage in DDR memory. We first
began by adding the basic Xilinx header files (xparameters.h, xbasic_types.h). We also
included the header file (xemac_l.h) in order to use its functions to initialize the EMAC
controller and to perform other important functions such as receiving frames. We first
began by setting the MAC address of the FPGA using the following function:
XEmac_mSetMacAddress(EMAC_BASEADDR, LocalAddress);
The parameter EMAC_BASEADDR has a value of 0x40c00000 and is the base

address of this device in memory. The LocalAddress parameter is the MAC address that
we previously assigned to the FPGA and has a hexadecimal value of 01 06 07 08 09 04.
This address is stored in an integer array of size 6 as shown below:
static Xuint8 LocalAddress[MAC_ADDR_SIZE] =
{
0x01, 0x06, 0x07, 0x08, 0x09, 0x04
};
In order to receive frames into the FPGA, we also needed to create a reception
buffer that we called RxFrameBuf. RxFrameBuf is an array with size 1500 (which is the
maximum size in bytes of the frames we will be sending). After all the necessary
initializations are made, the program enters a while loop in which it will receive frames.
Since we know exactly how many bytes we need to send from the PC to the board, we
used this value as a limit for looping. One important task that we had to incorporate into
the program was to filter out certain frames that do not contain data from the training
48
phase. That is, the Windows operating system on the PC randomly sends broadcast
frames across the Ethernet port. We had to insure that such broadcast frames were not
confused with data frames. We will explain how this was done shortly, but for now it is
important to note that the variable GoodFrameCount in the code represents the number of
actual training phase data frames received. Upon entering the loop, we used the
XEmac_RecvFrameSS function to receive a data frame as follows
Length = XEmac_RecvFrameSS(EMAC_BASEADDR, (Xuint8 *)RxFrameBuf);
The parameters of this function are the base address of the device and the
corresponding buffer where the frames would be stored. The XEmac_RecvFrameSS that
we used is a modification of the function XEmac_RecvFrame available in the xemac_l.h
library. The XEmac_RecvFrameSS function begins by checking if the receive buffer is
empty. If it is not, then there is a frame in the buffer ready for retrieval:
check = XEmac_mIsRxEmpty(BaseAddress);
while (check==XTRUE)
Next, it finds the length of the received frame by checking the address location
of the last byte, and using the base address of the device to calculate the difference.
Finally, the function filters out the broadcast frames that were mentioned previously. It
does so by checking that the destination EMAC address of the frame matches the MAC
address of the FPGA. In the case of a broadcast frame, the destination address is FF FF
FF FF FF FF. If the frames MAC address does not match the MAC address of the
FPGA, then the function will return a length of -1 and the frame will be discarded in the
main function.
49
After the XEmac_RecvFrameSS function, the program goes into a loop in which
it reads the individual bytes of the frame from the receive buffer in order to store the
frame in DDR memory. Below is the piece of code responsible for retrieving the bytes
and storing them in memory.
for (i=14; i<1014; i+=4)
{
rec1 = (Xuint32)
rec2 = (Xuint32)
rec3 = (Xuint32)
rec4 = (Xuint32)
RxFrameBuf[i];
RxFrameBuf[i+1];
RxFrameBuf[i+2];
RxFrameBuf[i+3];
rec2 <<= 8;
rec3 <<= 16;
rec4 <<= 24;
word = 0;
word = word | rec1 | rec2 | rec3 | rec4;
XDdr_mWriteReg (MEM_BASEADDR, memcount*4, word);
memcount++;
}
We begin by reading the 15th byte present in the buffer since the first 14 bytes
represent the source MAC address, destination MAC address and the length of the frame.
The actual data starts on the 15th byte and ends on the 1015th byte. When reading the
bytes from the receive buffer, we do so four at a time since every word that will be stored
in memory will be of size 32 bits (ie 4 bytes). We begin by reading RxFrameBuf[i],
RxFrameBuf[i+1], RxFrameBuf[i+2], RxFrameBuf[i+3] and storing them in variables.
We then have to concatenate the above four bytes into one word. This can be done by
shifting the most significant byte by 24 places and inserting zeros, then shifting the next
most significant byte by 16 places and inserting zeros and similarly adjusting the least
two significant bytes. After shifting, we perform an OR function on the four bytes to
obtain one 32 bit word.
50
We then store the word in memory using the XDdr_mWriteReg

(MEM_BASEADDR, memcount*4, word) function which stores the 32 bit word variable
in the memory address that has an offset of 4*memcount from the base address. We then
increment memcount. The program then goes back through the loop again to receive the
frames. We also made the necessary adjustments to the code when receiving the last
frame since it was of size 404 bytes instead of 1000 bytes.
4.4.2 Verification and Testing of Ethernet interface and DDR

In order to verify that all the data needed was transferred across the Ethernet
interface, we began by sending a single frame from the PC to the FPGA. We printed the
contents of this frame on the PC end by using printf statements, and on the FPGA side,
we accessed all the consecutive memory locations where the bytes were stored and sent
them through the serial port of the FPGA to observe on the HyperTerminal. We then
compared all the values of the bytes at both ends and they were identical.
After validating the correct reception of one frame, we moved on to send
multiple frames. We first attempted to send an entire file of size approximately 1 MB.
When attempting to send the consecutive frames, we noticed that after a few frames were
received, frame reception was blocked. This was due to the fact that the frames were
being sent at a much faster rate than the relatively small reception buffer could handle.
Since the maximum size of BRAM is limited, we could not expand these buffers to take
more of the BRAM space. Instead, we slowed down the sending end by incorporating
wait statement between the transmissions of consecutive packets. After doing this, we
were able to receive all the frames. Once again, we compared the data on both ends, and
51
they matched. At this point, we were certain that all the data needed was being sent and
correctly stored in memory.
4.4.3 Recognition Phase Implementation on the FPGA

As discussed previously, the recognition stage of the PCA algorithm involves 3
major stages: normalization, projection, and distance calculations. The first stage of the
recognition phase involves reading the test face that we had received over Ethernet and
stored in memory and normalizing it with respect to the average face that is also located
in memory. The following code illustrates this stage:
// normalization stage
for (i=0; i<TESTFACE_SIZE; i+=4)
{
r1 = XDdr_mReadReg (MEM_BASEADDR, TESTFACE_BASEADDR+i);
r2 = XDdr_mReadReg (MEM_BASEADDR, AVGFACE_BASEADDR+i);
r1 = r1 - r2;
XDdr_mWriteReg (MEM_BASEADDR, TESTFACE_BASEADDR+i, r1);
}
The loop iterates over all values of TESTFACE_SIZE, which is defined as the size
of the test face in bytes, or 75,000 bytes. In the for loop, r1 and r2 are the respective
values of the test face and the average face stored in memory locations
TESTFACE_BASEADDR and AVGFACE_BASEADDR offset by the iteration value, i,
and the base address of memory, MEM_BASEADDR. These two values represent the
memory addresses corresponding to the first elements of the test face and the average
face, respectively. Finally, the result is stored in place of the original test face in memory.
In the projection stage of the algorithm, we are multiplying the normalized test
face by the matrix of covariance eigenvectors. In order to accomplish this matrix
multiplication, the outer loop iterates over the number of faces in the database, and the
inner loop iterates over the size of the test face stored. During each inner loop iteration,
52
we obtain the corresponding values for the Eigenvector matrix and the test face from
DDR memory using the indexing illustrated below:
r1 = XDdr_mReadReg (MEM_BASEADDR, EIG_BASEADDR + i*4 + j*NUMFACES);

r2 = XDdr_mReadReg (MEM_BASEADDR, TESTFACE_BASEADDR+j);
We then multiply r1 with r2 and accumulate the product. Once the inner loop runs
to completion, the cumulative product is stored in a corresponding array location and set
back to 0 for the next outer loop iteration.
In the final stage of the PCA algorithm recognition phase, the Euclidian distance
between the projection computed earlier and each of the projections stored in the
projections matrix has to be calculated. Since the projections matrix has size
NUMFACES NUMFACES, both the outer and inner loops iterate over NUMFACES.
Inside, a projection value is read from memory and from this we value we subtract the
corresponding value in the projections array. Next, this value is cast to a floating point
value, squared, and accumulated. As an illustration of the distance calculation, below is a
sample of the inner loop code that finds the squared values for the Euclidean distance:
r1 = XDdr_mReadReg (MEM_BASEADDR, PROJ_BASEADDR+(i*NUMFACES+j)*4);

int1 = r1 - projections[j];
ftest = (Xfloat32) int1;
ftemp += ftest*ftest;
After this is done, the Euclidian distance is found by calling the sqrt function to
calculate the square root of the temporary accumulated value. We then compare this
distance to the smallest distance already calculated. If it is smaller, we set it as the new
minimum distance and proceed. Finally, we display this distance value and the
corresponding face index, which represents which face the smallest distance value
belongs to. Below is the code for finding the minimum distance:
53
if (i==0)
{
min = fdist;
imark = i;
}
else
{
if (fdist < min)
{
min = fdist;
imark = i;
}
}
4.4.4 Verification and Testing of Recognition Phase

After completing the recognition phase implementation on the FPGA, we moved
to testing our design by running the algorithm for numerous faces and comparing the
results obtained with those of the purely software implementation. Once we ensured that
the results of both implementations were the same, i.e., a test face produced the same
final result on both implementations, we moved into checking the values after every stage
of the recognition process. Moreover, we printed the values of the various stages of the
software implementation onto the screen output of the PC and simultaneously used the
hardware debugger available on the FPGA to investigate the memory contents. Please
refer to the appendix for the projection distances of both implementations.
4.4.5 Implementation of Performance Measurements

In order to measure the amount of time it takes to execute each stage of the
algorithm on the FPGA board, and hence decide on which area to focus on for hardware
optimization, the OPB Timer module had to be added to the MicroBlaze based system.
Adding this timer module provided us with a software interface that can calculate the
number of processor execution cycles. In the initialization stages of our C code, the OPB
Timer module instance timer is declared and initialized.
54
XTmrCtr timer;
XTmrCtr_Initialize(&timer,XPAR_OPB_TIMER_0_DEVICE_ID);
In order to measure the number of execution cycles a specific operation takes, the
timer has to be reset prior to starting it. Following the completion of the operations, the
timer is stopped and the value is read. The following code illustrates this:
XTmrCtr_Reset(&timer,0);
XTmrCtr_Start(&timer,0);
// operations to be measured
XTmrCtr_Stop(&timer,0);
cycles = XTmrCtr_GetValue(&timer,0);
On the PC end, measuring the time involved obtaining a header file from an open
source and using the timing functions to calculate the number of clock cycles completed
during runtime. Below are the commands that we used to measure the execution time of
the corresponding stages on the PC:
QueryPerformanceCounter(&start_ticks)
// RECOGNITION STAGE CODE GOES HERE
QueryPerformanceCounter(&end_ticks);
cputime.QuadPart = end_ticks.QuadPart- start_ticks.QuadPart;
printf ("\tElapsed CPU time test:
%.9f sec\n",
((float)cputime.QuadPart/(float)ticksPerSecond.QuadPart));
5. CRITICAL APPRAISAL
5.1 Researching Face Recognition
All throughout this project, we faced several decisions to ensure that our face
recognition system design would work out as planned. In the early stages of the project,
we only had a vague idea of what we wanted to implement. We started gaining some
experience in the field of face recognition by reading and researching numerous
publications.
55
Some publications detailed algorithms for performing face recognition, spanning

from the statistical to the biologically inspired. We also reviewed publications that
described hardware implementations of face recognition. After having gone through all
this information, we had a better feel for the subject.
The first major decision involved choosing which algorithm to adopt. Since each
of the algorithms has its own merits and drawbacks, we had to assign weights to these
factors relative to our needs. We decided that the most important needs involved
simplicity of implementation, accuracy, and the ability to gain from hardware
optimizations. After convening for several days, we decided to choose the PCA
algorithm, which was outlined earlier in this report. It proved to be simple to implement,
relatively accurate, and ideal for hardware optimization, as it involved a great deal of
matrix multiplication.
5.2 Modeling in MATLAB

We then studied the PCA algorithm in more detail by reviewing the relevant
papers and learning the necessary mathematical concepts. After discussing the matter, we
decided that the most convenient starting point to test our understanding of the PCA
algorithm was to model it in MATLAB. The reason for our decision was that MATLAB
provides several functions that bypass low-level details, thereby allowing us to focus on
the algorithm itself.
After obtaining a free face database from the Internet, we began coding the
algorithm in MATLAB. Since MATLAB provides a rich set of visualization tools, we
were able to check the code at several stages and obtain a deeper conceptual
understanding of how face recognition works. As a result, we learned a valuable lesson
56
that it is better to simulate a system using high-level tools before delving into details.
Indeed, the MATLAB simulation made our transition to C very smooth and strengthened
our understanding of the concepts at hand.
5.3 Working with C

When coding the training stage in C, we also faced some difficulties with lowlevel issues inherent to the language. For example, the handling of large matrices for face
recognition involved a great deal of memory allocation, initialization, and pointer
referencing. Also, at first we were overwhelmed by the number of functions that had to
be written in C. Predefined math libraries could not be used because of the limited space
available on the FPGA. As a result, we had to write customized functions in C. Although
we were facing some difficulties dealing with all these functions, we adopted a modular
approach, which greatly simplified our work.
One particular problematic function was the function that calculates the
Eigenvectors of a matrix. Since finding the Eigenvectors involves advanced numerical
methods, we chose to use the function defined in the book Numerical Recipes in C.
However, after several days of working with the functions, we were still not getting
correct results. As a result, we had to follow the function step by step, until we finally
spotted the problem. This taught us the lesson that to overcome a problem, it is important
to narrow down within the code until the problem is limited to only a few lines. This
method for debugging proved to be very useful with future problems.
After we coded and readily tested our functions, we then proceeded by coding the
actual PCA algorithm. Now that we had developed a library of custom functions,
57
implementing the algorithm essentially followed a path similar to MATLAB. Because we

had tested every component for correctness, we were quickly able to get the algorithm to
produce results identical to MATLAB. Indeed, we learned that it is vital to test each of
the components extensively before integrating them.
The code in C simulated both the training stage of the algorithm and the
recognition stage. However, ultimately, we wanted to perform the training stage on the
host PC, store the intermediate data as binary files, transmit the binary files to the FPGA,
and perform the recognition stage on the MicroBlaze. As a result, we had to write
functions that translated matrix variables into byte format and wrote them to binary files.
In order to test the correctness of the binary file, we opened it in a HEX viewer and
compared some sequences of data with the original variables. After finding some errors
in the data, we discovered that the problem was in the method we were writing the data.
It turned out that there is a special option that must be specified for writing to binary files,
as opposed to regular files.
5.4 Using the FPGA

On the FPGA end, we faced more difficulties, primarily because the platform was
relatively new to us. Xilinx Platform Studio has a very sharp learning curve, making it
difficult to progress. However, thanks to the course Embedded System Design, we
were able to gather the necessary tools to work with FPGA boards. Nevertheless, we still
faced several difficulties, especially in areas that required adding a new core to the
system, writing some VHDL code, and interfacing with IP cores like Ethernet.
Our first task on the FPGA end involved being able to read and write from DDR
memory. This was crucial to our project because all the data generated in the training
58
stage must be stored in external memory, as the working memory of the FPGA is not
sufficiently large in size to accommodate the requisite data. One of the problems that we
faced was that DDR is word-addressable while the binary files are byte-addressable. As a
result, we had to concatenate every 4 bytes into a 32-bit word, with proper endian-ness,
and write it to DDR.
Another problem that we were having on the FPGA involved the use of the float
data type. Initially, we had planned on executing the entire PCA algorithm in using floats,
so naturally, we tried to write some experimental C code on the FPGA that adds and
prints two floating-point numbers. Unfortunately, we were not obtaining correct results,
namely because the xil_printf statement, which prints on the HyperTerminal, does not
support the printing of floats. As a result, we had to resort to printing the float values as
decimals, converting them to hexadecimal, and using an online utility to obtain the
corresponding floating point value.
Still, we obtained wrong results. After several days of wrestling with the problem,
we discovered that the decimal values represented the first 32 bits of a 64 bit
representation rather than a standalone 32-bit floating point representation. This taught us
a major lesson, that not everything comes easily, and that several things require spending
time with the code and experimenting with different methods.
5.5 Researching Hardware Multipliers

We faced even more problems when discovering how to interface with the onboard hardware multipliers through the MicroBlaze processor. Our preliminary analysis
using our C code implementation had showed us that the most computationally
demanding phase of the algorithm was the projection stage that involved matrix
59
multiplication. By using the documentation available with the board and some application
notes, we learned how to instantiate the hardware multipliers through hardware
descriptive VHDL code. We then proceeded to learn how to add a custom IP core on the
board and interface it with C code. After overcoming many problems related satisfying
the timing constraints of the on board clock and hardware synthesis, we were able to
utilize the core to perform fast multiplication.
After even more research into the issue of using hardware multipliers, we
discovered that by altering the parameters of the MicroBlaze instance, we could route all
multiplication instructions in C to the on-board hardware multipliers. This eliminated the
need for the custom IP core and drastically improved our initial performance analysis.
This is because we had removed the overhead of transferring data to and from the custom
core in addition to delays incurred by the control signals needed to regulate the multiply
and accumulate process.
5.6 Learning to Use Ethernet

Having established our ability to work with DDR memory, floats, and hardware
multipliers on the FPGA, we then had to learn how to deal with Ethernet. As previously
mentioned, our choice to use an Ethernet interface instead of a serial interface stemmed
from the fact that the data files to be transmitted were relatively large (greater than 3
MB). Working with Ethernet was more difficult than working with DDR, namely because
it involves two parties, the sending side and the receiving side, both of which must work
correctly.
Thus, we first had to find a way to send Ethernet frames from the PC end. At first
we tried searching for a utility that sends raw Ethernet packets, but we were unable to
60
find a freely available one. We were basically looking for a program the sends frames at
the Ethernet level and not the IP level. After some more searching, we found a C# source
file containing functions for sending raw Ethernet packets. We then modified the code to
send data from the training stage.
At the FPGA end, we wrote the code to receive frames and specified the
necessary Ethernet settings. When we connected the two ends together, we used print
statements and the built in hardware debugger on the FPGA end to make sure we were
receiving packets. At first, no packets were appearing on the receiving end. After revising
our initial Ethernet settings, we discovered that the receiver settings were kept in reset
mode. After resolving this issue, we discovered that we were receiving incorrect data in
the packets. By using the hardware debugger, we found that all packets we were
receiving had the broadcast address as their destination address. The reason for this
anomaly was that Windows was sending broadcast packets through Ethernet. When we
pinpointed the problem, we simply coded a filter that discards all unwanted packets.
5.7 Porting PCA to the FPGA

Having written to binary files, sent them over Ethernet, and stored them on DDR,
we were then able to proceed to the next stage. This involved actually coding the PCA
algorithm on the FPGA. Since we had already implemented the recognition stage using C
on the host computer, the task at hand was to modify the functions we already created in
order to accommodate the way the matrices are stored in DDR memory and accessing
floating point values.
For example, the normalization phase, which was written as a function earlier,
was now modified to retrieve the corresponding memory locations, performing the
61
subtraction, and writing the results back to memory. Similarly, the projection and
distance calculations involved several modifications for memory access. Throughout,
most of the hurdles we faced were related to indexing problems. This is due to the fact
that matrix multiplication involves two-dimensional entities which are stored linearly in
memory. As a result, we had to make sure that the indexing for matrix element access
was correct. To do so, we had to run the code several times and print out the results until
they matched the values obtained on the PC.
Other issues that we faced in the PCA implementation on the FPGA included the
data types that we were using. For example, the function that retrieves data from memory
stores the word in a variable of type Xuint32, which represents an unsigned 32-bit
integer. However, after performing subtraction (such as in the distance calculation), the
data type would lead to wrong values. We then corrected this deviation by assigning the
subtraction operation to Xint32 instead of Xuint32. We also faced other problems related
to the data types, such as float and long int. These problems taught us much about the
importance of understanding the nature of the data used in the system.
5.8 Performance Assessment

In our final testing stages, we had to compare two different implementations of
the algorithm on the FPGA along with our first simulation on a host PC. The first
implementation inherently utilized the on board hardware multipliers in order to perform
the required multiplication operation. Our other implementation relied on programmable
gates in order to implement the multipliers. When we moved to download this
implementation onto the FPGA, we found that the code was too big to fit in the BRAM
blocks. As such, we had to try and reduce the size of our code without losing
62
functionality. This proved to be impossible. As such, we resorted to test each phase of the
algorithm individually and measure the performance. Through this process, we learnt the
importance and value of our limited memory resources and how to optimize our
implementation so as to use these resources efficiently.
When trying to measure the performance of each of our implementations, we
faced some difficulties in finding the correct tools for this purpose. On the FPGA side,
we had to add the OPB Timer module to count the execution cycles. This proved to
further limit our memory resources and force us to run the simulation on each of the
algorithm stages individually. On the host PC side, the regular C libraries did not provide
us with measurements that were accurate enough. Therefore, we had to resort to using
some open source functions and libraries to find accurate measurements for each of the
stages of the algorithm.
To sum up, it is evident that the past two semesters have been extremely fruitful
in terms of the amount of knowledge and experience acquired. Although at first we were
overwhelmed by the magnitude of this project, we discovered that breaking down our
problems into smaller pieces yielded quick and effective solutions. Moreover, we learned
a great deal about hardware implementations and FPGA programming, thereby widening
the scope of our applied knowledge.
63
6. RESULTS
6.1 Methodology Overview
After completing the implementation phase of our project, we moved on to the
analysis and performance assessment of our results. This involved obtaining several
execution time performance metrics and using them to interpret the relative efficiency of
our system.
As stated earlier, speed is of prime importance when it comes to the process of
recognizing a face. As such, the next most reasonable step involved obtaining a temporal
breakdown of the recognition phase. Specifically, the recognition phase can be broken
down into the following components:
Normalization
Projection
Distance Calculation
Our performance measurements were based on two FPGA implementations

benchmarked against a purely software implementation in C. All three candidate
implementations are outlined below:
Device
Processor
Environment
Multiplier
Implementation 1
Implementation 2
Acer Laptop
Virtex-II Board
1.7 GHz Centrino
100 MHz MicroBlaze
MS Visual C++
Xilinx Platform Studio
Software
Programmable Gates
Table 4 :Implementation Descriptions
Implementation 3
Virtex-II Board
100 MHz MicroBlaze
Xilinx Platform Studio
Dedicated Hardware
Prior to obtaining any measurements, we developed a hypothesis that the

projection phase would consume the longest portion of time. This rational stemmed from
64
the fact that this phase involves high computational demand in the form of a matrix
multiplication operation.
6.2 PC Implementation
In order to test our hypothesis, we first timed the recognition stage on
Implementation 1. This entailed importing additional libraries, along with predefined
time functions. We then started/ended the timer before/after each of the 3 stages of our
algorithm, and obtained the following results:
Phase
Normalization
Projection
TOTAL
Implementation 1
Execution Time
Clock Cycles Elapsed
0.235 milliseconds
399,500 clock cycles
49.5 milliseconds
84,150,000 clock cycles
3.32 milliseconds
53.055 milliseconds
Table 5: Implementation 1 Results
In the above table, the number of clock cycles elapsed was obtained by
multiplying the execution time by 1.7 GHz, representing the speed of the Centrino
processor. Although it is probably true that not all of the clock cycles are being used for
the recognition stage, but rather to sustain operation system functions, they must still be
included. The reason for this is that in reality, a system implemented on a PC would have
to run on an operating system and incur the overhead of OS calls.
6.3 FPGA Implementations

We next moved on to measure the timings involved in Implementations 2 and 3.
The only method of accomplishing this was to add an opb_timer, which is an IP core that
must be added to the project. As a result, the entire system had to be regenerated since the
65
IP core interfaces with the OPB bus on the FPGA. Once regeneration was completed, we
measured the execution of each of the 3 stages for recognition. It is important to note that
unlike in Implementation 1, the opb_timer measures execution in terms of clock cycles as
opposed to units of time. Thus, we had to divide the number of clock cycles by 100 MHz
to obtain the execution time.
Phase
Normalization
Projection
TOTAL
Implementation 2
Execution Time
15.5 milliseconds
2.12 seconds
14.7 milliseconds
2.15 seconds
Phase
Normalization
Projection
TOTAL
Implementation 3
Execution Time
15.5 milliseconds
774 milliseconds
11.5 milliseconds
801 milliseconds
Firstly, comparing the results of implementations 2 and 3, we notice that using the
hardware dedicated multipliers in implementation 3 resulted in a significant speed-up in
time. The number of clock cycles of the projection phase in implementation 3 is almost
63% lower than implementation 2. As expected, the normalization phases in both FPGA
implementations were practically identical due to the fact that no multiplications take
place in this phase.
Lastly, there was approximately a 22% speed-up in the distance calculations in
implementation 3 over implementation 2 since this phase inherently involves squaring
66
values (i.e. multiplying values by themselves). The speed-up was not as high as in the
projection phase since the distance calculation phase is not purely multiplicationintensive.
Cumulative Clock Cycles
FPGA COMPARISON
250,000,000
200,000,000
150,000,000
Impl. 2
100,000,000
Impl. 3
50,000,000
0
Normalization
Projection
Distance
Stage
Figure 11: FPGA Comparison
6.4 Overall Performance Analysis
Looking closely at the results of the first and third implementations, we notice
here that the execution time is slower than in the case of the software implementation.
This is primarily due to the fact that the PC we ran the software implementation on has a
1.7 GHz processor versus a 100 MHz processor running on the FPGA. However, if we
take the number of clock cycles in an absolute sense, the third hardware implementation
took approximately 10% less clock cycles to execute than the software implementation.
As such, comparing the performance in terms of clock cycles shows that the third
implementation is the fastest, as shown in the graph below.
67
TOTAL PERFORMANCE
Clock Cycles
250,000,000
200,000,000
150,000,000
100,000,000
50,000,000
0
1
Implementation Number
Figure 12: Total Performance
The justification for clock-cycle-based comparison stems from the fact that in
reality, multiple hardware units would be used in parallel to run the face recognition
algorithm. Furthermore, in a real implementation, an FPGA board with a faster processor
core would be used to speed up the algorithm. Lastly, the algorithm can be manufactured
on an ASIC, resulting in a further increase in performance.
Thus far, our performance measurements have shown us that by utilizing the onboard hardware multipliers, we can greatly improve the performance of our system. The
device utilization summary below reveals that there is ample room to make use of
available hardware multiplier units, or the MULT18X18s units:
Device Utilization Summary:

Number of MULT18X18s
10 out of 40
25%
Therefore, due to the availability of these multipliers and the nature of matrix
multiplication, future work in this field could be centered on trying to utilize this resource
68
in order to parallelize the process of matrix multiplication and further improve the
performance of this application.
7. EXTERNAL FACTORS AND CONSTRAINTS

We embarked upon the project of creating a face recognition system in hardware
in order to promote the quality as well as the safety of human life. From a security
perspective, face recognition systems are vital means of identifying people and insuring
that only the appropriate people have access to certain area. Post-September 11,
numerous international airports have tightened security in order to identify certain blacklisted potential terrorists, many of whom disguise themselves in an attempt to hide their
identity. Face recognition systems are more than ever needed in such airports, and act as
a tool in determining the identity of passengers and crew members alike.
Moving to smaller-scale applications, face recognition has been introduced into
the home security industry, and the market for such security systems has been growing
since. The accuracy with which such systems can detect features in a human face not only
insures correct functionality, but also allows homeowners to leave behind valuable
belongings without the constant burden of worrying about potential burglaries.
As face recognition systems are becoming more and more popular by the day,
there is a need for research into ways of designing systems with low cost, high reliability
and high accuracy and speed. The aim of our project is to use the hardware resources of
an FPGA in order to speed up the recognition process while maintaining a high level of
accuracy. Whether these systems are used in airports, offices or homes, speed is a very
important constraint. Passengers should not have to wait extra hours in line in order for
authorities to identify them. Rather, it should be a walk-through process free of waiting
69
time. The same applies to applications in the home security industry. Fast face
recognition systems would allow for instant detection and entrance into the home.
In addition to investigating ways to speed up the recognition process, we also
focused on creating a system that is sustainable and upgradeable. Faces can be added to
the existing database with great ease, and re-computing the new data is instantaneous.
The fact that we used a Field Programmable Gate Array to implement our system is one
of the key advantages over other systems. It provides for easy upgrading of the system
simply through the modification of code that runs on the core processor. Moreover, new
cores and new features would cost very little since the system we created leaves available
a huge amount of programmable gates for future modifications.
On the ethical side, face recognition treads on some thin territory regarding the
privacy of individuals. Many individuals prefer to have more discrete forms of
identification and detection that do not rely on such direct biometric measurements.
However, the privacy of an individual can be sustained by ensuring that the process is
automated and that the images captured are stored securely on a server. In this manner,
we can capitalize on the benefits of face recognition while preserving individual privacy.
Finally, from an economic vantage point, our project is an investment into
research that could result in millions of dollars of savings. Automation and speed-up will
lead to a lower need for human intervention, thereby cutting costs across several
frontiers. Nevertheless, the start-up cost for such a project is quite steep as it involves
revamping the entire security infrastructure that permeates modern life.
70
8. CONCLUSION
Over the course of the past two terms, we have researched the field of face
recognition, familiarized ourselves with the FPGA, and modeled the PCA algorithm in
both MATLAB and C. We next developed the system requirements of our intended
design and created a block diagram depicting the interconnection among the various
components of our system. Lastly, we implemented the algorithm on the FPGA, complete
with Ethernet, DDR Memory, and on-board hardware multipliers. Profiling the code
revealed that matrix multiplication was the most time consuming aspect of the algorithm
and that on-board multipliers result in the most optimized operation.
Our system can be further enhanced in several different ways. For example, a
friendly user interface can be created to improve software usability. Performance can be
further enhanced by employing hardware multipliers running in parallel and by
improving the clock speed of the soft core processor on the FPGA board. Having pieced
together the face recognition system over several months of milestones and setbacks, we
learned some valuable lessons. We hope that this system provides some additional insight
into the field of face recognition and contributes to the development of the field.
71
9. REFERENCES
[1]
W. Zhao, R. Chellapra, P.J. Phillips, A. Rosenfeld, Face Recognition: A

Literature Survey, ACM Computing Surveys, Vol. 35, No. 4, December 2003,
pp. 399-458
[2]
M.A. Turk, A.P. Pentland. Face Recognition Using Eigenfaces, IEEE

Conference on Computer Vision and Pattern Recognition, pp.586--591, 1991.
[3]
P. N. Belhumeur, J. P. Hespanha, D. J. Kriegman, Eigenfaces vs. Fisherfaces:

Recognition using class specific linear projection, IEEE Trans. Pattern Anal.
Machine Intell., vol. 19, pp. 711720, May 1997.
[4]
M.S. Bartlett, J.R. Movellan, T.J. Sejnowski, Face Recognition by Independent

Component Analysis, IEEE Trans. on Neural Networks, Vol. 13, No. 6,
November 2002, pp. 1450-1464
[5]
H. Ando, N. Fuchigami, M. Sasaki, A. Iwata, A Prototype Software System for

Multi-object Recognition and its FPGA Implementation, Proc. Third Hiroshima
International Workshop on Nano-electronics for Terra-Bit Information
Processing, 2004.
[6]
Gottumukkal R., and Asari K.V., System Level Design of Real Time Face
Recognition Architecture Based on Composite PCA, Proc. GLSVLSI 2003,
2003, pp. 157-160.
[7]
Hau T. Ngo, Rajkiran Gottumukkal, Vijayan K. Asari. "A Flexible and Efficient
Hardware Architecture for Real-Time Face Recognition Based on Eigenface",
isvlsi, pp. 280-281, Proc. IEEE Computer Society Annual Symposium on VLSI:
New Frontiers in VLSI Design (ISVLSI'05), 2005.
[8]
X. Li and S. Areibi, A Hardware/Software Co-design Approach for Face

Recognition, Proc. 16th International Conference on Microelectronics, Tunis,
Tunisia, Dec 2004.
[9]
Moritoshi Yasunaga, Taro Nakamura, and Ikuo Yoshihara, A Fault-tolerant

Evolvable Face Identification Chip, Proc. Int. Conf. on Neural Information
Processing, pp.125-130, Perth, November 1999.
[10]
In Ja Jeon, Boung Mo Choi, Phill Kyu Rhee. "Evolutionary Reconfigurable

Architecture for Robust Face Recognition," ipdps, p. 192a, International Parallel
and Distributed Processing Symposium (IPDPS'03), 2003.
[11]
Press, William H., Brian P. Flannery, Saul A. Teukolsky, and William T.

Vetterling. Numerical Recipes in C: The Art of Scientific Computing. 2nd ed.:
Cambridge University Press, 1992.
72
10. APPENDIX
10.1 PCA Code in MATLAB
function [distances] = pca(A,test_face,k)
fprintf(1,'Computing average face...\n');
average_face = mean(A);
num_of_faces = size(A,1);
fprintf(1,'Computing vector differences...\n');
for i = 1:num_of_faces
faces_diff(i,:) = A(i,:) - average_face;
end;
fprintf(1,'Computing L matrix...\n');
L = faces_diff * faces_diff';
fprintf(1,'Computing Eigenvectors of L...\n');
[V,D] = eigs(L,k);
fprintf(1,'Extracting Eigenvectors of covariance matrix...\n');
eigenvec_u = faces_diff' * V;
%fprintf(1,'Normalizing eigenvectors...\n');
%z = sum(eigenvec_u,1);
%eigenvec_u = eigenvec_u ./ z (ones(size(eigenvec_u,1), 1) ,:);
%eigenvectors = eigenvec_u;
fprintf(1,'Computing face projections...\n');
projections = faces_diff * eigenvec_u;
fprintf(1,'Testing a face...\n');
%test_face = B(3,:);
test_norm = test_face - average_face;
test_proj = test_norm * eigenvec_u;
distances = dist(projections, test_proj');
10.2 PCA Code in C
10.2.1 Matrix Library

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "eig.c"
void matrix_print(float** matrix, int height, int width)
{
int i,j;
73
for(i=0;i<height;i++)
{
for(j=0;j<width;j++)
printf ("%f\t",matrix[i][j]);
printf("\n");
}
printf("\n");
}
void vector_print(float* vector, int size)
{
int i;
for(i=0;i<size;i++)
printf ("%f\t",vector[i]);
printf("\n\n");
}
void matrix_transpose(float** matrix_in, int height_in, int width_in, float** matrix_out)
{
int i,j;
for(i=0;i<height_in;i++)
for(j=0;j<width_in;j++)
matrix_out[j][i] = matrix_in[i][j];
}
void matrix_average(float** matrix_in, int vector_num, int vector_size, float* matrix_out)
{
int i,j;
float temp;
for(i=0;i<vector_size;i++)
{
temp = 0;
for(j=0;j<vector_num;j++)
temp += matrix_in[j][i];
matrix_out[i] = temp/vector_num;
}
}
void matrix_multiply(float** matrix1, int height1, int width1,
float** matrix2, int height2, int width2,
float** matrix_out)
{
int i,j,k;
float temp;
for(i=0;i<height1;i++)
for(j=0;j<width2;j++)
{
temp = 0;
for(k=0;k<width1;k++)
temp = temp + matrix1[i][k]*matrix2[k][j];
matrix_out[i][j] = temp;
}
}
74
void matrix_subtract(float** matrix_in, int vector_num, int vector_size,

float* vector, float** matrix_out)
{
int i,j;
for(i=0;i<vector_num;i++)
for(j=0;j<vector_size;j++)
matrix_out[i][j] = matrix_in[i][j] - vector[j];
}
float vector_distance(float* vector1, float* vector2, int size)
{
int i;
float temp = 0;
for(i=0;i<size;i++)
temp = temp + (vector1[i] - vector2[i]) * (vector1[i] - vector2[i]);
return (float)sqrt(temp);
}
void eig(float** mat,int n, float* eval, float** evec)
{
double** mnew;
int* iterations;
double** vectors;
double* values;
int i,j;
iterations = malloc(4*sizeof(int));
mnew = malloc((n+1)*(n+1)*sizeof(double));
for(i=0;i<(n+1);i++)
mnew[i] = malloc((n+1)*sizeof(double));
vectors = malloc((n+1)*(n+1)*sizeof(double));
for(i=0;i<(n+1);i++)
vectors[i] = malloc((n+1)*sizeof(double));
values = malloc((n+1)*sizeof(double));
for(i=1;i<=n;i++)
for(j=1;j<=n;j++)
mnew[i][j] = (double) mat[i-1][j-1];
jacobi(mnew,n,values,vectors,iterations);
eigsrt(values,vectors,n);
for(i=1;i<=n;i++)
for(j=1;j<=n;j++)
evec[i-1][j-1] = (float) vectors[i][j];
for(i=1;i<=n;i++)
eval[i-1] = (float) values[i];
}
void read_images(float** db, int height, int width)
75
{
FILE * pFile;
long lSize;
float* buffer;
int i,j;
// open file
pFile = fopen ( "database.txt" , "rb" );
// obtain file size
fseek (pFile , 0 , SEEK_END);
lSize = ftell (pFile);
rewind (pFile);
// allocate memory to contain the whole file
buffer = (float*) malloc (lSize);
// copy the file into the buffer.
fread (buffer,1,lSize,pFile);
for(i=0;i<height;i++)
for(j=0;j<width;j++)
db[i][j] = buffer[j*height+i];
// close file
fclose (pFile);
}
void read_testface(float* face, int size)
{
FILE * pFile;
long lSize;
float* buffer;
int i;
// open file
pFile = fopen ( "testface.txt" , "rb" );
// obtain file size
fseek (pFile , 0 , SEEK_END);
lSize = ftell (pFile);
rewind (pFile);
// allocate memory to contain the whole file
buffer = (float*) malloc (lSize);
// copy the file into the buffer.
fread (buffer,1,lSize,pFile);
for(i=0;i<size;i++)
face[i] = buffer[i];
// close file
fclose (pFile);
}
76
10.2.2 Eigenvector Functions

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
static double sqrarg;
#define SQR(a) ((sqrarg=(a)) == 0.0 ? 0.0 : sqrarg * sqrarg)
#define SIGN(a,b) ((b) >= 0.0 ? fabs(a) : -fabs(a))
#define ROTATE(a,i,j,k,l) g=a[i][j];h=a[k][l];a[i][j]=g-s*(h+g*tau);a[k][l]=h+s*(g-h*tau)
double pythag(double a, double b)
/* Computes sqrt(a^2 + b^2) without destructive underflow or
overflow */
{
double absa, absb;
absa = fabs(a);
absb = fabs(b);
if (absa > absb)
return (absa * sqrt(1.0 + SQR(absb / absa)));
else
return (absb == 0.0 ? 0.0 : absb * sqrt(1.0 + SQR(absa/absb)));
}
void jacobi (double **a, int n, double *d, double **v, int *nrot)
/* Computes all eigenvalues and eigenvectors of a real symmetric
matrix a[1..n][1..n]. On output, elements of a above thep
diagonal are destroyed. d[1..n] returns the eigenvalues of
a. v[1..n][1..n] is a matrix whose columns contain, on output,
the normalized eigenvectors of a. nrot returns the number of
Jacobi rotations that were required. */
{
int j, iq, ip, i;
double tresh, theta, tau, t, sm, s, h, g, c, *b, *z;
b = (double *) calloc (n, sizeof(double));
if (b == NULL) {
perror ("calloc b in jacobi()");
return;
}
b--;
z = (double *) calloc (n, sizeof(double));
if (z == NULL) {
perror ("calloc z in jacobi()");
return;
}
z--;
/* Initialize to the identity matrix */
for (ip = 1; ip <= n; ip++) {
77
for (iq = 1; iq <= n; iq++)

v[ip][iq] = 0.0;
v[ip][ip] = 1.0;
}
/* Initialize b and d to the diagonal of a. This vector will
accumulate terms of the form ta_{pq} as in equation (11.1.14). */
for (ip = 1; ip <= n; ip++) {
b[ip] = d[ip] = a[ip][ip];
z[ip] = 0.0;
}
*nrot = 0;
for (i=1;i<=50;i++) {
sm = 0.0;
/* Sum off-diagonal elements */
for (ip = 1; ip <= n-1; ip++) {
for (iq = ip + 1; iq <= n; iq++)
sm += fabs(a[ip][iq]);
}
/* The normal return, which relies on quadratic convergence to
machine underflow */
if (sm == 0.0) {
free(++z);
free(++b);
return;
}
if (i < 4)
tresh = 0.2 * sm / (n*n);
/* on the first three swaps */
else
tresh = 0.0;
/* thereafter */
for (ip=1; ip<=n-1; ip++) {

for (iq=ip+1 ; iq<=n; iq++) {
g = 100.0 * fabs(a[ip][iq]);
/* After four sweeps, skip the rotation if the off-diagonal
element is small. */
if (i > 4 && (double) (fabs(d[ip]) + g) == (double)

fabs(d[ip]) && (double) (fabs(d[iq]) + g) == (double)
fabs(d[iq]))
a[ip][iq] = 0.0;
/* Page 2 */
else if (fabs(a[ip][iq]) > tresh) {
h = d[iq] - d[ip];
78
if ((double) (fabs(h) + g) == (double) fabs(h))

t = (a[ip][iq]) / h;
/* t = 1/(2*theta) */
else {
theta = 0.5 * h / (a[ip][iq]); /* equation 11.1.10 */
t = 1.0 / (fabs(theta) + sqrt(1.0 + theta * theta));
if (theta < 0.0)
t = -t;
}
c = 1.0 / sqrt(1+t*t);
s = t * c;
tau = s / (1.0 + c);
h = t * a[ip][iq];
z[ip] -= h;
z[iq] += h;
d[ip] -= h;
d[iq] += h;
a[ip][iq] = 0.0;
for (j = 1; j <= ip - 1; j++) { /* Case of rotations
1 <= j < p */
ROTATE (a, j, ip, j, iq);
}
for (j = ip + 1; j <= iq - 1; j++) { /* Case of rotations
p < j < q */
ROTATE (a, ip, j, j, iq);
}
for (j = iq + 1; j <= n; j++) { /* Case of ratations
q < j <= n */
ROTATE (a, ip, j, iq, j);
}
for (j = 1; j <= n; j++) {
ROTATE (v, j, ip, j, iq);
}
++(*nrot);
}
}
}
/* Update d with the sum of ta_{pq} and reinitialize z */
for (ip = 1; ip <= n; ip++) {
b[ip] += z[ip];
d[ip] = b[ip];
z[ip] = 0.0;
}
}
fprintf (stderr, "Too many iterations in routine jacobi\n");
}
void eigsrt (double *d, double **v, int n)
/* Given the eigenvalues d[1..n] and eigenvectors v[1..n][1..n]
as output from jacobi (section 11.1) or tqli (section 11.3),
this routine sorts the eigenvalues into decending order, and
rearranges the columns of v corespondingly. The method is
79
straight insertion. */
{
int k, j, i;
double p;
for (i = 1; i < n; i++) {
p = d[k=i];
for (j = i + 1; j <= n; j++)
if (fabs(d[j]) >= fabs(p))
p = d[k=j];
if (k != i) {
d[k] = d[i];
d[i] = p;
for (j = 1; j <= n; j++) {
p = v[j][i];
v[j][i] = v[j][k];
v[j][k] = p;
}
}
}
}
void tred2(double **a, int n, double *d, double *e)
/* Householder reduction of a real, symmetric matrix
a[1..n][1..n]. On output, a is replaced by the orthgonal
matrix Q effecting the transformation. d[1..n] returns the
diagonal elements of the tridiagonal matrix, and e[1..n] the
off-diagonal elements, with e[1] = 0. Several statements, as
noted in commensts, can be omitted if only eigenvalues are to
be found, in which case a contains no useful information on
output. Otherwise they are to be included. */
{
int l, k, j, i;
double scale, hh, h, g, f;
for (i = n; i>= 2; i--) {
l = i - 1;
h = scale = 0.0;
if (l > 1) {
for (k = 1; k <= l; k++)
scale += fabs(a[i][k]);
if (scale == 0.0)
/* skip transformation */
e[i] = a[i][l];
else {
for (k = 1; k <= l; k++) {
a[i][k] /= scale; /* use scaled a's for transformation*/
h += a[i][k] * a[i][k]; /* form sigma in h */
}
f = a[i][l];
g = (f >= 0.0 ? -sqrt(h) : sqrt(h));
e[i] = scale * g;
h -= f * g;
/* Now h is equation (11.2.4) */
a[i][l] = f-g;
/* Store u in the ith row of a. */
80
f = 0.0;
for (j = 1; j <= l; j++) {
/* Next statement can be omitted if eigenvectors not wanted
*/
a[j][i] = a[i][j] / h; /* Store u/H in ith column of a. */
g = 0.0;
/* Form an element of Au in g. */
for (k = 1; k <= j; k++)
g += a[j][k] * a[i][k];
for (k = j+1; k <= l; k++)
g += a[k][j] * a[i][k];
e[j] = g/h;
/* Form element of p in temporarily
unused element of e */
/* Page 2 */
f += e[j] * a[i][j];
}
hh = f / (h + h); /* Form K, equation (11.2.11). */
for (j = 1; j <= l; j++) { /* Form q and store in e
overwriting p */
f = a[i][j];
/* Note that e[l] = e[i-1] survives */
e[j] = g = e[j] - hh * f;
for (k = 1; k <= j; k++) /* Reduce a, equation (11.2.13) */
a[j][k] -= (f * e[k] + g * a[i][k]);
}
}
} else
e[i] = a[i][l];
d[i] = h;
}
/* Next statement can be omitted if eigenvectors not wanted */
d[1] = 0.0;
e[1] = 0.0;
/* Contents of this loop can be omitted if eigenvectors not wanted
except for statement d[i] = a[i][i]; */
for (i = 1; i <= n; i++) { /* Begin accumulation of
transformation matrices */
l = i - 1;
if (d[i]) {
/* This block skipped when i = 1 */
for (j = 1; j <= l; j++) {
g = 0.0;
for (k = 1; k <= l; k++) /* Use u and u/H stored in a to form
PQ */
g += a[i][k] * a[k][j];
for (k = 1; k <= l ; k++)
a[k][j] -= g * a[k][i];
}
}
d[i] = a[i][i];
/* This statement remains */
81
a[i][i] = 1.0;
/* Reset row and column of a to

identity matrix for next iteration
*/
for (j = 1; j <= l; j++)

a[j][i] = a[i][j] = 0.0;
}
}
void
tqli(double *d, double *e, int n, double **z)
/* QL algorithm with implicit shitfs, to determine the
eigenvalues and eigenvectors of a real, symmetric, tridiagonal
matrix, or of a real, symmetric matrix previously reduced by
tred2 (section 11.2). On input, d[1..n] contains the diagonal
elements of the tridiagonal matrix. On output, it returns the
eigenvalues. The vector e[1..n] inputs the subdiagonal
elements of the tridiagonal matrix, with e[1] arbitrary. On
output, e is destroyed. When finding only the eigenvalues,
several lines may be omitted, as noted in the comments. If
the eigenvectors of a tridiagonal matrix are desired, the
matrix z[1..n][1,,n] is input as the identity matrix. If the
eigenvectors of a matrix that has been reduced by tred2 are
required, then z is input as the matrix output by tred2. In
either case, the kth column of z returns the normalized
eigenvector corresponding to d[k]. */
{
double pythag (double a, double b);
int m, l, iter, i, k;
double s, r, p, g, f, dd, c, b;
/* Convenient to renumber the elements of e */
for (i = 2; i <= n; i++)
e[i-1] = e[i];
e[n] = 0.0;
for (l = 1; l <= n; l++) {
iter = 0;
do {
/* Look for a single small subdiagonal element to split the
matrix */
for (m = l; m <= n - 1; m++) {
dd = fabs(d[m]) + fabs(d[m+1]);
if ((double) fabs(e[m] + dd) == dd)
break;
}
if (m != l) {
if (iter++ == 30)
fprintf (stderr, "Too many iterations in tqli\n");
g = (d[l+1] - d[l]) / (2.0 * e[l]); /* Form shift */
r = pythag(g, 1.0);
g = d[m] - d[l] + e[l] / (g + SIGN(r,g)); /* this is d_{m} k_{s} */
82
s = c = 1.0;
p = 0.0;
/* Page 2 */
/* A plane rotation as in the original QL, followed by Givens
rotations to restore tridiagonal form. */
for (i = m-1; i >= l; i--) {
f = s * e[i];
b = c * e[i];
e[i+1] = (r = pythag(f,g));
/* recover from underflow */
if (r == 0.0) {
d[i+1] -= p;
e[m] = 0.0;
break;
}
s = f/r;
c = g/r;
g = d[i+1] - p;
r = (d[i] - g) * s + 2.0 * c * b;
d[i+1] = g + (p = s * r);
g = c * r - b;
/* Next loop can be omitted if eigenvectors not wanted */
/* Form eigenvectors */
for (k = 1; k <= n; k++) {
f = z[k][i+1];
z[k][i+1] = s * z[k][i] + c * f;
z[k][i] = c * z[k][i] - s * f;
}
}
if (r == 0.0 && i >= l)
continue;
d[l] -= p;
e[l] = g;
e[m] = 0.0;
}
} while (m != l);
}
}
83
10.2.3 PCA Algorithm

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "matrix.c"
#define NUMFACES 51
#define FACESIZE 18750
main()
{
/*****declarations*****/
int i,i_mark;
float** database;
float** database_trans;
float* average;
float** L;
float* eigenvalues;
float** eigenvectors;
float** eigenvectors_orig;
float** projections;
float** test_face;
float** test_projection;
float min, temp_min;
/***initializations****/
database = malloc(NUMFACES*FACESIZE*sizeof(float));
database[i] = malloc(FACESIZE*sizeof(float));
average = malloc(FACESIZE*sizeof(float));
L = malloc(NUMFACES*NUMFACES*sizeof(float));
L[i] = malloc(NUMFACES*sizeof(float));
database_trans = malloc(NUMFACES*FACESIZE*sizeof(float));
database_trans[i] = malloc(NUMFACES*sizeof(float));
eigenvalues = malloc(NUMFACES*sizeof(float));
eigenvectors = malloc(NUMFACES*NUMFACES*sizeof(float));
eigenvectors[i] = malloc(NUMFACES*sizeof(float));
eigenvectors_orig = malloc(NUMFACES*FACESIZE*sizeof(float));
eigenvectors_orig[i] = malloc(NUMFACES*sizeof(float));
84
projections = malloc(NUMFACES*NUMFACES*sizeof(float));
projections[i] = malloc(NUMFACES*sizeof(float));
test_face = malloc(FACESIZE*sizeof(float));
for(i=0;i<1;i++)
test_face[i] = malloc(FACESIZE*sizeof(float));
test_projection = malloc(NUMFACES*sizeof(float));
for(i=0;i<1;i++)
test_projection[i] = malloc(NUMFACES*sizeof(float));
/*****pca training*****/
// obtain database
read_images(database,NUMFACES, FACESIZE);
// find average face
matrix_average(database,NUMFACES,FACESIZE,average);
// normalize database
matrix_subtract(database,NUMFACES,FACESIZE,average,database);
// compute L matrix
matrix_transpose(database,NUMFACES,FACESIZE,database_trans);
matrix_multiply(database,NUMFACES,FACESIZE,database_trans,FACESIZE,NUMFACES,L);
// compute eigenvectors of L
eig(L,NUMFACES,eigenvalues,eigenvectors);
// derive eigenvectors of original matrix
matrix_multiply(database_trans,FACESIZE,NUMFACES,eigenvectors,NUMFACES,NUMFACE
S,eigenvectors_orig);
// compute face projections
matrix_multiply(database,NUMFACES,FACESIZE,eigenvectors_orig,FACESIZE,NUMFACES,
projections);
/***pca recognition****/
// obtain test face
read_testface(test_face[0],FACESIZE);
// normalize test face
matrix_subtract(test_face,1,FACESIZE,average,test_face);
// project test face
matrix_multiply(test_face,1,FACESIZE,eigenvectors_orig,FACESIZE,NUMFACES,test_projecti
on);
// compute minimum distance
85
{
if(i==0)
min = vector_distance(test_projection[0],projections[i],NUMFACES);
else
{
temp_min = vector_distance(test_projection[0],projections[i],NUMFACES);
if(temp_min < min)
{
min = temp_min;
i_mark = i;
}
}
}
printf("The minimum distance belongs to face %i and has a value of %f\n",i_mark+1, min);
}
10.3 Ethernet Code in C#

// read binary file
FileStream fs = File.OpenRead("b_testface_int");
BinaryReader br = new BinaryReader(fs);
// destination mac
packet[0] = 0x01;
packet[1] = 0x06;
packet[2] = 0x07;
packet[3] = 0x08;
packet[4] = 0x09;
packet[5] = 0x04;
// source mac
packet[6] = 0x00;
packet[7] = 0x56;
packet[8] = 0x00;
packet[9] = 0xFF;
packet[10] = 0x02;
packet[11] = 0xC5;
// length of data bytes
packet[12] = 0x03;
packet[13] = 0xE8;
for (i = 0; i < 75000; i++)
{
packet[14 + i % DATA_SIZE] = br.ReadByte();
if (i % DATA_SIZE == 999)
{
rawether.DoWrite(packet);
count++;
}
for (j = 0; j < 150000; j++) ;
}
86
Console.WriteLine(count.ToString());
br.Close();
fs.Close();
count = 0;
// read binary file
fs = File.OpenRead("b_avg_int");
br = new BinaryReader(fs);
// destination mac
packet[0] = 0x01;
packet[1] = 0x06;
packet[2] = 0x07;
packet[3] = 0x08;
packet[4] = 0x09;
packet[5] = 0x04;
// source mac
packet[6] = 0x00;
packet[7] = 0x56;
packet[8] = 0x00;
packet[9] = 0xFF;
packet[10] = 0x02;
packet[11] = 0xC5;
packet[12] = 0x03;
packet[13] = 0xE8;
for (i = 0; i < 75000; i++)
{
{
count++;
}
for (j = 0; j < 150000; j++) ;
}
br.Close();
fs.Close();
count = 0;
// read binary file
fs = File.OpenRead("b_eigen_int");
// destination mac
packet[0] = 0x01;
packet[1] = 0x06;
packet[2] = 0x07;
packet[3] = 0x08;
packet[4] = 0x09;
87
packet[5] = 0x04;
// source mac
packet[6] = 0x00;
packet[7] = 0x56;
packet[8] = 0x00;
packet[9] = 0xFF;
packet[10] = 0x02;
packet[11] = 0xC5;
packet[12] = 0x03;
packet[13] = 0xE8;
for (i = 0; i < 3825000; i++)
{
{
count++;
}
for (j = 0; j < 150000; j++) ;
}
br.Close();
fs.Close();
count = 0;
// read binary file
fs = File.OpenRead("b_proj_int");
// destination mac
packet[0] = 0x01;
packet[1] = 0x06;
packet[2] = 0x07;
packet[3] = 0x08;
packet[4] = 0x09;
packet[5] = 0x04;
// source mac
packet[6] = 0x00;
packet[7] = 0x56;
packet[8] = 0x00;
packet[9] = 0xFF;
packet[10] = 0x02;
packet[11] = 0xC5;
packet[12] = 0x03;
packet[13] = 0xE8;
for (i = 0; i < 10404; i++)
{
88

if (i == 10403)
{
count++;
break;
}
{
count++;
}
for (j = 0; j < 150000; j++) ;
}
br.Close();
fs.Close();
10.4 FPGA Code

/***************************** Include Files
*********************************/
#include "xparameters.h"
#include "xbasic_types.h"
#include "xemac_l.h"
#include "xio.h"
#include "xpacket_fifo_l_v2_00_a.h"
#include "xddr.h"
#include "xddr_l.h"
#include "time.h"
#include "xtmrctr.h"
/************************** Constant Definitions
*****************************/
#define EMAC_HDR_SIZE
#define MAC_ADDR_SIZE
#define MAX_FRAME_SIZE
14
6
1500
/* size of Ethernet header */

/* size of MAC address */
#define MAX_FRAME_SIZE_IN_WORDS ((MAX_FRAME_SIZE / sizeof(Xuint32)) +

1)
#define EMAC_BASEADDR
#define MEM_BASEADDR
#define
#define
#define
#define
#define
#define
#define
#define
TESTFACE_BASEADDR
TESTFACE_SIZE
AVGFACE_BASEADDR
AVGFACE_SIZE
EIG_BASEADDR
EIG_SIZE
PROJ_BASEADDR
PROJ_SIZE
0x40c00000
0x22000000
0
75000
75000
75000
150000
3825000
3975000
10404
89
#define TOTAL_FRAMES
#define NUMFACES
3986
51
// PROTOTYPES
int XEmac_RecvFrameSS(Xuint32 BaseAddress, Xuint8 *FramePtr);
void wait(Xuint32 time);
// mac address of the FPGA

static Xuint8 LocalAddress[MAC_ADDR_SIZE] =
{
0x01, 0x06, 0x07, 0x08, 0x09, 0x04
};
static Xuint8 RxFrameBuf[MAX_FRAME_SIZE];
Xuint32 FrameCount = 0;
Xuint32 GoodFrameCount = 0;
int main ()
{
printf("Inside MAIN\r\n");
int FrameSize;
int Length;
int imark;
//Integer that should be written to control register
Xuint32 setting_control;
Xuint32 rec1,rec2,rec3,rec4,word,memcount,r1,r2,product,index;
Xuint32 projections[NUMFACES];
Xfloat32 fdist,ftemp, min;
Xint32 int1;
Xfloat32 ftest;
Xfloat32 xflt1,xflt2;
XTmrCtr timer;
Xuint32 cycles;
XTmrCtr_Initialize(&timer,XPAR_OPB_TIMER_0_DEVICE_ID);
XTmrCtr_SetResetValue(&timer,0,0x00000000);
memcount = 0;
setting_control=2409652224;
XEmac_mWriteReg(EMAC_BASEADDR, XEM_ECR_OFFSET,setting_control);
//set MAC address
XEmac_mSetMacAddress(EMAC_BASEADDR, LocalAddress);
printf("Ready...\r\n");
int i,j;
// receive and store all frames
while (GoodFrameCount < TOTAL_FRAMES)
{
90
Length = XEmac_RecvFrameSS(EMAC_BASEADDR, (Xuint8 *)RxFrameBuf);

if (Length == -1)
continue;
//printf("Back from RECEIVE Function with Good Packet,
Length = %d\r\n",Length);
GoodFrameCount++;
//printf("Good Frame Count : %d\r\n", GoodFrameCount);
// for last frame
if (GoodFrameCount == 3986)
{
for (i=14; i<418; i+=4)
{
rec1 = (Xuint32) RxFrameBuf[i];
rec2 = (Xuint32) RxFrameBuf[i+1];
rec2 <<= 8;
rec3 <<= 16;
rec4 <<= 24;
word = 0;
memcount++;
}
break;
}
for (i=14; i<1014; i+=4)
{
rec1 = (Xuint32) RxFrameBuf[i];
rec2 <<= 8;
rec3 <<= 16;
rec4 <<= 24;
word = 0;
memcount++;
}
}
// RECOGNITION STAGE
// normalization stage
for (i=0; i<TESTFACE_SIZE; i+=4)
91
{
r1 = XDdr_mReadReg (MEM_BASEADDR, TESTFACE_BASEADDR+i);
r2 = XDdr_mReadReg (MEM_BASEADDR, AVGFACE_BASEADDR+i);
r1 = r1 - r2;
XDdr_mWriteReg (MEM_BASEADDR, TESTFACE_BASEADDR+i, r1);
}
xil_printf("Normalization Cycles: %d\r\n",cycles);
// projection stage
for (i=0; i<NUMFACES; i++)
{
product = 0;
for (j=0; j<TESTFACE_SIZE; j+=4)
{
r1 = XDdr_mReadReg (MEM_BASEADDR, EIG_BASEADDR +
i*4+j*NUMFACES);
r2 = XDdr_mReadReg (MEM_BASEADDR,
TESTFACE_BASEADDR+j);
product += r1 * r2;
}
projections[i] = product;
}
xil_printf("Projection Cycles: %d\r\n",cycles);
// distances
for (i=0; i<NUMFACES;i++)
{
ftemp = 0;
for (j=0; j<NUMFACES; j++)
{
r1 = XDdr_mReadReg (MEM_BASEADDR,
PROJ_BASEADDR+(i*NUMFACES+j)*4);
int1 = r1 - projections[j];
ftest = (Xfloat32) int1;
ftemp += ftest*ftest;
}
//fdist = sqrt(ftemp);
fdist = ftemp;
if (i==0)
{
min = fdist;
imark = i;
}
else
{
92
if (fdist < min)

{
min = fdist;
imark = i;
}
}
}
xil_printf("Distance Cycles: %d\r\n",cycles);
xil_printf("The face is %d and has a value of %d\n",imark+1,
min);
return 0;
}
// RECEIVE FRAME FUNCTION
int XEmac_RecvFrameSS(Xuint32 BaseAddress, Xuint8 *FramePtr)
{
//printf("Received a frame\r\n");
XStatus check;
check=XFALSE;
int Length;
//Wait for a frame to arrive
while (check==XTRUE)
FrameCount++;
if (FrameCount % 100 == 0)
printf("FrameCount : %d\r\n", FrameCount);
//Get the length of the frame that arrived
Length = XIo_In32(BaseAddress + XEM_RPLR_OFFSET);
/*
* Use the packet fifo driver to read the FIFO. We assume the
Length is
* valid and there is enough data in the FIFO - so we ignore the
return
* code.
*/
(void)XPacketFifoV200a_L0Read(BaseAddress + XEM_PFIFO_RXREG_OFFSET,
BaseAddress +
XEM_PFIFO_RXDATA_OFFSET,
FramePtr, Length);
/*
* Clear the status now that the length is read so we're ready
again
* next time
*/
XIo_Out32(BaseAddress + XEM_ISR_OFFSET, XEM_EIR_RECV_DONE_MASK);
if (FramePtr[0] == 1 &&
93
FramePtr[1] == 6 &&
FramePtr[2] == 7 &&
FramePtr[3] == 8 &&
FramePtr[4] == 9 &&
FramePtr[5] == 4)
{
//printf ("Received a GOOD Packet\r\n");
return Length;
}
else
return -1;
}
void wait(Xuint32 time)
{
Xuint32 cnt = 0;
while(cnt<time)
{
cnt++;
}
}
10.5 Recognition Results

Reading database...
Calculating average face...
Normalizing...
Computing L matrix...
Computing Eigenvectors of L...
Deriving original Eigenvectors...
Computing Projections...
Done...
Floating Point Implementation : Projection Distance Calculations
284757230.717693
279660242.475770
0.000000
15906094.951599
43353201.762848
196593658.015564
208707188.981688
217342335.956240
91545197.913032
290078638.103123
287208901.165014
286391165.244354
134212683.905218
114121723.805576
95377111.228415
64110028.836780
70614899.348352
73224709.697560
108391813.933073
108619514.482631
126522187.291307
94
285006939.446492
287533372.651606
289339502.859201
289201806.462897
271901321.489990
270497789.759402
264169974.413589
255327852.287155
258682538.304327
251213559.123769
259579475.746799
263522708.902784
243460224.373596
256767062.230726
249083248.619941
78699745.431568
257229726.685719
264791717.983083
303273149.042238
307864942.616162
300581205.147086
293359560.382787
298626547.239392
291808943.826539
302661000.071662
301379863.871029
308336091.791721
332560286.526925
345542685.107667
346003889.433077
The minimum distance belongs to face 3 and has a value of 0.000000
***
Integer Implementation : Projection Distance Calculations
284225955.814634
279126687.096212
620494.066290
15590958.052778
42948163.259742
196103692.379704
208225158.720017
216861641.784979
91093926.417311
289572508.175456
286698981.540504
285883839.142958
133857089.836859
113775500.038695
95048959.642352
63801517.932157
70310480.676750
72818549.508288
108058501.418556
108302177.848746
95
126181027.633667
284607217.509605
287127633.331032
288925692.704150
288760773.338336
271457671.284751
270053651.838463
263708795.309670
254867805.414352
258215407.944628
250731364.175974
259102093.073329
263054021.925713
242983661.512971
256275114.290106
248603954.983011
78496933.796400
256721561.955371
264280157.572359
302767434.835988
307361433.381023
300074824.514051
292915766.134634
298183088.693402
291376106.729249
302256564.576121
300979732.587588
307934681.917593
332104472.690896
345087533.640911
345549694.756852
The minimum distance belongs to face 3 and has a value of 620494.066290
Press any key to continue
96

Face Recognition On FPGA: Spring Term Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Face Recognition On FPGA: Spring Term Report

Uploaded by

Copyright:

Available Formats

Face Recognition on FPGA

Spring Term Report

May 23, 2006

4.1.2 Implementation Results............................................................................... 39

exchanging sensitive personal information. Instead, a computer-based face recognition

Figure 1: Applications of Face Recognition

1.3 Motivation and Objectives

We will use a hardware/software co-design

2.2 Algorithms for Face Recognition

2.2.2 Linear Discriminant Analysis

2.2.3 Independent Component Analysis

2.2.4 Neural Networks

2.2.5 Genetic Algorithms

Figure 2: Face Recognition Algorithms

2.3 FPGA Implementation of Face Recognition

2.3.2 FPGA Implementation Using Composite/Modular PCA

2.3.3 FPGA Implementation using Artificial Neural Networks

2.3.4 FPGA Implementation using Genetic Algorithm

2.3.5 FPGA implementation using Evolutionary Reconfigurable Architecture

2.4 Issues with Face Recognition

3.1.2 Inputs and Outputs

currently exceed this limit), Ethernet connections operate at 10 Mbits/second. Naturally,

3.1.3 Timing Constraints

3.2 System Description

Figure 3: System Block Diagram

3.3 Hardware FPGA Components

Figure 4: V2MB1000 Board

Figure 5: P160 Communications Module

3.3.3 OPB Interface

3.3.4 BRAM Controller

3.3.5 External Memory Controller

3.3.6 Ethernet MAC Controller/Driver

3.3.7 UARTLite Controller/Driver

3.3.8 On-board Hardware Multipliers

3.4 Software FPGA Components

3.5 PC Application Components

3.6 Memory Management

We therefore conclude that a total of 75 + 75 + 10 + 3,825 = 3,985 KB

3.7 PCA Algorithm

3.8 Project Budget

Figure 6: Ethernet CAT5 Cable

RS-232 Serial Cable: $5

Figure 7: RS-232 Serial Cable

In addition, we used functions to visualize the Eigenfaces as well as the average

Figure 8: Sample Face

Figure 9: Average Face

Figure 10: Eigenfaces

4.1.3 Exporting Code to C

4.1.4 Exporting the Database

4.2 Training Stage Implementation in C

4.2.2 Training Stage Calculations

Matrix L is of size NUMFACES NUMFACES (51 51) rather than

The result is stored in eigenvectors_orig which is of size FACESIZE

4.2.3 Writing to Binary Files

4.3 Ethernet Implementation in C#

We started by initializing the packet to be sent with the appropriate destination

4.4 Recognition Stage Implementation on FPGA

The parameter EMAC_BASEADDR has a value of 0x40c00000 and is the base

We then store the word in memory using the XDdr_mWriteReg

4.4.2 Verification and Testing of Ethernet interface and DDR

4.4.3 Recognition Phase Implementation on the FPGA

r1 = XDdr_mReadReg (MEM_BASEADDR, EIG_BASEADDR + i*4 + j*NUMFACES);

r1 = XDdr_mReadReg (MEM_BASEADDR, PROJ_BASEADDR+(i*NUMFACES+j)*4);

r1 = XDdr_mReadReg (MEM_BASEADDR, EIG_BASEADDR + i4 + jNUMFACES);

r1 = XDdr_mReadReg (MEM_BASEADDR, PROJ_BASEADDR+(iNUMFACES+j)4);