Professional Documents
Culture Documents
Face Recognition On FPGA: Spring Term Report
Face Recognition On FPGA: Spring Term Report
Ramzi Madi
200200055
Robin Lahoud
200200271
Supervisor
Prof. Mazen Saghir
Bassem Sawan
200200267
TABLE OF CONTENTS
1. INTRODUCTION ......................................................................................................... 5
1.1 Problem Definition .............................................................................................. 5
1.2 Applications.......................................................................................................... 5
1.3 Motivation and Objectives ................................................................................ 7
2 LITERATURE REVIEW ............................................................................................... 7
2.1 Still-Image versus Video.................................................................................... 7
2.2 Algorithms for Face Recognition.................................................................... 8
2.2.1 Principle Component Analysis ..................................................................... 8
2.2.2 Linear Discriminant Analysis ........................................................................ 9
2.2.3 Independent Component Analysis ............................................................ 10
2.2.4 Neural Networks ........................................................................................... 11
2.2.5 Genetic Algorithms....................................................................................... 12
2.3 FPGA Implementation of Face Recognition............................................... 14
2.3.1 FPGA Implementation using PCA ............................................................. 14
2.3.2 FPGA Implementation Using Composite/Modular PCA......................... 15
2.3.3 FPGA Implementation using Artificial Neural Networks......................... 17
2.3.4 FPGA Implementation using Genetic Algorithm...................................... 18
2.3.5 FPGA implementation using Evolutionary Reconf. Architecture .......... 20
2.4 Issues with Face Recognition........................................................................ 21
3. DESIGN....................................................................................................................... 23
3.1 System Specifications ..................................................................................... 23
3.1.1 Algorithm........................................................................................................ 23
3.1.2 Inputs and Outputs....................................................................................... 24
3.1.3 Timing Constraints ....................................................................................... 25
3.2 System Description........................................................................................... 25
3.3 Hardware FPGA Components........................................................................ 27
3.3.1 V2MB-1000 Overview.................................................................................. 27
3.3.2 MicroBlaze Processor.................................................................................. 29
3.3.3 OPB Interface ............................................................................................... 29
3.3.4 BRAM Controller........................................................................................... 30
3.3.5 External Memory Controller........................................................................ 30
3.3.6 Ethernet MAC Controller/Driver ................................................................. 30
3.3.7 UARTLite Controller/Driver ......................................................................... 31
3.3.8 On-board Hardware Multipliers .................................................................. 31
3.4 Software FPGA Components ......................................................................... 32
3.5 PC Application Components.......................................................................... 33
3.6 Memory Management ....................................................................................... 33
3.7 PCA Algorithm ................................................................................................... 35
3.7.1 Training Phase.............................................................................................. 35
3.7.2 Recognition Phase ....................................................................................... 36
3.8 Project Budget ................................................................................................... 37
4. IMPLEMENTATION .................................................................................................. 38
4.1 Modeling Algorithm in MATLAB.................................................................... 38
4.1.1 Implementation Details................................................................................ 38
2
LIST OF FIGURES
Figure 1: Applications of Face Recognition 6
Figure 2: Face Recognition Algorithms ............... 13
Figure 3: System Block Diagram... . 26
Figure 4: V2MB-1000 Board ... 28
Figure 5: P160 Communications Module 29
Figure 6: Ethernet CAT5 Cable ... 37
Figure 7: RS-232 Serial Cable . 37
Figure 8: Sample Face 38
Figure 9: Average Face .. 38
Figure 10: Eigenfaces ... 59
Figure 11: FPGA Comparison...68
Figure 12: Total Performance... 68
LIST OF TABLES
Table 1: MATLAB Functions Used . 38
Table 2: MATLAB Implementation Results. 39
Table 3: C Functions Used ... 41
Table 4: Implementation Descriptions..... 64
Table 5: Implementation 1 Results....... 65
Table 6: Implementation 2 Results....... 66
Table 7 Implementation 3 Results....... 66
1. INTRODUCTION
1.1 Problem Definition
Face recognition is a form of biometric identification that relies on data acquired
from the face of an individual. This data, which can be either two-dimensional or threedimensional in nature, is compared against a database of individuals. In recent years, face
recognition has gained popularity among researchers all over the world. With
applications ranging from security to entertainment, face recognition is an important
subset of biometrics.
In real world applications, it is desirable to have a stand-alone, embedded face
recognition system. The reason is that such systems provide a higher level of robustness,
hardware optimization, and ease of integration. As such, we have chosen the FPGA as a
reconfigurable platform to carry out our implementation. Ultimately, the stand alone
system may be implemented on an ASIC, a dedicated processor, or even an FPGA chip,
depending on the trade-offs in speed, portability, and reconfigurability.
1.2 Applications
Face recognition systems have gained a great deal of popularity due to the wide
range of applications that they have proved to be useful in. Broadly, two main categories
for these applications exist: commercial applications and research applications.
From a commercial standpoint, face recognition is practical in security systems
for law enforcement situations. It is in places like airports and international borders that
the need arises for a face recognition system that identifies individuals. Another
application of face recognition is the protection of privacy, obviating the need for
Computer
Vision
Security
Research
Face Recognition
Applications
Entertainment
Commercial
Pattern
Recognition
Unified
PIN
approach, delegating the more mathematically intensive tasks to the hardware while
controlling the algorithm procedure in software. Our aim is to achieve a speed up in the
process of recognition through the use of multiple parallelized components on the FPGA
while maintaining high accuracy in the results.
2 LITERATURE REVIEW
2.1 Still-Image versus Video
In the literature, two main forms of face recognition exist: still-image-based face
recognition and video-based face recognition. Still image face recognition relies on
classifying an individual based on a single image obtained from a still shot camera.
Conversely, video based face recognition relies on a sequence of frames to extract more
information about the face of a subject.
An inherent advantage of using still-image-based face recognition over videobased systems is that the images are of higher resolution. As a result, current face
recognition algorithms are able to recognize a face more accurately. Further to this, still
image based recognition is useful in controlled environments where pose and
illumination are relatively fixed. One example of such an environment is while taking
subjects photograph at the airport check in [1]. The disadvantages of still-image-based
face recognition occur when such a controlled environment is not easily attainable. An
example of this scenario would be a security camera used to identify a subject in a public
place. In this case, video-based recognition yields better results.
The clear advantage of video-based face recognition occurs in situations where
the image resolution is low and the video feed is continuous. Video-based algorithms
capitalize on both spatial and temporal variations in a subjects face. Nevertheless, a
natural disadvantage is the low resolution of the images being captured [1]. Since an
individual might be located at a distance, the pixels that represent this individuals face
might not constitute a sufficient information base for the algorithm to operate correctly.
Hence, the need for the two different approaches occurs in different situations.
limited. The algorithm basically involves projecting a face onto a face space, which
captures the maximum variation among faces in a mathematical form.
During the training phase, each face image is represented as a column vector, with
each entry corresponding to an image pixel. These image vectors are then normalized
with respect to the average face. Next, the algorithm finds the eigenvectors of the
covariance matrix of normalized faces by using a speedup technique that reduces the
number of multiplications to be performed. This eigenvector matrix is then multiplied by
each of the face vectors to obtain their corresponding face space projections. Lastly, the
recognition threshold is computed by using the maximum distance between any two face
projections [2].
In the recognition phase, a subject face is normalized with respect to the average
face and then projected onto face space using the eigenvector matrix. Next, the Euclidean
distance is computed between this projection and all known projections. The minimum
value of these comparisons is selected and compared with the threshold calculated during
the training phase. Based on this, if the value is greater than the threshold, the face is
new. Otherwise, it is a known face [2].
to lighting, pose, and expression variations [3]. The drawback is that this algorithm is
significantly more complicated than PCA.
As an input, LDA takes in a set of faces with multiple images for each individual.
These images are labeled and divided into within-classes and between-classes. The
former captures variations within the image of the same individual while the latter
captures variation among classes of individuals. LDA thus calculates the within-class
scatter matrix and the between-class scatter matrix, defined by two respective
mathematical formulas. Next, the optimal projection is chosen such that it maximizes
the ratio of the determinant of the between-class scatter matrix of the projected samples
to the determinant of the within-class scatter matrix of the projected samples [3]. This
ensures that the between-class variations are assigned higher weight than the within-class
variations. To prevent the within-class scatter matrix from being singular, PCA is usually
applied to initial image set. Finally, a well known mathematical formula is used to
determine the class to which the target face belongs. Since we have reduced the weight of
inter-class variation, the results will be relatively insensitive to variations.
10
the phase spectrum. Indeed, it is the phase spectrum that contains information which
humans use to identify faces [4].
The ICA implementation of face recognition relies on the infomax algorithm and
represents the input as an n-dimensional random vector. This random vector is then
reduced using PCA, without losing the higher order statistics. Then, the ICA algorithm
finds the covariance matrix of the result and obtains its factorized form. Finally,
whitening, rotation, and normalization are performed to obtain the independent
components that constitute the face space of the individuals. Since the higher order
relationships between pixels are used, ICA is robust in the presence of noise. Thus,
recognition is less sensitive to lighting conditions, changes in hair, make-up, and facial
expression [4].
11
exponentially increases the training time, while not including any results in poor
recognition rates. Once this neural network is formed for each person, it must be trained
to recognize that person. The most common training method is the back propagation
algorithm [8]. This algorithm sets the weights of the connections between neurons such
that the neural network exhibits high activity for inputs that belong to the person it
represents and low activity for others. During the recognition phase, a reduced image is
placed at the input of each of these networks, and the network with the highest numerical
output would represent the correct match.
The main problem with neural networks is that there is no clear method to find the
initial network topologies. Since training takes a long time, experimenting with such
topologies becomes a difficult task [8]. Another issue that arises when neural network are
used for face recognition is that of online training. Unlike PCA, where an individual may
be added by computing a projection, a neural network must be trained to recognize an
individual. This is a time consuming task not well suited for real-time applications.
12
are further subdivided into F-tables and T-tables, where each image occupies a row in the
table. Initially, the rows in the F-tables and T-tables do not match. However, by gradually
changing some of the F-table values to dont-cares, some rows end up matching with
each other. Hence, the F-table obtains the generalization ability. The evolution process
ensures that the modified F-table includes as many rows in the T-table as possible. Once
evolution is complete, the modifications that result in the best fitness are chosen for each
category (target person and unknown people) and applied to the F-table [9].
During the recognition phase, the input image is passed through the tables that
correspond to both categories. Two counters keep track of the number of pixel matches in
each of the categories and the counter with the highest value classifies the input face as
belonging to the corresponding category [9]. The obvious drawback of this algorithm is
that entire tables have to be created whenever a new individual is to be detected. As in
neural networks, the scalability of this algorithm is hindered by the exponential
complexity involved when training for multiple target faces.
Linear
Discriminant
Analysis
Principle
Component
Analysis
Independent
Component
Analysis
Face Recognition
Algorithms
Neural
Networks
Genetic
Algorithms
14
The face recognition system achieved a recognition time of 212 s. The image
size was 20 by 20 pixels and the FPGA board used was a Xilinx Virtex-II Pro (XC2VP7)
clocked at 100 MHz. The system made use of approximately 18% of the gates available
by the FPGA. At a more detailed level, the bit width was 8 bits for the input face, 7 bits
for the Eigenface and 18 bits for the Eigenspace [5].
15
serves two main functions. The first function is that of an accumulator that reads results
from the processing elements and accumulates the results in registers (one register for
each of the twenty processing lanes). Secondly, the classification block finds the face
with the minimum distance to the face under test and stores its index [7].
The above mentioned system was implemented on an Altera Quartus board
clocked at 91 MHz. It was able to recognize a face from a database of 1000 images in 11
milliseconds. The performance of this implementation can be attributed to the parallel
hardware blocks used in performing the necessary calculations for the algorithm. Further
to this, the design can be scaled for larger databases by simply adding more processing
elements in parallel. This will yield an even higher throughput of data and improved
performance for larger sized databases [7].
Another FPGA implementation strategy that yields some good performance
results with Composite PCA relies on two process blocks, 16 pairs of which are
connected in parallel for high throughput calculations. The first block reads in the
eigenvectors and the test image and performs the necessary multiplications. This result is
then passed to the second processing block, which computes the distance using a reduced
formula designed to simplify the hardware implementation of distance calculations. All
16 blocks are connected to a distance grouper and a comparator, used to eliminate all
redundant distance calculations and find the smallest distance, respectively [6].
The above hardware design was implemented on an Altera Quartus II board
(clocked at 100 MHz) and was able to perform face recognition on a database of 10 faces
in 3.88 milliseconds. A total of 7,820 logic elements were used, 2,348 of which were flip-
16
flops. Again, performance can be attributed to the highly parallel nature of the hardware
design and the composite algorithm used.
17
Both the C program and the training images were stored on BRAM, and the system
included peripherals such as OPB UART and the OPB GPIO bus. In order to increase the
speed of the neuron updating process, the HUM contains 4 parallel update units that are
capable of updating 4 neurons at a time. A finite state machine controls the floating-point
multipliers and the output is stored on 4 local registers. The MicroBlaze then reads off
these registers continuously until the update process is complete [8].
The results of the experiment showed that the HUM occupied 42% of the FPGA
and the MicroBlaze occupied 7%. Feed-forward and backward computations took around
20 ms to complete, MicroBlaze software updating took 173 ms, and HUM hardware
updating took 1.4 ms. The speed-up in updating was over 10x while the speed-up over a
software implementation was around 1.7x. This demonstrates that the algorithm contains
inherent parallelism which cannot be exploited with a general-purpose processor [8].
18
unknown person. Since the input image is fed to all the AND gates simultaneously, the
matching process is carried out in parallel [9].
During the training process, 8-bit images were used to represent the faces. To
implement the chromosome evolution technique, the byte that represents each pixel was
manipulated at 8 different levels. The first level replaces the least significant bit with a
dont-care, and each level gradually adds a dont-care to the next least significant bit. The
last level consists of all eight bits replaced by dont-cares. Moreover, to test the
implementation, a database of 100 images was used, representing 5 individuals in 20
different poses. The dimensions of the original images were 240 by 240 pixels, but they
were preprocessed and reduced to 8 by 8 pixels. Also, the F-tables and T-tables were
each assigned 10 of the 20 poses for the individuals [9].
To synthesize the circuits, a logic synthesizer was employed. The average number
of gates required for each person, including the counter and maximum detector units,
amounted to 1,334. The presence of dont-cares allowed the number of gates to be
collapsed by less than 1/10. Using an FPGA board with a Xilinx XC4010 chip, the
identification accuracy of the system was 97.2% and identification took place within 1
s. This is due to the intrinsic hardware parallelism found in the AND gate planes.
Furthermore, fault tolerance tests were made on the system. Random stuck at 0s or 1s
were injected at the outputs of the AND gates, and an accuracy upwards of 90% was
maintained even with a stuck-at faulty gate ratio of 18%. Additionally, the system
exhibited graceful degradation as more stack-at faults were introduced [9].
19
20
4 separate SRAM banks that can be accessed simultaneously by the FPGA. As such, 4
different images could be processed in parallel. The EM module was implemented by a
hybrid parallel genetic algorithm processor [10].
For testing, 386 images of 39 people were stored in the database. Each image was
comprised of 128 by 128 gray pixels. It took approximately 1,000 iterations on the
images in order for the optimal filter combination to be obtained. After evolution, face
recognition rates increased by 63.4% when using images with poor illumination and
noise. When noise was added to the image, the rate increased by 36.5%. These figures
demonstrate the robustness of the system to changes [10].
21
on creating a separate linear illumination subspace. This is similar to the space created to
capture face variations, except that it captures lighting variations [1].
Pose variation also impairs the face recognition process. Pose variation becomes
especially pronounced when it is combined with illumination changes. One solution to
the pose variation problem involves obtaining images with multiple views of an
individual. In this case, multiple poses are available during both training and recognition.
During the recognition process, each pose is aligned with a similar pose in the database to
achieve correct classification. The obvious drawbacks are that multiple views of an
individual are not always available. A more popular solution involves using multiple
poses during training but only a single pose during recognition. One such implementation
creates an Eigenspace for each pose to achieve pose-invariant recognition [1].
The problem of facial expression variation is also common in the literature. If
only one image of an individual is available, recognition accuracy drops considerably.
However, if many images are available, algorithms like PCA can absorb these changes. It
is important to note that during expression changes, parts of the face remain largely
unchanged. As a result, algorithms that segment the face are more robust to these
variations [1]. Many databases available today contain training images with multiple
expressions, and face recognition systems have been capable of making accurate image
classifications despite expression variations.
Lastly, it is important to discuss face detection in the context of the face
recognition problem. The need for face detection arises when one or more faces must be
extracted from an image. Furthermore, face detection and extraction is essential to reduce
external factors that might hinder the recognition process. One common method of face
22
detection relies on the use of Haar classifiers. These classifiers sweep through the image
and apply several filters to detect the presence of a face. Another method, mentioned
earlier, relies on skin color to detect a face.
As such, face recognition is a growing field with potential applications in security,
entertainment, and personal identification. The recognition algorithms can be grouped
into mathematical/statistical (PCA, ICA, LDA) algorithms and biological (NN, GA)
algorithms. Many of these algorithms have been implemented by several researchers on
FPGA boards with high recognition rates and recognition times within the margin of realtime applications. However, long training times and the scalability of face recognition has
been a recurring concern in all of these implementations. Finally, common face
recognition problems include illumination changes, pose variations, and the issue of face
detection and extraction.
3. DESIGN
3.1 System Specifications
3.1.1 Algorithm
Having researched the various algorithms for face recognition, we found that the
two most popular hardware implementations are PCA and Neural Networks. As stated
before, the advantage of PCA is its robustness, parallelizability, and relative simplicity.
Its disadvantages are its sensitivity to lighting and pose variations. On the other hand, the
Neural Networks approach provides strong accuracy but limits the number of individuals
that can be included in the database due to the long training periods involved.
23
We have chosen to adopt the PCA algorithm for face recognition for several
reasons. Firstly, the environment that will be used to obtain the individual face images is
controlled and hence lighting and pose variation effects can be minimized. Secondly,
since a face can be subdivided into multiple regions, pattern recognition can be applied in
parallel, resulting in faster face recognition. Lastly, PCA allows us to quickly add
individuals to the face database, making it better suited for real time applications.
24
25
26
On the FPGA end, the data files are received, parsed, and stored in DDR memory.
Then, the recognition stage runs on the MicroBlaze core with the assistance of on-board
multipliers and the results are displayed on the HyperTerminal through a serial interface.
27
Moreover, the V2MB-1000 board contains an RS232 port that allows for serial
communications, and a JTAG port, which is connected to the parallel port of a PC so that
bit stream configurations can be downloaded to the FPGA.
The board we are using comes with the P160 Communications Module-2
expansion. This module provides us with several different functions, but our use of the
board is restricted to Ethernet. This function consists of a Broadcom chip and an RJ45
connecter to which the Ethernet cable is hooked up.
28
29
MicroBlaze core as well as by input devices or interrupts. The outputs consist of the
MicroBlaze processor and any output indicators that may be used.
30
that receives a frame from the Ethernet port and stores it in a specified memory location
on the board. In addition, we were able to add the EMAC core to the system using the
corresponding pin constraints and correct signal matching.
31
back from DDR memory. This concept will be discussed further in the implementation
section of this report.
32
of the current face and the projection of each of the faces. In mathematical terms, the
magnitude of the difference between each pair of size-M vectors must be computed.
Although the operation is mainly subtraction, we did not design a custom hardware
comparator since we realized that calculating the distance to all the face projections is not
a bottleneck. For this operation, the face projection is stored in BRAM whereas the list of
projections is found in DDR memory, where they were dumped during the initialization
phase of the system. Lastly, the code on the MicroBlaze must then transmit the above
results to the HyperTerminal through the serial interface for the user to see. As mentioned
earlier, these results include the projection distances and the ID of the recognized face.
33
total of 720Kbits or approximately 90KB. It is located on the Virtex-II FPGA itself, and
thus has the fastest access time compared to all other types of memory on the board.
External memory is essentially a 16M 16 DDR memory that provides us with 32MB of
storage space. This memory lies on the board external to the FPGA, and thus has a longer
access time. Ideally, we would have opted to store all data in BRAM memory, but due to
the constraint in size, we are forced to store the data initially in External memory. Below
are the memory requirements assuming a 150 125 pixel image:
Target face:
Number of entries = 150 125
Bits per entry = 32 (since of type Xuint32)
Total memory for target face = 150 125 32 = 600,000 bits = 75 KB
Average face:
Number of entries = 150 125
Bits per entry = 32 (since of type Xuint32)
Total memory for average face = 150 125 32 = 600,000 bits = 75 KB
Projections Matrix:
Number of entries = 51 51 (assuming database contains 51 individuals)
Bits per entry = 32 (since of type Xuint32)
Total memory for projections = 51 51 32 = 83,232 bits = 10 KB
Eigenvector matrix:
Number of entries = 51 150 125 (assuming database contains 51 individuals)
Bits per entry = 32 (since of type Xuint32)
Total memory for Eigenvectors = 51 150 125 32 = 30,600,000 bits = 3,825 KB
34
a11 K a1n
A= M O M
a
m1 L amn
2. Next, the matrix is normalized by subtracting from each column a column that
represents the average face (the mean of all the faces):
a m1 K a1n m1
ur 11
A=
M
O
M
a m L a m
m
mn
m
m1
3. We then want to compute the covariance matrix of A, which is A AT, but since the
operation is very mathematically intensive, we use a shortcut:
L = AT A
4. To obtain U, the matrix of covariance eigenvectors, we find V, the matrix of
eigenvectors of L, and calculate:
35
U = A V.
5. Each face is then projected to face space:
= UT A
6. We next compute the threshold value for comparison:
= max {|| i j ||}, for i, j = 1n.
3.7.2 Recognition Phase
1. We represent the target face as a column vector:
r1
r = M
r
m
2. The target face is then normalized:
r m1
r 1
r = M
r m
m
m
3. Next, the face is project to face space:
r
= UT r
4. We then find the Euclidean distance between the target projection and each of the
projections in the database:
2 = || i ||2 for i = 1n
5. Finally, we decide if the face is known or not by selecting the smallest distance and
comparing it to the threshold . If it is greater, then the face is new. Otherwise, the
face is a match.
36
V2MB-1000 Development Kit (+ P160 Comm. Module-2) with ISE Foundation and
JTAG cable: $2995.00 (This product was provided by the American University of
Beirut.)
37
4. IMPLEMENTATION
4.1 Modeling Algorithm in MATLAB
4.1.1 Implementation Details
A free database of faces, non faces, and new faces was used as a means to test the
implementation developed. The database of faces consists of 51 images each having
dimensions of 150 125 pixels represented as row vectors. Each pixel contains an 8 bit
grayscale value representing 1 of 256 possible shades of gray. The MATLAB
implementation followed the algorithm details outlined above and used built-in
MATLAB functions to achieve functionality. Some of these functions are outline in the
table below.
MATLAB Function
mean (A)
A
eigs (A, k)
dist (A, B)
Description
Calculates the mean of matrix A
Calculates the transpose of matrix A
Determines the fist k eigenvectors and eigenvalues of A
Determines the Euclidean distance between matrices A and B
Table 1: MATLAB Functions Used
20
20
40
40
60
60
80
80
100
100
120
120
140
140
20
40
60
80
100
120
20
40
60
80
100
120
38
50
100
150
200
250
300
350
400
450
100
200
300
400
500
600
Person #
Distance to Face Index 3
1
2.8476 107
1
2.7966 107
2
0.0000 107
2
0.1591 107
2
0.4335 107
3
1.9659 107
3
2.0871 107
3
2.1734 107
Table 2: MATLAB Implementation Results
39
The first line of the code creates a file ID that we will write to in wb mode, or
write binary mode. Next, we write the matrix m_towrite to the file ID created in the
40
previous line. The matrix is written using 32 bit floating point representation. Finally, we
close the file that we have created in the last line.
C Function
Matrix Transpose
Matrix Average
Matrix Multiply
Matrix Subtract
Vector Distance
Eigenvectors
Read Images
Read Test Face
Description
Calculates the transpose of a matrix
Finds the average row in a matrix
Multiplies two matrices
Subtracts a row from every matrix row
Calculates the Euclidean vector distance
Finds the Eigenvectors/Eigenvalues of a matrix
Reads image database binary file
Reads test face binary file
Table 3: C Functions Used
The eigenvectors function was obtained from a freely available internet source
and it is modeled after the algorithm outlined in Numerical Recipes in C [11]. This
algorithm computes the Eigenvalues and Eigenvectors of a real symmetric matrix using
Jacobi rotations. Once we completed the implementation of the library of functions, we
could then proceed with the implementation of the training stage itself.
41
At this stage, all the binary data concerning the faces in the database are available
for use and are stored in an array called "database" as shown in the function above. The
first matrix we had to calculate was the average matrix. To do so, we used the
matrix_average function. This finds the average pixel values by adding all the pixels of
the 51 faces in one position and dividing them by 51. The function takes in the "database"
array and returns "average" which is a single vector of size FACESIZE.
matrix_average(database,NUMFACES,FACESIZE,average);
Once we obtain the average we must normalize the entire database by subtracting
the vector "average" from every face vector in the "database" array. Normalization thus
describes how similar each face in the database is compared to the average face. The
function call is shown below:
matrix_subtract(database,NUMFACES,FACESIZE,average,database);
42
The "database" array now contains all the normalized face vectors. From this
point on, we will use these normalized vectors and not the original ones. The next step in
the algorithm is to find database database_transpose. Since this will result in a huge
number of multiplications and a huge array, a trick is used in which we perform
database_transpose database instead. First in order to transpose the "database" array
we created a simple function in which we replaced every row with a column. The
function call is shown below:
matrix_transpose(database,NUMFACES,FACESIZE,database_trans);
The details of all of these functions we used are available in the appendix section
for further reference. Once we obtain the transpose, we may now perform the above
multiplication operation. Below is the function call. The function takes in database_trans
as the first operand, and database as the second and stores the corresponding product in
matrix L.
matrix_multiply(database,NUMFACES,FACESIZE,database_trans,FACESIZE,NUMF
ACES,L);
The above two operations are a part of the trick used to minimize the size of the
array that would result from multiplying database database_transpose. They are
intermediate operations that lead to obtaining the Eigenvectors of the original matrix
43
"database". The last step of this alternative method is to compute the Eigenvectors of the
original matrix by multiplying database_transpose by the Eigenvectors (obtained in the
above operation). This will result in the eigenvectors of database. The function is shown
below:
matrix_multiply(database_trans,FACESIZE,NUMFACES,eigenvectors,NUMFACES,
NUMFACES,eigenvectors_orig);
This operation in effect highlights the key features of every face by projecting it
onto the face space. As such, when a new face is brought in to the system for recognition,
determining whether it is a match would take a much smaller amount of time. Once we
completed the calculation of the average, eigenvectors_orig, and projections matrices, we
decided to truncate the elements of the arrays. That is, up until this point, all the arrays
were of type float. However, since floating point calculations take significantly longer
than integer calculations, truncating the digits after the decimal point and changing the
type to integer saves a lot of computation. Before doing so, we compared the values
obtained in both the integer and floating point cases and found that the error due to
truncation is negligible (less than 0.01%). To do so, we simply created new integer
arrays, and copied the floating point values. This automatically truncates anything after
the decimal point. This process is shown below.
44
int *average_int;
int **eigenvectors_orig_int;
int **projections_int;
for(i=0;i<FACESIZE;i++)
average_int[i] = average[i];
for(i=0;i<FACESIZE;i++)
for(j=0;j<NUMFACES;j++)
eigenvectors_orig_int[i][j] = eigenvectors_orig[i][j];
for(i=0;i<NUMFACES;i++)
for(j=0;j<NUMFACES;j++)
projections_int[i][j] = projections[i][j];
At this point, the above three arrays are ready to be written to a file and sent
through the Ethernet port to the FPGA.
We next write to the binary file by invoking the fwrite function. However, for the
cases of the projection and Eigenvector matrices, caution was exercised to ensure that the
indexing of the matrices corresponds with the storage format. That is, the binary files will
store the data in a linear manner and consistency must be maintained when un-wrapping
two dimensional data into a linear space. By maintaining this indexing consistency at the
receiving end, we were able to reproduce these two dimensional matrices without the loss
or corruption of data. The following illustrates one example of this:
45
for(i=0;i<FACESIZE;i++)
fwrite(eigenvectors_orig_int[i], NUMFACES*sizeof(int),1,
f_eigenvectors);
Finally, the file is closed and the process is repeated for all 4 segments of binary
data that are required by the recognition stage.
The next step involved reading the 4 binary files that were produced by the C
code and sequentially formatting and storing them into packets. In C#, opening and
closing a file stream for reading amounts to the following statements:
46
FileStream fs = File.OpenRead("b_testface_int");
BinaryReader br = new BinaryReader(fs);
br.Close();
fs.Close();
47
In order to receive frames into the FPGA, we also needed to create a reception
buffer that we called RxFrameBuf. RxFrameBuf is an array with size 1500 (which is the
maximum size in bytes of the frames we will be sending). After all the necessary
initializations are made, the program enters a while loop in which it will receive frames.
Since we know exactly how many bytes we need to send from the PC to the board, we
used this value as a limit for looping. One important task that we had to incorporate into
the program was to filter out certain frames that do not contain data from the training
48
phase. That is, the Windows operating system on the PC randomly sends broadcast
frames across the Ethernet port. We had to insure that such broadcast frames were not
confused with data frames. We will explain how this was done shortly, but for now it is
important to note that the variable GoodFrameCount in the code represents the number of
actual training phase data frames received. Upon entering the loop, we used the
XEmac_RecvFrameSS function to receive a data frame as follows
Length = XEmac_RecvFrameSS(EMAC_BASEADDR, (Xuint8 *)RxFrameBuf);
The parameters of this function are the base address of the device and the
corresponding buffer where the frames would be stored. The XEmac_RecvFrameSS that
we used is a modification of the function XEmac_RecvFrame available in the xemac_l.h
library. The XEmac_RecvFrameSS function begins by checking if the receive buffer is
empty. If it is not, then there is a frame in the buffer ready for retrieval:
check = XEmac_mIsRxEmpty(BaseAddress);
while (check==XTRUE)
check = XEmac_mIsRxEmpty(BaseAddress);
Next, it finds the length of the received frame by checking the address location
of the last byte, and using the base address of the device to calculate the difference.
Finally, the function filters out the broadcast frames that were mentioned previously. It
does so by checking that the destination EMAC address of the frame matches the MAC
address of the FPGA. In the case of a broadcast frame, the destination address is FF FF
FF FF FF FF. If the frames MAC address does not match the MAC address of the
FPGA, then the function will return a length of -1 and the frame will be discarded in the
main function.
49
After the XEmac_RecvFrameSS function, the program goes into a loop in which
it reads the individual bytes of the frame from the receive buffer in order to store the
frame in DDR memory. Below is the piece of code responsible for retrieving the bytes
and storing them in memory.
for (i=14; i<1014; i+=4)
{
rec1 = (Xuint32)
rec2 = (Xuint32)
rec3 = (Xuint32)
rec4 = (Xuint32)
RxFrameBuf[i];
RxFrameBuf[i+1];
RxFrameBuf[i+2];
RxFrameBuf[i+3];
rec2 <<= 8;
rec3 <<= 16;
rec4 <<= 24;
word = 0;
word = word | rec1 | rec2 | rec3 | rec4;
XDdr_mWriteReg (MEM_BASEADDR, memcount*4, word);
memcount++;
}
We begin by reading the 15th byte present in the buffer since the first 14 bytes
represent the source MAC address, destination MAC address and the length of the frame.
The actual data starts on the 15th byte and ends on the 1015th byte. When reading the
bytes from the receive buffer, we do so four at a time since every word that will be stored
in memory will be of size 32 bits (ie 4 bytes). We begin by reading RxFrameBuf[i],
RxFrameBuf[i+1], RxFrameBuf[i+2], RxFrameBuf[i+3] and storing them in variables.
We then have to concatenate the above four bytes into one word. This can be done by
shifting the most significant byte by 24 places and inserting zeros, then shifting the next
most significant byte by 16 places and inserting zeros and similarly adjusting the least
two significant bytes. After shifting, we perform an OR function on the four bytes to
obtain one 32 bit word.
50
51
they matched. At this point, we were certain that all the data needed was being sent and
correctly stored in memory.
// normalization stage
for (i=0; i<TESTFACE_SIZE; i+=4)
{
r1 = XDdr_mReadReg (MEM_BASEADDR, TESTFACE_BASEADDR+i);
r2 = XDdr_mReadReg (MEM_BASEADDR, AVGFACE_BASEADDR+i);
r1 = r1 - r2;
XDdr_mWriteReg (MEM_BASEADDR, TESTFACE_BASEADDR+i, r1);
}
The loop iterates over all values of TESTFACE_SIZE, which is defined as the size
of the test face in bytes, or 75,000 bytes. In the for loop, r1 and r2 are the respective
values of the test face and the average face stored in memory locations
TESTFACE_BASEADDR and AVGFACE_BASEADDR offset by the iteration value, i,
and the base address of memory, MEM_BASEADDR. These two values represent the
memory addresses corresponding to the first elements of the test face and the average
face, respectively. Finally, the result is stored in place of the original test face in memory.
In the projection stage of the algorithm, we are multiplying the normalized test
face by the matrix of covariance eigenvectors. In order to accomplish this matrix
multiplication, the outer loop iterates over the number of faces in the database, and the
inner loop iterates over the size of the test face stored. During each inner loop iteration,
52
we obtain the corresponding values for the Eigenvector matrix and the test face from
DDR memory using the indexing illustrated below:
We then multiply r1 with r2 and accumulate the product. Once the inner loop runs
to completion, the cumulative product is stored in a corresponding array location and set
back to 0 for the next outer loop iteration.
In the final stage of the PCA algorithm recognition phase, the Euclidian distance
between the projection computed earlier and each of the projections stored in the
projections matrix has to be calculated. Since the projections matrix has size
NUMFACES NUMFACES, both the outer and inner loops iterate over NUMFACES.
Inside, a projection value is read from memory and from this we value we subtract the
corresponding value in the projections array. Next, this value is cast to a floating point
value, squared, and accumulated. As an illustration of the distance calculation, below is a
sample of the inner loop code that finds the squared values for the Euclidean distance:
After this is done, the Euclidian distance is found by calling the sqrt function to
calculate the square root of the temporary accumulated value. We then compare this
distance to the smallest distance already calculated. If it is smaller, we set it as the new
minimum distance and proceed. Finally, we display this distance value and the
corresponding face index, which represents which face the smallest distance value
belongs to. Below is the code for finding the minimum distance:
53
if (i==0)
{
min = fdist;
imark = i;
}
else
{
if (fdist < min)
{
min = fdist;
imark = i;
}
}
XTmrCtr timer;
XTmrCtr_Initialize(&timer,XPAR_OPB_TIMER_0_DEVICE_ID);
In order to measure the number of execution cycles a specific operation takes, the
timer has to be reset prior to starting it. Following the completion of the operations, the
timer is stopped and the value is read. The following code illustrates this:
XTmrCtr_Reset(&timer,0);
XTmrCtr_Start(&timer,0);
// operations to be measured
XTmrCtr_Stop(&timer,0);
cycles = XTmrCtr_GetValue(&timer,0);
On the PC end, measuring the time involved obtaining a header file from an open
source and using the timing functions to calculate the number of clock cycles completed
during runtime. Below are the commands that we used to measure the execution time of
the corresponding stages on the PC:
QueryPerformanceCounter(&start_ticks)
// RECOGNITION STAGE CODE GOES HERE
QueryPerformanceCounter(&end_ticks);
cputime.QuadPart = end_ticks.QuadPart- start_ticks.QuadPart;
printf ("\tElapsed CPU time test:
%.9f sec\n",
((float)cputime.QuadPart/(float)ticksPerSecond.QuadPart));
5. CRITICAL APPRAISAL
5.1 Researching Face Recognition
All throughout this project, we faced several decisions to ensure that our face
recognition system design would work out as planned. In the early stages of the project,
we only had a vague idea of what we wanted to implement. We started gaining some
experience in the field of face recognition by reading and researching numerous
publications.
55
56
that it is better to simulate a system using high-level tools before delving into details.
Indeed, the MATLAB simulation made our transition to C very smooth and strengthened
our understanding of the concepts at hand.
After we coded and readily tested our functions, we then proceeded by coding the
actual PCA algorithm. Now that we had developed a library of custom functions,
57
58
stage must be stored in external memory, as the working memory of the FPGA is not
sufficiently large in size to accommodate the requisite data. One of the problems that we
faced was that DDR is word-addressable while the binary files are byte-addressable. As a
result, we had to concatenate every 4 bytes into a 32-bit word, with proper endian-ness,
and write it to DDR.
Another problem that we were having on the FPGA involved the use of the float
data type. Initially, we had planned on executing the entire PCA algorithm in using floats,
so naturally, we tried to write some experimental C code on the FPGA that adds and
prints two floating-point numbers. Unfortunately, we were not obtaining correct results,
namely because the xil_printf statement, which prints on the HyperTerminal, does not
support the printing of floats. As a result, we had to resort to printing the float values as
decimals, converting them to hexadecimal, and using an online utility to obtain the
corresponding floating point value.
Still, we obtained wrong results. After several days of wrestling with the problem,
we discovered that the decimal values represented the first 32 bits of a 64 bit
representation rather than a standalone 32-bit floating point representation. This taught us
a major lesson, that not everything comes easily, and that several things require spending
time with the code and experimenting with different methods.
59
multiplication. By using the documentation available with the board and some application
notes, we learned how to instantiate the hardware multipliers through hardware
descriptive VHDL code. We then proceeded to learn how to add a custom IP core on the
board and interface it with C code. After overcoming many problems related satisfying
the timing constraints of the on board clock and hardware synthesis, we were able to
utilize the core to perform fast multiplication.
After even more research into the issue of using hardware multipliers, we
discovered that by altering the parameters of the MicroBlaze instance, we could route all
multiplication instructions in C to the on-board hardware multipliers. This eliminated the
need for the custom IP core and drastically improved our initial performance analysis.
This is because we had removed the overhead of transferring data to and from the custom
core in addition to delays incurred by the control signals needed to regulate the multiply
and accumulate process.
60
find a freely available one. We were basically looking for a program the sends frames at
the Ethernet level and not the IP level. After some more searching, we found a C# source
file containing functions for sending raw Ethernet packets. We then modified the code to
send data from the training stage.
At the FPGA end, we wrote the code to receive frames and specified the
necessary Ethernet settings. When we connected the two ends together, we used print
statements and the built in hardware debugger on the FPGA end to make sure we were
receiving packets. At first, no packets were appearing on the receiving end. After revising
our initial Ethernet settings, we discovered that the receiver settings were kept in reset
mode. After resolving this issue, we discovered that we were receiving incorrect data in
the packets. By using the hardware debugger, we found that all packets we were
receiving had the broadcast address as their destination address. The reason for this
anomaly was that Windows was sending broadcast packets through Ethernet. When we
pinpointed the problem, we simply coded a filter that discards all unwanted packets.
61
subtraction, and writing the results back to memory. Similarly, the projection and
distance calculations involved several modifications for memory access. Throughout,
most of the hurdles we faced were related to indexing problems. This is due to the fact
that matrix multiplication involves two-dimensional entities which are stored linearly in
memory. As a result, we had to make sure that the indexing for matrix element access
was correct. To do so, we had to run the code several times and print out the results until
they matched the values obtained on the PC.
Other issues that we faced in the PCA implementation on the FPGA included the
data types that we were using. For example, the function that retrieves data from memory
stores the word in a variable of type Xuint32, which represents an unsigned 32-bit
integer. However, after performing subtraction (such as in the distance calculation), the
data type would lead to wrong values. We then corrected this deviation by assigning the
subtraction operation to Xint32 instead of Xuint32. We also faced other problems related
to the data types, such as float and long int. These problems taught us much about the
importance of understanding the nature of the data used in the system.
62
functionality. This proved to be impossible. As such, we resorted to test each phase of the
algorithm individually and measure the performance. Through this process, we learnt the
importance and value of our limited memory resources and how to optimize our
implementation so as to use these resources efficiently.
When trying to measure the performance of each of our implementations, we
faced some difficulties in finding the correct tools for this purpose. On the FPGA side,
we had to add the OPB Timer module to count the execution cycles. This proved to
further limit our memory resources and force us to run the simulation on each of the
algorithm stages individually. On the host PC side, the regular C libraries did not provide
us with measurements that were accurate enough. Therefore, we had to resort to using
some open source functions and libraries to find accurate measurements for each of the
stages of the algorithm.
To sum up, it is evident that the past two semesters have been extremely fruitful
in terms of the amount of knowledge and experience acquired. Although at first we were
overwhelmed by the magnitude of this project, we discovered that breaking down our
problems into smaller pieces yielded quick and effective solutions. Moreover, we learned
a great deal about hardware implementations and FPGA programming, thereby widening
the scope of our applied knowledge.
63
6. RESULTS
6.1 Methodology Overview
After completing the implementation phase of our project, we moved on to the
analysis and performance assessment of our results. This involved obtaining several
execution time performance metrics and using them to interpret the relative efficiency of
our system.
As stated earlier, speed is of prime importance when it comes to the process of
recognizing a face. As such, the next most reasonable step involved obtaining a temporal
breakdown of the recognition phase. Specifically, the recognition phase can be broken
down into the following components:
Normalization
Projection
Distance Calculation
Device
Processor
Environment
Multiplier
Implementation 1
Implementation 2
Acer Laptop
Virtex-II Board
1.7 GHz Centrino
100 MHz MicroBlaze
MS Visual C++
Xilinx Platform Studio
Software
Programmable Gates
Table 4 :Implementation Descriptions
Implementation 3
Virtex-II Board
100 MHz MicroBlaze
Xilinx Platform Studio
Dedicated Hardware
64
the fact that this phase involves high computational demand in the form of a matrix
multiplication operation.
6.2 PC Implementation
In order to test our hypothesis, we first timed the recognition stage on
Implementation 1. This entailed importing additional libraries, along with predefined
time functions. We then started/ended the timer before/after each of the 3 stages of our
algorithm, and obtained the following results:
Phase
Normalization
Projection
Distance Calculation
TOTAL
Implementation 1
Execution Time
Clock Cycles Elapsed
0.235 milliseconds
399,500 clock cycles
49.5 milliseconds
84,150,000 clock cycles
3.32 milliseconds
5,644,000 clock cycles
53.055 milliseconds
90,193,500 clock cycles
Table 5: Implementation 1 Results
In the above table, the number of clock cycles elapsed was obtained by
multiplying the execution time by 1.7 GHz, representing the speed of the Centrino
processor. Although it is probably true that not all of the clock cycles are being used for
the recognition stage, but rather to sustain operation system functions, they must still be
included. The reason for this is that in reality, a system implemented on a PC would have
to run on an operating system and incur the overhead of OS calls.
65
IP core interfaces with the OPB bus on the FPGA. Once regeneration was completed, we
measured the execution of each of the 3 stages for recognition. It is important to note that
unlike in Implementation 1, the opb_timer measures execution in terms of clock cycles as
opposed to units of time. Thus, we had to divide the number of clock cycles by 100 MHz
to obtain the execution time.
Phase
Normalization
Projection
Distance Calculation
TOTAL
Implementation 2
Clock Cycles Elapsed
Execution Time
1,558,397 clock cycles
15.5 milliseconds
211,744,171 clock cycles
2.12 seconds
1,474,767 clock cycles
14.7 milliseconds
214,777,335 clock cycles
2.15 seconds
Table 6: Implementation 2 Results
Phase
Normalization
Projection
Distance Calculation
TOTAL
Implementation 3
Clock Cycles Elapsed
Execution Time
1,550,129 clock cycles
15.5 milliseconds
77,361,175 clock cycles
774 milliseconds
1,152,310 clock cycles
11.5 milliseconds
80,063,614 clock cycles
801 milliseconds
Table 7: Implementation 3 Results
Firstly, comparing the results of implementations 2 and 3, we notice that using the
hardware dedicated multipliers in implementation 3 resulted in a significant speed-up in
time. The number of clock cycles of the projection phase in implementation 3 is almost
63% lower than implementation 2. As expected, the normalization phases in both FPGA
implementations were practically identical due to the fact that no multiplications take
place in this phase.
Lastly, there was approximately a 22% speed-up in the distance calculations in
implementation 3 over implementation 2 since this phase inherently involves squaring
66
values (i.e. multiplying values by themselves). The speed-up was not as high as in the
projection phase since the distance calculation phase is not purely multiplicationintensive.
FPGA COMPARISON
250,000,000
200,000,000
150,000,000
Impl. 2
100,000,000
Impl. 3
50,000,000
0
Normalization
Projection
Distance
Stage
Looking closely at the results of the first and third implementations, we notice
here that the execution time is slower than in the case of the software implementation.
This is primarily due to the fact that the PC we ran the software implementation on has a
1.7 GHz processor versus a 100 MHz processor running on the FPGA. However, if we
take the number of clock cycles in an absolute sense, the third hardware implementation
took approximately 10% less clock cycles to execute than the software implementation.
As such, comparing the performance in terms of clock cycles shows that the third
implementation is the fastest, as shown in the graph below.
67
TOTAL PERFORMANCE
Clock Cycles
250,000,000
200,000,000
150,000,000
100,000,000
50,000,000
0
1
Implementation Number
The justification for clock-cycle-based comparison stems from the fact that in
reality, multiple hardware units would be used in parallel to run the face recognition
algorithm. Furthermore, in a real implementation, an FPGA board with a faster processor
core would be used to speed up the algorithm. Lastly, the algorithm can be manufactured
on an ASIC, resulting in a further increase in performance.
Thus far, our performance measurements have shown us that by utilizing the onboard hardware multipliers, we can greatly improve the performance of our system. The
device utilization summary below reveals that there is ample room to make use of
available hardware multiplier units, or the MULT18X18s units:
10 out of 40
25%
Therefore, due to the availability of these multipliers and the nature of matrix
multiplication, future work in this field could be centered on trying to utilize this resource
68
in order to parallelize the process of matrix multiplication and further improve the
performance of this application.
69
time. The same applies to applications in the home security industry. Fast face
recognition systems would allow for instant detection and entrance into the home.
In addition to investigating ways to speed up the recognition process, we also
focused on creating a system that is sustainable and upgradeable. Faces can be added to
the existing database with great ease, and re-computing the new data is instantaneous.
The fact that we used a Field Programmable Gate Array to implement our system is one
of the key advantages over other systems. It provides for easy upgrading of the system
simply through the modification of code that runs on the core processor. Moreover, new
cores and new features would cost very little since the system we created leaves available
a huge amount of programmable gates for future modifications.
On the ethical side, face recognition treads on some thin territory regarding the
privacy of individuals. Many individuals prefer to have more discrete forms of
identification and detection that do not rely on such direct biometric measurements.
However, the privacy of an individual can be sustained by ensuring that the process is
automated and that the images captured are stored securely on a server. In this manner,
we can capitalize on the benefits of face recognition while preserving individual privacy.
Finally, from an economic vantage point, our project is an investment into
research that could result in millions of dollars of savings. Automation and speed-up will
lead to a lower need for human intervention, thereby cutting costs across several
frontiers. Nevertheless, the start-up cost for such a project is quite steep as it involves
revamping the entire security infrastructure that permeates modern life.
70
8. CONCLUSION
Over the course of the past two terms, we have researched the field of face
recognition, familiarized ourselves with the FPGA, and modeled the PCA algorithm in
both MATLAB and C. We next developed the system requirements of our intended
design and created a block diagram depicting the interconnection among the various
components of our system. Lastly, we implemented the algorithm on the FPGA, complete
with Ethernet, DDR Memory, and on-board hardware multipliers. Profiling the code
revealed that matrix multiplication was the most time consuming aspect of the algorithm
and that on-board multipliers result in the most optimized operation.
Our system can be further enhanced in several different ways. For example, a
friendly user interface can be created to improve software usability. Performance can be
further enhanced by employing hardware multipliers running in parallel and by
improving the clock speed of the soft core processor on the FPGA board. Having pieced
together the face recognition system over several months of milestones and setbacks, we
learned some valuable lessons. We hope that this system provides some additional insight
into the field of face recognition and contributes to the development of the field.
71
9. REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
Gottumukkal R., and Asari K.V., System Level Design of Real Time Face
Recognition Architecture Based on Composite PCA, Proc. GLSVLSI 2003,
2003, pp. 157-160.
[7]
Hau T. Ngo, Rajkiran Gottumukkal, Vijayan K. Asari. "A Flexible and Efficient
Hardware Architecture for Real-Time Face Recognition Based on Eigenface",
isvlsi, pp. 280-281, Proc. IEEE Computer Society Annual Symposium on VLSI:
New Frontiers in VLSI Design (ISVLSI'05), 2005.
[8]
[9]
[10]
[11]
10. APPENDIX
10.1 PCA Code in MATLAB
function [distances] = pca(A,test_face,k)
fprintf(1,'Computing average face...\n');
average_face = mean(A);
num_of_faces = size(A,1);
fprintf(1,'Computing vector differences...\n');
for i = 1:num_of_faces
faces_diff(i,:) = A(i,:) - average_face;
end;
fprintf(1,'Computing L matrix...\n');
L = faces_diff * faces_diff';
fprintf(1,'Computing Eigenvectors of L...\n');
[V,D] = eigs(L,k);
fprintf(1,'Extracting Eigenvectors of covariance matrix...\n');
eigenvec_u = faces_diff' * V;
%fprintf(1,'Normalizing eigenvectors...\n');
%z = sum(eigenvec_u,1);
%eigenvec_u = eigenvec_u ./ z (ones(size(eigenvec_u,1), 1) ,:);
%eigenvectors = eigenvec_u;
fprintf(1,'Computing face projections...\n');
projections = faces_diff * eigenvec_u;
fprintf(1,'Testing a face...\n');
%test_face = B(3,:);
test_norm = test_face - average_face;
test_proj = test_norm * eigenvec_u;
distances = dist(projections, test_proj');
73
for(i=0;i<height;i++)
{
for(j=0;j<width;j++)
printf ("%f\t",matrix[i][j]);
printf("\n");
}
printf("\n");
}
void vector_print(float* vector, int size)
{
int i;
for(i=0;i<size;i++)
printf ("%f\t",vector[i]);
printf("\n\n");
}
void matrix_transpose(float** matrix_in, int height_in, int width_in, float** matrix_out)
{
int i,j;
for(i=0;i<height_in;i++)
for(j=0;j<width_in;j++)
matrix_out[j][i] = matrix_in[i][j];
}
void matrix_average(float** matrix_in, int vector_num, int vector_size, float* matrix_out)
{
int i,j;
float temp;
for(i=0;i<vector_size;i++)
{
temp = 0;
for(j=0;j<vector_num;j++)
temp += matrix_in[j][i];
matrix_out[i] = temp/vector_num;
}
}
void matrix_multiply(float** matrix1, int height1, int width1,
float** matrix2, int height2, int width2,
float** matrix_out)
{
int i,j,k;
float temp;
for(i=0;i<height1;i++)
for(j=0;j<width2;j++)
{
temp = 0;
for(k=0;k<width1;k++)
temp = temp + matrix1[i][k]*matrix2[k][j];
matrix_out[i][j] = temp;
}
}
74
75
{
FILE * pFile;
long lSize;
float* buffer;
int i,j;
// open file
pFile = fopen ( "database.txt" , "rb" );
// obtain file size
fseek (pFile , 0 , SEEK_END);
lSize = ftell (pFile);
rewind (pFile);
// allocate memory to contain the whole file
buffer = (float*) malloc (lSize);
// copy the file into the buffer.
fread (buffer,1,lSize,pFile);
for(i=0;i<height;i++)
for(j=0;j<width;j++)
db[i][j] = buffer[j*height+i];
// close file
fclose (pFile);
}
void read_testface(float* face, int size)
{
FILE * pFile;
long lSize;
float* buffer;
int i;
// open file
pFile = fopen ( "testface.txt" , "rb" );
// obtain file size
fseek (pFile , 0 , SEEK_END);
lSize = ftell (pFile);
rewind (pFile);
// allocate memory to contain the whole file
buffer = (float*) malloc (lSize);
// copy the file into the buffer.
fread (buffer,1,lSize,pFile);
for(i=0;i<size;i++)
face[i] = buffer[i];
// close file
fclose (pFile);
}
76
77
78
79
straight insertion. */
{
int k, j, i;
double p;
for (i = 1; i < n; i++) {
p = d[k=i];
for (j = i + 1; j <= n; j++)
if (fabs(d[j]) >= fabs(p))
p = d[k=j];
if (k != i) {
d[k] = d[i];
d[i] = p;
for (j = 1; j <= n; j++) {
p = v[j][i];
v[j][i] = v[j][k];
v[j][k] = p;
}
}
}
}
void tred2(double **a, int n, double *d, double *e)
/* Householder reduction of a real, symmetric matrix
a[1..n][1..n]. On output, a is replaced by the orthgonal
matrix Q effecting the transformation. d[1..n] returns the
diagonal elements of the tridiagonal matrix, and e[1..n] the
off-diagonal elements, with e[1] = 0. Several statements, as
noted in commensts, can be omitted if only eigenvalues are to
be found, in which case a contains no useful information on
output. Otherwise they are to be included. */
{
int l, k, j, i;
double scale, hh, h, g, f;
for (i = n; i>= 2; i--) {
l = i - 1;
h = scale = 0.0;
if (l > 1) {
for (k = 1; k <= l; k++)
scale += fabs(a[i][k]);
if (scale == 0.0)
/* skip transformation */
e[i] = a[i][l];
else {
for (k = 1; k <= l; k++) {
a[i][k] /= scale; /* use scaled a's for transformation*/
h += a[i][k] * a[i][k]; /* form sigma in h */
}
f = a[i][l];
g = (f >= 0.0 ? -sqrt(h) : sqrt(h));
e[i] = scale * g;
h -= f * g;
/* Now h is equation (11.2.4) */
a[i][l] = f-g;
/* Store u in the ith row of a. */
80
f = 0.0;
for (j = 1; j <= l; j++) {
/* Next statement can be omitted if eigenvectors not wanted
*/
a[j][i] = a[i][j] / h; /* Store u/H in ith column of a. */
g = 0.0;
/* Form an element of Au in g. */
for (k = 1; k <= j; k++)
g += a[j][k] * a[i][k];
for (k = j+1; k <= l; k++)
g += a[k][j] * a[i][k];
e[j] = g/h;
/* Form element of p in temporarily
unused element of e */
/* Page 2 */
f += e[j] * a[i][j];
}
hh = f / (h + h); /* Form K, equation (11.2.11). */
for (j = 1; j <= l; j++) { /* Form q and store in e
overwriting p */
f = a[i][j];
/* Note that e[l] = e[i-1] survives */
e[j] = g = e[j] - hh * f;
for (k = 1; k <= j; k++) /* Reduce a, equation (11.2.13) */
a[j][k] -= (f * e[k] + g * a[i][k]);
}
}
} else
e[i] = a[i][l];
d[i] = h;
}
/* Next statement can be omitted if eigenvectors not wanted */
d[1] = 0.0;
e[1] = 0.0;
/* Contents of this loop can be omitted if eigenvectors not wanted
except for statement d[i] = a[i][i]; */
for (i = 1; i <= n; i++) { /* Begin accumulation of
transformation matrices */
l = i - 1;
if (d[i]) {
/* This block skipped when i = 1 */
for (j = 1; j <= l; j++) {
g = 0.0;
for (k = 1; k <= l; k++) /* Use u and u/H stored in a to form
PQ */
g += a[i][k] * a[k][j];
for (k = 1; k <= l ; k++)
a[k][j] -= g * a[k][i];
}
}
d[i] = a[i][i];
/* This statement remains */
81
a[i][i] = 1.0;
82
s = c = 1.0;
p = 0.0;
/* Page 2 */
/* A plane rotation as in the original QL, followed by Givens
rotations to restore tridiagonal form. */
for (i = m-1; i >= l; i--) {
f = s * e[i];
b = c * e[i];
e[i+1] = (r = pythag(f,g));
/* recover from underflow */
if (r == 0.0) {
d[i+1] -= p;
e[m] = 0.0;
break;
}
s = f/r;
c = g/r;
g = d[i+1] - p;
r = (d[i] - g) * s + 2.0 * c * b;
d[i+1] = g + (p = s * r);
g = c * r - b;
/* Next loop can be omitted if eigenvectors not wanted */
/* Form eigenvectors */
for (k = 1; k <= n; k++) {
f = z[k][i+1];
z[k][i+1] = s * z[k][i] + c * f;
z[k][i] = c * z[k][i] - s * f;
}
}
if (r == 0.0 && i >= l)
continue;
d[l] -= p;
e[l] = g;
e[m] = 0.0;
}
} while (m != l);
}
}
83
/***initializations****/
database = malloc(NUMFACES*FACESIZE*sizeof(float));
for(i=0;i<NUMFACES;i++)
database[i] = malloc(FACESIZE*sizeof(float));
average = malloc(FACESIZE*sizeof(float));
L = malloc(NUMFACES*NUMFACES*sizeof(float));
for(i=0;i<NUMFACES;i++)
L[i] = malloc(NUMFACES*sizeof(float));
database_trans = malloc(NUMFACES*FACESIZE*sizeof(float));
for(i=0;i<FACESIZE;i++)
database_trans[i] = malloc(NUMFACES*sizeof(float));
eigenvalues = malloc(NUMFACES*sizeof(float));
eigenvectors = malloc(NUMFACES*NUMFACES*sizeof(float));
for(i=0;i<NUMFACES;i++)
eigenvectors[i] = malloc(NUMFACES*sizeof(float));
eigenvectors_orig = malloc(NUMFACES*FACESIZE*sizeof(float));
for(i=0;i<FACESIZE;i++)
eigenvectors_orig[i] = malloc(NUMFACES*sizeof(float));
84
projections = malloc(NUMFACES*NUMFACES*sizeof(float));
for(i=0;i<NUMFACES;i++)
projections[i] = malloc(NUMFACES*sizeof(float));
test_face = malloc(FACESIZE*sizeof(float));
for(i=0;i<1;i++)
test_face[i] = malloc(FACESIZE*sizeof(float));
test_projection = malloc(NUMFACES*sizeof(float));
for(i=0;i<1;i++)
test_projection[i] = malloc(NUMFACES*sizeof(float));
/*****pca training*****/
// obtain database
read_images(database,NUMFACES, FACESIZE);
// find average face
matrix_average(database,NUMFACES,FACESIZE,average);
// normalize database
matrix_subtract(database,NUMFACES,FACESIZE,average,database);
// compute L matrix
matrix_transpose(database,NUMFACES,FACESIZE,database_trans);
matrix_multiply(database,NUMFACES,FACESIZE,database_trans,FACESIZE,NUMFACES,L);
// compute eigenvectors of L
eig(L,NUMFACES,eigenvalues,eigenvectors);
// derive eigenvectors of original matrix
matrix_multiply(database_trans,FACESIZE,NUMFACES,eigenvectors,NUMFACES,NUMFACE
S,eigenvectors_orig);
// compute face projections
matrix_multiply(database,NUMFACES,FACESIZE,eigenvectors_orig,FACESIZE,NUMFACES,
projections);
/***pca recognition****/
// obtain test face
read_testface(test_face[0],FACESIZE);
// normalize test face
matrix_subtract(test_face,1,FACESIZE,average,test_face);
// project test face
matrix_multiply(test_face,1,FACESIZE,eigenvectors_orig,FACESIZE,NUMFACES,test_projecti
on);
// compute minimum distance
85
for(i=0;i<NUMFACES;i++)
{
if(i==0)
min = vector_distance(test_projection[0],projections[i],NUMFACES);
else
{
temp_min = vector_distance(test_projection[0],projections[i],NUMFACES);
if(temp_min < min)
{
min = temp_min;
i_mark = i;
}
}
}
printf("The minimum distance belongs to face %i and has a value of %f\n",i_mark+1, min);
}
86
Console.WriteLine(count.ToString());
br.Close();
fs.Close();
count = 0;
// read binary file
fs = File.OpenRead("b_avg_int");
br = new BinaryReader(fs);
// destination mac
packet[0] = 0x01;
packet[1] = 0x06;
packet[2] = 0x07;
packet[3] = 0x08;
packet[4] = 0x09;
packet[5] = 0x04;
// source mac
packet[6] = 0x00;
packet[7] = 0x56;
packet[8] = 0x00;
packet[9] = 0xFF;
packet[10] = 0x02;
packet[11] = 0xC5;
// length of data bytes
packet[12] = 0x03;
packet[13] = 0xE8;
for (i = 0; i < 75000; i++)
{
packet[14 + i % DATA_SIZE] = br.ReadByte();
if (i % DATA_SIZE == 999)
{
rawether.DoWrite(packet);
count++;
}
for (j = 0; j < 150000; j++) ;
}
Console.WriteLine(count.ToString());
br.Close();
fs.Close();
count = 0;
// read binary file
fs = File.OpenRead("b_eigen_int");
br = new BinaryReader(fs);
// destination mac
packet[0] = 0x01;
packet[1] = 0x06;
packet[2] = 0x07;
packet[3] = 0x08;
packet[4] = 0x09;
87
packet[5] = 0x04;
// source mac
packet[6] = 0x00;
packet[7] = 0x56;
packet[8] = 0x00;
packet[9] = 0xFF;
packet[10] = 0x02;
packet[11] = 0xC5;
// length of data bytes
packet[12] = 0x03;
packet[13] = 0xE8;
for (i = 0; i < 3825000; i++)
{
packet[14 + i % DATA_SIZE] = br.ReadByte();
if (i % DATA_SIZE == 999)
{
rawether.DoWrite(packet);
count++;
}
for (j = 0; j < 150000; j++) ;
}
Console.WriteLine(count.ToString());
br.Close();
fs.Close();
count = 0;
// read binary file
fs = File.OpenRead("b_proj_int");
br = new BinaryReader(fs);
// destination mac
packet[0] = 0x01;
packet[1] = 0x06;
packet[2] = 0x07;
packet[3] = 0x08;
packet[4] = 0x09;
packet[5] = 0x04;
// source mac
packet[6] = 0x00;
packet[7] = 0x56;
packet[8] = 0x00;
packet[9] = 0xFF;
packet[10] = 0x02;
packet[11] = 0xC5;
// length of data bytes
packet[12] = 0x03;
packet[13] = 0xE8;
for (i = 0; i < 10404; i++)
{
88
if (i % DATA_SIZE == 999)
{
rawether.DoWrite(packet);
count++;
}
for (j = 0; j < 150000; j++) ;
}
Console.WriteLine(count.ToString());
br.Close();
fs.Close();
14
6
1500
TESTFACE_BASEADDR
TESTFACE_SIZE
AVGFACE_BASEADDR
AVGFACE_SIZE
EIG_BASEADDR
EIG_SIZE
PROJ_BASEADDR
PROJ_SIZE
0x40c00000
0x22000000
0
75000
75000
75000
150000
3825000
3975000
10404
89
#define TOTAL_FRAMES
#define NUMFACES
3986
51
// PROTOTYPES
int XEmac_RecvFrameSS(Xuint32 BaseAddress, Xuint8 *FramePtr);
void wait(Xuint32 time);
90
91
{
r1 = XDdr_mReadReg (MEM_BASEADDR, TESTFACE_BASEADDR+i);
r2 = XDdr_mReadReg (MEM_BASEADDR, AVGFACE_BASEADDR+i);
r1 = r1 - r2;
XDdr_mWriteReg (MEM_BASEADDR, TESTFACE_BASEADDR+i, r1);
}
XTmrCtr_Stop(&timer,0);
cycles = XTmrCtr_GetValue(&timer,0);
xil_printf("Normalization Cycles: %d\r\n",cycles);
// projection stage
XTmrCtr_Reset(&timer,0);
XTmrCtr_Start(&timer,0);
for (i=0; i<NUMFACES; i++)
{
product = 0;
for (j=0; j<TESTFACE_SIZE; j+=4)
{
r1 = XDdr_mReadReg (MEM_BASEADDR, EIG_BASEADDR +
i*4+j*NUMFACES);
r2 = XDdr_mReadReg (MEM_BASEADDR,
TESTFACE_BASEADDR+j);
product += r1 * r2;
}
projections[i] = product;
}
XTmrCtr_Stop(&timer,0);
cycles = XTmrCtr_GetValue(&timer,0);
xil_printf("Projection Cycles: %d\r\n",cycles);
// distances
XTmrCtr_Reset(&timer,0);
XTmrCtr_Start(&timer,0);
for (i=0; i<NUMFACES;i++)
{
ftemp = 0;
for (j=0; j<NUMFACES; j++)
{
r1 = XDdr_mReadReg (MEM_BASEADDR,
PROJ_BASEADDR+(i*NUMFACES+j)*4);
int1 = r1 - projections[j];
ftest = (Xfloat32) int1;
ftemp += ftest*ftest;
}
//fdist = sqrt(ftemp);
fdist = ftemp;
if (i==0)
{
min = fdist;
imark = i;
}
else
{
92
93
FramePtr[1] == 6 &&
FramePtr[2] == 7 &&
FramePtr[3] == 8 &&
FramePtr[4] == 9 &&
FramePtr[5] == 4)
{
//printf ("Received a GOOD Packet\r\n");
return Length;
}
else
return -1;
}
void wait(Xuint32 time)
{
Xuint32 cnt = 0;
while(cnt<time)
{
cnt++;
}
}
94
285006939.446492
287533372.651606
289339502.859201
289201806.462897
271901321.489990
270497789.759402
264169974.413589
255327852.287155
258682538.304327
251213559.123769
259579475.746799
263522708.902784
243460224.373596
256767062.230726
249083248.619941
78699745.431568
257229726.685719
264791717.983083
303273149.042238
307864942.616162
300581205.147086
293359560.382787
298626547.239392
291808943.826539
302661000.071662
301379863.871029
308336091.791721
332560286.526925
345542685.107667
346003889.433077
The minimum distance belongs to face 3 and has a value of 0.000000
***
Integer Implementation : Projection Distance Calculations
284225955.814634
279126687.096212
620494.066290
15590958.052778
42948163.259742
196103692.379704
208225158.720017
216861641.784979
91093926.417311
289572508.175456
286698981.540504
285883839.142958
133857089.836859
113775500.038695
95048959.642352
63801517.932157
70310480.676750
72818549.508288
108058501.418556
108302177.848746
95
126181027.633667
284607217.509605
287127633.331032
288925692.704150
288760773.338336
271457671.284751
270053651.838463
263708795.309670
254867805.414352
258215407.944628
250731364.175974
259102093.073329
263054021.925713
242983661.512971
256275114.290106
248603954.983011
78496933.796400
256721561.955371
264280157.572359
302767434.835988
307361433.381023
300074824.514051
292915766.134634
298183088.693402
291376106.729249
302256564.576121
300979732.587588
307934681.917593
332104472.690896
345087533.640911
345549694.756852
The minimum distance belongs to face 3 and has a value of 620494.066290
Press any key to continue
96