You are on page 1of 6

The FERET Verification Testing Protocol for Face Recognition Algorithms

Syed A. Rizvi1 P. Jonathon Phillips,2 and Hyeonjoon Moon3 , Department of Applied Sciences College of Staten Island of City University of New York, Staten Island, NY 10314
2 1

National Institute of Standards and Technology Gaithersburg, MD 20899,

Department of Electrical and Computer Engineering State University of New York at Buffalo, Amherst, NY 14260


Two critical performance characterizations of biometric algorithms, including face recognition, are identification and verification. Identification performance of face recognition algorithms on the FERET tests has been previously reported. In this paper we report on verification performance obtained from the Sept96 FERET test. The databases used to develop and test algorithms are usually smaller than the databases that will be encountered in applications. We examine the effects of size of the database on performance for both identification and verification.

1. Introduction
Identification and verification of a person’s identity are two potential areas for applications of face recognition systems. In identification applications, systems identity an unknown face in an image; i.e., searching an electronic mugbook for the identity of suspect. In verification applications, systems confirm the claimed identity of a face presented to the system. Proposed applications for verification systems include; controling access to buildings and computer terminals, confirming identities at automatic teller machines (ATMs), and verifying identities of passport holders at immigration ports of entry. These application have the potential to influence and impact our everyday life. 
This work was performed as part of the Face Recognition Technology (FERET) program, which is sponsored by the U.S. Department of Defense Counterdrug Technology Development Program. Portions of this were done while Jonathon Phillips was at the U.S. Army Research Laboratory (ARL). Please direct correspondence to Jonathon Phillips.

For systems to be successfully fielded, it is critical that their performance is known. However, to date the performance of most algorithms has only been reported on identification tasks, which implies that characterization on identification tasks holds for verification. For face recognition systems to successfully meet the demands of verification applications it is necessary to develop test and scoring protocols that specifically addresses these applications. These scoring protocols also have the potential to allow for a quantitative comparison between competing biometrics. An evaluation protocol generally consists of two parts. In the first part, an algorithm is executed on a test set of images and the output from executing the algorithm is written to a file(s). This produces the raw results. In the second part, the raw results are scored, which produces performance statistics. If the protocol is properly designed, the performance statistics can be computed for both identification and verification scenarios. The Sept96 FERET evaluation method is such a protocol [7], which used images from the FERET database of facial images [8]. The Sept96 FERET test is the latest in a series of FERET tests to measure the progress, assess the state-of-the-art, identify strengths and weakness of individual algorithms, and point out future directions of research in face recognition. Prior analysis of the FERET results has concentrated on identification scenarios. In this paper we present (1) a verification analysis method for the Sept96 FERET test, and (2) results for verification. The databases used to develop and test algorithms are usually smaller than the databases that will be encountered in applications. We examine the effects of size of the database on performance for both identification and verification.

Also.) There are two versions of the September 1996 test. Two algorithms from the MIT Media Laboratory (MIT) were tested. a different facial expression was requested for the second frontal image. The virtual gallery and probe set technique allows us to characterize algorithm performance for identification and verification. e. For each image qi in the query set Q.) In the second version the eye coordinates are given. To maintain a degree of consistency throughout the database. and calculate the variation in algorithm performance with the different galleries of size 100. G T . performance can be broken out by different categories of images. A gallery G is a virtual gallery if G is a proper subset of the target set.. an algorithm reports the similarity si k between qi and each image tk in the target set T . and the location and setup did not change during a session. Using this as a starting point. there was variation from session to session. The target and query sets are the same for each version. The Sept96 test was administered in September 1996 and in March 1997 (table 1 tells when the groups were tested). (The gallery is the set of known individuals. FB probes or duplicate probes. [13]. collected under relatively unconstrained conditions. We introduce this terminology to distinguish these sets from the gallery and probe sets that are used in computing performance statistics. Thus. The output for each query image qi is sorted by the similarity scores si  . .2. the performance level of face recognition algorithms in general improves overtime. The Sept96 FERET test The Sept96 FERET testing protocol was designed so that algorithm performance can be computed for (1) identification and verification evaluation protocols. The images in the query set are the unknown facial images to be identified. is that for any two images si and tk . and the collection of probes is called the probe set. The algorithm tested in September 1996 was based on the Etemad and Chellappa [3] and in March 1997 on Zhao et al. All the images in the target set were frontal images. From the output files. For a given gallery G and probe set P . for 14. The database contains 1199 individuals and 365 duplicate sets of images. For some people. This is because performance of the algorithms from a particular group will improve. and also. Two frontal views were taken (FA and FB). Images of an individual were acquired in sets of 5 to 11 images. people and determine how performance changes as the size of the gallery increases.. The development portion of the database consisted of 503 sets of images. Locating the faces in the images must be done automatically. This allowed a detailed analysis of performance on multiple galleries and probe sets.) In the Sept96 protocol. The facial images were collected in 15 sessions between August 1993 and July 1996. However.e. The first version requires that the algorithms be fully automatic. (We do present results in this paper using the rotated or digitally modified images. Sessions lasted one or two days. The first was the algorithm that was tested in March 1995 [6]. we know si k . algorithm performance can be computed for virtual galleries and probe sets. This algorithm was retested so that improvement since March 1995 could be determined. the photographer used the same physical setup in each photography session. we implemented and tested normalized correlation and our version of eigenfaces [10]. and (2) a variety of different galleries and probe sets [7]. (In the fully automatic version the testees are given a list of the images in the target and query sets. We designed the digitally modified images to test the effects of illumination and scale. an algorithm is given two sets of images: the target set and the query set. of Southern California were the only groups to take the fully automatic test. The remaining images were sequestered by the Government.    200 300 1000   . over two years elapsed between their first and most recent sittings. When comparing performance among algorithms one must note when the test were administered. We can create a gallery of 100 people and estimate the algorithm’s performance at recognizing people in this gallery. we can then create virtual galleries of . The key property of the new protocol. the performance scores are computed by examination of the similarity measures si k such that qi 2 P and tk 2 G . because the Media Lab and U. The University of Maryland (UMd) took the test in September 1996 and March 1997. The images were taken from the FERET database of facial images [8]. P is a virtual probe set if P Q. which allows for greater flexibility in scoring. the target set contained 3323 images and the query set 3816 images. 1564 sets of images were in the database. the test output contains every target image matched with itself. which were released to researchers. an algorithm outputs the similarity measure si k for all images tk in the target set. By July 1996. Except for a set of rotated and digitally modified images. The target set is given to the algorithm as the set of known facial images. For each query image qi . : : :. To provide a performance baseline.126 total images. In the September 1996 FERET test. because the equipment had to reassembled for each session. i. We only report results for the semi-automatic case (eye coordinates given). Similarly. The second algorithm was based on more recent work [5]. Another avenue of investigation is to create n different galleries of size 100. with some subjects being photographed multiple times. An image of an unknown face presented to the algorithm is called a probe. The query set consisted of all the images in the target set plus rotated images and digitally modified images. the target and query sets are the same.g.

For a probe pk 2 P . 6] U. this decision rule maximizes the hit rate for a false alarm rate . Mathematically. an algorithm is presented with an image p that claims to be the same person as in image g. The gallery hit and false alarm rates at the cut-off c. The probability of responding “yes” given an event “nig” has occurred is termed the false alarm rate. For a given algorithm. two possible events are defined: Either a probe pk is in the gallery (event “ig”) or pk is not in the gallery (event “nig”). The verification ROC is computed by varying c from . P Y jig . P Y jnig . the choice of a suitable Hit and False Alarm Rate pair would probably depend on a particular application. Verification Model In our verification model.1 to 1. The verification protocol considers one gallery image at a time. By varying c from it’s minimum to maximum value we obtain a set of hit and false alarm rate pairs. p 6 g. the algorithm determines if p  g . of Southern California (USC) [12] Eye Coordinates Given ARL PCA [10] ARL Correlation Excalibur Corp. 4].i Y jig and Pc. Hit Rate P Y jig (1) = : =1 = .) We get overall system performance by summing the results for I individual detection problems. P N jig P Y jnig . That is. In our model. For a given cut-off c. The observer responds “yes” if si k  c and “no” if si k c. For each probe p. otherwise.” into two intervals. the system decides if pk  gi for all i and k.) The verification problem for a single individual is equivalent to the detection problem with a gallery G fgg and probe set P . A given cut-off.i Y jig Pc Y jig = jGj i=1 j j   (3) and 1 Pc Y jnig = jGj  + XP j Gj =   and False Alarm Rate = P Y jnig: (2) which is the average of Pc. the equal error rate is often quoted. which we call the gallery G . we need performance over a population of people. (Carlsbad. and the false alarm rate.i Y jig and Pc. : : : . K g with all gallery images. I g. “ig” and “nig. =1 Our decisions will be made by a Neyman-Pearson observer. This measures system performance for a single person (who’s face is in g ). are computed by              G 1 X Pc.i Y jnig . However. Test Date Version of test Group September 1996 March 1997 Baseline Fully Automatic MIT [5. where the gallery consists of a single image gi and the probe set P . respectively.i Y jnig over the gallery G. the system allows access to a group of people fgi i . The verification rate. c. (We assume that there is only one image per person in the gallery. divides both distributions. The probability of responding “yes” (Y) given an event “ig” has occurred is termed the hit rate. i=1 c. List of groups that took the Sept96 test broken out by versions taken and dates administered. However. By the Neyman-Pearson theorem [4]. [11] UMd [3] USC 3.   (4)   =   . We briefly review the following terminology commonly used in signal detection theory [2] that will be used in the subsequent analysis. The equal error rate is the point where the Incorrect Rejection and False Alarm Rates are equal. are the proportions of the “ig” and “nig” distributions in the interval si k  c. : : : . denoted by Pc Y jig and Pc Y jnig . [9] Rutgers U. that is. (If p and g are images of the same person then we write p  g. A plot of these pairs is the receiver operating characteristic (ROC) (also known as the relative operating characteristic) [2. CA) MIT Michigan State U. the verification and false alarm rates for an image gi are denoted by Pc. The system is tested by comparing each probe on the probe set P fpk k . The system either accepts or does not accept the claim.Table 1. for performance evaluation and comparison among algorithm. We reduce this to I detection problems.i Y jnig.

7 0. the gallery consisted of 1196 individuals with one FA image per person. the universe of probes in the gallery and not in the gallery.0 0. (Gallery size = 1196. we say that verification is “probe” centric. For the results in this section. However. For each gallery size. Cumulative match scores report for what fraction of the probes the correct answer is in the top k . number of probes = 3120.8 0. we randomly generated galleries of size 200. then for pk to be correctly identified. ROC for FB probes. and we say that identification is “gallery” centric.5 False alarm rate False alarm rate Figure 1. Effects of Gallery and Probe Set Size Calculation of the system hit and false alarm rates is done by averaging the hit and false rates of individual gallery images.4 0. 5. then the average rank k score mk k            =    = = = .0 0. the hit and false alarm rates for an individual gallery image are dependent on the probe set. we average the cumulative match scores.2 0. An alternate presentation of the information in the ROC is a plot of the incorrect rejection and false alarm rates as a function of the similarity measure s (fig 3).4 0. Figure 1 shows the ROC performance for FB probes.) Figure 2. number of probes = 3120.2 0. From these curves one can exam how sensitive algorithm performance is to changes in s. (Gallery size = 1196. 400. figure 2 shows the ROC performance for duplicate probes.0 Probability of correct verification 0. si k si0 k for all i0 6 i.8 UMD Mar 97 USC Mar 97 Excalibur Corp. Thus. A concern for verification is how performance is affected by an increase in the size of the probe sets. Let K be the size of a probe set and Rk the number probes that the correct answer is in the top k .1 0.5 UMD Mar 97 USC Mar 97 Excalibur Corp. The performance was computed for the following two probe sets: FB images: the second frontal image taken from the same image set as the gallery image. Therefore the identification rates for a probe are dependent on the gallery.9 0.) 4. these curves cannot be used to directly compare performance among algorithms. If m` Rk =K for G` . The mk scores report performance for a single gallery and probe set. etc). Verification Results In this section we report the results of applying the verification to the Sept96 FERET test scores. different hair length. MIT Sep 96 Rutgers MSU MIT Mar 95 UMD Sep 96 ARL Eigenface ARL Cor 0.9 Probability of correct verification 0.7 0. To summarize performance for a sample population. 600. When the size of the probe set increases (sampling from Pig and Pnig ). if pk  gi . ROC for duplicate probes.5 0. For identification.1 0. with k on the horizontal axis and mk on the vertical axis. Thus. scores are usually reported a cumulative match scores. and 800. Duplicate: is an image taken in a different session (a different date) or taken under special circumstances (such as the subject was wearing glasses. the accuracy of the estimates of P si k jig . MIT Sep 96 Rutgers MSU MIT Mar 95 UMD Sep 96 ARL Eigenface ARL Cor 0. The scores were calculated using a PCA algorithm with histogram equalization preprocessing and the L1 metric for recognition. For identification.4 0. P si k jnig .6 0. Performance is determined by P si k jig and P si k jnig . we generated 100 galleries and scored them against duplicate probe sets.3 0. but not on other members of the gallery. The results are presented in a graph.3 0. Therefore.0 1. The fraction reported is mk Rk =K . and the resulting ROC increase. which are generated by sampling from the Pig and Pnig . The individual rates are a function of the distances between gi and all the probes in P . To study the effects of gallery size on performance.1. mean performance is not effected by the size of the gallery or probe set.

5 0. This raises the hypothesis that distributions of h for the different-sized galleries. The average was computed from 100 galleries for each size gallery.1 0.8 0. Performance of the ARL Eigenface algorithms in terms of incorrect rejection and false alarm rates for the duplicates test. which we will denote by h .4 9. the null hypoth1 N Pm.5 0. h is the distribution of m` for galleries G` (m` is a relative score version of m` ). The results show that in all cases. For each set of galleries we calculated 100 cumulative matches scores. 400.4 0. and 800. In relative performance. 400.2 0.2 0.0 0. 800.   = 100 1.1 0. .3 S 9.2 9.5 0.6 Incorrect Rejection Rate False Alarm Rate Cumulative Match Score 0.4 0. When comparing galleries of different sizes. we performed all pairwise comparisons between different-sized galleries. 15 [1]). Therefore.3 0.0 0.0 gallery size 200 400 600 800 0 10 20 30 40 50 Rank 60 70 80 90 100 Figure 3.0 0. Figure 4 reports the average cumulative match scores for galleries of size 200.6 0.2 0.0 gallery size 200 400 600 800 0 1 2 3 4 5 6 7 8 Percentage of Rank 9 10 11 12 Figure 5.0 9.0 9. mk is the average cumulative match scores for an algorithm. for each there are 100 relative match scores and these scores have a distribution. it is insightful to compare relative performance of galleries.1 0.) The curve k.5 9. 600.6 0.. Average cumulative rank scores for galleries of size 200. (Gallery size = 100.7 0. ` ` k Figure 4.1 9. Figure 5 reports the results from figure 4 as average cumulative percent scores.8 0.7 Probability 1.9 0.1. We chose the null hypothesis to be that the distributions for h05 were equal for all gallery sizes. The average percent score for is m mb N=100c . where N . The cumulative match scores in figure 4 rescaled so that the rank is a percentage of the gallery. 600.7 0. Scores are for duplicate probe sets. number of duplicate probes belonging to gallery = 309. i.3 0.9 0. The average cumulative score reports performance as a function of absolute rank.e.3 0.4 0. To test this hypothesis. Figure 4 shows a decrease in performance as gallery size increases.6 0. We performed the hypothesis testing with the bootstrap permutation test on the means of h05 (chap. we measure the fraction of probes that are in the top . k The graph of the average cumulative percent scores in figure 5 are very close in the value. number of probes 2924.9 0.8 Cumulative Match Score  e =  0.

18(8):831–836. we need to able to predict the effects on performance of increases in the size of the gallery and probe set. Mammone. Swets and J. 17(7):775–779. 1993. [12] L. J. Wiskott. Swets. 1994. Nastar. [8] P. J. Cognitive Neuroscience. one needs to note when the tests where administered. By knowing their weaknesses. John Wiley & Sons Ltd. Rizvi. Weng. This probably does not have much impact on tests such as the FERET tests where the gallery and probe set are from the same database. 17(7):696–710. Face recognition by elasric bunch graph matching. Egan.. Fellous. 1966. In R. Phillips. Tibshirani. Moghaddam and A. This allows for an independent assessment of algorithms in a key potential application area of face recognition. [3] K. Chapman Hall. Krishnaswamy. pages 520–536. This shows that algorithm development is a dynamic process and evaluations such as FERET make an important contribution to computer vision. We have empirically shown for one algorithm that expected relative cumulative match scores are not a function of gallery size. 1975. We repeated the procedure for h10 . PAMI. and S. 1997. P. IEEE Trans. The FERET evaluation methodology for face-recognition algorithms. in many applications this assumptions may not hold and its impact needs to investigated. They let researchers know the strengths of their algorithms and were improvements could be made.-M. C. Turk and A. Chellappa. We have shown that performance for verification is “probe” centric. The FERET database and evaluation procedure for facerecognition algorithms. 1996. J. and P. J. Academic Press. [10] M. and reported results for this protocol. [6] B. [2] J. 1997. N. 6. Chellappa. [4] D. Face recognition using transform coding of gray scale projection projections and the neural tree network. [9] D. Phillips. Signal Detection Theory and Psychophysics. Therefore. and C. References [1] B. Artifical Neural Networks with Applications in Speech and Vision. 1996. PAMI. the null hypothesis cannot be rejected. Green and J. H. Chapman and Hall. 1997. Kruger. However. In 3rd International Conference on Automatic Face and Gesture Recognition. for identification it is “gallery” centric. which produced the same conclusion. Etemad and R. to appear. In Proceedings Computer Vision and Pattern Recognition 97. Pentland. Wilder. Bayesian face recognition using deformable intensity surfaces. researchers knew where to concentrate their efforts to improve performance. Moghaddam. 1991. Thus. Probabilistic visual learning for object detection. of Maryland’s performance between September 1996 and March 1997. Image and Vision Computing Journal. Moon. Discriminant analysis of principal components for face recognition. performance on identification may not be indicative of performance on verification and there is a need for separate evaluation protocols. Pentland. Zhao. Signal Detection Theory and ROC Analysis. In Proceedings Computer Vision and Pattern Recognition 96. H. whereas. pages 137–143. P. 3(1):71–86. 1998. J. 1996. Using discriminant eigenfeatures for image retrieval. Rauss. PAMI. IEEE Trans. one see the improvement for both algorithms. and A. By comparing MIT’s performance between March 1995 and September 1996. Efron and R. pages 638–645. The databases used to develop and test algorithms are usually smaller than the databases that will be encountered in applications.esis cannot be rejected. An Introduction to the Bootstrap. For verification we have shown that the accuracy of performance estimates increase as the size of the probe set increases. IEEE Trans. [7] P. and the U. Another lesson is that when comparing algorithms. [5] B. 1998. . Rauss. Discriminant analysis for recognition of human face images. [11] J. J. Wechsler. Conclusion We have devised a verification test protocol for the Sept96 FERET test. Pentland. editor. Huang. In ICASSP ’96. von der Malsburg. Eigenfaces for recognition. pages 2148–2151. [13] W. R. J. and A.