You are on page 1of 4

Support Vector Machines for Human Face Detection

Ignas Kukenys Brendan McCane

This paper describes an attempt to build a component-based face detector using support vector machine classiers. We present current results and outline plans for future work required to achieve sucient speed and accuracy to use SVM classiers in an online face recognition system. We take a straightforward approach in implementing our own SVM classier with a Gaussian kernel that detects eyes in grayscale images, a rst step towards a component-based face detector. Details on design of an iterative bootstrapping process are provided, and we show which training parameter values tend to give best results. Conclusions drawn from our work up to date are consistent with previous research and problems encountered are to be expected by anyone building an object detection system - SVM classiers with large numbers of support vectors are slow and accuracy depends largely on the quality and variety of training data.

This paper presents results of early stages of the planned work and describes some problems to be addressed in the later stages. The planned approach for face detection as a rst step for face recognition is similar to that described by Heisele [3, 4]. We consider faces as a combination of handpicked components (eyes, lips) and train separate general SVM classiers for dierent components. A simple linear SVM classier can then be used to determine if combinations of positions and sizes of found components represent a face. If a face is found in the image, the known positions of the components would be used to run SVM classiers from known faces in an attempt to recognize the face. Our current implementation trains a classier that detects eyes as one of the face components. We consider only gray pixel values of images, but use a larger sample (20x20 pixel patch for one eye, compared to 9x7 pixels used by Heisele [3]). We chose the often used Gaussian kernel for non-linear SVMs and built an iterative training process where false positives of the previous iteration are included as negative samples in the training set for the next iteration. The immediately obvious problem with SVMs is computational complexity of classication when the number of support vectors is large, and the next planned step is to adapt the reduced set approach to increase the speed. The next section gives an overview of the support vector machine theory. In section 3 we describe the currently implemented system in more detail. Section 4 presents results of the current implementation, while section 5 outlines planned future work and points of interest for research.

Categories and Subject Descriptors

I.5 [Computing Methodologies]: Pattern Recognition

SVM, support vector machines, face detection



In the last fteen years support vector machines (SVMs) have become a widely studied and applied classication technique, successfully used for object detection and human face detection in particular [3, 4, 6]. A number of approaches following the original work of Vapnik et al. [1] were suggested to improve training speed, classication speed and accuracy, including an improved sequential minimal optimization (SMO) algorithm by Platt [5], approximation by reduced set vectors by Burges [2], reduced set vector cascades by Romdhani [6] and a component based approach by Heisele [4]. The objective of our research is to combine existing approaches in an attempt to build a face recognition system capable of fast and accurate classication of a high number of faces.

2. 2.1


Consider a set of l vectors {xi }, xi Rn , 1 i l, representing input samples and set of labels {yi }, yi {1}, that divide input samples into two classes, positive and negative. If the two classes are linearly separable, there exists a separating hyperplane (w, b) dening the function

f (x) =< w x > +b,


NZCSRSC 08 Christchurch New Zealand

and sgn(f (x)) shows on which side of the hyperplane x rests, in other words - the class of x. Vector w of the separating hyperplane can be expressed as a linear combination of xi (often called a dual representation of w) with weights i :



i yi xi .


the two classes of sample images are equal. We then have our decision function X

i yi K (xi , x) + b . (7)

The dual representation of the decision function f (x) is then: X


sgn(f (x)) = sgn

f (x) =

i yi < xi x > +b.


Training a linear SVM means nding the embedding strengths {i } and oset b such that hyperplane (w, b) separates positive samples from negatives ones with a maximal margin. Notice that not all input vectors {xi } might be used in the dual representation of w; those vectors xi that have weight i > 0 and form w are called support vectors.

Commonly used kernels include polynomial kernels K (x, y ) = y ||2 (x + y )d and the Gaussian kernel K (x, y ) = exp( ||x ). 2 In our implementation we use the Gaussian kernel, however one of the interesting points for further research is approaches for choosing an optimal kernel for the given input data.

3. TRAINING 3.1 Input data

We chose eyes as our rst component since it is a very distinctive feature of the human face. Eyes were manually marked in 1066 training photos of human faces and extracted into 2132 positive samples, 20x20 pixels each, represented by 400-dimensional vectors containing gray values of the unaltered image. At the current stage we did not add any articially generated positive samples as several approaches suggest, but in future work we plan to include samples with varying brightness, contrast and rotation. Negative samples for the rst iteration of training are generated randomly at varying scales from a set of images without faces and images of faces excluding the areas that contain positive samples (eyes). We chose the number of negative samples to be 3 to 10 times the number of positive samples.


Non-linear case

In real-life problems it is rarely the case that positive and negative samples are linearly separable. Non-linear support vector classiers map input space X into a feature space F via a usually non-linear map : X F, x (x) and solve the linear separation problem in the feature space by nding weights i of the dual expression of the separating hyperplanes vector w: X


i yi (xi ),


while the decision function f (x) takes the form X



SVM parameters

f (x) =

i yi < (xi ) (x) > +b.

Usually F is a high-dimensional space where images of training samples are highly separable, but working directly in such a space would be computationally expensive. However we can choose a space F which is induced by kernel K , dened by a kernel function K (x, y ) that computes the dot product in F , K (x, y ) =< (x) (y ) >. The decision function (5) can then be computed by just using the kernel function and it can also be shown that nding the maximum margin separating hyperplane is equivalent to solving the following optimization problem X

For our SVM we chose the Gaussian kernel and varied parameter between 800 and 1600 (equivalent to approx. 3.1 and 6.3 respectively for normalised input data pixel values). The bound on the i weights C was chosen to be 10. Consistent with previous research [6, 4], SVM classier performed best with values between 1200 and 1600 (close to 5-6.3 for normalised input data). For comparison of functions with dierent parameters refer to section 4.



The rst iteration SVM classier function is trained with initial data as described above. The obtained function is then run on the training images with and without faces, and false positives together with negative support vectors form the negative sample set for the next iteration. The positive sample set remains the same throughout all iterations. Depending on the number of negative samples, the accuracy of the classier signicantly increases during the rst 4-6 iterations. For example, a classier with = 1200 after the rst iteration has around 300 support vectors and nds 50-500 false positives per image in the training set, while after the fourth iteration it has over 1300 support vectors and nds only 0-10 false positives per image in the training set (see gure 3).


1 2

1i,j l

i j yi yj K (xi , xj ) , X

(6) i yi = 0 ,

subject to 0 i C, 1 i l,

where positive C is a parameter showing the trade-o between margin maximization and training error minimization. Thus knowing the kernel function K we avoid working directly in feature space F . After solving (6), oset b can be chosen so that the margins between the hyperplane and


Detection process

To detect objects in an image, a classier window is usually run over the image at all possible scales and positions. We

80 70 60 50 40 30 20 10 0 700 False negatives False positives





1200 sigma






Figure 1: Numbers of false positives and false negatives with dierent kernel parameter . Function with = 1600 performed best and found 88% of all eyes in the evaluation set. chose a range of scales at which eyes are likely to occur in images. For a photo to be scanned a pyramid of scaled images is calculated, with scale increasing 1.1 times with each step within the likely range. The classier window is then run over each image in the pyramid, with a step of 2 pixels in both directions. Evaluation of functions revealed that with a scale step of 1.1 the classier window misses some of the objects as it skips over them, as can be seen by manually evaluating likely positions. However currently this step size is a good tradeo between accuracy and speed, while future work should improve both.


Speed of training
Figure 2: Eye detector in action. After 3rd iteration of training, classier still produces several false positives on evaluation image (area with high detail around childs ngers).

The current implementation of the training process is not ecient, taking 3-5 days to train 4 iterations of a classier function. A signicant amount of that time is used for testing the trained functions on the input image set, obtaining false positives. For example, a function with 1000 support vectors classies one 20x20 pixel patch in around 7ms, but with tens of thousands of patches per image and over a thousand images, the testing phase takes more than a day. Planned future work should dramatically reduce the time needed for testing, as a cascaded reduced set vectors approach can result in more than a hundred-fold decrease in the average number of support vectors used to evaluate each image patch [6].



To evaluate the SVM classiers we chose an evaluation set of 25 random frontal face photos with a total of 50 eyes. Locations of eyes were manually marked in these images, and any window within a 10 pixel margin classied correctly was considered a correct hit. Figure 1 shows total numbers of false positives and false negatives for ve functions with varying kernel parameter . We attribute rates of false negatives to lack of variation in the positive training data - currently used images of faces

come from one source, feature only around 20 dierent people and have little variation in lighting conditions and rotation angles. We plan to expand the training set with both images from other sources and samples articially created by altering the images. Additionally, some eyes in the evaluation set were missed by the detector as the image pyramid skipped over the scale at which eyes would otherwise be detected. This problem will also be addressed in future work. The seemingly high numbers of false positives are of less concern - a classier evaluates over 50 000 patches per image, reporting 0 to 10 of them falsely as eyes. Some of the false positives still mark an eye but fall out of the 10 pixel margin area in the automated testing, others typically present themselves around one or two areas in the image (gure 2). Since overlapping positives would be merged, the penalty for including the false hits while evaluating possible compo-

3000 Support vectors False positives 2500



[3] [4]




2 Iteration


Figure 3: Changes in number of support vectors and false positives over the evaluation set, classier with = 1200. nent combinations to form a face should not be large, and we expect the probability of detecting a face that includes a false component to be very small.

Proceedings of the fth annual workshop on Computational learning theory, pages 144152, New York, NY, USA, 1992. ACM Press. C. J. C. Burges. Simplied support vector decision rules. In International Conference on Machine Learning, pages 7177, 1996. B. Heisele, T. Poggio, and M. Pontil. Face detection in still gray images, 2000. B. Heisele, T. Serre, and T. Poggio. A component-based framework for face detection and identication. International Journal of Computer Vision, 74(2):167181, August 2007. J. C. Platt. Using analytic qp and sparseness to speed training of support vector machines. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, Cambridge, MA, 1999. MIT Press. S. Romdhani, P. Torr, and B. Sch olkopf. Ecient face detection by a cascaded support-vector machine expansion. Royal Society of London Proceedings Series A, 460:32833297, November 2004.



Our straightforward implementation of a support vector machine revealed expected drawbacks. Accuracy of a SVM classier depends on the quality of the training data, specifically that sucient variation in training samples is required to cover for dierent possible angles, lighting conditions and partial occlusions in images to be detected. Furthermore, performance of a classier with several thousands of support vectors is unsatisfactory for use in online systems. We plan to implement a cascaded reduced set vectors approach [6] as a rst step towards increasing the speed of the classier and improve the training process by adding articial variations in input data to address accuracy. Further steps towards a face detecting system is choosing other components of the human face and training classiers to detect those components, then combining component detectors into a single face detector. Additional opportunities to increase the accuracy of SVM classiers may exist in the way that the feature space for training data is dened. While polynomial and Gaussian kernels have proved to give satisfactory results, we are interested in the possibility of nding an optimal kernel for particular training data. At this point of our research it remains unclear if a similar component approach could be used for a face recognition system with a large database of faces. Specically, its not obvious that the same components used for face detection would be sucient to distinguish between dierent faces.



[1] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classiers. In COLT 92: