Professional Documents
Culture Documents
Asim Shankar asim@cse.iitk.ac.in Priyendra Singh Deshwal priyesd@iitk.ac.in April 2002 Under the supervision of Dr. Amitabha Mukherjee, amit@cse.iitk.ac.in
Report submitted in partial fulfillment of requirements of the course CS397 Special topics in Computer Science to the Computer Science and Engineering Department, Indian Institute of Technology, Kanpur
ABSTRACT
Over the years, one of the many problems being dealt with by the computer-vision community is that of face detection and recognition in images. The applications of such a system are numerous, from automated security systems, census, intelligence information etc. In this report, we present our experience with two of the most successful techniques present today ([rowley98],[cvpr97face]) and extensions of this work into other interesting applications.
1 TABLE OF CONTENTS
1 2 3 TABLE OF CONTENTS..................................................................................... 2 Introduction....................................................................................................... 3 Generic Approach ............................................................................................ 4 3.1 The sliding window ................................................................................. 4 3.2 Image pre-processing............................................................................ 4 3.3 Bootstrapping............................................................................................ 5 3.4 Training Set description........................................................................ 5 4 The Neural Network Technique.................................................................. 7 4.1 Network Structure ................................................................................... 7 4.2 Results ......................................................................................................... 8 4.3 Other species ............................................................................................ 9 4.4 Different Network Architectures ...................................................... 10 4.4.1 Fully connected network............................................................. 10 4.4.2 Two outputs..................................................................................... 11 5 Support Vector Machines ........................................................................... 12 5.1 Introduction to SVMs ........................................................................... 12 5.2 SVM learning parameters................................................................... 13 5.3 Results of training................................................................................. 14 6 Implementation Details............................................................................... 15 7 Neural Nets and SVMs A comparison ................................................ 16 8 Further Directions ......................................................................................... 17 9 References ....................................................................................................... 18 10 Resources..................................................................................................... 18 Table of Figures
Figure 1 - Image pre-processing ................................................................. 5 Figure 2 - Constructing 20x20 training image from original.............................. 6 Figure 3- Basic structure of neural network (Taken from [rowley98]) ................ 7 Figure 4 - Results of Neural Network on pictures taken by us ........................... 8 Figure 5 - Results of neural network on "standard" pictures ............................. 9 Figure 6 - Results on a "fully-connected" network ........................................ 11
-2-
2 Introduction
Classification algorithms of any kind have traditionally worked on reducing the object in question to a small set of meaningful features, however, in many cases this is not quite feasible. Face detection, for example involves concepts (such as face) that cannot be reduced to manageable, quantifiable set of features, whose basis or eigen-features can be found. Since it is not known apriori, what the relevant features for the given concept are, the feature vectors are typically large (such as the grey values of each pixel in the image). Under such circumstances, the approach taken is to learn the solution from a large set of examples. We look into a Neural Network based technique (Henry Rowley et al.) and a support-vector-machine based techniques (Osuna et al.) which take in the large feature vector and attempt to classify the same.
-3-
3 Generic Approach
The problem in question: Given an arbitrary image, be able to mark the faces detected in the image.
These steps are applied to each 20x20 window and not the image as a whole.
-4-
3.3 Bootstrapping
Generating a training set for the SVM/neural network is a challenging task because of the difficulty in placing characteristic non-face images in a the training set. To get a representative sample of face images is not much of a problem; however, to choose the right combination of non-face images from the immensely large set of such images, is a complicated task. For this purpose, after each training session, non-faces incorrectly detected as faces are placed in the training set for the next session. This bootstrap method overcomes the problem of using a huge set of nonface images in the training set, many of which may not influence the training.
-5-
Initially, for negative samples, random images were created and added to the training set. The training set was subsequently enhanced with bootstrapping of scenery and false-detected images. To make the system somewhat invariant to changes such as rotation of the face random transformations (rotation by 15 degress, mirroring) were applied to images in the training set. The last used training set (including bootstrapping) had 8982 input vectors.
-6-
The neural network is a two-layer (one hidden, one output) feed-forward network. There are 400 input neurons, 26 hidden neurons and 1 output neurons. Each hidden neuron is not connected to ALL the input neurons. The hidden neuron connections are as follows: The input image is divided into a 2x2 grid. 4 of the hidden neurons take input from only one of these grids each The input image is divided into a 4x4 grid. 16 of these neurons take input from only one of these grids each. This division into grids should help in detection local features (eyes, nose) important for face detection. The input image is divided into 6 horizontal stripes (each of height 5 pixels, this there is some overlap between strips).This should aid in the detection of features such as a pair of eyes or the mouth.
The idea is that the hidden neurons taking square (grid) inputs would detect individual features while the horizontal stripes would detect pairs of eyes and the mouth.
-7-
4.2 Results
Here were present some results obtained (green rectangles around detected faces). You will notice that there are some false detections, which should be reduced by adding these to the training set (more bootstrapping). Also, many times the same face is detected multiple times. The remedy for this is to draw a bounding rectangle around the multiply detected regions. We implemented a primitive collapsing technique and have to refine it further.
-8-
-9-
4.4.1
After reading about the aforementioned network an obvious question that arose was the effect on the network of such restricted connections between hidden neurons and others. Rowley proposed 1426 different edges, while if we fully connect all 400 inputs to all 26 hidden neurons and all 26 hidden neurons to the output neuron we end up with 10426 edges. To see this, we trained a fully connected network on the same training set. We observed that results were quite similar, however, the time taken to process the image with the fully connected network was much larger (420% extra edges). Since this slower performance didnt translate to more accurate detection, we concluded that Rowleys construction was quite appropriate.
- 10 -
4.4.2
Two outputs
The networks above with only one output gave a few false detections and on rare occasions missed a face. A common strategy used in many neuralnetwork based classifiers is a two-output system. Some believe that neural networks work better with sparse input/output schemes. We thus tried a two output system, where the first output gives us a measure of how likely is the given image to be a face while the second output gives a measure of how likely is the given image to not be a face. Again, such a structure seemed to be no better than the original, more compact network with one input.
- 11 -
Figure 7 - QP eqn. whose solutions are the support vectors (from [cvpr97face])
- 12 -
It turns out that only a small number of coefficients are different from zero, and since every coefficient is a particular data point, this means that the solution is determined by the data points associated with the non-zero coefficients. There are the support vectors, the only ones which are relevant to the solution of the problem, and thus all other data points can be deleted from the data set without affecting the solution. Intuitively, support vectors are data points lying between the border between the two classes.
Figure 8 - Separating hyperplanes (a) small margin (b) larger margin, better classifier [taken from [cvpr97face]]
In the real world, were unlikely to find problems that actually be solved by a linear classifier. To extend the technique to non-linear decision surfaces, we project the original vector into a higher dimensional feature space. The problem now is the choice of the features that will project the original vector into a higher dimensional space. For this we use Kernel functions K(x,y). See [vapnik95svnets] for more details.
- 13 -
- 14 -
6 Implementation Details
In the course of studying the face detection techniques described above, a lot of implementation was done by us. We tried to write a significant amount of reusable and pluggable code so that future work can easily build upon our engines. Intels Image Processing Library (IPL) was used for image processing and manipulation (histogram equalization, window extraction, scaling etc.). Input vectors were then created from the scaled, processed windows. The application also assists in the creation of the training set by allowing features (eyes, nose, mouth) to be labeled, transforming the face based on the selected features to a 20x20 window, rotating the image randomly, pre-processing the image and then writing to a training set file. A neural network library (see Resources) was created for the corresponding technique. Training of the network was done on a compute server (as the training set was large) and the trained network was then plugged into the GUI for testing. The SVM engine used was SVM-light (see Resources). All training engines are both Linux and Windows compatible. The GUI is currently written for Windows systems. The code written is free for use, with the hope that this will save a significant amount of time for anyone trying to build up from here. Please feel to contact the authors for these applications.
- 15 -
Yes (0.97)
Yes (1.02)
Yes (0.82)
Yes (0.89)
Yes (0.59)
No!! (-0.12)
Yes (0.86)
Yes (0.39)
Yes (0.99)
Yes (2.01)
Yes (0.87)
Yes (1.11)
No (0.020)
No (-2.9)
No (0.001)
No (-3.9)
No (0.00001)
No (-5.6)
No (0.00002)
No (-4.5)
No (0.040)
No (-2.3)
- 16 -
8 Further Directions
The face detection problem has many applications in the field of security systems, automated census, intelligence systems etc. However, of particular interest to us is in the field of video summarization. The idea is that given a video sequence, we first identify the faces in the frames and then use the identified faces for motion-tracking and face-recognition. With this, we may be able to textually comment on the movement of persons across a scene. While the face detection technique described above can be applied to video applications, a major hindrance is the speed, or rather lack of it, on large images. To do this over a large set of frames in a video would make the system prohibitively slow. However, we can use properties of video to ease this problem. For example, using background subtraction techniques we can reduce the number of regions in the frame where a face detection is likely, and thus instead of looking at all windows in each frame we look only at the regions of interest in each frame. Testing out the feasibility and performance of such a system would be the next logical step to take after the Furthermore, the detection scheme described in this report deals with fullfrontal facial images, meaning thereby that profile views and occluded faces are not handled. Profile views can be detected using the same technique, possibly using the eye, nose and ear to positions to standardize the training set and then use the training schemes described above. We surveyed such techniques as the first step in video summarization. The next step would be to be able to: Label every scene with the characters present in it, and then Label every scene with the actions of each actor (bend, walk, move hand etc.)
- 17 -
9 References
[rowley98] Neural network based Face Detection. Henry Rowley, Shumeet Baluja, Takeo Kanade. CMU. IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 20, number 1, pages 2338, January 1998. (http://www2.cs.cmu.edu/afs/cs.cmu.edu/user/har/Web/faces.html) [cvpr97face] Training support vector machines: An application to Face Detection. Edgar Osuna, Robert Freund, Federico Girosi. MIT. 1997. (http://citeseer.nj.nec.com/osuna97training.html) [sung94examplebased] Example-based learning for Human Face Detection. Kah-Kay Sung, Tomaso Poggio. MIT. IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 20, number 1, pages 39-51, January 1998. (http://citeseer.nj.nec.com/sung94examplebased.html) [rowley97] - Rotation Invariant Neural Network-Based Face Detection. H. Rowley, S. Baluja, and T. Kanade. Technical report CMU-CS-97-201, Computer Science Department, Carnegie Mellon University, December, 1997. [vapnik95svnets] - Support vector networks. C. Cortes and V. Vapnik. Machine Learning, 20:1-25, 1995 T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schlkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.
10 Resources
CMU image test set for face detection http://vasc.ri.cmu.edu/IUS/eyes_usr17/har/har1/usr0/har/faces/test/ BIOID Face database http://www.bioid.com/technology/facedatabase.html Annie Artificial Neural Network library for C++ http://home.iitk.ac.in/student/asim/annie/ SVM light Support Vector Machine training and classification software http://svmlight.joachims.org/ Intel Performance Libraries Image Processing Library http://www.intel.com/software/products/perflib/ipl
- 18 -