You are on page 1of 4

Method for Generating a Three-Dimensional Representation From a Single Two-Dimensional Image and its Practical Uses

John D. Ives, Inc. 211 Warren Street Newark, NJ 07013 Tel 973-623-7900 Abstract
Over the last several years, facial recognition researchers and commercial enterprises have been developing new techniques to both reduce the error rate that has heretofore hampered widespread adoption of facial recognition systems and seek to identify individuals in broader scenarios than ever before. Spurred by advances in computer vision techniques, computational power, and the overarching need to deploy robust systems, facial recognition integrators and end-users have begun to embrace 3D data and its derivatives3. CyberExtruders approach (patented and patents pending1) provides the fast, automated, and scalable means to solve the most difficult facial recognition problems. By utilizing multiple views created from single legacy images, generating front facing images from off-angle images and leveraging detailed parametric information regarding an images content, composition and quality6, stakeholders have the means to craft reliable facial recognition systems. Through the conversion of 2D facial images to 3D facial models (3D morphable models), CyberExtruder provides a demonstrable enhancement to the accuracy of facial recognition matching engines. This is accomplished through the use of multiple variations in poses, lighting, and expression. These techniques have been independently tested by two of the leading facial recognition vendors and found to improve the performance of their respective solutions. head mesh. While this approach potentially improves the resulting depth information, the technique can never result in accurate depth information. Additionally, the 2D image on which the preformed mesh is projected must represent the human head looking directly along the camera's optical axis, any head rotation will skew the results and produce unreliable 3D data. CyberExtruder's algorithms were trained on large 2D and 3D data sets comprising human heads from all walks and stages of life. Representing a true cross-section of human head variance was an enormous task performed from late 2004 to early 2006 in our laboratory in New Jersey. After collecting the data, statistical models capable of representing this variation were created and the variables were separated into different meanings. The subset of variables represented include: lighting; age; facial expression; facial hair; and identity. Since these subsets are separable, we can determine and/or extract and/or change these variables as each deployment might require. For example, this technique can be used to remove a persons broad smile in an image to ensure that the picture complies with an agencys directive that image evidence should depict neutral expressions. This type of quality assurance step helps ensure that the facial recognition system compares a like image with a like image (two neutral expressions) and thus achieves the highest confidence matching score. Another relevant example would be to determine the expression observed on a subjects face and transform the result into the corresponding FACS13 (Facial Action Coding System) score. By conducting a passive and automated evaluation of each passengers expression as they wait in line, overburdened security officers have a new tool to help filter through the masses waiting to board an airplane. Another practical application of this technique would be to fit an automobile with a sensor that can alert a driver when they appear to be on the cusp of falling asleep at the wheel. Certain facial recognition system operators will benefit in their ability to leverage legacy databases containing images like those specified by the US Immigration and Naturalization Service12. In this example, the 2D images must contain a three quarter view of the applicant. This requirement renders these images unusable by facial recognition software. Similarly, the same tool can act as a quality control filter ensuring a flow of images being added to a database would all be within defined specifications, i.e. all neutral faces with consistent lighting and looking along the cameras optical axis.

CyberExtruders software reconstruction algorithms use training data to reconstruct the human head from a 2D image. Since a single 2D image is acquired there are none of the problems encountered with laser scanners or stereo systems. The use of training data ensures that the depth variables are estimated and fitted to the 2D data accurately while the complete modeling of the camera and its relation to the head in the image removes the requirement that the head should look directly along the camera's optical axis. In addition to these benefits, the algorithms are fast and completely automatic. Since there is no manual user intervention, the algorithms can run unattended on a server producing accurate 3D heads from single input images in microseconds. There are a few commercially available 3D head reconstruction software packages that produce head meshes based on the projection of a preformed generic mesh of the human head onto a 2D image. While this method can be fast, it forces the depth variable to be consistent with the generic mesh, hence the 'z' (depth) vertex values are always the same regardless of the input image. To provide more realism, some systems utilize more than one preformed

The Problem
In the FRVT 2002 results published by the National Institute of Standards and Technology (NIST) and described how the top three vendors produced results of 80% Verification with 20% Error with a false acceptance rate (FAR) of 0.1%. In the same report, NIST stated that 3D morphable models10 significantly improved the performance of all but one of the tested matching engines. In 2005, the Facial Recognition Grand Challenge (FRGC) 11 was established to improve upon the FRVT 2002. FRGC started with the performance baseline established in FRVT 2002 (FAR of 0.1%; 80% Verification; 20% Error) and selected a performance goal of 98% Verification and 2% Error at a FAR of 0.1%. The designers of the challenge felt that this order of magnitude increase was necessary5 if facial recognition was going to see widespread adoption. To achieve this goal, NIST designed experiments that would leverage new ideas and computer vision techniques. Cited techniques included recognition from threedimensional (3D) scans, recognition from high resolution still images, recognition from multiple still images, multimodal face recognition, multi-algorithm, and preprocessing algorithms to correct for illumination and pose variations11. FRGC experiment 1 measured performance on the classic face recognition problem: recognition from frontal facial images taken under controlled illumination. The maximum score for this experiment was 99% with a median score of 91%. FRGC experiment 2 was designed to examine the effect of multiple still images on performance. In this experiment, each biometric sample consists of the four controlled images of a person taken in a subject session. The maximum score for this experiment was 99% with a median score of 99%. FRGC experiment 3 looked at performance when both the enrolled and query images are 3D. The maximum score for experiment 3 was 97% and shows the potential of 3D facial imagery in facial recognition systems. The conclusion again was that 3D potentially resolves facial recognition accuracy issues but was not able to address the means for generating the 3D models.

variation in the shape and albedo (albedo is the extent to which an object reflects light) of the human head with relatively few parameters while maintaining great accuracy. These parametric representations can be separated into meaningful subsets of parameters (lighting, age, facial hair, expression and identity), each with its own use. In this way, the application of AutoMesh to a 2D image results in a rapidly produced, highly accurate reconstruction of that person's head in 3D plus a set of parameters that represent the person's identity, expression, facial hair, age. Parametric output relative to the environmental lighting and virtual camera are also measured and classified. The resulting 3D model can be manipulated and used in many ways. For example, expressions can be evaluated, added or removed; lighting can be normalized or adjusted; facial hair can be detected, removed or added and so on. The 3D model can also be used to produce n 2D views or to compose a neutral expression with consistent lighting for 2D facial recognition. AutoMesh out-performs other methods of creating 3D models because it parametrically represents the whole populous of human head shape, hair and albedo variation. Because of this it does not rely on a generic preformed head mesh but rather it fits the optimum 3D shape, hair, albedo, lighting and camera parameters to the given image. The resulting 3D model produced by AutoMesh is true 3D, not 2.5D and the subject in the input image does not have to look along the cameras optical axis. Input expression, hair, pose, lighting and age can be normalized, added or subtracted. AutoMesh can typically process an image in 0.6 seconds on a 2.6Ghz server, runs as a background service and does not require user intervention to run.

The AutoMesh Solution

CyberExtruders AutoMesh combines statistical, projective and lighting algorithms based on training databases to produce an automatic 2D to 3D head creator. The algorithms at the core of AutoMesh have been trained on a database of 2D images containing a cross section of human heads with a wide range of rotations and lighting conditions. These algorithms allow AutoMesh to accurately locate all anatomically relevant parts of the human head in a 2D image. AutoMesh also contains algorithms trained on carefully constructed 3D laser scans that encompass a cross-section of the human populous with all relevant facial expressions. These algorithms are combined with perspective projection and real world lighting models to enable AutoMesh to provide an optimum 3D parametric representation of the person in the 2D input image. Owing to the depth and breadth of the 2D and 3D training databases, AutoMesh is capable of representing the

Click on the image to see its 3D model

Multi-View System (MVS) Module

A 2D facial recognition system that has to compare an image of a persons face that is turned away from center with another image that is correctly posed and lit will typically have difficulty matching the two images correctly. The MVS module in AutoMesh allows system integrators the ability to render almost any permutation of an original image. Since the 2D image is transformed into a model in 3D space, an infinite number of possibilities exist to re-pose the subject and re-render an image that is more acceptable to the facial recognition system. Legacy database images12, for example, can be converted into 3D and then rendered back into 2D at an optimal pose. In addition to the adjustments to a subjects pose or expression, information pertaining to the camera (like focal length or zoom) can be altered; ambient lighting and light color can also be remodeled. MVS facilitates individual facial recognition engines such that the facial recognition systems can match images with similar compositions without themselves needing to compensate for these factors.

installations by providing quantitative feedback about image sets. Simply put, the ICS diagnostic application will help to determine what factors are present in a set of images that are causing a reduction in performance. Since the process that CyberExtruder applies to a 2D image during the conversion to a 3D model evaluates and classifies hundreds of parameters, the parametric information itself can be used to understand and quantify those aspects of an image that could be affecting the performance of a matching system. Examples of parametric fields are: Subjects rotation through the X axis relative to the virtual camera (head turned side to side) Subjects rotation through the Y axis relative to the virtual camera (head nodded up or down) Subjects rotation through the Z axis relative to the virtual camera (head tilted shoulder to shoulder) The magnitude of a subjects facial expression (from a frown, through a neutral expression to a broad smile) Subjects eyes are opened or closed Male-ness vs Female-ness Ranges of ethnicity (Asian to Caucasian to Black) Intrinsic lighting and contrast attributes seen in the image In all, there are currently 415 parameters for which a value can be assigned in a facial image. Once quantified, the parametric information can subjected to a series of statistical tests aimed at determining whether an attribute seen in the image is likely to be having a negative impact on matching scores.

ICS Diagnostic Utility Module

In 2003 CyberExtruder was asked by Passports Australia to evaluate several image sets that had produced sub par results in a pilot facial recognition deployment. The question posed was What is it about the test images in the test sets thats producing FAR\FRR scores that are opposite to the expected results? CyberExtruders ICS diagnostic application is the product of those early investigative processes and is now a state of the art tool intended to assist system designers and integrators in the fine tuning of their facial recognition

3D Database
The 3D data collection effort began in August of 2004 as part of a software development project that CyberExtruder was undertaking. As of this January 2006, the dataset contained 649 subjects and is the largest of its kind in the world2, 4. Each subject was scanned with a Cyberware head scanner while exhibiting variations in expression and pose (neutral expression, angry, sad smile (not teeth showing), broad smile (showing teeth), looking up (head tilted), looking down (head tilted)) for a total of 5,192 3D scans. Each subject was also photographed in the same 8 pose and expression modes. The photographs were taken with three Canon EOS 20D digital cameras linked by PocketWizard transceivers. The cameras were positioned so that a full frontal image, a profile (90 relative to the cameras opical axis) and a three quarter view (45 relative to the cameras opical axis) simultaneously captured under controlled indoor lighting conditions. This resulted in 24 high resolution (8 mega pixels each) pictures per subject for a total dataset of 15,576 images. 141 of the 649 subjects were also available to sit for a five minute digital video session recorded on a JVC JYHD10U HDV Camcorder. The subjects were prompted to recite combinations of numbers and facial expressions. In July 2006, an additional 90 subjects were photographed while being coached by Dr. Paul Ekman. This collection was photographed while they exhibited specific facial expressions such as anger, sadness, fear and surprise. The same camera setup was used as in the previous data collection effort resulting in over 2,100 images. 1,313 of those images clearly depict specific FACS13 evaluated facial expressions. Due to the nature of the expressions, it was not possible to collect laser scanned data. In concert with the collection of the imagery data, in depth demographic data was also captured. While not all subjects provided all the requested details, the vast majority supplied name and current address; gender, height and weight; date, time and place of birth; race and both parents ethnic heritages as well as personal anecdotal information.

1. 2. J. Ives and T. Parr, Apparatus and method for generating a three-dimensional representation from a two-dimensional image, WIPO WO02/95677, March 2000 R. Gross, Face Databases, Handbook of Face Recognition, Stan Z. Li and Anil K. Jain, ed., Springer-Verlag, February 2005 K. Chang, K. Bowyer, P. Flynn, Face Recognition Using 2D and 3D Facial Data, Workshop on Multimodal User Authentication (MMUA), December 2003, Santa Barbara, California R. Campbell and P. Flynn, A WWW-Accessible 3D Image and Model Database for Computer Vision Research, Empirical Evaluation Methods in Computer Vision, K.W. Bowyer and P.J. Phillips (eds.), IEEE Computer Society Press, pp. 148-154, 1998. P. J. Phillips, Presentation at the Third FRGC Workshop, National Institute of Standards and Technology, February 16, 2005 P. Belhumeur, Ongoing Challenges in Face Recognition, Frontiers of Engineering:Reports on Leading-Edge Engineering from the 2005 Symposium P.J. Phillips, P.J. Grother, R.J. Micheals, D.M. Blackburn, E. Tabassi, and J.M. Bone, Face recognition vendor test 2002: Evaluation repor, Tech. Rep. NISTIR 6965, National Institute of Standards and Technology, 2003 V. Blanz and T. Vetter, Face Recognition based on fitting a 3D Morphable Model. IEEE Trans on Pattern Analysis and Machine Intelligence 25 (9), pp 1063-1074, 2003. V. Blanz., P. Grother, P. J. Phillips, T, Vetter, Face Recognition based on Frontal Views generated from NonFrontal Images, IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2005 V. Blanz, T. Vetter, Generating Frontal Views from Single, Non-Frontal Images, In: Face Recognition Vendor Test 2002, Technical Appendices, Nat. Inst. of Standards and Technology (NIST), NISTIR 6965, Appendix O, 2003 P.J. Phillips, P.J. Flynn, T. Scruggs, K. W. Bowyer, W. Worek, Preliminary Face Recognition Grand Challenge Results, National Institute of Standards and Technology, 7th International Conference on Automatic Face and Gesture Recognition, April 2006 Immigration and Naturalization Service, Form M-378 (6-92) P. Ekman, W.V. Friesen, The Facial Action Coding System: A Technique for the Measurement of Facial Movement, San Francisco: Consulting Psychologists Press, 1978 S. Li, A. Jain. Ed., Handbook of Face Recognition, SpringerVerlag, February 2005 P. Ekman, Emotions Revealed, Times Books, NY, 2003 M. Bartlett, G. Littlewort, I. Fasel, J. Susskind, J. Movellan, Dynamics of Facial Expression Extracted Automatically from Video, Computer Vision and Image Understanding, Special Issue on Face Processing in Video, 2005 Y. Tian, T. Kanade, J. Cohn, Recognizing Action Units for Facial Expression Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No. 2, Feb. 2001









12. 13.

14. 15. 16.


For more information or a demonstration of the software and techniques described within document please contact Jack Ives directly via email at