You are on page 1of 13

Human detect ion in video

Dmitry Golomidov
Advisor: Jianbo Shi

April 18, 2008


This paper will attempt to analyze and compare a number of different existing
approaches to human detection in video. Even though there have already been
a number a papers written on the topics of skin and face detection this paper
will go beyond these analyze, culminating in an open source package aimed at
anyone working with image processing and object recognition. I will provide a
comprehensive up to date open source package containing implementations of the
algorithms based on the most recent research concerning human presence detection.
The package will include libraries for skin, face, eyes detection which together can
be used for detecting human presence in video. The system is build so that it can
applied to both real-time data although with lower detection rate and static data
(images, video) for offline processing with higher detection rate.

1 Introduction to human detection project

Human detection in video can be considered one of the most fundamental challenges in
image and video processing field. It is also one of the most researched areas. What
attracts the researches is the complexity of the problem due to a large data variation.
Solving a more complex problem may give a birth to a technique capable of performing
identification task on a different object with less variance. Every person looks different,
combination of the many features make up a unique model. And yet quite often the tech-
nical problem requires automatic human detection. The general trend in this area can
be summarized as a ”divide and conquer” approach. There are distinct parts of ”human
model” that researches are interested in. These parts are face and facial features, like
eyes, mouth, nose, another features can be skin, hand or body structure (composite model
of connected parts, like head, body, limbs).They form 3 separate components of human
detection:Preprocessing, Skin detection, Face detection, and Human figure detection.
Not only the task of object detection is interesting, but also techniques that can ease
this task. Such techniques include background subtraction, that will reduce the area of
the image one might be interested in, or color space transformation, that may yield better
result for some specific tasks like skin detection. These techniques form a separate signal
processing category.

1.1 Skin detection

The existing skin detection research can be divided into 3 directions Vezhnevets (2005):
explicitly defined skin region, nonparametric skin distribution modeling and parametric
skin distribution modeling. The first method is often extended into explicitly defining
non-skin region as well. This method automatically constructs rules obtained from train-
ing data, most of the work on this method was carried by Gomez and Morales (2002) and
Gómez (2002) Self organizing map approach Brown et al. (2001) can be considered as
another example of automatically learning skin color distribution model and defining the
regions into skin and non-skin regions. The second method of nonparametric modeling
uses Bayesian approach to learn skin probability obtained from training set. The most
fundamental work was done by Jones and Rehg (1999) which included comparison of
Gaussian mixture modeling and histogram based approach which were trained on large
Compaq skin data set that contained nearly 1 billion of hand labeled pixels.

Most research uses skin filters based on the fact that skin color depends on a com-
bination of blood (red) and melanin (yellow, brown) Forsyth and Fleck (1999) , has a
restricted range of hues and is not deeply saturated. In a two step process this filter first
picks those pixels that are likely to be skin and afterwards the regions of pixels around
them is examined for consistency and texture structure.

Another approach is called non-parametric skin color distribution modeling. The key
is to create a skin probability map, based on relatively simple ”skin/not-skin” color filter
which is applied to every color in image color space separately. Although this approach
is very fast, it requires a lot of space due to separate analysis for each color, so in order
to reduce the space requirements clustering technique is used. Clustering technique is es-
sentially the same with the only difference that color space is divided into several clusters.

There is also a parametric skin color distribution modeling Zhu et al. (2004). It uses
the same cluster approach with some defined shape so that it detects skin texture within
that cluster. It is analyzed afterwards by using different statistical modeling and Gaus-
sian distributions.
Yet another approach is to define skin color cluster explicitly and define as skin anything
that satisfies the set of rules that define that skin color cluster.
The most up to date comprehensive overview Kakumanu et al. (2007) has determined
that non parametric skin color distribution model is the most accurate one. It can be
explained by the fact that the latter two approaches use some approximation rules.

1.2 Face detection

Traditionally the face detection task can be implemented with two very distinct ap-
proaches. These approaches are feature-based and image-based. Image based approach
makes use of any information that the actual image can provide, such as color ranges,
intensities. Feature-based approach uses information obtained from particular set of
features corresponding to a given image, such as edges, pixel-pixel relative positioning,
points of interest. Most of the existing face detection algorithms work with gray images.
Without taking into consideration the color information, face detection becomes basically
a pattern recognition task.

One major branch of research involves using neural networks for specific pattern recog-
nition task. It is usually combined with some clustering techniques described in Sung
(1996) The two prototypes - face and non-face, are produced as a result of clustering a set
of training images. After that a difference (or distance) between a given image and the
two prototypes is computed via MLP(multi-layer perceptron) classification. By choosing
the shortest distance to the prototype an image is either consider as the one that either
contains face or the one that does not.

Another approach uses SVMs(support vector machines). This technique is extremely

suitable for solving two-class pattern recognition tasks. It determines the separating
hyperplane with maximum distance to the closest points in the dataset, thus establishes
a boundary between the two sets. Since for complex patterns like faces, it is clear that
the two sets cannot be linearly separated by using a couple of simple features, non-linear
transformation is applied to the data points. This transformation maps the data points
into high dimensional space (feature space), in which the data points for two different
sets can be separated by a hyperplane described by Osuna et al. (1997) uses this idea to
detect frontal faces with a polynomial SVM classifier alng with the other approach by
Heisele et al. (2000).
Component-based approach possibly is the closest to the way of how humans detect
certain things. It consist of identifying some basic components of a human face (eyes,
nose, mouth) and trying to combine them together into a connected system that can be
matched against some template used by Wiskott et al.. The other methods like the ones
used by Rikert et al. (1999) and Schneiderman and Kanade (1998) make use of statistical

2 Implementation
2.1 General design
I used Matlab as a general tool to test the algorithms. OpenCV Intel package was used
to obtain video stream from the camera.

2.2 My skin detection

I have started with a very very simple skin detector using red and green color components
(since read and yellow are the main colors describing skin color as mentioned above). I
have constructed a 400*240 image that I later will be referring as skin patch. It was con-
structed from 5 different images I found on the Internet that had skin regions. I manually
selected and cropped these regions and combined them into one patch so that I only had
skin pixels to work with. On the early stages of my project I used this skin patch to
build and test simple techniques, I am planning to extend the skin patch to include more
pictures and more skin types. After that I have plotted the color distribution model on
RG chart and manually selected region that color would fall into.

I used this approach as a draft for my later stages, because skin model is more complex
than rectangular region in RG space. But even at this stage I got some decent results
by calling a ’Skin’ everything that would fall into this region. I ran the algorithm that
would analyze an input image pixel by pixel and giving logical 1 to pixels that satisfy
the ’Skin’ condition and logical 0 to those that do not satisfy it. The results varied from
detecting almost all pixels correctly to completely missing all of them and interpreting
non-skin textures as skin. I learned the paramount importance of a good skin classifier,
in order to get better results one must model a descriptive and as accurate as possible

After the first attempt to design a skin detector was made, I decided to repeat the ex-
periment using different color space. I chose HSV color space for my second experiment.
Matlab provided rgb2sv function that performs the translation from RGB to HSV color
space. I was mostly interested in how different my results would be if I use some other
color space. This second experiment gave better results than the first one on the same
picture set. It turned out that hue and saturation gave a better classification description
for skin patch due to its invariance to high intensity at white lights .

Skin detection using single Gaussian distribution is one of the most known methods
for skin detection. Its advantages are real-time speed, low memory requirements and
main disadvantage is lower accuracy. This method is a step-up from methods that just
determine the hard boundaries for skin color interval, since it uses a normal distribution
of skin color regions. In order to increase accuracy of skin detection I decided to use
3D classifier by counting the number of certain skin color occurrences and convolving it
with Gaussian distribution afterwards. I have implemented a function that would take
an input skin patch, analyze it pixel by pixel storing counts in 3D accumulator array, the
size of which I determine manually. After that I form 1D Gaussian distribution vector of
some predefined size and standard deviation and convolve it with its transpose and once
again with the normal vector in order to obtain separable 3D Gaussian filter. After I
apply this filter to the accumulator array I get normal distribution of data derived from
my original skin patch. The model has to be normalized and stoed as .mat file, so it can
be retrieved later and used in other methods.

For the next part of my project I decided to use an improved version of skin patch, so
I added skin color regions from 20 other pictures found on the Internet, which resulted in
582*585 pixels picture. Furthermore I decided to experiment with all the new color spaces
that were mentioned in Kakumanu et al. (2007), namely perceptual color spaces (HSI,
HSV, HSL, TSL), orthogonal color 5 spaces (YCbCr, YIQ, YUV, YES) and perceptually
uniform color spaces (CIE-Lab and CIE-Luv).

Some of the conversion functions perform faster than the others, so far TSL and YIQ
are the lowest performing functions in terms of time, but I will be working on code op-
timization later on. In order to compare the results of detecting skin in different color
spaces I had to create a skin model in all of them, so I used the function described above
to produce 13 different color model .mat files including standard RGB and normalized
RGB. This experiment results were the following: RGB, normalized RGB and YIQ color
spaces produced the best skin probability map than the others, XYZ, HSV, YUV, YES
showed the next result in detecting skin followed by Lab and Luv color spaces. TSL and
HSL procedures need to be improved, since due to the nature of conversion TSL some-
times returns imaginary part of output, which needs to be handled. But overall results
were satisfying, I was pleased with accuracy of the top performing color spaces. I also
wrote inverse operations for color space conversions, since it takes less time to convert
a model from some color space to RGB than a picture with bigger dimensions into that
color space. This increased the speed of algorithm execution.

After the skin probability map was obtained, I had to filter out some of the pixels
falsely detected by the algorithm and also include the ones it misclassified. First of all I
had to convert the probability image into binary by deciding on the threshold. All the
probability values lower than the threshold are discarded as non-skin, they become 0’s,
the rest are considered to be skin and they become 1’s one the binary map. In order to
automate the process of choosing the threshold, I used the matlab sript that classified
the pixels on skin/non-skin basis based on very low threshold and then repeated the pro-
cedure increasing the threshold until it reached the maximum defined level of threshold.
Meanwhile the list of differences between the current and previous binary maps was main-
tained that counted the number of pixels that was previously considered to be skin and a
number of “new” skin pixels. After that I find the iteration with the minimal difference,
and use a threshold associated with that iteration. This process of dynamically adapting
the threshold to a certain image was extremely useful since the scin detection algorithm
on images will be tested on images with different illumination conditions and different
skin color regions.

After obtaining a binary skin map, I had to segment the skin regions. Skin region can
be defined as a closed region of pixels or a set of connected components(J. Cai and A.
Goshtasby and C. Yu, ”Detecting Human Faces in Color Images”, Wright State Univer-
sity, U. of Illinois.) in the image. Knowing that some of the pixels are still misclassified
and some are still missing from the map, I decided to apply two new operations of erosion
and dilation. Erosion is a function that “shrinks” the image by applying the Kx transla-
tion to the binary skin probability map. Erosion is dilation’s dual, therefore eroding and
dilating by the same factor will result in noise reduction during erosion process and skin
region enhancement during dilation.

Since I will be using skin detection for my face detector in the future, I decided to
restrict the skin probability map once again by requiring a skin region to have at least
one hole inside of it (eye on face example).

2.3 My face detection

Face detection is a major part of my system. The first approach that I implemented
was based on my matlab’s code for skin regions detection. I assumed for simplicity that
any of the skin regions with one or more holes in it was a face region, and also that
each labeled skin region (procedure described in section ’skin region’ segmentation) was
also face. These two major assumptions were made initially for simplicity and testing
purposes. As it turned out these assumptions yielded to a very narrow band of types of
images the procedure was performing well on, so they became major technique limita-
tions. Furthermore these two assumptions became hard to resolve, given that the entire
method for face detection was solely based on skin regions segmentation.

The binary skin probability map obtained with procedure described in previous chap-
ter needed to be further processed in order to be used by my face detection method. I
needed to be able to identify the skin regions from each other by labeling them. During
labeling part each pixel is examined along with 8 of its neighboring pixels (these are all
the neighbors possible of a pixel). If any of the neighboring pixels had some label we
mark the pixel were are currently examining with that label, in other case we introduce
some new label. One way of producing the labeled regions in Matlab is to use bwlabeln
function. In order to make the resulting picture easily readable I associated each label
with a particular color, so that on the skin regions map I could clearly see the distinctly
colored regions. This was done with label2rgb function in Matlab.

Once the labeled skin regions were obtained, they needed to be examined topologi-
cally. In order to reduce the number of false-positive results, further restriction on what
a face region is was made. For a region to be considered for further processing as a valid
face region, it had to have at least one hole in it. This was done bearing in mind that eye
and mouth regions are omitted by skin detector, and only the surrounding them region
is labeled as skin. One hole is sufficient to consider even profile face images showing only
one hole (for example an eye).

Since now the labeled skin regions can be processed separately we examine each one
of them at a time. The number of holes in a corresponding region can be determined
with subtracting Euler number from 1. This caclulation comes from a definition of Euler
number: E=C-H where C is a number of connected components (1 in our case since we
are examining one skin region at a time) and H is the number of holes in it. The following
lines of Matlab code implement this procedure:

Listing 1: Descriptive Caption Text

e u l = r e g i o n p r o p s ( l a b e l s k i n , ’ EulerNumber ’ ) ;
e u l e r r e g s=c a t ( 1 , e u l . EulerNumber ) ;
h o l e s=1− e u l e r r e g s ;
In attempt to further refine the face detection rate a simple aspect ratio test is per-
formed, which considers the regions as valid only in case the aspect ratio of the skin
region is between values of 1 and 3. These values were obtained from averaging results
of aspect ratio test performed on face database.

In order to study the region further center of mass and orientation of the region had
yet to be found in order to fit the template face image into the skin region. These calcu-
lations were made by extensively using regionprops Matlab function, that depending on
a second parameter passed to the function yilded to such useful features as orientation
and center of mass (centroid) of the region. It also finds a bounding box in which the
face template can be fit. These templates imposed into original image show location of
faces detected.

This procedure employs a multi step refinement technique. Due to its dependence
on multiple matrix operations performed on skin regions it can not be used in real time
face detection procedures, but it does yield fairly consistent good results given that the
two assumption described above are satisfied. The procedure can be used offline on a
sequences of images extracted from a video. The resulting images can afterwards be as-
sembled in a resulting video sequence.

Although the procedure did not perform real time, its refinement steps helped to iden-
tify the skin regions more accurately, therefore resulting in an overall better performance
for face detection technique using skin color as a main face feature.

2.3.1 AdaBoost

A lesson was learned from the procedure described above that in order to detect faces the
algorithm should employ a higher level feature extraction rather than examining some
properties of an individual pixel.

Fundamentally alternative approach to problem of face detection was proposed by

Papageorgiou et al. (1998). He proposed using a set of alternative features also know as
Haar like features. A simple rectangular Haar like feature can be defined as a difference
of the sum of pixel values of some areas inside a given rectangle, with this rectangle being
at any position within the image and having any scale value.

Viola and Jones (2001) defined 2, 3 and 4 rectangle features, named after the number
of rectangles within the feature. These features together indicate certain characteristics of
the image, change in texture, border position between a dark and a light region within the
image. Viola and Jones also introduced a concept of an integral image, which is defined
as a two dimensional look-up table in a form of the matrix with the size corresponding
to the original image. Each element of this matrix contains the sum of all pixels located
higher and to the left of the corresponding pixel in the original image. This property
of integral image allows to compute sum of pixels at any position in the original image
using only four lookups: sum =

pt4 − pt3 − pt2 + pt1

. Depending on how the feature was defined it takes a different number of look-ups
to define that feature (2-6,3-8,4-9), providing that once the integral image is computed,
Haar-like features can be computed in constant time. Later Lienhart and Maydt (an
extended set of haar like features) introduced a new concept of a tillted Haar-like feature
(45 degrees), which helped describing an object in a better way.
A classifier for an object was constructed using AdaBoost algorithm. AdaBoost algo-
rithm uses an intuitive approach of combining multiple diverse and independent decision
(they have to be at least more accurate than a random guess) together, thus cancelling
random errors and reinforcing the correct decisions.

OpenCV package provides an implementation of AdaBoost algorithm. The documen-

tation provided with the package barely described a process of training a classifier. I used
a tutorial webseng (2008) which was aimed at teaching on how to create a classifier for
pedestrian detection. Although the tutorial had more information needed to carry out
the procedure of training, I had to experiment with the code and databases for some time
to train my classifier.

As the tutorial suggested I maintained two sets of data – positive and negative
datasets. For positive dataset I chose the UMIST Face database, since it had data that
contained head rotation invariants and I was aiming to classify even rotated faces. And
for the negative set I used dataset (suggested by Naotoshi Seo) with about
3500 pictures. I had to delete some of them because some contained faces, due to the fact
that the dataset was used for similar eye detection research, and this is a very important
step, one has to make sure the negative set does not contain an object that needs to be
detected. After that I had 2984 negative images (images of background, not containing
faces) and 615 cropped images from the UMIST database.

I created two .data files containing the paths to all images from each of the datasets.
Creating test samples from the dataset was not an easy task, since it had 3 different op-
tions of generating the description file. The createsamples executable was build from the
sources provided by OpenCV package. It allows to create test samples either by creating
training samples from some images in negative dataset without applying distortions, or
from one image with distortions with an extra option of providing the ground truth table
for the face location. I used createsamples -img me.jpg -num 20 -bg negatives.dat -info
test.dat -maxidev 100 -maxzangle 0.3 -maxxangle 0.6 -vec samples.vec command.

The resulting file samples.vec can be examined by executing createsample -vec sam-
ples.vec and specifying the dimension of the output picture cascade. The generated .vec
files then can be merged into a bigger .vec file. I had to repeat the procedure over a
1000 times to create 5 samples at a time and merge the results into a bigger .vec file.
Once the samples.vec is created we can run a program compiled from openCV source
code on it to train the classifier for face detection. Kuranov at el. (empirical analysis of
boosting algorithms) in his analysis of AdaBoost algorithm uses 5000 positive and 3000
negative images and sample size of 20*20, which according to him yilded better results.
I run the following command, opencv-haartraining -data cascade -vec result/samples.vec
-bg result/negatives.dat -nstages 30 -nsplits 2 -minhitrate 0.999 -maxfalsealarm 0.5 -npos
7000 -neg 2984 -w 20 -h 20 -mem 1300 -mode ALL. The mode -ALL uses extended tilted
Haar-like features mentioned above. During the learning procedure AdaBoost assigns
weights to the samples and readjusts them after each iteration, reweighting them so that
the learning focuses on the samples that the most recetly learned classifier got wrong.
The algorithm iteratively combines the classifiers and weights them according to their er-
ror rate. The training error during the learning process converges to 0. The haartraining
procedure produces an output file which contain information about the classifier. These
classifiers can be combined into a cascade with createcascade program afterwards, since
the false positive rate of the cascade classifiers is lower.

2.3.2 Enhancements

Although the trained classifier did not have a high detection rate, its performance was
enhanced by using some additional techniques. In addition to face detecting from a
formed classifier, I run a similar kind of procedure to detect eyes. The cascade was ob-
tained from Santana et al. (2008). Two cascade coupled together significantly improved
detection rate. Additional cascades can be utilized, namely mouth and nose detection

Furthermore once a facial region was detected and eyes were found inside that region,
a number of points inside the region is found with cvGoodFeaturesToTrack function, that
then can be used to track the features, as long as the number and location of the points
is persistent. Tracking is done with Lucas-Kanade optical flow algorithm. The location
of the points on the picture gives additional evidence that is used to detect human face.

In addition to the procedure described above, one additional head and shoulders
cascade provided by M. Castrillón-Santana can be used to identify human figure. The
bounding box is then estimated by multiplying the height of the detected head by a
known aspect ratio constant.

One more technique that can be used in improving the performance is a background
eliminating technique. OpenCV provides an example of a silhouette tracking. I aug-
mented the code so that the resulting function returns the actively moving objects and
subtracts the still ones from the picture, thus reducing the number of possible false-
positive outliers.
2.3.3 Result

The result of my work is a set of functions written in C/C++ and .so and libraries com-
piled from Matlab that can be used in external projects in a related field study. I have
also implemented a GUI interface in Python that allows user to interact with a program
by turning on and off additional enhancement procedures, toggling between different de-
tection procedures. The user interactive design also allows to switch between the online
detection method with a video stream obtained from a web camera or to manually select
a desired image or video file and if so desired save the output in suited format. Certain
restrictions apply to real-time program performance. Skin detection techniques due to a
high number of processing steps cannot be performed in real-time.

2.3.4 Challenges

During the work process I encountered both technical and knowledge-based types of chal-
lenges. The knowledge-based challenges included most of the time a lack of understanding
of some concept, that is why I had to spend significant amount of time doing research.
The hardest part here was to select information that would be relevant to my research
from many papers and publication on face and skin detection.

Technical challenges included problems of compiling all the libraries together and
building an application. This was partially resolved by compiling the procedures written
in Matlab into loadable libraries. The interaction between GUI written in Python and
C++ function was done with Boost.Python library.

Brown, D. A., Craw, I., and Lewthwaite, J. (2001). A SOM based approach to skin
detection with application in real time systems. In British Machine Vision Conference,
page Poster Session 2. and Demonstrations. 2

Forsyth, D. A. and Fleck, M. M. (1999). Automatic detection of human nudes. Interna-

tional Journal of Computer Vision, 32(1):63–77. 2

Gómez, G. (2002). On selecting colour components for skin detection. In ICPR (2), pages
961–964. 2

Gomez, G. and Morales, E. F. (2002). Automatic feature construction and a simple rule
induction. 2

Heisele, B., Poggio, T., and Pontil, M. (2000). Face detection in still gray images. In
MIT AI Memo. 3
Jones, M. J. and Rehg, J. M. (1999). Statistical color models with application to skin
detection. In CVPR, pages 1274–1280. IEEE Computer Society. 2

Kakumanu, P., Makrogiannis, S., and Bourbakis, N. G. (2007). A survey of skin-color

modeling and detection methods. Pattern Recognition, 40(3):1106–1122. 3

Osuna, E. E., Freund, R., and Girosi, F. (1997). Support vector machines: Training and
applications. 3

Papageorgiou, C., Oren, M., and Poggio, T. (1998). A general framework for object
detection. In ICCV, pages 555–562. 8

Rikert, T. D., Jones, M. J., and Viola, P. (1999). A cluster-based statistical model for
object detection. In International Conference on Computer Vision, pages 1046–1053.

Santana, M. C., Déniz-Suárez, O., Antón-Canalı́s, L., and Lorenzo-Navarro, J. (2008).

Face and facial feature detection evaluation - performance evaluation of public domain
haar detectors for face and facial feature detection. In Ranchordas, A. and Araújo, H.,
editors, VISAPP (2), pages 167–172. INSTICC - Institute for Systems and Technologies
of Information, Control and Communication. 10

Schneiderman, H. and Kanade, T. (1998). Probabilistic modeling of local appearance and

spatial relationships for object recognition. In CVPR, pages 45–51. IEEE Computer
Society. 3

Sung, K. K. (1996). Learning and example selection for object and pattern detection. In

Vezhnevets, V. (2005). A Comparative Assessment of Pixel-

based Skin Detection Methods. Graphics and Media Laboratory. 2

Viola, P. and Jones, M. (2001). Rapid object detection using a boosted cascade of simple
features. Proc. CVPR, 1:511–518. 8

webseng (2008). openCV Haartraining Tutorial.


Wiskott, L., Fellous, J.-M., Kruger, N., and von der Malsburg, C. Face recognition by
elastic bunch graph matching. pages 129–132. 3

Zhu, Q., Wu, C.-T., Cheng, K.-T., and Wu, Y.-L. (2004). An adaptive skin model and
its application to objectionable image filtering. In Schulzrinne, H., Dimitrova, N.,
Sasse, M. A., Moon, S. B., and Lienhart, R., editors, Proceedings of the 12th ACM
International Conference on Multimedia, October 10-16, 2004, New York, NY, USA,
pages 56–63. ACM. 2