You are on page 1of 148

Anlise Forense de

Documentos Digitais

Prof. Dr. Anderson Rocha


anderson.rocha@ic.unicamp.br
http://www.ic.unicamp.br/~rocha

Reasoning for Complex Data (RECOD) Lab.


Institute of Computing, Unicamp
Av. Albert Einstein, 1251 - Cidade Universitria
CEP 13083-970 Campinas/SP - Brasil
Outline
Outline
The pornography vs. nude detection problem

Related work

Skin-based techniques and how they work

Visual Words Techniques and how they work

Working with Videos

Space-temporal Visual Words

Extending the Solutions to Violence Detection and General Action


Recognition

A. Rocha, 2012 Digital Document Forensics 3


Introduction

The slides herein are a composition of many contributions of


several authors
Michael J. Jones, James M. Rehg, Li Fei-Fei, Ana Paula Lopes,
Sandra E. F. de Avila, Anderson N. A. Peixoto, Rodrigo S.
Oliveira, Marcelo de M. Coelho, Arnaldo Arajo, Fillipe
Dias, Guillermo Cmara Chvez, Eduardo Valle and many
others.

I deeply thank all of them for making their resources available

A. Rocha, 2012 Digital Document Forensics 4


Introduction
Introduction

The importance of pornography detection is


attested by the large literature on the subject.

Moreover, web filter is essential to avoid offensive


content, such as adult or pornographic material.

There are some industry software that block web


sites with offensive content (e.g. CyberPatrol1,
NetNanny2, K9 Web Protection3)

A. Rocha, 2012 Digital Document Forensics 6


Introduction

Also, there are software products which scan a computer for


pornographic content (e.g. SurfRecon4, Porn Detection Stick5,
PornSeer Pro6).

The latter pornography-detection software is readily available


for evaluation purposes.

NuDetective (Brazilian Federal Police)

A. Rocha, 2012 Digital Document Forensics 7


Introduction

A. Rocha, 2012 Digital Document Forensics 8


State-of-the-Art
State-of-the-art

Most work regarding the detection of pornographic material


has been done for the image domain

The vast majority of those works is based on the detection of


human skin. For example, [Fleck et al., 1996; Forsyth and Fleck,
1996, 1997, 1999] proposed to detect skin regions in an image
and match them with human bodies by applying geometric
grouping rules.

A. Rocha, 2012 Digital Document Forensics 10


State-of-the-art
[Jones and Rehg, 2002] focused on the detection of human skin
by constructing RGB color histograms from a large database of
skin and non-skin pixels, which allows to estimate the skin
probability of a pixel based on its color.

[Rowley et al., 2006] used Jones and Rehg skin color


histograms in a system installed in Googles Safe Search.

[Zuo et al., 2010] proposed a patch-based skin color detection


that verifies whether all the pixels in a small patch correspond
to human skin tone.

A. Rocha, 2012 Digital Document Forensics 11


State-of-the-art

Few methods have explored other possibilities.

Bag-of-Words models [Sivic and Zisserman, 2003] have been


employed for many complex visual classification tasks, including
pornography detection in images and videos

[Deselaers et al., 2008] first proposed a BoW model to filter


pornographic images, which greatly improved the efficiency of
the identification of pornographic images.

A. Rocha, 2012 Digital Document Forensics 12


State-of-the-art

Lopes et al. developed a BoW-based approach, which used the


Hue-SIFT color descriptor, to classify images [Lopes et al.,
2009b] and videos [Lopes et al., 2009a] of pornography.

[Ulges and Stahl, 2011] introduced a color-enhanced visual


word features in YUV color space to classify child pornography.

A. Rocha, 2012 Digital Document Forensics 13


State-of-the-art

[Steel, 2012] proposed a pornographic images recognition


method based on visual words, by using mask-SIFT in a
cascading classification system.

Those previous works have explored only bags of static


features.

A. Rocha, 2012 Digital Document Forensics 14


State-of-the-art

Very few works have explored spatiotemporal features or


other motion information for detection of pornography

[Tong et al., 2005] proposed a method to estimate the period


of a signal to classify periodic motion patterns.

[Endeshaw et al., 2008] developed a fast method for detection


of indecent video content using repetitive movement analysis.

A. Rocha, 2012 Digital Document Forensics 15


State-of-the-art

[Jansohn et al., 2009] introduced a framework that combines


keyframe-based methods with a statistical analysis of MPEG-4
motion vectors.

[Valle et al., 2012] compared the use of several features,


including spatiotemporal local descriptors, in the pornography
detection.

A. Rocha, 2012 Digital Document Forensics 16


Parallel
Parallel

There is a vast amount of multimedia data nowadays

Filtering improper multimedia material by its visual content is


needed

A. Rocha, 2012 Digital Document Forensics 18


Parallel

Skin detectors

Precise skin detection is not a trivial task

Generic geometrical model vs.Various body poses

BoVF representation...

...has great success in object recognition tasks and...

...is robust to several variations and occlusion

A. Rocha, 2012 Digital Document Forensics 19


Statistical Color Models with
Applications in Skin Detection
Michael J. Jones and James M. Rehg
Cambridge Research Laboratory
Compaq Computer Corporation
One Cambridge Center

Springer Intl. Journal of Computer Vision, Vol. 46, No. 1, 2002


Introduction

The paper describes the construction of skin and non-skin


statistical color models;

Uses 1 billion training pixels from Web images;


Shows a skin classifier with accuracy of 88%;
System used to detected naked people on Web images;

A. Rocha, 2012 Digital Document Forensics


Histogram Color Models

Images are organized in two sets:

Generic Training Set;

Used to compute a general histogram density;

Classifier Training Set;

Used to build the skin and non-skin models;

Manually separated into subsets containing skin and not containing


skin;

Skin pixels are manually labeled;

A. Rocha, 2012 Digital Document Forensics


General Color Model

Observations:

1. Most colors fall on or near the gray line;


2. Black and white are by far the most frequent colors, with white occuring slightly more
frequently;
3. There is a marked skew in the distribution toward the red corner of the color cube.

A. Rocha, 2012 Digital Document Forensics


Skin and Non-skin Color Models

Observations:

1. Non-skin model is the general model without skin pixels (10% of pixels);
2. There is a significant degree of separation between the skin and non-skin models;

A. Rocha, 2012 Digital Document Forensics


Skin Detection Using Color Models

Given skin and non-skin histogram models we can construct a


skin pixel classifier;

Such a classifier is extremely useful in two contexts:

Detection and recognition of faces and figures;

Image indexing and retrieval;

A. Rocha, 2012 Digital Document Forensics


Skin Detection Using Color Models
a skin pixel classifier is derived through the standard likelihood
ratio approach:

P(rgb)

skin non-skin

x rg
A. Rocha, 2012 Digital Document Forensics
Histogram-based Skin Classifier

A. Rocha, 2012 Digital Document Forensics


Histogram-based Skin Classifier

Qualitative observations: = 0.4;

The classifier does a good job of detecting skin in most of


these examples;

In particular, the skin labels form dense sets whose shape


often resembles that of the true skin pixels;

The detector tends to fail on highly saturated or shadowed


skin;

A. Rocha, 2012 Digital Document Forensics


Histogram-based Skin Classifier

More qualitative observations:


The example photos also show the performance of the detector on non-
skin pixels.

In photos such as the house (lower right) or flowers (upper right) the false
detections are sparse and scattered.

More problematic are images with wood or copper-colored metal such as


the kitchen scene (upper left) or railroad tracks (lower left).

These photos contain colors which often occur in the skin model and are
difficult to discriminate reliably.

This results in fairly dense sets of false positives.

A. Rocha, 2012 Digital Document Forensics


Histogram-based Skin Classifier

fraction of pixels labeled


as skin that were
classified correctly.

fraction of non-skin pixels which are


mistakenly classified as skin.

A. Rocha, 2012 Digital Document Forensics


Receiver Operating Characteristic
Curves (ROC curves)

Original =7 =9

A. Rocha, 2012 Digital Document Forensics


Receiver Operating Characteristic
Curves (ROC curves)

Original =7 =9

A. Rocha, 2012 Digital Document Forensics


Histogram-based Skin Classifier

More quantitative observations:

The performance of the skin classifier is surprisingly good


considering the unconstrained nature of Web images;

The best classifier (size 32) can detect roughly 80% of skin
pixels with a false positive rate of 8.5%, or 90% correct
detections with 14.2% false positives;

Its equal error rate is 88%.

A. Rocha, 2012 Digital Document Forensics


Comparison to GMM Classifier

One advantage of mixture models is that they can be made to


generalize well on small amounts of training data;

One possible benefit of a large dataset is the ability to use


density models such as histograms which are computationally
simpler to learn and apply.

Mixture models were trained for the dataset and compared their
classification performance to the histogram models.

A. Rocha, 2012 Digital Document Forensics


Comparison to GMM Classifier

two separate mixture models were trained for the skin and non-
skin classes;

It was used 16 Gaussians in each model;

The models were trained using a parallel implementation of the


standard EM algorithm.

A. Rocha, 2012 Digital Document Forensics


Comparison to GMM Classifier

A. Rocha, 2012 Digital Document Forensics


Comparison to GMM Classifier

A. Rocha, 2012 Digital Document Forensics


Comparison to GMM Classifier

The mixture of Gaussian model is significantly more expensive


to train than the histogram models;
It took 24 hours to train both skin and non-skin mixture models
using 10 Alpha workstations in parallel. In contrast, the histogram
models could be constructed in a matter of minutes on a single
workstation;
The mixture model is also slower to use during classification
since all of the Gaussians must be evaluated in computing the
probability of a single color value;

A. Rocha, 2012 Digital Document Forensics


Comparison to GMM Classifier

In contrast, use of the histogram model results in a fast


classifier since only two table lookups are required to compute
the probability of skin.
From the standpoint of storage space, however, the mixture
model is a much more compact representation of the data.

A. Rocha, 2012 Digital Document Forensics 38


Comparison to GMM Classifier

A. Rocha, 2012 Digital Document Forensics


Applications

Person Detector;

Adult Image Detector;

Incorporating Text Features into Adult Image Detection.

A. Rocha, 2012 Digital Document Forensics


Applications - Person Detector

The goal is to determine whether or not an input image


contains one or more people;
A classifier is trained on a simple feature vector computed
from the output of the skin detector;
1. Percentage of pixels detected as skin;

2. Average probability of the skin pixels;

3. Size in pixels of the largest connected component of skin;

4. Number of connected components of skin;

5. Percent of colors with no entries in the skin and non-skin histograms.

A. Rocha, 2012 Digital Document Forensics


Applications - Person Detector

The goal is to determine whether or not an input image contains


one or more people;
A decision tree classifier is trained on a simple feature vector
computed from the output of the skin detector;
1. Percentage of pixels detected as skin;
2. Average probability of the skin pixels;
3. Size in pixels of the largest connected component of skin;
4. Number of connected components of skin;
5. Percent of colors with no entries in the skin and non-skin histograms.

Results:
83.2% (correctly classified person images)
71.3% (correctly classified non-person images)
77.5% (correctly classified images)

A. Rocha, 2012 Digital Document Forensics


Applications - Adult Image Detector

There is a strong correlation between images with large patches of skin and
adult or pornographic images;

All of these services currently operate by maintaining lists of objectionable


URLs and newsgroups and require constant manual updating;

An image-based scheme has the potential advantage of applying equally to all


images without the need for updating.

A. Rocha, 2012 Digital Document Forensics


Applications - Adult Image Detector

The same approach as with person detection is used;

A neural network classifier is used;

The feature vector includes two additional elements


corresponding to the height and width of the image;

These two were added based on informal observations that


adult images are often sized to frame a standing or reclining
figure.

A. Rocha, 2012 Digital Document Forensics


Applications - Adult Image Detector

Results:
85.8% correctly detected adult images
7.5% mistakenly classified non-adult images

A. Rocha, 2012 Digital Document Forensics


Applications - Adult Image Detector

A. Rocha, 2012 Digital Document Forensics


Applications - Adult Image Detector

Combining the adult image detector just described with a text-


based classifier (obtained from AltaVista) which uses the text
occurring on a Web page to determine if it is pornographic;

The text-based detector uses the standard approach of matching


words on the page against a manually-specified list of objectional
words, and classifying based on the match frequencies;

The text-based classifier on its own achieves 84.9% correct


detections of adult images with 1.1% false positives.

A. Rocha, 2012 Digital Document Forensics


Applications - Adult Image Detector

The color-based detector is combined with the text-based detector by using an OR


of the two classifiers, i.e. an image is labeled adult if either classifier labels it adult.

A. Rocha, 2012 Digital Document Forensics


Comments

Color distributions for skin and non-skin pixel classes learned


from web images can be used as an accurate pixel-wise skin
detector;
The key is the use of a very large labeled dataset to capture the
effects of the unconstrained imaging environment represented by
web photos;
Visualization studies show a surprising degree of separability in
the skin and non-skin color distributions;
They also reveal that the general distribution of color in web
images is strongly biased by the presence of skin pixels.

A. Rocha, 2012 Digital Document Forensics


Comments

One possible advantage of using a large dataset is that simple


learning rules may give good performance;
A pixel-wise skin detector can be used to detect images
containing naked people, which tend to produce large connected
regions of skin;
It is shown that a detection rate of 88% can be achieved with a
false alarm rate of 11.3%, using a seven element feature vector
and a neural network classifier;
The results suggest that skin color is a very powerful cue for
detecting people in unconstrained imagery.

A. Rocha, 2012 Digital Document Forensics


Before
Continuing...
Bag-of-words models (BoW Models)
Li Fei-Fei (Stanford Univ.)
Related Work
Early bag of words models: mostly texture recognition
Cula et al. 2001; Leung et al. 2001; Schmid 2001; Varma et al. 2002,
2003; Lazebnik et al. 2003
Hierarchical Bayesian models for documents (pLSA, LDA, etc.)
Hoffman 1999; Blei et al, 2004; Teh et al. 2004
Object categorization
Dorko et al. 2004; Csurka et al. 2003; Sivic et al. 2005; Sudderth et al.
2005;
Natural scene categorization
Fei-Fei et al. 2005

A. Rocha, 2012 Digital Document Forensics 53


Object
Object Bag of words
Analogy to documents

Of all the sensory impressions proceeding to the


brain, the visual experiences are the dominant
ones. Our perception of the world around us is
based essentially on the messages that reach the
brain from our eyes. For a long time it was thought
that the retinal image was transmitted point by
point to visual centers in the brain; the cerebral
cortex was a movie screen, so to speak, upon
which the image in the eye was projected. Through
the discoveries of Hubel and Wiesel we now know
that behind the origin of the visual perception in the
brain there is a considerably more complicated
course of events. By following the visual impulses
along their path to the various cell layers of the
optical cortex, Hubel and Wiesel have been able to
demonstrate that the message about the image
falling on the retina undergoes a step-wise analysis
in a system of nerve cells stored in columns. In this
system each cell has its specific function and is
responsible for a specific detail in the pattern of the
retinal image.
Analogy to documents

Of all the sensory impressions proceeding to the


brain, the visual experiences are the dominant
ones. Our perception of the world around us is
based essentially on the messages that reach the
sensory,
brain from our eyes. brain,
For a long time it was thought
that the retinal image was transmitted point by
visual, perception,
point to visual centers in the brain; the cerebral
cortex was retinal, cerebral
a movie screen, cortex,
so to speak, upon
which the image in the eye was projected. Through
eye, cell, optical
the discoveries of Hubel and Wiesel we now know
nerve,
that behind the origin image
of the visual perception in the
brain there is a considerably more complicated
course of events. Hubel,
By followingWiesel
the visual impulses
along their path to the various cell layers of the
optical cortex, Hubel and Wiesel have been able to
demonstrate that the message about the image
falling on the retina undergoes a step-wise analysis
in a system of nerve cells stored in columns. In this
system each cell has its specific function and is
responsible for a specific detail in the pattern of the
retinal image.
Analogy to documents

China is forecasting a trade surplus of $90bn


Of all the sensory impressions proceeding to the (51bn) to $100bn this year, a threefold increase
brain, the visual experiences are the dominant on 2004's $32bn. The Commerce Ministry said the
ones. Our perception of the world around us is surplus would be created by a predicted 30% jump
based essentially on the messages that reach the in exports to $750bn, compared with a 18% rise in
sensory,
brain from our eyes. brain,
For a long time it was thought imports to $660bn.China,
The figurestrade,
are likely to further
that the retinal image was transmitted point by
visual, perception,
point to visual centers in the brain; the cerebral
annoy the US,surplus,
which has longcommerce,
argued that China's
exports are unfairly helped by a deliberately
cortex was retinal, cerebral
a movie screen, cortex,
so to speak, upon undervalued exports,
yuan. Beijing imports, US, is
agrees the surplus
which the image in the eye was projected. Through
eye, cell, optical
the discoveries of Hubel and Wiesel we now know
yuan,
too high, but says thebank, domestic,
yuan is only one factor. Bank
of China governor Zhou Xiaochuan said the
nerve,
that behind the origin image
of the visual perception in the foreign,
country also needed increase,
to do more to boost domestic
brain there is a considerably more complicated
course of events. Hubel,
By followingWiesel trade, value
demand so more goods stayed within the country.
the visual impulses China increased the value of the yuan against the
along their path to the various cell layers of the dollar by 2.1% in July and permitted it to trade
optical cortex, Hubel and Wiesel have been able to within a narrow band, but the US wants the yuan to
demonstrate that the message about the image be allowed to trade freely. However, Beijing has
falling on the retina undergoes a step-wise analysis made it clear that it will take its time and tread
in a system of nerve cells stored in columns. In this carefully before allowing the yuan to rise further in
system each cell has its specific function and is value.
responsible for a specific detail in the pattern of the
retinal image.
learning
learning

feature detection
& representation
learning

codewords dictionary
feature detection
& representation
learning

codewords dictionary
feature detection
& representation

image representation
learning

codewords dictionary
feature detection
& representation

image representation

category models
(and/or) classifiers
learning recognition

codewords dictionary
feature detection
& representation

image representation

category models
(and/or) classifiers
learning recognition

codewords dictionary
feature detection
& representation

image representation

category models
(and/or) classifiers
learning recognition

codewords dictionary
feature detection
& representation

image representation

category models category


(and/or) classifiers decision
Representation

codewords dictionary
feature detection
& representation

image representation
Representation

2.
codewords dictionary
feature detection
1. & representation

image representation

3.
1.Feature detection and representation
1.Feature detection and representation

Regular grid
Vogel et al. 2003
Fei-Fei et al. 2005
1.Feature detection and representation

Regular grid
Vogel et al. 2003
Fei-Fei et al. 2005
Interest point detector
Csurka et al. 2004
Fei-Fei et al. 2005
Sivic et al. 2005
1.Feature detection and representation

Regular grid
Vogel et al. 2003
Fei-Fei et al. 2005
Interest point detector
Csurka et al. 2004
Fei-Fei et al. 2005
Sivic et al. 2005
1.Feature detection and representation

Regular grid
Vogel et al. 2003
Fei-Fei et al. 2005
Interest point detector
Csurka et al. 2004
Fei-Fei et al. 2005
Sivic et al. 2005
Other methods
Random sampling (Ullman et al. 2002)
Segmentation based patches (Barnard et al. 2003)
1.Feature detection and representation

Compute
SIFT Normalize
descriptor patch
[Lowe99]

Detect patches
[Mikojaczyk and Schmid 02]
[Matas et al. 02]
[Sivic et al. 03]

Slide credit: Josef Sivic


1.Feature detection and representation


2. Codewords dictionary formation


2. Codewords dictionary formation

Vector quantization

Slide credit: Josef Sivic


2. Codewords dictionary formation

Fei-Fei et al. 2005


Image patch examples of codewords

Sivic et al. 2005


3. Image representation
frequency

..
codewords
Representation

2.
codewords dictionary
feature detection
1. & representation

image representation

3.
Learning and Recognition

codewords dictionary

category models category


(and/or) classifiers decision
Nude Detection in Video
using Bag-of-Visual-Features
Representing Images as BoVF

1) Point selection 2) Point description 3) Vocabulary discovery

4) Cluster association 5) Histogram computation

A. Rocha, 2012 Digital Document Forensics


Detecting Nudity from Videos
Lopes et al. 2009 - Approach

Frames HueSIFT PCA


Extraction Description Transformation

BoVF
Construction

Video SVM
Voting Scheme
Classification Classification

A. Rocha, 2012 Digital Document Forensics


Experimental Results
Database

179 segments

Nude sequences

From 43 to 308 frames long

Non-nude sequences

From 50 to 278 frames long

http://www.dcc.ufmg.br/nudeDetection

A. Rocha, 2012 Digital Document Forensics


Experimental Results
Experimental Setup

Frames selection - Sample rates 1/15 and 1/30

BoVF creation

10,000 random HueSIFT points

Vocabulary size: 60, 120 and 180

Linear SVM classifier

10-5 C 105

30 5-folds cross-validation runs

A. Rocha, 2012 Digital Document Forensics


Experimental Results

A. Rocha, 2012 Digital Document Forensics


Experimental Results

A. Rocha, 2012 Digital Document Forensics


General Comments

93.2% of correct classicaAon


IdenAed misclassicaAon causes:
Background colors near to skin tones
Presence of large skin areas
Illumination variations

A. Rocha, 2012 Digital Document Forensics


An Evaluation on Color Invariant
Based Spatiotemporal Features for
Action Recognition
Fillipe Dias Moreira de Souza, DCC/UFMG
Prof. Dr. Guillermo Cmara Chvez, DECOMP/UFOP (co-advisor)
Prof. Dr. Arnaldo de Albuquerque Arajo, DCC/UFMG (advisor)
Theory on STIP
Extension of the 2D Harris corner detector to 3D space;

y
y

e
#m
x
x
Invariant to mulAple scales.

81
Theory on STIP
Given a sequence of images
f(x,y,t) = f(.),

82
Theory on STIP
Given a sequence of images
f(x,y,t) = f(.),

its scale-space representaAon is denoted by

83
Theory on STIP

MulAscale 3D Harris matrix

84
Theory on STIP
Gaussian kernel

2D Harris matrix

MulAscale 3D Harris matrix

85
Theory on STIP
Gaussian kernel

2D Harris matrix

MulAscale 3D Harris matrix

Saliency measurement

86
Theory on STIP
(simulating STIP processing)
Input video to STIP

87
Theory on STIP
(simulating STIP processing)
STIP detects spa#otemporal interest points (IPs)

Input video to STIP

88
Theory on STIP
(simulating STIP processing)
STIP detects spaAotemporal interest points (IPs)
STIP describes IPs in terms
of histograms of op#cal ow
Input video to STIP

IP2HoF

IP1HoG IP2HoG

IP1HoF

STIP describes IPs in terms of histograms of oriented


gradients

89
Theory on STIP
(simulating STIP processing)
STIP detects spaAotemporal interest points (IPs)
STIP describes IPs in terms
of histograms of opAcal ow
Input video to STIP

IP2HoF

IP1HoG IP2HoG

IP1HoF

STIP describes IPs in terms of histograms of oriented


gradients

Concatena#ng descriptors

IP1HoG
+ IP1HoF

IP2HoG
+ IP2HoF

90
Theory on STIP
(simulating STIP processing)
STIP detects spaAotemporal interest points (IPs)
STIP describes IPs in terms
of histograms of opAcal ow
Input video to STIP

IP2HoF

IP1HoG IP2HoG

IP1HoF

STIP describes IPs in terms of histograms of oriented The set of features represen#ng the video
gradients content (STIP output)

ConcatenaAng descriptors
IP1

+
IP1HoG IP1HoF

IP2
IP1HoG IP1HoF
IP2HoG IP2HoF
()

IP2HoG
+ IP2HoF
IPn
IPnHoG IPnHoF

91
Color-based STIPs
STIP detects spaAotemporal interest points (IPs)
HueSTIP Histogram of op#cal ow

STIP also uses color


histograms to describe
local spatiotemporal IP1HoF

features
IP1HoG

Hue histogram has Histogram of oriented The set of features represenAng the video
gradients content (STIP output)
36 bins
ConcatenaAng descriptors
IP1

IP1HoG IP1HoF

IP1HoG
+ IP1HoF
()

IPn

IPnHoG IPnHoF

92
Color-based STIPs
STIP detects spaAotemporal interest points (IPs)
HueSTIP Histogram of opAcal ow

STIP also uses color


histograms to describe
local spatiotemporal IP1HoF

features
Color histogram
IP1HoG of hue

IP1Hue

Hue histogram has Histogram of oriented The set of features represenAng the video
gradients content (STIP output)
36 bins
ConcatenaAng descriptors
IP1

IP1HoG IP1HoF IP1Hue

IP1HoG
+ IP1HoF
+ IP1Hue
()

IPn

IP1HoG IP1HoF IP1Hue

93
Experimental Results

94
Experimental Results

95
Experimental Results

96
Experimental Results

97
Experimental Results

98
Violence Detection in Videos
Using Spatio-Temporal Features
Fillipe D. M. de Souza, Guillermo C. Chvez
Eduardo Valle Jr. and Arnaldo de A. Arajo

1 NPDI/DCC/ICEx/UFMG
2 ICEB/UFOP
3 IC/UNICAMP
Introduction

A. Rocha, 2012 Digital Document Forensics


What's new in this work?
l Data representation
l Use of local spatio-temporal features as
representative patterns for the interest actions

Elbow Up
Left Arm Moving Knee Lifting
Moving

A. Rocha, 2012 Digital Document Forensics


What's new in this work?
l Data representation:
l Use of local spatio-temporal features as representative
patterns for the interest actions;
l Encoded as bags of visual words.

A. Rocha, 2012 Digital Document Forensics


The Method Overview

A. Rocha, 2012 Digital Document Forensics


The Method Overview

A. Rocha, 2012 Digital Document Forensics


Preliminary Step
l Temporal Video Segmentation

Input Video

Video Partitioning

A. Rocha, 2012 Digital Document Forensics


Preliminary Step
l Temporal Video Segmentation

Input Video
Video Shots
(input to STIP)

Video Partitioning

A. Rocha, 2012 Digital Document Forensics


Preliminary Step
l Temporal Video Segmentation

Input Video
Video Shots
(input to STIP)

Video Partitioning

Video Shots
(sample of frames
is input to SIFT)

Frame Sample Extraction

A. Rocha, 2012 Digital Document Forensics


The Method Overview

A. Rocha, 2012 Digital Document Forensics


Local interest point detection in
space-time
l Spatio-temporal domain features (Laptev 2005)
y

time

A. Rocha, 2012 Digital Document Forensics


Local interest point detection in
space-time
l Spatio-temporal domain features (Laptev 2005)
y

x
time

A. Rocha, 2012 Digital Document Forensics


Data representation by visual
words

Video showing an agitated crowd Video showing two people fighting

A. Rocha, 2012 Digital Document Forensics


Data representation by visual
words

Video showing an agitated crowd Video showing two people fighting

A. Rocha, 2012 Digital Document Forensics


The Method Overview

A. Rocha, 2012 Digital Document Forensics


Feature Clustering
l Discovering visual patterns (so-called visual words)

A. Rocha, 2012 Digital Document Forensics


Data representation by visual
words
l Visual codebook formed by 8 visual 'words':
4 spatio-temporal descriptions (visual
words) of non-violent content

4 spatio-temporal descriptions (visual words)


of violent content

A. Rocha, 2012 Digital Document Forensics


The Method Overview

A. Rocha, 2012 Digital Document Forensics


Data representation by visual
words
Sets of spatio-temporal features computed for each video

Bags of visual words


representation

A. Rocha, 2012 Digital Document Forensics


Data representation by visual
words

Video showing an agitated crowd Video showing two people fighting

A. Rocha, 2012 Digital Document Forensics


Learning a classifier to detect the
interest event

l Support Vector Machine (SVM, libSVM, Ref.)


l Binary problem (only 2 classes);

A. Rocha, 2012 Digital Document Forensics


Learning a classifier to detect the
interest event
l Support Vector Machine (SVM, libSVM, Ref.)
l Linear kernel (simple, fast, good results);
l LibSVM default parameters;
l Binary problem (only 2 classes);
l Good capability of generalization;
l Input: annotated bags of visual words;
l Output: classification model.

A. Rocha, 2012 Digital Document Forensics


Non-violent Dataset Sample

A. Rocha, 2012 Digital Document Forensics


Violent Dataset Sample

A. Rocha, 2012 Digital Document Forensics


Experimental Results
Codebook 100 SIFT

(%) Violent Non-Violent

Violent 80.09 19.91

Non-Violent 14.65 85.35

Codebook 100 STIP

(%) Violent Non-Violent

Violent 99.54 0.46

Non-Violent 0.0 100.0

A. Rocha, 2012 Digital Document Forensics


Comments

l We proposed a method with:


l Local spatio-temporal features
l Visual words representation
l Supervised learning with SVM
l Comparison between spatio-temporal and spatial
features

A. Rocha, 2012 Digital Document Forensics


Conclusions
Conclusions

We are just starting to solve this problem


Many challenges still in existence
Efficiency is also an important issue

A. Rocha, 2012 Digital Document Forensics 126


References
References

Valle, E., Avila, S., da Luz Jr., A., de Souza, F., Coelho, M., and de A. Arau jo, A. (2012). Detecting
undesirable content in video social networks (to appear). In Brazilian Symposium on Information and
Computer System Security.

Zuo, H., Hu, W., and Wu, O. (2010). Patch-based skin color detection and its appli- cation to
pornography image filtering. In International Conference on World Wide Web (WWW), pages
1227--1228.

A. Rocha, 2012 Digital Document Forensics 129


References
Deselaers, T., Pimenidis, L., and Ney, H. (2008). Bag-of-visual-words models for adult image
classification and filtering. In International Conference on Pattern Recognition (ICPR), pages 14.

Endeshaw, T., Garcia, J., and Jakobsson, A. (2008). Fast classification of indecent video by low
complexity repetitive motion detection. In IEEE Applied Imagery Pattern Recognition Workshop,
pages 1--7.

Fleck, M., Forsyth, D. A., and Bregler, C. (1996). Finding naked people. In European Conference on
Computer Vision (ECCV), pages 593--602.

Forsyth, D. A. and Fleck, M. M. (1996). Identifying nude pictures. In IEEE Workshop on Applications of
Computer Vision (WACV), pages 103--108.

Forsyth, D. A. and Fleck, M. M. (1997). Body plans. In Conference on Computer Vision and Pattern
Recognition (CVPR), pages 678--683.

A. Rocha, 2012 Digital Document Forensics 130


References
Forsyth, D. A. and Fleck, M. M. (1999). Automatic detection of human nudes. Inter- national Journal on
Computer Vision (IJCV), 32(1):63--77.

Jansohn, C., Ulges, A., and Breuel, T. M. (2009). Detecting pornographic video con- tent by combining
image features with motion information. In ACM International Conference on Multimedia, pages
601--604.

Jones, M. J. and Rehg, J. M. (2002). Statistical color models with application to skin detection.
International Journal of Computer Vision (IJCV), 46(1):81--96.

Lopes, A., Avila, S., Peixoto, A., Oliveira, R., Coelho, M., and de A. Arau jo, A. (2009a). Nude detection in
video using bag-of-visual-features. In Brazilian Sympo- sium on Computer Graphics and Image
(SIBGRAPI), pages 224--231.

Lopes, A., Avila, S., Peixoto, A., Oliveira, R., and de A. Arau jo, A. (2009b). A bag- of-features approach
based on hue-sift descriptor for nude detection. In European Signal Processing Conference
(EUSIPCO), pages 1552--1556.

A. Rocha, 2012 Digital Document Forensics 131


References

Rowley, H. A., Jing,Y., and Baluja, S. (2006). Large scale image-based adult-content filtering. In
International Conference on Computer Vision Theory and Applications (VISAPP), pages 290296.

Sivic, J. and Zisserman, A. (2003).Video Google: A text retrieval approach to ob- ject matching in
videos. In International Conference on Computer Vision (ICCV), volume 2.

Steel, C. (2012). The mask-sift cascading classifier for pornography detection. In World Congress on
Internet Security (WorldCIS), pages 139--142.

Tong, X., Duan, L., Xu, C., Tian, Q., Hanqing, L., Wang, J., , and Jin, J. (2005). Peri- odicity detection of
local motion. In IEEE International Conference on Multimedia and Expo (ICME), pages 650--653.

Ulges, A. and Stahl, A. (2011). Automatic detection of child pornography using color visual words. In
International Conference on Multimedia Retrieval (ICME), pages 16.

A. Rocha, 2012 Digital Document Forensics 132


Obrigado!
Thank you!

You might also like