You are on page 1of 69

Principled Asymmetric Boosting Approaches

to Rapid Training and Classification

in Face Detection

presented by

Minh-Tri Pham
Ph.D. Candidate and Research Associate
Nanyang Technological University, Singapore
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Problem
Application
Application

Face recognition
Application

3D face reconstruction
Application

Camera auto-focusing
Application
Windows face logon
• Lenovo Veriface Technology
Appearance-based Approach
• Scan image with probe
window patch (x,y,s)
– at different positions and scales
– Binary classify each patch into
• face, or
• non-face
• Desired output state:
– (x,y,s) containing face

Most popular approach


•Viola-Jones ‘01-’04, Li et.al. ‘02, Wu et.al. ’04, Brubaker et.al. ‘04, Liu et.al. ’04, Xiao et.al ‘04,
•Bourdev-Brandt ‘05, Mita et.al. ‘05, Huang et.al. ’05 – ‘07, Wu et.al. ‘05, Grabner et.al.
’05-’07, 0 1
•And many more
Appearance-based Approach
• Statistics:
– 6,950,440 patches in a
320x240 image

– P(face) < 10-5

• Key requirement:
– A very fast classifier
0 1
A very fast classifier
• Cascade of non-face rejectors:
pass pass pass pass
F1 F2 …. FN face
reject reject reject
non-face

A very fast classifier


A very fast classifier
• Cascade of non-face rejectors:
pass pass pass pass
F1 F2 …. FN face
reject reject reject
non-face

• F1, F2, …, FN : asymmetric classifiers


– FRR(Fk)  0
– FAR(Fk) as small as possible (e.g. 0.5 – 0.8)
A very fast classifier
• Cascade of non-face rejectors:
pass pass pass pass
F1 F2 …. FN face
reject reject reject
non-face

• F1, F2, …, FN : asymmetric classifiers


– FRR(Fk)  0
– FAR(Fk) as small as possible (e.g. 0.5 – 0.8)
Non-face Rejector
• A strong combination of weak classifiers:
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

– f1,1, f1,2, …, f1,K : weak classifiers


–  : threshold
Boosting
Wrongly
classified

Correctly
classified

Wrongly
Weak Weak
classified
Classifier Classifier
Learner Learner
1 Correctly 2
classified

Stage 1 Stage 2
: negative example
: positive example
Asymmetric Boosting
• Weight positives  times more than negatives

Weak Weak
Classifier Classifier
Learner Learner
1 2

Stage 1 Stage 2
: negative example
: positive example
Non-face Rejector
• A strong combination of weak classifiers:
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

– f1,1, f1,2, …, f1,K : weak classifiers


–  : threshold
Non-face Rejector
• A strong combination of weak classifiers:
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

– f1,1, f1,2, …, f1,K : weak classifiers


–  : threshold
Weak classifier
• Classify a Haar-like feature value

input feature Classify score


patch value v v
Weak classifier
• Classify a Haar-like feature value

input feature Classify score


patch value v v


Main issues
• Requires too much intervention from experts
A very fast classifier
• Cascade of non-face rejectors:
pass pass pass pass
F1 F2 …. FN face
reject reject reject
non-face

• F1, F2, …, FN : asymmetric classifiers


– FRR(Fk)  0
– FAR(Fk) as small as possible (e.g. 0.5 – 0.8)

How to choose bounds for FRR(Fk)


and FAR(Fk)?
Asymmetric Boosting How to
• Weight positives  times more than negatives choose ?

Weak Weak
Classifier Classifier
Learner Learner
1 2

Stage 1 Stage 2
: negative example
: positive example
Non-face Rejector
• A strong combination of weak classifiers:
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

– f1,1, f1,2, …, f1,K : weak classifiers


–  : threshold

How to
choose ?
Main issues
• Requires too much intervention from experts

• Very long learning time


Weak classifier
• Classify a Haar-like feature value

input feature Classify score


patch value v v

10 minutes to learn a
weak classifier

Main issues
• Requires too much intervention from experts

• Very long learning time


– To learn a face detector ( 4000 weak classifiers):
• 4,000 * 10 minutes  1 month

• Only suitable for objects with small shape variance


Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Detection with Multi-exit
Asymmetric Boosting

CVPR’08 poster paper:


Minh-Tri Pham and Viet-Dung D. Hoang and Tat-Jen Cham. Detection with Multi-exit Asymmetric
Boosting. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), Anchorage, Alaska, 2008.
• Won Travel Grant Award
Problem overview
• Common appearance-based approach:
pass pass pass pass
F1 F2 …. FN object
reject reject reject
non-object

– F1, F2, …, FN : boosted classifiers

yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

– f1,1, f1,2, …, f1,K : weak classifiers


–  : threshold
Objective

yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
 K  reject
F1 ( x)  sign  f1,i ( x)   
 i 1 

• Find f1,1, f1,2, …, f1,K, and  such that:


– FAR ( F1 )   0
– FRR ( F )  
1 0
– K is minimized  proportional to F1’s evaluation time
Existing trends (1)

Idea Issues
• For k from 1 until convergence: • Weak classifiers are sub-
– Let  k  optimal w.r.t. training goal.
F1 ( x)  sign  f1,i ( x) 
 i 1  • Too many weak classifiers
are required in practice.
– Learn new weak classifier f1,k(x):
fˆ1,k  arg minFAR ( F1 )  FRR( F1 )
f1,k
 k 
F1 ( x)  sign  f1,i ( x)   
– Let  i 1 

– Adjust  to see if we can achieve FAR(F1) <= 0 and FRR(F1) <= 0:
• Break loop if such  exists
Existing trends (2)

Idea
• For k from 1 until convergence:
– Let  k 
F1 ( x)  sign  f1,i ( x) 
 i 1 
– Learn new weak classifier f1,k(x):
fˆ1,k  arg minFAR ( F1 )  FRR( F1 )
f1,k

– Break loop if FAR(F1) <= 0 and FRR(F1) <= 0


Pros Cons Solution to con
• Reduce FRR at the • How to choose ? • Trial and error:
cost of increasing FAR • Much longer training • choose  such that
– acceptable for time K is minimized.
cascades
• Fewer weak classifiers
Our solution

Learn every weak classifier f1,k ( x ) using the same asymmetric goal:

fˆ1,k  arg minFAR ( F1 )  FRR( F1 ),


f1,k

where   0 .
0

Why?
Because…
• Consider two desired bounds (or targets) for learning a boosted classifier FM(x):
– Exact bound:  0
FAR ( FM )and FRR ( FM )   0 (1)
– Conservative bound: 
FAR ( F )  0 FRR( F )   M M 0 (2)
0
• (2) is more conservative than (1) because (2) => (1).
FAR FAR
1 1

H1
 =1  = 0/0

H2

H1 H3
H4 exact bound
exact bound H2 H39 Q2
0 0
H3
Q1 H40 Q1
conservative H201 H200 Q 4 Q 3 Q2 conservative Q3
H41
bound Q200 bound Q40 Q39
0 b0 Q201 0 b0 Q41
1 FRR 1 FRR



At 0for
, every new weak classifier learned, the ROC operating
0
point moves the fastest toward the conservative bound
Implication

yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
 K  reject
F1 ( x)  sign  f1,i ( x)   
 i 1 

• When the ROC operation point lies in the conservative bound:


– FAR ( F )  
1 0

FRR ( F1 )   0
– Conditions met, therefore  = 0.
Multi-exit Boosting
A method to train a single boosted classifier with multiple exit nodes:

F1 +
F2 + + +
F3 + + +
f1 f2 f3 f4 f5 f6 f7 f8 object
pass pass pass
reject reject reject

non-obj

fi: a weak classifier fi followed by a decision to


: a weak classifier
continue or reject – an exit node

• Features: 0
• Weak classifiers are trained with the same goal:  .
0
•  0
Every pass/reject decision is guaranteed with FARand FRR   0 .
• The classifier is a cascade.
• Score is propagated from one node to another.

• Main advantages:
• Weak classifiers are learned (approximately) optimally.
• No training of multiple boosted classifiers.
• Much fewer weak classifiers are needed than traditional cascades.
Results
Goal () vs. Number of weak classifiers (K)

• Toy problem: To learn a (single-


exit) boosted classifier F for
classifying face/non-face patches
such that FAR(F) < 0.8 and
FRR(F) < 0.01
– Empirically best goal:
  [10,100].
– Our method chooses:
0.8
  80.
0.01

• Similar results were obtained for


tests on other desired error rates.
Ours vs. Others (in Face Detection)

• Use Fast StatBoost as base method for fast-training a weak classifier.

Method No of No of Total
weak exit training
classifier nodes time
s

Viola Jones [3] 4,297 32 6h20m

Viola Jones [4] 3,502 29 4h30m

Boosting chain [7] 959 22 2h10m

Nested cascade [5] 894 20 2h

Soft cascade [1] 4,871 4,871 6h40m

Dynamic cascade [6] 1,172 1,172 2h50m

Multi-exit 575 24 1h20m


Asymmetric Boosting
Ours vs. Others (in Face Detection)
• MIT+CMU Frontal Face Test set:
Conclusion

• Multi-exit Asymmetric Boosting trains every weak


classifier approximately optimally.

– Better accuracy

– Much fewer weak classifiers

– Significantly reduces training time


• No more trial-and-error for training a boosted classifier
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Fast Training and Selection of
Haar-like Features using Statistics

ICCV’07 oral paper:


Minh-Tri Pham and Tat-Jen Cham. Fast Training and Selection of Haar Features using Statistics in
Boosting-based Face Detection. In Proc. International Conference on on Computer Vision (ICCV), Rio de
Janeiro, Brazil, 2007.
• Won Travel Grant Award
• Won Second Prize, Best Student Paper in Year 2007 Award, Pattern Recognition and Machine
Intelligence Association (PREMIA), Singapore
Motivation

• Face detectors today


– Real-time detection
speed

…but…

– Weeks of training time


Why is Training so Slow?
A view of a face detector training algorithm

for weak classifier m from 1 to M: • Time complexity: O(MNT log N)



– 15ms to train a feature classifier
update weights – O(N)
for feature t from 1 to T: – 10 minutes to train a weak classifier
compute N feature values – O(N) – 27 days to train a face detector
sort N feature values – O(N log
N)
train feature classifier – O(N)
select best feature classifier – O(T)

Facto Description Common


r value
N number of examples 10,000

M number of weak 4,000 -


classifiers in total 6,000
T number of Haar-like 40,000
features
Why Should the Training Time be
Improved?
• Tradeoff between time and generalization
– E.g. training 100 times slower if we increase both N and T by 10 times

• Trial and error to find key parameters for training


– Much longer training time needed

• Online-learning face detectors have the same problem


Existing Approaches to Reduce the
Training Time
• Sub-sample Haar-like feature set
– Simple but loses generalization

• Use histograms and real-valued boosting (B. Wu et. al. ‘04)


– Pro: Reduce from O(MNT log N) to O(MNT)
– Con: Raise overfitting concerns:
• Real AdaBoost not known to be overfitting resistant
• Weak classifier may overfit if too many histogram bins are used

• Pre-compute feature values’ sorting orders (J. Wu et. al. ‘07)


– Pro: Reduce from O(MNT log N) to O(MNT)
– Con: Require huge memory storage
• For N = 10,000 and T = 40,000, a total of 800MB is needed.
Why is Training so Slow?
A view of a face detector training algorithm

for weak classifier m from 1 to M: • Time complexity: O(MNT log N)


… – 15ms to train a feature classifier
update weights – O(N) – 10min to train a weak classifier
for feature t from 1 to T:
compute N feature values – O(N) – 27 days to train a face detector
sort N feature values – O(N log
N)
train feature classifier – O(N) • Bottleneck:
select best feature classifier – O(T) – At least O(NT) to train a weak

classifier

Facto Description Common


r value • Can we avoid O(NT)?
N number of examples 10,000

M number of weak 4,000 -


classifiers in total 6,000
T number of Haar-like 40,000
features
Our Proposal

• Fast StatBoost: To train feature classifiers using statistics rather than


using input data
– Con:
• Less accurate
… but not critical for a feature classifier
– Pro:
• Much faster training time:
 Constant time instead of linear time
Fast StatBoost
• Training feature classifiers using Non-face

statistics: Face
– Assumption: feature value v(t) is normally
distributed given face class c is known
– Closed-form solution for optimal threshold

• Fast linear projections of the statistics Optimal Feature


of a window’s integral image into 1D threshold value
statistics of a feature value
 (t )  mTJ g (t )  (t ) 2  g (t )T  J g (t )
(t ) ,  (t ) 2 : mean and variance of feature value v(t)

J : random vector representing a window’s integral image



m J ,  J : mean vector and covariance matrix of J

g (t ) : Haar-like feature, a sparse vector with less than 20 non-zero elements

 constant time to train a feature classifier


Fast StatBoost
• Integral image’s statistics are obtained directly from the weighted input data
– Input: N training integral images and their current weights w(m):
  
w ( m)
1  ( m)
  ( m)
, J 1 , c1 , w 2 , J 2 , c2 ,..., w N , J N , c N 
– We compute: zˆc   n
w (m)

n:cn  c
• Sample total weight: 
ˆ c  zˆ
m 1
c  w Jn
n:cn  c
(m)
n

• Sample mean vector:   T 


ˆ
 c  zˆc   wn J n J n   m
1 ( m)
ˆ cm
ˆ Tc
 n:cn c 
• Sample covariance matrix:
Fast StatBoost
A view of our face detector training algorithm
• To train a weak classifier:
for weak classifier m from 1 to M: – Extract the class-conditional integral
… image statistics
update weights – O(N) • Time complexity: O(Nd2)
Extract statistics of integral image – • Factor d2 negligible because fast algorithms
O(Nd2) exist, hence in practice: O(N)
for feature t from 1 to T:
project statistics into 1D – O(1)
train feature classifier – O(1)
select best feature classifier – O(T) – Train T feature classifiers by
… projecting the statistics into 1D:
• Time complexity: O(T)

Facto Description Common


r value
– Select the best feature classifier
N number of examples 10,000 • Time complexity: O(T)

M number of weak 4,000 -


classifiers in total 6,000 • Time complexity: O(N+T)
T number of Haar-like 40,000
features
d number of pixels of a 300-500
window
Experimental Results
• Edge features: Corner features:
Setup
– Intel Pentium IV 2.8GHz
– 19 types  295,920 Haar-like (1) (2) (3) (4) (5) (6)
features Diagonal line features:

• Time for extracting the statistics: (7)


(10) (11) (12) (13)
(8) (9)
– Main factor: covariance matrices
Line features: Center-surround features:
• GotoBLAS: 0.49 seconds
per matrix
• Time for training T features: (15) (17)
(14) (18) (19)
– 2.1 seconds (16)

Nineteen feature types used in our experiments

 Total training time: 3.1 seconds per weak classifier with 300K features
• Existing methods: up to 10 minutes with 40K features or fewer
Experimental Results
• Comparison with Fast AdaBoost (J. Wu et. al. ‘07), the fastest known
implementation of Viola-Jones’ framework:

training time of a weak classifier

12 Fast AdaBoost
10 Fast StatBoost
seconds (s)

8
6
4
2
0
0 50000 100000 150000 200000 250000 300000
number of features (T)
Experimental Results
• Performance of a cascade:

Method Total training Memory


time requirement
Fast AdaBoost 13h 20m 800 MB
(T=40K)
Fast StatBoost 02h 13m 30 MB
(T=40K)
Fast StatBoost 03h 02m 30 MB
(T=300K)

ROC curves of the final cascades for face detection


Conclusions

• Fast StatBoost: use of statistics instead of input data to train feature


classifiers

• Time:
– Reduction of the face detector training time from up to a month to 3 hours
– Significant gain in both N and T with little increase in training time
• Due to O(N+T) per weak classifier

• Accuracy:
– Even better accuracy for face detector
• Due to much more members of Haar-like features explored
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Weak classifier
• Cascade of non-face rejectors:
Weak classifier
• Cascade of non-face rejectors:
Weak classifier
• Cascade of non-face rejectors:
Weak classifier
• Cascade of non-face rejectors:
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Summary

• Online Asymmetric Boosting


– Integrates Asymmetric Boosting with Online Learning

• Fast Training and Selection of Haar-like Features using Statistics


– Dramatically reduce training time from weeks to a few hours

• Multi-exit Asymmetric Boosting


– Approximately minimizes the number of weak classifiers
Thank You

You might also like