Principled Asymmetric Boosting Approaches in Face Detection: Minh-Tri Pham

Principled Asymmetric Boosting Approaches
to Rapid Training and Classification
in Face Detection
presented by
Minh-Tri Pham
Ph.D. Candidate and Research Associate
Nanyang Technological University, Singapore
Outline
• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error
• Future Work
• Summary
Outline
• Motivation
• Contributions
• Future Work
• Summary
Problem
Application
Application
Face recognition
Application
3D face reconstruction
Application
Camera auto-focusing
Application
Windows face logon
• Lenovo Veriface Technology
Appearance-based Approach
• Scan image with probe
window patch (x,y,s)
– at different positions and scales
– Binary classify each patch into
• face, or
• non-face
• Desired output state:
– (x,y,s) containing face
Most popular approach

•Viola-Jones ‘01-’04, Li et.al. ‘02, Wu et.al. ’04, Brubaker et.al. ‘04, Liu et.al. ’04, Xiao et.al ‘04,
•Bourdev-Brandt ‘05, Mita et.al. ‘05, Huang et.al. ’05 – ‘07, Wu et.al. ‘05, Grabner et.al.
’05-’07, 0 1
•And many more
Appearance-based Approach
• Statistics:
– 6,950,440 patches in a
320x240 image
– P(face) < 10-5
• Key requirement:
– A very fast classifier
0 1
A very fast classifier
• Cascade of non-face rejectors:
pass pass pass pass
F1 F2 …. FN face
reject reject reject
non-face

pass pass pass pass
F1 F2 …. FN face
non-face
• F1, F2, …, FN : asymmetric classifiers

– FRR(Fk)  0
– FAR(Fk) as small as possible (e.g. 0.5 – 0.8)
pass pass pass pass
F1 F2 …. FN face
non-face

– FRR(Fk)  0
Non-face Rejector
• A strong combination of weak classifiers:
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject
– f1,1, f1,2, …, f1,K : weak classifiers

–  : threshold
Boosting
Wrongly
classified
Correctly
classified
Wrongly
Weak Weak
classified
Classifier Classifier
Learner Learner
1 Correctly 2
classified
Stage 1 Stage 2
: negative example
: positive example
Asymmetric Boosting
• Weight positives  times more than negatives
Weak Weak
Learner Learner
1 2
Stage 1 Stage 2
: negative example
: positive example
Non-face Rejector
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

–  : threshold
Non-face Rejector
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

–  : threshold
Weak classifier
• Classify a Haar-like feature value
input feature Classify score

patch value v v
Weak classifier

patch value v v
…
Main issues
• Requires too much intervention from experts
pass pass pass pass
F1 F2 …. FN face
non-face

– FRR(Fk)  0
How to choose bounds for FRR(Fk)

and FAR(Fk)?
Asymmetric Boosting How to
• Weight positives  times more than negatives choose ?
Weak Weak
Learner Learner
1 2
Stage 1 Stage 2
: negative example
: positive example
Non-face Rejector
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

–  : threshold
How to
choose ?
Main issues
• Very long learning time

Weak classifier

patch value v v
10 minutes to learn a
weak classifier
…
Main issues
• Very long learning time

– To learn a face detector ( 4000 weak classifiers):
• 4,000 * 10 minutes  1 month
• Only suitable for objects with small shape variance

Outline
• Motivation
• Contributions
• Future Work
• Summary
Outline
• Motivation
• Contributions
• Future Work
• Summary
Outline
• Motivation
• Contributions
• Future Work
• Summary
Detection with Multi-exit
Asymmetric Boosting
CVPR’08 poster paper:

Minh-Tri Pham and Viet-Dung D. Hoang and Tat-Jen Cham. Detection with Multi-exit Asymmetric
Boosting. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), Anchorage, Alaska, 2008.
• Won Travel Grant Award
Problem overview
• Common appearance-based approach:
pass pass pass pass
F1 F2 …. FN object
non-object
– F1, F2, …, FN : boosted classifiers
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

–  : threshold
Objective
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
 K  reject
F1 ( x)  sign  f1,i ( x)   
 i 1 
• Find f1,1, f1,2, …, f1,K, and  such that:

– FAR ( F1 )   0
– FRR ( F )  
1 0
– K is minimized  proportional to F1’s evaluation time
Existing trends (1)
Idea Issues
• For k from 1 until convergence: • Weak classifiers are sub-
– Let  k  optimal w.r.t. training goal.
F1 ( x)  sign  f1,i ( x) 
 i 1  • Too many weak classifiers
are required in practice.
– Learn new weak classifier f1,k(x):
fˆ1,k  arg minFAR ( F1 )  FRR( F1 )
f1,k
 k 
F1 ( x)  sign  f1,i ( x)   
– Let  i 1 
– Adjust  to see if we can achieve FAR(F1) <= 0 and FRR(F1) <= 0:
• Break loop if such  exists
Existing trends (2)
Idea
• For k from 1 until convergence:
– Let  k 
F1 ( x)  sign  f1,i ( x) 
 i 1 
– Learn new weak classifier f1,k(x):
fˆ1,k  arg minFAR ( F1 )  FRR( F1 )
f1,k
– Break loop if FAR(F1) <= 0 and FRR(F1) <= 0

Pros Cons Solution to con
• Reduce FRR at the • How to choose ? • Trial and error:
cost of increasing FAR • Much longer training • choose  such that
– acceptable for time K is minimized.
cascades
• Fewer weak classifiers
Our solution
Learn every weak classifier f1,k ( x ) using the same asymmetric goal:
fˆ1,k  arg minFAR ( F1 )  FRR( F1 ),

f1,k

where   0 .
0
Why?
Because…
• Consider two desired bounds (or targets) for learning a boosted classifier FM(x):
– Exact bound:  0
FAR ( FM )and FRR ( FM )   0 (1)
– Conservative bound: 
FAR ( F )  0 FRR( F )   M M 0 (2)
0
• (2) is more conservative than (1) because (2) => (1).
FAR FAR
1 1
H1
 =1  = 0/0
H2
H1 H3
H4 exact bound
exact bound H2 H39 Q2
0 0
H3
Q1 H40 Q1
conservative H201 H200 Q 4 Q 3 Q2 conservative Q3
H41
bound Q200 bound Q40 Q39
0 b0 Q201 0 b0 Q41
1 FRR 1 FRR


At 0for
, every new weak classifier learned, the ROC operating
0
point moves the fastest toward the conservative bound
Implication
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
 K  reject
F1 ( x)  sign  f1,i ( x)   
 i 1 
• When the ROC operation point lies in the conservative bound:

– FAR ( F )  
1 0
–
FRR ( F1 )   0
– Conditions met, therefore  = 0.
Multi-exit Boosting
A method to train a single boosted classifier with multiple exit nodes:
F1 +
F2 + + +
F3 + + +
f1 f2 f3 f4 f5 f6 f7 f8 object
pass pass pass
non-obj
fi: a weak classifier fi followed by a decision to

: a weak classifier
continue or reject – an exit node
• Features: 0
• Weak classifiers are trained with the same goal:  .
0
•  0
Every pass/reject decision is guaranteed with FARand FRR   0 .
• The classifier is a cascade.
• Score is propagated from one node to another.
• Main advantages:
• Weak classifiers are learned (approximately) optimally.
• No training of multiple boosted classifiers.
• Much fewer weak classifiers are needed than traditional cascades.
Results
Goal () vs. Number of weak classifiers (K)
• Toy problem: To learn a (single-

exit) boosted classifier F for
classifying face/non-face patches
such that FAR(F) < 0.8 and
FRR(F) < 0.01
– Empirically best goal:
  [10,100].
– Our method chooses:
0.8
  80.
0.01
• Similar results were obtained for

tests on other desired error rates.
Ours vs. Others (in Face Detection)
• Use Fast StatBoost as base method for fast-training a weak classifier.
Method No of No of Total
weak exit training
classifier nodes time
s
Viola Jones [3] 4,297 32 6h20m
Viola Jones [4] 3,502 29 4h30m
Boosting chain [7] 959 22 2h10m
Nested cascade [5] 894 20 2h
Soft cascade [1] 4,871 4,871 6h40m
Dynamic cascade [6] 1,172 1,172 2h50m
Multi-exit 575 24 1h20m

Asymmetric Boosting
Ours vs. Others (in Face Detection)
• MIT+CMU Frontal Face Test set:
Conclusion
• Multi-exit Asymmetric Boosting trains every weak

classifier approximately optimally.
– Better accuracy
– Much fewer weak classifiers
– Significantly reduces training time

• No more trial-and-error for training a boosted classifier
Outline
• Motivation
• Contributions
• Future Work
• Summary
Outline
• Motivation
• Contributions
• Future Work
• Summary
Fast Training and Selection of
Haar-like Features using Statistics
ICCV’07 oral paper:

Minh-Tri Pham and Tat-Jen Cham. Fast Training and Selection of Haar Features using Statistics in
Boosting-based Face Detection. In Proc. International Conference on on Computer Vision (ICCV), Rio de
Janeiro, Brazil, 2007.
• Won Travel Grant Award
• Won Second Prize, Best Student Paper in Year 2007 Award, Pattern Recognition and Machine
Intelligence Association (PREMIA), Singapore
Motivation
• Face detectors today

– Real-time detection
speed
…but…
– Weeks of training time

Why is Training so Slow?
A view of a face detector training algorithm
for weak classifier m from 1 to M: • Time complexity: O(MNT log N)

…
– 15ms to train a feature classifier
update weights – O(N)
for feature t from 1 to T: – 10 minutes to train a weak classifier
compute N feature values – O(N) – 27 days to train a face detector
sort N feature values – O(N log
N)
train feature classifier – O(N)
select best feature classifier – O(T)
…
Facto Description Common

r value
N number of examples 10,000
M number of weak 4,000 -

classifiers in total 6,000
T number of Haar-like 40,000
features
Why Should the Training Time be
Improved?
• Tradeoff between time and generalization
– E.g. training 100 times slower if we increase both N and T by 10 times
• Trial and error to find key parameters for training

– Much longer training time needed
• Online-learning face detectors have the same problem

Existing Approaches to Reduce the
Training Time
• Sub-sample Haar-like feature set
– Simple but loses generalization
• Use histograms and real-valued boosting (B. Wu et. al. ‘04)

– Pro: Reduce from O(MNT log N) to O(MNT)
– Con: Raise overfitting concerns:
• Real AdaBoost not known to be overfitting resistant
• Weak classifier may overfit if too many histogram bins are used
• Pre-compute feature values’ sorting orders (J. Wu et. al. ‘07)

– Pro: Reduce from O(MNT log N) to O(MNT)
– Con: Require huge memory storage
• For N = 10,000 and T = 40,000, a total of 800MB is needed.
Why is Training so Slow?
A view of a face detector training algorithm
for weak classifier m from 1 to M: • Time complexity: O(MNT log N)

… – 15ms to train a feature classifier
update weights – O(N) – 10min to train a weak classifier
for feature t from 1 to T:
compute N feature values – O(N) – 27 days to train a face detector
sort N feature values – O(N log
N)
train feature classifier – O(N) • Bottleneck:
select best feature classifier – O(T) – At least O(NT) to train a weak
…
classifier

r value • Can we avoid O(NT)?
N number of examples 10,000

classifiers in total 6,000
features
Our Proposal
• Fast StatBoost: To train feature classifiers using statistics rather than

using input data
– Con:
• Less accurate
… but not critical for a feature classifier
– Pro:
• Much faster training time:
 Constant time instead of linear time
Fast StatBoost
• Training feature classifiers using Non-face
statistics: Face
– Assumption: feature value v(t) is normally
distributed given face class c is known
– Closed-form solution for optimal threshold
• Fast linear projections of the statistics Optimal Feature

of a window’s integral image into 1D threshold value
statistics of a feature value
 (t )  mTJ g (t )  (t ) 2  g (t )T  J g (t )
(t ) ,  (t ) 2 : mean and variance of feature value v(t)
J : random vector representing a window’s integral image


m J ,  J : mean vector and covariance matrix of J
g (t ) : Haar-like feature, a sparse vector with less than 20 non-zero elements
 constant time to train a feature classifier

Fast StatBoost
• Integral image’s statistics are obtained directly from the weighted input data
– Input: N training integral images and their current weights w(m):
  
w ( m)
1  ( m)
  ( m)
, J 1 , c1 , w 2 , J 2 , c2 ,..., w N , J N , c N 
– We compute: zˆc   n
w (m)
n:cn  c
• Sample total weight: 
ˆ c  zˆ
m 1
c  w Jn
n:cn  c
(m)
n
• Sample mean vector:   T 

ˆ
 c  zˆc   wn J n J n   m
1 ( m)
ˆ cm
ˆ Tc
 n:cn c 
• Sample covariance matrix:
Fast StatBoost
A view of our face detector training algorithm
• To train a weak classifier:
for weak classifier m from 1 to M: – Extract the class-conditional integral
… image statistics
update weights – O(N) • Time complexity: O(Nd2)
Extract statistics of integral image – • Factor d2 negligible because fast algorithms
O(Nd2) exist, hence in practice: O(N)
for feature t from 1 to T:
project statistics into 1D – O(1)
train feature classifier – O(1)
select best feature classifier – O(T) – Train T feature classifiers by
… projecting the statistics into 1D:
• Time complexity: O(T)

r value
– Select the best feature classifier
N number of examples 10,000 • Time complexity: O(T)

classifiers in total 6,000 • Time complexity: O(N+T)
features
d number of pixels of a 300-500
window
Experimental Results
• Edge features: Corner features:
Setup
– Intel Pentium IV 2.8GHz
– 19 types  295,920 Haar-like (1) (2) (3) (4) (5) (6)
features Diagonal line features:
• Time for extracting the statistics: (7)

(10) (11) (12) (13)
(8) (9)
– Main factor: covariance matrices
Line features: Center-surround features:
• GotoBLAS: 0.49 seconds
per matrix
• Time for training T features: (15) (17)
(14) (18) (19)
– 2.1 seconds (16)
Nineteen feature types used in our experiments
 Total training time: 3.1 seconds per weak classifier with 300K features
• Existing methods: up to 10 minutes with 40K features or fewer
• Comparison with Fast AdaBoost (J. Wu et. al. ‘07), the fastest known
implementation of Viola-Jones’ framework:
training time of a weak classifier
12 Fast AdaBoost
10 Fast StatBoost
seconds (s)
8
6
4
2
0
0 50000 100000 150000 200000 250000 300000
number of features (T)
• Performance of a cascade:
Method Total training Memory

time requirement
Fast AdaBoost 13h 20m 800 MB
(T=40K)
Fast StatBoost 02h 13m 30 MB
(T=40K)
Fast StatBoost 03h 02m 30 MB
(T=300K)
ROC curves of the final cascades for face detection

Conclusions
• Fast StatBoost: use of statistics instead of input data to train feature

classifiers
• Time:
– Reduction of the face detector training time from up to a month to 3 hours
– Significant gain in both N and T with little increase in training time
• Due to O(N+T) per weak classifier
• Accuracy:
– Even better accuracy for face detector
• Due to much more members of Haar-like features explored
Outline
• Motivation
• Contributions
• Future Work
• Summary
Outline
• Motivation
• Contributions
• Future Work
• Summary
Weak classifier
Weak classifier
Weak classifier
Weak classifier
Outline
• Motivation
• Contributions
• Future Work
• Summary
Summary
• Online Asymmetric Boosting

– Integrates Asymmetric Boosting with Online Learning
• Fast Training and Selection of Haar-like Features using Statistics

– Dramatically reduce training time from weeks to a few hours
• Multi-exit Asymmetric Boosting

– Approximately minimizes the number of weak classifiers
Thank You

Principled Asymmetric Boosting Approaches in Face Detection: Minh-Tri Pham

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Principled Asymmetric Boosting Approaches in Face Detection: Minh-Tri Pham

Uploaded by

Copyright:

Available Formats

Principled Asymmetric Boosting Approaches

to Rapid Training and Classification

Most popular approach

– P(face) < 10-5

A very fast classifier

• F1, F2, …, FN : asymmetric classifiers

• F1, F2, …, FN : asymmetric classifiers

– f1,1, f1,2, …, f1,K : weak classifiers

– f1,1, f1,2, …, f1,K : weak classifiers

– f1,1, f1,2, …, f1,K : weak classifiers

input feature Classify score

input feature Classify score

• F1, F2, …, FN : asymmetric classifiers

How to choose bounds for FRR(Fk)

– f1,1, f1,2, …, f1,K : weak classifiers

• Very long learning time

input feature Classify score

• Very long learning time

• Only suitable for objects with small shape variance

CVPR’08 poster paper:

– F1, F2, …, FN : boosted classifiers

– f1,1, f1,2, …, f1,K : weak classifiers

• Find f1,1, f1,2, …, f1,K, and  such that:

– Break loop if FAR(F1) <= 0 and FRR(F1) <= 0

fˆ1,k  arg minFAR ( F1 )  FRR( F1 ),

• When the ROC operation point lies in the conservative bound:

fi: a weak classifier fi followed by a decision to

• Toy problem: To learn a (single-

• Similar results were obtained for

• Use Fast StatBoost as base method for fast-training a weak classifier.

Viola Jones [3] 4,297 32 6h20m

Viola Jones [4] 3,502 29 4h30m

Boosting chain [7] 959 22 2h10m

Nested cascade [5] 894 20 2h

Soft cascade [1] 4,871 4,871 6h40m

Dynamic cascade [6] 1,172 1,172 2h50m

Multi-exit 575 24 1h20m

• Multi-exit Asymmetric Boosting trains every weak

– Much fewer weak classifiers

– Significantly reduces training time

ICCV’07 oral paper:

• Face detectors today

– Weeks of training time

for weak classifier m from 1 to M: • Time complexity: O(MNT log N)

Facto Description Common

M number of weak 4,000 -

• Trial and error to find key parameters for training

• Online-learning face detectors have the same problem

• Use histograms and real-valued boosting (B. Wu et. al. ‘04)

• Pre-compute feature values’ sorting orders (J. Wu et. al. ‘07)

for weak classifier m from 1 to M: • Time complexity: O(MNT log N)

Facto Description Common

M number of weak 4,000 -

• Fast StatBoost: To train feature classifiers using statistics rather than

• Fast linear projections of the statistics Optimal Feature

J : random vector representing a window’s integral image

g (t ) : Haar-like feature, a sparse vector with less than 20 non-zero elements

 constant time to train a feature classifier

• Sample mean vector:   T 

Facto Description Common

M number of weak 4,000 -

• Time for extracting the statistics: (7)