You are on page 1of 20

# Regression based Learning of

## Human Actions from Video

using HOF-LBP Flow patterns
Binu M Nair, Vijayan K Asari

distance.

## Objectives: To develop a human action recognition framework

Which is invariant to sequence length normalization
Can classify human actions from 10-15 frames (for real time operation)
To account for variation in speed of an action

## Overview of proposed algorithm

Define and extract suitable motion descriptors based on the optical flow at each
frame

Using the extracted motion descriptors, define action manifolds for each class.
Contains variations of motion with respect to the sequence

## Classify the test sequence using the learned neural networks.

Proposed Methodology

1.

Motion Representation using Histogram of Oriented Flow and Local Binary Flow patterns(HOFLBP).

2.

## Computation of Reduced Posture Space using PCA

3.

Motion descriptor computed from optical flow for each frame of the video sequence

Computing an action manifold for each action class using Principal Component Analysis

Flow Patterns

## Motion Representation using Histogram of

Flow Patterns

Gives information about the extent of motion on a local scale and the direction of
motion

Algorithm
Compute Optical Flow < , > between consecutive frames at location (, )
Compute the magnitude and direction images from optical flow.
Divide them into blocks

## At each block, histogram of flow is computed

Histogram of flow: weighted histogram of the flow direction with the weights being the
corresponding magnitude.

## Concatenate across blocks to get the HOF descriptor

These are local distributions which change during the course of an action
sequence.

## Motion Representation using Local Binary Flow

Patterns

To extract relationship between the flow vectors in different regions of the body

This textural context can be extracted by using the Local Binary Pattern
encoding on optical flow magnitude and direction.

2 ( )

, =
=1

5 6 7

4 0

3 2 1

A sampling grid of (P,R) = (16,2) where P refers to the number of neighbors and R
refers to the radius of the neighborhood.

The concatenation of HOF and LBP constitutes the action feature set

HOF (5,5)
+
LBP(16,2)
LBP(16,2)

Action Feature

## Computation of Reduced Posture Space using

PCA
Aim is to perform regression analysis on the set of action features
Action features will be considered as the regressors/input variables to a
Frame k

regression function.
Selection of the response/output variable should

dim 2

## Bring out the variations in the regressors w.r.t to time

Be invariant to the time : selecting time will not be the solution

Frame 1
dim 1

## A multivariate time-series set of (regressors,responses) for each action class would

correspond to an action manifold(Reduced posture space).
The frames of an action sequence is then considered as points on a particular manifold.

## One method to treat a multi-variate time series data

Prinicipal Component Analysis or Empirical Orthogonal Function Analysis
Time series data is represented as a linear combination of time-independent orthogonal
basis functions(Eigen vectors) with time varying amplitude(Eigen coefficients).

## Computation of Reduced Posture Space using

PCA for action class
EOF Analysis
Let = 1 2 .

## and is observed at 1 , 2 , 3 . , then

. ( ) ; ;
=1

XK(m)xD

Eigen
Vectors( )

PCA

Coefficients
( )
Extending this to our motion feature set = [1 , 2 .

] of the action

## class having a total of frames and ,

We get time independent basis functions which are Eigen vectors V = [1 , 2 , . d ]
We get time dependent coefficients = [1 , 2 .

] and

GRNN

## Modeling of Action Manifolds using

Generalized Regression Neural Networks

## Generalized Regression Neural Networks

Used to learn the functional mapping between and for an action class .
Based on the radial basis function network
Faster training scheme which is one-pass algorithm
The number of input nodes depends on number of training samples.
K-Mean clustering is used before training so to reduce training sample size

= { : 1

= { : 1 ()

()

=

. (

)
( )

## Modeling of Action Manifolds using

Generalized Regression Neural Networks
If there are () clusters from training pairs , ,

=
=

2
,
()
=1 , . exp( 2 2 )
; ,
2

() ,
=1 2 2

= ,

1

2
1,
exp( 2 )
2

1,

2,

3
1

2
(),
exp(
)
2 2

## Classification of test sequence

Algorithm (Testing)
Compute HOF-LBP motion feature for each frame of test sequence(partial 15 frames or full

60-80 frames)
Project the test features on Eigen basis for each action class
Estimate the projections of each action by applying the feature set onto the trained GRNN
model

## Correct class = argminm (projectionsm estimationsm )

The model which gives the smallest difference between the eigen space projections and the GRNN
estimations is the correct class.

## Results (Weizmann database (10 actions, 9

individuals)
Testing strategy:- Leave 9 sequences out of training

a1-bend

a2-jump p

a3-jjack

a1

a4-jump f

a5-run

a6-side

a1

100

a2

a3
a4
a7-wave1

a8-skip

a9-wave2

a5
a6

a7

a2

a3

75

22

a4

a5

a6

a7

a9
a10

a9

a10

100
88

12
93

5
78

21

100

a8
a10-walk

a8

100
1

99
100

With bag

Legs
Occluded

With dog

Normal
Walk

Knees Up

With
Briefcase

Limping

With Pole

Moonwalk

With Skirt

Test Seq

1st Best

2nd Best

Swinging a
bag

3.094 3.9390

Carrying a
briefcase

2.170 3.6418

a dog

2.338 3.8249

Knees Up

3.270 4.0910

2.922 3.8217

2.132 3.6633

Occluded
Legs

2.594 2.6249

2.624 3.6338

a pole

2.945 3.8801

skirt

2.159 3.5401

Median to all
actions

Test Seq

1st Best

2nd Best

Dir. 0

Walk

1.7606

Skip

2.3435

3.6550

Dir. 9

Walk

1.6975

Skip

2.3138

3.6286

Dir. 18

Walk

1.7342

Skip

2.2600

3.6066

Dir. 27

Walk

1.7314

Skip

2.3225

3.5359

Dir. 36

Walk

1.7721

Skip

2.3296

3.5050

Dir. 45

Walk

1.7750

Skip

2.2099

3.4217

Dir. 54

Walk

1.7796

Skip

2.1169

3.3996

Dir. 63

Walk

1.9683

Skip

2.3181

3.2095

Dir. 72

Walk

2.2900

Skip

2.4930

3.3460

Dir. 81

Side

2.6917

Side

2.8095

3.7771

## Median to all actions

Conclusions/Inferences
Motion Information is used.

## Misclassifications are not spread across action classes.

Occurs between at most two actions.

## Does not rely too much on the silhouette mask

Only an approximate mask is required

## Can identify actions from a set of 10-15 frames

Can be used in a higher level activity recognition system where the scores
for the primitive actions is available.

Thank You
Questions?