You are on page 1of 19

Learning hierarchical invariant spatio-temporal

features for human action and activity


recognition

Binu M Nair, Vijayan K Asari


07/08/2014

Introduction

Applications of activity/action recognition

Gaming (Kinect)

Autonomous Visual Control of Fighter Jets by Air Crew hand gestures.

Research Objectives

To detect and recognize harmful activities of individuals of interest from a set/pair of surveillance cameras at long range.

Motivation: Monitoring a crowded environment and locating suspicious activities by security personnel

Security personnel creates a temporary signature of people in the scene (type of clothing, the shape etc..)

Identifies the action of the person (walking, running etc..)

Locates the individual with suspicious action and then observes him closely of what he is doing( from the joints movements etc)

Introduction

To have an automated system to perform these tasks, there are 4 different entities
Automatic Pedestrian Unique ID tagger

Security personnel fairly knowing what each person looked like

Human Action Recognition

Seeing what action each one does : walking, running, bending etc..

Automatic Detection and Tracking of Specific body joints.

Examining a particular individual(performing a suspicious action) closely of what he/she does

Inference of what activity is performed by joint trajectory analysis based on context

Eg: Bending down to place a suitcase or pick up a box or tying his shoe lace etc..

Motivation

Need a real time system


Recognize an action or an activity from 15-20 frames of a streaming video

Should not depend on the initialization of action/gait cycle states (starting/ending points of a
an action cycle)
Should be invariant to speed of motion

Applications
Air crew hand gesture recognition for autonomous visual control of fighter jet
Decision to follow a person based on activity in surveillance.

Typical Data-flow for Generic Action Recognition


system
Action Learning

Video

Feature
Extraction

Action
Segmentation

Action Model
Database

Action
Classification

Feature Extraction : - Posture/Motion Cues (Hierarchical invariant features)

Action Segmentation:- Segmenting out action instances consistent with the train set

Action Learning and Classification:- Learn statistical models to classify new feature
observations ( based on PCA-Generalized Regression Neural Networks)

Feature Extraction and Feature Fusion


Hierarchical Histogram of
Oriented Flow

HOF
(N)
Input Frame

Masked Region

Optical Flow
R-Transform

Hierarchical
Histogram of
Oriented Flow

+
Quantized
Local Binary
Pattern
Optical Flow

HOF
(N/2)

HOF
(N/2)

HOF
(N/2)

Feature Fusion

Mag/Dir

Feature Fusion

Action
Feature

Assumption that HHOF, LBFP and RT are independent of each other.


Can concatenate one after the other to form the complete feature vector ( Feature Fusion in
Biometric systems)

HOF
(N/2)

Feature Selection

Feature Set
3-Level HHOF ( 140 elements) , 2-Level LBFP ( 295 elements) , 2-level R-Transform
(180) : Total Feature Set
Over fitting of regression model for each action class and tuned more to irrelevant and
redundant feature elements and thus lower accuracy.

Methodology ( Fast Correlation-based Feature Selection) - FCBF


Identify relevant features with large correlation values

Remove redundant features and choose a subset of features.

Correlation measure based on Information Theory


Symmetrical Uncertainty (SU) between two random variables X and Y
H(X) Entropy ; IG(X|Y) information of X gained from the knowledge provided by

Algorithm(Training / Testing)

RESULTS

Weizmann dataset

10 different actions performed by 9 different persons


Low resolution video at 30 fps
Static background

Weizmann Dataset
Testing strategy:- Leave 10 out (corresponding to one person)
Partial Sequence :- 15 frames with overlap of 10 frames

Robustness Test (Test for Deformity)


With bag

Legs
Occluded

With dog

Normal
Walk

Knees Up

With
Briefcase

Limping

With Pole

Moonwalk

With Skirt

Test Seq

1st Best

2nd Best

Swinging a
bag

Walk 2.508 Skip

3.094 3.939

Carrying a
briefcase

Walk 1.866 Skip

2.170 3.641

Walking
with a dog

Walk 1.806 Skip

2.338 3.824

Knees Up

Walk 2.894 Side

3.270 4.091

Limping
Man

Walk 2.224 Skip

2.922 3.821

Sleepwalkin Walk 1.892 Skip


g

2.132 3.663

Occluded
Legs

Walk 1.883 Skip

2.594 2.624

Normal
Walk

Walk 1.886 Skip

2.624 3.633

Occluded by Walk 2.149 Skip


a pole

2.945 3.880

Walking in a Walk 1.855 Skip


skirt

2.159 3.540

Median to
all actions

Cambridge Hand gesture

9 different hand gestures

Different combinations of shape and motion


5 different illumination conditions

KTH Action Dataset

6 human actions

25 subjects
4 different scenarios
600 sequence divided into 2391 subsequences
Low res : 160 120 at 25 fps
11/10/2014

Binu M Nair

14

Results on 4 sets using proposed feature


set.

Results on all sets with STIP features

UCF Sports Dataset

High Res : 720 480


200 video sequences
Contains 9 actions
Challenge :

Complex and varying background


Wide range of scenes and view point variations

Tested on 8 actions : dive, golf swing, lift, ride, run, skate, swing and walk
Tested on window size of 15 frames with overlap of 10.

11/10/2014

Binu M Nair

17

Future work in action recognition

Testing on the UCF ARG


Dataset
Multi-view human action
dataset
Set of actions

Boxing, carrying, clapping,


digging, jogging, open-close
trunk,
running,
throwing, walking, waving

Challenges

Different resolutions
across cameras.
Different kinds of
features.

Thank You
Questions?