You are on page 1of 37

HAR-Depth

Human Action Recognition Using Sequential Learning and


Depth Estimated History Images
Presented by
Sarthak Anil (TCR18CS053)
Guided by
Prof. Rahamathulla K
APJ Abdul Kalam Technological University

Department of Computer Science and


Engineering
Government Engineering College,
Thrissur Thrissur - 680009
Introduction

The HAR paradigm faces challenges in recognizing the actions efficiently due
to the presence of inter-class similarities and intra-class variations among the
action classes.The temporal relationship in an action video is an important
aspect to recognize the action type.
The HAR paradigm will certainly be more efficient if depth information is
available.

This seminar intends to explore estimating depth from action frames and
using it in conjunction with other learning network to get more accuracy
Objectives

● Estimating Depth information for RGB action Frames.


● Model the temporal information lying in between the action
frames.
● Reduce overfitting of DHI image training.
● Fuse learning score from both the learning network to provide
the final recognition score.
● Analyse result
Method

The work is divided into two streams: sequential learning and shape
learning streams.

● Deep Bidirectional LSTM (DBiLSTM) Network for HAR


● Depth History Image (DHI)
● Transfer Learning and Data Augmentation for DHI
Method
Deep Bidirectional LSTM (DBiLSTM) Network for HAR

● Combination of two LSTM cells in forward and backward directions.


● Input is the collection of features extracted from each action frame.
○ AlexNet
● Three BiLSTM layers are placed in series with a dropout in each layer
● LSTM learns the long term sequences by
○ overcoming the vanishing gradient problem
○ regulating the cells by non-linear gating units known as input gate,
output gate, and forget gate.
DBiLSTM
Basic memory cell structure of LSTM
Depth History Image (DHI)

● Used to describe the shape of an action


● Depth information extracted using medium transmission map
○ DehazeNet
● Estimated depth frames are projected onto a single X-Y plane to
calculate the DHI
Cont’d Flowchart of DHI
Transfer Learning and Data Augmentation
for DHI

● Why?
○ To reduce Overfitting Problem
● What caused it in the first place?
○ Lack of training data (The extracted DHI images)
Transfer Learning

● The process of transferring learned weights from a pre-trained network and retraining them.
● Final FC layer, Softmax layer and classification layer are replaced to suit current needs
Data Augmentation

● Technique to increase the number of samples internally


● Network sees same image as different through data variation
○ Rotation
○ X|Y offset
○ Mirroring
● Increases the efficiency of network by reducing overfitting
Cont’d
Score Fusion

● Combining the learning scores of two network

● sp score from shape learning


● sq score from sequential learning
● α is the fusion parameter
Cont’d

Effect of α on performance accuracy is studied to decide the optimal value


Training Procedure

● Selecting the right number of BiLSTM Layers


● Loss function
● Network optimizer
BiLSTM Layers

Experimental analysis between classification accuracy and training time

Three layers selected for balance between performance and training time
Network Optimization
● steepest descent gradient with momentum (SGDM) optimizer

● W, b, η is the weights, bias and learning parameter for the network


● dW and db are the derivatives of the cost function
● vdW , vdb are the velocities

● momentum parameter β is chosen as 0.9 or above


○ Reduce vertical oscillation
○ Faster training procedure
Loss
● Calculated at the classification layer after each feed forward pass.
● cross entropy technique is used

● C : total number of action classes,


● p : generated score in the softmax layer
● l : classification layer
● li is either 0 or 1
● negative sign is to counter the negative value of the log
function.
Dataset
● Dataset Used
○ KTH dataset
○ UCF Sports dataset
○ JHMDB dataset
○ UCF101 dataset
○ HMDB51 dataset
● Setup
○ mid-scale datasets are trained independently
○ KTH dataset is first trained
■ Weights of trained dataset used for UCF sports and
JHMDB datasets
Setup

● BiLSTM network
○ Three BiLSTM units used in sequence
○ hidden units
■ 150 in first layer
■ 125 in second layer
■ 100 in third layer
○ Other parameters
■ mini-batch size 8,
■ maximum number of epochs 300
■ initial learning rate 0.001
○ All the values are chosen empirically.
Setup

● Shape learning network


○ Layers of Alexnet used
■ maximum number of epochs 300
■ initial learning rate 0.0001.
● Parameters used for comparison
○ classification accuracy (CA),
○ kappa parameter (k)
○ precision (P).
Parameter Evaluation
● Effect of Batch Size
○ number of training samples passed through the learning network
Parameter Evaluation

● Sample Complexity
○ number of samples per each class
Parameter Evaluation

● Class Complexity
○ number of action classes in the training dataset
Result & Analysis
● Ablation Study
○ Different Streams of HAR-Depth Network
Result & Analysis
● Comparison with earlier reported methods(Small scale dataset)
Result & Analysis

● confusion matrices
○ Small scale dataset
Result & Analysis
● Comparison with earlier reported methods(Mid scale dataset)
Result & Analysis

● confusion matrices
○ Mid scale data set
Conclusion

● Familiarized with different concepts related to action Recognition

● Introduced a two-stream HAR-Depth network for HAR.

● DHIs are constructed to provide better shape representation

● DBiLSTM learns sequential information and the DHI learns the shape

information
Conclusion

● Transfer learning and data augmentation techniques reduce overfitting

● HAR-Depth network performs better in terms of performance accuracy

● Technique faces challenge when two closely related faster actions are

recognized at the same time e.g. ‘riding horse’ and ‘running’

● Ablation studies, parameter sensitivity, sample complexity, class

complexity analysis suggest that the proposed HAR-Depth performs well


References
S. Sahoo and U. Srinivasu, and S. Ari, "3D features for human action recognition with semi-supervised learning", IET Image Process., vol. 13, no. 6, pp. 983-990,
2019.

Y. Yuan, Y. Zhao and Q. Wang, "Action recognition using spatial-optical data organization and sequential learning framework", Neurocomputing, vol. 315, pp.
221-233, 2018.

S. Megrhi, M. Jmal, W. Souidene and A. Beghdadi, "Spatio-temporal action localization and detection for human action recognition in big dataset", J. Visual
Commun. Image Representation, vol. 41, pp. 375-390, 2016.

Y. Shi, Y. Tian, Y. Wang and T. Huang, "Sequential deep trajectory descriptor for action recognition with three-stream CNN", IEEE Trans. Multimedia, vol. 19,
no. 7, pp. 1510-1520, 2017.

Y. Bin, Y. Yang, F. Shen, N. Xie, H. T. Shen and X. Li, "Describing video with attention-based bidirectional LSTM", IEEE Trans. Cybern., vol. 49, no. 7, pp. 2631-2641,
Jul. 2018.

Y. M. Lui and J. R. Beveridge, "Tangent bundle for human action recognition", Proc. IEEE Int. Conf. Autom. Face Gesture Recognit. Workshops, pp. 97-102,
2011.

K. P. Chou et al., "Robust feature-based automated multi-view human action recognition system", IEEE Access, vol. 6, pp. 15 283-15 296, 2018.

S. Samanta and B. Chanda, "Space-time facet model for human activity classification", IEEE Trans. Multimedia, vol. 16, no. 6, pp. 1525-1535, Oct. 2014.

G. Yu, N. A. Goussies, J. Yuan and Z. Liu, "Fast action detection via discriminative random forest voting and top-K subvolume search", IEEE Trans. Multimedia,
vol. 13, no. 3, pp. 507-517, Jun. 2011.

G. Gkioxari and J. Malik, "Finding action tubes", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 759-768, 2015.
References
Y. Qin, L. Mo and B. Xie, "Feature fusion for human action recognition based on classical descriptors and 3D convolutional networks", Proc. IEEE Int. Conf.
Sens. Technol., pp. 1-5, 2017.

X. Peng, C. Zou, Y. Qiao and Q. Peng, "Action recognition with stacked Fisher vectors", Proc. Eur. Conf. Comput. Vis., pp. 581-595, 2014.

G. Singh, S. Saha, M. Sapienza, P. H. Torr and F. Cuzzolin, "Online real-time multiple spatiotemporal action localisation and prediction", Proc. IEEE Int. Conf.
Comput. Vis., pp. 3637-3646, 2017.

H. Gammulle, S. Denman and S. Sridharan, and C. Fookes, "Two stream LSTM: A deep fusion framework for human action recognition", Proc. IEEE Winter
Conf. Appl. Comput. Vis., pp. 177-186, 2017.

Y. Xu, L. Wang, J. Cheng, H. Xia and J. Yin, "DTA: Double LSTM with temporal-wise attention network for action recognition", Proc. IEEE Int. Conf. Comput.
Commun., pp. 1676-1680, 2017.

X. Peng and C. Schmid, "Multi-region two-stream R-CNN for action detection", Proc. Eur. Conf. Comput. Vis., pp. 744-759, 2016.

A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad and S. W. Baik, "Action recognition in video sequences using deep bi-directional LSTM with CNN features", IEEE
Access, vol. 6, pp. 1155-1166, 2017.

A. Krizhevsky, I. Sutskever and G. E. Hinton, "Imagenet classification with deep convolutional neural networks", Proc. Adv. Neural Inf. Process. Syst., pp. 1097-
1105, 2012.

B. Cai, X. Xu, K. Jia, C. Qing and D. Tao, "Dehazenet: An end-to-end system for single image haze removal", IEEE Trans. Image Process., vol. 25, no. 11, pp. 5187-
5198, 2016.
References
C. Schuldt, I. Laptev and B. Caputo, "Recognizing human actions: A local SVM approach", Proc. IEEE Int. Conf. Pattern Recognit., vol. 3, pp.
32-36, 2004.

M. D. Rodriguez, J. Ahmed and M. Shah, "Action mach a spatio-temporal maximum average correlation height filter for action recognition",
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1-8, 2008.

H. Jhuang, J. Gall, S. Zuffi, C. Schmid and M. J. Black, "Towards understanding action recognition", Proc. IEEE Int. Conf. Comput. Vis., pp.
3192-3199, 2013.

K. Soomro, A. R. Zamir and M. Shah, "UCF101: A dataset of 101 human actions classes from videos in the wild", Aug. 2012, [online] Available:
https://arxiv.org/abs/1212.0402.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre, "Hmdb: A large video database for human motion recognition", Proc. IEEE Int. Conf.
Comput. Vis., pp. 2556-2563, 2011.

Y. Song, S. Tang, Y. T. Zheng, T. S. Chua, Y. Zhang and S. Lin, "A distribution based video representation for human action recognition", Proc.
IEEE Int. Conf. Multimedia Expo, pp. 772-777, 2010.

A.-A. Liu, Y.-T. Su, W.-Z. Nie and M. Kankanhalli, "Hierarchical clustering multi-task learning for joint human action grouping and
recognition", IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1, pp. 102-114, Jan. 2016.

X. Lu, H. Yao, S. Zhao, X. Sun and S. Zhang, "Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors", Multimedia
Tools Appl., vol. 78, no. 1, pp. 507-523, 2019.
References
L. Wang, Y. Qiao and X. Tang, "Action recognition with trajectory-pooled deep-convolutional descriptors", Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., pp. 4305-4314, 2015.

C. Feichtenhofer, A. Pinz and A. Zisserman, "Convolutional two-stream network fusion for video action recognition", Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., pp. 1933-1941, 2016.

S. Zhao, Y. Liu, Y. Han, R. Hong, Q. Hu and Q. Tian, "Pooling the convolutional layers in deep convnets for video action recognition", IEEE
Trans. Circuits Syst. Video Technol., vol. 28, no. 8, pp. 1839-1849, Aug. 2018.

Y. Yang, R. Liu, C. Deng and X. Gao, "Multi-task human action recognition via exploring super-category", Signal Process., vol. 124, pp. 36-44,
2016.

M. Xin, H. Zhang, H. Wang, M. Sun and D. Yuan, "Arch: Adaptive recurrent-convolutional hybrid networks for long-term action recognition",
Neurocomputing, vol. 178, pp. 87-102, 2016.

M. Sekma, M. Mejdoub and C. B. Amar, "Human action recognition based on multi-layer fisher vector encoding method", Pattern Recognit.
Lett., vol. 65, pp. 37-43, 2015.
Thank You

You might also like