Professional Documents
Culture Documents
The HAR paradigm faces challenges in recognizing the actions efficiently due
to the presence of inter-class similarities and intra-class variations among the
action classes.The temporal relationship in an action video is an important
aspect to recognize the action type.
The HAR paradigm will certainly be more efficient if depth information is
available.
This seminar intends to explore estimating depth from action frames and
using it in conjunction with other learning network to get more accuracy
Objectives
The work is divided into two streams: sequential learning and shape
learning streams.
● Why?
○ To reduce Overfitting Problem
● What caused it in the first place?
○ Lack of training data (The extracted DHI images)
Transfer Learning
● The process of transferring learned weights from a pre-trained network and retraining them.
● Final FC layer, Softmax layer and classification layer are replaced to suit current needs
Data Augmentation
Three layers selected for balance between performance and training time
Network Optimization
● steepest descent gradient with momentum (SGDM) optimizer
● BiLSTM network
○ Three BiLSTM units used in sequence
○ hidden units
■ 150 in first layer
■ 125 in second layer
■ 100 in third layer
○ Other parameters
■ mini-batch size 8,
■ maximum number of epochs 300
■ initial learning rate 0.001
○ All the values are chosen empirically.
Setup
● Sample Complexity
○ number of samples per each class
Parameter Evaluation
● Class Complexity
○ number of action classes in the training dataset
Result & Analysis
● Ablation Study
○ Different Streams of HAR-Depth Network
Result & Analysis
● Comparison with earlier reported methods(Small scale dataset)
Result & Analysis
● confusion matrices
○ Small scale dataset
Result & Analysis
● Comparison with earlier reported methods(Mid scale dataset)
Result & Analysis
● confusion matrices
○ Mid scale data set
Conclusion
● DBiLSTM learns sequential information and the DHI learns the shape
information
Conclusion
● Technique faces challenge when two closely related faster actions are
Y. Yuan, Y. Zhao and Q. Wang, "Action recognition using spatial-optical data organization and sequential learning framework", Neurocomputing, vol. 315, pp.
221-233, 2018.
S. Megrhi, M. Jmal, W. Souidene and A. Beghdadi, "Spatio-temporal action localization and detection for human action recognition in big dataset", J. Visual
Commun. Image Representation, vol. 41, pp. 375-390, 2016.
Y. Shi, Y. Tian, Y. Wang and T. Huang, "Sequential deep trajectory descriptor for action recognition with three-stream CNN", IEEE Trans. Multimedia, vol. 19,
no. 7, pp. 1510-1520, 2017.
Y. Bin, Y. Yang, F. Shen, N. Xie, H. T. Shen and X. Li, "Describing video with attention-based bidirectional LSTM", IEEE Trans. Cybern., vol. 49, no. 7, pp. 2631-2641,
Jul. 2018.
Y. M. Lui and J. R. Beveridge, "Tangent bundle for human action recognition", Proc. IEEE Int. Conf. Autom. Face Gesture Recognit. Workshops, pp. 97-102,
2011.
K. P. Chou et al., "Robust feature-based automated multi-view human action recognition system", IEEE Access, vol. 6, pp. 15 283-15 296, 2018.
S. Samanta and B. Chanda, "Space-time facet model for human activity classification", IEEE Trans. Multimedia, vol. 16, no. 6, pp. 1525-1535, Oct. 2014.
G. Yu, N. A. Goussies, J. Yuan and Z. Liu, "Fast action detection via discriminative random forest voting and top-K subvolume search", IEEE Trans. Multimedia,
vol. 13, no. 3, pp. 507-517, Jun. 2011.
G. Gkioxari and J. Malik, "Finding action tubes", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 759-768, 2015.
References
Y. Qin, L. Mo and B. Xie, "Feature fusion for human action recognition based on classical descriptors and 3D convolutional networks", Proc. IEEE Int. Conf.
Sens. Technol., pp. 1-5, 2017.
X. Peng, C. Zou, Y. Qiao and Q. Peng, "Action recognition with stacked Fisher vectors", Proc. Eur. Conf. Comput. Vis., pp. 581-595, 2014.
G. Singh, S. Saha, M. Sapienza, P. H. Torr and F. Cuzzolin, "Online real-time multiple spatiotemporal action localisation and prediction", Proc. IEEE Int. Conf.
Comput. Vis., pp. 3637-3646, 2017.
H. Gammulle, S. Denman and S. Sridharan, and C. Fookes, "Two stream LSTM: A deep fusion framework for human action recognition", Proc. IEEE Winter
Conf. Appl. Comput. Vis., pp. 177-186, 2017.
Y. Xu, L. Wang, J. Cheng, H. Xia and J. Yin, "DTA: Double LSTM with temporal-wise attention network for action recognition", Proc. IEEE Int. Conf. Comput.
Commun., pp. 1676-1680, 2017.
X. Peng and C. Schmid, "Multi-region two-stream R-CNN for action detection", Proc. Eur. Conf. Comput. Vis., pp. 744-759, 2016.
A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad and S. W. Baik, "Action recognition in video sequences using deep bi-directional LSTM with CNN features", IEEE
Access, vol. 6, pp. 1155-1166, 2017.
A. Krizhevsky, I. Sutskever and G. E. Hinton, "Imagenet classification with deep convolutional neural networks", Proc. Adv. Neural Inf. Process. Syst., pp. 1097-
1105, 2012.
B. Cai, X. Xu, K. Jia, C. Qing and D. Tao, "Dehazenet: An end-to-end system for single image haze removal", IEEE Trans. Image Process., vol. 25, no. 11, pp. 5187-
5198, 2016.
References
C. Schuldt, I. Laptev and B. Caputo, "Recognizing human actions: A local SVM approach", Proc. IEEE Int. Conf. Pattern Recognit., vol. 3, pp.
32-36, 2004.
M. D. Rodriguez, J. Ahmed and M. Shah, "Action mach a spatio-temporal maximum average correlation height filter for action recognition",
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1-8, 2008.
H. Jhuang, J. Gall, S. Zuffi, C. Schmid and M. J. Black, "Towards understanding action recognition", Proc. IEEE Int. Conf. Comput. Vis., pp.
3192-3199, 2013.
K. Soomro, A. R. Zamir and M. Shah, "UCF101: A dataset of 101 human actions classes from videos in the wild", Aug. 2012, [online] Available:
https://arxiv.org/abs/1212.0402.
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre, "Hmdb: A large video database for human motion recognition", Proc. IEEE Int. Conf.
Comput. Vis., pp. 2556-2563, 2011.
Y. Song, S. Tang, Y. T. Zheng, T. S. Chua, Y. Zhang and S. Lin, "A distribution based video representation for human action recognition", Proc.
IEEE Int. Conf. Multimedia Expo, pp. 772-777, 2010.
A.-A. Liu, Y.-T. Su, W.-Z. Nie and M. Kankanhalli, "Hierarchical clustering multi-task learning for joint human action grouping and
recognition", IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1, pp. 102-114, Jan. 2016.
X. Lu, H. Yao, S. Zhao, X. Sun and S. Zhang, "Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors", Multimedia
Tools Appl., vol. 78, no. 1, pp. 507-523, 2019.
References
L. Wang, Y. Qiao and X. Tang, "Action recognition with trajectory-pooled deep-convolutional descriptors", Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., pp. 4305-4314, 2015.
C. Feichtenhofer, A. Pinz and A. Zisserman, "Convolutional two-stream network fusion for video action recognition", Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., pp. 1933-1941, 2016.
S. Zhao, Y. Liu, Y. Han, R. Hong, Q. Hu and Q. Tian, "Pooling the convolutional layers in deep convnets for video action recognition", IEEE
Trans. Circuits Syst. Video Technol., vol. 28, no. 8, pp. 1839-1849, Aug. 2018.
Y. Yang, R. Liu, C. Deng and X. Gao, "Multi-task human action recognition via exploring super-category", Signal Process., vol. 124, pp. 36-44,
2016.
M. Xin, H. Zhang, H. Wang, M. Sun and D. Yuan, "Arch: Adaptive recurrent-convolutional hybrid networks for long-term action recognition",
Neurocomputing, vol. 178, pp. 87-102, 2016.
M. Sekma, M. Mejdoub and C. B. Amar, "Human action recognition based on multi-layer fisher vector encoding method", Pattern Recognit.
Lett., vol. 65, pp. 37-43, 2015.
Thank You