You are on page 1of 1

TSNet: Deep Network for Human Action Recognition

in Hazy Videos
Sachin Chaudhary and Subrahmanyam Murala
Computer Vision and Pattern Recognition Lab,
Indian Institute of Technology Ropar, INDIA

CNN
C
CNN
O
C2MSNet Rank Pooling N

Fully Connected Layer


CNN C
A
Dynamic Depth Image T

Softmax
CNN feature extraction E
Transmission Map
Hazy Input
N
Video A
CNN
T
I
CNN O
N
De-Hazing Rank Pooling CNN

Dynamic Appearance CNN feature extraction


Image
Hazy Free Video

Fig. 1: The process of proposed approach for HAR in hazy videos.

Abstract: 𝐷𝑒 − 𝐻𝑎𝑧𝑖𝑛𝑔

 Due to poor quality, it is difficult to analyze the haze degraded video for the
human activities using the existing state-of-the-art methods.
 A new two level saliency based end-to-end network (TSNet) for HAR in hazy
videos is proposed.
 The concept of rank pooling given in [2] is further utilized to efficiently
represent the temporal saliency of the video.
 The transmission map information is utilized here to fix the spatial saliency in
𝑎
each frame. 𝑂𝑝𝑡𝑖𝑐𝑎𝑙 𝐹𝑙𝑜𝑤
𝑎
𝐷𝑒𝑝𝑡ℎ 𝑀𝑎𝑝
 A new dataset of hazy video is generated from two benchmark datasets
namely HMDB51 and UCF101 by adding synthetic haze.

Outline of Proposed TSNet: 𝑏 𝑏

A B A
Spatial Saliency Detection: Fig 1. Overview of the problem of HAR in hazy videos. (A) Sample optical flow (OF),
 Generally, the actor in any video is in foreground and hence, have a different dynamic image (DI) and dynamic optical flow (DOF) estimation from haze free video. (A)
depth as compared to the background objects. (a) DI of normal RGB frame. (A) (b) DOF of normal RGB frame. (B) Sample OF, DI and
 Using transmission map, we can segregate actor from the background. DOF estimation from hazy video. (B) (a) DI of hazy frame. (B) (b) DOF of hazy frame. (C)
 Haze is a clue to estimate the scene transmission map. The proposed method of estimating spatial and temporal saliency in hazy video. (C) (a) DI
 C2 MSNet [1] is used to estimate scene transmission map. of de-hazed frame. (C) (b) DI of the TrMap. The figure depicts a clear visible difference in
 Transmission map is used to define the spatial saliency. OF, DI and DOF of normal frame, hazy frame and de-hazed frame.
Temporal Saliency Detection:
 The concept of dynamic image inspired from [2] is used to estimate the
temporal saliency.
 The dynamic image is basically a summary of the appearance and dynamics
of the whole video.
 Thus, we obtained Appearance Dynamic Image of haze free video and not
of hazy video (see in Fig. 1).
 Also, the concept of Depth Dynamic Image is proposed.
 Transmission map obtained while image de-hazing is used to obtain the
effective depth dynamic image.
Learning Human Actions:
 Fig. 1 depicts the procedure of proposed Human Action Recognition.
 Appearance based features:
• For appearance based features, the de-hazed video frames are utilized.
• VGG-19 is used to extract appearance based features.
 Depth based features:
• For depth based features, similar network as used for appearance based features is
utilized. The input here is the depth dynamic image.
Fig 1. The qualitative analysis of the hazy vs haze free video frames. The OF is calculated
Experimental Results and Analysis: between the frame shown here with the previous frame of that video.

Datasets: Table III: Comparison of average recognition rate of the proposed method with
 UCF101: 101 action contains 13000 videos. some of the existing methods
 HMDB51: 51 action classes contains 6766 videos. Methods Hazy-HMDB51 Hazy-UCF101
 Synthetic haze is added using optical model. DI [2] 50.2 82.1
Table I: Decrement in the average recognition rate (ARR) of the existing methods SI+DI + OF + DOF [2] 65.2 88.7
when applied on the hazy datasets, DE: decrement, HF: haze Free. TS [4] 55.1 84.6
TSF [3] 60.2 86.3
HMDB51 DE UCF101 DE
Methods STResNet + IDT [3] 66.2 89.3
HF Hazy HF Hazy
Proposed Method 68.7 96.1
DI [2] 57.3 50.2 7.1 86.6 82.1 4.5
SI + DI + OF + DOF [2] 71.5 65.2 6.3 95.0 88.7 6.3
References:
TS [4] 59.4 55.1 4.3 88.0 84.6 3.4
TSF [3] 65.4 60.2 5.2 92.5 86.3 6.2 [1] Dudhane A, Murala S. C2MSNet: A Novel Approach for Single Image Haze Removal, in Winter
Conference on Applications of Computer Vision (WACV), 2018, pp. 1397-1404.
STResNet + IDT [5] 70.3 66.2 4.1 94.6 89.3 5.3
[2] H. B. and B. F. and E. G. and A. Vedaldi, “Action Recognition with Dynamic Image Networks,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. PP, no. 99, pp. 1–1, 2017.A new two level saliency based
end-to-end network (TSNet) for HAR in hazy videos is proposed.
Table II: Comparison of ARR of the proposed method
[3] A. Z. C. Feichtenhofer, A. Pinz, “Convolutional two-stream network fusion for video action
with some of the existing methods. recognition,” in 2016 EEE Conference on Computer Vision and Pattern, 2016, p. 2,6,7.
Methods Hazy-HMDB51 Hazy-UCF101 [4] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in
Hazy frame 50.3 82.1 Videos,” Adv. Neural Inf. Process. Syst., pp. 568–576, 2014.
Hazy frame + OF 61.2 86.4 [5] R. P. W. Christoph Feichtenhofer, Axel Pinz, “Spatiotemporal Residual Networks for Video Action
Recognition,” in Advances in Neural Information Processing Systems, 2016, pp. 3468–3476.
Hazy frame + TrMap 68.7 96.1

You might also like