Professional Documents
Culture Documents
Large-Scale Video Classification
Large-Scale Video Classification
and the network class predictions are averaged to produce a ture vector. Our features are similar to Yang & Toderici
more robust estimate of the class probabilities. To produce [27] and consist of local features (HOG [5], Texton [24],
video-level predictions we opted for the simplest approach Cuboids [8], etc.) extracted both densely and at sparse
of averaging individual clip predictions over the durations interest points, as well as global features (such as Hue-
of each video. We expect more elaborate techniques to fur- Saturation, Color moments, number of faces detected). As
ther improve performance but consider these to be outside a classifier we use a multilayer neural network with Rec-
of the scope of the paper. tified Linear Units followed by a Softmax classifier. We
Feature histogram baselines. In addition to comparing found that a multilayer network performs consistently and
CNN architectures among each other, we also report the ac- significantly better than linear models on separate validation
curacy of a feature-based approach. Following a standard experiments. Furthermore, we performed extensive cross-
bag-of-words pipeline we extract several types of features validations across many of the network’s hyperparameters
at all frames of our videos, discretize them using k-means by training multiple models and choosing the one with best
vector quantization and accumulate words into histograms performance on a validation set. The tuned hyper parame-
with spatial pyramid encoding and soft quantization. Ev- ters include the learning rate, weight decay, the number of
ery histogram is normalized to sum to 1 and all histograms hidden layers (between 1-2), dropout probabilities and the
are concatenated into a 25,000 dimensional video-level fea-
Figure 5: Examples that illustrate qualitative differences between single-frame network and Slow Fusion (motion-aware)
network in the same color scheme as Figure 4. A few classes are easier to disambiguate with motion information (left three).
4.2. Transfer Learning Experiments on UCF-101 these were not available and hence we cannot guarantee that
the Sports-1M dataset has no overlap with UCF-101. How-
The results of our analysis on the Sports-1M dataset in-
ever, these concerns are somewhat mitigated as we only use
dicate that the networks learn powerful motion features. A
a few sampled clips from every video.
natural question that arises is whether these features also
We use the Slow Fusion network in our UCF-101 exper-
generalize to other datasets and class categories. We ex-
iments as it provides the best performance on Sports-1M.
amine this question in detail by performing transfer learn-
The results of the experiments can be seen on Table 3. In-
ing experiments on the UCF-101 [22] Activity Recognition
terestingly, retraining the softmax layer alone does not per-
dataset. The dataset consists of 13,320 videos belonging
form best (possibly because the high-level features are too
to 101 categories that are separated into 5 broad groups:
specific to sports) and the other extreme of fine-tuning all
Human-Object interaction (Applying eye makeup, brush-
layers is also not adequate (likely due to overfitting). In-
ing teeth, hammering, etc.), Body-Motion (Baby crawling,
stead, the best performance is obtained by taking a balanced
push ups, blowing candles, etc.), Human-Human interac-
approach and retraining the top few layers of the network.
tion (Head massage, salsa spin, haircut, etc.), Playing In-
Lastly, training the entire network from scratch consistently
struments (flute, guitar, piano, etc.) and Sports. This group-
leads to massive overfitting and dismal performance.
ing allows us to separately study the performance improve-
Performance by group. We further break down our per-
ments on Sports classes relative to classes from unrelated
formance by 5 broad groups of classes present in the UCF-
videos that are less numerous in our training data.
101 dataset. We compute the average precision of every
Transfer learning. Since we expect that CNNs learn
class and then compute the mean average precision over
more generic features on the bottom of the network (such
classes in each group. As can be seen from Table 4, large
as edges, local shapes) and more intricate, dataset-specific
fractions of our performance can be attributed to the Sports
features near the top of the network, we consider the fol-
categories in UCF-101, but the other groups still display im-
lowing scenarios for our transfer learning experiments:
pressive performance considering that the only way to ob-
Fine-tune top layer. We treat the CNN as a fixed feature
serve these types of frames in the training data is due to label
extractor and train a classifier on the last 4096-dimensional
noise. Moreover, the gain in performance when retraining
layer, with dropout regularization. We found that as little as
only the top to retraining the top 3 layers is almost entirely
10% chance of keeping each unit active to be effective.
due to improvements on non-Sports categories: Sports per-
Fine-tune top 3 layers. Instead of only retraining the fi-
formance only decreases from 0.80 to 0.79, while mAP im-
nal classifier layer, we consider also retraining both fully
proves on all other categories.
connected layers. We initialize with a fully trained Sports
CNN and then begin training the top 3 layers. We intro- 5. Conclusions
duce dropout before all trained layers, with as little as 10% We studied the performance of convolutional neural net-
chance of keeping units active. works in large-scale video classification. We found that
Fine-tune all layers. In this scenario we retrain all net- CNN architectures are capable of learning powerful fea-
work parameters, including all convolutional layers on the tures from weakly-labeled data that far surpass feature-
bottom of the network. based methods in performance and that these benefits are
Train from scratch. As a baseline we train the full net- surprisingly robust to details of the connectivity of the ar-
work from scratch on UCF-101 alone. chitectures in time. Qualitative examination of network out-
Results. To prepare UCF-101 data for classification we puts and confusion matrices reveals interpretable errors.
sampled 50 clips from every video and followed the same Our results indicate that while the performance is not
evaluation protocol as for Sports across the 3 suggested particularly sensitive to the architectural details of the con-
folds. We reached out to the authors of [22] to obtain the nectivity in time, a Slow Fusion model consistently per-
YouTube video IDs of UCF-101 videos, but unfortunately forms better than the early and late fusion alternatives. Sur-
prisingly, we find that a single-frame model already dis- [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
plays very strong performance, suggesting that local motion ture hierarchies for accurate object detection and semantic
cues may not be critically important, even for a dynamic segmentation. In CVPR, 2014. 1, 2
dataset such as Sports. An alternative theory is that more [10] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural
careful treatment of camera motion may be necessary (for networks for human action recognition. PAMI, 35(1):221–
example by extracting features in the local coordinate sys- 231, 2013. 2, 3
[11] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet clas-
tem of a tracked point, as seen in [25]), but this requires
sification with deep convolutional neural networks. In NIPS,
significant changes to a CNN architecture that we leave for
2012. 1, 2, 3, 4
future work. We also identified mixed-resolution architec- [12] I. Laptev. On space-time interest points. IJCV, 64(2-3):107–
tures that consist of a low-resolution context and a high- 123, 2005. 2
resolution fovea stream as an effective way of speeding up [13] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.
CNNs without sacrificing accuracy. Learning realistic human actions from movies. In CVPR,
Our transfer learning experiments on UCF-101 suggest 2008. 2
that the learned features are generic and generalize other [14] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learn-
video classification tasks. In particular, we achieved the ing hierarchical invariant spatio-temporal features for action
highest transfer learning performance by retraining the top recognition with independent subspace analysis. In CVPR,
3 layers of the network. 2011. 2
In future work we hope to incorporate broader categories [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
in the dataset to obtain more powerful and generic fea- based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998. 1, 2
tures, investigate approaches that explicitly reason about
[16] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions
camera motion, and explore recurrent neural networks as
from videos “in the wild”. In CVPR, 2009. 2
a more powerful technique for combining clip-level predic-
[17] J. C. Niebles, C.-W. Chen, and L. Fei-Fei. Modeling tempo-
tions into global video-level predictions. ral structure of decomposable motion segments for activity
Acknowledgments: We thank Saurabh Singh, Abhinav classification. In ECCV, pages 392–405. Springer, 2010. 2
[18] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carls-
Shrivastava, Jay Yagnik, Alex Krizhevsky, Quoc Le, Jeff
son. CNN features off-the-shelf: an astounding baseline for
Dean and Rajat Monga for helpful discussions.
recognition. arXiv preprint arXiv:1403.6382, 2014. 1, 2
[19] P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neu-
References ral networks applied to house numbers digit classification. In
[1] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and ICPR, 2012. 2
A. Baskurt. Sequential deep learning for human action [20] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
recognition. In Human Behavior Understanding, pages 29– and Y. LeCun. OverFeat: Integrated recognition, localization
39. Springer, 2011. 2, 3 and detection using convolutional networks. arXiv preprint
[2] D. Ciresan, A. Giusti, J. Schmidhuber, et al. Deep neural net- arXiv:1312.6229, 2013. 1, 2
works segment neuronal membranes in electron microscopy [21] J. Sivic and A. Zisserman. Video Google: A text retrieval
images. In NIPS, 2012. 1 approach to object matching in videos. In ICCV, 2003. 2
[3] L. N. Clement Farabet, Camille Couprie and Y. LeCun. [22] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset
Learning hierarchical features for scene labeling. PAMI, of 101 human actions classes from videos in the wild. arXiv
35(8), 2013. 1, 2 preprint arXiv:1212.0402, 2012. 2, 7
[4] C. Couprie, C. Farabet, L. Najman, and Y. LeCun. Indoor [23] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Con-
semantic segmentation using depth information. Internatinal volutional learning of spatio-temporal features. In ECCV.
Conference on Learning Representation, 2013. 2 Springer, 2010. 2
[24] M. Varma and A. Zisserman. A statistical approach to tex-
[5] N. Dalal and B. Triggs. Histograms of oriented gradients for
ture classification from single images. IJCV, 62(1-2):61–81,
human detection. In CVPR, volume 1, 2005. 5
2005. 5
[6] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V.
[25] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recog-
Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang,
nition by dense trajectories. In CVPR. IEEE, 2011. 2, 8
and A. Y. Ng. Large scale distributed deep networks. In
[26] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid, et al.
NIPS, 2012. 4
Evaluation of local spatio-temporal features for action recog-
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
nition. In BMVC, 2009. 2
Fei. Imagenet: A large-scale hierarchical image database. In
[27] W. Yang and G. Toderici. Discriminative tag learning on
CVPR, 2009. 2
youtube videos with latent sub-tags. In CVPR, 2011. 5
[8] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. Behav-
[28] M. D. Zeiler and R. Fergus. Visualizing and under-
ior recognition via sparse spatio-temporal features. In Inter-
standing convolutional neural networks. arXiv preprint
national Workshop on Visual Surveillance and Performance
arXiv:1311.2901, 2013. 1, 3
Evaluation of Tracking and Surveillance, 2005. 2, 5