14 Scopus citations


We propose to learn semantic spatio-temporal embeddings for videos to support high-level video analysis. The first step of the proposed embedding employs a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Gated Recurrent Unit encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a multilayer perceptron to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. We demonstrate the usefulness and effectiveness of this new video representation by experiments on action recognition, zero-shot video classification, and 'word-to-video' retrieval, using the UCF-101 dataset.

Original languageEnglish (US)
Title of host publication2016 23rd International Conference on Pattern Recognition, ICPR 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages6
ISBN (Electronic)9781509048472
StatePublished - Jan 1 2016
Event23rd International Conference on Pattern Recognition, ICPR 2016 - Cancun, Mexico
Duration: Dec 4 2016Dec 8 2016

Publication series

NameProceedings - International Conference on Pattern Recognition


Other23rd International Conference on Pattern Recognition, ICPR 2016

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition


Dive into the research topics of 'Video2vec: Learning semantic spatio-temporal embeddings for video representation'. Together they form a unique fingerprint.

Cite this