With the sustaining bloom of multimedia data, Zero-shot Learning (ZSL) techniques have attracted much attention in recent years for its ability to train learning models that can handle 'unseen' categories. Existing ZSL algorithms mainly take advantages of attribute-based semantic space and only focus on static image data. Besides, most ZSL studies merely consider the semantic embedded labels and fail to address domain shift problem. In this paper, we purpose a deep two-output model for video ZSL and action recognition tasks by computing both spatial and temporal features from video contents through distinct Convolutional Neural Networks (CNNs) and training a Multi-layer Perceptron (MLP) upon extracted features to map videos to semantic embedding word vectors. Moreover, we introduce a domain adaptation strategy named 'ConSSEV' - by combining outputs from two distinct output layers of our MLP to improve the results of zero-shot learning. Our experiments on UCF101 dataset demonstrate the purposed model has more advantages associated with more complex video embedding schemes, and outperforms the state-of-the-art zero-shot learning techniques.

Original languageEnglish (US)
Title of host publication2016 IEEE International Conference on Image Processing, ICIP 2016 - Proceedings
PublisherIEEE Computer Society
Number of pages5
ISBN (Electronic)9781467399616
StatePublished - Aug 3 2016
Event23rd IEEE International Conference on Image Processing, ICIP 2016 - Phoenix, United States
Duration: Sep 25 2016Sep 28 2016

Publication series

NameProceedings - International Conference on Image Processing, ICIP


Other23rd IEEE International Conference on Image Processing, ICIP 2016
Country/TerritoryUnited States


  • Action recognition
  • Convolutional neural network
  • Multi-layer perceptron
  • Zero-shot learning

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition
  • Signal Processing


Dive into the research topics of 'Recognizing unseen actions in a domain-adapted embedding space'. Together they form a unique fingerprint.

Cite this