Deep key clips-video feature fusion framework for action recognition

Chao Li, Yue Ming, Yuan Shen, Hui Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Action recognition is crucial for many computer vision applications. Recently, deep learning has made breakthrough in recognition performance of action. However, there are a large number of redundant video frames which contain similar information making it difficult to capture discriminative spatiooral features for long-term actions. In this paper, we propose a novel framework for action recognition: Deep Key Clips-Video feature fusion framework. First, we propose a key clip selection algorithm based on background subtraction, which utilizes image average gradient and select key clips for training. Then, we further superimpose the key frames to generate historical contour images, effectively aggregating long-term information of the actions. Key video clips and historical contour images are inputted to the 3D convolutional network and the 2D convolutional network respectively, which extract the clip level and long term video level feature. Finally, we fuse these two sub-networks to improve the accuracy of recognition. We conduct experiments on two current mainstream action recognition datasets UCF-101 and HMDB-51. Compared with the state-of-the-art methods, the experimental results demonstrate the effectiveness of our proposed network for action recognition.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages6
ISBN (Electronic)9781538692141
ISBN (Print)9781538692158
Publication statusPublished - 15 Aug 2019
Event2019 IEEE International Conference on Multimedia and Expo Workshops - Shanghai, China
Duration: 8 Jul 201912 Jul 2019


Conference2019 IEEE International Conference on Multimedia and Expo Workshops
Abbreviated titleICMEW 2019


  • Action recogntion
  • Convolution networks
  • Key clips
  • Long term actions
  • Video level


Dive into the research topics of 'Deep key clips-video feature fusion framework for action recognition'. Together they form a unique fingerprint.

Cite this