A discriminative deep model with feature fusion and temporal attention for human action recognition

Jiahui Yu, Hongwei Gao, Wei Yang, Weihong Chin, Naoyuki Kubota, Zhaojie Ju

Research output: Contribution to journalArticlepeer-review

169 Downloads (Pure)


Activity recognition which aims to accurately distinguish human actions in complex environments plays a key role in human-robot/computer interaction. However, long-lasting and similar actions will cause poor feature sequence extraction and thus lead to a reduction of the recognition accuracy. We propose a novel discriminative deep model (D3D-LSTM) based on 3D-CNN and LSTM for both singletarget and interaction action recognition to improve the spatiotemporal processing performance. Our models have several notable properties: 1) A real-time feature fusion method is used to obtain a more representative feature sequence through composition of local mixtures for enhancing the performance of discriminating similar actions; 2) We introduce an improved attention mechanism that focuses on each frame individually by assigning different weights in real-time; 3) An alternating optimization strategy is proposed for our model to obtain parameters with the best performance. Because the proposed D3D-LSTM model is efficient enough to be used as a detector that recognizes various activities, a Real-set database is collected to evaluate action recognition in complex real-world scenarios. For long-term relations, we update the present memory state via the weight-controlled attention module that enables the memory cell to store better long-term features. The densely connected bimodal modal makes local perceptrons of 3D-Conv motion-aware and stores better short-term features. The proposed D3D-LSTM model has been evaluated through a series of experiments on the Real-set and open-source datasets, i.e. SBU-Kinect and MSR-action-3D. Experimental results show that the proposed D3D-LSTM model achieves new state-of-the-art results, including pushing the average rate of the SBU-Kinect to 92.40% and the average rate of the MSR-action-3D to 95.40%.
Original languageEnglish
Pages (from-to)43243-43255
Number of pages13
JournalIEEE Access
Publication statusPublished - 2 Mar 2020


  • Human action recognition
  • RGB-D
  • attention mode
  • real-time feature fusion
  • dataset


Dive into the research topics of 'A discriminative deep model with feature fusion and temporal attention for human action recognition'. Together they form a unique fingerprint.

Cite this