Adaptive spatiotemporal representation learning for skeleton-based human action recognition

Jiahui Yu, Hongwei Gao, Yongquan Chen, Dalin Zhou, Jinguo Liu, Zhaojie Ju

Research output: Contribution to journalArticlepeer-review

178 Downloads (Pure)

Abstract

How do humans recognize an action or an interaction in the real world? Due to the diversity of viewing perspectives, it is a challenge for humans to identify a regular activity when they observe it from an uncommon perspective. We argue that discriminative spatiotemporal information remains an essential cue for human action recognition. Most existing skeleton-based methods learn optimal representation based on the human-crafted criterion that requires many labelled data and much human effort. This paper introduces adaptive skeleton-based neural networks to learn optimal spatiotemporal representation automatically through a data-driven manner. First, an adaptive skeleton representation transformation method (ASRT) is proposed to model view-variation data without hand-crafted criteria. Next, powered by a novel attentional LSTM (C3D-LSTM) encapsulated with 3D-convolution, the proposed model could effectively enable memory blocks to learn short-term frame dependency and long-term relations. Hence, the proposed model can more accurately understand long-term or complex actions. Furthermore, a data enhancement driven end-to-end training scheme is proposed to train key parameters under fewer training samples. Enhanced by learned high-performance spatiotemporal representation, the proposed model achieves state-of-the-art performance on five challenging benchmarks.
Original languageEnglish
Article number4
Pages (from-to)1654-1665
Number of pages12
JournalIEEE Transactions on Cognitive and Developmental Systems
Volume14
Issue number4
Early online date29 Nov 2021
DOIs
Publication statusPublished - 1 Dec 2022

Keywords

  • human motion recognition
  • deep learning
  • attentional LSTM
  • skeleton representation

Fingerprint

Dive into the research topics of 'Adaptive spatiotemporal representation learning for skeleton-based human action recognition'. Together they form a unique fingerprint.

Cite this