Abstract
How do humans recognize an action or an interaction in the real world? Due to the diversity of viewing perspectives, it is a challenge for humans to identify a regular activity when they observe it from an uncommon perspective. We argue that discriminative spatiotemporal information remains an essential cue for human action recognition. Most existing skeleton-based methods learn optimal representation based on the human-crafted criterion that requires many labelled data and much human effort. This paper introduces adaptive skeleton-based neural networks to learn optimal spatiotemporal representation automatically through a data-driven manner. First, an adaptive skeleton representation transformation method (ASRT) is proposed to model view-variation data without hand-crafted criteria. Next, powered by a novel attentional LSTM (C3D-LSTM) encapsulated with 3D-convolution, the proposed model could effectively enable memory blocks to learn short-term frame dependency and long-term relations. Hence, the proposed model can more accurately understand long-term or complex actions. Furthermore, a data enhancement driven end-to-end training scheme is proposed to train key parameters under fewer training samples. Enhanced by learned high-performance spatiotemporal representation, the proposed model achieves state-of-the-art performance on five challenging benchmarks.
Original language | English |
---|---|
Article number | 4 |
Pages (from-to) | 1654-1665 |
Number of pages | 12 |
Journal | IEEE Transactions on Cognitive and Developmental Systems |
Volume | 14 |
Issue number | 4 |
Early online date | 29 Nov 2021 |
DOIs | |
Publication status | Published - 1 Dec 2022 |
Keywords
- human motion recognition
- deep learning
- attentional LSTM
- skeleton representation