How do humans recognize an action or an interaction in the real world? Due to the diversity of viewing perspectives, it is a challenge for humans to identify a regular activity when they observe it from an uncommon perspective. We argue that discriminative spatiotemporal information remains an essential cue for human action recognition. Most existing skeleton-based methods learn optimal representation based on the human-crafted criterion that requires many labelled data and much human effort. This paper introduces adaptive skeleton-based neural networks to learn optimal spatiotemporal representation automatically through a data-driven manner. First, an adaptive skeleton representation transformation method (ASRT) is proposed to model view-variation data without hand-crafted criteria. Next, powered by a novel attentional LSTM (C3D-LSTM) encapsulated with 3D-convolution, the proposed model could effectively enable memory blocks to learn short-term frame dependency and long-term relations. Hence, the proposed model can more accurately understand long-term or complex actions. Furthermore, a data enhancement driven end-to-end training scheme is proposed to train key parameters under fewer training samples. Enhanced by learned high-performance spatiotemporal representation, the proposed model achieves state-of-the-art performance on five challenging benchmarks.
|Journal||IEEE Transactions on Cognitive and Developmental Systems|
|Early online date||29 Nov 2021|
|Publication status||Early online - 29 Nov 2021|
- human motion recognition
- deep learning
- attentional LSTM
- skeleton representation