Vision-based human motion analysis:
: with deep learning

  • Wei Zeng

Student thesis: Doctoral Thesis


Vision-based human motion analysis is a task of which the objective is to automatically identify the behaviour of human from a given image or a sequence of images. It has man real world applications, such as video-based surveillance, human robot interaction, sign language recognition, video assisted medical diagnosis and gesture recognition in the gaming industry. This project focus on 2 sub-tasks of human motion analysis, namely pose estimation and human action recognition. Human pose estimation is the task of learning accurate human pose from a single image or multiple images from different views, or a few consecutive images, while human action recognition is the act of recognising a target action from a video, or form of sequence of dynamic images.

Recent advances in deep learning has stimulated progress in both human pose estimation and human action recognition. However, 3D human pose estimation and the problem of handling temporal dependency in action recognition remains challenging problems. This thesis has offered alternative approaches to this problems. For human pose estimation, an architecture based on autoencoder, along with a novel 3 stage learning scheme is proposed. The first and second stages of this approach exploits the power of the autoencoder to learn latent representation both from pose vector and image inputs. The performance of the proposed approach has been demonstrated on publicly available dataset. It has been shown to provide competitive performance as compared to the other recently published methods, in spite of the limitation in the hardware used for experiments.

For action recognition, an architecture which combines the power of 3D convolutional densenet and recurrent neural network is proposed. The 3D architecture distinguishes itself by incorporating (2+1)D spato-temporal convolution into it to facilitate optimisation and the adoption of convolutional GRU. A further novelty of this architecture is that it alternates between dense block and recurrent GRU layer to repeatedly leverage the power of (2+1)D densenet and Recurrent GRU. The proposed architecture are experimental on both RGB and RGB-D dataset, where data fusion scheme are employed to multi-modal data fusion. Experimentation on both RGB and RGB-D dataset have shown that the proposed model achieve competitive results. In particular, such performance is achieved under the condition that our model was not pretrained on any other dataset, and that we did not use computationally expensive optical flow for two stream input. Our method also has relatively small number of parameters compared to other methods. This demonstrates the effectiveness and efficiency of our proposed method.
Date of Award2019
Original languageEnglish
SupervisorHonghai Liu (Supervisor)

Cite this