Attention-driven Learning for Social Interaction Recognition

  • Gongyue Zhang

Student thesis: Doctoral Thesis


"Attention-driven" can refer to the study of participants’ behavior through gaze and their interest goals, or it can refer to the attention mechanism in deep learning. In unconstrained environments, where participants are free to move without wearing any devices, estimating head pose, gaze, and facial expressions faces significant challenges due to various factors such as different eye appearances, eyelid occlusion, large head movements, varying angles of view, and lighting conditions. To address the lack of views and improve accuracy, we propose a multi-view gaze estimation approach based on head pose. Additionally, we introduce a high-definition network coupled with a high-definition RGB camera to enhance the success rate of facial expression recognition by incorporating an attention mechanism. To optimize recognition in dynamic videos, we further propose the temporal STNet to enhance the recognition experience. Moreover, considering the limited source data available for children and the wide variety of children’s facial expressions, it is challenging to implement a pre-trained model effectively, especially for ASD (Autism Spectrum Disorder) children. To overcome this limitation, we introduce transfer learning as a solution for detection tasks with limited source data.
First, the ahead poses estimation model was designed based on multi-view fusion.
ResNet50 is the leading trunk network. In-network transmission, feature maps from different perspectives are obtained. The critical information of other feature maps is fused based on retaining the original feature maps. The fusion features were taken as the input features of the subsequent process, and finally, three Euler angles of the head posture were obtained. Finally, the model is validated on the Dpose dataset, and the highest classification accuracy is 92.07% under the four-view feature. Also, a gaze estimation model was designed based on multi-information fusion. When the network propagates forward to the whole connection layer, the pupil centre information obtained from the critical point detection module and the head pose data obtained from the head pose estimation module are connected end to end with the whole connection layer to realise the multi-information fusion and finally get the two deflection angles of human eyes. Finally, this model is validated on the MPIIGaze data set, and the average error Angle of 4.9◦ is obtained, which is better than the GazeNet model.
Secondly, a study of facial expression recognition based on high definition static pictures and dynamic videos. This paper combines attention mechanisms with a high-resolution network and designs the SE-HRNet model for expression recognition based on static images. At the same time, the importance of each channel is paid attention to while maintaining high- resolution information. Deep separable convolution is used to reduce the number of network parameters. At the same time, FRN is used to optimise the normalisation method, and the accuracy of 97.65% is obtained on the CK+ dataset. This paper combines the convolutional neural network with the long-term and short-term memory network for expression recognition based on dynamic video. The temporal and spatial information is merged to a great extent, and the receptive field of the network is enlarged by using void convolution. Finally, the accuracy of 98.6% is obtained on the CK+ dataset.
Thirdly, different migration strategies are used to compare the migration capabilities of various deep neural network models on the database of children’s facial expression images. The classification effects of model migration from heterogeneous domain to children’s facial expression recognition task are compared. Finally, only the top-level parameters of the VGG-Face model are trained, and the recognition rate is as high as 79.8%, which is of great value in practical application. And the advantages of domain adaptive methods for children’s facial expression classification data with ten or fewer samples per category were studied. Furthermore, our results suggest that domain adaptation structures with pre-training weights learned from adult facial expression data improves model performance. With domain adaption method, the recognition rate is 85.1%, better than fine-tune only.
To summarize, the main contribution of the thesis is to develop and improve several algorithms for head posture, gaze estimation and facial expression detection and optimised for a limited sample of participants through transfer learning.
Date of Award9 Jan 2023
Original languageEnglish
Awarding Institution
  • University of Portsmouth
SupervisorHonghai Liu (Supervisor), Zhaojie Ju (Supervisor) & Mo Adda (Supervisor)

Cite this