Abstract
With the advent of autonomous driving technology, in-vehicle driver monitoring systems are becoming increasingly important. These systems can capture the driver pose and can be used in many driver monitoring research programs. However, capturing 3D driver pose based on 2D data has been challenging due to depth ambiguity and self-occlusion issues. This paper presents a spatial-temporal model in video for human pose estimation from 2D to 3D, combining graph convolution network(GCN) and transformer, called GTFormer. In order to improve the effectiveness of obtaining spatial-temporal information, this task is divided into two stages. In the first stage, a mask pose modeling method is introduced as a self-supervised subtask, taking as human input joints randomly masked in time and space, which gives the model a good initialization. In the second stage, the pre-trained model is coupled with a regression head to predict the 3D pose of the current frame. The network utilizes the transformer module as a temporal feature extractor and a module combining the self-attention mechanism and GCN as a spatial feature extractor. GTFormer has an MPJPE of 33.84mm with ground truth as input and 44.50mm with CPN as input on the Human3.6M dataset. Experiment results show that the GTFormer achieves state-of-the-art performance with 1725 FPS. We performed quantitative experiments on the indoor dataset Human3.6M and the in-vehicle dataset Drive&Act. The experimental results show that the proposed method has good results for human pose estimation indoors and driver pose estimation in the narrow space of cars.
Original language | English |
---|---|
Journal | IEEE Transactions on Intelligent Vehicles |
Early online date | 12 Oct 2023 |
DOIs | |
Publication status | Early online - 12 Oct 2023 |
Keywords
- 3D driver body pose estimation
- GCN
- transformer
- Pre-training
- autonomous driving system