Frequency-enhanced spatio-temporal criss-cross attention for video-based 3D human pose estimation

Xianfeng Cheng, Zhaojie Ju, Qing Gao*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recent transformer-based solutions have achieved remarkable success in video-based 3D human pose estimation. However, the computational cost grows quadratically with the number of joints and frames when computing the joint-to-joint affinity matrix. In this work, we propose a novel Frequency-Enhanced Spatio-Temporal Criss-Cross Transformer model (FSTCFormer) to address the computational challenge from two perspectives: leveraging the compact frequency-domain representation of the input data and relevant learned decomposition. Our FSTCFormer first performs discrete cosine transform (DCT) and filtering on the input sequence to extract low-frequency coefficients, capturing trend information and successfully compressing the input data. Then, the Frequency-Enhanced Spatio-Temporal Criss-Cross (FSTC) block uni-formly divides the compressed features along the channel dimension into two partitions and separately performs spatial and temporal attention on each partition. A learnable Freq MLP is introduced during the attention computation to further enhance the utilization of frequency-domain data. Each FSTC block, by concatenating the outputs of the attention layers, can model the interactions among joints within the same frame, joints along the same trajectory, and joints fused across multiple frames. Extensive experiments on the Human3.6M demonstrate that our FSTCFormer achieves a better trade-off between speed and accuracy compared to state-of-the-art (SOTA) methods.

Original languageEnglish
Title of host publicationICARM 2024 - 2024 9th IEEE International Conference on Advanced Robotics and Mechatronics
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages984-989
Number of pages6
ISBN (Electronic)9798350385724, 9798350385717
ISBN (Print)9798350385731
DOIs
Publication statusPublished - 18 Oct 2024
Event9th IEEE International Conference on Advanced Robotics and Mechatronics, ICARM 2024 - Tokyo, Japan
Duration: 8 Jul 202410 Jul 2024

Publication series

NameInternational Conference on Advanced Robotics and Mechatronics
PublisherIEEE
ISSN (Print)2993-4982
ISSN (Electronic)2993-4990

Conference

Conference9th IEEE International Conference on Advanced Robotics and Mechatronics, ICARM 2024
Country/TerritoryJapan
CityTokyo
Period8/07/2410/07/24

Keywords

  • Three-dimensional displays
  • Accuracy
  • Frequency-domain analysis
  • Computational modeling
  • Pose estimation
  • Transformers
  • Frequency estimation
  • Computational efficiency
  • Trajectory
  • Discrete cosine transforms

Fingerprint

Dive into the research topics of 'Frequency-enhanced spatio-temporal criss-cross attention for video-based 3D human pose estimation'. Together they form a unique fingerprint.

Cite this