TY - GEN
T1 - Frequency-enhanced spatio-temporal criss-cross attention for video-based 3D human pose estimation
AU - Cheng, Xianfeng
AU - Ju, Zhaojie
AU - Gao, Qing
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/10/18
Y1 - 2024/10/18
N2 - Recent transformer-based solutions have achieved remarkable success in video-based 3D human pose estimation. However, the computational cost grows quadratically with the number of joints and frames when computing the joint-to-joint affinity matrix. In this work, we propose a novel Frequency-Enhanced Spatio-Temporal Criss-Cross Transformer model (FSTCFormer) to address the computational challenge from two perspectives: leveraging the compact frequency-domain representation of the input data and relevant learned decomposition. Our FSTCFormer first performs discrete cosine transform (DCT) and filtering on the input sequence to extract low-frequency coefficients, capturing trend information and successfully compressing the input data. Then, the Frequency-Enhanced Spatio-Temporal Criss-Cross (FSTC) block uni-formly divides the compressed features along the channel dimension into two partitions and separately performs spatial and temporal attention on each partition. A learnable Freq MLP is introduced during the attention computation to further enhance the utilization of frequency-domain data. Each FSTC block, by concatenating the outputs of the attention layers, can model the interactions among joints within the same frame, joints along the same trajectory, and joints fused across multiple frames. Extensive experiments on the Human3.6M demonstrate that our FSTCFormer achieves a better trade-off between speed and accuracy compared to state-of-the-art (SOTA) methods.
AB - Recent transformer-based solutions have achieved remarkable success in video-based 3D human pose estimation. However, the computational cost grows quadratically with the number of joints and frames when computing the joint-to-joint affinity matrix. In this work, we propose a novel Frequency-Enhanced Spatio-Temporal Criss-Cross Transformer model (FSTCFormer) to address the computational challenge from two perspectives: leveraging the compact frequency-domain representation of the input data and relevant learned decomposition. Our FSTCFormer first performs discrete cosine transform (DCT) and filtering on the input sequence to extract low-frequency coefficients, capturing trend information and successfully compressing the input data. Then, the Frequency-Enhanced Spatio-Temporal Criss-Cross (FSTC) block uni-formly divides the compressed features along the channel dimension into two partitions and separately performs spatial and temporal attention on each partition. A learnable Freq MLP is introduced during the attention computation to further enhance the utilization of frequency-domain data. Each FSTC block, by concatenating the outputs of the attention layers, can model the interactions among joints within the same frame, joints along the same trajectory, and joints fused across multiple frames. Extensive experiments on the Human3.6M demonstrate that our FSTCFormer achieves a better trade-off between speed and accuracy compared to state-of-the-art (SOTA) methods.
KW - Three-dimensional displays
KW - Accuracy
KW - Frequency-domain analysis
KW - Computational modeling
KW - Pose estimation
KW - Transformers
KW - Frequency estimation
KW - Computational efficiency
KW - Trajectory
KW - Discrete cosine transforms
UR - http://www.scopus.com/inward/record.url?scp=85208044256&partnerID=8YFLogxK
U2 - 10.1109/ICARM62033.2024.10715852
DO - 10.1109/ICARM62033.2024.10715852
M3 - Conference contribution
AN - SCOPUS:85208044256
SN - 9798350385731
T3 - International Conference on Advanced Robotics and Mechatronics
SP - 984
EP - 989
BT - ICARM 2024 - 2024 9th IEEE International Conference on Advanced Robotics and Mechatronics
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 9th IEEE International Conference on Advanced Robotics and Mechatronics, ICARM 2024
Y2 - 8 July 2024 through 10 July 2024
ER -