TY - GEN
T1 - Fine-grained action recognition using cross-modal attention network for human-robot sign language interaction
AU - Hu, Jing
AU - Gao, Qing
AU - Cheng, Xianfeng
AU - Li, Xuerui
AU - Ju, Zhaojie
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2026/1/28
Y1 - 2026/1/28
N2 - With the growing demand for barrier-free communication among the deaf and mute, human-robot sign language interaction has gradually gained attention as an auxiliary tool. Action recognition serves as a crucial information source for robots to understand human behavior, enabling robots to recognize signs and achieve natural interaction with deaf people through it. However, existing Human-Robot Interaction (HRI) technologies based on action recognition mainly focus on coarse-grained human movements, failing to capture and respond to nuanced actions in real-world scenarios. Additionally, multi-modal action recognition often employs early or late fusion methods to integrate various modalities, lacking the exploration of relationships between modalities, resulting in the loss of some correlated information. To enable robots to adapt to diverse scenarios for a more nuanced understanding of human behaviors, we propose a fine-grained action recognition framework using Cross-Modal Attention network (CMA) based on RGB and skeleton. Firstly, holistic features including face, hand, and body are extracted by a pose estimator, effectively representing intricate human actions. Subsequently, to fully leverage the extracted fine-grained features, skeleton is represented as heatmap volumes. Finally, a Cross-Attention Interaction (CAI) module is designed to explore the intrinsic connections between RGB and skeleton, facilitating mutual learning of their respective advantageous features in the deep layers of feature extraction, thereby achieving information interaction. Simultaneously, HRI experiments are conducted on the large-scale fine-grained action dataset, WLASL2000. In this HRI system, the robotic arm responds by performing sign language aligned with the human actions identified by CMA, showcasing the practicality and effectiveness of our proposed model in real-world scenarios.
AB - With the growing demand for barrier-free communication among the deaf and mute, human-robot sign language interaction has gradually gained attention as an auxiliary tool. Action recognition serves as a crucial information source for robots to understand human behavior, enabling robots to recognize signs and achieve natural interaction with deaf people through it. However, existing Human-Robot Interaction (HRI) technologies based on action recognition mainly focus on coarse-grained human movements, failing to capture and respond to nuanced actions in real-world scenarios. Additionally, multi-modal action recognition often employs early or late fusion methods to integrate various modalities, lacking the exploration of relationships between modalities, resulting in the loss of some correlated information. To enable robots to adapt to diverse scenarios for a more nuanced understanding of human behaviors, we propose a fine-grained action recognition framework using Cross-Modal Attention network (CMA) based on RGB and skeleton. Firstly, holistic features including face, hand, and body are extracted by a pose estimator, effectively representing intricate human actions. Subsequently, to fully leverage the extracted fine-grained features, skeleton is represented as heatmap volumes. Finally, a Cross-Attention Interaction (CAI) module is designed to explore the intrinsic connections between RGB and skeleton, facilitating mutual learning of their respective advantageous features in the deep layers of feature extraction, thereby achieving information interaction. Simultaneously, HRI experiments are conducted on the large-scale fine-grained action dataset, WLASL2000. In this HRI system, the robotic arm responds by performing sign language aligned with the human actions identified by CMA, showcasing the practicality and effectiveness of our proposed model in real-world scenarios.
UR - https://www.scopus.com/pages/publications/105033158012
U2 - 10.1109/SMC58881.2025.11343724
DO - 10.1109/SMC58881.2025.11343724
M3 - Conference contribution
AN - SCOPUS:105033158012
SN - 9798331533595
T3 - Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
SP - 7447
EP - 7453
BT - 2025 IEEE International Conference on Systems, Man, and Cybernetics
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2025
Y2 - 5 October 2025 through 8 October 2025
ER -