Skip to main navigation Skip to search Skip to main content

Fine-grained action recognition using cross-modal attention network for human-robot sign language interaction

  • Jing Hu
  • , Qing Gao*
  • , Xianfeng Cheng
  • , Xuerui Li
  • , Zhaojie Ju
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With the growing demand for barrier-free communication among the deaf and mute, human-robot sign language interaction has gradually gained attention as an auxiliary tool. Action recognition serves as a crucial information source for robots to understand human behavior, enabling robots to recognize signs and achieve natural interaction with deaf people through it. However, existing Human-Robot Interaction (HRI) technologies based on action recognition mainly focus on coarse-grained human movements, failing to capture and respond to nuanced actions in real-world scenarios. Additionally, multi-modal action recognition often employs early or late fusion methods to integrate various modalities, lacking the exploration of relationships between modalities, resulting in the loss of some correlated information. To enable robots to adapt to diverse scenarios for a more nuanced understanding of human behaviors, we propose a fine-grained action recognition framework using Cross-Modal Attention network (CMA) based on RGB and skeleton. Firstly, holistic features including face, hand, and body are extracted by a pose estimator, effectively representing intricate human actions. Subsequently, to fully leverage the extracted fine-grained features, skeleton is represented as heatmap volumes. Finally, a Cross-Attention Interaction (CAI) module is designed to explore the intrinsic connections between RGB and skeleton, facilitating mutual learning of their respective advantageous features in the deep layers of feature extraction, thereby achieving information interaction. Simultaneously, HRI experiments are conducted on the large-scale fine-grained action dataset, WLASL2000. In this HRI system, the robotic arm responds by performing sign language aligned with the human actions identified by CMA, showcasing the practicality and effectiveness of our proposed model in real-world scenarios.

Original languageEnglish
Title of host publication2025 IEEE International Conference on Systems, Man, and Cybernetics
Subtitle of host publicationNavigating Frontiers: Smart Systems for a Dynamic World, SMC 2025 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages7447-7453
Number of pages7
ISBN (Electronic)9798331533588, 9798331533571
ISBN (Print)9798331533595
DOIs
Publication statusPublished - 28 Jan 2026
Event2025 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2025 - Hybrid, Vienna, Austria
Duration: 5 Oct 20258 Oct 2025

Publication series

NameConference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
ISSN (Print)1062-922X
ISSN (Electronic)2577-1655

Conference

Conference2025 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2025
Country/TerritoryAustria
CityHybrid, Vienna
Period5/10/258/10/25

Fingerprint

Dive into the research topics of 'Fine-grained action recognition using cross-modal attention network for human-robot sign language interaction'. Together they form a unique fingerprint.

Cite this