Recognizing video activities in the wild via view-to-scene joint learning

Jiahui Yu, Yifan Chen, Xuna Wang, Xu Cheng, Zhaojie Ju, Yingke Xu*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Recognizing video actions in the wild is challenging for visual control systems. In-the-wild videos show actions not seen in training data, recorded from various angles and scenes with the same labels. Most existing methods address this challenge by developing complex frameworks to extract spatiotemporal features. To achieve view robustness and scene generalization cost-effectively, we explore view consistency and scene joint understanding. Based on this, we propose a neural network (called Wild-VAR) to learn view and scene information jointly without any 3D pose ground truth labels, a new approach to recognizing video actions in the wild. Unlike most existing methods, first, we propose a Cubing module to self-learn body consistency between views instead of comprehensive image features, boosting the generalization performance of across-view settings. Specifically, we map 3D representations to multiple 2D features and then adopt a self-adaptive scheme to constrain 2D features from different perspectives. Moreover, we propose temporal neural networks (called T-Scene) to develop a recognizing framework, enabling Wild-VAR to flexibly learn scenes across time, including key interactors and context, in video sequences. Extensive experiments show that Wild-VAR consistently outperforms state-of-the-art methods on four benchmarks. Notably, with only half the computation costs, Wild-VAR improves accuracy by 2.2% and 1.3% on the Kinetics-400 and the Something-Somthing V2 datasets, respectively. Note to Practitioners—In human-robot interaction tasks, video action recognition technology is a prerequisite for visual control. In real applications, humans move freely in 3D space, which results in significant changes in the view of video capture and constantly changing scenes. Deep Neural Networks are limited by the perspectives and scenarios contained in the training data, resulting in most existing methods are only effective for identifying actions from 2-4 fixed views, and the background is single. Therefore, existing models are often difficult to generalize to unconstrained application environments. Human view and video scene understanding are often treated separately. Inspired by the human visual system, this paper proposes a view-to-scene video processing method in a cost-efficient way. In real-world applications, this lightweight method can be integrated into robots to help identify human behavior in complex environments. Fewer parameters indicate that the method can be easily migrated to different types of behaviors, and the reduced computational costs represent the ability to achieve real-time performance under limited hardware conditions.

Original languageEnglish
Pages (from-to)5816-5827
Number of pages12
JournalIEEE Transactions on Automation Science and Engineering
Volume22
Early online date23 Jul 2024
DOIs
Publication statusPublished - 7 Mar 2025

Keywords

  • 3D pose
  • Human motion analysis
  • human-machine reactions
  • video understanding

Fingerprint

Dive into the research topics of 'Recognizing video activities in the wild via view-to-scene joint learning'. Together they form a unique fingerprint.

Cite this