TY - JOUR
T1 - Recognizing video activities in the wild via view-to-scene joint learning
AU - Yu, Jiahui
AU - Chen, Yifan
AU - Wang, Xuna
AU - Cheng, Xu
AU - Ju, Zhaojie
AU - Xu, Yingke
N1 - Publisher Copyright:
© 2004-2012 IEEE.
PY - 2025/3/7
Y1 - 2025/3/7
N2 - Recognizing video actions in the wild is challenging for visual control systems. In-the-wild videos show actions not seen in training data, recorded from various angles and scenes with the same labels. Most existing methods address this challenge by developing complex frameworks to extract spatiotemporal features. To achieve view robustness and scene generalization cost-effectively, we explore view consistency and scene joint understanding. Based on this, we propose a neural network (called Wild-VAR) to learn view and scene information jointly without any 3D pose ground truth labels, a new approach to recognizing video actions in the wild. Unlike most existing methods, first, we propose a Cubing module to self-learn body consistency between views instead of comprehensive image features, boosting the generalization performance of across-view settings. Specifically, we map 3D representations to multiple 2D features and then adopt a self-adaptive scheme to constrain 2D features from different perspectives. Moreover, we propose temporal neural networks (called T-Scene) to develop a recognizing framework, enabling Wild-VAR to flexibly learn scenes across time, including key interactors and context, in video sequences. Extensive experiments show that Wild-VAR consistently outperforms state-of-the-art methods on four benchmarks. Notably, with only half the computation costs, Wild-VAR improves accuracy by 2.2% and 1.3% on the Kinetics-400 and the Something-Somthing V2 datasets, respectively. Note to Practitioners—In human-robot interaction tasks, video action recognition technology is a prerequisite for visual control. In real applications, humans move freely in 3D space, which results in significant changes in the view of video capture and constantly changing scenes. Deep Neural Networks are limited by the perspectives and scenarios contained in the training data, resulting in most existing methods are only effective for identifying actions from 2-4 fixed views, and the background is single. Therefore, existing models are often difficult to generalize to unconstrained application environments. Human view and video scene understanding are often treated separately. Inspired by the human visual system, this paper proposes a view-to-scene video processing method in a cost-efficient way. In real-world applications, this lightweight method can be integrated into robots to help identify human behavior in complex environments. Fewer parameters indicate that the method can be easily migrated to different types of behaviors, and the reduced computational costs represent the ability to achieve real-time performance under limited hardware conditions.
AB - Recognizing video actions in the wild is challenging for visual control systems. In-the-wild videos show actions not seen in training data, recorded from various angles and scenes with the same labels. Most existing methods address this challenge by developing complex frameworks to extract spatiotemporal features. To achieve view robustness and scene generalization cost-effectively, we explore view consistency and scene joint understanding. Based on this, we propose a neural network (called Wild-VAR) to learn view and scene information jointly without any 3D pose ground truth labels, a new approach to recognizing video actions in the wild. Unlike most existing methods, first, we propose a Cubing module to self-learn body consistency between views instead of comprehensive image features, boosting the generalization performance of across-view settings. Specifically, we map 3D representations to multiple 2D features and then adopt a self-adaptive scheme to constrain 2D features from different perspectives. Moreover, we propose temporal neural networks (called T-Scene) to develop a recognizing framework, enabling Wild-VAR to flexibly learn scenes across time, including key interactors and context, in video sequences. Extensive experiments show that Wild-VAR consistently outperforms state-of-the-art methods on four benchmarks. Notably, with only half the computation costs, Wild-VAR improves accuracy by 2.2% and 1.3% on the Kinetics-400 and the Something-Somthing V2 datasets, respectively. Note to Practitioners—In human-robot interaction tasks, video action recognition technology is a prerequisite for visual control. In real applications, humans move freely in 3D space, which results in significant changes in the view of video capture and constantly changing scenes. Deep Neural Networks are limited by the perspectives and scenarios contained in the training data, resulting in most existing methods are only effective for identifying actions from 2-4 fixed views, and the background is single. Therefore, existing models are often difficult to generalize to unconstrained application environments. Human view and video scene understanding are often treated separately. Inspired by the human visual system, this paper proposes a view-to-scene video processing method in a cost-efficient way. In real-world applications, this lightweight method can be integrated into robots to help identify human behavior in complex environments. Fewer parameters indicate that the method can be easily migrated to different types of behaviors, and the reduced computational costs represent the ability to achieve real-time performance under limited hardware conditions.
KW - 3D pose
KW - Human motion analysis
KW - human-machine reactions
KW - video understanding
UR - http://www.scopus.com/inward/record.url?scp=86000591245&partnerID=8YFLogxK
U2 - 10.1109/TASE.2024.3431128
DO - 10.1109/TASE.2024.3431128
M3 - Article
AN - SCOPUS:86000591245
SN - 1545-5955
VL - 22
SP - 5816
EP - 5827
JO - IEEE Transactions on Automation Science and Engineering
JF - IEEE Transactions on Automation Science and Engineering
ER -