TY - JOUR
T1 - Video object detection considering dynamic neighborhood feature multiplexing
AU - Yu, Jiahui
AU - Chen, Yifan
AU - Wang, Xuna
AU - Chen, Long
AU - Chen, Hang
AU - Zhou, Dalin
AU - Xu, Yingke
AU - Ju, Zhaojie
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2025/6/5
Y1 - 2025/6/5
N2 - Video object detection is essential for human-interaction applications, including bimanual manipulation sensing (BMS). The effects of video detection in practical applications still need to be improved, as they are restricted by long-range spatiotemporal dependency analysis. How do humans sense bimanual manipulation in videos, especially for deteriorated clips? We argue that humans analyze the current clips based on earlier memory, namely, long-term spatial and temporal dependencies (LTSTD). However, most existing methods have yet to report significant results, as the limited exploration of these dependencies limits them. Developing an easy-to-integrate module is generally preferred for future applications rather than designing a complex end-to-end framework. Therefore, we propose a dynamic neighborhood feature multiplexing mechanism for online video object detection in this article, which is better at learning LTSTD in flexible and robust ways, boosting existing detection results, called DNFM. Specifically, we develop dynamic memory enhancement neural networks for better long-term feature aggregation with negligible additional computation costs. We multiplex each frame feature to aggregate key enhanced representations under the guidance of dynamic memory recall. The DNFM contributes to various famous detectors in BMS and other challenging detection tasks, and particular attention has been devoted to “low-quality” frame detection. Experimental results show that, while achieving state-of-the-art detection performance, DNFM clearly illustrates the easy-to-integrate operation for boosting the video object detection results.
AB - Video object detection is essential for human-interaction applications, including bimanual manipulation sensing (BMS). The effects of video detection in practical applications still need to be improved, as they are restricted by long-range spatiotemporal dependency analysis. How do humans sense bimanual manipulation in videos, especially for deteriorated clips? We argue that humans analyze the current clips based on earlier memory, namely, long-term spatial and temporal dependencies (LTSTD). However, most existing methods have yet to report significant results, as the limited exploration of these dependencies limits them. Developing an easy-to-integrate module is generally preferred for future applications rather than designing a complex end-to-end framework. Therefore, we propose a dynamic neighborhood feature multiplexing mechanism for online video object detection in this article, which is better at learning LTSTD in flexible and robust ways, boosting existing detection results, called DNFM. Specifically, we develop dynamic memory enhancement neural networks for better long-term feature aggregation with negligible additional computation costs. We multiplex each frame feature to aggregate key enhanced representations under the guidance of dynamic memory recall. The DNFM contributes to various famous detectors in BMS and other challenging detection tasks, and particular attention has been devoted to “low-quality” frame detection. Experimental results show that, while achieving state-of-the-art detection performance, DNFM clearly illustrates the easy-to-integrate operation for boosting the video object detection results.
KW - Attention
KW - enhancement
KW - interaction
KW - video detection
UR - https://www.scopus.com/pages/publications/105007675922
U2 - 10.1109/TSMC.2025.3572123
DO - 10.1109/TSMC.2025.3572123
M3 - Article
AN - SCOPUS:105007675922
SN - 2168-2216
JO - IEEE Transactions on Systems, Man, and Cybernetics: Systems
JF - IEEE Transactions on Systems, Man, and Cybernetics: Systems
ER -