Abstract
Global temporal information and local semantic information are essential cues for high-performance online object detection in videos. However, despite their promising detection accuracy in most cases, most state-of-the-art approaches have following two limitations: invalid background/scale suppression and inadequate temporal information mining between frames. Many jobs currently focus on temporal information learning based on a single frame. In this article, we propose an attentional global–local information learning network; this is one of the first attempts to fully use both types of information between frames. Attention maps are creatively utilized to transfer temporal contexts between frames. This also effectively alleviates the adverse effects of scale changes. Furthermore, empowered by a detailed framework, a proposed detector effectively uses multilevel feature extraction. Given these contributions, the proposed detector achieves state-of-the-art performance on challenging benchmarks. Finally, practical experiments are conducted on a space human–robot interaction platform.
Original language | English |
---|---|
Journal | IEEE Transactions on Human-Machine Systems |
Early online date | 10 Feb 2022 |
DOIs | |
Publication status | Early online - 10 Feb 2022 |
Keywords
- Video object detection
- SSD
- Attention model
- Space human-robot interaction