Global temporal information and local semantic information are essential cues for high-performance online object detection in videos. However, despite their promising detection accuracy in most cases, most state-of-the-art approaches have following two limitations: invalid background/scale suppression and inadequate temporal information mining between frames. Many jobs currently focus on temporal information learning based on a single frame. In this article, we propose an attentional global–local information learning network; this is one of the first attempts to fully use both types of information between frames. Attention maps are creatively utilized to transfer temporal contexts between frames. This also effectively alleviates the adverse effects of scale changes. Furthermore, empowered by a detailed framework, a proposed detector effectively uses multilevel feature extraction. Given these contributions, the proposed detector achieves state-of-the-art performance on challenging benchmarks. Finally, practical experiments are conducted on a space human–robot interaction platform.
- Video object detection
- Attention model
- Space human-robot interaction