Abstract
Although Siamese trackers have become increasingly prevalent in the visual tracking domain, they are easily interfered by semantic distractors in complex environments, which results in the underutilization of feature information. Especially when multiple disturbances work together, the performance of many trackers often suffers severe degradation. To solve the above problem, this paper presents a robust Stereoscopic Transformer network for improving tracking performance. Using a hybrid attention mechanism, our method is composed of a channel feature awareness network (CFAN), a global channel attention network (GCAN), and a multi-level feature enhancement unit (MFEU).Concretely, CFAN focuses on specific channel information, while highlighting the contained target features and weakening the semantic distractor features. As an intermediate hub, GCAN is mainly responsible for establishing the global feature dependencies between the search region and the template, while selecting the concerned channel features to improve the distinguishing ability of the model. In particular, MFEU is used to enhance multi-level feature information to facilitate feature representation learning for our method. Finally, a Transformer-based Siamese tracker (named VTST) is proposed to present an efficient tracking representation, which can gain advantages over a variety of challenging attributes. Experiments show that our method outperforms the state-of-the-art trackers on multiple benchmarks with a real-time running speed of 56.0 fps.
Original language | English |
---|---|
Number of pages | 18 |
Journal | IEEE Transactions on Automation Science and Engineering |
Early online date | 3 Oct 2023 |
DOIs | |
Publication status | Early online - 3 Oct 2023 |
Keywords
- visual tracking
- complex environments
- stereoscopic transformer
- hybrid attention mechanism