Repformer: a robust shared-encoder dual-pipeline transformer for visual tracking

Fengwei Gu, Jun Lu*, Chengtao Cai, Qidan Zhu, Zhaojie Ju*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

12 Downloads (Pure)

Abstract

Siamese-based trackers have achieved outstanding tracking performance. However, these trackers in complex scenarios struggle to adequately integrate the valuable target feature information, which results in poor tracking performance. In this paper, a novel shared-encoder dual-pipeline Transformer architecture is proposed to achieve robust visual tracking. The proposed method integrates several main components based on a hybrid attention mechanism, namely the shared encoder, the feature enhancement pipelines with functional complementarity, and the pipeline feature fusion head. The shared encoder is adopted to process template features and provide useful target feature information for the feature enhancement pipeline. The feature enhancement pipeline is responsible for enhancing feature information, establishing feature dependencies between the template and the search region, and employing global information adequately. To further correlate the global information, the pipeline feature fusion head integrates the feature information from the feature enhancement pipelines. Eventually, we propose a robust Siamese-based Repformer tracker, which incorporates a concise tracking prediction network to obtain efficient tracking representations. Experiments show that our tracking method surpasses numerous state-of-the-art trackers on multiple tracking benchmarks, with a running speed of 57.3 fps.

Original languageEnglish
Pages (from-to)20581–20603
JournalNeural Computing and Applications
Volume35
Early online date22 Jul 2023
DOIs
Publication statusPublished - 1 Oct 2023

Keywords

  • Hybrid attention mechanism
  • Pipeline feature fusion head
  • Transformer architecture
  • Visual tracking

Fingerprint

Dive into the research topics of 'Repformer: a robust shared-encoder dual-pipeline transformer for visual tracking'. Together they form a unique fingerprint.

Cite this