In recent years, there have been many multimodal works in the field of remote sensing, and most of them have achieved good results in the task of land-cover classification. However, multi-scale information is seldom considered in the multi-modal fusion process. Secondly, the multimodal fusion task rarely considers the application of attention mechanism, resulting in a weak representation of the fused feature. In order to better use the multimodal data and reduce the losses caused by the fusion of different modalities, we proposed a TRMSF (Transformer and Multi-scale fusion) network for land-cover classification based on HSI (hyperspectral images) and LiDAR (Light Detection and Ranging) images joint classification. The network enhances multimodal information fusion ability by the method of attention mechanism from Transformer and enhancement using multi-scale information to fuse features from different modal structures. The network consists of three parts: multi-scale attention enhancement module (MSAE), multimodality fusion module (MMF) and multi-output module (MOM). MSAE enhances the ability of feature representation from extracting different multi-scale features of HSI, which are used to fuse with LiDAR feature, respectively. MMF integrates the data of different modalities through attention mechanism, thereby reducing the loss caused by the data fusion of different modal structures. MOM optimizes the network by controlling different outputs and enhances the stability of the results. The experimental results show that the proposed network is effective in multimodality joint classification.
- hyperspectral image
- cross-modal data fusion