Attention-Guided Feature Fusion for Ship Detection in SAR Images
Article information
Abstract
Ship detection through synthetic aperture radar (SAR) imagery is essential for maritime monitoring, security, and navigation. However, SAR images present several challenges, including noise, low contrast, and varying ship sizes. In this paper, we propose an architecture that incorporates ResNet-50 as the backbone for multi-scale feature extraction, which is further enhanced using a convolutional block attention module for refining spatial and channel-wise features. Feature combination and attention fusion mechanisms are employed to integrate critical features while suppressing irrelevant information, and detection heads are optimized for precise bounding box regression and classification. This design ensures improved efficiency and accuracy, demonstrating significant advancements in SAR ship detection performance. Extensive experiments conducted on the SAR ship detection dataset and the high-resolution SAR image dataset yielded AP50 scores of 98.1% and 91.8%, respectively, outperforming several state-of-the-art detectors, thus validating the effectiveness of the proposed method.
I. Introduction
Traditional synthetic aperture radar (SAR) ship detection methods primarily rely on manual feature extraction. Among them, constant false alarm rate (CFAR) [1] stands out as one of the most widely used detection methods. It models background clutter using statistical distributions and computes an adaptive threshold to determine whether a pixel belongs to the target region. Nonetheless, while different CFAR-based detection algorithms employ various background modeling strategies, CFAR remains susceptible to interference from complex environments, and its nearshore environment detection performance is low.
With the development of deep learning (DL), convolutional neural network (CNN)-based object detection methods have undergone rapid advancements, emerging as a significant area of research. These methods can be classified into one-stage and two-stage procedures. A one-stage detector, such as YOLO (you only look once) [2, 3], SSD (single shot multibox detector) [4], and RetinaNet [5], directly predicts object categories and regresses location coordinates to simultaneously obtain bounding box positions and class probabilities without generating proposal regions. In contrast, two-stage detectors generate candidate region proposals in the first stage and perform classification and regression in the second stage. The region-based CNN (R-CNN) [6] introduced this two-step pipeline by applying CNNs to region proposals. Fast R-CNN [7] improved the efficiency of the method by sharing features across proposals, and Faster R-CNN [8] further advanced the framework by introducing a region proposal network for end-to-end training. Furthermore, Mask R-CNN [9] extended Faster R-CNN [8] to support instance segmentation. These two-stage methods offer higher accuracy but at the cost of increased computational complexity. Although one-stage algorithms are characterized by fast and straightforward end-to-end training, they generally exhibit poorer performance than two-stage approaches. However, a common limitation of these methods is their reliance on the final-level feature map for object prediction, which neglects valuable information from lower-level features, thereby hindering the detection of small targets. To address this shortcoming, the feature pyramid network (FPN) [10] was introduced, leveraging multi-level feature extraction to enhance the detection performance of independent predictions from different feature layers. Overall, all these methods primarily focus on detecting prominent objects, such as large-scale ships in offshore SAR scenes. However, the presence of complex backgrounds makes the detection of multi-scale ships in inshore SAR scenes more challenging. As a result, small-scale ships are often overlooked owing to the emphasis on large-scale ships and interference from complex surroundings.
The motivation for this research arises from the need to improve the accuracy and efficiency of ship detection systems under challenging conditions. By addressing the limitations of current models, such as reduced effectiveness in detecting ships of varying sizes and reliance on complex feature representations, this study develops an improved network architecture capable of handling the unique challenges of SAR ship detection. This could contribute to better maritime safety, efficient ship monitoring, and enhanced environmental surveillance in both offshore and inshore SAR scenarios. By achieving these objectives, the research aims to significantly advance the capabilities of SAR ship detection systems, thereby improving accuracy, even in complex maritime environments.
To address the abovementioned challenges, we propose a novel network architecture, as illustrated in Fig. 1. The design follows an end-to-end trainable structure, enabling seamless integration and optimized performance. This architecture is specifically tailored to handle the unique characteristics of SAR imagery, such as speckle noise, low contrast, and complex maritime environments. Each component backbone, attention module, feature fusion, and detection head is carefully designed to enhance the robustness and accuracy of SAR ship detection. The proposed framework introduces several innovations that are specifically aimed at addressing the inherent difficulties in SAR-based ship detection. These contributions include:
• Integration of the convolutional block attention module for enhanced feature refinement: The architecture incorporates the convolutional block attention module (CBAM) after the ResNet-50 [11] backbone. CBAM applies both channel-wise and spatial attention mechanisms to emphasize important features while suppressing irrelevant background noise, which is crucial for handling the noisy and cluttered nature of SAR images.
• Feature combination and attention fusion mechanism: After extracting features from multiple levels of the backbone, the architecture employs a novel attention fusion mechanism to combine them. This module ensures that important information across different feature scales is prioritized, which is essential for detecting ships of varying sizes and orientations in SAR imagery.
• Detection head for SAR ship detection: The detection heads employed in this study are specifically designed to optimize bounding box regression and ship classification in SAR imagery. By leveraging the refined features from the attention fusion stage, it improves precision in localizing ships and reduces false positives, which are quite prevalent in SAR datasets due to noise and complex backgrounds.
The remainder of this paper is organized as follows: Section II describes both traditional ship detection methods and recent advances in DL techniques for detecting small targets, Section III elaborates on the proposed method in detail, Section IV reports the experimental results, and the conclusions are presented in Section V.
II. Related Work
Ship detection has been extensively studied for decades. Traditional methods, in general, adopt a multistage coarse-to-fine process for detection, which is often divided into two steps— region proposal and region refinement—that rely on handcrafted features. Along these lines, Chen et al. [12] introduced an adaptive superpixel-level CFAR detector for dense ship detection in nearshore areas, employing a nonlocal topology strategy to distinguish clutter from mixed superpixels and adaptively select pure superpixels for threshold determination. Superpixel-level segmentation [13, 14] has also been implemented to enhance the detection performance of traditional CFAR-based methods. However, the handcrafted features used in these approaches often result in poor generalization, thus compromising the robustness of the overall detection performance [15].
Recently, DL has achieved remarkable progress in object detection efficiency and accuracy. As a result, many researchers have introduced DL-based methods into the study of SAR ship detection. For instance, Jiao et al. [16] integrated FPN with Faster R-CNN for multi-scale ship detection, while Wang et al. [17] enhanced SSD using semantic aggregation and attention modules to simultaneously detect ships and estimate their orientation. Additionally, the incorporation of ship context information into SSD feature maps [18] was found to improve the detection accuracy of inshore ships. However, these methods primarily focus on prominent objects, such as large-scale ships in offshore SAR scenes, and they struggle to detect multi-scale ships in inshore environments due to complex background interference. To address this issue, Wang et al. [19] proposed a hierarchical feature pyramid to enhance small object detection, while Zhang et al. [20] leveraged contextual information to improve small ship recognition in nearshore regions. Nonetheless, these methods still struggle with balancing accuracy and computational efficiency.
Encouraged by the success of CNN-based object detection methods, researchers have combined CNN with traditional methods for SAR ship detection. Early studies combined CNN with traditional techniques, using CNN-generated bounding boxes as guard windows for CFAR [21], and implementing sealand separation through a fully convolutional network to model sea clutter distribution [22]. With the increase in SAR ship datasets, recent DL approaches have been able to further enhance detection accuracy and efficiency. To improve multi-scale detection, some researchers have focused on network structural connections. For example, Jiao et al. [16] employed dense connections to extract features at different scales, while Zhao et al. [23] applied attention mechanisms, constructing a feature pyramid using receptive field blocks and CBAM [24]. For faster detection, Zhang et al. [25] proposed a network with multiple feature extraction modules and lightweight strategies. Meanwhile, Lin et al. [26] introduced the squeeze and excitation rank faster R-CNN (SER faster R-CNN), which employs a multiscale feature cascade strategy and the squeeze and excitation mechanism to improve feature quality and minimize redundancy, thereby enhancing detection performance. However, many existing SAR ship detection methods have continued to focus only on large vessels in offshore environments while overlooking the challenges of detecting small ships in nearshore areas. This limitation arises from the heavy reliance on high-level feature maps, which lack sufficient spatial resolution for detecting small targets. Some researchers have attempted to address this issue by incorporating multi-scale feature fusion techniques. For instance, Yang et al. [27] integrated a Swin Transformer into a CNN-based detection framework, achieving superior performance in multi-scale ship detection. Additionally, Wang et al. [28] applied hybrid CNN–transformer models, combining the advantages of both architectures to attain improved robustness in complex maritime scenarios. These instances suggest that transformer-based approaches hold significant promise for SAR ship detection, but further exploration is required to optimize their efficiency and scalability.
III. Proposed Method
This section describes the proposed SAR ship detection framework, which features multiple stages: feature extraction using a backbone network, attention-based feature refinement, feature fusion, and detection heads for object localization. The proposed model involves five key steps: Backbone → CBAM → Concatenate Feature → Fusion Feature → Detection Head.
1. Overall Architecture
The network architecture depicted in Fig. 1 addresses the SAR ship detection problem by leveraging a feature-enhanced backbone and implementing a multi-scale feature prioritization strategy. First, the backbone extracts hierarchical feature maps from the SAR input image, which are then refined using CBAMs. These modules emphasize crucial spatial and channel-wise information, thus improving the model’s ability to focus on essential ship features across varying scales. Next, the concatenated multi-scale features undergo a fusion process, which enables robust feature integration that enhances the network’s discriminative capability. Finally, detection heads are used to process the fused features, ensuring precise ship localization while minimizing false positives.
ResNet-50 was selected as the backbone in this study owing to its well-known ability to balance performance and computational efficiency, as demonstrated in various remote sensing and SAR-based applications. This also allowed us to focus on the core contributions of the multi-scale attention and feature fusion processes rather than architectural backbone tuning. CBAM was chosen as the attention module for its lightweight yet effective enhancement of both spatial and channel-wise features, which is particularly beneficial in SAR imagery, where small ships are often obscured by speckle noise and cluttered backgrounds. CBAM effectively addresses challenges, such as speckle noise, low contrast, varying ship sizes, or complex backgrounds, that are typically encountered in SAR imagery.
The proposed framework is a one-stage object detector, where detection heads directly predict bounding boxes from fused multi-scale features without relying on a separate region proposal stage. Such a design allows for efficient end-to-end training, which is ideal for real-time or large-scale SAR ship detection applications. In this context, the approaches reported in prior works include the attention receptive pyramid network, which leverages multi-scale receptive fields and attention in a two-stage manner, and hybrid CNN–Transformer models, which combine local and global context via transformer layers. In contrast, the approach described in this study adopts a light-weight yet effective multi-branch CBAM-enhanced fusion strategy. This enables adaptive emphasis on salient features while maintaining spatial and semantic diversity without the need for complex modules. Additionally, the end-to-end, one-stage YOLO-style detection heads employed in this study allow direct dense prediction from the fused features, which serve to increase detection speed and simplify deployment.
2. Feature Extraction
The backbone is a deep convolutional neural network that is responsible for extracting hierarchical features from input images. In this study, we used ResNet-50 [11] to generate hierarchical feature maps at multiple scales in three stages, creating feature maps of decreasing spatial dimensions and increasing depth. Each refined feature map derived from the backbone was passed through a CBAM to refine its spatial and channel-wise features, as shown in Fig. 2. Notably, CBAM is an efficient and lightweight attention mechanism designed to enhance feature extraction in DL models. It sequentially applies channel attention and spatial attention to refine feature maps, thereby helping the network focus on important information while suppressing irrelevant noise. Unlike most prior studies, which have applied CBAM only to high-level backbone outputs. The ResNet50 extracts multi-level feature maps (C3, C4, and C5), corresponding to low, mid, and high level semantic representation. CBAM modules are integrated after these stages to refine the extracted feature maps. This helped preserve fine-grained features across multiple scales, thus enhancing the detection of small and subtle ship targets. In other words, to ensure enhanced feature representation, we integrated CBAMs into multiple levels of the backbone network (ResNet-50), specifically after the C3, C4, and C5 feature stages to refine the extracted feature maps. This multi-branch attention strategy ensured the preservation of both low-level and high-level spatial-channel cues.
Following CBAM refinement, the multi-scale features were concatenated and passed through a nonlinear fusion block consisting of convolutional layers and activation functions, enabling learnable feature interactions across resolutions. The two modules adopted in this study are as follows:
• Channel attention module (CAM): It learns the importance of different feature channels by applying global average pooling and max pooling, followed by a shared multilayer perceptron to generate attention weights.
• Spatial attention module (SAM): It highlights important spatial regions using average and max pooling along the channel axis, followed by applying a convolution layer to refine attention maps.
• Thus, by integrating CBAM into DL architectures, models can achieve better feature representation, leading to improved performance in object detection and image classification tasks.
3. Combination of Features
The output feature maps from each CBAM block were concatenated to combine complementary information, ensuring the preservation of both high-resolution details (useful for detecting small ships) and high-level semantic context (useful for identifying large ships and background differentiation). After concatenation, a feature fusion module was used to further process the concatenated features to improve consistency and robustness. This module included convolutional layers for dimension reduction, activation functions ReLU (rectified linear unit) to introduce nonlinearity, and batch normalization to stabilize training. Ultimately, a unified feature map with richer and more discriminative representations was obtained.
4. Detection Head for Ship Localization
The final fused feature map was passed through detection heads consisting of multiple convolutional layers for ship localization and classification. The detection heads produced bounding box predictions that localized the detected objects while also providing confidence scores indicating detection certainty. During the inference stage of ship detection, we applied greedy non-maximum suppression (NMS) considering an intersection over union (IoU) threshold of 0.5, which is standard for one-stage object detectors. Notably, greedy NMS was chosen for its simplicity and speed, which are well suited to the sparse and low-overlap distribution of ship targets in SAR images.
IV. Experimental Results
1. Evaluation Metrics
To comprehensively evaluate the performance of the detection methods, we chose the F1-score and average precision (AP) at IoU = 0.5 as the evaluation metrics. Notably, IoU is a standard evaluation metric for measuring the accuracy of object-detection algorithms. IoU quantifies the overlap between predicted and ground truth bounding boxes, expressed as follows:
In the context of evaluation metrics, true positive refers to the number of cases in which the model correctly identifies a positive instance consistent with the ground truth. Meanwhile, false positive represents the number of cases in which the model erroneously predicts a positive outcome for an instance that is negative. Conversely, false negative corresponds to the number of instances in which the model fails to detect a true positive, instead classifying it as negative.
Furthermore, the F1-score is a metric by which the performance of a model can be analyzed, defined as follows:
2. Datasets and Experimental Settings
We employed the SAR ship detection dataset (SSDD) [31] and the high-resolution SAR images dataset (HRSID) [32] to evaluate the proposed method. SSDD contains a set of 1,160 images derived from RadarSet-2, TerraSAR-X, and Sentinel-1 images, with resolutions ranging from 1 to 15 m and image sizes varying from 214 × 160 to 668 × 526. We randomly selected 930 images as the training set, with the remaining 230 images considered the test set. Meanwhile, the HRSID dataset comprises 99 Sentinel-1B images, 36 TerraSAR-X images, and 1 TanDEM-X image. Table 1 provides more details on these datasets. In particular, HRSID comprises several typical samples that are difficult to detect, such as ships under severe speckle noise, densely distributed small ships, and densely packed parallel ships at ports.
Moreover, this dataset is characterized by large-scale scenes with sparse and small ship targets, making it well suited for evaluating SAR ship detection models. From HRSID, 3,642 images were used for training, and 1,962 images were considered for testing. A few of the SAR images are illustrated in Fig. 3. The dataset also contains ships captured under diverse imaging conditions and orientations, which provide natural variation.
Preliminary tests conducted using simple augmentations, such as flips and rotations, did not yield noticeable improvements, consistent with prior observations pertaining to SAR ship detection tasks. The backbone network based on ResNet-50 was initialized with weights pretrained on the ImageNet dataset, using models from the Torchvision library. To maintain consistency with prior work trained on the same dataset in its original form, we did not apply additional data augmentation. Furthermore, greedy NMS with an IoU threshold of 0.5 was implemented at the output of the detector under test. Experiments were conducted using a machine equipped with an NVIDIA GeForce GPU with 10 GB of memory running CUDA 11.8. All algorithms were developed based on the PyTorch DL framework. The configuration of the optimizer is presented in Table 2. In this study, the batch size was set to 8 based on experimental evaluation. This size was found to be well suited to the proposed architecture, ensuring stable convergence and reliable detection performance. Alternative batch sizes were tested, but no significant accuracy gains were observed when using larger batches, while smaller batches resulted in less stable optimization. We implemented greedy NMS with an IoU threshold of 0.5, consistent with standard practice in YOLO-based detectors as well as other one-stage detectors. Notably, greedy NMS was chosen for its simplicity and speed. For evaluation, the average precision at the IoU threshold of 0.5 (AP@0.5) was considered the primary metric, given that it remains the most used benchmark for ship detection tasks in SAR imagery.
3. Evaluation Methods
We evaluated the proposed method by comparing its evaluation metrics with those attained using other methods, including Faster R-CNN [8], PVT-SAR [29], KeyShip [30] and LPST-Det [27]. Table 3 presents the quantitative results of all the methods applied to the two datasets. The best result is highlighted in red, and the second-best results are highlighted in blue. It is observed that the proposed model achieved the best AP50 and F1-score values for both datasets, exhibiting the best overall performance. Compared to Faster R-CNN, the proposed method offers an improvement in AP50 by 8.4% and 13.6% for the SSDD and HRSID datasets, respectively. This highlights the model’s effectiveness while also pointing to potential improvements, such as refining confidence thresholds or enhancing feature extraction, for maintaining precision at higher recall levels. The experimental results, as described in Fig. 4, highlight the superior performance of the proposed method, especially in facilitating significant improvements in AP50 across the SSDD and HRSID datasets.
Comparison of the results of the ship object detection methods for the SSDD and HRSID datasets (unit: %)
To further validate the effectiveness of the proposed method, Fig. 5 presents a visual comparison of its detection performance with that of other previously reported methods. Although the resolution of the SSDD dataset is low, and there are several background clutter interference factors, the proposed model still managed to accurately detect the position of the ships. Furthermore, Fig. 6 presents the ship detection results obtained by implementing the proposed method on the HRSID dataset, demonstrating detection performance across various scenarios, including docked ships, offshore vessels, and ships in deep-sea environments. The evaluation metrics were computed with a confidence threshold of 0.5. Fig. 7 presents confidence and P–R curves, indicating that ships of various scales, including densely clustered vessels near the shore, were accurately detected. Additionally, small and sparsely distributed ships in deep-sea regions were effectively identified. Overall, the proposed method achieved a low missed detection rate and a low false alarm rate. We trained our model five times using different random seeds (0, 1, 42, 77, and 99) under identical settings. The AP@0.5 metric attained a mean of 91.8%, standard deviation of ±0.31%, and a 95% confidence interval, confirming the consistency and reliability of the results. In SAR data, ship targets are often obscured by speckle noise and background clutter. As a result, CBAM was implemented to enhance feature representation by applying channel attention to prioritize semantically important features (e.g., high backscatter regions from ships) and spatial attention to localize target areas, while also suppressing irrelevant background noise. Table 4 shows that removing CBAM led to a 3.63% drop in AP50, highlighting its contribution to detection performance.
Detection results for the SSDD dataset: (a) Input image, (b) ground truth, (c) Faster R-CNN, (d) LPST-Det, (e) KeyShip, and (f) the proposed method.
Detection results of the proposed method for the HRSID dataset at IoU = 0.5: (a) docked ships, (b) offshore vessels, (c) and (d) ships in deep-sea environments.
V. Conclusion
The SAR ship detection framework proposed in this study effectively enhances feature representation by integrating CBAM into the backbone network. This attention mechanism refines feature extraction at multiple levels, improving the model’s ability to distinguish ships from the background. By concatenating and fusing these enriched features before passing them to the detection head, the proposed approach ensures accurate ship localization. The final detection results demonstrate the model’s ability to identify multiple ships in SAR images with high precision, making it a promising solution for robust and efficient maritime surveillance.
Although the proposed method achieves high-precision detection in standard scenarios, challenges such as false positives in cluttered environments and false negatives under occlusion persist. To address these issues in future work, we intend to incorporate Bayesian networks to model predictive uncertainty and explore self-supervised learning to reduce dependence on labeled datasets. These improvements are likely to enhance the generalization, robustness, and applicability of the proposed framework to diverse real-world maritime surveillance conditions.
Notes
This work was supported in part by the Priority Research Center Program through the National Research Foundation under Grant RS-2019-NR040074, and in part by the Korea Basic Science Institute (National Research Facilities and Equipment Center) Grant funded by the Ministry of Education under Grant RS-2022-NF000835).
References
Biography
![]()
Bao-Tran Nguyen Thi https://orcid.org/0000-0002-5308-977X received her B.S. degree in statistics from Ton Duc Thang University, Vietnam, in 2017, and her M.S. degree in information and telecommunication engineering from Soongsil University, Seoul, South Korea, in 2023. Since 2024, she has been pursuing her Ph.D. degree in information and communication engineering at Kongju National University, South Korea. Her primary research interests include image processing and synthetic aperture radar (SAR) imaging, with a particular focus on object detection and image denoising.
![]()
Ic-Pyo Hong https://orcid.org/0000-0003-1875-5420 received his B.S., M.S., and Ph.D. degrees in electronics engineering from Yonsei University, Seoul, South Korea, in 1994, 1996, and 2000, respectively. From 2000 to 2003, he worked at the Information and Communication Division, Samsung Electronics Company, Suwon, South Korea, where he was a senior engineer with CDMA Mobile Research. He was a visiting scholar at Texas A&M University, College Station, TX, USA, in 2006; and at Syracuse University, Syracuse, NY, USA, in 2012. Since 2003, he has been with the Department of Smart Information and Technology Engineering, Kongju National University, Cheonan, South Korea, where he is currently a professor. His research interests include numerical techniques in electromagnetics, periodic electromagnetic structures, and their applications in wireless communication.
