Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

The Segment Anything Model (SAM) has attracted considerable attention due to its impressive performance and demonstrates potential in medical image segmentation. Compared to SAM's native point and bounding box prompts, text prompts offer a simpler and more efficient alternative in the medical field, yet this approach remains relatively underexplored. In this paper, we propose a SAM-based framework that integrates a pre-trained vision-language model to generate referring prompts, with SAM handling the segmentation task. The outputs from multimodal models such as CLIP serve as input to SAM's prompt encoder. A critical challenge stems from the inherent complexity of medical text descriptions: they typically encompass anatomical characteristics, imaging modalities, and diagnostic priorities, resulting in information redundancy and semantic ambiguity. To address this, we propose a text decomposition-recomposition strategy. First, clinical narratives are parsed into atomic semantic units (appearance, location, pathology, and so on). These elements are then recombined into optimized text expressions. We employ a cross-attention module among multiple texts to interact with the joint features, ensuring that the model focuses on features corresponding to effective descriptions. To validate the effectiveness of our method, we conducted experiments on several datasets. Compared to the native SAM based on geometric prompts, our model shows improved performance and usability.

Download full-text PDF

Source
http://dx.doi.org/10.1109/JBHI.2025.3607023DOI Listing

Publication Analysis

Top Keywords

prompts sam
8
medical image
8
image segmentation
8
prompts
5
leveraging multi-text
4
multi-text joint
4
joint prompts
4
sam
4
sam robust
4
medical
4

Similar Publications

The Segment Anything Model (SAM) has attracted considerable attention due to its impressive performance and demonstrates potential in medical image segmentation. Compared to SAM's native point and bounding box prompts, text prompts offer a simpler and more efficient alternative in the medical field, yet this approach remains relatively underexplored. In this paper, we propose a SAM-based framework that integrates a pre-trained vision-language model to generate referring prompts, with SAM handling the segmentation task.

View Article and Find Full Text PDF

Purpose: Recent developments in computational pathology have been driven by advances in vision foundation models (VFMs), particularly the Segment Anything Model (SAM). This model facilitates nuclei segmentation through two primary methods: prompt-based zero-shot segmentation and the use of cell-specific SAM models for direct segmentation. These approaches enable effective segmentation across a range of nuclei and cells.

View Article and Find Full Text PDF

Objective: Accurate segmentation of breast lesions, especially small ones, remains challenging in digital mammography due to complex anatomical structures and low-contrast boundaries. This study proposes DVF-YOLO-Seg, a two-stage segmentation framework designed to improve feature extraction and enhance small-lesion detection performance in mammographic images.

Methods: The proposed method integrates an enhanced YOLOv10-based detection module with a segmentation stage based on the Visual Reference Prompt Segment Anything Model (VRP-SAM).

View Article and Find Full Text PDF

Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized.

View Article and Find Full Text PDF

Foreign Object Debris (FOD) on airport pavements poses a serious threat to aviation safety, making accurate detection and interpretable scene understanding crucial for operational risk management. This paper presents an integrated multi-modal framework that combines an enhanced YOLOv7-X detector, a cascaded YOLO-SAM segmentation module, and a structured prompt engineering mechanism to generate detailed semantic descriptions of detected FOD. Detection performance is improved through the integration of Coordinate Attention, Spatial-Depth Conversion (SPD-Conv), and a Gaussian Similarity IoU (GSIoU) loss, leading to a 3.

View Article and Find Full Text PDF