Toward Fine-Grained 3-D Visual Grounding Through Referring Textual Phrases.

Zhihao Yuan , Xu Yan , Zhuo Li , Xuhao Li , Yao Guo , Shuguang Cui , Zhen Li

IEEE Trans Neural Netw Learn Syst

Published: June 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Recent progress in 3-D scene understanding has explored visual grounding [3D visual grounding (3DVG)] to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target object, ignoring fine-grained relationships between contexts and nontarget ones. In this article, we extend 3DVG to a more fine-grained task, called 3D phrase-aware grounding (3DPAG). The 3DPAG task aims to localize the target objects in a 3-D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases. To tackle this problem, we manually labeled about 227 K phrase-level annotations using a self-developed platform, from 88 K sentences of widely used 3DVG datasets, i.e., Natural Reference in 3-D (Nr3D), Spatial Reference in 3-D (Sr3D), and ScanRefer. By tapping on our datasets, we can extend previous 3DVG methods to the fine-grained phrase-aware scenario. It is achieved through the proposed novel phrase-object alignment (POA) optimization and phrase-specific pretraining (PSP), boosting conventional 3DVG performance as well. Extensive results confirm significant improvements, i.e., previous state-of-the-art method achieves 3.9%, 3.5%, and 4.6% overall accuracy gains on Nr3D, Sr3D, and ScanRefer, respectively. Our datasets and platform are released in https://github.com/CurryYuan/PhraseRefer.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TNNLS.2025.3571959	DOI Listing

Publication Analysis

Top Keywords

visual grounding

3-d scene

localize target

target object

reference 3-d

sr3d scanrefer

fine-grained

fine-grained 3-d

3-d visual

grounding

Similar Publications

Optically Controlled Memristor Enabling Synergistic Sensing-Memory-Computing for Neuromorphic Vision Systems.

Adv Mater

September 2025

Key Laboratory of Brain-Like Neuromorphic Devices and Systems of Hebei Province, College of Electronic and Information Engineering, Hebei University, Baoding, 071002, China.

Jianhui Zhao , Dingxin Liu , Kangbo Zhao , Jianning Wang , Yufei Shang

Neuromorphic Visual Devices hold considerable promise for integration into neuromorphic vision systems that combine sensing, memory, and computing. This potential arises from their synergistic benefits in optical signal detection and neuro-inspired computational processes. However, current devices face challenges such as insufficient light/dark resistance ratios, mismatched transient photo-response, and volatile retention characteristics, limiting their adaptability to complex artificial vision systems.

View Article and Find Full Text PDF

Similar Publications

Perspectives on eye care access and telemedicine-based glaucoma screening among Latine individuals with limited English proficiency.

AJO Int

October 2025

Department of Ophthalmology & Visual Sciences, University of Michigan Medical School, 1000 Wall Street, Ann Arbor, MI, 48105, USA.

Norma E Del Risco , Mildred Silva Zuccaro , Jade J Livingston , Michele Heisler , Harry Levine

Purpose: Michigan Screening and Intervention for Glaucoma and Eye Health through Telemedicine Program (MI-SIGHT) was developed to facilitate access to glaucoma and eye disease screening and improve attendance at recommended follow-up in underserved communities. MI-SIGHT offered free eye disease screenings, low-cost glasses and for those who screened positive for glaucoma, personalized education, and language-concordant coaching grounded in motivational interviewing. The primary aims of this study were 1) To explore barriers to eye care among Latine participants with limited English proficiency (LEP) who screened positive for glaucoma, 2) to understand whether and how the MI-SIGHT program facilitated access to care and 3) to understand participant experience in MI-SIGHT to inform the development of future interventions.

View Article and Find Full Text PDF

Similar Publications

De-MSI: A Deep Learning-Based Data Denoising Method to Enhance Mass Spectrometry Imaging by Leveraging the Chemical Prior Knowledge.

Anal Chem

September 2025

State Key Laboratory of Environmental and Biological Analysis, Hong Kong Baptist University, Hong Kong SAR 999077, China.

Lei Guo , Chengyi Xie , Xin Diao , Thomas Ka Yam Lam , Yanhui Zhong

Mass spectrometry imaging (MSI) is a label-free technique that enables the visualization of the spatial distribution of thousands of ions within biosamples. Data denoising is the computational strategy aimed at enhancing the MSI data quality, providing an effective alternative to experimental methods. However, due to the complex noise pattern inherent in MSI data and the difficulty in obtaining ground truth from noise-free data, achieving reliable denoised images remains challenging.

View Article and Find Full Text PDF

Similar Publications

Rhotic Acquisition Is More Rapid in Biofeedback Than Motor-Based Treatment for Residual Speech Sound Disorder: Primary Outcome of a Randomizeeech Research Articlesd Controlled Trial.

J Speech Lang Hear Res

September 2025

Department of Communication Sciences & Disorders, Montclair State University, Bloomfield, NJ.

Tara McAllister , Jonathan L Preston , Nina R Benway , Jennifer Hill , Marcela P Lara

Purpose: Residual speech sound disorder (RSSD) is a high-prevalence condition that can limit children's academic and social participation, with negative consequences for overall well-being. Previous studies have described visual biofeedback as a promising option for RSSD, but results have been inconclusive due to study design limitations and small sample sizes.

Method: In a preregistered randomized controlled trial, 108 children aged 9-15 years with RSSD affecting American English /ɹ/ were randomly assigned to receive treatment incorporating visual biofeedback (subdivided into ultrasound and visual-acoustic types) or a comparison condition of motor-based treatment consistent with current best practices in speech therapy.

View Article and Find Full Text PDF

Similar Publications

Improving Generalized Visual Grounding with Instance-aware Joint Learning.

IEEE Trans Pattern Anal Mach Intell

September 2025

Ming Dai , Wenxuan Cheng , Jiang-Jiang Liu , Lingfeng Yang , Zhenhua Feng

Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process.

View Article and Find Full Text PDF

Similar Publications