General 3D Vision-Language Model With Fast Rendering and Pre-Training Vision-Language Alignment.

Kangcheng Liu , Yong-Jin Liu , Baoquan Chen

IEEE Trans Pattern Anal Mach Intell

Published: September 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Current prevailing vision-language models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. The major bottleneck for the current robot 3D scene recognition approach for robotic applications is that these models do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse real-world robot applications such as robot manipulation as well as robot navigation. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require a large number of high-quality labels to train neural networks, which merely perform well in a fully supervised manner. Therefore, we are in urgent need of a framework that can simultaneously be applicable to both 3D point cloud segmentation and detection, particularly in the circumstances where the labels are rather scarce. This work presents a generalized and straightforward framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark on both the task of semantic segmentation and instance segmentation. Also, our proposed WS3D++ achieves state-of-the-art data-efficient learning performance on the other large-scale real-scene indoor and outdoor datasets S3DIS and SemanticKITTI. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TPAMI.2025.3566593	DOI Listing

Publication Analysis

Top Keywords

scene understanding

vision-language models

data-efficient learning

indoor outdoor

scene

general vision-language

vision-language model

model fast

fast rendering

rendering pre-training

Similar Publications

Love as Bait: A Scoping Review and Crime Script Analysis of Online Romance Scams.

Trauma Violence Abuse

September 2025

Ghent University, Ghent, Belgium.

Janneke M Schokkenbroek , Thom Snaphaan

This study presents a scoping review and crime script analysis of the modus operandi of online romance scammers. Online romance scams are a form of fraud in which perpetrators fabricate online romantic relationships with victims, aiming to emotionally manipulate and, ultimately, financially exploit them. The review aims to synthesize existing research on how scammers operate and to develop a comprehensive crime script that can guide prevention and policy efforts.

View Article and Find Full Text PDF

Similar Publications

Using Shakespeare's Lear to Deepen Formulation Skills in Geriatric Psychiatry.

Acad Psychiatry

September 2025

University of Toronto, Toronto, Ontario, Canada.

Mark Rapoport , Certina Ho , Rex Kay , Adrienne Tan , David Ferry

Objective: A deep understanding of patients in psychiatry requires an ability to appreciate and describe the biopsychosocial determinants of health. Great works of theatre portray a nuanced observation of the human condition, but these have not been formally evaluated in psychiatric literature as teaching tools. The purpose of this study was to explore Shakespeare's King Lear as an educational intervention in supporting formulation skills training in geriatric psychiatry residency.

View Article and Find Full Text PDF

Similar Publications

The illusory perception of occluded space as empty depends on the occluded area.

Iperception

September 2025

Donders Institute for Brain, Cognition, and Behaviour, Radboud University, Nijmegen, The Netherlands.

Pierre-Pascal Forster , Simon J Hazenberg , Vebjørn Ekroll , Rob van Lier

Some occluders evoke the compelling impression that the space behind them is empty. Stage magicians use this illusion of absence to produce objects out of thin air. The generic view principle predicts that the illusion of absence should increase with decreasing occluder size.

View Article and Find Full Text PDF

Similar Publications

Surfer: A World Model-Based Framework for Vision-Language Robot Manipulation.

IEEE Trans Neural Netw Learn Syst

September 2025

Pengzhen Ren , Kaidong Zhang , Hetao Zheng , Zixuan Li , Yuhang Wen

Considering how to make the model accurately understand and follow natural language instructions and perform actions consistent with world knowledge is a key challenge in robot manipulation. This mainly includes human fuzzy instruction reasoning and the following of physical knowledge. Therefore, the embodied intelligence agent must have the ability to model world knowledge from training data.

View Article and Find Full Text PDF

Similar Publications

Cascade Aggregation Network for Accurate Polyp Segmentation.

IET Syst Biol

September 2025

School of Computer and Information Techonology, Xinyang Normal University, Xinyang, China.

Yanru Jia , Yu Zeng , Huaping Guo

Accurate polyp segmentation is crucial for computer-aided diagnosis and early detection of colorectal cancer. Whereas feature pyramid network (FPN) and its variants are widely used in polyp segmentation, inherent limitations existing in FPN include: (1) repeated upsampling degrades fine details, reducing small polyp segmentation accuracy and (2) naive feature fusion (e.g.

View Article and Find Full Text PDF

Similar Publications