Surfer: A World Model-Based Framework for Vision-Language Robot Manipulation.

Pengzhen Ren , Kaidong Zhang , Hetao Zheng , Zixuan Li , Yuhang Wen , Fengda Zhu , Shikui Ma , Xiaodan Liang

IEEE Trans Neural Netw Learn Syst

Published: September 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Considering how to make the model accurately understand and follow natural language instructions and perform actions consistent with world knowledge is a key challenge in robot manipulation. This mainly includes human fuzzy instruction reasoning and the following of physical knowledge. Therefore, the embodied intelligence agent must have the ability to model world knowledge from training data. However, most existing vision and language robot manipulation methods mainly operate in less realistic simulators and language settings and lack explicit modeling of world knowledge. To bridge this gap, we introduce a novel and simple robot manipulation framework, called Surfer. It is based on the world model, treats robot manipulation as a state transfer of the visual scene, and decouples it into two parts: action and scene. Then, the generalization ability of the model on new instructions and new scenes is enhanced by explicit modeling of the action and scene prediction in multimodal information. In addition, we built a robot manipulation simulation platform that supports physics execution based on the MuJoCo physics engine. It can automatically generate demonstration training data and test data, effectively reducing labor costs. To conduct a comprehensive and systematic evaluation of the visual-language understanding and physical execution of the manipulation model, we also created a robotic manipulation benchmark with different difficulty levels, called SeaWave. It contains four visual-language manipulation tasks of different difficulty levels and can provide a standardized testing platform for embedded AI agents in multimodal environments. Overall, we hope Surfer can freely surf in the robot's SeaWave benchmark. Extensive experiments show that Surfer consistently outperforms all baselines significantly in all manipulation tasks. On average, Surfer achieved a success rate of 54.74% on the defined four levels of manipulation tasks, exceeding the best baseline performance of 51.07%. The simulator, code, and benchmarks are released at https://pzhren.github.io/Surfer.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TNNLS.2025.3594117	DOI Listing

Publication Analysis

Top Keywords

robot manipulation

manipulation tasks

manipulation

ability model

training data

explicit modeling

action scene

difficulty levels

robot

surfer

Similar Publications

A soft micron accuracy robot design and clinical validation for retinal surgery.

Microsyst Nanoeng

September 2025

Department of Ophthalmology, Key Laboratory of Precision Medicine for Eye Diseases of Zhejiang Province, Center for Rehabilitation Medicine,, Zhejiang Provincial People's Hospital (Affiliated People's Hospital, Hangzhou Medical College), Hangzhou, 314408, China.

Yiqi Chen , Xiangyu Guo , Xin Ye , Tong Jiang , Huan Chen

Retinal surgery is one of the most delicate and complex operations, which is close to or even beyond the physiological limitation of the human hand. Robots have demonstrated the ability to filter hand tremors and motion scaling which has a promising output in microsurgery. Here, we present a novel soft micron accuracy robot (SMAR) for retinal surgery and achieve a more precise and safer operation.

View Article and Find Full Text PDF

Similar Publications

Corrigendum to "Robotic manipulations of single cells using a large-volume piezoelectric micropipette with nanoliter precision" [Colloid. Surf. B Biointerfaces 256 (2025) 114972].

Colloids Surf B Biointerfaces

September 2025

Nanobiosensorics Laboratory, Institute of Technical Physics and Materials Science, HUN-REN Centre for Energy Research, Budapest, Hungary; Nanobiosensorics Group, Institute of Biophysics, HUN-REN Biological Research Centre, Szeged, Hungary. Electronic address:

Boglarka Kovacs , Szabolcs Novak , Igor Sallai , Beatrix Magyarodi , Inna Szekacs

View Article and Find Full Text PDF

Similar Publications

Surfer: A World Model-Based Framework for Vision-Language Robot Manipulation.

IEEE Trans Neural Netw Learn Syst

September 2025

Pengzhen Ren , Kaidong Zhang , Hetao Zheng , Zixuan Li , Yuhang Wen

View Article and Find Full Text PDF

Similar Publications

Tunable Optical Metamaterial Enables Steganography, Rewriting, and Multilevel Information Storage.

Nanomicro Lett

September 2025

State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, 110016, People's Republic of China.

Jianchen Zheng , Yuzhao Zhang , Haibo Yu , Jingang Wang , Hongji Guo

In the realm of secure information storage, optical encryption has emerged as a vital technique, particularly with the miniaturization of encryption devices. However, many existing systems lack the necessary reconfigurability and dynamic functionality. This study presents a novel approach through the development of dynamic optical-to-chemical energy conversion metamaterials, which enable enhanced steganography and multilevel information storage.

View Article and Find Full Text PDF

Similar Publications

Target recognition and grasping strategies for soft robotic manipulators in unstructured environments.

Rev Sci Instrum

September 2025

Hefei University of Technology, School of Mechanical Engineering, Hefei 230009, China.

Lisong Dong , Huiru Zhu , Yuan Chen , Daoming Wang

In unstructured environments, robots face challenges in efficiently and accurately grasping irregular, fragile objects. To address this, this paper introduces a soft robotic hand tailored for such settings and enhances You Only Look Once v5s (YOLOv5s), a lightweight detection algorithm, to achieve efficient grasping. A rapid pneumatic network-based soft finger structure, broadly applicable to various irregularly placed objects, is designed, with a mathematical model linking the bending angle of the fingers to input gas pressure, validated through simulations.

View Article and Find Full Text PDF

Similar Publications