Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Considering how to make the model accurately understand and follow natural language instructions and perform actions consistent with world knowledge is a key challenge in robot manipulation. This mainly includes human fuzzy instruction reasoning and the following of physical knowledge. Therefore, the embodied intelligence agent must have the ability to model world knowledge from training data. However, most existing vision and language robot manipulation methods mainly operate in less realistic simulators and language settings and lack explicit modeling of world knowledge. To bridge this gap, we introduce a novel and simple robot manipulation framework, called Surfer. It is based on the world model, treats robot manipulation as a state transfer of the visual scene, and decouples it into two parts: action and scene. Then, the generalization ability of the model on new instructions and new scenes is enhanced by explicit modeling of the action and scene prediction in multimodal information. In addition, we built a robot manipulation simulation platform that supports physics execution based on the MuJoCo physics engine. It can automatically generate demonstration training data and test data, effectively reducing labor costs. To conduct a comprehensive and systematic evaluation of the visual-language understanding and physical execution of the manipulation model, we also created a robotic manipulation benchmark with different difficulty levels, called SeaWave. It contains four visual-language manipulation tasks of different difficulty levels and can provide a standardized testing platform for embedded AI agents in multimodal environments. Overall, we hope Surfer can freely surf in the robot's SeaWave benchmark. Extensive experiments show that Surfer consistently outperforms all baselines significantly in all manipulation tasks. On average, Surfer achieved a success rate of 54.74% on the defined four levels of manipulation tasks, exceeding the best baseline performance of 51.07%. The simulator, code, and benchmarks are released at https://pzhren.github.io/Surfer.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2025.3594117DOI Listing

Publication Analysis

Top Keywords

robot manipulation
24
manipulation tasks
12
manipulation
11
ability model
8
training data
8
explicit modeling
8
action scene
8
difficulty levels
8
robot
6
surfer
5

Similar Publications

A soft micron accuracy robot design and clinical validation for retinal surgery.

Microsyst Nanoeng

September 2025

Department of Ophthalmology, Key Laboratory of Precision Medicine for Eye Diseases of Zhejiang Province, Center for Rehabilitation Medicine,, Zhejiang Provincial People's Hospital (Affiliated People's Hospital, Hangzhou Medical College), Hangzhou, 314408, China.

Retinal surgery is one of the most delicate and complex operations, which is close to or even beyond the physiological limitation of the human hand. Robots have demonstrated the ability to filter hand tremors and motion scaling which has a promising output in microsurgery. Here, we present a novel soft micron accuracy robot (SMAR) for retinal surgery and achieve a more precise and safer operation.

View Article and Find Full Text PDF

Corrigendum to "Robotic manipulations of single cells using a large-volume piezoelectric micropipette with nanoliter precision" [Colloid. Surf. B Biointerfaces 256 (2025) 114972].

Colloids Surf B Biointerfaces

September 2025

Nanobiosensorics Laboratory, Institute of Technical Physics and Materials Science, HUN-REN Centre for Energy Research, Budapest, Hungary; Nanobiosensorics Group, Institute of Biophysics, HUN-REN Biological Research Centre, Szeged, Hungary. Electronic address:

View Article and Find Full Text PDF

Considering how to make the model accurately understand and follow natural language instructions and perform actions consistent with world knowledge is a key challenge in robot manipulation. This mainly includes human fuzzy instruction reasoning and the following of physical knowledge. Therefore, the embodied intelligence agent must have the ability to model world knowledge from training data.

View Article and Find Full Text PDF

Tunable Optical Metamaterial Enables Steganography, Rewriting, and Multilevel Information Storage.

Nanomicro Lett

September 2025

State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, 110016, People's Republic of China.

In the realm of secure information storage, optical encryption has emerged as a vital technique, particularly with the miniaturization of encryption devices. However, many existing systems lack the necessary reconfigurability and dynamic functionality. This study presents a novel approach through the development of dynamic optical-to-chemical energy conversion metamaterials, which enable enhanced steganography and multilevel information storage.

View Article and Find Full Text PDF

In unstructured environments, robots face challenges in efficiently and accurately grasping irregular, fragile objects. To address this, this paper introduces a soft robotic hand tailored for such settings and enhances You Only Look Once v5s (YOLOv5s), a lightweight detection algorithm, to achieve efficient grasping. A rapid pneumatic network-based soft finger structure, broadly applicable to various irregularly placed objects, is designed, with a mathematical model linking the bending angle of the fingers to input gas pressure, validated through simulations.

View Article and Find Full Text PDF