Text-Guided Visual Representation Optimization for Sensor-Acquired Video Temporal Grounding.

Yun Tian , Xiaobo Guo , Jinsong Wang , Xinyue Liang

Sensors (Basel)

School of Optoelectronic Engineering, Changchun University of Science and Technology, Changchun 130022, China.

Published: July 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Video temporal grounding (VTG) aims to localize a semantically relevant temporal segment within an untrimmed video based on a natural language query. The task continues to face challenges arising from cross-modal semantic misalignment, which is largely attributed to redundant visual content in sensor-acquired video streams, linguistic ambiguity, and discrepancies in modality-specific representations. Most existing approaches rely on intra-modal feature modeling, processing video and text independently throughout the representation learning stage. However, this isolation undermines semantic alignment by neglecting the potential of cross-modal interactions. In practice, a natural language query typically corresponds to spatiotemporal content in video signals collected through camera-based sensing systems, encompassing a particular sequence of frames and its associated salient subregions. We propose a text-guided visual representation optimization framework tailored to enhance semantic interpretation over video signals captured by visual sensors. This framework leverages textual information to focus on spatiotemporal video content, thereby narrowing the cross-modal gap. Built upon the unified cross-modal embedding space provided by CLIP, our model leverages video data from sensing devices to structure representations and introduces two dedicated modules to semantically refine visual representations across spatial and temporal dimensions. First, we design a Spatial Visual Representation Optimization (SVRO) module to learn spatial information within intra-frames. It selects salient patches related to the text, capturing more fine-grained visual details. Second, we introduce a Temporal Visual Representation Optimization (TVRO) module to learn temporal relations from inter-frames. Temporal triplet loss is employed in TVRO to enhance attention on text-relevant frames and capture clip semantics. Additionally, a self-supervised contrastive loss is introduced at the clip-text level to improve inter-clip discrimination by maximizing semantic variance during training. Experiments on Charades-STA, ActivityNet Captions, and TACoS, widely used benchmark datasets, demonstrate that our method outperforms state-of-the-art methods across multiple metrics.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12349264	PMC
http://dx.doi.org/10.3390/s25154704	DOI Listing

Publication Analysis

Top Keywords

visual representation

representation optimization

video

text-guided visual

sensor-acquired video

video temporal

temporal grounding

natural language

language query

video signals

Similar Publications

Use of artificial intelligence for classification of fractures around the elbow in adults according to the 2018 AO/OTA classification system.

BMC Musculoskelet Disord

September 2025

Department of Clinical Sciences at Danderyds Hospital, Department of Orthopedic Surgery, Karolinska Institutet, Stockholm, 182 88, Sweden.

Annelie Pettersson , Michael Axenhus , Teo Stukan , Oscar Ljungberg , Hans Nåsell

Background: This study evaluates the accuracy of an Artificial Intelligence (AI) system, specifically a convolutional neural network (CNN), in classifying elbow fractures using the detailed 2018 AO/OTA fracture classification system.

Methods: A retrospective analysis of 5,367 radiograph exams visualizing the elbow from adult patients (2002-2016) was conducted using a deep neural network. Radiographs were manually categorized according to the 2018 AO/OTA system by orthopedic surgeons.

View Article and Find Full Text PDF

Similar Publications

Racial stereotypes bias the neural representation of objects towards perceived weapons.

Nat Commun

September 2025

Columbia University, Department of Psychology, New York, NY, USA.

DongWon Oh , Henna I Vartiainen , Jonathan B Freeman

Racial stereotypes have been shown to bias the identification of innocuous objects, making objects like wallets or tools more likely to be identified as weapons when encountered in the presence of Black individuals. One mechanism that may contribute to these biased identifications is a transient perceptual distortion driven by racial stereotypes. Here we provide neuroimaging evidence that a bias in visual representation due to automatically activated racial stereotypes may be a mechanism underlying this phenomenon.

View Article and Find Full Text PDF

Similar Publications

Distinct Neural Mechanisms of Visual and Sound Adaptation in the Cat Visual Cortex.

Eur J Neurosci

September 2025

The Tampa Human Neurophysiology Lab, Department of Neurosurgery, Brain and Spine, Morsani College of Medicine, University of South Florida, Tampa, Florida, USA.

Yahia Yassine Belkacemi , Ehsan Mokhtarinejad , Solène Hospital , Nayan Chanauria , Oliver Flouty

Sensory areas exhibit modular selectivity to stimuli, but they can also respond to features outside of their basic modality. Several studies have shown cross-modal plastic modifications between visual and auditory cortices; however, the exact mechanisms of these modifications are yet not completely known. To this aim, we investigated the effect of 12 min of visual versus sound adaptation (referring to forceful application of an optimal/nonoptimal stimulus to a neuron[s] under observation) on the infragranular and supragranular primary visual neurons (V1) of the cat (Felis catus).

View Article and Find Full Text PDF

Similar Publications

Sexual pleasure in older age: haptic visuality and female eroticism in three contemporary Spanish films.

J Aging Stud

September 2025

Dean of Area Studies and Assistant Dean of Faculty, IES Abroad Barcelona (Spain) & Research Fellow, Aston University, UK. Electronic address:

Raquel Medina Bañón

This article explores the representation of female sexuality in later life through the lens of three contemporary Spanish films: La vida era eso (2020), Destello bravío (2021), and Mamacruz (2023). Drawing from feminist aging studies, film theory, and concepts such as haptic visuality and clitoral sexuality, the study challenges the patriarchal, ageist, and phallocentric narratives that have long shaped cultural understandings of older women's erotic lives. Through close readings of these films, the article demonstrates how they subvert the dominant heteronormative gaze by foregrounding sensory pleasure, autoeroticism, and the reawakening of desire in older women.

View Article and Find Full Text PDF

Similar Publications

Lie symmetry approach to the dynamical behavior and conservation laws of actin filament electrical models.

PLoS One

September 2025

Department of Mathematics and Statistics, College of Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia.

Beenish , Maria Samreen , Fehaid Salem Alshammari

This research explores the dynamical properties and solutions of actin filaments, which serve as electrical conduits for ion transport along their lengths. Utilizing the Lie symmetry approach, we identify symmetry reductions that simplify the governing equation by lowering its dimensionality. This process leads to the formulation of a second-order differential equation, which, upon applying a Galilean transformation, is further converted into a system of first-order differential equations.

View Article and Find Full Text PDF

Similar Publications