VirNucPro: an identifier for the identification of viral short sequences using six-frame translation and large language models.

Brief Bioinform

The College of Life Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China.

Published: May 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Viruses are ubiquitous in nature, yet our understanding of them remains limited. High-throughput sequencing technology facilitates the unbiased revelation of genetic composition in samples; however, viral sequences typically make up a small proportion of the entire sequencing data, making it challenging to accurately identify the few or fragmented viral sequences present in a sample. The limited features and information provided by short sequences result in insufficient resolution of viral sequences by existing models. Therefore, we propose a new model, VirNucPro, for short viral sequence identification. Based on a six-frame translation strategy and large language models, we combine nucleotide and amino acid sequence information to enhance feature extraction for short sequences, achieving high accuracy in identifying short viral sequences. Ablation experiments compared the contributions of nucleotide and amino acid sequence features to the model, confirming that the introduced amino acid features significantly contribute to the classification results. Our model outperforms others, such as GCNFrame, DeepVirFinder, DETIRE, and Virtifier, which have demonstrated good performance in identifying short viral sequences of 300 and 500 bp. Our model demonstrates excellent performance on carefully created real-world datasets. Additionally, it can scan for prophage regions within long bacterial fragments, offering a wide range of applications. The codes are available at: https://github.com/Li-Jing-1997/VirNucPro.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12086996PMC
http://dx.doi.org/10.1093/bib/bbaf224DOI Listing

Publication Analysis

Top Keywords

viral sequences
20
short sequences
12
short viral
12
amino acid
12
sequences
8
six-frame translation
8
large language
8
language models
8
nucleotide amino
8
acid sequence
8

Similar Publications

IBDV-SSA, a novel molecular approach for the recovery of infectious bursal disease virus whole genomes from FTA cards.

Microbiol Spectr

September 2025

United States Department of Agriculture, Agricultural Research Service (USDA-ARS), Southeast Poultry Research Laboratories, US National Poultry Research Center, Athens, Georgia, USA.

Infectious bursal disease (IBD), a highly contagious viral disease in young chickens, poses significant economic losses due to high mortality and immunosuppression. While IBD virus (IBDV) virulence is influenced by multiple genes, whole-genome sequencing (WGS) of IBDV is crucial for defining the strain pathotype and clinical profile. Flinders Technology Associates (FTA) cards are convenient for field sample collection, but their filter paper matrix can hinder nucleic acid recovery, impacting sequencing efficiency.

View Article and Find Full Text PDF

Unrelated pathogens, including viruses and bacteria, use a common short linear motif (SLiM) to interact with cellular kinases of the RSK (p90 S6 ribosomal kinase) family. Such a "DDVF" (D/E-D/E-V-F) SLiM occurs in the leader (L) protein encoded by picornaviruses of the genus , including Theiler's murine encephalomyelitis virus (TMEV), Boone cardiovirus (BCV), and Encephalomyocarditis virus (EMCV). The L-RSK complex is targeted to the nuclear pore, where RSK triggers FG-nucleoporins hyperphosphorylation, thereby causing nucleocytoplasmic trafficking disruption.

View Article and Find Full Text PDF

Background: Mixed-phenotype acute leukemia (MPAL) is a rare acute leukemia for which data are currently not available to guide therapy. It has a poor outcome, particularly in elderly patients.

Case Presentation: We report the successful use of venetoclax/azacitidine as treatment for a treatment-naive elderly patient with early T-cell precursor (ETP)/myeloid MPAL.

View Article and Find Full Text PDF

Introduction: Metastatic colorectal cancer (mCRC) exhibits significant heterogeneity in molecular profiles, influencing treatment response and patient outcomes. Mutations in v-raf murine sarcoma viral oncogene homolog B1 () and rat sarcoma () family genes are commonly observed in mCRC. Though originally thought to be mutually exclusive, recent data have shown that patients may present with concomitant and mutations, posing unique challenges and implications for clinical management.

View Article and Find Full Text PDF

Introduction: The Zika virus (ZIKV) envelope (E) protein is critical for viral replication and host interactions. Although glycosylation of the E protein is known to influence viral infectivity and immune evasion, the specific functional roles of E protein glycosylation in ZIKV infectivity in mosquito cells remain unclear.

Methods: In this study, we generated a deglycosylation mutant ZIKV with a T156I substitution in the E protein and investigated its effects on viral replication and viral-host interactions in mosquito C6/36 cells.

View Article and Find Full Text PDF