Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment.

Chen Shen , Lu Zhao , Cejin Fu , Bote Gan , Zhenlong Du

Sensors (Basel)

College of Computer and Information Engineering (College of Artificial Intelligence), Nanjing Tech University, Nanjing 211816, China.

Published: June 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Although Singing Voice Synthesis (SVS) has revolutionized audio content creation, global linguistic diversity remains challenging. Current SVS research shows scant exploration of cross-lingual generalization, as fragmented, language-specific phoneme encodings (e.g., Pinyin, ARPA) hinder unified phonetic modeling. To address this challenge, we built a four-language dataset based on GTSinger's speech data, using the International Phonetic Alphabet (IPA) for consistent phonetic representation and applying precise segmentation and calibration for improved quality. In particular, we propose a novel method of decomposing IPA phonemes into letters and diacritics, enabling the model to deeply learn the underlying rules of pronunciation and achieve better generalization. A dynamic IPA adaptation strategy further enables the application of learned phonetic representations to unseen languages. Based on VISinger2, we introduce Transinger, an innovative cross-lingual synthesis framework. Transinger achieves breakthroughs in phoneme representation learning by precisely modeling pronunciation, which effectively enables compositional generalization to unseen languages. It also integrates Conformer and RVQ techniques to optimize information extraction and generation, achieving outstanding cross-lingual synthesis performance. Objective and subjective experiments have confirmed that Transinger significantly outperforms state-of-the-art singing synthesis methods in terms of cross-lingual generalization. These results demonstrate that multilingual aligned representations can markedly enhance model learning efficacy and robustness, even for languages not seen during training. Moreover, the integration of a strategy that splits IPA phonemes into letters and diacritics allows the model to learn pronunciation more effectively, resulting in a qualitative improvement in generalization.

Download full-text PDF	Source
http://dx.doi.org/10.3390/s25133973	DOI Listing

Publication Analysis

Top Keywords

singing voice

voice synthesis

cross-lingual generalization

ipa phonemes

phonemes letters

letters diacritics

unseen languages

cross-lingual synthesis

pronunciation effectively

synthesis

A PHP Error was encountered