Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Large language model (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This article proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For TTS, we adopt the text-to-vec (TTV) framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution (SpeechSR) framework from 16 to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis. Audio samples and source code are available at https://github.com/hierspeechpp/code.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2025.3584944DOI Listing

Publication Analysis

Top Keywords

zero-shot speech
24
speech synthesis
24
speech
15
hierarchical variational
8
strong zero-shot
8
speech synthesizer
8
synthetic speech
8
zero-shot
6
synthesis
6
hierspeech++ bridging
4

Similar Publications

Pre-trained models have become widely adopted for their strong zero-shot performance, often minimizing the need for task-specific data. However, specialized domains like medical speech recognition still benefit from tailored datasets. We present ADMEDVOICE, a novel Polish medical speech dataset, collected using a high-quality text corpus and diverse recording conditions to reflect real-world scenarios.

View Article and Find Full Text PDF

Large language model (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This article proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC).

View Article and Find Full Text PDF

Learning multi-modal representations by watching hundreds of surgical video lectures.

Med Image Anal

October 2025

University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, Strasbourg, France. Electronic address:

Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations.

View Article and Find Full Text PDF

Purpose: Development of aphasia therapies is limited by clinician shortages, patient recruitment challenges, and funding constraints. To address these barriers, we introduce (ABCD), a novel method for simulating goal-driven natural spoken dialogues between two conversational artificial intelligence (AI) agents-AI clinician (Re-Agent) and AI patient (AI-Aphasic), which vocally mimics aphasic errors. Using ABCD, we simulated response elaboration training between both agents with stimuli varying in semantic constraint (high via pictures, low via topics).

View Article and Find Full Text PDF

Zero-shot speaker adaptation seeks to enable the cloning of voices for previously unseen speakers by leveraging only a few seconds of their speech samples. Nevertheless, existing zero-shot multi-speaker text-to-speech (TTS) systems continue to exhibit significant disparities in the synthesized speech quality and speaker similarity when comparing unseen to seen speakers. To address these challenges and improve synthesized speech quality and speaker similarity for unseen speakers, this study introduces an efficient zero-shot speaker-adaptive TTS model, DiffGAN-ZSTTS.

View Article and Find Full Text PDF