98%
921
2 minutes
20
Large language model (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This article proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For TTS, we adopt the text-to-vec (TTV) framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution (SpeechSR) framework from 16 to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis. Audio samples and source code are available at https://github.com/hierspeechpp/code.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TNNLS.2025.3584944 | DOI Listing |
Sci Data
August 2025
Medical University of Gdańsk, Department of Hypertension and Diabetology, Gdańsk, Poland.
Pre-trained models have become widely adopted for their strong zero-shot performance, often minimizing the need for task-specific data. However, specialized domains like medical speech recognition still benefit from tailored datasets. We present ADMEDVOICE, a novel Polish medical speech dataset, collected using a high-quality text corpus and diverse recording conditions to reflect real-world scenarios.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
July 2025
Large language model (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This article proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC).
View Article and Find Full Text PDFMed Image Anal
October 2025
University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, Strasbourg, France. Electronic address:
Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations.
View Article and Find Full Text PDFJ Speech Lang Hear Res
July 2025
Department of Computer Science and Engineering, University of South Florida, Tampa.
Purpose: Development of aphasia therapies is limited by clinician shortages, patient recruitment challenges, and funding constraints. To address these barriers, we introduce (ABCD), a novel method for simulating goal-driven natural spoken dialogues between two conversational artificial intelligence (AI) agents-AI clinician (Re-Agent) and AI patient (AI-Aphasic), which vocally mimics aphasic errors. Using ABCD, we simulated response elaboration training between both agents with stimuli varying in semantic constraint (high via pictures, low via topics).
View Article and Find Full Text PDFSci Rep
February 2025
School of Information Engineering, Minzu University of China, Beijing, 100081, China.
Zero-shot speaker adaptation seeks to enable the cloning of voices for previously unseen speakers by leveraging only a few seconds of their speech samples. Nevertheless, existing zero-shot multi-speaker text-to-speech (TTS) systems continue to exhibit significant disparities in the synthesized speech quality and speaker similarity when comparing unseen to seen speakers. To address these challenges and improve synthesized speech quality and speaker similarity for unseen speakers, this study introduces an efficient zero-shot speaker-adaptive TTS model, DiffGAN-ZSTTS.
View Article and Find Full Text PDF