HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech Synthesis.

Sang-Hoon Lee , Ha-Yeong Choi , Seung-Bin Kim , Seong-Whan Lee

IEEE Trans Neural Netw Learn Syst

Published: July 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Large language model (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This article proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For TTS, we adopt the text-to-vec (TTV) framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution (SpeechSR) framework from 16 to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis. Audio samples and source code are available at https://github.com/hierspeechpp/code.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TNNLS.2025.3584944	DOI Listing

Publication Analysis

Top Keywords

zero-shot speech

speech synthesis

speech

hierarchical variational

strong zero-shot

speech synthesizer

synthetic speech

zero-shot

synthesis

hierspeech++ bridging

Similar Publications

A Comprehensive Polish Medical Speech Dataset for Enhancing Automatic Medical Dictation.

Sci Data

August 2025

Medical University of Gdańsk, Department of Hypertension and Diabetology, Gdańsk, Poland.

Andrzej Czyżewski , Sebastian Cygert , Karolina Marciniuk , Maciej Szczodrak , Arkadiusz Harasimiuk

Pre-trained models have become widely adopted for their strong zero-shot performance, often minimizing the need for task-specific data. However, specialized domains like medical speech recognition still benefit from tailored datasets. We present ADMEDVOICE, a novel Polish medical speech dataset, collected using a high-quality text corpus and diverse recording conditions to reflect real-world scenarios.

View Article and Find Full Text PDF

Similar Publications

HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech Synthesis.

IEEE Trans Neural Netw Learn Syst

July 2025

Sang-Hoon Lee , Ha-Yeong Choi , Seung-Bin Kim , Seong-Whan Lee

View Article and Find Full Text PDF

Similar Publications

Learning multi-modal representations by watching hundreds of surgical video lectures.

Med Image Anal

October 2025

University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, Strasbourg, France. Electronic address:

Kun Yuan , Vinkle Srivastav , Tong Yu , Joël L Lavanchy , Jacques Marescaux

Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations.

View Article and Find Full Text PDF

Similar Publications

ABCD: A Simulation Method for Accelerating Conversational Agents With Applications in Aphasia Therapy.

J Speech Lang Hear Res

July 2025

Department of Computer Science and Engineering, University of South Florida, Tampa.

Gerald C Imaezue , Harikrishna Marampelly

Purpose: Development of aphasia therapies is limited by clinician shortages, patient recruitment challenges, and funding constraints. To address these barriers, we introduce (ABCD), a novel method for simulating goal-driven natural spoken dialogues between two conversational artificial intelligence (AI) agents-AI clinician (Re-Agent) and AI patient (AI-Aphasic), which vocally mimics aphasic errors. Using ABCD, we simulated response elaboration training between both agents with stimuli varying in semantic constraint (high via pictures, low via topics).

View Article and Find Full Text PDF

Similar Publications

High fidelity zero shot speaker adaptation in text to speech synthesis with denoising diffusion GAN.

Sci Rep

February 2025

School of Information Engineering, Minzu University of China, Beijing, 100081, China.

Xiangchun Liu , Xuan Ma , Wei Song , Yanghao Zhang , Yi Zhang

Zero-shot speaker adaptation seeks to enable the cloning of voices for previously unseen speakers by leveraging only a few seconds of their speech samples. Nevertheless, existing zero-shot multi-speaker text-to-speech (TTS) systems continue to exhibit significant disparities in the synthesized speech quality and speaker similarity when comparing unseen to seen speakers. To address these challenges and improve synthesized speech quality and speaker similarity for unseen speakers, this study introduces an efficient zero-shot speaker-adaptive TTS model, DiffGAN-ZSTTS.

View Article and Find Full Text PDF

Similar Publications