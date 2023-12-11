A team of researchers from Tsinghua University and Microsoft Research Asia has made an exciting breakthrough in the field of text-to-speech technology. They have developed a new system called Bridge-TTS that introduces a cleaner and more predictable alternative to the noisy Gaussian prior used in conventional methods. This replacement prior provides strong structural information about the target and is derived from the latent representation extracted from the text input.

The key contribution of Bridge-TTS is the development of a manageable Schrodinger bridge that connects the ground-truth mel-spectrogram and the clean prior. Unlike diffusion models that operate through a data-to-noise process, Bridge-TTS utilizes a data-to-data process, thereby enhancing the information content of the distribution. This leads to better synthesis quality and sampling efficiency.

The team evaluated the effectiveness of Bridge-TTS through experimental validation on the LJ-Speech dataset. In comparison to previous diffusion-based TTS approaches, such as Grad-TTS, Bridge-TTS demonstrated superior performance in both 50-step and 1000-step synthesis settings. It even outperformed strong and fast TTS models in scenarios with a few steps. The primary strengths of Bridge-TTS are its high synthesis quality and efficient sampling.

To achieve these results, the researchers employed a fully tractable Schrodinger bridge for paired data. This bridge utilizes a flexible reference stochastic differential equation (SDE), allowing for empirical investigations of design spaces and providing a theoretical explanation. The team also examined the impact of sampling technique, model parameterization, and noise scheduling on TTS quality, implementing asymmetric noise schedules, data prediction, and first-order bridge samplers.

Bridge-TTS offers not only outstanding inference speed and generation quality but also remarkable efficiency with just one training session. It outperformed various existing models, including Grad-TTS, FastGrad-TTS, FastSpeech 2, and CoMoSpeech.

This breakthrough in text-to-speech technology opens up new possibilities for applications such as voice assistants, audiobook narration, and accessibility tools for individuals with speech impairments. Further research and development in this direction have the potential to revolutionize the way machines generate human-like speech.