State-of-the-art Zero-shot Speech Synthesis with Vall-E

published on 19 January 2023

A giant step towards personalised speech applications?

Today, we are taking a look at VALL-E - an innovative neural codec language model for text-to-speech synthesis (TTS) that was recently published by Microsoft.

VALL-E significantly outperforms the state-of-the-art in zero-shot TTS. Impressively, VALL-E is able to generate highly personalised speech with just a 3-second recording of an unseen speaker.

We also made a video about VALL-E - you can watch it here.

Overview of VALL-E

Screenshot 2023-01-19 at 17.16.55-wjgw5

The key to VALL-E's success is its use of neural codec language models (see the figure). These models are trained using discrete codes derived from an off-the-shelf neural audio codec model.

Audio codec models are able to encode audio samples and their corresponding phoneme transcription into discrete codes (bottom left of the figure), as well as to decode discrete codes back into audio samples (top right of the figure). The use of disrete codes allows VALL-E to treat text-to-speech synthesis as a conditional language modeling task, rather than a continuous signal regression task.

During the pre-training stage, the VALL-E team used a massive dataset of 60,000 hours of English speech. This allows VALL-E to learn the nuances of human speech and generate highly personalized speech with just a 3-second recording of an unseen speaker. During inference, given a phoneme sequence and a 3-second recording, the neural codec language model outputs a sequence of discrete codes, and the audio codec decoder then synthesises the high-quality output speech.

Results of VALL-E

Experiments on diverse datasets show that VALL-E outperforms the state-of-the-art zero-shot TTS system in terms of both speech naturalness and speaker similarity. VALL-E even has the ability to preserve the emotion and acoustic environment of the original recording in the synthesized speech. You can listen for yourself at this demo.

One interesting aspect of VALL-E is that it has in-context learning capabilities, meaning that it enables prompt-based approaches for zero-shot TTS. This makes VALL-E naturally compatible with other generative AI models like GPT-3, and makes a lot of new exciting applications of speech synthesis possible, such as speech editing, or content creation. We are certainly excited to see how VALL-E will be used!

So to sum it up. VALL-E is a significant advancement in the field of TTS. Its ability to synthesize personalized speech, zero-shot TTS, speech editing, and more is made possible by its large and diverse data training, and its unique architecture. With this new approach, the team has paved the way for further advancements in the field of audio synthesis.

If you are looking to start a project in NLP using Large Language Models and Prompting methods, and would like to get advice from an advanced team of experts with PhDs that have been working in the area for 5+ years, you should get in touch with us at The Global NLP Lab. Looking forward to talking to you! 

Read more