AI Defi Blog

Welcome to our blog! Here, we bring you the latest and greatest in the world of virtual currencies. Whether you're a seasoned pro or just getting started, we've got you covered. Our goal is to provide you with informative and useful content to help you navigate the ever-changing world of virtual currencies. So sit back, grab a cup of coffee, and let's jump into the exciting world of crypto as a community!

My Voice is no longer my password —

Text-to-speech model can preserve speaker's emotional tone and acoustic environment.

Benj Edwards - Jan 9, 2023 10:15 pm UTC

An AI-generated image of a person's silhouette.

Enlarge / An AI-generated image of a person's silhouette.

Ars Technica

On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALL-E that can closely simulate a person's voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything—and do it in a way that attempts to preserve the speaker's emotional tone.

Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn't), and audio content creation when combined with other generative AI models like GPT-3.

Microsoft calls VALL-E a "neural codec language model," and it builds off of a technology called EnCodec, which Meta announced in October 2022. Unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts. It basically analyzes how a person sounds, breaks that information into discrete components (called "tokens") thanks to EnCodec, and uses training data to match what it "knows" about how that voice would sound if it spoke other phrases outside of the three-second sample. Or, as Microsoft puts it in the VALL-E paper:

To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder.

Microsoft trained VALL-E's speech synthesis capabilities on an audio library, assembled by Meta, called LibriLight. It contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. For VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data.

On the VALL-E example website, Microsoft provides dozens of audio examples of the AI model in action. Among the samples, the "Speaker Prompt" is the three-second audio provided to VALL-E that it must imitate. The "Ground Truth" is a pre-existing recording of that same speaker saying a particular phrase for comparison purposes (sort of like the "control" in the experiment). The "Baseline" is an example of synthesis provided by a conventional text-to-speech synthesis method, and the "VALL-E" sample is the output from the VALL-E model.

A block diagram of VALL-E provided by Microsoft researchers.

Enlarge / A block diagram of VALL-E provided by Microsoft researchers.


While using VALL-E to generate those results, the researchers only fed the three-second "Speaker Prompt" sample and a text string (what they wanted the voice to say) into VALL-E. So compare the "Ground Truth" sample to the "VALL-E" sample. In some cases, the two samples are very close. Some VALL-E results seem computer-generated, but others could potentially be mistaken for a human's speech, which is the goal of the model.

In addition to preserving a speaker's vocal timbre and emotional tone, VALL-E can also imitate the "acoustic environment" of the sample audio. For example, if the sample came from a telephone call, the audio output will simulate the acoustic and frequency properties of a telephone call in its synthesized output (that's a fancy way of saying it will sound like a telephone call, too). And Microsoft's samples (in the "Synthesis of Diversity" section) demonstrate that VALL-E can generate variations in voice tone by changing the random seed used in the generation process.

Perhaps owing to VALL-E's ability to potentially fuel mischief and deception, Microsoft has not provided VALL-E code for others to experiment with, so we could not test VALL-E's capabilities. The researchers seem aware of the potential social harm that this technology could bring. For the paper's conclusion, they write:

"Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models."

AI DeFi Blog is a leading resource for all things related to blockchain-based finance and virtual currencies. Our team of experts is dedicated to providing our readers with the latest news, insights, and analysis on the rapidly evolving world of DeFi. At AI DeFi Blog, we are passionate about all things DeFi, from margin trading to yield farming and beyond. We believe that DeFi has the capacity to change the way we think about finance and financial systems, and we are excited to be a part of this expanding movement. One of the main features of DeFi is that it is built on distributed ledger technology, which allows for decentralized transactions that do not require a third party, such as a financial institution, to facilitate. This means that you can be in control of your own financial transactions and assets, which can be especially appealing to those who are doubtful of traditional financial systems. DeFi also facilitates greater accessibility and inclusion, as it enables anyone with an internet connection to participate in financial transactions and activities. This is particularly important in regions where traditional financial systems may be underdeveloped or inaccessible. In addition to DeFi, we also cover a wide range of topics related to cryptocurrency, including the king of crypto, altcoins, mining, and more. We understand that the world of cryptocurrency can be daunting, especially for those who are new to the space. That's why we strive to provide our readers with clear and concise and simple content that covers the most important aspects of cryptocurrency and DeFi. Whether you're a veteran pro or just starting out, we've got something for you. Our goal is to provide our readers with the knowledge and tools they need to navigate the exhilarating world of DeFi and cryptocurrency. So join us as we explore the exhilarating world of DeFi and cryptocurrency as a community! From margin trading to yield farming and beyond, we've got you covered.