Meta’s new artificial intelligence offers the same text-to-speech capabilities as DALL-E has in image generation.
Meta has now unveiled a new generative artificial intelligence tool called Voicebox, which can perform various tasks such as converting text to voice, reducing noise, and editing audio. One of the important features of this model is receiving an audio sample from a language and converting it to a foreign language. This tool can do what ChatGPT and DALL-E do in the field of text-to-audio conversion in the field of text and image generation.
AI Voicebox is a text-to-speech converter that Meta describes as “a non-self-decreasing stream synchronization model for context- and text-based audio completion.” The model was trained on more than 50,000 hours of trained audio, and Meta specifically used audiobook audio in English, French, Spanish, German, Polish, and Portuguese.
Among the important capabilities of this artificial intelligence is the transfer of speech style from one language to another foreign language. To use this feature, just give Voicebox a 2-second sample of your voice along with a text in English, French, Spanish, German, Polish, and Portuguese and ask the artificial intelligence to read the text in these languages. The company says its model can translate virtually any text from one language to another, preserving the spoken form of the target language.
What other capabilities does Voicebox’s meta-artificial intelligence model have?
The wide range of input data helps the system produce sounds that are more conversational. “Our results show that speech recognition models trained on voices created with Voicebox perform nearly as well as models trained on real voices,” says Meta. In addition, the computer-generated voices faced only a 1 percent degradation error, compared to 45 to 70 percent for other text-to-speech (TTS) models.
Voicebox’s AI model can edit voices, remove noise from conversations, and even correct mispronounced words. Meta researchers say, for example, the user can identify which part of the audio file has noise and then ask artificial intelligence to reproduce that part.
The voicebox model does not need a large volume of input data thanks to the new meta-learning method called “Flow Matching“. Benchmark results show that this artificial intelligence performs much better than the best text-to-speech systems in terms of errors (1.9% compared to 5.9%) and is up to 20 times faster.
However, the Wisbox model or its source code is not publicly available. Meta has admitted that due to possible risks, it has no intention of making this model available to the general public. For now, they’ve only published a preliminary research paper on the model, but they hope to use the technology in the future to help people with vocal cord problems, NPCs in games, and voice assistants.