Meta, formerly known as Facebook, has unveiled a groundbreaking tool that could usher in a new era for voice assistants. Voicebox AI, Meta’s latest innovation, is a generative model designed to generate spoken speech based on textual inputs, potentially making voice assistants more intelligent and efficient. While the company has not yet released the program or its source code, this technology has the potential to reshape the world of voice technology.
Voicebox AI operates on a similar principle to ChatGPT and DALL-E, but instead of generating text or images, it focuses on creating spoken speech. The system is trained on a vast dataset comprising 50,000 hours of unfiltered audio, including transcripts from publicly available audiobooks recorded in various languages, including English, French, Spanish, German, Polish, and Portuguese.
This diverse dataset allows Voicebox AI to produce “more conversational speech,” bridging language gaps and facilitating smoother interactions between users and voice assistants. The company asserts that speech recognition models trained on synthetic speech generated by Voicebox AI perform nearly as well as those trained on real speech.
One notable achievement is Voicebox AI’s outperformance of Microsoft’s VALL-E in text-to-language conversion. It excels in both intelligibility, with a 5.9% word error rate compared to VALL-E’s 1.9%, and audio similarity, boasting a 0.580% score versus VALL-E’s 0.681%. Impressively, it achieves these results while being 20 times faster.
Voicebox AI offers various valuable features, including the capacity to edit audio, remove noise, and correct mispronunciations. Users can pinpoint distorted segments of speech, trim them, and instruct the model to rectify those segments.
The development methodology behind Voicebox AI is noteworthy. Meta employs a novel technique known as Flow Matching for training speech synthesis from scratch, promising further advancements in this field.
Despite the significant breakthrough, Meta has chosen not to release the Voicebox program or its source code to the public. The company cites concerns about potential misuse as the reason behind this decision.
Researchers behind the project envision various applications for this technology in the future. These include prosthetics for patients with damaged vocal cords, enhancing gaming NPCs (non-playable characters), and improving digital assistants.
It’s worth noting that Meta has taken both open and guarded approaches to AI technology. While it released its LLaMA AI language model as an open-source package for the AI community, it encountered issues with unauthorized downloads and distribution. Additionally, Meta introduced SAM, an AI image segmentation model that can identify specific objects in images or videos based on user cues, offering open source code and datasets for the Animated Drawings AI project.
In essence, Meta’s Voicebox AI represents a significant leap in the field of voice technology, promising more natural and efficient interactions with voice assistants. While it remains in the company’s vault for now, its potential applications are vast and may reshape the future of voice-driven AI.