Thursday, June 13, 2024

Meta announces Voicebox, a generative model for multiple voice synthesis tasks


Be part of prime executives in San Francisco on July 11-12, to listen to how leaders are integrating and optimizing AI investments for achievement. Be taught Extra

Final week, Meta Platforms’ synthetic intelligence analysis arm launched Voicebox, a machine studying mannequin that may generate speech from textual content. What units Voicebox aside from different text-to-speech fashions is its capacity to carry out many duties that it has not been educated for, together with enhancing, noise removing, and elegance switch.

The mannequin was educated utilizing a particular methodology developed by Meta researchers. Whereas Meta has not launched Voicebox as a result of moral considerations about misuse, the preliminary outcomes are promising and might energy many functions sooner or later.

‘Movement Matching’

Voicebox is a generative mannequin that may synthesize speech throughout six languages, together with English, French, Spanish, German, Polish, and Portuguese. Like massive language fashions, it has been educated on a really normal activity that can be utilized for a lot of functions. However whereas LLMs attempt to be taught the statistical regularities of phrases and textual content sequences, Voicebox has been educated to be taught the patterns that map voice audio samples to their transcripts. 

Such a mannequin can then be utilized to many downstream duties with little or no fine-tuning. “The purpose is to construct a single mannequin that may carry out many text-guided speech era duties by way of in-context studying,” Meta’s researchers write in their paper (PDF) describing the technical particulars of Voicebox.


Rework 2023

Be part of us in San Francisco on July 11-12, the place prime executives will share how they’ve built-in and optimized AI investments for achievement and averted widespread pitfalls.


Register Now

The mannequin was educated Meta’s “Flow Matching” method, which is extra environment friendly and generalizable than diffusion-based studying strategies utilized in different generative fashions. The method permits Voicebox to “be taught from diverse speech information with out these variations having to be fastidiously labeled.” With out the necessity for handbook labeling, the researchers have been capable of practice Voicebox on 50,000 hours of speech and transcripts from audiobooks.

The mannequin makes use of “text-guided speech infilling” as its coaching purpose, which suggests it should predict a section of speech given its surrounding audio and the whole textual content transcript. Principally, it signifies that throughout coaching, the mannequin is supplied with an audio pattern and its corresponding textual content. Elements of the audio are then masked and the mannequin tries to generate the masked half utilizing the encircling audio and the transcript as context. By doing this time and again, the mannequin learns to generate natural-sounding speech from textual content in a generalizable means.

Replicating voices throughout languages, enhancing out errors in speech, and extra

In contrast to generative fashions which are educated for a particular utility, Voicebox can carry out many duties that it has not been educated for. For instance, the mannequin can use a two-second voice pattern to generate speech for brand new textual content. Meta says this functionality can be utilized to deliver speech to people who find themselves unable to talk or customise the voices of non-playable sport characters and digital assistants.

Voicebox additionally performs model switch in numerous methods. For instance, you possibly can present the mannequin with two audio and textual content samples. It should use the primary audio pattern as model reference and modify the second to match the voice and tone of the reference. Curiously, the mannequin can do the identical factor throughout totally different languages, which could possibly be used to “assist folks talk in a pure, genuine means — even when they don’t communicate the identical languages.”

The mannequin can even do quite a lot of enhancing duties. For instance, if a canine barks within the background whilst you’re recording your voice, you possibly can present the audio and transcript to Voicebox and masks out the section with the background noise. The mannequin will use the transcript to generate the lacking portion of the audio with out the background noise. 

The identical method can be utilized to edit speech. For instance, in case you have misspoken a phrase, you possibly can masks that portion of the audio pattern and cross it to Voicebox together with a transcript of the edited textual content. The mannequin will generate the lacking half with the brand new textual content in a means that matches the encircling voice and tone.

One of many fascinating functions of Voicebox is voice sampling. The mannequin can generate numerous speech samples from a single textual content sequence. This functionality can be utilized to generate artificial information to coach different speech processing fashions. “Our outcomes present that speech recognition fashions educated on Voicebox-generated artificial speech carry out virtually in addition to fashions educated on actual speech, with 1 % error price degradation versus 45 to 70 % degradation with artificial speech from earlier text-to-speech fashions,” Meta writes.

Voicebox has limits too. Because it has been educated on audiobook information, it doesn’t switch properly to conversational speech that’s informal and comprises non-verbal sounds. It additionally doesn’t present full management over totally different attributes of the generated speech, similar to voice model, tone, emotion, and acoustic situation. The Meta analysis workforce is exploring methods to beat these limitations sooner or later.

Mannequin not launched

There may be rising concern concerning the threats of AI-generated content material. For instance, cybercriminals not too long ago tried to rip-off a lady by calling her and utilizing AI-generated voice to impersonate her grandson. Superior speech synthesis programs similar to Voicebox could possibly be used for comparable functions or different nefarious deeds, similar to creating pretend proof or manipulating actual audio.

“As with different highly effective new AI improvements, we acknowledge that this know-how brings the potential for misuse and unintended hurt,” Meta wrote on its AI weblog. Resulting from these considerations, Meta didn’t launch the mannequin however supplied technical particulars on the structure and coaching course of within the technical paper. The paper additionally comprises particulars a couple of classifier mannequin that may detect speech and audio generated by Voicebox to mitigate the dangers of utilizing the mannequin. 

GamesBeat’s creed when protecting the sport business is “the place ardour meets enterprise.” What does this imply? We wish to inform you how the information issues to you — not simply as a decision-maker at a sport studio, but in addition as a fan of video games. Whether or not you learn our articles, hearken to our podcasts, or watch our movies, GamesBeat will aid you be taught concerning the business and luxuriate in partaking with it. Uncover our Briefings.

Source link

- Advertisement -spot_img
- Advertisement -spot_img
Latest News

5 BHK Luxury Apartment in Delhi at The Amaryllis

If you're searching for a five bedroom 5 BHK Luxury Apartment in Delhi, The Amaryllis could be just what...
- Advertisement -spot_img

More Articles Like This

- Advertisement -spot_img