Technology developed by Princeton University computer scientists may do for audio recordings of the human voice what word processing software did for the written word and Adobe Photoshop did for images.

“VoCo” software, still in the research stage, makes it easy to add or replace a word in an audio recording of a human voice by simply editing a text transcript of the recording. New words are automatically synthesized in the speaker’s voice — even if they don’t appear anywhere else in the recording.

The system uses a sophisticated algorithm to learn and recreate the sound of a particular voice. It could one day make editing podcasts and narration in videos much easier, or in the future, create personalized robotic voices that sound natural, according to co-developer Adam Finkelstein, a professor of computer science at Princeton. Or people who have lost their voices due to injury or disease might be able to recreate their voices through a robotic system, but one that sounds natural.

An earlier version of VoCo was announced in November 2016. A paper describing the current VoCo development will be published in the July issue of the journal Transactions on Graphics (an open-access preprint is available).

How it works (technical description)

VoCo’s user interface looks similar to other audio editing software such as the podcast editing program Audacity, with a waveform of the audio track and cut, copy and paste tools for editing. But VoCo also augments the waveform with a text transcript of the track and allows the user to replace or insert new words that don’t already exist in the track by simply typing in the transcript. When the user types the new word, VoCo updates the audio track, automatically synthesizing the new word by stitching together snippets of audio from elsewhere in the narration. VoCo is is based on an optimization algorithm that searches the voice recording and chooses the best possible combinations of phonemes (partial word sounds) to build new words in the user’s voice. To do this, it needs to find the individual phonemes and sequences of them that stitch together without abrupt transitions. It also needs to be fitted into the existing sentence so that the new word blends in seamlessly. Words are pronounced with different emphasis and intonation depending on where they fall in a sentence, so context is important. For clues about this context, VoCo looks to an audio track of the sentence that is automatically synthesized in artificial voice from the text transcript — one that sounds robotic to human ears. This recording is used as a point of reference in building the new word. VoCo then matches the pieces of sound from the real human voice recording to match the word in the synthesized track — a technique known as “voice conversion,” which inspired the project name, VoCo. In case the synthesized word isn’t quite right, VoCo offers users several versions of the word to choose from. The system also provides an advanced editor to modify pitch and duration, allowing expert users to further polish the track. To test how effective their system was a producing authentic sounding edits, the researchers asked people to listen to a set of audio tracks, some of which had been edited with VoCo and other that were completely natural. The fully automated versions were mistaken for real recordings more than 60 percent of the time. The Princeton researchers are currently refining the VoCo algorithm to improve the system’s ability to integrate synthesized words more smoothly into audio tracks. They are also working to expand the system’s capabilities to create longer phrases or even entire sentences synthesized from a narrator’s voice.

Fake news videos?



A key use for VoCo might be in intelligent personal assistants like Apple’s Siri, Google Assistant, Amazon’s Alexa, and Microsoft’s Cortana, or for using movie actors’ voices from old films in new ones, Finkelstein suggests.

But there are obvious concerns about fraud. It might even be possible to create a convincing fake video. Video clips with different facial expressions and lip movements (using Disney Research’s FaceDirector, for example) could be edited in and matched to associated fake words and other audio (such as background noise and talking), along with green screen to create fake backgrounds.

With billions of people now getting their news online and unfiltered, augmented-reality coming, and hacking way out of control, things may get even weirder. …

Zeyu Jin, a Princeton graduate student advised by Finkelstein, will present the work at the Association for Computing Machinery SIGGRAPH conference in July. The work at Princeton was funded by the Project X Fund, which provides seed funding to engineers for pursuing speculative projects. The Princeton researchers collaborated with scientists Gautham Mysore, Stephen DiVerdi, and Jingwan Lu at Adobe Research. Adobe has not announced availability of a commercial version of VoCo, or plans to integrate VoCo into Adobe Premiere Pro (or FaceDirector).



Abstract of VoCo: Text-based Insertion and Replacement in Audio Narration

Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.