What is Real? Voice Cloning Now a Reality

By Donald McLaughlin Published on November 19, 2019

During the election of 2016 The Washington Post made public the infamous recording of Donald Trump’s crass remark about grabbing women caught on a hot mic in an Access Hollywood interview. No one, including Trump, ever tried to deny that it was actually him saying that back in 2005. We had the audio recording and we knew it was him.

Now let us suppose that another audio recording surfaces, this time with President Trump saying something damaging that he never actually said. Sure, it’s his voice, and it’s recorded. Only, it’s not him, and he never said it. Well, technically it is his voice, but saying something he never said. Welcome to the wonderful world of voice cloning (VC).

In the same way that digital technology can be used to alter photos, VC can be used to literally create speech of someone’s actual voice. All that’s needed is a five second sample of the target speaker and you can make them say anything!

Creating Speech

… Researchers … are now claiming that they can clone a person’s voice using just a 5-second clip. They explain that this can be done because they have trained a neural network, what we often call artificial intelligence or machine learning, on hours and hours of a wide variety of speakers so that it understands how we speak and then it can take a 5 second clip from an individual it has not heard before and clone a voice and get them to say things that were not in the clip.

Help us champion truth, freedom, limited government and human dignity. Support The Stream »

The researchers from Cornell University call this “Text-To-Speech Synthesis.” The idea is simple. First, using the tools of artificial intelligence, “train” the computer how to recognize human speech, and speech patterns, from thousands of speakers. Second, create an analysis data-set, called a “mel spectrogram” from all the training. Third, convert all that data into individual speaker audio waveforms. Now that you’ve trained your computer, all you need is a 5 second sample of any speaker, and from that you can make them say anything you wish.

The Cornell team has provided astonishing examples of how this sounds. You can listen to some of them in this two minute video explaining the research. They provide actual recordings of 5 second samples of 5 different speakers, and then they create multiple phrases and sentences completely synthesized that sound remarkably like the actual speaker. Go ahead, I’ll wait here while you check it out.

Do you find the idea that anyone could potentially be made to say anything a bit scary? Certainly the implications for court cases relying on audio recording evidence gets murky. In the age of “fake news,” things can go to a whole new level. So, yeah, it is a bit scary. Granted, the technology hasn’t advanced to the point where trained listeners and researchers couldn’t identify the false from the genuine, but how far away can the day be when that becomes problematic?

Some Constructive Applications

As with most technologies, it’s not all just the dark side. There are some legitimate, and even exciting possibilities for how such technology could be used. Suppose you’re the post-production team for a hit TV show. You’re on a tight deadline to finish up all the audio editing for broadcast. You discover a small, but noticeable issue with one of the lines spoken by an actress. Not to worry, just apply the VC algorithms, type in the correct line, and no one will be the wiser.

The brave new world of technology offers many things good, but always with the capacity to be employed for ill-intent.

Or, imagine being a song writer and having a library of actual famous singers’ vocals. You want to hear your new song sung by Celine Dion. Celine has allowed samples of her vocals to be made available for audio productions, samples you can buy legitimately. Type in your lyrics, play your melody and just like that, Celine is singing your new hit song! This isn’t as far-fetched as it sounds. A company called East-West Quantum Leap has something similar for choirs. East-West’s libraries are used by composers the world over for soundtracks, movie trailers, TV shows and other audio productions. Imagine doing the same thing with individual singers.

The brave new world of technology offers many things good, but always with the capacity to be employed for ill-intent. How far are we from no longer being able to tell the real from the fake? Perhaps the movie The Matrix wasn’t so far off to ask “what is real?”

Print Friendly
Comments ()
The Stream encourages comments, whether in agreement with the article or not. However, comments that violate our commenting rules or terms of use will be removed. Any commenter who repeatedly violates these rules and terms of use will be blocked from commenting. Comments on The Stream are hosted by Disqus, with logins available through Disqus, Facebook, Twitter or G+ accounts. You must log in to comment. Please flag any comments you see breaking the rules. More detail is available here.
Inspiration
The Christians I Knew Liked Rules Too Much
David Mills
More from The Stream
Connect with Us