Acting is what voice talents should do even when they read a boring e-learning course. They should sound as if they are committed and interested and truly believe and understand what they are saying. When they do act and stick with heart and mind to the text, voice actors show a winning edge over TTS (text to speech) technology, since as intelligent and trained speakers, they are the ones who can add meaning and emotion to the lines they read.

TTS technology has made impressive strides in recent years thanks to more sophisticated algorithms and a continuos boost in the calculation power of computers. So when you feed lines in the TTS software you can get real human-like speech, called synthetic voice, that can actually get the message across. So the old question comes up again… Will the machine replace the human? We are not there yet, but don’t underestimate the future possibilities of this technology.

More and more clients are asking studios to supply “machine voices”, instead of “human, natural voices” for simple IVR, phone prompts, voices on vending machines and toys, because “they sound OK” and they are getting really cheap. In fact if you get familiar with the recent advances of this technology you will realise the potential of TTS.

The modus operandi of TTS consists of recording hundred of hours of random speech with an actor, the machine decomposes the sentences into phonemes and then process them thanks to a complicated process.

This is how TTS works, according to Acapela one major of this industry. Yes, some passages are laughable or even scary… so worth a listen

Well there is a missing element in TTS that makes the difference. It’s called prosody, which broadly speaking means that speech information that is related to context, namely pitch, pace, stress, duration, amplitude, and even voice gestures.

Linguists say that prosody is actually “a parallel channel for communication, carrying some information that cannot be simply deduced from the lexical channel. All para-linguistic information contained in prosody are transmitted by muscle motions, and in most of them, the recipient can perceive, fairly directly, the motions of the speaker.”

So by acting as the script requires, we, voice talents, shouldn’t be afraid of that intelligent monster called TTS. Prosody is also the key human ingredient that gives real life to a text, through things that are not necessarily vocal, such as: hand gestures, eyebrow and face motions.

TTS bumps into a big problem when it comes to certain sentences, namely questions. In most West European languages, questions sound with a higher pitch at some point, usually on the last words. So those words are extended and the pitch raises a bit. But how about Russian? In Russian, you don’t make the crescendo pitch at the end, a question in Russian is identified by a strong stress on a key word, not a series of words.

At voice-over studios PrimeVoices they have tested the available TTS technologies, making different VOs with different speech synthesis products. The result has not been convincing enough, the test was not successful to provide articulate and clear audio to clients. After attending for 3 editions the Mobile World Congress in Barcelona I also found out that the software available is not making real progress. Major European carriers such as France Telecom (Orange), Telecom Italia or Telefónica have outsourced this solution only to realise the limits of TTS. As a result the operators are not investing any more in this technology. Only Google seems to be involved right now.

Why is that? What happen to put on hold the TTS progress? Well developers realised that to have something usable and commercially viable you need man and machine. As it happens with machine translation you need a human to tweak, correct and improve. You create speech memory which can be used automatically when there is a critical mass (hundreds of hours of recording), but in the end they have to call the voice talent regularly to supply missing words or expressions the machine can’t reproduce properly. For an operator, producer, studio, project manager, etc, the economy that they could get by calling the voice less (you have to call that voice anyway) will be wasted on postproduction costs. Currently the costs is around 0,15 USD per word, which equals the cost of voice for non commercial reading.

On the other hand TTS has a big problem with products like e-learning. Despite tweaking to make the robot sound articulate, the resulting flow of words is monotonous and this is against the logic of keeping the attention of the audience. After a minute or two, you get asleep, you don’t follow the training.

So no worries yet, we still have a chance. But algorithms are fast learners and practice make perfect also for a machine, so one day you might find yourself working side by side with a TTS churning out speech with your own voice. But studios will still need you feed the machine, especially with complicated words, brand names, foreign names and that feeling and emotion than only a human can give.

Meanwhile you must dare act as best as you can, with the right dose, because you shouldn’t overact either. By acting using the natural prosody you will certainly beat the machine.

What do you think? Do you expect that the machine will take over some parts of the VO industry?