“Cleared to Chicago O’Hare, Compton 3 Foxtrot departure, Squawk 5133.”

“Winds 240 at 10. Visibility 5 miles. Clouds at 5,000 feet. Temperature 14, dewpoint 9. Landing and departing runway 28. Taxiway sierra closed.”

Odds are, you didn’t read that in a monotonous voice. Neither do real air traffic controllers. While air traffic controllers have a reputation for remaining calm under pressure, they still adjust their pitch and tone of voice as they issue instructions.

Yet when it comes to virtual air traffic controllers, simulation avatars often speak in robotic voices. Since simulations seek to imitate real-life processes, this is less than ideal.

How is simulation technology used?

One common use of simulation technology is simulation-based learning or training. Simulation training (or virtual reality training) has proven to be a reliable method of training across thousands of industries.

As technology and design continue to improve, these virtual training worlds have become increasingly convincing. With advanced 3D views and detailed graphic displays, users are quickly transported to different worlds.

Yet natural sounding text-to-speech synthesis has yet to substantially break into simulation and training environments. The lifelike technology of emotive synthetic voice is key to advancing these experiences for users.

What is the current solution?

The current solution is typically concatenative synthesis. Concatenative synthesis involves recording hundreds of hours of actual human voices speaking in many different manners. After the voice is recorded, short segments are rearranged to form complete sentences. This time-consuming process allows a simulation to work across different scenarios.

Why is this an issue?

According to Mark Huckvale of the Department of Phonetics and Linguistics at the University College London, “Listeners are sensitive to the intentions of a speaker, perhaps through some coherence in how information is encoded in the signal. We can gauge quite easily whether we are listening to a recorded announcement rather than a person talking solely to us.” [1]

This awareness can detract from the learning and training that would otherwise take place in a simulation. With the current state-of-the-art solution, if a creator wants to add a unique phrase, or a sound that has not been recorded, the voice actor must record that new sound in the database. This is time-consuming and inefficient.

Or say a creator wants the simulation to switch from normal dialog to a voice expressing anger or happiness. In this case, an actor would need to record additional segments to reflect these emotions. Even seemingly simple requests such as this require additional data collection and significant amounts of preparation.

What is the alternative?

One alternative to concatenative synthesis is parametric text-to-speech (TTS) which involves building a completely computer-generated voice. While this can be more cost effective, there is a trade-off. Since the voice is completely computer-generated, it often sounds robotic. This is where emotive synthetic voice comes in.

Since natural sounding avatars are needed to create valuable virtual experiences, training environments can benefit from a solution that provides more realistic voices — voices that are able to express different emotions and vary based on context.

What is the benefit of emotive synthetic voices?

The benefit of emotive synthetic voices is two-part. Not only does the synthetic voice have a realistic, emotional range, but it is also cost-effective.

Emotional range. Genuine voice outputs can advance simulations by building truly innovative communication experiences. By manipulating the emotion and cadence of a synthetic voice, the user experience improves. Whether the voice is happy, serious, or even a whisper, emotive synthetic voices can convey the correct meaning and intent.

Cost-effective. With synthetic text-to-speech (TTS), there is no need to worry about voice actors and audio banks. Synthetic voices can be created in real time, which makes it possible to produce a seamless and cost-effective voice with an emotional range that is applicable in a wide variety of contexts.

Emotive synthetic voices have untapped potential. If applied correctly, emotive synthetic voice will change the world of simulation.

Do you see value in applying emotive synthetic voice to simulations?