Real-time machine translation of spoken languages used to be relegated to the realm of science fiction. However, advances in wearables—and "hearables"—are making such devices a reality. What are the bottlenecks and stumbling blocks for an effective "universal translator"?

One of the most basic animal traits is communication. Humans have developed this skill to a high level, transferring complex information quickly and fairly efficiently through spoken language. The problem is that we don’t all speak the same language, which creates a solid barrier to communication.

Science fiction writers solved the problem of communicating with interstellar species in a number of “wave your hand at the problem” ways. "Star Trek" has its “universal translator,” which relies on reading brain waves to understand the other creature’s speech. Perhaps the most intriguing solution of all was created by Douglas Adams in his novel "The Hitchhiker’s Guide to the Galaxy." The “Babel fish” is a symbiotic creature that you stick inside your ear, whereupon it translates all other languages.

Wouldn’t it be great if you could stick something in your ear—though perhaps not a living creature—that would translate all languages for you in real time?

The rise of digital “hearables”

The digital electronics revolution has finally brought us close to a real-life Babel fish. Advances in digital signal processing, beam-steering microphone arrays, and wireless communications have led to the development of in-ear devices that do more than just help people with impaired hearing. You can answer your phone, measure your heart rate, and listen to music. Researchers have even created a headset that gives visually impaired users verbal instructions to navigate through buildings and other complex settings.

But can these digital devices translate spoken conversation in real time? The answer is, sort of.

In order to understand the problem of wearable (or even just mobile) devices for language translation, it helps to break down the required process into discrete tasks. (See Figure 1.)

Figure 1: Tasks required for spoken language translation

Acquiring the sound is much more than simply recording a sound sample. The system needs to focus on the sounds of the subject speaking. This means that ambient noises must be reduced or eliminated, including the spoken words of anyone else participating in the conversation. If the microphones are placed on devices located in the ear, they may be well suited for picking up speech from someone in front of the subject, as would be required for augmented hearing applications, but this may not be an optimal location to pick up the subject’s speech. As a result, some devices rely on a headset with a boom microphone, or they use a handheld device held near the subject’s mouth.

The speech-to-text step is the most complex and demanding task in the chain. It must isolate the sounds of the subject’s speech and then use pattern matching to determine the words that were spoken. This isn’t easy. In natural speech, the words run together. If you ever used speech-to-text dictation software from 20 years ago, you'd know the systems worked best with “discrete speech,” meaning you had to pause. Briefly. Between. Each. Word. These systems also required endless training sessions to teach the system to recognize your individual speech patterns. In contrast, newer technology separates spoken words when they are strung together in natural speech.

Like this article? Sign up for our weekly newsletter. We don't waste your time. Sign up now!

Most systems do make the speech recognition task a bit easier by requiring that you specify the source language. This eliminates the difficult step of trying to identify the language on top of translating it.

Once the system converts the speech into text, it is a comparatively simple problem to convert the words from one language to another. It is not a trivial task, however, because spoken language is often incomplete and filled with colloquial usage. The translation process needs to fill in the gaps and recognize when “raining cats and dogs” is not a literal statement.

Finally, the translated text must be converted to sound so that the target can hear the translation spoken in the target language. Again, this is relatively easy—relatively! However, it is important to get cadence and inflection right, as this makes it less tiring to listen to the translation.

What could possibly go wrong?

While it’s fairly simple to lay out these four steps, execution is much more difficult. It’s a bit like saying, “To get to the moon, build a space capsule, put it on a rocket, and launch it.” There are lots of bottlenecks and challenges in creating a digital Babel fish that fits in your ear.

First and foremost, not everyone has the same shaped ear. Fit is an important factor, especially if you’re going to be wearing these devices for hours at a time for business or travel. Some hearing devices rely on custom-fitted earpieces to create a comfortable and effective device, but this boosts the cost of the device significantly. Others offer earpieces in a range of sizes, leaving users to select the best-fitting option.

Speaking of “hours at a time,” that raises another design bottleneck: battery life. An in-ear device is by definition a miniature product. There is not a lot of room for batteries or other forms of energy storage. Traditional hearing aids are essentially simple amplifiers, so their tiny batteries last for a long time. Trying to handle all the digital processing tasks listed in Figure 1 requires a lot more power, making it a challenge for a device to come up with a useful battery life. As a result, many devices rely on handheld form factors that can support larger batteries.

Another design approach is a wireless Bluetooth connection to a smartphone that has a larger battery. Then the system can offload some processing tasks from the earpiece to the phone. Of course the trade-off here is that you can significantly shorten the time the smartphone can last between charges.

Another bottleneck is what I call the “half-duplex problem.” If you ever watched an old classic movie about aircraft pilots, you’ll remember how they had to say “Over” at the end of each transmission so that the party at the other end knew that it was their turn to talk. The radio system allowed for two-way communication, but only in one direction at a time. This is an archaic concept for us; we’re used to phones that let both parties speak at the same time (not that they necessarily can hear what each other is saying).

All “real-time” language translators so far are half-duplex systems. You specify the source and target languages, and then you speak. The system translates the speech and plays it back in the target language. Only then can the other party say something and then you do it again in reverse. This process may be fine for asking directions to the nearest bus stop, but it is limiting if you need to negotiate a business deal or enjoy a dinner conversation.

This leads to perhaps the biggest bottleneck of all: the delay in data transfer, or latency. You may be most familiar with the effect when you watch a remote interview on television. The host asks a question, which is followed by a pause while the remote guest listens to the delayed sound, and finally the guest starts to answer. This delay may be less than a second, but it is disruptive and definitely degrades the conversation quality.

Most of these hearable or mobile translation devices suffer from significant latency problems. Part of it comes from the digital sound processing, but the biggest culprit is the speech-to-text portion. As mentioned above, this task typically relies on machine learning engines and other AI techniques to process the data. Often, multiple AI systems work independently and then “vote” on what they collectively think is the “right” answer for the conversion. Given the enormous datasets required for most languages, a lot of complex data processing must be handled to achieve this.

Even with the miracles of miniaturized digital electronics, all of these complex steps are an enormous task to give to something that is small enough to stick in your ear. Most smartphones bog down trying to perform this conversion!

As a result, many systems rely on transmitting the sound data up to the cloud (through a smartphone connection) where it is processed by AI systems. Then the converted text is translated and transmitted back down again. This takes time, and it is not uncommon for current hearable and handheld translators to have one- or two-second delays before they start to play the translation.

Two seconds may not sound like much. But try it for yourself. Hold a conversation with someone and time a two-second pause between the moment when one person stops and the other person starts talking. It feels a lot longer than you might expect. The ability to translate does not instantly allow a conversation.

When you rely on the cloud to handle the heavy lifting of speech-to-text conversion, it’s difficult to avoid the latency problem. Replicating that processing power and data storage even in a handheld device—let alone something in your ear—is a massive and potentially expensive challenge.

Are we there yet?

With all those technical challenges, you might be surprised to learn that there are products already on the market that do a pretty good job of translating spoken languages, with list prices ranging from just over $100 to almost $800.

Perhaps one of the best-known products in this segment is Google Pixel Buds. These wireless earbuds do most of the things you’d expect of wireless buds, but they also provide real-time language translation. This only works with a Google Pixel or Pixel 2 smartphone, and the feature relies on the free Google Translate app. To use the system, you hand the phone to the other person. That person speaks into the phone using their language, and you hear the translation in the earbuds. For the other direction, your speech is picked up by the buds’ mics, and it is transmitted to the phone for translation and playback through the phone’s speaker.

Dash Pro wireless earphones from Bragi also have a speech translation feature. It relies on the free iTranslate app to do the translating, which means that the earphones must be paired with a smartphone (iOS or Android). Note that iTranslate does some limited translating when offline.

The Pilot from Waverly Labs is another wireless earbud solution. It relies on a free app (iOS or Android) that uses a cloud service to process the translation. To hold a conversation, you give one of your earpieces to the other party and you each run the app on your own phones. (Some people may be put off a bit by the idea of sharing something as personal as an earbud.)

Some devices rely on a different design. For example, the TRAGL has a boom mic attached to an earpiece, with a front-facing speaker that plays the translation of your speech to the other party in the conversation. (No sharing of hardware required.) It captures the person's speech and plays back the translation through your earpiece. This system also relies on a smartphone app (iOS and Android) and cloud support, but the device does provide limited support for offline translation of some languages. The product is currently offered through an Indiegogo campaign with shipments scheduled before the end of the year.

Other products skipped the wireless earbud design and rely instead on a dedicated handheld device. One successful product is the Travis, which has sold more than 110,000 units. The company announced a new version of the device called the Travis Touch, which has an improved interface and larger touchscreen. According to the company, the system relies on 16 different translation engines at once to get the best results. It requires a high-speed Internet connection. The Touch is also the subject of an Indiegogo campaign and is scheduled to ship in the fall.

Two other handheld devices looking for crowdfunding support are IU and Mesay 2.0. The IU relies on a separate app running on a smartphone; it is much like a small Bluetooth speaker with a microphone and some buttons for the user interface. It is scheduled to ship at the end of the year. The Mesay has an LCD screen that displays the original and translated text and can even be used as a Wi-Fi hotspot. It is already shipping.

Can you hear me now?

Do we have a digital Babel fish yet? Sadly, none of these products are as proficient as the science fiction benchmark, but we’re making great progress. 4G and Wi-Fi wireless connectivity provides fast access to the cloud resources that most of these products require. Until we can pack a lot more processing power into a hearable device, you can expect that these products are going to rely on smartphone support and a connection to online resources.

And we have a way to go before you can expect to have a “normal” conversation using real-time speech translation. The pauses and the problem of taking turns are not natural but a small price to pay compared with the time required to learn a new language. These products may not be ideal, but they already can deliver significant value to travelers and business people who need to make themselves understood to others who speak a different language.

A translator in your ear: Lessons for leaders