On EDN’s Sensors Design Center, we spend a lot of time looking out for new and interesting sensing technologies and approaches. Here’s a good one: VocalZoom is an Israeli startup that has invented a means of optically converting human voice to digital signals that paradoxically get more accurate in the presence of loud ambient noise.

This makes it ideal for emergency services communication, as well as its initial target applications of consumer automotive, headsets, smartphones, security, and just about any voice-recognition application you can think of.

The problem with current voice-based human-to-machine communication (HMC) systems is that they’re optimized for humans, not machines. They use acoustic microphones that detect all sounds and we waste precious power and time on fancy noise-cancellation algorithms trying to filter out background noise, while optimizing for natural, pleasant sound reproduction that is intelligible to humans.

For phone conversations, this approach is functional, as humans have an advantage: we can use context and experience to almost subconsciously fill in missed words and phrases in a high-noise environment. Machines, on the other hand, need to distinguish every single word and phrase in order to act as intended. Errors due to background noise are not tolerated and the machine either performs the wrong function or asks for a repeat of the instruction.

While humans may have a leg up on machines, for both humans and machines, intelligibility suffers and hit-rates drop as soon as background noise increases, leading to either frustrating phone conversations or mistaken voice commands (Figure 1 ).

Figure 1 In a moving vehicle with windows open and speakers on, voice-command hit rates typically drop to 0%. VocalZoom claims its HMC sensor can maintain a hit rate of over 90% in that same environment.

“Everyone wants voice [recognition, command, and control] but the key challenge is background noise and the unpredictability of the environment,” said Rammy Bahalul, vice president of sales and business development at VocalZoom. He pointed out that while voice-recognition software can be trained for accents and other speech patterns, “it can’t be trained to background noise.”

To completely isolate the spoken word from the environment, VocalZoom turned to a low-cost, low-power implementation of the principles of interferometry for its HMC sensor. It uses a laser to measure low-level vibrations on the surface of the face or behind the ear that are a direct result of the spoken word.

Typically costing in the thousands of dollars, or millions in the case of military systems, interferometers detect vibrations down to the nanometer over ranges of up to a mile or more by detecting phase differentials between a source and a reflected wave. Classic “spy” applications include eavesdropping on conversations by measuring window vibration.

To bring interferometry down to more affordable and consumer-friendly levels, the team sacrificed distance, bringing it to 1 meter, and used a Class 1 user-safe VCSEL laser that can be directed at the face to detect the vibration. The vibration modulates the phase of the reflected beam and algorithms embedded in a custom ASIC are used to give the final output via I2 S interface (Figure 2 ).

Figure 2 The HMC sensor uses a simpler, proprietary, interferometry technique and a Class 1 VCSEL laser that can be directed at the face. The vibration modulates the phase of the reflected beam and algorithms embedded in a custom ASIC are used to give the final output via I2 S interface.

The spoken word is literally derived straight from the face: ambient sounds such as barking dogs, other voices, cars, and sirens are not even detected.

Paradoxically, as ambient noise increases, the accuracy actually gets better. On a typical voice-recognition system, the typical hit rate in a quiet environment is 80%, giving an error rate of 20%. However once the system is brought into the street, the hit rate can drop to 60% for words, and worse for sentences. According to Bahalul, VocalZoom’s technology can keep the hit rate at 90 to 97%.

Playing in VocalZoom’s favor is the Lombard effect, whereby we reflexively increase our vocal effort as ambient noise increases, thereby increasing the facial vibration. This provides, in effect, a higher signal-to-noise ratio as the background sounds are still not detected, but the user’s facial vibration level has increased.

This has huge implications for emergency services applications where sirens, fire, crumbling structures, competing conversations, and other noises can drown out the spoken word.

As Bahalul relates it, one potential customer put VocalZoom’s technology in an acoustic room with 120-spl (sound pressure level) noise to compare it to an acoustic microphone. “The acoustic got saturated, our optical sensor got voice very clearly,” he said.

The advantages of the system go beyond better mobile phone voice conversations and more accurate and consistently responsive voice command and control of machines (Figure 3 ). It can also be used for proximity detection and to measure heart rate. Also, because of the peculiarities of each person’s voice and corresponding facial vibrations, it can also be used for biometric security purposes.



Figure 3 The applications of the VocalZoom technology go far beyond voice recognition, command, and control into proximity sensing and biometrics. The applications of the VocalZoom technology go far beyond voice recognition, command, and control into proximity sensing and biometrics.

“This will change the way people talk to machines,” said Bahalul, emphasizing the primary application. However, he also noted that it could replace between $10 to $20 in sensor components on a smartphone, including proximity detection, speech recognition, and biometrics – with inherent “proof of life” features – while also providing better noise reduction and saving power through voice triggering.

The sensor itself consumes power in the milliwatt range, said Bahalul, and costs in “the single dollars.” The laser is around $1 and ASIC is sub-$1. The first prototype systems should be ready in Q3 and Bahalul expects first products to be shipping in early 2017.

The company is working with most voice-recognition software systems and headset manufacturers, and is also working on a car mirror integration approach and with MEMS manufacturers who are interested in combining VocalZoom’s technology with classic acoustic audio.

Short term, the company expects to increase its range to 2 m, while incorporating multiple lasers to sample various facial surfaces to optimize performance in cases where the user may have their face partially covered by a beard or scarf, for example.

Also see:

