September 15, 2014

Current voice interfaces are horrible. They try to imitate intelligence and fail so badly that they make computers even more frustrating to use. Most people have no idea how to use things like Siri. They ask it perfectly reasonable questions like, “Who invented the lightbulb?” and it responds with junk from Google. The answer to that particular question is not straightforward–at least three people, independently, invented different types of light bulbs–and computers are really bad at deciphering and communicating absolute answers from complex information like that. Because they are still so frustratingly limited, Siri and Google Now are simply not yet ready to exist. They are not artificially intelligent by any stretch of the imagination.

That being said, current voice recognition technology is incredibly good at certain things. It’s great at detecting and transcribing words, listening for specific commands, and making matches against expected inputs. So why does literally no software take advantage of voice technology in the way it works best? For example it boggles my mind that I cannot do the following things:

When I’m inputting my home address in a web browser (on mobile or desktop), I should be able to tap the “State” dropdown and just say “California” and it have it select that option for me.

When I highlight the browser address bar, I should be able to just say “The Economist” and have it automatically find the address in my favorites and go there.

When I open to the home screen on my phone, I should be able to just say “Instagram” and have that app open.

On iOS, when I get a notification that covers the top of the screen, I should be able to just say “ignore” and have the notification instantly disappear.

When I click the “To” field in a mail app or in Gmail, I should be able to just say a person’s name and have it fill in automatically (and maybe show me a dropdown to select which email address to send to).

All of these possible cases have one important thing in common–something specific has to happen before the voice control works. I have to tap a dropdown and then say “California”. I have to have just received a text message to say “ignore” and it have it disappear. The reason current voice interfaces suck is because they force the speaker to consciously enter a “voice” mode and then create context around the action they want the computer to perform. This makes no sense; the computer should just always be listening for potential commands within the context of whatever the user is doing. Siri has no context, which makes it very difficult for the computer to accurately predict what the user whats to do. It also makes it very difficult for a user to know when using Siri would be helpful. But within specific context, there are only a few possibilities, which makes it far easier to use and much easier for a developer to build.

In general, voice interfaces would work great for things that are easy to vocally describe but hard to do on input devices like touch screens, keyboards, and mice. Like selecting “California” from a dropdown list. Or paging through icons to find an app to open. I only listed five possible use cases above, but I think they would dramatically improve the computing experience. Imagine what other things could be done.

1,358 Kudos