For Voice User Interfaces (VUIs) to have any chance of success, the future direction of Voice User Experience (VUX) will be strongly tied to physical, not software, constraints. The three features of these will be: 1) At least 100 words per minute (wpm) input 2) close to 200wpm output 3) under 250ms response time. We are nowhere close.

Voice User Experience

We have just updated the Sentieo Skill on the Alexa Skill Store where it now ranks among the best Finance skills (higher if you strip out all the bitcoin noise). We thought we might share a few thoughts on our experiences redesigning the Sentieo experience, which was first created for Desktop and then Mobile, for a radically different interface.

With the linguistic abstraction infrastructure finally in place to separate voice software engineering (executing specific intents with data and integrations) from language processing (a natural monopoly parsing general human speech to specific intent and vice versa), and, as a plus, supportive hardware, there will undoubtedly be a wealth of development with the ecosystem benefits deservingly accruing to Amazon. Ours was roughly the 2000th skill to hit the Alexa store, 3 months after it crossed the 1000 mark.

This, together with the recent attention on chatbots, has predictably prompted all sorts of manic speculation, including “the Death of the GUI”, but that discussion is premature until key issues in the development of the Voice User Interface are addressed. Simply put, apart from simple hands free convenience, we haven’t figured out where the VUI absolutely dominates. You see this when your bank gives you the option to “Speak to a human representative”.

In this very real sense, the VUI is a solution in search of a problem. We aren’t even very good at the solution yet: we are terrible at transcribing accents and abbreviations; context management and intent disambiguation is a mess; input mappings are naturally many-to-one while output tends toward one-to-one; and we haven’t even tried our hand at “nontextual” verbal data like voice recognition (multiple speakers), sarcasm, humor, and tone. And let’s not even talk about privacy issues in practical implementations.

There is a common implicit assumption that those problems go away as research and infrastructure in language processing improves, but in fact the meta problems endemic to voice software engineering are perhaps even harder to solve because they run into physical “laws”. Even if we do the voice equivalent of assuming a spherical frictionless cow, and assume every speech is perfectly translated to intent, there are still terminally intractable problems with the field of voice that, for want of a better term, we will call UI efficiency (although there are formal definitions of this).