Your User’s Text is Like a Click

Let’s say you have a simple application, capable of executing four functions:

check_weather() : Show the current weather and tomorrow’s forecast.

: Show the current weather and tomorrow’s forecast. check_calendar() : Show today’s appointments.

: Show today’s appointments. call_mom() : Call your mother.

: Call your mother. tell_joke(): Tell a corny joke.

Imagine writing a GUI for this application from scratch. The user moves the cursor and clicks. On each click, you get a pair of numbers, representing the cursor’s position. So, each user action gives you a vector of two real values. Your application knows the bounding-boxes of its four buttons. For each vector, you check whether the point falls within the bounds of one of your buttons. If it does, you trigger the button-press animation, and execute the appropriate function.

To write a LUI from scratch, we must take the user’s text and resolve it to a vector of numbers. We must figure out “where” the user has “clicked”. Typically we map each word to an arbitrary ID. Then if we recognise a 5,000 word vocabulary, we can regard each word as a distinct point in a 5,000 dimensional space. Then we reduce this space to a denser space of, say, 300 dimensions. This takes us much of the way to resolving the text to a manageable meaning vector.

To illustrate this, let’s recognise a vocabulary of a few words, and assign a real value to each word. To map the user’s text to two dimensions, we’ll take the first word we recognise as the x coordinate, and the last word we recognise as the y coordinate.

The simple function above lets us represent the “meaning” of the user’s text as a pair of two real values:

get_coords(“check the weather”) → (0.3, 0.3)

→ (0.3, 0.3) get_coords(“show my calendar”) → (0.1, 0.7)

→ (0.1, 0.7) get_coords(“say something funny”) → (0.1, 0.1)

→ (0.1, 0.1) get_coords(“call mom”) → (0.9, 0.9)

We can complete the analogy between the LUI and the GUI by plotting out these values, and proposing some boundaries for our “buttons” — our application’s actions.

With a GUI, there’s no trouble determining the coordinates of the click, and you never have to think about resolving the click event to a particular button, if any. That stuff just happens — it’s taken care of for you. With a LUI, you have to pay attention to these details. You can do a better or worse job at this, but the “gold standard” — the holy grail of all your machine learning efforts — will only ever give you something you’ve been taking for granted all along in a GUI. Of course, you have a vastly bigger, multi-dimensional canvas on which the user can “click”, and each click can give you richly structured data. But you still have to paint buttons, forms, navigation menus etc onto this canvas. You’re still wiring a UI to some fixed underlying set of capabilities.

Consider a dialog like this:

- Hi! How can I help you today?

- I’m looking for car insurance.

- Are you an existing policy-holder?

- No this is my first car

Let’s say that under the hood, the user’s final utterance triggers the function car_insurance.non_holder.tell(), which prints a wall of text. The LUI here gives the user a hierarchical menu, whose options are determined by the underlying domain. In GUI design, the problems posed by nested menus are well known, and it’s easy to imagine analogous problems for a LUI.

If you’re looking at the top of a nested menu, how do you know what the leaves of the tree are? And if you know you need a particular leaf, how do you reliably guess how to navigate to it? The LUI prompts give you more text, so the context might sometimes be clearer. On the other hand, the range of options available is not always enumerated, and your intent might be misclassified.

My point here is that a linguistic user interface (LUI) is just an interface. Your application still needs a conceptual model, and you definitely still need to communicate that conceptual model to your users. So, ask yourself: if this application had a GUI, what would that GUI look like?

A GUI version of Siri would probably give you a home-screen with pages of form elements, one for each of Siri’s sub-applications. There would also be a long list of buttons, to trigger Siri’s atomic “eager egg” functions. Note that the GUI to Siri would not simply be the home-screen of iOS. If that were true, then Siri would be mapping your utterance to a sequence of touch events and user inputs.

When you say, “Tell my mother I love her”, Siri executes the command sms(“my mother”, “I love her”). It definitely doesn’t execute a sequence of user actions, that “pilots” your iPhone the way you do. Trying to do that would be insane. Siri is just an app, with its own conceptual model of actions you’re likely to want to perform. It exposes those actions to you via a LUI rather than a GUI.