But as an engineer and product person, the Echo really whets my appetite because you can build apps for it. Well, they’re called “Skills”.

Amazon have quietly been building a solid library of third-party skills. There’s now over 200 including new additions from Uber, Spotify and Dominos. And they’re clearly taking their new ecosystem seriously: on the developer/platform side there’s a new VP and a $100m Alexa Fund. On the consumer side there’s, well, a superbowl ad.

So to learn more, I built a Skill.

Here’s some observations from my experience:

Getting Started

Building voice interfaces is no easy task, let alone a framework for arbitrary commands for third-party apps — but the Alexa Skills Kit (as the programing interface is called) is an impressive bit of software.

As a developer you specify the ‘intents’ your Skill supports (think of these like controllers in an Rails app or Activities in an Android app), then specify the various phrases people might use to invoke that intent. You also specify any variables that you expect as part of the incantation. These can be standard types (dates, numbers, places), enums you specify, or arbitrary literals (not recommended, but sometimes necessary). Your code then gets passed clean structured data to act upon.

This programming model is flexible enough to make most things possible, but there’s a few limitations. It’d be great to have a programmatic way of updating enum of custom slot types — either via an API or have the Skills Kit read and cache the values from JSON served at a URL. It’d also like to see an expanded list of built-in types: it currently only support US cities, for example.

These nits aside, it’s clear a lot of work has gone into the ASK, and it’s super-easy to build pretty complex voice-driven interfaces really quickly.

Needs better support for asynchronous tasks

Not everything happens in an instant. Today, when you ask Alexa something, she can only reply with one block of speech. This works great if the Skill you’re interacting with has the answers ready in an instant, but that’s not always the case.

Imagine your Skill calls an API which takes 5 seconds to respond — not all that uncommon for complex operations. There’ll be an awkward 5-second pause after you pose the question before you hear a response. Granted, you know something’s happening as the Echo’s blue lights pulse in the meantime. But it’d be a much better experience if Alexa offered Skills the ability to respond immediately with something like “OK, let me look that up for you”, and then a few seconds later with the actual response.

A great use case is a hypothetical Lyft app. When you order a ride, it might take 10–60 seconds for real drivers in the real world to accept the job. In Lyft’s app, this latency is satisfied with a spinner. But to make this experience work on the Echo, a Skill needs to be able to reply instantly (“OK, let me get you a ride”), then keep you updated (“I’m still trying to connect you with a driver…”), before letting you know: “I got you a ride. Your car will arrive in 4 minutes”. That experience is not possible today and it desperately needs to be to enable a whole class of semi-asynchronous or long-running Skills.

Notifications Notification Notifications

Today, Alexa can only respond to commands you utter. There’s no way the Echo can notify you that something happened — you always have to ask. But events and alerts are critical agents in some of the most useful experiences.

Take that Lyft example again — wouldn’t it be useful if Alexa could tell you when your ride was one minute away? What if Alexa could let you know that the pizza you ordered had been dispatched, or for that matter, remind you that your latest Amazon order would be delivered sometime this afternoon? None of that’s possible today.

Now, I can totally understand why this isn’t in for the v1 — tasteful notifications are hard to get right — but the issues are all solvable. Access to notifications needs to be tightly controlled to prevent abuse, but Amazon already has a certification scheme in place for Skills. It’d also make sense to have a low per-Skill quota to prevent over-use. As a user, I’d also want to be able to set do-not-disturb periods to prevent interruptions.

I really hope the folks at Amazon are actively working on notifications right now. They’d dramatically expand the universe of what’s possible.

Access to long-form & streaming audio

Right now, Skills are able to play short (<30 sec) audio clips. This is really designed for audio branding — perhaps a sound trademark. But I can imagine whole classes of experiences that become possible if Skills are able to access live audio streams or play long files.

For example, I’d love to be able to ask Alexa to start streaming the sound from our baby monitor when we put our daughter to sleep. I’d love people to build Skills which access long-form audio content beyond podcasts — for example LBC’s back catalogue of programming stretching back nearly 10 years.

The built-in apps (TuneIn, Spotify, Pandora, Audible etc) are all able to play >30 sec audio files, and connect to live audio streams. It’d be great to see the same abilities made available to third-party Skills too.

Multiroom

Perhaps this is the ultimate first-world problem: I’d like my ambient voice-activated virtual assistant to be in every room of my home. Yes, I know, I’m lucky enough to have a home with enough distance between rooms so as to not be heard properly between them — let alone lucky enough to have an ambient voice-activated virtual assistant. But I’ve begun to expect — no, rely — on Alexa’s presence, so that I’m confused when I walk in to the bedroom and can’t verbally add diapers to our shopping list.

First, it’d be great if, in a multiple-Echo home, Alexa were smart enough that only the nearest device responded — like the Echo’s beam-forming mic on steroids. Though we only have one Echo, that’s probably not the case today.

It’d be even better if multiple Echos could work together. I’d love to be able to say, from the kitchen: “Alexa, play a lullaby in the Nursery”. Yes, I’m that good a dad.

A more natural invocation model for third-party Skills

While built-in apps like Amazon’s own or Pandora can be invoked with natural phrases like “Alexa, is it going to rain today?”, or “Alexa, play some Gregory Porter”, third-party Skills have a more rigid invocation format:

Alexa, ask Tube Status if there are any delays

Alexa, ask Automatic where my car is

Alexa, ask TV Shows when is American Idol on?

Alexa, ask|tell|open {skill name} to|for|about|if|whether {some command}

This results in some pretty awkward sentences, and the formal structure interrupts the illusion that you’re talking to a truly smart assistant. To really make Skills shine, Alexa needs to be clever enough to figure out what you’re asking, and delegate to the right Skill. The commands above should be as simple as:

Alexa, are there any delays on the Tube?

Alexa, where’s my car?

Alexa, when is American Idol on?

Now, again, I totally get why this is the state today — the formal structure makes it much easier for Alexa’s brain to invoke the right Skill and pass your command to it in a structured way. But we’re shooting for amazing here — and being able to invoke Skills using natural language and arbitrary sentence structure is critical to the illusion Alexa purveys.

Audio Out

The Echo is a really great little speaker — at least as good as the other Bluetooth speakers in its price range, and they just stream Bluetooth audio. But it’s not Hi-Fi. For me, the Echo is missing a line out jack that I can wire into a proper set of speakers to play back the streaming audio.

Of course, I could still use a laptop/phone/AirPlay to stream Spotify to my Hi-Fi but it’s testament to how awesome Alexa’s interaction model is that I want to use the Echo to control everything. Given that the current hardware doesn’t have an audio jack, a quick fix would be to let Alexa control another Spotify client — kind of like Spotify Connect in reverse. Or I could hack it —