It was a serious effort, so we’ve decided to share some of our experiences and impressions. Here is a subjective summary of what integrating with each of these platforms feels like and our impressions of the resulting skills. If you’re interested in integrating with some of these platforms, reading this might help you prepare for what’s to come.

Let’s talk about symptoms

Our scenario is no cat facts app. Our app is a symptom checker. This means that the task of the bot is to understand your health complaints and play the role of a diagnostician, which involves an actual conversation. This includes asking questions that help disambiguate your descriptions, but also performing a digital differential diagnosis. What follows is a series of dynamically chosen questions rather than a fixed conversation script.

The standard approach to building bots assumes that language understanding boils down to guessing the intent behind each user’s message, and possibly also understanding which entities are addressed (usually called “filling slots”). These entities may be city names or items from the user’s tiny database. This model works well when dealing with questions like “What’s the weather in Berlin?”, but it didn’t fit our scenario didn’t fit it. First, getting a health check-up is not a simple intent with fixed slots to fill. This meant that frameworks such as Rasa, Dialogflow or LUIS were of little use to us. We had to take full control of the conversation management and implement our custom bot framework to handle this.

Our web-based chatbot implements the same conversation flow: hearing chief complaints, basic demographic data, questions driven by our diagnostic engine.

The standard approach to building bots assumes that language understanding boils down to guessing the intent behind each user’s message, and possibly also understanding which entities are addressed (usually called“filling slots”). These entities may be city names or items from the user’s tiny database. This model works well when dealing with questions like“What’s the weather in Berlin?”, but it didn’t fit our scenario didn’t fit it. First, getting a health check-up is not a simple intent with fixed slots to fill. This meant that frameworks such as Rasa, Dialogflow or LUIS were of little use to us. We had to take full control of the conversation management and implement our custom bot framework to handle this.

What’s more, while the task is admittedly not open-domain, our domain is a subset of GP-level diagnostics, which means that we should be able to understand more than a thousand possible complaints, and each of these could be expressed in various ways. If you’re curious how we approach understanding users’ descriptions, our blog post on Alexa integration may shed some light.

All of these considerations mean that the scenario described here is somewhat different from a typical one advertised by most “build your chatbot with us” services. We had implemented all of the actual logic for handling conversations on our own and deployed the code onto our servers (Infermedica Bot Framework). We didn’t expect the voice platforms to guess user intents, manage conversation turns, etc. We expected them to turn users’ speech into text, pass the text on to us as it was, and let us do the rest of the job. It was also helpful when a platform was able to display formatted text replies on the user’s device, if it had a display.

Getting your hands dirty

Before proceeding with the real integration, we tried deploying a toy skill on each of these platforms that would allow us to see the full text of a user’s utterance and post a simple reply. Here is a summary of our experience with this. If you’re not interested in technical details, feel free to skip this section entirely.

The main difficulty when approaching each of the platforms was to plow through the documentation, which usually focused on simplistic one-intent bots. We had to follow a less documented path that allowed us to take full control. Please note that these platforms are evolving rapidly, and by the time you’re reading this, many of the issues may have already been solved.

Alexa and AMAZON.Literal

We first deployed the skill on Alexa. At that time it was almost impossible to avoid relying on its intents-and-slots model, and it took nothing short of a hack to gain access to transcriptions of whole utterances. The hack was to use the AMAZON.Literal slot type and one generic intent. Now the hack is to be retired, but a couple of other solutions are offered; you’ll probably have the best luck using the SearchQuery slot type. Once you’re past this point, the rest is fairly simple to get up and running. You can either talk JSON on your own or use a convenient framework such as Flask-Ask. The data structures required by Amazon are well organized and mostly self-explanatory.

Fast prototyping on Cortana

Cortana was the simplest to get started, and we can recommend it for fast skill prototyping. Admittedly, Microsoft loves to spawn countless abstraction layers which can overwhelm a daring developer (we’ve encountered Azure, Microsoft Bot Framework, My Knowledge Store, botlets, and LUIS), but once you get started, most of these can be easily dismissed. We managed to escape using services such as LUIS and get hold of the original user text quickly. Also, you might get the impression that using the JS or C# SDK is the only possibility. Microsoft seems to be trying hard to hide the fact that their Bot Framework has a convenient REST API, and you don’t need to use a dedicated library or any of the suggested programming languages. What’s great about Cortana is that it’s quick to deploy a working version to your own Microsoft account. The data structures are admittedly somewhat complex, but this is the cost of being able to include formatted text and quick response buttons in your app’s replies.

The weird world of Google

Integrating with Google Assistant was the harshest experience. The documentation was sparse and tried to trick you into using DialogFlow at every possibility. The data exchange formats were unusually complex, and some features were described only for the Node.JS library (while we strongly preferred using REST APIs for every channel to avoid any additional dependencies). The device simulator seemed to work differently than actual Android devices, and getting properly formatted output on the simulator didn’t guarantee that it would be rendered that way on your phone. It was even hard to get the skill invoked (sometimes you had to refer to it as “my test app”, and sometimes you had to use its proper invocation name). Fortunately, Google’s tech support was very responsive and helpful, which allowed us to get past these difficulties in a reasonable time.

Speech recognition quality

We’ve been amazed by the accuracy of speech recognition on Google. Common words or specific medical terminology, good English or a strong Polish accent — everything works just fine. This alone should be a reason to take Google Assistant seriously as the most promising platform out there.

Cortana is the second best for us. Most of its transcriptions were correct. A nice feature is the way the recognition process is visualized: you get word-by-word predictions where the word being understood is depicted using a neat waveform icon. We like this a lot, as this way our skill doesn’t get the blame for errors at the speech recognition level. A significant drawback, though, are the lengthy pauses between conversation turns. While it’s important to give the speaker enough time to finish his sentence, in our opinion this goes too far and the result is an unnaturally slow conversation pace.

As for Alexa, we were quite disappointed to see how poorly it performed with common medical terms. It seems that its underlying voice recognition service is heavily primed for its most common use cases: shopping, news, and pop culture. It’s hard to get many of our phrases understood correctly. For instance, “abdominal pain” is often understood as “add domino pain” (Domino’s pizza is likely to blame), “dizzy” gives way to “Disney”, and “female” battles with “email”. To the best of my knowledge, there is no way to select a domain-specific language model or to train one. This is especially bad on headless devices — you can have the best possible Natural Language Understanding technology, but the user will still think that this is your failure to understand simple language. We hope this will improve soon.

Rich text replies accompanying speech

Voice interaction can be enriched by presenting some information on screen or even allowing the user to tap on quick-reply buttons. This can be especially helpful in the case of health information, which usually involves complex vocabulary.

Each of these platforms supports some range of devices equipped with a screen. To be honest, we didn’t explore this possibility on Alexa, as at the time of our integration such devices were non-existent or barely emerging. So, I will refrain from giving any opinion on Alexa Display Interface until we’ve tried it.

Cortana seems to have been designed with voice+screen interaction in mind. It can display plain text, simple text formatting (using a subset of markdown), some special cards, and quick-reply buttons. It took us some fiddling to put the desired sequences of replies under one rich-formatted card, but the format is quite flexible and in the end it worked for us. We didn’t manage to embed smaller images, but being able to include colorful emoji was good enough.

Actions on Google also supports graphical replies, which include both simple replies and some special cards. But as soon as you start designing a slightly more complex interaction, things begin to get nasty. The platform imposes cryptic constraints on the sequence of possible reply types, which are hard to understand and follow. For instance, each sequence of replies to a user’s utterance needs to start with a plain text bubble, and there can be at most two such bubbles per turn. Even if you want to reply with a rich-formatted reply card, you need to start with a plain text bubble. Voice responses are to be distributed among plain text bubbles only, and it takes a lot of finesse to get speech counterparts of the replies which are displayed as rich text. If you’ve already designed an underlying conversation flow, you’ll need to spend quite some time implementing a translator that will force your sequence of replies (both speech and text for screens) to adhere to these requirements. And the built-in simulator is hardly helping, as what you see ain’t what you are getting. Again, we hope this is another aspect that will improve over time. At any rate, the final user experience on Google is so rewarding that it’s still definitely worth it to grit your teeth and get the job done.

Summary

Although automatic voice recognition has been around for many years, voice assistant platforms are a very recent development. It would be unreasonable to assume that they operate without glitches, and this is what you’re going to face as a developer. Nevertheless, we’ve managed to integrate our symptom checker app with all three major voice platforms and build interesting user experience using the available features. Incidentally, Infermedica is the first company to offer symptom checkers on all these platforms. Feel free to check it out and judge for yourself whether we have succeeded.

Where to find Symptomate: