About a year and a half ago, I began to develop debilitating Repetitive Strain Injury(RSI) while working at bambuu. I’ve chronicled what’s helped for me and what hasn’t, but there’s one thing in the original blog post I only briefly mentioned that I feel deserves further exploration. It’s voice coding, by which I mean using your voice as an input device to write code. I feel voice coding has huge but unrealized potential, both for allowing people that are injured to code again, but also to help people that are at-risk.

RSI happens when you do the same thing over and over again — hammering on the keyboard in this case. It’s obvious that people who are now unable to type could benefit from coding hands-free, but everyone might benefit from a day or two a week going hands-less. Mixing it up keeps us healthy.

With that out of the way, what I want to do here is to paint with broad strokes, how the landscape for voice coding looks.

For most people that know of voice coding, I assume they found out from Tavis Rudd’s excellent talk at PyCon where he explains his own setup using Dragon NaturallySpeaking, along with a language he’s created himself. The talk is just below and the demo starts at 8:50 — I’d suggest you go watch at least a minute of the demo to see how it looks and sounds.

Tavis Rudd’s talk at PyCon. Over 250.000 views!

For those of you who prefer reading, Tavis maps one-syllable words to special characters and actions, so “slap” means space, “dash some random thing” outputs “some-random-thing” in the editor. Navigation is performed through e.g. “up 10” to jump ten lines up.

Mapping one-syllable words to actions seems unnecessarily limiting to me though. We’re missing out on all the rich meaning a sentence can carry, when instead of creating sentences like “Create a new dictionary called my dictionary” we say things like “pam snake my dictionary ak par” — it’s not even much shorter, and it‘s much harder to understand. More on that later!



The way Tavis accomplishes his impressive setup is by using Dragon NaturallySpeaking which is a speech recognition program, some hefty Emacs magic and a self-created Dragonfly grammar.

Dragonfly is an open source framework for associating actions with voice commands. A Dragonfly grammar is a way to hook into Dragon NaturallySpeakings speech recognition, and execute code for specific commands, usually by emulating keyboard actions.

An example of how a dragonfly grammar looks. This is very simple and allows you to navigate up, down and click space.

I think for a great deal of people, this is how they’ve attempted to voice-code. At least there’s a wide variety of personal projects taking this approach on GitHub, such as code-by-voice and dragonfly-modules. However using someone else’s grammar is usually difficult, as they’re rarely very well documented, and as Tavis’ talk shows, usually rely on arbitrary words that are hard to memorize. You could of course write your own, but if you’re starting off with voice-coding to prevent RSI, it seems cruel that you’d have to program the solution yourself.

So what ways are there to get started using voice coding? There’s a few, and we’ve already skimmed the surface on the first, so let’s dive in.

The Do-It-Yourself approach

The DIY approach consists of writing your own Dragonfly grammars, and modding your editor with appropriate shortcuts and extensions to make your life easier. Generally DIY projects seem to use Dragon NaturallySpeaking to capture the speech, but Dragonfly does support Windows Speech Recognition as well.

Using Dragon has a wide variety of advantages though, Dragon comes with built in support for dictating documents, writing mails and navigating around in Windows — it even has support for web browsing (though the experience is sometimes so, so).

There’s a disadvantage here however, Dragon NaturallySpeaking only runs on Windows (though projects like Aenea exist that will let you run it on Linux.)

For macOS the scenery is a little different. There exists a version of Dragon, Dragon Professional, for macOS but it’s supposedly less capable of things like navigating around. So while it’ll work for voice coding, you’re still worse off than when using Dragon NaturallySpeaking on Windows.

With the DIY approach you get a lot of freedom, but you also get very little out of the box unless you’re on Windows. If you’re interested in pursuing this angle, some good kickoff points are the dragonfly-modules git repository and these blog posts.

Out of the box Solutions

Maybe you’re not willing to do it yourself. The good news is that out of the box solutions exist. The bad news is that there’s not very many and they’re quite similar. Let me try to give you an overview.

Drop-in Dragonfly macros:

Some people have done the DIY approach and have been nice enough to put their macros on github. Here are a few, however they’re usually reasonably tightly coupled to the persons workflows and sparsely documented.

Caster in particular stands out from the crowd as it seems to be a feature-rich voice coding toolkit, and perhaps more importantly — it has actual documentation. Unfortunately it looks like it hasn’t been updated for a few years.

Voicecode

There’s an old project called VoiceCode by the National Research Council of Canada.

It seems to have been a full solution for coding by voice, that anyone can pick up and use. Looking at the source code, it’s close to Tavis’ setup, basing itself on Dragon and a heavily customized Emacs. Unfortunately, it seems to have been abandoned over five years ago, and the documentation pages aren’t hosted anymore.

Voicecode.io

Then there’s voicecode at voicecode.io, which confusingly enough has the exact same name. This is a Mac-only system, that promises, not only to let you code by voice but also to increase your productivity. It uses SmartNav for replacing your mouse, and under the hood it runs Dragon for converting the speech to code. You can see a demonstration here.

If you view the demonstration, you’ll notice that this is similar to the way Tavis’ does it, with lots of strange, arbitrary one-syllable words that maps to commands. Even with these shortcomings, — I think voicecode.io is currently the most feature-rich out-of-the-box voice coding experience. It’s only for Mac so far though, and it does come with a reasonably hefty 300$ price tag. And that’s without Dragon or SmartNav. A full setup here will probably cost you around a thousand dollars.

Silvius

Silvius is the offspring of Dragon NaturallySpeaking and Aenea. It uses a custom speech recognition framework called Kaldi and works both online and on small embedded devices. Silvius works by piping the microphone output to a server, and the server responds with the sentences it recognize. The parsed speech is then run through a grammar that produces virtual keyboard strokes. Looking at Silvius you might think it’d be slow as the audio has to take a roundtrip to the server — but it actually seems surprisingly snappy.

I think the strong innovation in Silvius is the fact that it relies on a platform-agnostic speech recognition algorithm— in the end that might allow for something that will work across all platforms.

Vocola

Vocola is a Voice Command Language — that allows you to map voice commands to keyboard commands and other functions, in a way that’s very reminiscent of AutoHotkey. I don’t think this is particularly well suited for code, but I think it might be very good for surrounding tasks, e.g. opening up applications.

An example of a few Vocola commands

Speech Recognition Engines

I like to refer to the thing that powers the actual speech recognition as the speech recognition engine. E.g. Tavis uses Dragonfly to execute the keybindings that result in his actual code, but the engine translating speech to text is Dragon NaturallySpeaking.

The pattern here is generally, that most commercial software uses Dragon NaturallySpeaking for speech recognition, as it appears to have the best accuracy out of the available options, but also comes with a hefty price tag of up to $300.

Some notable exceptions are Dragonfly which supports Windows Speech Recognition and Silvius which uses Kaldi, which seems to be the only offline platform-agnostic framework currently. A few days ago on November 29th, Mozilla also launched the first release of Deep Speech and Common Voice, which I hope will become a viable alternative as well.

Complimentary modes of interaction

I think there’s a lot of potential in voice as an input, but I’m not sure it’ll get us all the way here. If desktops were designed for voice, I think we could probably get all the way, but as of right now, we’ll still need to interact with regular desktop programs. Most of them are GUI-based, which means we’ll need to emulate a mouse.

There’s a few possible ways to do this.

Descriptive voice commands

aka “Click the red button called send” — There’s a possibility to use visual recognition and try to visually parse what the user means. Currently I don’t think there’s anyone taking this approach -the closest I could find was the SpeechStart+ addon to dragon that lets the user enumerate clickable elements in most programs, and then click on them via voice. Non-hand-controlled mouse

We’ve already been introduced to SmartNav, but a mouse that’s controlled by muscles that aren’t the hands is an appealing way. Current alternatives are: Headmouse Nano, Camera Mouse and SmartNav. Eye tracking

You’d think the most natural replacment of the mouse would be eye-tracking, we usually look at what we click at. Tobii’s been making strides in this area with first the Tobii EyeX and now the Eye Tracker 4C. However our eyes naturally drift around, and so eye tracking will never be able to achieve the pixel-precision that a mouse can get. However for most tasks, I think soon eye tracking will be good enough for a lot of tasks.

(Previously I’ve co-authored an academic paper about combining speech recognition and eye tracking for coding. Contact me if you want a copy, there’s also a quick demo here)

Unsolved Challenges

As we’ve seen demonstrated a few times during this blog post, it’s definitely possible to navigate inside a file and output code with voice. As I’ve written about before— for me it’s usually navigation between symbols and files that are lacking — but I think this is only because the work hasn’t been done yet. I’m hoping that we’ll get there soon.