Voice Fill is now available at the Firefox Add-ons website for all Firefox users. Test Pilot users will be automatically migrated to this version.

Last year, Mozilla launched several parallel efforts to build capability around voice technologies. While work such as the Common Voice and DeepSpeech projects took aim at creating a foundation for future open source voice recognition projects, the Voice Fill experiment in Test Pilot took a more direct approach by building voice-based search into Firefox to learn if such a feature would be valuable to Firefox users. We also wanted to push voice research at Mozilla by contributing general tooling and training data to add value to future voice projects.

How it went down

The Firefox Emerging Technologies team approached Test Pilot with an idea for a voice input experiment that let users fill out any form element on the web with voice input.

An early prototype

As a technical feat, the early prototypes were quite impressive, but we identified two major usability issues that had to be overcome. First, adding voice control to every site with a text input would mean debugging our implementation across an impossibly large number of websites which could break the experiment in random and hard-to-repair ways. Second, because users must opt into voice controls in Firefox on a per-site basis, this early prototype would require that users fiddle with browser permissions wherever they wanted to engage with voice input.

In order to overcome these challenges, the Test Pilot and Emerging Technologies teams worked together to identify a minimum scope for our experiment. Voice Fill would focus on voice-based search as a its core use case and would only be available to users through Google, DuckDuckGo, and Yahoo search engines. Users visiting these sites would see a microphone button indicating that Voice Fill was available, and could click the button to trigger a voice search.

Animation showing the Voice Fill interface

From an engineering standpoint, the Voice Fill WebExtension add-on worked by letting users activate microphone input on specific search engine pages. Once triggered, an overlay appeared on the page prompting the user to record their voice via standard getUserMedia browser API. We used a WebExtension content script to inject the Voice Fill interface into search pages, and the Lottie library — which parses After Effects animations in JSON — to power the awesome mic animations provided by our super talented visual designer.

Voice Fill relied on an Emscripten module based on WebRTC C code to handle voice activity detection and register events for thing like loudness and silence during voice recording. After recording, samples were analyzed by an open source speech recognition engine called Kaldi. Kaldi is highly configurable, but essentially works by taking snippets of speech, then using a speech model (we used an legacy version of the Api.ai model in our experiment) to convert each snippet into best guesses at text along with a confidence ratings for each guess. For example, I might say “Pizza” and Kaldi might guess “Pizza” with 97% confidence, “Piazza” with 85% confidence, and “Pit saw” with 60% confidence.

Search results in Voice Fill

Depending on the confidence generated for any given speech sample, Voice Fill did one of the following one for each analyzed voice sample.

If the topmost confidence rating was high enough, or the difference between the first and second confidence scores for a result was large enough, Voice Fill triggered a search automatically.

If the topmost confidence rating was below a certain threshold, or if the top two confidence ratings were tightly clustered, we showed a list of possible search terms for the user to choose from.

If Kaldi returned no suggestions, we displayed a very pretty error screen and asked the user to try again.

What did we learn?

One of the big goals of the Test Pilot program is to assess market fit for experimental concepts, and it was pretty clear from the start that Voice Fill was not the most attractive experiment for the Test Pilot audience.

Voice Fill has fewer daily users than our other active experiments

The graph above shows the average number of Firefox profiles with each of our four add-on based experiments installed over the last two months. While the other three sit in the 15 to 20k user range, Voice Fill, in orange has significantly fewer users.

This lack of market fit bears out when we look at how users engaged with Voice Fill on the Test Pilot website. In the first two weeks of January when Mozilla’s marketing department ran a promotion for Test Pilot. The chart below shows how many Test Pilot users clicked on each experiment installation button (or in the case of Send, clicked the button that links to the Send website). Again, Voice Fill garnered significantly less direct user attention than other experiments.

The pitch for Voice Fill was less attractive than for other experiments

So Voice Fill didn’t set the world on fire, but by shipping in Test Pilot, we were able to determine that a pure speech to text search function may not be the most highly sought after Firefox feature without undertaking the complex task of building a massive service for every Firefox user.

As mentioned above, Voice Fill is one part of an effort to improve open source voice recognition tools at Mozilla. While it had a modest overall user base, Voice Fill gave us a large corpus of data on which to conduct comparative analysis.

Over its lifespan, Voice Fill users produced nearly one hundred thousand requests resulting in more than one hundred ten hours of audio. A comparative analysis of the voice fill corpus using different speech models gave us a insight into how to benchmark the performance of future voice-based efforts.

We conducted our analysis by running the Voice Fill corpus through the Voice Fill’s Api.ai speech model, the open source DeepSpeech model built by Mozilla, the Kaldi Aspire model, and Google’s Speech API.

The chart below shows the average amount of time each of these models needed to decode samples in our corpus. In terms of raw speed, the Api.ai model used in Voice fill performed quite well relative to DeepSpeech and Aspire. The Google comparison here is not quite apples-to-apples since it’s average time includes a call to Google’s Cloud API whereas the other three analyses were conducted on a local cluster.

Average time to process each sample by speech model

Next we wanted to know how many of the words Google’s Speech API identified were also identified by the other models. The chart below show total words in the corpus where each model matched with the results generated by Google’s Speech API. Here, Api.ai matched forty six thousand words with Google, Aspire matched forty-two thousand and DeepSpeech matched just thirty thousand. DeepSpeech lags behind, but it’s worth noting that it’s by far the newest of these training models. While it has a long way to go to catch up to Google’s proprietary model, it’s quite impressive for such a young open source effort.

While we can’t be sure exactly why Google’s model outperforms the others in this instance, the qualitative feedback from Test Pilot suggests that our users accents might be one factor.

We limited promotion of Voice Fill to Test Pilot users to English-speaking users, but did not restrict the experiment by geography. As a result, many users told us that their accents seemed to prevent Voice Fill from accurately interpreting voice samples. Here is another limitation that would prevent us from shipping Voice Fill in Firefox in its current form: our users came from all over the world, and the model we used simply does not account for the literal diversity of voices among Firefox users.

What Happens Next?

Voice Fill is leaving Test Pilot, but it will remain available to all users of Firefox at the Firefox Add-ons website. We know from user feedback that Voice Fill provides accessibility benefits to some of its users and we are delighted to continue to support this use case.

All of these samples collected in Voice Fill will be used be used to go help train and improve the DeepSpeech open source speech recognition model.

Additionally, The proxy service we built to let Voice Fill speak to that future voice-based experiments and services at Mozilla could share a common infrastructure. This service is already being used by the Mozilla IoT Gateway, an open source connector for smart devices.

We’re also exploring improvements to the way Firefox add-ons handle user media. The approaches available to us in Voice Fill were limited, and may have contributed to the diminished usability of the experiments.

Thank you to everyone that participated in the Voice Fill experiment in Test Pilot, and thanks in particular to Faramarz Rashed and Andre Natal on the Mozilla Emerging Technologies team for spearheading Voice Fill!