Speech JavaScript API Specification Editors: Bjorn Bringert, Google Inc. Satish Sampath, Google Inc. Glen Shires, Google Inc.

Abstract

This specification defines a JavaScript API to enable web developers to incorporate speech recognition and synthesis into their web pages. It enables developers to use scripting to generate text-to-speech output and to use speech recognition as an input for forms, continuous dictation and control. The JavaScript API allows web pages to control activation and timing and to handle results and alternatives

It is a fully-functional subset of the specification proposed in the HTML Speech Incubator Group Final Report [1]. Specifically, this subset excludes the underlying transport protocol, the proposed additions to HTML markup, and it defines a simplified subset of the JavaScript API. This subset supports the majority of use-cases and sample code in the Incubator Group Final Report. This subset does not preclude future standardization of additions to the markup, API or underlying transport protocols, and indeed the Incubator Report defines a potential roadmap for such future work.

Status of This Document

This document is an API proposal from Google Inc. to the Web Applications (WEBAPPS) Working Group.

All feedback is welcome.

No working group is yet responsible for this specification. This is just an informal proposal at this time.

Table of Contents

1 Conformance requirements

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.

Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification, as this specification uses that specification's terminology. [WEBIDL]

2 Introduction

This section is non-normative.

The JavaScript Speech API aims to enable web developers to provide, in a web browser, speech-input and text-to-speech output features that are typically not available when using standard speech-recognition or screen-reader software software. The API itself is agnostic of the underlying speech recognition and synthesis implementation and can support both server-based and client-based/embedded recognition and synthesis. The API is designed to enable both brief (one-shot) speech input and continuous speech input. Speech recognition results are provided to the web page as a list of hypotheses, along with other relevant information for each hypothesis.

This specification is a subset of the API defined in the HTML Speech Incubator Group Final Report. That report is entirely informative since it is not a standards track document. This document is intended to be the basis of a standards track document, and therefore defines portions of that report to be normative. All other portions of that report may be considered informative with regards to this document, and provide an informative background to this document.

3 Use Cases

This section is non-normative.

This specification supports the following use cases, as defined in Section 4 of the Incubator Report.

Voice Web Search

Speech Command Interface

Domain Specific Grammars Contingent on Earlier Inputs

Continuous Recognition of Open Dialog

Domain Specific Grammars Filling Multiple Input Fields

Speech UI present when no visible UI need be present

Voice Activity Detection

Hello World

Speech Translation

Speech Enabled Email Client

Dialog Systems

Multimodal Interaction

Speech Driving Directions

Multimodal Video Game

Multimodal Search

Rerecognition

Temporal Structure of Synthesis to Provide Visual Feedback

4 Security and privacy considerations

User agents must only start speech input sessions with explicit, informed user consent. User consent can include, for example: User click on a visible speech input element which has an obvious graphical representation showing that it will start speech input.

Accepting a permission prompt shown as the result of a call to SpeechReco.start .

. Consent previously granted to always allow speech input for this web page. User agents must give the user an obvious indication when audio is being recorded.

In a graphical user agent, this could be a mandatory notification displayed by the UA as part of its chrome and not accessible by the web page. This could for example be a pulsating/blinking record icon as part of the browser chrome/address bar, an indication in the status bar, an audible notification, or anything else relevant and accessible to the user. This UI element must also allow the user to stop recording.



In a speech-only user agent, the indication may for example take the form of the system speaking the label of the speech input element, followed by a short beep.

The user agent may also give the user a longer explanation the first time speech input is used, to let the user now what it is and how they can tune their privacy settings to disable speech recording if required.

To minimize the chance of users unwittingly allowing web pages to record speech without their knowledge, implementations must abort an active speech input session if the web page lost input focus to another window or to another tab within the same user agent.

Implementation considerations

This section is non-normative.

Spoken password inputs can be problematic from a security perspective, but it is up to the user to decide if they want to speak their password. Speech input could potentially be used to eavesdrop on users. Malicious webpages could use tricks such as hiding the input element or otherwise making the user believe that it has stopped recording speech while continuing to do so. They could also potentially style the input element to appear as something else and trick the user into clicking them. An example of styling the file input element can be seen at http://www.quirksmode.org/dom/inputfile.html. The above recommendations are intended to reduce this risk of such attacks.

5 API Description

This section is normative.

The speech reco interface is the scripted web API for controlling a given recognition.

IDL [Constructor] interface SpeechReco { SpeechGrammarList grammars; DOMString lang; attribute boolean continuous; void start(); void stop(); void abort(); attribute Function onaudiostart; attribute Function onsoundstart; attribute Function onspeechstart; attribute Function onspeechend; attribute Function onsoundend; attribute Function onaudioend; attribute Function onresult; attribute Function onnomatch; attribute Function onresultdeleted; attribute Function onerror; attribute Function onstart; attribute Function onend; }; SpeechReco implements EventTarget; interface SpeechInputError { const unsigned short OTHER = 0; const unsigned short NO_SPEECH = 1; const unsigned short ABORTED = 2; const unsigned short AUDIO_CAPTURE = 3; const unsigned short NETWORK = 4; const unsigned short NOT_ALLOWED = 5; const unsigned short SERVICE_NOT_ALLOWED = 6; const unsigned short BAD_GRAMMAR = 7; const unsigned short LANGUAGE_NOT_SUPPORTED = 8; readonly attribute unsigned short code; readonly attribute DOMString message; }; interface SpeechInputAlternative { readonly attribute DOMString transcript; readonly attribute float confidence; readonly attribute any interpretation; }; interface SpeechInputResult { readonly attribute unsigned long length; getter SpeechInputAlternative item(in unsigned long index); readonly attribute boolean final; }; interface SpeechInputResultList { readonly attribute unsigned long length; getter SpeechInputResult item(in unsigned long index); }; interface SpeechInputResultEvent : Event { readonly attribute SpeechInputResult result; readonly attribute SpeechInputError error; readonly attribute short resultIndex; readonly attribute SpeechInputResultList resultHistory; }; [Constructor] interface SpeechGrammar { attribute DOMString src; attribute float weight; }; [Constructor] interface SpeechGrammarList { readonly attribute unsigned long length; getter SpeechGrammar item(in unsigned long index); void addFromUri(in DOMString src, optional float weight); void addFromString(in DOMString string, optional float weight); };

The TTS interface is the scripted web API for controlling a text-to-speech output.

IDL [Constructor] interface TTS { attribute DOMString text; attribute DOMString lang; readonly attribute boolean paused; readonly attribute boolean ended; // methods to drive the speech interaction void play(); void pause(); void stop(); attribute Function onstart; attribute Function onend; };

6 Examples

This section is non-normative.

Examples Using speech recognition to perform a web search. Web search by voice with auto-submit <script type="text/javascript"> var sr = new SpeechReco(); sr.onresult = function(event) { var q = document.getElementById("q"); q.value = event.result[0].transcript; q.form.submit(); } </script> <form action="http://www.example.com/search"> <input type="search" id="q" name="q"> <input type="button" value="Speak" onclick="sr.start()"> </form> Using speech synthesis. TTS <script type="text/javascript"> var tts = new TTS(); function speak(text, lang) { tts.text = text; tts.lang = lang; tts.play(); } speak("Hello world.", "en-US"); </script>

This API supports all of the examples in the HTML Speech Incubator Group Final Report that are within the scope of the JavaScript API and are relevant to the Section 3 Use Cases, with minimal or no changes. Specifically, the following are supported from Section 7.1.7.

Speech Web Search JS API Only (except for non-essential aspects: serviceURI and speedVsAccuracy)

Web search by voice, with auto-submit

Web search by voice, with "Did you say..."

Speech translator

Speech shell

Turn-by-turn navigation

Domain Specific Grammars Contingent on Earlier Inputs

Speech Enabled Email Client (except for non-essential aspects: serviceURI and speedVsAccuracy)

Simple Multimodal Example JS API Only

Speech XG Translating Example

Acknowledgments

The members of the HTML Speech Incubator Group, and the corresponding Final Report, created the basis for this proposal.

References

[RFC2119]