T. V. Raman

A desktop is a workspace that one uses to organize the tools of one's trade . Graphical desktops provide rich visual interaction for performing day-to-day computing tasks; the goal of the audio desktop is to enable similar efficiencies in an eyes-free environment. Thus, the primary goal of an audio desktop is to use the expressiveness of auditory output (both verbal and nonverbal) to enable the end user to perform a full range of computing tasks:

Ready access to local documents on the client and global documents on the Web

The Emacspeak audio desktop was motivated by the following insight: to provide effective auditory renderings of information, one needs to start from the actual information being presented, rather than a visual presentation of that information. This had earlier led me to develop AsTeR, Audio System For Technical Readings ( http://emacspeak.sf.net/raman/aster/aster-toplevel.html). The primary motivation then was to apply the lessons learned in the context of aural documents to user interfaces—after all, the document is the interface.

The primary goal was not to merely carry the visual interface over to the auditory modality, but rather to create an eyes-free user interface that is both pleasant and productive to use.

Contrast this with the traditional screen-reader approach where GUI widgets such as sliders and tree controls are directly translated to spoken output. Though such direct translation can give the appearance of providing full eyes-free access, the resulting auditory user interface can be inefficient to use.

These prerequisites meant that the environment selected for the audio desktop needed:

Thus, designing a clean client/server abstraction, and relying on the power of Unix I/O, has made it trivial to later run Emacspeak on a remote machine and have it connect back to a speech server running on a local client. This enables me to run Emacspeak inside screen on my work machine, and access this running session from anywhere in the world. Upon connecting, I have the remote Emacspeak session connect to a speech server on my laptop, the audio equivalent of setting up X to use a remote display.

Notice that so far I have said nothing explicit about how this client/server connection was opened; this late-binding proved beneficial later when it came to making Emacspeak network-aware. Thus, the initial implementation worked by the Emacspeak client communicating to the speech server using stdio . Later, making this client/server communication go over the network required the addition of a few lines of code that opened a server socket and connected stdin/stdout to the resulting connection.

The net result of this design was to create separate speech servers for each available engine, where each speech server was a simple script that invoked TCL's default readeval-print loop after loading in the relevant definitions. The client/server API therefore came down to the client (Emacspeak) launching the appropriate speech server, caching this connection, and invoking server commands by issuing appropriate procedure calls over this connection.

where _say calls the underlying C implementation provided by the DECTalk software.

The speech server for the software DECTalk implemented an equivalent, simplified tts_say version that looks like:

Emacspeak speech servers are implemented in the TCL language. The speech server for the DECTalk Express communicated with the hardware synthesizer over a serial line. As an example, the command to speak a string of text was a proc that took a string argument and wrote it to the serial device. A simplified version of this looks like:

The most natural way to design the system to leverage both speech options was to first implement a speech server that abstracted away the distinction between the two output solutions. The speech server abstraction has withstood the test of time well; I was able to add support for the IBM ViaVoice engine later, in 1999. Moreover, the simplicity of the client/server API has enabled open source programmers to implement speech servers for other speech engines.

I started implementing Emacspeak in October 1994. The target environments were a Linux laptop and my office workstation. To produce speech output, I used a DECTalk Express (a hardware speech synthesizer) on the laptop and a software version of the DECTalk on the office workstation.

31.2. Speech-Enabling Emacs

The simplicity of the speech server abstraction described above meant that version 0 of the speech server was running within an hour after I started implementing the system. This meant that I could then move on to the more interesting part of the project: producing good quality spoken output. Version 0 of the speech server was by no means perfect; it was improved as I built the Emacspeak speech client.

A Simple First-Cut Implementation A friend of mine had pointed me at the marvels of Emacs Lisp advice a few weeks earlier. So when I sat down to speech-enable Emacs, advice was the natural choice. The first task was to have Emacs automatically speak the line under the cursor whenever the user pressed the up/down arrow keys. In Emacs, all user actions invoke appropriate Emacs Lisp functions. In standard editing modes, pressing the down arrow invokes function next-line , while pressing the up arrow invokes previous-line . To speech-enable these commands, version 0 of Emacspeak implemented the following rather simple advice fragment: (defadvice next-line (after emacspeak) "Speak line after moving." (when (interactive-p) ( emacspeak-speak-line))) The emacspeak-speak-line function implemented the necessary logic to grab the text of the line under the cursor and send it to the speech server. With the previous definition in place, Emacspeak 0.0 was up and running; it provided the scaffolding for building the actual system.

Iterating on the First-Cut Implementation The next iteration returned to the speech server to enhance it with a well-defined eventing loop. Rather than simply executing each speech command as it was received, the speech server queued client requests and provided a launch command that caused the server to execute queued requests. The server used the select system call to check for newly arrived commands after sending each clause to the speech engine. This enabled immediate silencing of speech; with the somewhat naïve implementation described in version 0 of the speech server, the command to stop speech would not take immediate effect since the speech server would first process previously issued speak commands to completion. With the speech queue in place, the client application could now queue up arbitrary amounts of text and still get a high degree of responsiveness when issuing higher-priority commands such as requests to stop speech. Implementing an event queue inside the speech server also gave the client application finer control over how text was split into chunks before synthesis. This turns out to be crucial for producing good intonation structure. The rules by which text should be split up into clauses varies depending on the nature of the text being spoken. As an example, newline characters in programming languages such as Python are statement delimiters and determine clause boundaries, but newlines do not constitute clause delimiters in English text. As an example, a clause boundary is inserted after each line when speaking the following Python code: i=1 j=2 See the section "Augmenting Emacs to create aural display lists," later in this chapter, for details on how Python code is distinguished and its semantics are transferred to the speech layer. With the speech server now capable of smart text handling, the Emacspeak client could become more sophisticated with respect to its handling of text. The emacspeak-speak-line function turned into a library of speech-generation functions that implemented the following steps: Parse text to split it into a sequence of clauses.

Preprocess text—e.g., handle repeated strings of punctuation marks.

Carry out a number of other functions that got added over time.

Queue each clause to the speech server, and issue the launch command. From here on, the rest of Emacspeak was implemented using Emacspeak as the development environment. This has been significant in how the code base has evolved. New features are tested immediately because badly implemented features can render the entire system unusable. Lisp's incremental code development fits naturally with the former; to cover the latter, the Emacspeak code base has evolved to be "bushy"—i.e., most parts of the higher-level system are mutually independent and depend on a small core that is carefully maintained.

A Brief advice Tutorial Lisp advice is key to the Emacspeak implementation, and this chapter would not be complete without a brief overview. The advice facility allows one to modify existing functions without changing the original implementation . What's more, once a function f has been modified by advice m , all calls to function f are affected by advice . advice comes in three flavors: before The advice body is run before the original function is invoked. after The advice body is run after the original function has completed. around The advice body is run instead of the original function. The around advice can call the original function if desired. All advice forms get access to the arguments of the adviced function; in addition, around and after get access to the return value computed by the original function. The Lisp implementation achieves this magic by: Caching the original implementation of the function Evaluating the advice form to generate a new function definition Storing this definition as the adviced function Thus, when the advice fragment shown in the earlier section "A Simple First-Cut Implementation" is evaluated, Emacs' original next-line function is replaced by a modified version that speaks the current line after the original next-line function has completed its work.

Generating Rich Auditory Output At this point in its evolution, here is what the overall design looked like: Emacs' interactive commands are speech-enabled or adviced to produce auditory output. advice definitions are collected into modules, one each for every Emacs application being speech-enabled. The advice forms forward text to core speech functions. These functions extract the text to be spoken and forward it to the tts-speak function. The tts-speak function produces auditory output by preprocessing its text argument and sending it to the speech server. The speech server handles queued requests to produce perceptible output. Text is preprocessed by placing the text in a special scratch buffer. Buffers acquire specialized behavior via buffer-specific syntax tables that define the grammar of buffer contents and buffer-local variables that affect behavior. When text is handed off to the Emacspeak core, all of these buffer-specific settings are propagated to the special scratch buffer where the text is preprocessed. This automatically ensures that text is meaningfully parsed into clauses based on its underlying grammar. Audio formatting using voice-lock Emacs uses font-lock to syntactically color text. For creating the visual presentation, Emacs adds a text property called face to text strings; the value of this face property specifies the font, color, and style to be used to display that text. Text strings with face properties can be thought of as a conceptual visual display list . Emacspeak augments these visual display lists with personality text properties whose values specify the auditory properties to use when rendering a given piece of text; this is called voice-lock in Emacspeak. The value of the personality property is an Aural CSS (ACSS) setting that encodes various voice properties—e.g., the pitch of the speaking voice. Notice that such ACSS settings are not specific to any given TTS engine. Emacspeak implements ACSS-to-TTS mappings in engine-specific modules that take care of mapping high-level aural properties—e.g., mapping pitch or pitch-range to engine-specific control codes. The next few sections describe how Emacspeak augments Emacs to create aural display lists and to process these aural display lists to produce engine-specific output. Augmenting Emacs to create aural display lists Emacs modules that implement font-lock call the Emacs built-in function put-text-property to attach the relevant face property. Emacspeak defines an advice fragment that advices the put-text-property function to add in the corresponding personality property when it is asked to add a face property. Note that the value of both display properties ( face and personality ) can be lists; values of these properties are thus designed to cascade to create the final (visual or auditory) presentation. This also means that different parts of an application can progressively add display property values. The put-text-property function has the following signature: (put-text-property START END PROPERTY VALUE &optional OBJECT) The advice implementation is: (defadvice put-text-property (after emacspeak-personality pre act) "Used by emacspeak to augment font lock." (let ((start (ad-get-arg 0)) ;; Bind arguments (end (ad-get-arg 1 )) (prop (ad-get-arg 2)) ;; name of property being added (value (ad-get-arg 3 )) (object (ad-get-arg 4)) (voice nil)) ;; voice it maps to (when (and (eq prop 'face) ;; avoid infinite recursion (not (= start end)) ;; non-nil text range emacspeak-personality-voiceify-faces) (condition-case nil ;; safely look up face mapping (progn (cond ((symbolp value) (setq voice (voice-setup-get-voice-for-face value))) ((ems-plain-cons-p value)) ;;pass on plain cons ( (listp value) (setq voice (delq nil (mapcar #'voice-setup-get-voice-for-face value)))) (t (message "Got %s" value))) (when voice ;; voice holds list of personalities (funcall emacspeak-personality-voiceify-faces start end voice object))) (error nil))))) Here is a brief explanation of this advice definition: Bind arguments First, the function uses the advice built-in ad-get-arg to locally bind a set of lexical variables to the arguments being passed to the adviced function. Personality setter The mapping of faces to personalities is controlled by user customizable variable emacspeak-personality-voiceify-faces . If non-nil, this variable specifies a function with the following signature: ( emacspeak-personality-put START END PERSONALITY OBJECT) Emacspeak provides different implementations of this function that either append or prepend the new personality value to any existing personality properties. Guard Along with checking for a non-nil emacspeak-personality-voiceify-faces , the function performs additional checks to determine whether this advice definition should do anything. The function continues to act if: The text range is non-nil.

The property being added is a face . The first of these checks is required to avoid edge cases where put-text-property is called with a zero-length text range. The second ensures that we attempt to add the personality property only when the property being added is face . Notice that failure to include this second test would cause infinite recursion because the eventual put-text-property call that adds the personality property also triggers the advice definition. Get mapping Next, the function safely looks up the voice mapping of the face (or faces) being applied. If applying a single face , the function looks up the corresponding personality mapping; if applying a list of faces, it creates a corresponding list of personalities. Apply personality Finally, the function checks that it found a valid voice mapping and, if so, calls emacspeak-personality-voiceify-faces with the set of personalities saved in the voice variable. Audio-formatted output from aural display lists With the advice definitions from the previous section in place, text fragments that are visually styled acquire a corresponding personality property that holds an ACSS setting for audio formatting the content. The result is to turn text in Emacs into rich aural display lists. This section describes how the output layer of Emacspeak is enhanced to convert these aural display lists into perceptible spoken output. The Emacspeak tts-speak module handles text preprocessing before finally sending it to the speech server. As described earlier, this preprocessing comprises a number of steps, including: Applying pronunciation rules Processing repeated strings of punctuation characters Splitting text into appropriate clauses based on context Converting the personality property into audio formatting codes This section describes the tts-format-text-and-speak function, which handles the conversion of aural display lists into audio-formatted output. First, here is the code for the function tts-format-text-and-speak : (defsubst tts-format-text-and-speak (start end ) "Format and speak text between start and end." (when (and emacspeak-use-auditory-icons (get-text-property start 'auditory-icon)) ;;queue icon (emacspeak-queue-auditory-icon (get-text-property start 'auditory-icon))) (tts-interp-queue (format "%s

" tts-voice-reset-code)) (cond (voice-lock-mode ;; audio format only if voice-lock-mode is on (let ((last nil) ;; initialize (personality (get-text-property start 'personality ))) (while (and ( < start end ) ;; chunk at personality changes (setq last (next-single-property-change start 'personality (current-buffer) end))) (if personality ;; audio format chunk (tts-speak-using-voice personality (buffer-substring start last )) (tts-interp-queue (buffer-substring start last))) (setq start last ;; prepare for next chunk personality (get-text-property last 'personality))))) ;; no voice-lock just send the text (t (tts-interp-queue (buffer-substring start end ))))) The tts-format-text-and-speak function is called one clause at a time, with arguments start and end set to the start and end of the clause. If voice-lock-mode is turned on, this function further splits the clause into chunks at each point in the text where there is a change in value of the personality property. Once such a transition point has been determined, tts-format-text-and-speak calls the function tts-speak-using-voice , passing the personality to use and the text to be spoken. This function, described next, looks up the appropriate device-specific codes before dispatching the audio-formatted output to the speech server: (defsubst tts-speak-using-voice (voice text) "Use voice VOICE to speak text TEXT." (unless (or (eq ' inaudible voice ) ;; not spoken if voice inaudible (and (listp voice) (member 'inaudible voice))) (tts-interp-queue (format "%s%s %s

" (cond ((symbolp voice) (tts-get-voice-command (if (boundp voice ) (symbol-value voice ) voice))) ((listp voice) (mapconcat #'(lambda (v) (tts-get-voice-command (if (boundp v ) (symbol-value v ) v))) voice " ")) (t "")) text tts-voice-reset-code)))) The tts-speak-using-voice function returns immediately if the specified voice is inaudible . Here, inaudible is a special personality that Emacspeak uses to prevent pieces of text from being spoken. The inaudible personality can be used to advantage when selectively hiding portions of text to produce more succinct output. If the specified voice (or list of voices) is not inaudible , the function looks up the speech codes for the voice and queues the result of wrapping the text to be spoken between voice-code and tts-reset-code to the speech server.

Using Aural CSS (ACSS) for Styling Speech Output I first formalized audio formatting within AsTeR, where rendering rules were written in a specialized language called Audio Formatting Language ( AFL). AFL structured the available parameters in auditory space—e.g., the pitch of the speaking voice—into a multidimensional space, and encapsulated the state of the rendering engine as a point in this multidimensional space. AFL provided a block-structured language that encapsulated the current rendering state by a lexically scoped variable, and provided operators to move within this structured space. When these notions were later mapped to the declarative world of HTML and CSS, dimensions making up the AFL rendering state became Aural CSS parameters, provided as accessibility measures in CSS2 ( http://www.w3.org/Press/1998/CSS2-REC). Though designed for styling HTML (and, in general, XML) markup trees, Aural CSS turned out to be a good abstraction for building Emacspeak's audio formatting layer while keeping the implementation independent of any given TTS engine. Here is the definition of the data structure that encapsulates ACSS settings: (defstruct acss family gain left-volume right-volume average- pitch pitch-range stress richness punctuations) Emacspeak provides a collection of predefined voice overlays for use within speech extensions. Voice overlays are designed to cascade in the spirit of Aural CSS. As an example, here is the ACSS setting that corresponds to voice-monotone : [cl-struct-acss nil nil nil nil nil 0 0 nil all] Notice that most fields of this acss structure are nil —that is, unset. The setting creates a voice overlay that: Sets pitch to 0 to create a flat voice. Sets pitch-range to 0 to create a monotone voice with no inflection. This setting is used as the value of the personality property for audio formatting comments in all programming language modes. Because its value is an overlay, it can interact effectively with other aural display properties. As an example, if portions of a comment are displayed in a bold font, those portions can have the voice-bolden personality (another predefined overlay) added; this results in setting the personality property to a list of two values: ( voice-bolden voice-monotone ). The final effect is for the text to get spoken with a distinctive voice that conveys both aspects of the text: namely, a sequence of words that are emphasized within a comment. Sets punctuations to all so that all punctuation marks are spoken.

Adding Auditory Icons Rich visual user interfaces contain both text and icons. Similarly, once Emacspeak had the ability to speak intelligently, the next step was to increase the bandwidth of aural communication by augmenting the output with auditory icons. Auditory icons in Emacspeak are short sound snippets (no more than two seconds in duration) and are used to indicate frequently occurring events in the user interface. As an example, every time the user saves a file, the system plays a confirmatory sound. Similarly, opening or closing an object (anything from a file to a web site) produces a corresponding auditory icon. The set of auditory icons were arrived at iteratively and cover common events such as objects being opened, closed, or deleted. This section describes how these auditory icons are injected into Emacspeak's output stream. Auditory icons are produced by the following user interactions: To cue explicit user actions

To add additional cues to spoken output Auditory icons that confirm user actions—e.g., a file being saved successfully—are produced by adding an after advice to the various Emacs built-ins. To provide a consistent sound and feel across the Emacspeak desktop, such extensions are attached to code that is called from many places in Emacs. Here is an example of such an extension, implemented via an advice fragment: (defadvice save-buffer (after emacspeak pre act) "Produce an auditory icon if possible." (when (interactive-p) (emacspeak-auditory-icon 'save-object) (or emacspeak-last-message (message "Wrote %s" (buffer-file-name))))) Extensions can also be implemented via an Emacs-provided hook. As explained in the brief advice tutorial given earlier, advice allows the behavior of existing software to be extended or modified without having to modify the underlying source code. Emacs is itself an extensible system, and well-written Lisp code has a tradition of providing appropriate extension hooks for common use cases. As an example, Emacspeak attaches auditory feedback to Emacs' default prompting mechanism (the Emacs minibuffer) by adding the function emacspeak-minibuffer-setup-hook to Emacs' minibuffer-setup-hook : (defun emacspeak-minibuffer-setup-hook () "Actions to take when entering the minibuffer." (let ((inhibit-field-text-motion t)) (when emacspeak-minibuffer-enter-auditory-icon (emacspeak-auditory-icon 'open-object)) (tts-with-punctuations 'all (emacspeak-speak-buffer)))) (add-hook 'minibuffer-setup-hook 'emacspeak-minibuffer-setup-hook) This is a good example of using built-in extensibility where available. However, Emac-speak uses advice in a lot of cases because the Emacspeak requirement of adding auditory feedback to all of Emacs was not originally envisioned when Emacs was implemented. Thus, the Emacspeak implementation demonstrates a powerful technique for discovering extension points. Lack of an advice -like feature in a programming language often makes experimentation difficult, especially when it comes to discovering useful extension points. This is because software engineers are faced with the following trade-off: Make the system arbitrarily extensible (and arbitrarily complex)

Guess at some reasonable extension points and hardcode these Once extension points are implemented, experimenting with new ones requires rewriting existing code, and the resulting inertia often means that over time, such extension points remain mostly undiscovered. Lisp advice , and its Java counterpart Aspects, offer software engineers the opportunity to experiment without worrying about adversely affecting an existing body of source code.

Producing Auditory Icons While Speaking Content In addition to using auditory icons to cue the results of user interaction, Emacspeak uses auditory icons to augment what is being spoken. Examples of such auditory icons include: A short icon at the beginning of paragraphs

The auditory icon mark-object when moving across source lines that have a breakpoint set on them Auditory icons are implemented by attaching the text property emacspeak-auditory-icon with a value equal to the name of the auditory icon to be played on the relevant text. As an example, commands to set breakpoints in the Grand Unified Debugger Emacs package (GUD) are adviced to add the property emacspeak-auditory-icon to the line containing the breakpoint. When the user moves across such a line, the function tts-format-text-and-speak queues the auditory icon at the right point in the output stream.