« previous post | next post »

Disfluency has been in the news recently, for two reasons: the deployment of filled pauses in an automated conversation by Google Duplex, and a cross-linguistic study of "slowing down" in speech production before nouns vs. verbs.

Lance Ulanoff, "Did Google Duplex just pass the Turing Test?", Medium 5/8/2018:

I think it was the first “Um.” That was the moment when I realized I was hearing something extraordinary: A computer carrying out a completely natural and very human-sounding conversation with a real person. And it wasn’t just a random talk. […]

Duplex made the call and, when someone at the salon picked up, the voice AI started the conversation with: “Hi, I’m calling to book a woman’s hair cut appointment for a client, um, I’m looking for something on May third?”

Frank Seifart et al., "Nouns slow down speech: evidence from structurally and culturally diverse languages", PNAS 2018:

When we speak, we unconsciously pronounce some words more slowly than others and sometimes pause. Such slowdown effects provide key evidence for human cognitive processes, reflecting increased planning load in speech production. Here, we study naturalistic speech from linguistically and culturally diverse populations from around the world. We show a robust tendency for slower speech before nouns as compared with verbs. Even though verbs may be more complex than nouns, nouns thus appear to require more planning, probably due to the new information they usually represent. This finding points to strong universals in how humans process language and manage referential information when communicating linguistically.

For a more authoritative account of the Google Duplex service, see Yaniv Leviathan [yes, really] and Yossi Matias, "Google Duplex: An AI System for Accomplishing Real-World Tasks Over the Phone", Google AI Blog 5/8/2018. And the University of Zurich press release for Seifert et al. is "Nouns slow down our speech", 5/14/2018:

Speakers hesitate or make brief pauses filled with sounds like "uh" or "uhm" mostly before nouns. Such slow-down effects are far less frequent before verbs, as UZH researchers working together with an international team have now discovered by looking at examples from different languages.

When we speak, we unconsciously pronounce some words more slowly than others, and sometimes we make brief pauses or throw in meaningless sounds like "um." Such slow-down effects provide key evidence on how our brains process language. They point to difficulties when planning the utterance of a specific word.

A small sample of the buzz: "Google’s AI sounds like a human on the phone — should we be worried?"; "Service Workers Forced to Act Like Robots Meet Their Match — Surprise! It’s a robot that pretends to be human, courtesy of Google"; "Hello, Google Duplex? No Artificially Intelligent Calls, Please"; "What If A Robot Wrote This Article?".

I don't have time for much this morning, but I'd like to make a few quick points.

First, the only examples we have so far of Google Duplex conversations are supplied by the authors, who tell us that "This summer, we’ll start testing the Duplex technology within the Google Assistant". The examples we've seen have no doubt been selected to show the system off at its best, and should not be trusted to present typical examples, much less problematic ones. I say this as someone who worked in industry on text-to-speech synthesis: I've been there.

Modern speech synthesis is generally excellent, but conversational interaction remains a serious challenge. Google Duplex will initially be limited to specific sorts of interactions,"to help users make restaurant reservations, schedule hair salon appointments, and get holiday hours over the phone". But even so, I guarantee that once the public has access to the service, we'll hear some less impressive (and therefore perhaps less concerning) examples.

Second, the work on speech rate and silent or filled pauses is less novel than the coverage suggests. The idea that pauses (filled or not) reflect uncertainty about following material is obvious and well documented. A couple of research reports from 20-30 years ago, among hundreds:

Stanley Schachter et al., "Speech Disfluency and the Structure of Knowledge", Journal of Personality and Social Psychology 1991:

It is generally accepted that filled pauses (“uh,” “er,” and “um”) indicate time out while the speaker searches for the next word or phrase. It is hypothesized that the more options, the more likely that a speaker will say “uh.” The academic disciplines differ in the extent to which their subject matter and mode of thought require a speaker to choose among options. The more formal, structured, and factual the discipline, the fewer the options. It follows that lecturers in the humanities should use more filled pauses during lectures than social scientists and that natural scientists should use fewest of all. Observations of lecturers in 10 academic disciplines indicate that this is the case. That this is due to subject matter rather than to self-selection into disciplines is suggested by observations of this same set of lecturers all speaking on a common subject. In this circumstance, the academic disciplines are identical in the number of filled pauses used.

Elizabeth Shriberg and Andreas Stolcke, "Word predictability after hesitations: A corpus-based study", ICSLP 1996:

We ask whether lexical hesitations in spontaneous speech tend to precede words that are difficult to predict. We define predictability in terms of both transition probability and entropy, in the context of an N-gram language model. Results show that transition probability is significantly lower at hesitation transitions, and that this is attributable to both the following word and the word history. In addition, results suggest that fluent transitions in sentences with a hesitation elsewhere are significantly more likely than transitions in fluent sentences to contain out-of-vocabulary words and novel word combinations. Such findings could be used to improve statistical language modeling for spontaneous-speech applications.

So I have a question: Nouns in context are on average less predictable than verbs — is there anything left for a part-of-speech variable to explain once conditional entropy has been factored in? I don't think that anyone has tested this, but it would be fairly easy to check.

As for the deployment of disfluencies in artificial conversational interaction, there's been lots of research implying that they belong there, e.g. Brennan & Williams, "The Feeling of Another′s Knowing: Prosody and Filled Pauses as Cues to Listeners about the Metacognitive States of Speakers", Journal of Memory and Language 1995; Swerts et al., "Filled pauses as markers of discourse structure", 1996.

With respect to um and uh in particular, here's a useful quote from Herb Clark and Jean Fox Tree, "Using uh and um in spontaneous speaking", Cognition 2002:

Uh and um have long been called filled pauses in contrast to silent pauses (see Goldman-Eisler, 1968; Maclay & Osgood, 1959). The unstated assumption is that they are pauses (not words) that are filled with sound (not silence). Yet it has long been recognized that uh and um are not on a par with silent pauses. In one view, they are symptoms of certain problems in speaking. In a second view, they are non-linguistic signals for dealing with certain problems in speaking. And in a third view, they are linguistic signals – in particular, words of English. If uh and um are words, as we will argue, it is misleading to call them filled pauses. To be neutral and yet retain a bit of their history, we will call them fillers.

And finally, the following article on the Seifart publication ends with an annoying falsehood — Alan Burdick, "Why Nouns Slow Us Down, and Why Linguistics Might Be in a Bubble", The New Yorker 5/15/2018:

In recent years, scientists have grown concerned that much of the literature on human psychology and behavior is derived from studies carried out in Western, educated, industrialized, rich, democratic countries. These results aren’t necessarily indicative of how humans as a whole actually function. Linguistics may face a similar challenge—the science is in a bubble, talking to itself. “This is what makes people like me realize the unique value of small, often endangered languages and documenting them for as long as they can still be observed,” Seifart said. “In a few generations, they will not be spoken anymore.” In the years to come, as society grows more complex, the number of nouns available to us may grow exponentially. The diversity of its speakers, not so much.

"Linguistics … is in a bubble, talking to itself", limited to "studies carried out in Western, educated, industrialized, rich, democratic countries"? What ignorant nonsense.

From missionaries like Antonio Ruiz de Montoya in the 16th century, through scholars like Sir William Jones in the 18th century and Wilhelm von Humboldt in the early 19th century, among innumerable philologists of the later 19th century, anthropological linguistics like Franz Boas and Edward Sapir in the early 20th century, and most of my colleagues in the field today, linguistics as a field has always devoted a large fraction of its efforts to documenting and understanding languages, endangered and otherwise, that are not "Western, educated, industrialized, rich".

Frank Seifart certainly knows this. So I'm going to pin this one squarely on Alan Burdick, since I'm all too familiar with the way that journalists behave when they have a (true or false) generalization in their sights. For an example documented by a journalist who was mistreated in this standard way by a fellow practitioner, see "Down with journalists!", 6/27/2005.

Permalink