The Panic About Kindle’s Text to Speech: Still Silly

This article attempts to explain why my and some other authors’ sanguine attitude toward the new Kindle’s Text-to-Speech capability is misguided (or more, “right response, incorrect reasoning”); in essence the argument is that we’re only looking at how computerized voice reading sounds now, as opposed to how it will sound in the future, when it’ll be easy to instruct computers how to do inflections and all that.

This is a nice try, but, no.

1. First, on a personal, your mileage may vary note, it seems to me that people generally buy the audio version of a book or the text version, rather than both; personally speaking, as a writer I don’t generally expect someone to buy more than one version of my work in any event. So the “Oh noes! Since they have the Kindle version, they won’t buy the audio version!” concern is, shall we say, not high on my list of things to worry about.

2. Has it escaped the general notice of folks that the same company that is putting out the Kindle is also the same company that owns Audible.com? Yes, Amazon owns both, and I don’t really see the company trying to put one section of itself out of business with the other. Indeed, one of the things I would very surprised not to see at some point in the near future is Amazon doing a Kindle/Audible bundle: Say, buy the Kindle version of Zoe’s Tale and they’ll throw in the actual audiobook version for $10 or so, which would make the whole package about the same cost as a hardcover. Then if Amazon is actually really smart, they’ll find a way to do audio indexing, so you can highlight a word in the text and have the audiobook version pick up right from there. And so on. This works grandly for me, because then I get twice the royalties.

Yes, other eBook reader makers might also make text-to-speech capability, and they aren’t Amazon — but that said, I imagine if Amazon does this sort of bundling, other eBook sellers will find a way as well, and then the field is leveled again.

3. I understand geeks have unlimited faith in their ability to manipulate technology, but developing a computerized audio voice that actually delivers a performance rather than a recitation is not simply a matter of “how to emphasize certain words and phrases, probably through some kind of XML-based markup standard.” This is fairly unsophisticated way of looking at how language works, and in particular how it works in fiction, narrative and exposition. We authors are crafty types and we often use language in unexpected ways, and I doubt very seriously you could create software that would accurately discern correct intonation at all times, or even be able to tell when one person was talking in dialogue as opposed to another.

If you tried to build software that could heuristically appropriately discern what emphasis to put on what words where and when in all cases, as well as being able to differentiate between characters (and their own ways of inflection, intonation, speaking, etc) not only would the code base be HUGE, but in point of fact you would have developed some damn impressive AI, and I for one would welcome our new book-reading computer overlords. If this software couldn’t manage this task completely, or did it imperfectly, you’d be having an audio version of the Uncanny Valley, in which the “almost but not quite” nature of the audio performance would be self-defeating. I’m not sure there’s an interest in doing this in any event from any of the eBook companies, but particularly from Amazon, who has a direct interest in upselling another, superior audio product.

So that’s dealt with. But what if instead of trying to birth a book-reading AI you instead and somewhat more simply created markup related specifically to a work (say, a markup specifically meant to read Zoe’s Tale)? Well, then what you’ve got there is very definitely a derivative work, and you’ll hear from my lawyers. To my mind there’s a substantial difference between a computer voice reading text which a consumer has already purchased, which to my mind is not a derivative work, and a computer voice reading audio under directions specific to a work, which certainly is. Not to mention that this markup would be created by someone who is a computer programmer, whose skills, while no doubt formidable, are likely not to be consonant with the skills required to give a book an audio performance that sounds authentic.

The author of the article linked to above imagines wikis where people write “inflection scripts” for their favorite works, and while that’s certainlypossible, I also suspect the folks who would frequent those wikis are the same sorts who currenly frequent warez sites and the like; i.e., people who don’t buy things anyway and are sufficiently geekoidial that they’re happy to load their own scripts rather than have Amazon (or whichever seller) do it for a relatively modest fee. These aren’t most people, nor will be most people any time soon.

In either case there’s an easier and likely cheaper way to generate an audio file from a book that sounds and “feels” like a human: Give it to an actual human to perform, the performance of which is a derivative work.

In short: I’m not at all convinced that realistic and engaging computerized audio will be possible at any point in the near or even middle future without requiring a clear and obvious derivative work to generate it. When it is possible, I suspect AI will be at a point where it will also be able to generate actual novels, and then, of course, I will retire, to spend my remaining days being pleasured by my sexbots, until they plug me into the mainframe to use my brain cycles for sewage maintenance and I slip comfortably into the hive mind.

Naturally people are free to disagree with me on any of these points; that’s fine. Suffice to say for all the reasons above, I’m not in the least concerned about computerized text readings, in terms of how they affect my career or my rights.