In 2001, I got an itch to write a book. Like many people, I naïvely thought, “I have a book or two in me,” as if writing a book is as easy as putting pen to paper. It turns out to be very time consuming, and that’s after you’ve spent countless hours learning and researching and organizing your topic of choice. But I marched on and wrote or co-wrote 10 books in a five-year period. I’m a glutton for punishment.

My day job during that time was programming. I’ve been programming for 16 years. My whole career I’ve focused on automating the un-automatable — essentially making computers do things people never thought they could do. By the time I started on my 10th book, I got another kind of itch — I wanted to automate my writing career. I was getting bored with the tedium of writing books, and the money wasn’t that good.

But that’s absurd, right? How can a computer possibly write something coherent and informative, much less entertaining? The “how can a computer possibly do X?” questions are the ones I’ve spent my career trying to answer. So, I set out on a quest to create software that could write. It took more effort than writing 10 books put together, but after building a team of 12 people, we were able to use our software to generate more than 100,000 sports-related stories in a nine-month period.

Before I get into specifics with what our software produces, I think it’s worth highlighting some of the attributes that make software a great candidate to be a writer:

Software doesn’t get writer’s block, and it can work around the clock.

Software can’t unionize or file class-action lawsuits because we don’t pay enough (like many of the content farms have had to deal with).

Software doesn’t get bored and start wondering how to automate itself.

Software can be reprogrammed, refactored and improved — continuously.

Software can benefit from the input of multiple people. This is unlike traditional writing, which tends to be a solitary event (+1 if you count the editor).

Perhaps most importantly, software can access and analyze significantly more data than what a single person (or even a group of people) can do on their own.

Software isn’t a panacea, though. Not all content can be easily automated (yet). The type of content my company, Automated Insights, has automated is quantitatively oriented. That’s the trick. We’ve automated content by applying meaning to numbers, to data. Sports was the first category we tackled. Sports by their nature are very data heavy. By our internal estimates, 70% of all sports-related articles are analyzing numbers in one form or another.

Our technology combines a large database of structured data, a real-time feed of stats, and a large database of phrases, and algorithms to tie it all together to produce articles from two to eight paragraphs in length. The algorithms look for interesting patterns in the data to determine what to write about.

In November of 2010, we launched the StatSheet Network, a collection of 345 websites (one for every Division-I NCAA Basketball team) that were fully automated. Check out my favorite team: UNC Tar Heels.



Software mines data to construct short game recaps. (Click to see full story.)

We included the typical kind of stats you’d expect on a basketball site, but also embedded visualizations and our fully automated articles. We automated 14 different types of stories, everything from game recaps and previews to players of the week and historical retrospectives. Recently, we launched similar sites for every MLB team (check out the Detroit Tigers site), and soon we are launching sites for every NFL and NCAA Football team.

Sports is only one of many different categories we are working on. We’ve also done work in finance, real estate and a few other data-intensive industries. However, don’t limit your thinking on what’s possible. We get a steady stream of requests from non-obvious industries, such as pharmaceutical clinical trials and even domain name registrars. Any area that has large datasets where people are trying to derive meaning from the data are potential candidates for our technology.

Automation plus human, not automation versus human

Creating software that can write long-form narratives is very difficult, full of all sorts of interesting artificial intelligence, machine learning and natural language problems. But with the right mix of talent (and funding), we’ve been able to do it. It really does take a keen understanding of how software and the written word can work together.

I often hear it suggested that software-generated prose must be very bland and stilted. That’s only the case if the folks behind the software write bland and stilted prose. Software can be just as opinionated as any writer.

A common, and funny, question I get from journalists is: “when will you automate me out out of a job?” I find the question humorous because built into the question is the assumption that if our software can write the perfect story on a particular topic, then no one else should attempt to write about it. That’s just not going to happen. What’s happening instead is that media companies are using our software to help scale their businesses. Initially, that takes the form of generating stories on topics a media outlet didn’t have the resources to cover. In other cases, it means putting our stories through an editorial process that customizes the content to the specific needs of the publisher. You still need humans for that. There will be less of a need for folks to spend their time writing purely quantitative pieces, but that should be liberating. Now, they can focus on more qualitative, value-added commentary that humans are inherently good at. Quantitative stories can — and probably should — be mostly automated because computers are better at that.

Software will make hyperlocal content possible and even profitable. Many companies have tried to solve the “hyperlocal problem” with minimal success. It’s just too hard to scale content creation out to every town in the U.S. (or the world, for that matter). For certain categories (e.g. high school sports), software-generated content makes perfect sense. You’ll see automated content play a big role here in the coming years.

Software-generated books?

Because I’ve been so focused on running Automated Insights, I haven’t had time to write any new books recently. I suggested to a colleague that we should turn our software loose and have it write my next book. He looked at me and asked, “How can it possibly do that?” That’s what I like to hear.

But is a software-generated book even feasible? Our software can create eight paragraphs now, but is it possible to create eight chapters’ worth of content? The answer is “yes,” but not quite the same kind of technical books I used to write, at least right now. It would be easy for us to extend our technology to write even longer pieces. That’s not the issue. Our software is good at quantitative analysis using structured data.

The kind of books I used to write were not based on data and were qualitative in nature. I pulled from my experience and did supplemental research, made a judgment on the best way to perform a task, then documented it. We are in the early stages of building software that will do more qualitative analysis such as this, but that’s a much harder challenge. The main advantage of today’s usage of software writing is to automate repetitive types of content. This is less applicable for books.

In the near term, the writers at O’Reilly and elsewhere have nothing to worry about. But I wouldn’t count out automation in the long term.

Associated box score photo on home and category pages via Wikipedia.

Related: