In 2013, Robert Galbraith — an aspiring author — finished his first novel, Cuckoo’s Calling. It had all the trappings of a great story: a suspicious death, a private investigator haunted by his past, intrigue, pace, and misdirection. There was only one problem: literary agents, the gatekeepers of the publishing industry, kept rejecting the book — often without even looking at it.

Galbraith eventually opted to publish Cuckoo’s Calling through an acquaintance of sorts. Interestingly, on the day that Galbraith was discovered to be none other than J.K. Rowling — author of the Harry Potter series — sales of Cuckoo’s Calling increased by over 150,000 percent. In other words, Galbraith had chops — but the publishing industry failed to see it.

As an aspiring writer myself, Galbraith’s tale is all-too-familiar. And it illustrates a major pain-point in the life cycle of stories: literary publishing is a $20 billion industry…with a bottleneck problem. Much like the challenge faced by hiring managers, literary agents can receive thousands of pitches every year, and there is no easy way to triage them. This, in turn, means that writers often send their materials to every agent they can find. The end result is a nasty and increasingly impersonal feedback loop that builds unnecessary costs and delays into publishing.

With breaking this bottleneck in mind, I’ve used my time as an Insight Data Science Fellow to build the AIgent, a web-based neural net to connect writers to representation. Using nothing more than a book’s synopsis, the AIgent can surface similar books, genre tags, and sales proxies. This will allow agents to filter their inboxes to better focus on the pitches that best align with their portfolios. Likewise, it will help writers identify agents who are a good fit. Ultimately this will mean a more streamlined, inclusive, and personal publishing industry.

The AIgent was built with BERT, Google’s state-of-the-art language model. In this article, I will discuss the construction of the AIgent, from data collection to model assembly. I will also cover simple extensions of the AIgent, its cross-media potential, its power as an unbiased, content-based recommender system, and its capacity to increase fairness in content acquisition. Lastly, I will discuss my own experience as an AIgent beta-tester. On the day I finished my last novel, I did not begin emailing agents in search of representation. Instead, I built the AIgent.

Building the AIgent

The AIgent construction pipeline.

Data Collection

The AIgent leverages book synopses and book metadata. The former is a short block of text, generally between 50 and 400 words in length. The latter is any type of external data that has been attached to a book — for example genre tags, ratings, and sales.

To my knowledge, the most extensive repository of synopses and metadata is Goodreads. For my purposes, the most interesting data on Goodreads comes in the form of genre/content tags. A given book may be associated with hundreds of different tags, ranging from broad genre labels, like ‘Mystery’ or ‘Historical’, to specific content tags like ‘Werewolves’ and ‘Vampires’.

To collect these genre tags and other metadata, I took advantage of the well-documented Goodreads API. Unfortunately, that API does not permit collection of synopses. To get around this, an enterprising and motivated individual might use Scrapy and Beautiful Soup to scrape synopses. To build the AIgent, I started with synopses and metadata from 100,000 books.

Features: DistilBERT Text Embeddings

Once I had a raw dataset, I could begin engineering features and building a natural language processing (NLP) model. Ultimately, I wanted this model to do two things:

Build a rich representation of a text synopsis for external comparisons Mathematically describe the relationship between a book’s synopsis and its metadata

To get there, I needed to do a bit of pre-processing on my synopses (i.e. features) and metadata (i.e. labels).

The most powerful approach for the first task is to use a ‘language model’ (LM), i.e. a statistical model of natural language. These models use a variety of approaches to transform text into numerical representations, which can then be used for downstream tasks such as classification and semantic parsing.

The last five years have seen an incredible increase in the power of LMs. As in the world of image classification, some of the greatest advances in LMs have come with transfer learning, and with unsupervised neural nets that build ‘general’ models of language. This class of model includes OpenAI’s generative text model GPT-2, which produces eerily human-like text at the sentence to short paragraph level.

More relevant to the AIgent is Google’s BERT model, a task-agnostic (i.e. general-purpose) LM that has thus far been extended to over 100 languages and achieves state-of-the-art results on a long list of language tasks, including sequence classification. If you have not used BERT before, this Colab notebook is a great place to get started. I tested several different flavors of BERT for use as synopsis classifiers before settling on the DistilBERT model from Hugging Face. It’s much faster than the full BERT model without sacrificing much in the way of performance.