A book description is an advertisement. It attempts, within a fleeting window of opportunity, to persuade a reader to buy a book. For many books which struggle to connect with readers, delivering a compelling sales pitch is vital. Writing isn’t easy, though, and finding the perfect words to connect with an audience is a process fraught with guesswork. Furthermore, the task of choosing the best words for a book description becomes particularly difficult for books marketed to the K-12 demographic, where it is unclear who — a young reader, a parent, or an educator — is making the decision to buy a book.

During my Insight Data Science Fellowship, I focused my efforts on using natural language processing (NLP) to build a tool that would alleviate some of that uncertainty for writers. In making VerbiAge, I sought to address the problem of writing descriptions for books targeting age groups between kindergarten and 12th grade (K-12).

The final product

VerbiAge is a web app to help writers leverage NLP and machine learning when writing book descriptions that target one of four categories: kindergarten through 2nd grade (K-2), 3rd through 5th grade (3–5), middle school (6–8), or high school (9–12). The app makes predictions using a model trained on thousands of book descriptions to give a writer a sense of how the words they’ve used stack up against other book descriptions. Additionally, the app’s user interface provides real-time side-by-side annotation of user-inputted text. The animation below shows how a writer can use this immediate feedback to rapidly iterate on word choice.

Writing a book description in VerbiAge for an illustrated book about a dog for high school age readers.

The data

The underlying model implemented by the app is a text classifier. For this particular use case, finding labels for training and testing turned out to be simple. Book descriptions and lists of recommended reading by grade level are both publicly available from the California Department of Education, which published lists of recommended books among four grade categories spanning K-12 with roughly twenty-five hundred book descriptions per category.

The classes in the data were considered to be balanced.

Training a classifier

The workflow I used for training the classifier is not uncommon for a text classification problem. First, I removed stop words and punctuations from the contents of each book description. Then, because my focus was on word-choice, I tokenized by unigrams. The book descriptions were subsequently vectorized using tf-idf (term frequency–inverse document frequency) into a 10247 x 24002 document-term matrix. In the target vector, I encoded the four categorical labels (K-2, 3–5, 6–8, and 9–12) as integers. Then, I split the data into an 80% training set and 20% test set, with the intention of subsequently training a multinomial logistic regression model.

At this point, I had one major concern about training a model on this data: overfitting. Overfitting was a concern because my design matrix had twice as many features as observations. Too many features can degrade the predictive power of a classifier. Each feature that is used by a model is a dimension along which an observation is represented, and the density of training data decreases exponentially with higher dimensionality. This sparsity makes it easy to fit a separating hyperplane that does a good job of classifying the training data but fails miserably at classifying the test data. This is often referred to as the curse of dimensionality.

To mitigate the effects of sparsity on the trained model, I generated and inspected two curves that describe a model’s training: the validation curve and the learning curve.

Validation curve. The mean accuracy and 95% confidence interval are plotted from 3-fold cross validation.

The validation curve depicts the bias-variance tradeoff, and I demarcated (with a vertical blue line) where the model’s tuned hyperparameter, C, falls along the curve. C is the inverse of the strength of the L2 regularization penalty, and I used grid search with cross-validation to find its optimal value. To the right of the blue line is the regime of overfitting (lower bias and higher variance) that is addressable by regularization. To the left (higher bias and lower variance), strengthening regularization only serves to worsen the training score without any appreciable improvement to the cross-validation score.

Learning curve. The mean accuracy and 95% confidence interval are plotted from 3-fold cross validation.

The learning curve shows how the model’s performance changes with respect to the number of observations in the training set. The plateauing of the cross-validation score (green) indicates that adding more observations to the training data does not increase the predictive power of the model.

Model performance and explanation

Having done the aforementioned checks, I proceeded to evaluating the model’s performance on the test set. The trained model classified with accuracy = 0.526, precision = 0.506, recall = 0.526, and F1 score = 0.512. Furthermore, inspection of the confusion matrix revealed that incorrect label predictions tended to neighbor true labels. In other words, the classifier was unlikely to grossly misclassify a book description for kindergarteners as being for high schoolers, nor vice versa.

Confusion matrix.

A sanity check of relevant words among the four labels also made sense. For example, “illustrations,” “watercolors,” and “animals” were found to be relevant for K-2, while “novel,” “modern,” and “adult” were found to be more relevant for high schoolers.

Most important words by label. “Importance” refers to the coefficients in the model.

VerbiAge, however, needed to do more than solely make predictions about the target K-12 level of a book description. In order to be helpful for writers, it also needed to provide an explanation of the model’s predictions. To extract this explanation, I used the LIME package, which linearly approximates a model’s local behavior in the vicinity of the prediction being explained. I included the coefficients from this explanation in the API endpoint that serves the results of the model’s prediction and plotted them on the app’s frontend using react-vis.

Conclusion

VerbiAge demonstrates that NLP and machine learning can be used to help writers with specific writing tasks. I was able to use the predictions and explanations provided by the trained model to build a web app that provides real-time feedback during a writing task. Although the model presented here was solely trained on tf-idf values of unigrams to help writers write book descriptions for K-12 students, many opportunities for additional feature engineering, dimensionality reduction, and alternative use cases for the tool still remain open for exploration.