Ok, you got me: I love music and tv series. What you probably don’t know is that I’m a stubborn reader. I can’t list all of the books I’ve read in my life but I can answer the question what is your favourite book?.

It is One Hundred Years of Solitude by Gabriel García Márquez. One Hundred Years of Solitude is THE book: it contains everything you need or neither, it is a very fantastic novel, it is practically infinite in its content, it is the definition of time. Moreover, it represents my solitude.

I found a very nice edition some days ago in France. I can’t read it anymore but curiosity always kills the cat, so I was wondering whether it is possible to analyze it by meaning of some sort of textual analysis.

I did it and there is this post. Let’s jump to the Introduction!

Introduction

I analyzed the book using two kinds of analysis, that are a textual analysis and some sort of a graph theory analysis. The former is made up by a textual analysis of the book, taking the words as atomic elements of the analysis. The latter uses the former in order to create a graph (or network) modelling the interactions between the characters in the book.

Before starting, a disclaimer: this post does not represent a scientific approach to the exposed problems. It only carries out a lot of curiosity.

The data

Before starting, we need something. Yea, you guessed well: we need the book in some textual format. We ask, Google answers: here’s a .txt file containing the english translation of the original book.

The text mining has been carried out using Python and the nltk library. Visualizations have been made by matplotlib and Tableau. The graph was made using the networkx library, and Gephi for the visualization.

The main steps of the approach are the following:

reading the book and tokenizing it using python and nltk, making the textual analysis with the same tools, building a graph on the top of the extracted information about characters.

The code is available on Github.

The textual analysis

Let’ start with the textual analysis, that is take all of the text, spit it on the computer and see if something comes out. It seems that something interesting really came out!

Symbols, words, lexical richness

One simple question: how many symbols, total and distinct words does the book contain? Márquez, for his last draft of the book, pressed the keys of his typewriter 809644 times. One Hundred Years of Solitude contains 144739 words and the number of distinct words is 11027. There are more than 11k different words!

We can extract a nice measure called lexical richness, which is the ratio between the number of distinct words and the number of total words. In this case, we have 0.07618540959934779 and it means that all of the distinct words represent 7.6% of the entire text!

Words dispersion

Márquez entitled his work One Hundred Years of Solitude. It could make you think that the word solitude appears a lot in the text. We can find it by identifying the dispersion of the word in the whole text. I’m curios, so I tried identifying other four words, that are time, love, life and death.

The plot shows the dispersion of the words in the text. Intuitively, a blue line represents a word which occurs in that part of the book, no otherwise. From the plot we can see that the old fox Márquez uses really few times solitude but makes the reader falling into the concept of time, using that word practically in the whole book. Furthermore, we see that love is the main theme starting from the end of the fifth generation of Buendía till the end of the book.

Hapax legomena

Erh… what? A hapax legomena is a word that occurs only once within a context, either in the written record of an entire language, in the works of an author, or in a single text.

One Hundred Years of Solitude contains 4741 hapax legomena. Here are 50 of them, picked at random:

epaulets, upsetting, civilization, motilón, marshal, domain, gluttons, despised, secretary, consulting, vise, forty-seven, modify, parrot, thirty-five, jeopardize, highest-flying, docility, wishing, rebuffs, wisely, walter, piglets, cans, dainties, demented, ports, chalice, mitigated, paragraphs, riddle, huts, alexandria, shuttered, consummate, adulterous, hoof, drugstore, tap-dancing, fabric.

Collocations

Ehm… what? I’m a computer scientist, come on! A collocation is a sequence of words or terms that co-occur more often than would be expected by chance.

In One Hundred Years of Solitude, we find the following collocations:

josé arcadio, aureliano segundo, colonel aureliano, aureliano buendía, arcadio buendía, gerineldo márquez, santa sofía, pietro crespi, petra cotes; pilar ternera, colonel gerineldo, arcadio segundo, amaranta úrsula, chestnut tree, banana company, mauricio babilonia, apolinar moscote, father nicanor, prudencio aguilar, many years.

Network analysis

Well, now we’re getting fun! The idea behind this kind of analysis is that of modelling relationships between characters of One Hundred Years of Solitude by a network (a graph).

Before starting, we define what is a graph and what are the relationships between characters.

A network is just a set of objects connected between them by some sort of relationship. As an example, you, my dear reader, and your friends make a network: you’re entities which are connected by past experiences, common interests, etc. Throughout this whole post, I could call these entities vertices and the relationships edges.

Regarding the relationship between characters: we can not extract various kind of relationships in an automatic way. We say that a relation exists between a character A and B whether A and B occurs in the text and B appear at most after 30 words from the occurrence of A. Furthermore, I considered only the characters with at least one interaction.

Oh my, we have a graph!

You were waiting for it! Here’s the graph modelling the relationship of One Hundred Years of Solitude characters: