Introduction

After a week-long trip to Milano some months ago, I found myself in the waiting area of an airport. It was a peaceful day, the sun was bursting inside the place through few huge windows. As usual, the bookstore was calling me affectionately, so I couldn’t resist the urge to go in and take a look. My attention was lured in by the cover of Norwegian Wood, a book by Haruki Murakami. Although I knew who he was, I never had read a book by him, until that moment. I bought one copy and I started reading it. I’m not a literary critic, so I won’t give any comments about the book itself. However, I really liked it and, of course, I had the urge to analyze it using stuff from my very personal world: computer science.

Lil’ disclaimer: the analysis is spoiler-free. Except for some notes, I consciously omitted all of the possible spoilers. Therefore, some parts of the analysis might sound flimsy to whoever already knows the story.

The book, the data and whatever

The data is represented by the overall text of the book. To obtain it, I wrote all of the text by myself in a .txt file. I’m not playing sarcastic here, obviously, although sometimes I just write whatever I’m looking for on Google and I get lucky in finding exactly what I need.

The overall analysis has been carried out using Python. As always, the code is available here.

The whatever part of the section here would denote the methodology used to analyze the book. However, we skip that boring part and we concentrate on what we want to do (or, actually, what I did). We want to get a bird’s eye view of the book, by inspecting the textual data within. Starting, we simply analyze the overall book at symbols and words level of abstraction: we count things, we divide other things and collect more things. Then, we talk about women and men appearing in the book. Last but not least, we bring up the art of sentiment analysis, borrowing some stuff from it.

Symbols, words, lexical richness

Let’s start by swimming into the pool of symbols (where a symbol is just a typewritten character). Norwegian Woods is 632093 characters long, excluding boilerplates such as page number, title, etc.¹ Instead, by focusing on words we have a length of 118378 words. Note, this number takes into account all words. The number of distinct words is 7145 exactly. This gives a lexical richness (the ratio of different unique words to the total number of words) of 0,0603; higher this number is, richer the text is, although it is straightforward to see how it is unfeasible to reach a lexical richness of 1. In this case, unique words represent 6,03% of the whole text!

Until now, the analysis still looks flat. We can add up more things to our soup of data and information.

One of the most interesting aspects we can capture is that of collocations. Simply, a collocation is a sequence of words that occur together more often than would be expected by chance. These are the collocations found in Norwegian Woods:

storm trooper; dining hall; long time; said reiko; take care; per cent; shook head; hey watanabe; tell truth; said midori; kobayashi bookshop; norwegian wood; straight away; record shop; ami hostel; foreign ministry; stuff like; sunday morning; pretty much; first time

What captured my interest here is that Storm Trooper is treated as a collocation, although is a name. The fact that is a two-word name does the trick. Expressions such as “Hey Watanabe!” or “take care” also appear, and even the title of the book (which originally refers to the Beatles’ song) occurring exactly once in the text.

Something also very interesting are hapex legomena. Aside from its name², an hapax legomenon is a word that occurs only once within a context: in this case, a word which occurs only once within the book. There are a total of 3060 hapex legomena in the whole book. Here are fifty of them randomly drawn out:

fringe, increase, odour, yukio, viewing, shattering, dreamless, intervals, flimsy, sociable, cycles, de, chickens, graveyard, brim, manner, elder, metres, centipedes, gear, gangsters, virgo, launched, cruising, camp, neatness, windowsills, remoter, sunbathe, stuffy, frond, twinkle, riots, violated, halt, incorporated, involve, sabots, agonizing, plain, shitpiles, cheats, graceful, dabbing, heel, sorts, tucked, conducive, smoulder, tottori

Yes, Tottori (鳥取市 Tottori-shi) is the capital city of the Tottori Prefecture in Japan. And no, Totoro is not from Tottori.

Figure 1: top ten of the most common words in the text.

Finally, one more stuff related to the word point of view. I sorted out the 10 most common words in the text and plotted them. You can see them in Figure 1, where the x-axis indicates the word and the y-axis the frequency (that is, the number of times a word appeared in the text). Although the overall figure is not much informative, three particular words deserve attention: Naoko, Midori, and Reiko.

Female and Male Characters in Norwegian Woods

Naoko, Midori and Reiko are the three main female characters of Norwegian Wood. Fluidly connected to Watanabe, they live in the book in different parts and different times. Each of them has a peculiar role and a different degree of importance. Curiosity came knocking at my door, and I decided I couldn’t avoid analyzing their appearance.

First things first: Naoko, Midori, and Reiko appear in the book 476, 349 and 306 times respectively. Each number is the sum of the occurrence of the respective name in the text. It is relatively safe to assume that when a character appears, in front of the storyteller or neither, it is introduced at least once by using her name.

To visualise their appearance, I made a dispersion plot. As the name suggests, a dispersion plot shows the dispersion of a word in a whole text, i.e., every part in which the word appears. The dispersion plot for the words Naoko and Midori is illustrated in Figure 2. The number aside each name represents its frequency.

Figure 2: dispersion plot for the words “Naoko” and “Midori”.

What is actually interesting here is that until the very half of the book, Naoko and Midori are not overlapping at all but rather they take up distinct parts. Instead, from the half to the end of the book, the two characters are being fused together in the thoughts of Watanabe: around 80% of the book, the narrative concentrates on both, then it ramps up to a dense part focused on Naoko. Everything leads to the end, where the last word is left to Midori.

Figure 3: dispersion plot for the words “Naoko”, “Midori” and “Reiko”.

Also, I thought it was interesting to add the frequency for Reiko together with those of Naoko and Midori, due to the bond of Reiko with Naoko. Figure 3 shows the frequencies. In general, Reiko and Naoko’s frequencies are overlapped, except for the first part, where Naoko does not meet Reiko yet.

Figure 4: dispersion plot for the words “Storm Trooper” and “Nagasawa”.

Still, Norwegian Woods also contain some male characters. Here, I decided to depict the presence of two Watanabe’s friends, namely Storm Trooper and Nagasawa. They both can be considered minor characters, although, in my opinion, they add some interesting peculiarities to the story. Their dispersion plot is shown in Figure 4. The obtained frequencies are really informative: they correctly depict both the characters as minors; however, the gaps might suggest something interesting… (spoiler-free, remember?)

Sentiment

The last part of the analysis focuses on checking the sentiment expressed throughout the whole book by analyzing a sort of sentiment value for each sentence. The topic of sentiment analysis is fairly new for me, so I tried to leave everything in a consistent state.

To make a quite simple sentiment analysis, I used VADER. VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is worth pointing out that this is only a possible experiment: I’m not quite sure that using VADER might constitute an optimal approach to analyze text coming from a book, but it is worth the shot.

To boot up the analysis, I computed the sentiment value for each sentence of the book. VADER allows computing the polarity indices that are positive, negative and neutral. Also, it includes the compound score, a metric denoting the sum of all the lexicon ratings which have been normalized between most extreme negative and most extreme positive (-1 and +1 respectively). A compound score c denotes a positive sentiment if c ≥ 0.05, negative if c ≤ -0.05, or neutral if it is included within these two values.

Figure 5: compound mean value throughout the whole book.

Figure 5 shows the compound mean value over a window of 15 sentences throughout the whole book, that is for every 15 sentences, the mean value is computed and plotted. On the x-axis we have the n-th sentence of the book, while y-axis shows the compound mean value. It is interesting to observe some points in the negative zone emerging and, in particular, the negative one near the end. Note that these values do not represent a single sentence but an aggregation of them, thus those negative points might reflect negative parts in the book. In order to remain spoiler-free, I do not report the sentences related to that negative climax. Go read the book!

Conclusion

After getting caught in Norwegian Wood, I decided to analyse it by following previous habits and this post shows all of the information gathered throughout the whole process. I went from counting words to grasping sentiment related insights. Overall, using programming languages such as Python for this stuff is always cool. Sentiment analysis is not particularly convincing right now, thus it surely needs a more appropriate approach. Please, share your comments (if you have any)!