Starting with the word count (wc) command gives me the following:

Ligin% wc frankenstein.txt

7243 74952 421503 frankenstein.txt

The book contains 7243 lines, 74952 words, and 421503 characters.

The number of words gives us a fair idea about the type of book. A typical Novel has 40,000 words or over, a Novella has 17,500 to 39,999 words, a Novelette has 7,500 to 17,499 words & a Short Story has words under 7,500 words. Since this book has more than 70,000 words we can assume this to be a novel.

For the consistency of further analysis, I convert all the characters into lower-case and split each word per line

Ligin% tr 'A-Z' 'a-z'< frankenstein.txt| tr -sc 'a-z' '

'

Now exploring more into the contents of the book, by applying sort & uniq programs, we can find the most frequently used words and the number of times it appears.

Ligin% tr 'A-Z' 'a-z' < frankenstein.txt| tr -sc 'a-z' '

'| sort | uniq -c | sort

It seems “the” is the most used word, which appears 4195 times, but that does not give any insights into to book. Anyway, I saved it into a file named frank.words, without sorting.

Ligin% tr 'A-Z' 'a-z' < frankenstein.txt|tr -sc 'a-z' '

'> frank.words

Now using grep I can find the most used word in different length of the word.

Ligin% grep -w "[a-z]\{12\}" frank.words | sort -nr | more

As I changed the values, I found some interesting words & its frequencies.

2850 i

1391 a

1776 my

867 me

608 he

136 myself

134 father

71 friend

45 horror

39 months

36 geneva

34 spirit

59 clerval

55 justine

54 friends

51 cottage

76 feelings

44 creature

38 thoughts

27 murderer

92 elizabeth

65 miserable

37 mountains

21 vengeance

34 discovered

32 sensations

39 countenance

28 endeavoured

27 frankenstein

18 conversation

14 wretchedness

14 tranquillity

14 circumstances

7 disappointment

6 notwithstanding

The longest single word has 16 alphabets.

By arranging two adjacent words into a sequence, we can analyze a different type of frequent distributions among the words, this is called bigram.

Now by creating a new file frank.nextwords, which store the words one above the previous file frank.words,

Ligin% tail +2 frank.words > frank.nextwords

and with pasting & sorting these two files, we can form a bigram file frank.bigram.

Ligin% paste frank.words frank.nextwords| sort | uniq -c > frank.bigram

Now using sort on the bigram,

Ligin% sort -nr frank.bigram | more

I was able to get to some more clarified assumptions.

228 i was

219 i had

100 as i

90 my father

49 my heart

46 my eyes

41 the cottage

36 i thought

35 my mind

31 my friend

31 my dear

So this is what I guess the book is about:

It’s told in a first person point of view(maybe Frankenstein’s), more like a diary. The person is so close with his/her father. Some other characters are Elizabeth, Justine, Clerval. The person is close with his friends too. There is a cottage involved, probably where the character lives. Someone is accused of a murder and there is some sort horror spread around, which also makes it reasonable to say that this could be a mystery novel.

To know how close I got, I need to sit down & read it, since it involves horror & murder I better not to read it at bedtime ;)

If you have read it before please point if I’m anywhere close to the book’s actual content.

Sayonara