For the last couple of days, I have been thinking to write something about my recent experience on the usages of raw bash command and regex to mine text. Of course, there are more sophisticated tools and libraries online to process text without writing so many lines of codes. For example, Python has built-in regex module “re” that has many rich features to process text. ‘BeautifulSoup’ on the other hand has nice built-in features to clean raw web pages. I use these tool for faster processing of large text corpus and when I feel lazy to write codes.

Most of the times, I prefer to use the command line. I feel at home on the command line especially when I work with text data. In this tutorial, I use bash commands and regex to process raw and messy text data. I assume readers have the basic familiarity of regex and bash commands.

I show how bash commands like ‘grep,’ ‘sed,’ ‘tr,’ ‘column,’ ‘sort,’ ‘uniq,’ ‘awk’ can be used with regex to process raw and messy texts and then extract information. As an example, I use the complete works of Shakespeare provided by Project Gutenberg, which is in cooperation with World Library, Inc.

Look at the file first

The whole work of Shakespeare work can be downloaded from this link. I downloaded the entire work of Shakespeare and put it into a text file: “Shakespeare.txt.” All right, now let’s get started looking at the file size:

ls -lah shakes.txt ### Display:

-rw-r--r--@ 1 sabber staff 5.6M Jun 15 09:35 shakes.txt

‘ls’ is the bash command that lists all the files and folder in a certain directory. ‘-l’ flag displays file types, owner, group, size, date, and filename. ‘-a’ flag is used to display all the files including hidden ones. The flag ‘h’ -one of my favorite flags as it displays file size which is the human-readable format. Size of the shakes.txt is 5.6 megabyte.

Explorer the text

Okay, now lets read the file to see what's in it. I use ‘less,’ and ‘tail’ commands to explorer the parts of the file. Name of the commands tells about their functionalities. ‘less’ is used to view the contents of a text file one screen at a time. It is similar to ‘more’ but has the extended capability of allowing both forward and backward navigation through the file. ‘-N’ flag can be used to define line numbers. Similarly ‘tail’ shows the last couple of lines of the file.

less -N shakes.txt ### Display:

1 <U+FEFF>

2 Project Gutenberg’s The Complete Works of William Shakespeare, by William

3 Shakespeare

4

5 This eBook is for the use of anyone anywhere in the United States and

6 most other parts of the world at no cost and with almost no restrictions

7 whatsoever. You may copy it, give it away or re-use it under the terms



It looks like the first couple of lines are not Shakespeare work but some information about the Gutenberg’s project. Similarly, there are some lines at the end of the file unrelated to Shakespeare’s work. So I would delete all the unnecessary lines from the file using ‘sed’ as below:

cat shakes.txt | sed -e ' 149260 , 149689 d' | sed -e '1,141d' > shakes_new.txt

The above code snippets delete lines from 14926 to 149689 at the tail then delete the first 141 lines. The unwanted lines include some information about legal rights, Gutenberg’s project and contents of the work.

Basic Analysis

Now let's do some statistics of the file using ‘pipe |’ and ‘awk’.

cat shakes_new.txt | wc | awk '{print "Lines: " $1 "\tWords: " $2 "\tCharacter: " $3 }' ### Display

Lines: 149118 Words: 956209 Character: 5827807

In the above code, I first extract the entire text of the file using ‘cat’ and then pipe into ‘wc’ to count the number of lines, words, and characters. Finally, I used ‘awk’ to display information. The way of counting and displaying can be done in tons of other ways. Feel free to explore other possible options.

Text processing

Now its time to clean the text for further analysis. Cleaning includes, convert the text to lower case, remove all digits, remove all punctuations, and remove high-frequency words (stop words). Processings are not limited to these steps, and it depends on the purpose. Since I intend to show some basic text processing, I only focus on the above steps.

First, I convert all the uppercase characters/words to lowercase followed by removing all the digits and punctuations. To perform the processing, I use bash command ‘tr’ which translate or delete characters from a text document.

cat shakes_new.txt | tr 'A-Z' 'a-z' | tr -d [:punct:] | tr -d [:digit:] > shakes_new_cleaned.txt

The code snippet above first converts the entire text to lower case and then remove all the punctuations and digits. The results of the above codes:

### Display before:

1 From fairest creatures we desire increase,

2 That thereby beauty’s rose might never die,

3 But as the riper should by time decease,

4 His tender heir might bear his memory:

5 But thou contracted to thine own bright eyes,

6 Feed’st thy light’s flame with self-substantial fuel,

7 Making a famine where abundance lies,

8 Thy self thy foe, to thy sweet self too cruel:

9 Thou that art now the world’s fresh ornament,

10 And only herald to the gaudy spring,

11 Within thine own bud buriest thy content,

12 And, tender churl, mak’st waste in niggarding:

13 Pity the world, or else this glutton be,

14 To eat the world’s due, by the grave and thee.

### Display after: 1 from fairest creatures we desire increase

2 that thereby beautys rose might never die

3 but as the riper should by time decease

4 his tender heir might bear his memory

5 but thou contracted to thine own bright eyes

6 feedst thy lights flame with selfsubstantial fuel

7 making a famine where abundance lies

8 thy self thy foe to thy sweet self too cruel

9 thou that art now the worlds fresh ornament

10 and only herald to the gaudy spring

11 within thine own bud buriest thy content

12 and tender churl makst waste in niggarding

13 pity the world or else this glutton be

14 to eat the worlds due by the grave and thee

Tokenization is one of the basic preprocessing in natural language processing. Tokenization can be performed both on a word or sentence level. In this tutorial, I show how to tokenize the file. In the code below, I first extract the cleaned text using ‘cat’ then I use ‘tr’ and its two flags: ‘s’ and ‘c’ to convert every word into lines.

cat shakes_new_cleaned.txt | tr -sc ‘a-z’ ‘\12’ > shakes_tokenized.txt ### Display (First 10 words) 1 from

2 fairest

3 creatures

4 we

5 desire

6 increase

7 that

8 thereby

9 beautys

10 rose

Now that we have all the words tokenized, we can answer a question like, what is the most/least frequent word in the entire Shakespeare work? To do this, I first use the ‘sort’ command to sort all the words first, then I use ‘uniq’ command with ‘-c’ flag to find out the frequency of each word. ‘uniq -c’ is same as ‘groupby’ in Pandas or SQL. Finally, sort the words with their frequency in either ascending (least frequent) or descending (most frequent) order.

cat shakes_tokenized.txt | sort | uniq -c | sort -nr > shakes_sorted_desc.txt ### Display 29768 the 28276 and 21868 i 20805 to 18650 of 15933 a 14363 you 13191 my 11966 in 11760 that cat shakes_tokenized.txt | sort | uniq -c | sort -n > shakes_sorted_asc.txt ### Display 1 aarons 1 abandoner 1 abatements 1 abatfowling 1 abbominable 1 abaissiez 1 abashd 1 abates 1 abbeys 1 abbots

The above results reveal some interesting observations. For example, the ten most frequent words are either pronouns or prepositions or conjunctions. If we want to find out more abstract information about the work, we have to remove all the stop word (prepositions, pronouns, conjunctions, modal verbs, etc.). It also depends on the purpose of the object. One might be interested only in prepositions. In this case, it’s okay to keep all the prepositions. On the other hand, the least frequent words are abandoned, abatements, abashed.

Removing stop words

In the next step, I show the usages of ‘awk’ to remove all the stop words on the command line. In this tutorial, I used NLTK’s list of English stopwords. I also have added a couple more words to the list. Details of the following codes can be found in this StackOverflow answers. Details of the different options of awk can be also found from the manual of awk (‘man awk’ on the command line)

awk ‘FNR==NR{for(i=1;i<=NF;i++)w[$i];next}(!($1 in w))’ stop_words.txt shakes_tokenized.txt > shakes_stopwords_removed.txt

Alright, after removing the stop words lets sort the words in ascending and descending order like as above.

cat shakes_stopwords_removed.txt | sort | uniq -c | sort -nr > shakes_sorted_desc.txt ### Display most frequent 3159 lord 2959 good 2924 king 2900 sir

2634 come 2612 well 2479 would 2266 love

2231 let 2188 enter cat shakes_stopwords_removed.txt | sort | uniq -c | sort -n > shakes_sorted_asc.txt ### Display least frquent 1 aarons 1 abandoner 1 abatements 1 abatfowling 1 abbominable 1 abaissiez 1 abashd 1 abates 1 abbeys 1 abbots

We see the most frequent word used by Shakespeare is the word ‘Lord’ followed by ‘good’. The word ‘Love’ is also included in the top most frequent words. The least frequent words remain the same. A linguistic or literature student may interpret the information or gain better insight from these simple analytics.

Let‘s discuss

As we are done with some necessary processing and cleaning, in the next tutorial I will discuss how to perform some advanced analytics. Until then if you have any questions feel free to ask. Please make comments if you see any typos, mistakes or you have better suggestions. You can reach out to me: