I have been playing last days with some tools to analyze online texts and I have been using NLTK (Natural Language ToolKit) which is a platform for building Python programs to work with human language data.

NLP or Natural language processing is the science of enabling the uter to understand human language, derive meaning and generate natural language. It is the intersection of computer science, artificial intelligence, and linguistics.

Lately I have been working on a quite similar project to implement an intelligent chatbot on my raspberry pi, I will be posting my experiment on my blog, but for now, et’s learn how to use NLTK to analyze text.

The analysis that I provide in this tutorial is based on 348 files but still approximate and are intended to be an educational tool to learn basic stuff, not more. NLTK is a great tool but still remains a software designed to evolve over time to be more efficient. Maybe you will find some small classification errors, but this remains negligible compared to the overall result.

Texts have been downloaded from this site, I have not seen the content of each of the speeches, but the overwhelming majority of these texts were told by Obama during his speeches. Anything else told by somebody else is negligible compared to the overall result of this analysis.

Downloading Content

First thing is downloading content, I used this simple Python script:

#!/usr/bin/env python

# coding: utf8 from goose import Goose

import urllib

import lxml.html

import codecs def get_links(url, domain):

connection = urllib.urlopen(url)

dom = lxml.html.fromstring(connection.read())

for link in dom.xpath(‘//a/@href’): # select the url in href for all a tags(links)

if ( link.startswith(“speech”) and link.endswith(“htm”) ):

yield domain + link def get_text(url):

g = Goose()

article = g.extract(url=url)

with codecs.open(article.link_hash + “.speech”, “w”, “utf-8-sig”) as text_file:

text_file.write(article.cleaned_text)

link = “

domain = “

for i in get_links(link, domain):

get_text(i) if (__name__ == “__main__”):link = “ http://www.americanrhetoric.com/barackobamaspeeches.htm domain = “ http://www.americanrhetoric.com/ for i in get_links(link, domain):get_text(i)

Concatenating is the second step:

import os

for file in os.listdir(“.”):

if file.endswith(“.speech”):

os.system(“cat “+ file + “ >> all.speeches”)

Then it is recommended to create what we call tokens in NLTK jargon:

with codecs.open(“all.speeches”, “r”, “utf-8-sig”) as text_file:

r = text_file.read()

#Remove punctuation

tokenizer = RegexpTokenizer(r’\w+’)

_tokens = tokenizer.tokenize(r)

# Get clean tokens

tokens = [t for t in _tokens if t.lower() not in english_stopwords]

Analyzing Content

The Lexical Diversity

According to Wikipedia The lexical diversity of a given text is defined as the ratio of total number of words to the number of different unique word stems.