All the news

Clustering 143,000 articles with KMeans.

I recently curated this dataset to explore some algorithmic approximation of the categories that make up our news, a thing that at different times I have both read and created. If you had tens of thousands of articles from a spread of outlets that seem more or less representative of our national news landscape and you turned them into structured data, and you put a gun to that data’s head and coerced it into groups, what would those groups be?

I decided the best balance of simplicity and efficacy would be to use unsupervised clustering methods and let the data sort itself, however crudely (and categories, no matter what algorithm they’re derived from, will almost always be crude, as there’s no reason the media can’t be infinitesimally taxonomized). For a variety of reasons (local memory constraints, ability, recommendations from those more learned), I chose to run a bag-of-words through KMeans — in other words, if every word becomes its own dimension and each article a single datapoint, what clusters of articles will form? If you’re itching to skip to the “so what” and/or don’t care about code, scroll down until you see bold letters telling you not to. The code is here if anyone wants to peer-review this and tell me if/where I screwed up and/or give me suggestions.

Because KMeans is non-deterministic, results vary; the clusters change a small bit between runs, but, having done many of them now, I can attest that they don’t do so substantially. The results here are more or less inherent to the data.

A brief overview

This is how the data looks:

The number of articles in the dataset

These publications were chosen based on a completely unscientific process of Cartesian introspection whereby I looked inside myself and came out with a rough summary of what I believed was a reasonable sample of our national news landscape. The method for grabbing these articles went more or less like this:

Grab the link for each publication’s archived homepage or RSS feed for the past year-and-a-half from the invaluable archive.org (so invaluable that I gave them money afterwards to say thanks (if anyone there is reading, thanks for existing)). Scrape every article from every link on that archived homepage using a very-very-hacked-together web scraper cobbled with BeautifulSoup. Uninteresting data stuff. Clean the filthy data. Etc.

All of which is to say, the data here comprises articles from these publications mainly from the beginning of 2016 to July 2017, and were articles that were featured on the homepage or RSS feed — i.e., this was not an insatiable vacuuming of an entire domain as it existed on August 13, 2016.

Stemming the corpus

This is a snippet of the data:

I made a judgement call (and you can judge this judgment) to remove proper nouns from the corpus. The thinking was, A) there are too many of them, B) they don’t tell us much about the content and manner of writing, and C) they’re another form of noise when attempting to boil down the essence of these categories. As you’ll see later, Python’s NLTK is a great package, but it isn’t perfect, and some proper nouns remained in the corupus after attempting to scrub them out.

Additionally, I decided to remove digits/numbers. And there’s a line in there that deals with a tendency for NLTK’s tokenizer to sometimes leave in periods at the end of words, something I maintain is a bug.

The stemming process looks like:

import nltk

from nltk import pos_tag

from nltk.stem import PorterStemmer

from nltk import word_tokenize

from collections import Counter

import time stemmer = PorterStemmer()

tokenizer = nltk.data.load(‘tokenizers/punkt/english.pickle’) progress = 0 #for keeping track of where the function is def stem(x):

end = time.time()

dirty = word_tokenize(x)

tokens = []

for word in dirty:

if word.strip(‘.’) == ‘’: #this deals with the bug

pass

elif re.search(r’\d{1,}’, word): #getting rid of digits

pass

else:

tokens.append(word.strip(‘.’))

global start

global progress

tokens = pos_tag(tokens) #

progress += 1

stems = ‘ ‘.join(stemmer.stem(key.lower()) for key, value in tokens if value != ‘NNP’) #getting rid of proper nouns



end = time.time() sys.stdout.write(‘\r {} percent, {} position, {} per second ‘.format(str(float(progress / len(articles))),

str(progress), (1 / (end — start)))) #lets us see how much time is left start = time.time()

return stems start = time.time()

articles['stems'] = articles.content.apply(lambda x: stem(x))

The result is that an article goes from this:

Queen Elizabeth II made her first public appearance in almost a month on Sunday, allaying concerns about her health after she missed Christmas and New Year’s Day church services because of what Buckingham Palace described as a persistent cold. The queen, who will turn 91 in April, attended services at St. Mary Magdalene Church in Sandringham...

To this:

made her first public appear in almost a month on , allay concern about her health after she miss and s church servic becaus of what describ as a persist cold the queen , who will turn in , attend servic at in...

Stemming the words drastically cuts down on the size of the corpus. Instead of “realize” and “realized” being considered different words and being assigned their own dimensions, they’re reduced to their shared stem, “realiz”. This reduces noise so that the algorithm doesn’t place weight on, say, a publication’s decision to use the past tense instead of the present, or that plural nouns aren’t considered a different part of the vocabulary instead of a singular noun, and so on.

Creating a vocabulary

Now for a head count, tallying every single one of these stems throughout the corpus, which can then be taken and turned into a dataframe to be used for the document-term matrix and vocabulary.

from collections import Counter

all_words = Counter()

start = time.time()

progress = 0

def count_everything(x):

global start

global all_words

global progress

x = x.split(‘ ‘)

for word in x:

all_words[word] += 1

progress += 1

end = time.time()

sys.stdout.write(‘\r {} percent, {} position, {} per second ‘.format((str(float(progress / len(articles)))),

(progress), (1 / (end — start))))

start = time.time() for item in articles.stems:

count_everything(item)

which is then transferred to a new dataframe:

allwordsdf = pd.DataFrame(columns = [‘words’, ‘count’])

allwordsdf[‘count’] = pd.Series(list(all_words.values()))

allwordsdf[‘words’] = pd.Series(list(all_words.keys()))

allwordsdf.index = allwordsdf[‘words’]

which gives, at the head of the dataframe:

Republican’s inclusion is as good an example as any of how the part-of-speech tagger doesn’t completely scrub proper nouns. But ignoring that, the corpus is now a dataframe with each term in the vocabulary as item in the index, which will be useful in the near future.

One challenge when dealing with internet-derived text data is that non-words, like combinations of characters and symbols (e.g., “@username”, “#hashtags”, words that are conjoined with ellipses like “welll…”), appear with relative frequency. Rather than find and clean each one, I decided to keep only the words that are in the NLTK’s complete English corpus. How complete that corpus is is up for debate among linguists, but at 236,736 words, it’s sizable. We’ll finish the final streamlining of our dataframe first by stemming the entire English corpus and then comparing that corpus to our own:

from nltk.corpus import words nltkstems = [stemmer.stem(word) for word in words.words()] #stem the #words in the NLTK corpus so that they’re equivalent to the words in #the allwordsdf dataframe nltkwords = pd.DataFrame() #make a new dataframe with the stemmed #NLTK words nltkwords[‘words’] = nltkstems allwordsdf = allwordsdf[allwordsdf[‘words’].isin(nltkwords[‘words’])] #keep only #those in the stemmed NLTK corpus

This cuts down the total vocabulary size from 89216 to 34527. It takes care of every last bit of noise in the vocabulary, and it took weeks before I considered this solution.

Vectorizing the words

A TfIdf (term frequency-inverse document frequency) vectorizer, roughly speaking, gives a value for each word in each article weighted by that words frequency in the whole corpus. The inverse-document frequency is a denominator derived from the word’s frequency in the entire dataset. Take the word “perspicacious”, a word that, because of its many superior substitutes in English, is a bullshit word that we luckily scarcely see. Because of that scarcity, its inverse-document frequency, or demoninator, is low. If it occurred 15 times in a single article, its Tf value, or numerator, would be high. So its TfIdf value would be a large numerator over a small demoninator, yielding a high number. So in our many-thousand-dimensional space, the article would have a value on the “perspicacious” dimension. (This, of course, is to say nothing about normalizing vectors and the other tasks involved in finding TfIdf values.)

Including stopwords (a list of words for the algorithm to ignore) doesn’t matter as much when using this type of vectorizer, since infrequent words are given low values. But it’s useful nonetheless, since at the very least it lowers memory usage and reduces the already very high dimensionality of our space. Additionally, creating a floor of words ensure that incredibly uncommon words that happen to appear all in one article aren’t flukes that cluster on their own. I chose words above the 40th quantile. At first blush, that seems quite high, until you look at what that quantile contains:

allwordsdf[allwordsdf[‘count’] == allwordsdf[‘count’].quantile(.4)][:10]

So the 40th quantile includes words with only 9 occurrences in the entire corpus — extremely low and thus not inclined to be informative. Why not the 50th or 60th quantile? Because a number has to be chosen somewhere, and it may as well be this one.

On to creating the stopwords, vectorizer vocabulary and vectorizer. Writing both stopwords and vocabulary may be redundant; I’m adding both for good measure, and because we need the vocab list later.

from sklearn.feature_extraction.text import TfidfVectorizer stopwords = list(allwordsdf[(allwordsdf[‘count’] >= allwordsdf[‘count’].quantile(.995)) | (allwordsdf[‘count’] <= allwordsdf[‘count’].quantile(.4))][‘words’]) vecvocab = list(allwordsdf[(allwordsdf[‘count’] < allwordsdf[‘count’].quantile(.995)) & (allwordsdf[‘count’] > allwordsdf[‘count’].quantile(.4))][‘words’]) vec = TfidfVectorizer(stop_words = stopwords, vocabulary = vecvocab, tokenizer=None)

Now to transform the dataframe:

vec_matrix = vec.fit_transform(articles[‘stems’])

Which produces a matrix shape of (142570, 20193) , or about 20,000 words.

Dimensionality reduction

How many dimensions to reduce our 20,193-dimension matrix to is difficult to answer. Sklearn’s official recommendation states, “For Latent Semantic Analysis [what we’re doing here], a value of 100 is recommended.” I’ve clustered this data with all 20,193 dimensions, and I’ve clustered it with 100 dimensions, and I’ve clustered it with three dimensions, and each time, the clusters seem independent of how many dimensions there are. This ultimately comes down to cutting down on processing time, and since the wisdom of the people who created the package dictates 100 dimensions, 100 dimensions it is.

from sklearn.decomposition import TruncatedSVD pca = TruncatedSVD(n_components=100) vec_matrix_pca = pca.fit_transform(vec_matrix)

Clustering

Even more difficult to answer is how many clusters to assign to the data. With KMeans, the data is too tightly clustered together for hierarchical clustering or any algorithm that finds clusters on its own. As mentioned above, I chose ten as a starting point.

from sklearn.cluster import KMeans clf10 = KMeans(n_clusters=10, verbose = 0) clf10.fit(vec_matrix_pca)

Now to assign the labels we’ve just created to the original dataframe for grouping, visualizing and analyzing:

articles[‘labels’] = clf10.labels_

We can take a look at what percentage of each publication’s articles was assigned to each label:

labelsdf = articles.groupby([‘publication’, ‘labels’]).count()

pubslist = list(articles[‘publication’].unique())

labelsdf[‘percentage’] = 0 for index, row in labelsdf.iterrows():

for pub in pubslist:

for label in range(10):

try:

labelsdf.loc[(pub, label), ‘percentage’] = labelsdf.loc[(pub, label), ‘id’] / labelsdf.loc[(pub), ‘id’].sum()

except:

pass labelsdf = labelsdf[['publication', 'labels', 'percentage']]

equals

and so on.

If you were scrolling down this is where to stop

These graphs were produced in RStudio with Plotly, a sometimes spotty but still useful package for data viz.