Scattertext is a Python package that lets you interactively visualize how two categories of text are different from each other (Kessler 2017). Most of the work I’ve done on Scattertext focuses on how you can visualize the differences in how single words and (and bigrams) are used with different frequencies across categories.

This blog post focuses on how we can extend these techniques to longer phrase, using PyTextRank and Phrasemachine. The first part of this post will be an overview of how Scattertext visualizations work, the second part describes the how to integrate PyTextRank into Scattertext and two different ways of integrating its scores, and the third part will discuss how to integrate Phrasemachine.

Introduction to Scattertext

Figure 1. An example Scattertext plot showing how Democrats and Republicans differed in word frequency ranks across speeches in their 2012 nominating conventions. The bluer terms are, the higher their association scores are for Democrats. The redder the terms, the higher their association score are for Republicans. See http://jasonkessler.github.io/demo_dense_rank.html for an interactive version of this plot.

In Figure 1, we can see how word usage among Democrats and Republicans (among speakers in the 2012 political conventions). Each point corresponds to usage of a word, where the higher up the word is on the y-axis, the more it was used by Democrats, and the further right, the more it was used by Republicans. Frequently used terms, like conjunctions, articles and prepositions, appear in the upper right hand corner, while infrequently used terms, like “optimism” or “ballot” appear in the bottom right corner.

The positions shown correspond to a the dense ranks of frequencies in a particular class. Means that the first most frequently used term in a particular category is plotted next to the second most frequently used term, even if one term is used many times more than its neighbor. This prevents large gaps from appearing in the plot, and makes it more readable as a whole.

Finally, an algorithm unique to Scattertext is used to determine which points are labeled and which points aren’t. This labeling happens client-side, and users can interactively see which terms correspond to which points if they mouse-over them. Clicking reveals snippets from documents showing how each term was used in context.

Using PyTextRank

PyTextRank, created by Paco Nathan, is an implementation of a modified version of the TextRank algorithm (Mihalcea and Tarau 2004). It involves graph centrality algorithm to extract a scored list of the most prominent phrases in a document. Here, named entities recognized by spaCy. As of spaCy version 2.2, these are from an NER system trained on Ontonotes 5.

To use, build a corpus as normal, but make sure you use spaCy to parse each document as opposed a built-in whitespace_nlp -type tokenizer. Note that adding PyTextRank to the spaCy pipeline is not needed, as it will be run separately by the PyTextRankPhrases object. We'll reduce the number of phrases displayed in the chart to 2000 using the AssociationCompactor . The phrases generated will be treated like non-textual features since their document scores will not correspond to word counts.

import pytextrank, spacy

import scattertext as st

import numpy as np nlp = spacy.load('en') convention_df = st.SampleCorpora.ConventionData2012.get_data(

).assign(

parse=lambda df: df.text.apply(nlp),

party=lambda df: df.party.apply(

{'democrat': 'Democratic',

'republican': 'Republican'}.get

)

) corpus = st.CorpusFromParsedDocuments(

convention_df,

category_col='party',

parsed_col='parse',

feats_from_spacy_doc=st.PyTextRankPhrases()

).build(

).compact(

st.AssociationCompactor(2000, use_non_text_features=True)

)

Note that the terms present in the corpus are named entities, and, as opposed to frequency counts, their scores are the eigencentrality scores assigned to them by the TextRank algorithm. Running corpus.get_metadata_freq_df('') will return, for each category, the sums of terms' TextRank scores. The dense ranks of these scores will be used to construct the scatter plot.

term_category_scores = corpus.get_metadata_freq_df('')

print(term_category_scores)

'''

Democratic Republican

term

our future 1.113434 0.699103

your country 0.314057 0.000000

their home 0.385925 0.000000

our government 0.185483 0.462122

our workers 0.199704 0.210989

her family 0.540887 0.405552

our time 0.510930 0.410058

...

'''

Before we construct the plot, let’s some helper variables Since the aggregate TextRank scores aren’t particularly interpretable, we’ll display the per-category rank of each score in the metadata_description field. These will be displayed after a term is clicked.

term_ranks = np.argsort(

np.argsort(-term_category_scores, axis=0),

axis=0) + 1 metadata_descriptions = {

term: '<br/>' + '<br/>'.join(

'<b>%s</b> TextRank score rank: %s/%s' % (

cat,

term_ranks.loc[term, cat],

corpus.get_num_metadata()

)

for cat in corpus.get_categories()

)

for term in corpus.get_metadata()

}

We can construct term scores in a couple ways. One is a standard dense-rank difference, a score which is used in most of the two-category contrastive plots here, which will give us the most category-associated phrases. Another is to use the maximum category-specific score, this will give us the most prominent phrases in each category, regardless of the prominence in the other category. We’ll take both approaches in this tutorial, let’s compute the second kind of score, the category-specific prominence below.

category_specific_prominence = term_category_scores.apply(

lambda row: (row.Democratic

if row.Democratic > row.Republican

else -row.Republican),

axis=1

)

Now we’re ready output this chart. Note that we use a dense_rank transform, which places identically scaled phrases atop each other. We use category_specific_prominence as scores, and set sort_by_dist as False to ensure the phrases displayed on the right-hand side of the chart are ranked by the scores and not distance to the upper-left or lower-right corners. Since matching phrases are treated as non-text features, and in order to ensure that they’re searchable when clicked, we encode them as single-phrase topic models by setting the topic_model_preview_size to 0 . This indicate the topic model list shouldn't be shown. Finally, we set ensure the full documents are displayed. Note the documents will be displayed in order of phrase-specific score.

html = st.produce_scattertext_explorer(

corpus,

category='Democratic',

minimum_term_frequency=0,

pmi_threshold_coefficient=0,

width_in_pixels=1000,

transform=st.dense_rank,

metadata=corpus.get_df()['speaker'],

scores=category_specific_prominence,

sort_by_dist=False,

use_non_text_features=True,

topic_model_term_lists={term: [term] for term in

corpus.get_metadata()},

topic_model_preview_size=0,

metadata_descriptions=metadata_descriptions,

use_full_doc=True

)

Figure 2. Phrases are extracted by PyTextRank. The two axes are the dense ranks of phrase centrality scores as computed by PyTextRank. Terms are scored based on the sum of their category-specific eigenvector centrality as calculated by PyTextRank. The category association is simply determined by which category’s average score is highest. Interactive version: http://jasonkessler.github.io/demo_pytextrank_prominence.html

The most associated terms in each category make some sense, at least on a post hoc analysis. When referring to (then) Governor Romney, Democrats used his surname “Romney” in their most central mentions of him, while Republicans used the more familiar and humanizing “Mitt”. In terms of the President Obama, the phrase “Obama” didn’t show up as a top term in either, the but the first name “Barack” was one of the the most central phrases in Democratic speeches, mirroring “Mitt.”

Alternatively, we can Dense Rank Difference in scores to color phrase-points and determine the top phrases to be displayed on the right-hand side of the chart. Instead of setting scores as category-specific prominence scores, we set term_scorer=RankDifference() to inject a way determining term scores into the scatter plot creation process.

html = st.produce_scattertext_explorer(

corpus,

category='Democratic',

minimum_term_frequency=0,

pmi_threshold_coefficient=0,

width_in_pixels=1000,

transform=st.dense_rank,

use_non_text_features=True,

metadata=corpus.get_df()['speaker'],

term_scorer=st.RankDifference(),

sort_by_dist=False,

topic_model_term_lists={term: [term] for term in

corpus.get_metadata()},

topic_model_preview_size=0,

metadata_descriptions=metadata_descriptions,

use_full_doc=True

)

Figure 3. Phrases are extracted by PyTextRank. Here, terms are scored the same way as in Figure 1, based on the difference in terms’ dense ranks. Interactive version: https://jasonkessler.github.io/demo_pytextrank_rankdiff.html

Using Phrasemachine

Phrasemachine from AbeHandler (Handler et al. 2016) uses regular expressions over sequences of part-of-speech tags to identify noun phrases. This has the advantage over using spaCy’s NP-chunking in that it tends to isolote meaningful, large noun phases which are free of appositives.

A opposed to PyTextRank, we’ll just use counts of these phrases, treating them like any other term.

We’ll select the 4000 most category-associated phrases from the corpus, and plot them using their dense-ranked category-specific frequencies. We’ll use the difference in dense ranks as the scoring function.