In honor of today being April 20th, I thought it would be interesting to do some NLP on Reddit comments about marijuana (shout out to Yufeng Guo for this idea!). My teammate Felipe Hoffa has conveniently made all Reddit comments available in BigQuery, so I took a subset of those comments and ran them through Google’s Natural Language API.

The comment data

I looked only at comments from the November 2016 table since a lot of marijuana legislation was passed during that time

Of 71 million comments during November 2016 49,717 comments (.07%) contained the words ‘marijuana’ or ‘cannabis’ .

. For quality control, I looked only at comments with Reddit score > 100. I also narrowed the dataset to English comments. That left me with 12,789 comments to analyze. Here’s my BigQuery schema:

Number of comments by day in November 2016

What did the volume of pot-related comments throughout November look like?

We see a big spike in the number of comments around the election, which makes sense: seven states changed their marijuana laws during the 2016 election.

What adjectives do people use when they talk about pot?

The Natural Language API’s syntax analysis method tells us the part of speech of each word in a sentence, which makes it easy to get all of the adjectives used in these comments. Which adjectives were used most frequently in conjunction with ‘marijuana’ and ‘cannabis’?

I also wondered if the top adjectives used each day changed throughout the month. Making use of BigQuery’s handy partitioning I used this query to find the top 3 adjectives used each day:

SELECT * FROM (

SELECT day, adjective, count(*) c,

row_number() over(partition by day order by count(*) desc) seqnum

FROM `sara-bigquery.reddit_pot_nlp.adj_day`

GROUP BY 1, 2

ORDER BY day, c desc

)

WHERE seqnum <= 3

ORDER BY day

Here’s a snippet of the output from November 5th to the 10th (the 4th column is the number of mentions for a word, and the last column is seqnum from the query above):

We can see that ‘good’, ‘more’, ‘other’ and legal were part of the vocabulary throughout the month, but during the election the adjectives ‘medical’ and ‘recreational’ came into play.

What else do people talk about when they talk about pot?

The Natural Language API’s analyzeEntities endpoint can tell us! It will extract any known entities from our text, along with their Wikipedia URL if it exists. Here are the most common entities mentioned in conjunction with marijuana on Reddit:

Get started with your own text

Want to do this sort of thing on text that’s not related to marijuana? That’s cool too. This NL analysis can be applied to any sort of text — news articles, customer service feedback, speech transcriptions, and more.

You can try out the Natural Language API directly in the browser with your own text. For example, here’s the results of syntax annotation on one of the comments from my dataset above:

If you want to follow the approach I used in this analysis, here are the steps:

Get the Reddit data you’d like to analyze from BigQuery. Download the results as JSON to a local file

Reading from the local file, send the text to the NL API and write the NL output to a separate file. You can do this from your favorite language, I used Node.js — code here.

Upload your newline delimited JSON file to BigQuery, it can automatically detect the schema:

Go forth and process your text!

Extra: training a custom text model with word embeddings

Machine learning APIs are pretty magical in that they take your data and give you back detailed information about it — all without requiring you to think about what’s happening under the hood.

You can think of it kind of like ordering a pastry: you provide money and a bakery gives you back a chocolate chip cookie. This will result in a delicious cookie a lot of the time, but sometimes you may decide you want to add white chocolate chips or toffee bits to your cookie. In this case you’re going to have to customize it, which likely involves making it yourself even if it may ruin some of the magic. After you’ve gotten over this fear you can either use a recipe someone else has written (BYO ingredients) or live dangerously and write your own.

The machine learning equivalent of making your cookie from scratch is building and training a model with your own data. I’ve recently started exploring this side of machine learning by looking at word2vec: a model for learning the relationships between words in a dataset. As the name implies, it takes words as input and outputs word vectors: a vector representation of a word in a set of text.

Using the word2vec sample on GitHub and this tutorial, I fed the model a subset of the Reddit comment data (~55k words, 8k unique) to generate a visualization of word embeddings:

Word embeddings essentially map words to vectors by extracting the meaningful words from a large text dataset and determining the semantic relationship between them. We can see in the visualization above that the model did a pretty good job clustering related words together: ‘rights’, ‘laws’, and ‘legalization’ are grouped together, as is ‘marijuana’ and ‘cannabis’.

Next steps? I could use this trained model to predict the topic of a comment or generate a new comment. I’m just getting started with custom models, so I’d love to hear your feedback and suggestions! Leave a comment or find me on Twitter Sara Robinson.

Want to do more cannabis analysis?

My colleague Allen Day has made genomics data for 1000 cannabis strains available in BigQuery. Details here, or head over to BigQuery to check out the public dataset.