At HumanGeo, we love Elasticsearch and we love social media. Elasticsearch lends itself well to a variety of interesting ways to process the vast amount of content in social media. But like most things on the internet, keeping up with slang and trends in social media text can be an increasingly difficult barrier to entry to analysing the data (so can getting beyond your teenage years). So how do we get past this barrier? If the web is so powerful, can't we use it to help us understand what's really being said?

Enter Urban Dictionary. Usually, if you're looking for answers, UD might be the last place on the internet you want to look unless you have a large jug of mindbleach waiting on standby. Aside from proving that the internet is a cold, dark place, Urban Dictionary has a large amount of crowd-sourced data that can help us get some insight into today's communication medium, whether it's 140 characters or beyond.

In this post, my goal is to 1) collect a bunch of data from Urban Dictionary, 2) index it in such a way that I can use it to "decipher" lousy internet slang and 3) query it with "normal" terms and get extended results.

The Data

To get started, we needed to get the words themselves. To do this, I built a simple web scraper to scroll UD and extract the words. Here's a snippet to extract the words out of the DOM using Python via Requests and Beautiful Soup.

import requests from bs4 import BeautifulSoup WORD_LINK = 'http://www.urbandictionary.com/popular.php?character={0}' def make_alphabet_soup ( self , letter , link = WORD_LINK ): '''Make soup from the list of letters on the page.''' r = requests . get ( link . format ( letter )) soup = BeautifulSoup ( r . text ) return soup def parse_words ( self , letter , soup = None ): '''Scrape the webpage and return the words present.''' if not soup : soup = self . make_alphabet_soup ( letter ) word_divs = soup . find ( id = 'columnist' ) . find_all ( 'a' ) words = [ div . text for div in word_divs ] return popular_words

This is the basic building block, but I extended from there. For every word I grabbed, I threw it against the Urban Dictionary API and got a definition.

# Redacted API_LINK = 'http://ud_api.com' def define ( self , word , url = API_LINK ): '''Send a request with the given word to the UD JSON API.''' r = requests . get ( url , params = { 'term' : word }, timeout = 5.0 ) j = r . json () # Add our search term to the document for context j . update ({ 'word' : word }) return j

Using this method, I ended up with about 100k "popular" words, as defined by UD. An example response from the API looks something like:

{ "tags" : [ "black" , "ozzy" , "sabbath" , "black sabbath" , "geezer" , "metal" , "osbourne" , "tony" , "bill" , "butler" ], "result_type" : "exact" , "list" : [ { "defid" : 772739 , "word" : "Iommi" , "author" : "Matthew McDonnell" , "permalink" : "http://iommi.urbanup.com/772739" , "definition" : "Iommi = a Godlike person. A master of their chosen craft. Someone or something extremely cool" , "example" : "Example 1. Hey rick, that motorcyle stunt you did was really Iommi! \r

\r

Example 2. That guy is SO Iommi! \r

\r

Example 3. Be Iommi, man!" , "thumbs_up" : 57 , "thumbs_down" : 3 , "current_vote" : "" } ], "sounds" : [] }

Now that I had the data, it was time to make something out of it.

The Process

With our data in hand, it's time to utilize Elasticsearch. More specifically, it's time to take advantage of the Synonym Token Filter when indexing data into Elasticsearch.

A quick interjection about indexing: this is an good time to talk about "the guts" of how data is indexed into Elasticsearch. If you don't specify your mappings when indexing data, you can get unexpected results if you're not familiar with the mapping/analysis process. By default, the data is tokenized upon indexing, which is great for full-text search but not when we want exact matches to multiple words. For example, if I'm searching for exactly "brown fox" in my index (for example, an exact match against my query string), I will get results for the sentence "John Brown was attacked by a fox." You can read more about that behavior here. A good strategy is to create a subfield of "word" such as ".raw" where the ".raw" is set to not_analyzed in your mapping.

Using the data, we collected, we can generate the Solr synonym file required by the token filter. To do this, I used the "tags" area of the definition. This definitely is not a set of synonyms (sometimes you just get a bunch of racism and filth), but it does provide (potentially) related words to the original word. For example, here are the tags for word "internet":

"facebook"

"web"

"computer"

"myspace"

"lol"

"google"

"online"

"porn"

"youtube"

"internets"

I mean, they're not wrong. Here's an example of adding the mapping I used on the "test" index in the "name" field:

justin@macbook ~/p/urban> curl -XPOST "http://localhost:9200/test" -d

{ "settings" : { "index" : { "analysis" : { "analyzer" : { "synonym" : { "tokenizer" : "whitespace" , "filter" : [ "synonym" ] } }, "filter" : { "synonym" : { "type" : "synonym" , "synonyms_path" : "/tmp/solr-synonyms.txt" } } } } }, "mappings" : { "test" : { "properties" : { "name" : { "type" : "string" , "index" : "analyzed" , "analyzer" : "synonym" } } } } }

The Search

Now that we have our index set up, it's time to put a search in action. Until I went down this rabbit hole, I had no idea calling something Iommi was a thing (it probably isn't). As someone who likes Black Sabbath, I want to find other words in my index that are totally Iommi. Using the mapping I specified above, I indexed a few sample documents with "name" field set to tags that UD relates to Iommi, as well as some bogus filler. Example tags (and no, I did not make this example up):

"sabbath"

"black sabbath"

"geezer"

"metal"

Our query (in Sense, against the 'test' index), and the results:

POST _search { "query": { "term": { "name": "iommi" } } }

Awesome! This installment is more about showcasing how the filter works, so it's not entirely practical. Look out for a future installment where we use real social media data to do "extended search" and display results with the Elasticsearch highlighting to show a practical example of this in action.