The 140 characters in a tweet don’t leave much room for context. To understand a tweet, you often need to understand the who, what, and where behind it. The lab’s Soft-Boiled challenge spent some time looking at the “where,” taking an automated approach to estimating the location of users or messages on Twitter (a problem referred to as geo-inferencing).

We found two general approaches to inferring locations on Twitter: based on the content of the messages and based on the structure of the social network. Both approaches use messages with known locations to estimate something about messages without a known location.

Soft-Boiled set out to implement one geo-inferencing method from both categories, in a scalable way. Our algorithms are implemented in Python, using Spark as a distributed computing environment. Python and Spark gave us an easy way to use the same code for building algorithms and analysis. Our implementation is geared towards Twitter data, which unfortunately can’t be included alongside our code, due to the terms of service. You can use the Twitter API to pull a sample of data and explore the algorithms. Pull requests are always welcome! Now back to the methods:

Social Network-Based Methods

One of the most interesting facts from early analysis of online social networks was that friends on these networks tend to be geographically close to each other. Using just this fact, you can start to estimate the location of users based on where their friends are. A survey of many network-based methods is described in this recent paper. One such approach is described below:

Training

Construct a social network graph in Twitter using bi-directional @mentions. An @mention occurs when one user mentions another user in a tweet. Using bi-directional @mentions mitigates the effect of one-sided relationships such as celebrities or news outlets. The figure below shows how one-sided communication in the graph is pruned away, leaving only bi-directional connections.

Estimate the home locations of users. For users who have sent a sufficient number of tweets with a known location, estimate their “home” location as the location that minimizes the distance to the remaining points. The example below shows the location that is estimated for user B based on all of the tweets with locations for that user:

Using known home locations and the @mention graph from step 1, we apply label propagation to infer unknown locations. Spatial label propagation, depicted below, works by examining each node in the graph from step 1 and estimate a user’s location as the friend location that minimizes the distance to other friends. In the diagram, user A’s location is estimated as user B’s location, since user B minimizes the median distance to all other friends. Median distance is used to handle outliers. The local estimate is always a friend’s location, which prevents estimating a user’s location as the middle of the ocean.

An improvement discussed in a follow-on paper adds a constraint on how large that median distance can be. With this additional condition, if your friends are too spread out, the algorithm doesn’t attempt to estimate your location. Repeat step 3 for the desired number of label propagation iterations. At the end of every iteration, you have a list of users and the best estimate of their locations.

Prediction

Every iteration ends with a list of users and the best estimate of their location. Users cannot be estimated if they don’t have a sufficient number of connections with known or estimated locations.

Performance

Content-Based Methods

Content-based methods rely on the words used by the user in the message and their profile rather than the structure of the social network. The simplest method would be to look at the language of the message and predict the most likely country for that language. That simple model would correctly predict the country of origin for a message with an accuracy of 97% for tweets in Japanese. Unfortunately, that simple model is only able to correctly predict the country of origin for 62% of tweets in English.

Content-based methods take a variety of approaches. Some turn geo-inferencing into a classification problem estimating city/country, while others approach it as a regression problem and estimate latitude/longitude directly. Such methods use a multitude of underlying machine learning techniques to effect those predictions. We will discuss an approach that estimates latitude/longitude directly using Gaussian Mixture Models (GMMs). GMMs estimate the distribution of a variable as a set of Gaussian distributions. One interesting byproduct of the probabilistic nature of this model is that an estimate of confidence is built into the prediction. The GMM-based approach described in attempts to estimate the geographic distribution of all the words in a corpus, then locate a message by combining the distributions of its words.

Training

For each tweet with a location, split (tokenize) the text into words. For each word, across all tweets, find all of the locations from which that word was used. For each word, fit a GMM to the distribution of locations from which that word was used. For each word-GMM, calculate the distance between the most likely point predicted by the GMM for that word and the true locations. This model error on the training data will be used as a scale factor.

Prediction

For a given tweet, tokenize the text into words. For each word in the message, find the word-GMM and error associated with that word. Combine the predictions we have for each word into a prediction for the message. We could simply multiply the probability distributions for each word together but that might give equal weight to the (uninformative) and #SF (very informative). Instead the authors suggest using the inverse of the error to the fourth power as a weighting factor when combining the GMMs. This penalizes words that are used over broad geographical ranges. The exponent controls the influence given to the error. Additionally, to produce an estimate of the model’s certainty: take the most likely point and use the properties of GMMs to estimate the probability that the true location is within, for example, 100 km.

Performance

What we found

Soft-Boiled not only created code to produce an estimated location but also a confidence (step 4 above) that a message is within some radius (i.e. there is a 80% chance the true location is within 100km).

In our testing, we found that the social network-based methods provided better accuracy than content-based methods. Additionally, network based methods were much faster to run. Content-based methods were somewhat slower to run but were able to estimate locations for a much larger percentage of users. The primary cost in the content-based methods is building of the GMMs and evaluating the probability mass covered by some radius for the confidence estimate.

We were able to create a hybrid algorithm where content-based methods are used for an initial estimate of users’ locations and then refined using a network-based method. This hybrid algorithm gave the ability to tune performance and coverage to match the application.

Ultimately, inferring location using either of the classes of algorithm in the literature — or a hybrid of the two — is feasible and can be performed in a scalable and performant manner.