21 minute read

Project Objective

Untappd has some usage restrictions for their API, namely not allowing any exploration for analytics or data mining use cases, so I’m going to explore tweets of beer and brewery check-ins from the Untappd app to find some implicit trends in how users share their activity.

Exploratory Analysis

library ( tidyverse ) library ( rtweet ) library ( stringr ) library ( wesanderson ) library ( maps ) library ( tidytext ) library ( dumas ) # http://jasdumas.github.io/dumas/ library ( wordcloud ) library ( magrittr ) library ( emo )

All social media shares from the Untappd app include their own short URL ‘untp.beer’, which makes the search query criteria identifiable using the search_tweets() function.

untp <- search_tweets ( q = "untp.beer" , n = 18000 , include_rts = FALSE , retryonratelimit = TRUE , geocode = lookup_coords ( "usa" ), lang = "en" )

Let’s take a peek at the text of the tweet!

head ( untp $ text )

## [1] "I just earned the 'Tower of Beer (Level 20)' badge on @untappd! https://t.co/Cv5kwOxquc" ## [2] "I just earned the 'Belgian Holiday (Level 31)' badge on @untappd! https://t.co/soGN8iihKw" ## [3] "I just earned the 'Flan-didly-anders (Level 3)' badge on @untappd! https://t.co/RFY0CKXlaI" ## [4] "I just earned the 'Hopped Down (Level 16)' badge on @untappd! https://t.co/9BWWwsEX1O" ## [5] "I just earned the 'Master' badge on @untappd! https://t.co/plOOnWFovi" ## [6] "Just added Victory at Sea by @BallastPoint to our menu https://t.co/7ify5kXSyA"

Let’s explore some descriptive stats about the 18,000 tweets that were extracted

How many unique users are in the data set?

n_distinct ( untp $ user_id )

## [1] 5867

What is the range of dates in the data set?

paste ( as.Date ( min ( untp $ created_at )), as.Date ( max ( untp $ created_at )), sep = " to " )

## [1] "2018-02-04 to 2018-02-06"

My initial assumptions were that all the tweets would be posted from the app, but it seems there is a little bit cross-posting going on from Facebook and some 🤓 nerds who have set up IFTTT applet recipes.

count ( untp , source )

## # A tibble: 4 x 2 ## source n ## <chr> <int> ## 1 Facebook 6 ## 2 Flamingo for Android 1 ## 3 IFTTT 6 ## 4 Untappd 17987

How many types of these check-ins are shared?

There a few different types of tweet structures that can be shared from the Untappd app, as noticed from the text sample above. They include:

Earning Badges (i.e. tweets that contain ‘I just earned the…’ or even the word ‘badge’) Added review text (i.e. text which ends in a ‘-‘ before the default template of ‘Drinking a’) Default check-ins (i.e. tweets that begin with ‘Drinking a’) Brewery offering updates (i.e. tweets that begin with ‘Just added …’ for new beers added)

There are granular account social settings available that enable the ease of sharing check-in info to certain linked social media accounts.

Let’s detect the string patterns and create a new column in the data set to house them.

untp %<>% mutate ( structure_type = case_when ( str_detect ( text , "I just earned the" ) ~ "badge achievement" , str_detect ( text , "^Drinking" ) ~ "default check-in" , str_detect ( text , "^Just added" ) ~ "brewery update" , TRUE ~ "additional review" ))

Given that a single check-in can result in multiple badges and therefore multiple social shares, this makes sense to have more tweets associated with the type of beer check-in.

untp %>% count ( structure_type ) %>% mutate ( structure_type = fct_reorder ( structure_type , n )) %>% ggplot ( aes ( x = structure_type , y = n )) + geom_bar ( stat = "identity" , fill = wes_palette ( "BottleRocket1" , 1 )) + theme_minimal () + labs ( title = "Count of Tweet Types for Untappd Twitter Shares" , subtitle = "Most users have shared their badge acheivements" , y = "Count of Tweets" , x = "" )

What kind of badges are users earning?

The ‘Brew Bowl LII’ badge is the most popular earned badge available during this Superbowl weekend. Consequently, I visited a brewery this weekend and earned this badge as well.

The badge names are located between single quotes, so we’ll have to extract out that text to learn what specific badges are earned.

untp %<>% mutate ( badge_type = str_extract ( text , "(?<=').*?(?=')" ))

There were numerous ‘Middle of the Road’ badges awarded of various levels. The description of the Level 5 of that badge is:

Looking for more kick than a session beer, but want to be able to stay for a few rounds? You have to keep it in the middle. That’s 25 beers with an ABV greater than 5% and less than 10%.

So, it would appear that is a popular range of ABV that users are trying.

untp %>% count ( badge_type ) %>% dplyr :: filter ( ! is.na ( badge_type )) %>% arrange ( desc ( n )) %>% top_n ( 25 ) %>% ggplot ( aes ( x = fct_reorder ( badge_type , n ), y = n )) + geom_bar ( stat = "identity" , fill = wes_palette ( "GrandBudapest1" , 1 )) + theme_minimal () + coord_flip () + labs ( title = "Count of Badge Types for Untappd Twitter Shares" , subtitle = "Brew Bowl LII was the most awarded badge this weekend" , y = "Count of Tweets" , x = "" )

Where do people check-in from?

There is a lot of activity on the east coast (Pennsylvania being the top place at the time of running this analysis) and in the metros across the U.S.! Untappd is based in North Carolina, so it’s interesting to not see a lot of activity there as expected, but users can have their location settings turned off for privacy in Twitter. This may also be indicative of users not filling out all the check-in details such as purchase or drinking location. Often times users seem to be drinking and checking-in at home and may want to mask their location given the plethora of missing values.

untp %>% count ( place_full_name ) %>% arrange ( desc ( n )) %>% top_n ( 21 ) %>% data.frame ()

## place_full_name n ## 1 <NA> 14703 ## 2 Pennsylvania, USA 73 ## 3 Portland, OR 69 ## 4 Florida, USA 56 ## 5 Los Angeles, CA 45 ## 6 Cape Girardeau, MO 44 ## 7 Houston, TX 39 ## 8 Phoenix, AZ 35 ## 9 New York, USA 34 ## 10 Anaheim, CA 32 ## 11 Dallas, TX 32 ## 12 Philadelphia, PA 32 ## 13 Chicago, IL 31 ## 14 Boston, MA 29 ## 15 Brooklyn, NY 29 ## 16 Austin, TX 26 ## 17 San Diego, CA 25 ## 18 Nashville, TN 22 ## 19 Tempe, AZ 22 ## 20 O'Fallon, MO 21 ## 21 Charlotte, NC 20 ## 22 Manhattan, NY 20 ## 23 Monterey, CA 20 ## 24 Raleigh, NC 20 ## 25 Virginia, USA 20

There are a few off the map, but the general distribution is effectively visualized with one yellow dot per tweet.

## create lat/lng variables using all available tweet and profile geo-location data untp <- lat_lng ( untp ) ## plot state boundaries par ( mar = c ( 0 , 0 , 0 , 0 )) maps :: map ( "state" , lwd = .25 ) ## plot lat and lng points onto state map with ( untp , points ( lng , lat , pch = 20 , cex = .75 , col = wes_palette ( "Cavalcanti1" , 1 )))

What are the most popular breweries?

Instead of crafting an ugly regex solution to extract the brewery from the tweet text, the column for mentions_screen_name is actually a decent proxy for the beer location if the brewery has a Twitter presence. I really enjoy Tree House Brewery, Green IPA and it is nice to see many others have tried out beers from their brewery.

# mentions_screen_name is the brewery that produced the checked-in beer untp %>% unnest ( brewery = mentions_screen_name ) %>% count ( brewery ) %>% dplyr :: filter ( brewery != 'untappd' ) %>% arrange ( desc ( n )) %>% top_n ( 25 ) %>% ggplot ( aes ( x = fct_reorder ( brewery , n ), y = n )) + geom_bar ( stat = "identity" , fill = wes_palette ( "Moonrise2" , 1 )) + theme_minimal () + coord_flip () + labs ( title = "Top 25 Breweries for Untappd Twitter Shares" , subtitle = "" , y = "Count of Tweets" , x = "" , caption = "by @handle occurrences" )

How many pictures of beer are shared?

The #photo hashtag is included with a tweet from Untappd when the user appends a photo. The rest of the hash tags seem to align with the Superbowl festivities

untp %>% unnest ( hashtags ) %>% count ( hashtags ) %>% dplyr :: filter ( ! is.na ( hashtags )) %>% arrange ( desc ( n )) %>% top_n ( 15 )

## # A tibble: 15 x 2 ## hashtags n ## <chr> <int> ## 1 photo 2939 ## 2 brewbowl 2498 ## 3 ibelieveinIPA 62 ## 4 SuperBowl 40 ## 5 FlyEaglesFly 33 ## 6 FirstSqueeze 26 ## 7 craftbeer 24 ## 8 birthdaybrew 19 ## 9 UntapTheStack 17 ## 10 BrainDeadBottleShare 16 ## 11 beerandfood 15 ## 12 Superbowl 12 ## 13 flyeaglesfly 10 ## 14 superbowl 10 ## 15 GoEagles 9

Did people share more during the Superbowl game?

There was definitely a spike on Sunday as the Superbowl was starting (This post has since been re-ran with newer data) and its noticeable on Monday as well!

## plot time series of tweets ts_plot ( untp , "1 hours" ) + ggplot2 :: theme_minimal () + ggplot2 :: theme ( plot.title = ggplot2 :: element_text ( face = "bold" )) + ggplot2 :: labs ( x = NULL , y = NULL , title = "Frequency of Untappd Twitter statuses from the past 2 days" , subtitle = "Twitter status (tweet) counts aggregated using one-hour intervals" , caption = "

Source: Data collected from Twitter's REST API via rtweet" )

Sentiment Analysis

We want to separate all the text before the dash (-) which is how the tweet is structured when users add additional text to their beer review share to prep for text analysis.

# the first vector is the review, second is the beer/brewery, third is the url untp_text <- str_split ( untp $ text [ untp $ structure_type == 'additional review' ], "-|—" ) # unnest these unnamed lists, if they were named I would have used purrr::map_df() # https://stackoverflow.com/a/24496537/4143444 review <- sapply ( untp_text , "[" , 1 ) beer <- sapply ( untp_text , "[" , 2 ) user <- untp %>% dplyr :: filter ( structure_type == 'additional review' ) %>% select ( starts_with ( "screen_name" )) untp_text <- data.frame ( review , beer , user $ screen_name )

Let’s remove the hash tags, the ‘Drinking a’ and any punctuation from the beer column and create a new column.

untp_text %<>% mutate ( clean_beer = str_replace ( beer , "#[a-zA-Z0-9]{1,}" , "" )) %>% mutate ( clean_beer = str_replace_all ( clean_beer , "Drinking a |Drinking an " , "" )) %>% mutate ( clean_beer = trimws ( str_replace_all ( clean_beer , "[[:punct:]]" , "" )))

Now that we have the reviews and beer/breweries separated, I wonder if there are any commonly reviewed beers that are shared on Twitter? (I want to keep the beer with the brewery at this point in case the same beer name appears at different breweries)

untp_text %>% count ( clean_beer ) %>% arrange ( desc ( n )) %>% top_n ( 15 ) %>% dplyr :: filter ( clean_beer %notin% c ( "DC" , "A" )) %>% ggplot ( aes ( x = fct_reorder ( clean_beer , n ), y = n )) + geom_bar ( stat = "identity" , fill = wes_palette ( "BottleRocket2" , 1 )) + theme_minimal () + coord_flip () + labs ( title = "Top 15 Beers for Untappd

Twitter Shares" , subtitle = "" , y = "Count of Tweets" , x = "" )

Translating the additional review text as proxy for empirical reviews

These tweets don’t indicate what numerical value that users rated each beer on a scale of 0.0 to 5.0 (0.25 increments) on the app, so I’m going to try and derive some of users opinions about the top 20 beers from tweets that have additional review text, using the tidytext package. I believe there is going to be some sentiment shared that is linked to the Super Bowl, and some selection bias from user’s most likely sharing only preferred beers.

top_beers <- untp_text %>% count ( clean_beer ) %>% arrange ( desc ( n )) %>% top_n ( 20 ) %>% dplyr :: filter ( clean_beer %notin% c ( "DC" , "A" )) %>% select ( clean_beer ) # subset the data into topic (beers) and review (text) for tokenization untp_text_tiny <- untp_text [, c ( "clean_beer" , "review" )] # need to inner_join top beers with the reviews merge_untp <- untp_text_tiny %>% inner_join ( top_beers ) # tokenize the reviews and remove some of the specific football words untp_text_tiny <- merge_untp %>% mutate ( review = as.character ( review )) %>% unnest_tokens ( word , review ) %>% dplyr :: filter ( word %notin% c ( "super" , "superbowl" , "superbowlsunday" ))

untp_sentiment <- untp_text_tiny %>% inner_join ( get_sentiments ( "bing" )) %>% count ( clean_beer , sentiment ) %>% spread ( sentiment , n , fill = 0 ) %>% mutate ( sentiment = positive - negative ) head ( untp_sentiment )

## # A tibble: 6 x 4 ## clean_beer negative positive ## <chr> <dbl> <dbl> ## 1 Barrel 2 1 ## 2 Bourbon Barrel 3 3 ## 3 Canadian Breakfast Stout CBS 2017 by foundersbrewing 0 2 ## 4 Dallas Sucks by weyerbacher 1 1 ## 5 Double Dry 2 2 ## 6 Drifter Pale Ale by widmerbrothers 2 0 ## # ... with 1 more variables: sentiment <dbl>

The ‘Drifter Pale Ale Widmer Brothers Brewing’ having the lowest associated sentiment and currently rated on Untappd: 3.43 and the ‘Stone Loral & Dr. Rudi’s Inevitable Adventure by Stone Brewing’ having the highest associated sentiment which is currently rated on Untappd: 3.81.

Sentiment when paried with frequency of rating could definetly be an indicator to other beer drinkers or breweries about how people liked a particular beer without the numerical rating system.

ggplot ( untp_sentiment , aes ( x = clean_beer , y = sentiment , fill = clean_beer )) + geom_col ( show.legend = FALSE ) + coord_flip () + theme_minimal () + labs ( x = "" )

Now for fun, I want to visualize the words on a word cloud to see the most common occurrences from the reviews. The largest word being ‘Beer’ is the most obvious given the specific Untappd reviews. Then the other words appear to be popular hash tags from the Superbowl such as ‘flyeaglesfly’.

untp_text_tiny %>% anti_join ( stop_words ) %>% count ( word ) %>% with ( wordcloud ( word , n , max.words = 100 , colors = wes_palette ( "Zissou1" )))

Notes: