A scatter plot of every bar in Wisconsin (there are a lot of them).

I know, I know... it's a population map, but anyone that has been to Wisconsin knows this is an apt way to describe the population of Wisconsin.

A blog has to start somewhere. In the future, I'll hope to show interesting, provocative, or educational data analyses, but for now, this is just a recent fun project. My goal was to make something similar to this map of UK bars. In this post, I'll walk you through the steps I took to make this.

Data Sourcing

To get a list of bars and their coordinates, I eventually landed on collecting the information from yellowpages.com. The search results seemed to be more complete than other options like the Yelp API. I was okay with the occasional missing or inaccurate entry, and lacking an "official" list of bars in Wisconsin, this approach would have to suffice.

A simple search of "taverns" and "Wisconsin" returned about 6,300 results, so it seemed like I'd have a good list. Unfortunately, the first roadblock came when every page after 100 simply displayed the results from page 100. I guess broad searches are only tolerated up to a point. The search area needed to be smaller, so I decided to take a systematic approach and go by ZIP code. Of course this introduced duplicate results, but those were easily removed later.

Retrieving the latitude and longitude took some digging, but the information was there in the HTML, sent to Google to create the map you see above. I was able to pull the coordinates out of the javascript on the page.

The Code

The scraping was done in Python with Beautiful Soup.

import requests import pandas as pd from bs4 import BeautifulSoup import json import re import time import csv # import a downloaded list of Wisconsin ZIP codes zips = pd . read_csv ( 'WI Zips.csv' )[ 'ZIP Code' ] session = requests . session ()

I had a couple helper functions. The biggest issue was extracting coordinates from the javascript, but that was reasonably straightforward.

def get_yp_url ( zipcode , page ): url = 'http://www.yellowpages.com/search?search_terms=taverns&geo_location_terms={}&page={}' return url . format ( int ( zipcode ), page ) def get_coords_from_javascript ( scripts ): ''' :param scripts: a list of javascript blocks from webpage :return: List of geographic coordinates ''' locs = [] # List of locations to be returned # Regex to find the javascript with lat/long information pattern = re . compile ( r'YPU = (.*?);' ) for script in scripts : if len ( pattern . findall ( str ( script . string ))) == 1 : data = pattern . findall ( str ( script . string )) down = json . loads ( data [ 0 ]) try : locs = down [ 'expandedMapListings' ] if len ( locs ) == 0 : break except : break return locs

The outer container for writing our data to csv:

writefile = 'Wisconsin_lat_long.csv' with open ( writefile , 'w' ) as f1 : writer = csv . writer ( f1 , delimiter = ',' , lineterminator = '

' )

And then I loop through each ZIP code, performing a search, and going through all of the result pages. Latitudes and longitudes were written to csv, and I added a time delay for slightly more responsible web scraping.

for zipcode in zips : for page in range ( 1 , 30 ): url = get_yp_url ( zipcode , page ) print ( url ) s = session . get ( url ) soup = BeautifulSoup ( s . text , 'lxml' ) # Get all javascript blocks from page scripts = soup . findAll ( 'script' ) locs = get_coords_from_javascript ( scripts ) if len ( locs ) == 0 : break for loc in locs : writer . writerow ([ loc [ 'name' ], loc [ 'zip' ], loc [ 'latitude' ], loc [ 'longitude' ]]) print ( loc [ 'name' ], loc [ 'zip' ], loc [ 'latitude' ], loc [ 'longitude' ]) print ( '{}-------------{}' . format ( zipcode , page )) time . sleep ( 2 )

This wasn't an efficient way to get the data by any means, but it worked well enough.

Data Cleaning and Plotting

Now that I had a csv with geographic information, there were only a couple more steps. The data had plenty of missing values and duplicates, which needed to be removed:

Name ZIP Latitude Longitude Lynn's Creekside Bar & Grill 53001 43.59045 -88.050026 Times Remembered Inc 53001 43.615665 -87.952675 The Whey Side Saloon Hall & Charcoal Grill 53001 43.61909 -87.952675 Greg's Tap 53001 43.618343 -87.951965 Grandma & Grandpa's 53001 Lake House Sports Pub & Gril 53073 Nap's Place 53073 Laack's Tavern & Ballroom 53085 Racers Hall 53073 Harbor Lights Resort Pub 53011 43.64997 -88.009674 BENN THERE PUB 53011 43.65839 -88.006744 Sipp's Bar and Grill 53011 43.65839 -88.006744

I also filtered the results by legitimate Wisconsin ZIP codes, since some sneaky bars in Minnesota and Illinois were trying to get in.

import matplotlib.pyplot as plt import numpy as np import pandas as pd df = ( pd . read_csv ( 'Wisconsin_lat_long.csv' , delimiter = "," , encoding = "windows-1252" , header = None , names = [ 'Name' , 'Zip' , 'Lat' , 'Long' ]) . drop_duplicates () . dropna () . query ( 'Zip>53000 & Zip<55000' ))

Once the data was clean, I fiddled with plotting for quite a while before I had something I was happy with. I had a few requirements:

The colors needed to be green and gold (naturally)

I wanted a gradient effect for each point

I wanted the rural taverns to be visible without the urban areas becoming over-saturated messes.

I ended up "cheating" to get the gradient by plotting multiple times with different transparencies and point sizes. Here's the detail when you zoom in on Madison, WI, you can see the gradient effect as well as the isthmus between lakes Mendota and Menona.

Detailed view of the bars in Madison, Wisconsin

The plotting was done with the following commands:

green = r'#203731' gold = r'#FFB612' plt . figure ( figsize = ( 120 , 120 )) plt . subplot ( '111' , axisbg = green ) plt . scatter ( df . Long , df . Lat , alpha = 3 / 10 , lw = 0 , edgecolors = None , s = 200 , color = gold , marker = "o" ) plt . scatter ( df . Long , df . Lat , alpha = 5 / 10 , lw = 0 , edgecolors = None , s = 135 , color = gold , marker = "o" ) plt . scatter ( df . Long , df . Lat , alpha = 7 / 10 , lw = 0 , edgecolors = None , s = 45 , color = gold , marker = "o" ) plt . scatter ( df . Long , df . Lat , alpha = 9 / 10 , lw = 0 , edgecolors = None , s = 20 , color = gold , marker = "o" ) plt . scatter ( df . Long , df . Lat , alpha = 10 / 10 , lw = 0 , edgecolors = None , s = 12 , color = r'#FFFFFF' , marker = "o" ) plt . xlim ([ - 94 , - 86 ]) plt . ylim ([ 42 , 47 ]) plt . show ()

And there we have it:

For the finishing touches I added a simple banner in Photoshop. I think it turned out well, and I think it will look great on canvas, probably in my basement. The print is on order as I post this! I'm certain that not every state would be recognizable from a map of its bars.

Next steps:

Get this printed and on my wall.

Create the ultimate Wisconsin pub crawl as a travelling salesman problem. I say that in jest. There might be a few too many points to make it computationally feasible.

More data posts.

UPDATE: I think this really ties the room together :)

Thanks for reading.

The code is available here.