Context

During the Summer of 2019, I wrote a script to gather data from two Google API Services. I needed to estimate the market size of Chinese Restaurants in the United States. And it cost me ALOT of money.

The Plan

The scripting plan was as follows,

Spin up an Ubuntu Compute Engine on GCP. I didn’t want to loose sleep running it on my actual computer Use Google’s PlacesAPI to perform a search on specific text queries. For instance, “Chinese restaurants in Boston” Accumulate the data in a Python dictionary, convert it into a pandas DataFrame and export the information to CSV for Tableau visualization Utilize the visualizations for strategic recommendations

Below are some of the visualizations generated from Tableau, and here is the data and script. Yes, I’ve just given you $1200 worth of data. For free, and you’re welcome.

Looks fine right? How can this information cost so much money?

Chinese restaurants around the Greater Boston Area

Chinese restaurants around the Manhattan New York Area

Chinese restaurants around the San Francisco Bay Area

Well, these visualizations were generated after I realized the damage.

The Script

Here is the code I used to generate the data. Can you spot the error in logic?

# import libraries import requests import pandas as pd import numpy as np from time import sleep from support import nbs # Initializing restaurants = [] rating = [] reviews = [] priceLevel = [] address = [] placeID = [] lat = [] lng = [] for country in ["san fransisco", "boston", "los angeles", "new york"]: # request setup ts_base = "https://maps.googleapis.com/maps/api/place/textsearch/json?" ts_query = "query=" + "restaurants in chinatown {}&".format(country).replace(" ", "%20") ts_location = "location=42.3500641,-71.0624052&radius=50000&type=restaurant&" ts_other = "&key="+gapi ts_nextPage = "" ts_gurl = ts_base+ts_query+ts_location+ts_other ts_response = requests.get(ts_gurl).json() print(ts_gurl) print("Starting textsearch...") for j in range(0,5): print("- ts page {}".format(j)) print("\t- Total places: {}".format(len(placeID))) # Extract through initial list for i in ts_response["results"]: if i["place_id"] not in placeID: rating.append(i["rating"]) if "rating" in i else rating.append(np.nan) restaurants.append(i["name"]) if "name" in i else restaurants.append(np.nan) reviews.append(i["user_ratings_total"]) if "user_ratings_total" in i else reviews.append(np.nan) priceLevel.append(i["price_level"]) if "price_level" in i else priceLevel.append(np.nan) address.append(i["vicinity"]) if "vicinity" in i else address.append(np.nan) placeID.append(i["place_id"]) if "place_id" in i else placeID.append(np.nan) lat.append(i["geometry"]["location"]["lat"]) if "geometry" in i else lat.append(np.nan) lng.append(i["geometry"]["location"]["lng"]) if "geometry" in i else lng.append(np.nan) # Perform nearby search placeID, restaurants, rating, priceLevel, address, lat, lng, reviews = nbs(gapi, placeID, restaurants, rating, priceLevel, address, lat, lng, reviews) # Iterate to next page if "next_page_token" in ts_response: sleep(np.random.normal(5, 0.1, 1)) ts_nextPage = "&pagetoken="+ts_response["next_page_token"] ts_gurl = ts_base+ts_other+ts_nextPage # print("{}: {}".format(j, ts_gurl)) ts_response = requests.get(ts_gurl).json() # print("response: {}".format(ts_response)) else: print(" - ts next_page_token not found in {}, response: {}".format(j, ts_response["status"])) break print("text search scraping done...!") data_ts = {"Name": restaurants, "Rating":rating, "Reviews":reviews, "PriceLevel":priceLevel, "Address":address, "placeId":placeID, "lat":lat, "lng":lng} dfts = pd.DataFrame(data=data_ts) dfts.to_csv("{}_chinatown.csv".format(country).replace(" ","")) # print(len(data_ts["Name"]), len(data_ts["Rating"]), len(data_ts["Reviews"]), len(data_ts["PriceLevel"]), len(data_ts["Address"]), len(data_ts["placeId"]))

A Critical Error

The most critical error I made came with a constraint when using the Google Places Search API. When you submit a string search query with relevant parameters to the API, it returns a maximum of 60 results. Nothing more.

It may be plausible that some cities do not have more than 60 cities, but cannot make this assumption for other big metropolitan areas. Like Los Angeles for instance.

So to get around this, I modified the script to use Google’s Nearby Search API. This allows you to make a search based on a geographical location. So, the script would initially gather the list of 60 locations for the Place Search API, then iterate through each of the 60 restaurants in that area using the Nearby Search until it returned 0 results within a 1 mile radius.

The Result

The result is that the script generated almost 1000 times more data then expected. Why did this happen?

The for loop essentially iterated over an increasing number of locations stored in a list defined by a support function, nbs. Below is that function, and if you follow the logic you’ll see that this function performs the Nearby Search for every item on the list, through every iteration of the double for loop.

def nbs(gapi, placeID, restaurants, rating, priceLevel, address, lat, lng, reviews): # nearbysearch assume 100 meter radius nbs_base = "https://maps.googleapis.com/maps/api/place/nearbysearch/json?" nbs_key = "keyword=" + "chinatown restaurants".replace(" ", "%20") nbs_other = "&key="+gapi nbs_nextPage = "" for a in range(len(placeID)): # - For each placeID # - Do nbs # - Iterate through each page of nbs, insert non-duplicates nbs_location = "location={},{}&radius=100&type=restaurant&".format(lat[a], lng[a]) nbs_gurl = nbs_base+nbs_location+nbs_key+nbs_other nbs_response = requests.get(nbs_gurl).json() # Iterate through each page of nbs for b in nbs_response["results"]: # If the place_of the current result is not in the place_id list, then add to list if b["place_id"] not in placeID: rating.append(b["rating"]) if "rating" in b else rating.append(np.nan) restaurants.append(b["name"]) if "name" in b else restaurants.append(np.nan) reviews.append(b["user_ratings_total"]) if "user_ratings_total" in b else reviews.append(np.nan) priceLevel.append(b["price_level"]) if "price_level" in b else priceLevel.append(np.nan) address.append(b["vicinity"]) if "vicinity" in b else address.append(np.nan) placeID.append(b["place_id"]) if "place_id" in b else placeID.append(np.nan) lat.append(b["geometry"]["location"]["lat"]) if "geometry" in b else lat.append(np.nan) lng.append(b["geometry"]["location"]["lng"]) if "geometry" in b else lng.append(np.nan) if "next_page_token" in nbs_response: sleep(np.random.normal(5, 0.1, 1)) nbs_nextPage = "&pagetoken="+nbs_response["next_page_token"] nbs_gurl = nbs_base+nbs_other+nbs_nextPage # print("{}: {}".format(j, nbs_gurl)) for c in requests.get(nbs_gurl).json()["results"]: if c["place_id"] not in placeID: rating.append(c["rating"]) if "rating" in c else rating.append(np.nan) restaurants.append(c["name"]) if "name" in c else restaurants.append(np.nan) reviews.append(c["user_ratings_total"]) if "user_ratings_total" in c else reviews.append(np.nan) priceLevel.append(c["price_level"]) if "price_level" in c else priceLevel.append(np.nan) address.append(c["vicinity"]) if "vicinity" in c else address.append(np.nan) placeID.append(c["place_id"]) if "place_id" in c else placeID.append(np.nan) lat.append(c["geometry"]["location"]["lat"]) if "geometry" in c else lat.append(np.nan) lng.append(c["geometry"]["location"]["lng"]) if "geometry" in c else lng.append(np.nan) # print("response: {}".format(nbs_response)) else: print(" - nbs next_page_token not found in {}, response: {}".format(a, nbs_response["status"])) break return [placeID, restaurants, rating, priceLevel, address, lat, lng, reviews]

Even looking at the code today, after cringing in my sleep for many nights, it’s still confusing.

The $1200 Learning Opportunity

If I had to do the whole thing again, I would consider the following;

Instead of jumping straight into the code, which is something I love to do, I would rather get over my ego and take the time to thrash out the pseudo code. Using GCP is awesome, you can run the script and not have to worry about loosing wifi, your slow computer or nights worth of sleep. It’s run in the cloud. The only downside is that you really have no idea what it’s doing unless you’re absolutely sure the script is safe. I would head on over to Budgets & Alerts and setup a budget before doing anything on GCP.

Setup a GCP Budget before having fun

Hope this was helpfull to you, and please share with your friends so that they don’t make the same mistake!

Finally, if something like this happens to you and you’re a Student. Get in touch with Google and ask them for a student discount. Haha.

Share this: Twitter

Facebook

