Intro

I recently migrated a set of applications, scripts, databases and other odds-and-ends from the 'desktop' to an AWS 'serverless' solution, and now I never want to build another desktop app again. Trying to build a desktop Python application with a Tkinter framework GUI is a really horrible experience and I don't recommend it.

My AWS solution is actually a collection of a few different applications that can be chained together using API Gateways, SNS Topics and Lambda Functions. There are several ways to access the services, and several ways to get information back. More on that in a bit...

Geocoding

First, let me attempt to explain what geocoding is. The common use of the term describes the process of converting a street address to Latitude / Longitude coordinates. This process isn't trivial. It requires a huge database of every street address in the entire US, and the coordinates for those addresses. On the low side, there are a couple thousand such lists for just the US alone, updated monthly. Then it requires some clever querying to account for variations in street addresses, and erroneous or missing address components supplied by an end user searching for an address. Google Maps handles this incredibly well. If you search for a partial address it usually figures out what you were looking for.

Once you have those coordinates, you can figure out lots of other stuff about an address by using geospatial databases that contain the geometry data about places on a map: Countries, states, counties, cities, Congressional districts, etc. Essentially, this comes down to queries to a geospatial databases that are designed to determine if coordinates are inside of geometric spaces, distances between coordinates, etc. I'm sure I really oversimplified this.

Finally, once you have determined the geographic area an address exists in, you can start layering in as many associated data points as you like. These are usually industry-specific, and feed into a larger workflow. For example, if you wanted to know how many people live in a US Congressional district, you could save that data point, and any time a geocoded address landed in that district you could retrieve that data.

My Industry Specific Use Case

My geocoding solution has an industry specific purpose, and there are data points that have to be retrieved in order to determine what to do next in the process. This is proprietary data, and for the sake of simplicity, I'm skipping over that for now and focusing on the system architecture. In order to get access to that proprietary information, I have to start with a street address (or a huge list of addresses), get them geocoded to coordinate points, find those points in a geospatial database, and determine exactly where they are. This example will focus on some basic geographic data points: US State, County and an incorporated City (if the address is inside an incorporated city vs an unincorporated county area).

GeocodeHero

The core component in the system is a little app I dubbed the GeocodeHero. This is a self-contained application with an API interface used to pass in street addresses, and send back all the final data points from the geocoding process.

This is the core application that handles the geocoding process from start to finish. Users submit addresses via the API, which is handled by the AWS API Gateway. The API sends the request down the chain to a Lambda function running Python (because I like Python). The Lambda does its magic and sends its response back out through the API Gateway.

This function performs two critical processes. The first is to translate the address into coordinate points. For this, I rely on outside service providers because I don't have a database of every street address in the US. I use a primary provider in Geocod.io, with a fallback to Google Maps API. If the starting address is malformed or incorrect I can compare the two services to see which one got a better match and blend their results together. If Geocod.io can't figure out what I'm looking for, I take the Google Maps suggestion and try again.

The second part of the process is the Geospatial database query. Once I have the coordinates for an address, I run it through my RDS PostGreSQL (PostGIS enabled) database on AWS. My database is loaded with a variety of geometries and data points that can be retrieved and sent back in the response. For starters, the basic response gives the following information: State, County, City (if its an incorporated city), the nearest incorporated city (if its in an unincorporated County area), and all the Counties that border the County this address is in. Coming along for the ride are a bunch of other proprietary data points that are associated with the State, County and City.

Technically, this isn't entirely 'serverless', as the RDS database is an instance on its own. The API Gateway and the Lambda are, however.

Map-Maker, Map-Maker, Make Me a Map

The next application in the chain is the Map Maker, which didn't get a clever name yet. The purpose of this app is to create a single static HTML page that delivers a map of the requested address, with specific geometry layers to help the end user understand where this address exists.

I chose to use a package in Python called Folium which can be used to pull geometry data directly from my RDS database and plot it on a Leaflet map. This map pulls in base layers from a variety of sources, and there is a ton of customization available. The best part is that it can be served up as a static page via S3. I don't need to create a complicated web front end. I can just send the URL to the map along with the rest of the geocoding results. In order to minimize the storage on S3, I set my files to auto expire and "die" after a day.

The great part about making this app (and others) modular is that I can use it in a variety of ways. The underlying queries and the geometry that gets returned can be customized any number of ways based on requirements. If I want to send a request to it directly and just make a map from a set of coordinates I can do that. In this scenario, my other applications are the ones making the API requests as part of the chain.





An API is cool. But a Slack Bot is cooler.

I wanted to give access to the entire system to my Slack users who weren't interfacing via an API call. This would allow them to submit addresses via a 'slash' command and the Slack Bot would automatically retrieve all the geocoded information and send it back to the user. On the Slack account side, this involved setting up an "App", a "Bot User" and a "Slash Command" that all work together.

The AWS Slack portion of this solution ended up being the trickiest to put together. The Slash Command is easy to link to AWS because the only thing required is a URL endpoint for Slack to talk to. That endpoint is just an API Gateway URL in AWS. Simple enough. Normally I would drop in a Lambda to receive the request, do all the processing and send the response back. Slack requires that a response get sent back from an outgoing Slash Command within 3 seconds, or it considers it a failure and times out. A response, in this case can simply be a "200" (ok) status code sent back. Best case, I can get my full processing cycle inside of 2 seconds, but I thought it cut too close and built in a failsafe.

The Laziest Lambda Ever

The API Gateway receives the request from the user, and sends it along to a Lambda function, as usual. This Lambda does something different though. It hands off the request entirely to an AWS SNS block, then sends back a message to the Slack user that the request was received. The Lambda literally says "ok, i received it, but its not my job" and passes it down the line. It's the laziest Lambda ever. It's fast though - the response is instant and Slack is happy.

The SNS block that receives the forwarded request from the lazy Lambda is basically another lazy middle-man. It holds the request, in this case, called a Topic and it lets other Lambdas know it has it. Those Lambdas are "subscribers", in that they get notified anytime the SNS middle-man has something to work on. This chain technique is a clever way to set up a bunch of little worker-bee Lambdas to do stuff.

The hard working Lambda is the one that subscribes to the SNS Topic. That Lambda takes the request, unpacks it, and makes the API connections to the other applications to get the data it needs. It packages it all up and sends a Slack message back to the Slack user.

The Slash Command can be initiated from anywhere inside Slack since the bot is an application that is "always on". Results from the geocode process are then posted on a Channel called "Geocoding". Doing it this way allows the user to review past geocode results.

The Dreaded Cold Start

If you've ever used a Lambda, you have probably come across the "cold start" problem, or read about it online. Lambdas are serverless container instances that don't behave like 'traditional' EC2 cloud systems. They can be turned on and off.

The awesome thing about Lambdas is AWS automatically manages them for you. The not-so-awesome thing about Lambdas is AWS automatically manages them for you.

Lambdas can scale up, but they can also completely shut down. AWS will turn off those instances if they aren't being used, and as soon as they are needed, AWS will turn them back on. This "cold start" process is not instant. It can take 5-10 seconds on the low side. An entire chain of Lambdas that need waking up could take 30 seconds to a full minute to wake up before they function. This can be a real drag on the user experience. Luckily, there is a pretty simple solution. An AWS Cloudwatch Event can be set up to "ping" a Lambda, and it can be done on a schedule. In my case, I set up an Event to keep the Lazy Lambda awake at all times. This Lambda never shuts down and never has to go through a Cold Start. This solution works perfect for a small scale deployment like this.

Putting all the pieces together, we have the full architecture of the GeocodeHero, Slack Bot and Map Maker, along with a traditional API service. Pieces are upgradeable on their own, modular, and the whole system ends up being very flexible.

Upcoming Features in the Works

The traditional API access into the application allows for batch address geocoding, and a final spreadsheet can be compiled by the end-user, client-side. One major feature being rolled in next is the ability to upload a file with addresses, and get back a nice polished Excel spreadsheet with all the data points. This file upload and delivery can be managed a number of different ways, and you can be sure the Slack Bot will be one of them.

Wrap Up

Hope you found this informative or entertaining. I find working with the serverless components on AWS to be a lot of fun and relatively easy to understand what is happening. A more traditional and EC2 deployment with the proper load balancers and auto scaling groups would certainly accomplish all of this too. While that setup is more complicated, the individual Lambdas could likely be rolled into a single application, simplifying the code-base a bit. Maybe I'll try that for my next build.