Working with Spatial Databases and R

And how to trick your database into using the index.

A map of Time Warner Cable Serviceability from broadbandmap.gov

Intro

This blog post describes how we achieved much faster and more accurate estimates of historical serviceability for a prospect customer by using the GIS capabilities in SQL Server, and how we were able to speed up our queries through cunningly tricking the query optimizer into using the index.

Context

At Red Ventures, we have partners in several business verticals, one of which the home services industry — television, internet, phones, etc. We both market and acquire customers on behalf of these partners. One metric we’re very interested in is a prospect customer’s propensity to be serviceable for a particular service or partner — if you move into a new neighborhood, you have to figure out which of the cable companies has actually laid the wire that can reach your house and bring you internet.

Starting with the breakup of Ma Bell in 1982, ISP serviceability in the US is segregated geographically — your house can be serviced by either company Time Warner Cable or Comcast, but not both. We partner with several ISPs, and so when a new customer gives us their address we have two options:

Run a serviceability check for this address against multiple ISPs, and advise the customer as to who they are serviceable for. Query our historical data for an estimated serviceability propensity (“of the 5 closest houses to you that we know about, 4 are serviceable for company X, so we think there’s an 80% chance you too are serviceable for company X”).

Option (1) provides us with the definitive truth — they either are or are not serviceable for this company. However, option (2) is much faster — often service checks take 30-60 seconds to run, and if you’re waiting on a website watching that wheel spin for 60 seconds, you’re probably leaving that website pretty soon.

However, traditional SQL isn’t optimized for questions like “give me the 5 nearest houses to this address.” To do that you have to do calculations like the one below for every address in your database (you can probably get around some of the latency with an index on the lat and long columns on your table, but meh).

Disclaimer: I do not have any sort of DBA or software engineering background. But I google real good. So I started looking into GIS database capabilities to see if we could find databases that have functions like that built in.