One of the things I love about my job at Dia&Co is that I’ve had personal interactions with the founders, Nadia Boujarwah and Lydia Gilbert, on several occasions. (My friend at Google can’t say the same about Sergey and Larry!)

Shortly after I started at Dia, Lydia presented a slide deck that provided a high-level overview of our business. I was struck by one of her charts, which compared two “heat maps”: one of the spatial distribution of the U.S. population density, the other of the spatial distribution of our customers. The distributions were visually similar, underscoring her point that Dia is a company of national reach that serves our customer wherever she lives.

But, since I like numbers as well as visuals, I started to wonder: how would one prove mathematically or statistically that these two distributions were indeed the same? And, if they weren’t, could the differences provide us better insight into our customer base?

Approach

The problem can be cast as a traditional supervised learning problem in machine learning. There are as many observations as there are geographic regions. For instance, grouping at the state level, there would be 50 observations, whereas grouping at the zip code level, there would be ~43k observations. Within each geographic region, the goal is to predict the number of customers residing within the region, using an indicator variable (representing the region itself) as well as using features about the region (e.g., population, median income, median age, etc.). If the coefficient of the indicator is significant, this implies that our customers are over- or under-represented in this geographic region, relative to what our model expects. Since the model prediction is based on the population of the region, this provides an indirect means of assessing whether the two distributions (population and customers) are similar. Likewise, if the coefficient of a feature (e.g., median age) is significant, then this might imply that our customers are older or younger than the U.S. population at large. Thus, such a model could provide better insight into our customer base.

But there are also a few reasons why a naive machine learning approach might not work. First, we are trying to predict a non-negative, discrete value (“number of customers”), so many traditional techniques, such as linear regression, are inappropriate. Second, we’d like to normalize by the “reference” distribution that we’re trying to compare to. If we believe that the number of customers in a particular region should be proportional to the population of that region, then the prediction should somehow be normalized by the population, instead of using population as a machine learning-style feature in the model. Third, our technique should be as suitable for “small” data (e.g., 50 observations) as for “medium” data (e.g., 43k observations). Fourth, it should automatically produce confidence/credible intervals, so that we can rigorously evaluate the significance of various features. And fifth, it would also be nice to be able to pool geography at multiple levels (so, for instance, including state-level as well as zip code-level predictors) without inducing high variance in our coefficient estimates.

Although this represents a stringent list of requirements, social scientists and statisticians have developed a robust set of tools for tackling this problem. The basic idea is to use an appropriate generalized linear model (in this case, either Poisson or binomial) plus Bayesian inference. For example, Andrew Gelman and coworkers used this approach to analyze the practice of “stop and frisk” in New York City. They investigated whether the spatial distribution of stops matched the spatial distribution of the previous year’s arrests, and what features of the geographic region (police precinct) were responsible for the discrepancies that existed.

Case Study

I’ll illustrate the approach with an example close to my heart. I live in East Harlem, and I’ve noticed (anecdotally) that there seem to be very few proper grocery stores (as opposed to convenience stores/bodegas) within walking distance of my apartment. In fact, the NYC “Green Cart” program exists to provide underserved areas (like East Harlem) access to fresh fruits and vegetables. So my hypothesis was that East Harlem had too few grocery stores, relative to the number of people living there.

This problem struck me as being highly similar to the customer distribution problem. There are two heat maps: one of the NYC population, and the other of NYC grocery stores. Are the grocery stores where they “should” be? And, if they aren’t, why not?

Data

A few words on the data: I limited my study to the borough of Manhattan. I obtained grocery store data from Yelp, using their excellent Python API. I attempted to limit the search to “proper” grocery stores, and ended up with 531 stores in Manhattan. (Of course, I’m ignoring the differences between these stores, such as quality and selection, an issue I’ll touch upon at the end of this post.) I merged that data with Census data (median income, population) at the zip code level. I also ignored zip codes with fewer than 100 people: these seem to be areas that get a lot of mail, but where hardly anyone lives. This left me with 44 zip codes (observations) for the analysis — not a lot of data! Finally, I used shape files for zip code boundaries from the Census to plot my results.

Model

I implemented the analysis as a Poisson regression: I’m trying to predict a discrete count variable (the number of stores in a zip code) using a Poisson distribution. The rate parameter of the Poisson is where the regressors enter. There is an indicator variable, accounting for the effects of the zip code alone, as well as a linear depending on a feature: the (log) income. I examined models both with and without the income term. The “offset”/“base rate” of the Poisson is the population of the zip code — thus, if the two distributions were indeed proportional, the Poisson rate parameter would be identical for all zip codes.

Code

In pymc3, the model is simple to code up (after the data has been cleaned up and placed in a pandas DataFrame). (All of the data is available on GitHub, and the model analysis can be found in this Jupyter notebook.) Here’s an example for the model with the income term:

Results

After running the model, I examined the credible intervals of the indicator variable for my zipcode. It revealed that there was nothing special about where I live. (Rats!) But plotting the means for all of the indicator variables shows an interesting spatial pattern (although the effect is weaker after controlling for income):

Model without income term (whiter = fewer stores; redder = more stores)

Model with income term (whiter = fewer stores; redder = more stores)

Here, the redder areas indicate zip codes that are over-indexed in grocery stores, relative to their population, and the whiter areas are those that are under-indexed. It appears that downtown and midtown Manhattan are over-indexed, while uptown Manhattan (including places such as the Upper West and East Sides) is the opposite. One way to prove that this effect is significant would be to pool geography at another level (downtown/midtown/uptown), and check the credible intervals of the indicator variables of these coarser regions. This is convenient to do with a hierarchical approach.

Discussion

In the end, the analysis might indicate that there’s nothing nefarious or unusual about the distribution of grocery stores in Manhattan. In particular, they seem to be over-indexed in areas where people work (as opposed to live), which seems to make sense. (Judging from the 6 pm line at Fairway, I’m not the only one who buys groceries after work.) One important takeaway, from the modeling perspective, might be that the census population is not an appropriate base rate, and that it should somehow be blended with the “working population”.

Finally, some caveats. I intend for this study to be an illustration of an interesting technique, not a perfectly rigorous analysis. In particular, one weakness of the model is that Poisson regression doesn’t have independent mean and variance parameters, so it’s often prone to “overdispersion”. There are other types of regression, such as negative binomial regression, that can overcome this limitation. Second, it would also be interesting to stratify grocery stores based on quality, thereby incorporating businesses like bodegas/convenience stores as well.

If you’re the type of person who enjoys statistical modeling, whether in work or life (or both!), please consider joining us at Dia! We’re hiring!