Editor’s note: Opinions and views expressed in this post do not necessarily reflect the views of #ODSC Let’s face it, if you’re American and you celebrate Christmas...

Editor’s note: Opinions and views expressed in this post do not necessarily reflect the views of #ODSC

Let’s face it, if you’re American and you celebrate Christmas chances are you have visited Target, Wal-Mart or shopped at amazon.com this holiday season. Even if you haven’t shopped there you probably are getting something from one of these retailers…let’s hope it’s a good old fashioned data science book. The only other gift worth having is a ticket to the next ODSC!

Combined these retailers made $187B in Q4 last year. To support that much product moving around the country there has to be an extensive logistics network. I wanted to explore this unseen backbone of US consumption so I painstakingly gathered warehouse information from various sites. Not surprisingly complete, and unified data was hard to come by. I suspect each retailer wants to keep their distribution philosophy as secret as possible. That would avoid competing for the same labor pool in an area, defining the cost of real estate and distance to population centers. Still with some late nights I persevered and unified almost complete data, largely manually collecting and trying to cross validate in multiple sources.

In this post we will pretend Santa has to visit each of the warehouses so the Elves’ toys can be disseminated to all the children and good data scientists in the US. Along the way you will learn some basic mapping and routing techniques you can use the next time you have geospatial data.

Set Up

This work is an example of “GIS” which stands for geospatial information systems. Admittedly, I am no expert in this field, but find it fascinating to visualize geospatial data. I like to use a basic package maps during EDA for geospatial data. It has quick and easy, although somewhat unappealing map information for plotting. Then I load both ggplot2 and ggthemes which are among my favorite R visualization packages. Next, the TSP package is a simple package for “traveling salesperson problems.” This type of problem is one where you are trying to minimize the distance traveled from among multiple points in space. As you can imagine this is complex and has real world implications. Delivery companies, or in our case, Santa, do not want to waste effort by picking a suboptimal route. In this post, we suppose Santa is like a traveling salesman that has to efficiently travel between Amazon, Target and Wal-Mart warehouses because he only has a day to do it.

Lastly, I change the number of decimals that R will print to be 15 using options . When working with specific latitude and longitude values it is helpful during EDA. I also use set.seed so your code will create the same visualizations as I do for the TSP map.

library(maps)

library(ggplot2)

library(ggthemes)

library(TSP)

options(digits=15)

set.seed(2016)

Next, go get the data here and load it using read.csv . When parsing the data frame the vector classes are incorrect so I use as.numeric to change the $sq_ft . You may want to change the others as you explore the data more.

warehouses<-read.csv('warehouse_stats_w_latlon.csv')

warehouses$sq_ft<-as.numeric(as.character(warehouses$sq_ft))

Exploring the Warehouse Locations

It’s always a good idea to perform basic EDA on a new data set. I usually start with head to examine a few rows. This code will print the top 10 rows to your console but the default is 6. You will notice the column names and also some NA values. This data was challenging to obtain so in some cases I used NA or it was returned when I geocoded the location for lat/lon.

head(warehouses,10)

Since I collected multiple retailers you may be interested in the distribution among the companies. Using table will tally the retailers. I also use barplot in the second line to visualize the frequency of warehouses. Surprisingly Amazon has more warehouses than Wal-Mart. This despite the fact that Wal-Mart does about 4x the annual revenue. Clearly the organizations think about distribution efficiency differently.

table(warehouses$retailer)

barplot(table(warehouses$retailer))

Amazon has more warehouses even when adding Sam’s Club to the Wal-Mart.

A popular base function is summary which can be applied to an entire data set. However, a more nuanced way to explore data is to summarize the features by another feature. Summary can be applied in this manner using tapply along with the vectors $sq_ft and $retailer . The tapply function applies summary to each cell of the intersections of the vectors. The result is a list summarizing the square footage of warehouses for each retailer.

tapply(warehouses$sq_ft, warehouses$retailer, summary)

Comparing the median values from the code above shows that Wal-Mart is committed to larger warehouses resulting less of a need compared to Amazon. Specifically, amazon’s median warehouse size is 449,000 compared to Wal-Mart’s 880,000.

To get more feature interactions the aggregate function can be helpful. This code will provide the mean average square foot by retailer and by warehouse type. In the second parameter pass in a list containing the features for subsetting.

aggregate(warehouses$sq_ft, by=list(warehouses$retailer,warehouses$type), FUN=mean,na.rm = T)

This code lets you compare the size of each retailer’s warehouse according to its function. For example the table is a portion of the aggregate result comparing average “food” distribution center size for the three retailers.

Retailer Type Sq_Ft Amazon Food 317,589 Target Food 431,600 Wal-Mart Food 802,195

Again, you can see that Wal-Mart is committed to large warehouse centers by type.

During your EDA you should have observed multiple NA’s. Often geospatial functions fail with NA lat/lon values. A quick way to exclude them is with subset. Since a latitude that is NA will also have a longitude NA this single line of code will remove any rows that meet the logical declaration for either lat or lon. The complete.cases function is my declaration and returns a TRUE if the row contains no NA values.

complete.latlon<-subset(warehouses,

complete.cases(warehouses$lat)==T)

Quick Map

To get a quick EDA map visual I like to use the map function. First plot the line map and specifying “state” to get the lower 48. Next add points to represent each fulfillment center. The X values are longitude and the Y are latitude. Lastly, I tell R to plot 4 colors and in 4 different shapes. This is because there are 4 different retailers, Wal-Mart, Sam’s Club, Target and Amazon.

map('state')

points(complete.latlon$lon,complete.latlon$lat, col =c('orange','blue','red','yellow'),pch = c(16,17,18,19))

The result isn’t pretty but during EDA I am less concerned with aesthetics. The point is that there are dense East coast and West coast clusters no matter the retailer. This type of quick map is used to find some insights but isn’t really presentation worthy.

A quick map showing a lot of Eastern warehouses and none in Idaho, Montana, Wyoming, and the Dakotas..

The Traveling Santa Problem

Santa will be busy and not have much time to get all these gifts to the warehouses around the US. To help him Mrs. Clause, a Kaggle Grand Master, knows that data science can help. To get started, the ETSP function accepts and an X and Y vector within a matrix. This problem is a Euclidean Traveling Salesperson Problem (ETSP) since Santa can fly straight measured as Euclidean distance between points. In “reality” Earth is a sphere so a great circles approach is really what airlines would use. Remember these points represent each warehouse location and the goal is to to minimize the distance he has to travel so. The solve_TSP function applies heuristics solutions to the problem of minimizing the distance traveled. Here I selected two_opt which essentially looks to minimize how often Santa would backtrack over his route. A good article on the approach is here. Another approach in solve_TSP is “nn” which stands for nearest neighbor. In this case the function selects the nearest warehouse in turn but allows for Santa crossing over his path again.

latlon.m<-ETSP(cbind(complete.latlon$lat, complete.latlon$lon))

santa.tour<-solve_TSP(latlon.m, method = 'two_opt')

To examine the result, print santa.tour in your console. This will print the class and length of Santa’s trip but this is in Euclidean distances of lat/lon so it’s not that helpful. Instead use plot to see the entire route firsthand. In this example I did not specify a start= value. Technically Santa would be coming from the North Pole which is 90,0 but I didn’t bother adding it ☺

santa.tour

plot(latlon.m, santa.tour)

The basic Santa Tour plot isn’t very informative!

To get the order of warehouses that Santa will visit call as.interger on the tour object. This extracts the index number from santa.tour . In this example, Santa should start at row 81, 80, 82 and so on. You can see the order by calling route.order in your console.

route.order<-as.integer(santa.tour)

route.order

A quick way to see the first 6 routes on Santa’s journey is below. I am indexing complete.latlon by the 1:6 values of route.order . Instead of returning the entire column I am specifying the location vector with 4. The next table shows his first 6 stops of the night.

complete.latlon[route.order[1:6],4]

Stop Location 1 3100 North-I 27, Plainview, TX, 79072 2 700 Westport Parkway, (Haslet) Fort Worth. Texas. 76177- 4513 3 15201 Heritage Parkway, Fort Worth, Texas, 76177-2517 4 15101 N. Beach St., Fort Worth, TX, 76177 5 5300 Westport Parkway , Fort Worth, TX, 76177 6 4601 Gold Spike Drive, Fort Worth, Texas, 76106

The start of Santa’s Journey to all US Warehouses is in Texas..

An easy way to reorder the original data frame to account for the traveling salesperson solution is with match . To start add a column, $index , which has values 1 to the number of rows in the data set. This makes the row position a declared vector. Next, using match within square brackets pass in the route order and $index . Since the values are shared match will rearrange the rows into the efficient delivery tour.

complete.latlon$index<-seq(1,nrow(complete.latlon))

complete.latlon<-complete.latlon[match(route.order, complete.latlon$index),]

Once the data is reordered you can compare the first 6 warehouse locations with the code below.

head(complete.latlon$location)

Since the basic plot wasn’t informative I want to create an actual map of Santa’s journey. First, load the US state map to get the spatial information with map_data . This code will invoke the “maps” library to get the plotting data.

state.map <- map_data("state")

Next, create a base layer ggplot . Then add a polygon layer containing the state level map data in state.map with x =long and y=lat . I chose grey as the color so the state outlines would not add visual clutter. The next layer is geom_path . The geom_path function will plot a line in the order of the data frame rows. From the package documentation,” geom_path connects the observations in the order in which they appear in the data.” In contrast, “ geom_line connects them in order of the variable on the x axis. Since this is a linear journey I have to use geom_path . The next plot layer uses xlim and ylim to limit the viewable section. This can be useful if you have outlier lat/lon points. Next, I add a title with ggtitle , apply my favorite plot theme, theme_gdocs , and then remove a lot of the visual elements so you can focus on the plot and Santa’s journey. These last 4 layers, limits, title, theme and blank elements are not mandated by make the plot easier to follow.

ggplot() +

geom_polygon(data = state.map,

aes(x=long,y=lat,group=group),

colour = "grey", fill = NA) +

geom_path(data=complete.latlon,

aes(x=lon, y=lat), col="red") +

xlim(-125,-65) + ylim(25,50) +

ggtitle("Santa Deliveries") +

theme_gdocs() +

theme(panel.border = element_blank(),

axis.text = element_blank(),

line = element_blank(),

axis.title = element_blank())

A route Santa could take to deliver goods to Target, Wal-Mart and Amazon warehouses.

Conclusion

You can trace Santa’s path starting in Texas and ending in Nevada. Remember this TSP method sought to minimize crossing paths but it was unavoidable in some cases. However, the result seems intuitive as Santa starts in Texas migrates to the lower Midwest then over to the Southeast. From there he starts up the East coast before heading West across the upper Midwest and then darting down to Southern California. Looping north he covers the Pacific Northwest before ending in Nevada. Those must be some tired Reindeer!

Now that Santa has made his rounds I want to know the nearest warehouse to my location so I can go pick up my goods…assuming I am on the nice list! In the next post I will show you how to geocode an address, find the closest warehouse to that address, get a route from google maps and then get a satellite image of the building that will have my holiday gifts from Santa’s elves. Lastly, I will create an interactive map with all warehouse information from Wal-Mart, Sam’s Club, Target, and Amazon so you can explore the data yourself.

©ODSC 2016