How does the Bay Area Commute?

Defining Transit Service Areas with Unsupervised Machine Learning

For this project, I wanted to answer a question. Can we define the service areas of the transit agencies in the San Francisco Bay Area using only commute data, and if so, what would it look like?

The answer is yes, and it looks something like this:

Here’s how I got there.

Background

The Bay Area is unusual among major metropolitan areas in its multi-polar nature. This has long been reflected in our transit systems — we have around 28 transit systems serving the nine-county Bay Area (ignoring once-a-day Amtrak interstate rail services). Roughly speaking, these fall into three classes:

Inter-regional transit —Capitol Corridor, San Joaquins, Altamont Corridor Express, future High Speed Rail. These systems serve destinations outside of the nine-county Bay Area.

—Capitol Corridor, San Joaquins, Altamont Corridor Express, future High Speed Rail. These systems serve destinations outside of the nine-county Bay Area. Regional transit — BART, Caltrain, SMART, Golden Gate Ferry, San Francisco Bay Ferry. These systems serve two or more of the local transit districts listed below.

— BART, Caltrain, SMART, Golden Gate Ferry, San Francisco Bay Ferry. These systems serve two or more of the local transit districts listed below. Local transit — Muni, AC Transit, VTA, SamTrans, Golden Gate Transit, Marin Transit, Dumbarton Express, County Connection, Santa Rosa City Bus, Tri Delta Transit, Wheels, SolTrans, Sonoma County Transit, WestCAT, Fairfield and Suisun Transit (FAST), Solano Express, Vine, City Coach, Petaluma Transit, Rio Vista Delta Breeze.

Bay Area Transit Agencies (Source: SPUR, Seamless Transit, April 2015)

The fragmentation of Bay Area transit results in a poor experience for transit riders, who are forced to navigate multiple fare systems and informational materials in order to get to where they are going. It’s often naively suggested that the solution is simply to merge all the agencies together, but this would be politically challenging and likely to create more problems than it solves. As transit consultant Jarrett Walker notes, “A very leftist San Francisco supervisor once asked me why San Francisco should give up any control of its transit to suburb-dominated boards who simply wouldn’t understand San Francisco’s density-driven need for a very high per-capita level of service. And he had a point.”

Many of the issues caused by transit fragmentation are detailed in SPUR’s Seamless Transit report, which discusses how these issues might be solved through better interagency cooperation, without merging all the transit agencies together into one unwieldy entity. Other countries have taken a similar approach; for example, the Rhine-Ruhr region of Germany is analogous to the Bay Area, with similar population, similar distances between major cities, and similar multipolarity. An organization called a Verkehrsverbund is responsible for coordinating services; their transit system has one map, one fare structure, and one branding, despite having 27 different operators

A service map for the Rhein-Ruhr Verkehrsverbund

So if we can paper over the cracks in our transit system with a Verkehrsverbund, we don’t need to worry about fragmentation, right? Well, not quite. One issue the Verkehrsverbund does not solve is the problem of defining sensible boundaries for the transit systems under its jurisdiction. At these boundaries, passengers must transfer to another system to continue their journey. An example is the San Francisco/San Mateo county line, where the Muni’s 14-Mission bus terminates and SamTrans’ ECR bus continues the journey down El Camino Real; and another is the San Mateo/Santa Clara county line, where the ECR bus terminates and VTA’s 522 bus continues down El Camino Real to San Jose.

While boundaries are inconvenient, they are also necessary; it would not be practical to run those three bus routes as one 60 mile combined route, nor would San Francisco be happy with San Jose making decisions about its bus service, or visa versa. If a line has to be drawn somewhere, it should be drawn in a place where as few riders as possible are inconvenienced by the transition. Can we improve the definition of local transit service areas to better match the actual commuting patterns of residents? An understanding of the commuting patterns of the Bay Area should allow us to suggest transit agency mergers when appropriate, and keep the agencies separate when it is not.

Methodology

Kaggle has posted the commute flows from every census tract in the United States to every other census tract in the United States, with the 2010 US Census as the original source. The ‘flow’ values from this data, i.e. the number of commuters travelling from a given origin census tract to a given destination census tract, can be used to calculate the distances between census tracts in feature space.

We define the distance d(i,j) in feature space between census tracts i and j as:

where:

This implies that the feature space distance between two census tracts that have no commuters flowing between them is equal in quantity to the maximum commute flow in the dataset, i.e. those census tracts are the furthest apart in feature space. It also means that the pair of census tracts with the biggest commute flow has a feature space distance of ~1, i.e. those census tracts are the closest in feature space.

The directionality of commute flows presented a challenge when creating the distance matrix. Clustering algorithms generally take a datasets position in feature space, calculate the distance matrix, and then establish the linkage; however, for this problem, we are defining the distance matrix directly from the source data. This results in the unusual property that the feature space distance from A to B is likely to be different from the feature space distance from B to A, as more commuters will commute in one direction than the other. For this project, we made the decision to examine origin to destination commute flows only, as this resulted in the clusters that were clearly defined in both feature space and real space, while the inverse resulted in clusters that were significantly overlapping in real space.

Once we had calculated the distance between every pair of census tracts, we fed this into SciPy’s hierarchical clustering algorithm to determine the relationship between census tracts in the Bay Area. (For this analysis, the Bay Area is defined as all census tracts existing between 37 and 38.5 latitude and between -123 and -121.5 longitude.) The number of clusters is not determined in advance, but the validity and consistency of the clusters can be calculated by calculating the average silhouette score of all data points in the sample. This is shown below for varying values of k:

Bay Area Silhouette Scores

The existence of a dual peak in the silhouette score chart is indicative of both regional and local coherence in commuting patterns. Optimal values of k are found at 3, and again at 9, 10, and 11.

Let’s start by looking at k=3:

Bay Area Census Tract Clustering (k=3)

This clustering represents the first internally cohesive sub-regional division of the Bay Area. Firstly, Napa and Solano counties split off at k=2; then, San Jose and Silicon Valley split off at k=3. The remaining Bay Area covers everywhere you can get to on a BART train, plus as far as Redwood City to the south and Santa Rosa to the north.

The next level of viable clustering occurs at higher values of k. Let’s start with k=9:

Bay Area Census Tract Clustering (k=9)

Intuitively, this looks good; we have a clear separation between all the logical sub-areas of the Bay Area. These can roughly be described as: Sonoma/Marin, Solano/Napa, San Francisco/San Mateo, Oakland/Berkeley/Richmond, Walnut Creek/Concord/Antioch, the Tri-Valley area, Hayward/Fremont, and San Jose/Silicon Valley.

Let’s go up to k=10:

Bay Area Census Tract Clustering (k=10)

Now, San Francisco and San Mateo have split, with SF keeping Daly City, Colma, and Pacifica.

Finally, let’s look at k=11:

Bay Area Census Tract Clustering (k=11)

Here, the southern East Bay has split into Hayward/Castro Valley to the north, and Union City/Newark/Fremont to the south.

It’s notable that first two clusters to break off (Napa/Solano and San Jose) have not subdivided any further, indicating strong internal consistency. (At k=15, San Jose splits from the rest of Santa Clara county; Napa and Solano remain connected to each other until very high values of k.)

Conclusions

Inter-Regional Transit

No conclusions on inter-regional transit are drawn, as inter-regional transit primarily serves business and leisure travellers, and so is unsuited to analysis using commuting data. It is likely that these services would be best aggregated and administered at the state level, due to the distances covered.

Regional Transit

San Jose is a different animal to the rest of the Bay Area. As an auto-dominated metro area, a lot of its commute patterns consist of car travel from the bedroom communities of south and east San Jose to the office parks of north San Jose and Silicon Valley. It would be a mistake to assume Silicon Valley is tied to the rest of the Bay Area; the Google Buses may have a lot of notoriety, but they serve relatively few commuters.

Solano and Napa counties are also disconnected from the rest of the Bay Area, despite their geographic proximity; there are pitifully few transit options to the rest of the region, with the 60–70 minute Vallejo Ferry being the only direct connection to San Francisco. The situation could be improved by adding commuter rail to these counties heading north from Richmond across a new bridge to Vallejo, and then branching to Sonoma, Napa, and Fairfield. Ideally this would run directly to SF through a new Transbay Tube, but could also work as a feeder to BART in the near term.

Despite the above, there isn’t any reason to divide the Bay Area for the purposes of regional transit. Caltrain serves both the SF-focused Bay Area and the South Bay, and soon BART will do the same. Both of these services are integral to the regional transit system and cannot be divided in two. And while Solano/Napa is more isolated from the SF-focused Bay Area than Sonoma/Marin, this is largely due to I-80 congestion and poor transit options, problems which can and should be remedied. All regional transit providers should be merged into one system serving regional travel across the nine-county Bay Area.

Local Transit

While the clusters match the service areas of these agencies surprisingly well, there are a few changes that could be made, roughly based on the k=10 map. Where multiple options can be justified, I’ve chosen the option resulting in the least change from the status quo.

Golden Gate Transit, Marin Transit, Petaluma Transit, and Santa Rosa City Bus should all merge to form a North Bay transit agency. (Note: there’s an argument for including SMART in this agency, but I would argue that it should remain a regional agency for ease of integration with Golden Gate Ferry.)

Vine, FAST, SolTrans, and City Coach should all merge to form a Solano/Napa transit agency. Rio Vista Delta Breeze could possibly be included as well.

County Connection and Tri Delta Transit should merge.

Wheels should take over service in San Ramon and Danville from County Connection, as these cities have stronger commuting ties to Dublin/Pleasanton than to Walnut Creek.

AC Transit should absorb WestCAT to the north. The dividing line between the two services in the Richmond area is entirely artificial.

Bus service from Bay Fair BART station south to the Alameda/Santa Clara county line, currently provided by a mix of AC Transit, Union City Transit, and Dumbarton Express, should be consolidated into a new agency servicing the southern East Bay. This change in conjunction with the above would allow AC Transit to focus on becoming an urban transit system serving the Oakland/Berkeley/Richmond conurbation.

Muni should take over service from SamTrans in Daly City and Colma. The bulk of San Bruno mountain and the (literal) dead zone of the Colma cemeteries form a natural dividing line between San Francisco and San Mateo county, a little south of where the line was drawn on the map. Pacifica would also benefit from direct service to San Francisco.

VTA should take over service from SamTrans in Menlo Park and East Palo Alto. The town of Atherton, too wealthy and too sparsely populated to support transit ridership, forms the barrier between the San Mateo service area and Silicon Valley.

Here is the end result. I’ve used some creative license for the system names.

Of course, rationalizing service areas does not help unless we also rationalize the fare system and branding. Stay tuned for a future post on what a Bay Area Verkehrsverbund might look like.

Questions? Comments?

Find me on Twitter or LinkedIn

The code behind this analysis is available on GitHub. Thanks to Alex Golden Cuevas and Bonnie Shen for helping write the code; additional thanks to Alex Golden Cuevas and Brandon Ananias Martin-Anderson for reviewing this post before publication.

Bonus Maps

Although I’m not familiar enough with other regions to prescribe transit reorganization recommendations, performing similar clustering analyses on other multi-polar regions afforded some interesting insights.

Southern California

The silhouette score is positive and flat up until k=12, after which it falls off quickly. Interesting, at k=3, Los Angeles County is more connected to San Diego than to Orange County or the Inland Empire. LA County splits in two at k=9, primarily as the highlands vs. the basin, with the highlands also claiming Beverly Hills and the coastal cities.

Southern California Silhouette Scores

Southern California Census Tract Clustering (k=3)

Southern California Census Tract Clustering (k=6)

Southern California Census Tract Clustering (k=9)

Southern California Census Tract Clustering (k=12)

Northeast Corridor

A lot to unpack here. Interestingly, state lines have a significant impact on where people work, even when these do not follow natural features. There is a clear cut-off at the land border between New Jersey and New York, and also between Maryland and Delaware. Philadelphia/Camden/Trenton is neatly divided into four based on the rivers that flow through the city, some of which are also state lines. Boston/Providence buck the trend and remain clustered despite being in different states, while Baltimore is split between DC to the south and Harrisberg to the north.

Northeast Corridor Silhouette Scores