On March 30 2015, the Premier of Victoria took to social media to announce that an election promise had been kept — that Victoria’s public transport timetables would soon be available in Google Maps.

Six months later, Melbourne’s timetable data remains conspicuously absent from Google’s services. Liam Mannix from Fairfax wrote an article seeking explanation from Public Transport Victoria, and Google, on this continuing absence.

In short, Google blames PTV, and PTV blames Google.

IT projects — hell, projects in general — that cross organizational boundaries always have challenges. But, for once, the item under dispute — whether the public transport timetable data released by PTV is sufficient for Google to use — is both publicly available and relatively simple to understand. For once, we can see for ourselves who’s at fault.

So I did. I look at the data release. And, while it’s not completely black and white, the vast majority of the blame appears to lie with PTV.

When there is no standard, invent one — GTFS

Technically, what the Andrews government promised was to make Melbourne’s public transport timetable data downloadable from the web in a format known as GTFS. Google — or anybody else who wants it — can download the “feed” and use it to display timetable information on a website, in a mobile app, or in pretty much any other form somebody can be bothered writing a program to make happen.

Originally developed in 2005 by a collaboration between Google and Portland, Oregon’s public transit agency, the GTFS standard represented a way to electronically share timetable data. It just a text document that describe a set of files (archived in a single ZIP file) that constitute a GTFS “feed”, and the rules describing what they should contain and how they should be formatted.

Originally called the Google Transit Feed Specification, it was renamed the General Transit Feed Specification in 2010. As often occurs, a lot of other organizations found a lot of other uses for a standard data format. This includes not only services aimed at public transport users, but software used by public transport agencies themselves such as travel demand forecasting tools.

Technically, GTFS is quite a simple standard. Each file within a feed is just a “comma-separated value” file. Logically, they’re like an Excel spreadsheet, with rows and columns. The complexity in the standard comes in the rules specifying what each row of each file contains. For instance, in a GTFS feed, the file “stops.txt” contains a list of stops, with one row for each stop. Easy enough. But there are up to 13 columns of information that need to be supplied for each of these stops, and the rules for these are quite involved.

Have a look at this section from the GTFS standard describing how stations, stops within a station, and stops outside a station are distinguished:

The rules for describing the difference types of stops in a public transport network in the GTFS (screenshot from Google Transit website)

This is both complex and tedious, but it’s the kind of tedious complexity you have to deal with if you’re trying to represent a real-world system electronically. It’s also entirely routine for anyone designing IT systems.

Two standards — the text and the tool

To assist public transit authorities to provide usable GTFS data, Google provides a validation tool — a computer program, in other words — which can be used to check whether their data is formatted according to the standard.

This kind of tool is a very handy thing; checking a large feed by hand for compliance with the rules is going to avoid a lot of tedium and likely catch a lot of errors. But it creates a problem that is all too common in IT — what happens if there’s a difference between what the standard says, and what the validation tool accepts as correct?

As we’ll see, this issue comes up when trying to figure out why PTV and Google can’t agree on their data release, but it’s hardly the only problem.

A splintered feed — the first problem

I downloaded the GTFS “data feed” from the Victorian Government’s data.vic open data website. It rapidly became clear that whomever was responsible for putting the feed together hadn’t properly followed the GTFS standard.

The first major flaw in the PTV data release is that it does not follow the file layout required by the standard.

The GTFS standard specifies a single directory in which all of the files must be placed. There must be a single version of each file.

The PTV’s release doesn’t do this. It contains 11 numbered folders, each of which contain another zip file, along with a test script for running the validation tool. Each of the ZIP files represents a GTFS feed covering parts of the PTV network. There is no ambiguity here; this is simply not following the rules.

Each directory seems to contain the timetable data from some part of Victoria’s transport network, but from an outsider’s perspective some of the groupings appear arbitrary. For instance, folder 1 contains the data from V/Line trains to and from places like Albury and Maryborough, but also appears to contain timetable data for the Metro services from Pakenham and Sunbury. Yet some of the Sunbury services are in folder 2.

Duplicated field data and an ambiguous standard

The second problem is that PTV has not followed the GTFS rules as to the contents of the data fields in some files — though this is where the ambiguity between the written standard and what the validation tool accepts comes in.

If you run Google’s validation tool over folder 1 of the PTV feed, it identifies hundreds of problems. A few relate to stations which have no services stopping at them, but the biggest issue is the way that line names have been handled.

The vast majority of the warnings relate to the way that PTV has handled train line names. The “routes.txt” file contains two fields called “route_short_name” and “route_long_name”. For instance, for the Pakenham line trains, PTV’s file sets “route_short_name” to “Pakenham” and “route_long_name” to “Pakenham — City (Flinders Street)”. The validator complains that the “route_short_name” is too long and that the “route_long_name” shouldn’t contain the route_short_name, as they are often displayed together.

At this point, bells started ringing. I went back to Liam Mannix’s article, and one of the key interview quotes from PTV started to a bit more sense:

“Google has additional data requirements beyond GTFS for its transit application, and PTV is working with Google to provide that data in a timely manner.”

What seems to have happened is that the team at PTV responsible for the GTFS release cared only about the written GTFS standard, which is far less explicit on what “short names of routes” should look like than the validator wants:

“The route_short_name contains the short name of a route. This will often be a short, abstract identifier like “32”, “100X”, or “Green” that riders use to identify a route, but which doesn’t give any indication of what places the route serves. At least one of route_short_name or route_long_name must be specified, or potentially both if appropriate. If the route does not have a short name, please specify a route_long_name and use an empty string as the value for this field.”

“The route_long_name contains the full name of a route. This name is generally more descriptive than the route_short_name and will often include the route’s destination or stop. At least one of route_short_name or route_long_name must be specified, or potentially both if appropriate. If the route does not have a long name, please specify a route_short_name and use an empty string as the value for this field.”

If PTV wanted to make data that met the more exacting requirements of the validator as well as the written GTFS specification, it wouldn’t have been hard. What they should have done is leave one of the “route_long_name” or the “route_short_name” fields blank. Not rocket science. But is their responsibility?

A game of IT chicken

While the splintering of the data into multiple zip files is, without a doubt, not compliant with the GTFS standard, the issues with the route name fields is more complicated. Was it enough for PTV to simply write the data to their own interpretation of the GTFS written standards, and if Google’s validation tool was more finicky, suggest it’s Google’s problem?

In a word, nope. From this outsider’s perspective, it looks like the action of an organization who doesn’t understand IT.

Inevitably, complex specifications have ambiguities — assumptions that are clear to those who wrote the standards, but that might not be shared by those reading them from a different perspective. It’s extremely common for computer programs which work with them to differ in subtle ways from “the spec”.

And, where you have a gorilla and a minnow discussing the issue, the minnow ends up adjusting what they do to what the gorilla says the spec is. In this circumstance, Google is the gorilla, and their expectation that public transit authorities will provide data in the format they expect is actually entirely reasonable.

Google deals with hundreds of public transport bodies around the world, and they will have written software to periodically fetch the GTFS feed from each of these hundreds of public transport bodies around the world and incorporate it into their systems. To deal with PTV’s idiosyncratically-formatted data, they would have to write a fair bit of additional code especially for the PTV dataset. Multiply that for every transport authority around the world, and it’s going to be a costly nightmare to maintain.

By contrast, all PTV has to do is filter out some redundant information and write a fairly short and simple program to combine their multiple datasets into one. Frankly, it should have taken them, at most, a couple of additional weeks, not most of a year. That they haven’t is indicative of an unfortunate combination of incompetence and bloody-mindedness.

It’s not just Google who would have to deal with PTV’s idiosyncracies. Most others who write software that handles GTFS data will, explicitly or implicitly, assume that “GTFS” means “GTFS as Google defines it”. Everybody else wanting to use PTV’s dataset will similarly have to code around the data’s quirks. Many companies running global operations simply will not bother.

And what of Myki?

In the greater scheme of things, the extraordinary difficulties PTV have had with the comparatively simple task of getting GTFS timetable data online are a minor annoyance. But it’s not the only customer-facing IT project that PTV have on the go.

The deservedly much-maligned Myki ticketing system operating contract is about to expire. Three bidders are currently competing for the new contract, due to be signed some time in 2016.

If PTV can’t get what is essentially a collection of spreadsheets to have data in the right columns, how in the hell are they going to manage the immensely more complex IT procurement task that is a Myki contract?