Transport for London (TfL) is combining two distinct data sources to help reduce station overcrowding and to help ensure its customers are able to ride the Underground at peak times.

Robert Duff, data scientist at TfL, explained to an audience of data scientists at the EARL Conference in London, organised by technology specialist Mango Solutions, how “every journey matters” to the organisation that manages both the Underground and other elements of the transport system in London, such as buses, boats, trams and cable cars. Duff and his colleagues must use information to help manage the flow of almost 31 million journeys across London daily:

That gives us an abundance of data – we’re data-rich; that’s sometimes good and sometimes bad. The challenge is making sure we always do something intelligent, but extract the insight from it.

Duff said the challenge with using data to create insight is one that will be familiar to data scientists in other organisations: data sources are often independent and databases can sit on different platforms, be that on the cloud or in legacy systems. His aim is to bring sources together to help solve business challenges at TfL.

The issue he addressed at the conference was the problems of station overcrowding, which can lead to station closures on the Underground network at peak travel times. By bringing together two disparate data sources – ticketing data and train information data – Duff and his colleagues are providing insight to TfL bosses that can be used to improve operational efficiencies and boost the experiences of customers.

Using ticketing datA

Duff explained how ticketing data is taken from the Oyster and contactless cards that passengers use to enter the Tube network. TfL can see how passengers enter and leave the network. He demonstrated how the data can be displayed to show how each station has unique patterns of passenger flow.

Canary Wharf and Oxford Circus are known as tidal stations, which are extremely busy around 5pm. Wembley Park, meanwhile, is an event station, with huge spikes for football matches and music concerts. When stations become too busy, passengers are held outside due to overcrowding. Duff said data plays a key role here:

Our challenge is to use ticketing data to find a better way to report on closures that are primarily driven by overcrowding and congestion – and we can now actually start detecting those closure times.

Historically, station staff have collected this information, but their priority is actually to make sure stations are running safely. So TfL has created a new digital technique based on ticketing data through Oyster and Contactless, which captures when a station is closed to passengers:

We obtained six months’ worth of data. We do this on an ongoing basis and aggregate into findings to a 15-minute level. We treat each day as a signal and we generate characteristics. We then apply multivariate outlier detection to find the 30 most typical days from which we generate a ‘normal’ baseline. We then detect any new signal against our baseline to determine closure periods. We can produce heatmaps for stations where we may find holes in the typical entry demand flow – this can be attributed to closures due to overcrowding.

Duff said TfL uses these station logs to work which stations have the most overcrowding. The organisation can then work out the chances of passengers being held outside these stations at certain times on particular days.

Exploiting train information data

Duff’s colleague Rahulan Chandrasekaran, who is now economist at the Department for Transport, but who was formerly transport planner at TfL, explained how train movement data is captured via track circuits across the Underground network. These circuits record the frequency and movement of trains. Chandrasekaran said this data can be translated into CSV and R formats so scientists can search for key variables of interest:

The reason we want to do this is because we want to combine it with our ticketing data to address the issue of overcrowding. It's not enough to be able to get into a station – you’ve got to be able to get on a train. And not being able to get on the first train that arrives on the platform is a pretty common thing in the morning and in the evening rush hour. It's important because it's significant and it affects people's travel choices. This is a problem within TfL that we’ve tried to tackle for some time, but we now have a method for doing so.

Chandrasekaran said their approach uses a research technique known as trip assignment . He gave the example of data that shows the platform-to-gate walk-time of passengers at Clapham Common station. This database from winter 2018 includes 20,000 trips. TfL uses that distribution data today to estimate the probable trip of passengers for each train.

The results can then be visualised in different ways. Chandrasekaran gave the example of data from the morning period between 7am and 9.30am at 15-minute intervals. The results on-screen showed the probability of someone being able to enter the station and then board a northbound Northern Line Tube train. The number of passengers waiting on platforms was higher as the train moved northwards, becoming much greater at platforms with narrow stations, such as Clapham Common, and at the key central station of London Bridge.

Applying new insight in the business

TfL is using the insight from train information data in combination with ticketing data to create better accuracy in its reporting. Duff said this insight means TfL can help its staff to reduce overcrowding and potential station closures:

We really want to know what station is going to be next in line so that we can make sure it doesn't become busy like Oxford Circus. We can spot concerning trends over time and act on them. And there's some interesting strategies at the very busy stations, where they throttle the flow and open fewer gates to keep the flow moving and that’s really helped. So we can support different mitigation strategies through this data. We also have a greater appreciation of the drivers behind station control, such as station layout and train service performance. We need to make sure that these insights are fed into future designs.

Duff said the data-led approach provides benefits in terms of customer experience, too: