Transport for London is using data science to improve the quality of its service by identifying the causes of disruption to trains and infrastructure on the London Underground and predicting when these failures will emerge.

To help Transport for London (TfL) live up to its slogan of "keeping London moving", all its trains, stations, signals, tracks, and escalators need to be operational every day. A small issue with any of them could cause a big disruption.

To improve their reliability and reduce maintenance costs, a team of three data scientists and a group of reliability analysts has been exploring what is driving failures and how they can be mitigated.

They are working on how predictive maintenance could drive down costs and improve the service. A project currently in production on the Central line analyses events underground to predict when a motor is soon to fail.

"The expected savings from that is approximately £3 million a year, a very significant amount, because this type of failure is very costly and there was already a lot of scheduled maintenance happening in order to prevent it," Akis Tsiotsios, a data scientist at TfL, explained at AI Congress.

Keeping an old metro running

The tube became the world's first underground railway when it started running in 1863 between Paddington and Farringdon Street, a section that now forms part of the Circle, Metropolitan and Hammersmith & City lines.

It remains one of the busiest metro networks in the world. Every morning, 538 trains run between 270 stations, and a total of 1.4 billion journeys are expected across the network this year covering 86 million kilometres, the equivalent of 110 trips to the moon and back.

The Victorian infrastructure and the aging carriages that run along it need regular maintenance to limit disruption to these journeys.

Around half of all delays are caused by problems with TfL assets. The cost of maintaining them takes up 59 percent of the body's budget.

External factors can also cause disruption to the service. In one data science project, TfL investigated how weather conditions affect fleet reliability by finding correlations between failures and temperature, humidity and rain.

Read next: How Siemens uses data analytics to make trains run on time

The team considered using a model of failure probability that aggregates all systems, but decided that a more powerful option would be to identify how individual subsystem components drive different susceptibilities.

This analysis resulted in a heat map that indicates the impact of each factor on every system and component.

High temperatures were found to be the main culprit behind failures. Low temperatures also had a significant impact.

The researchers gave this feedback to their stakeholders to help with their decision-making around maintenance and upgrades.

Their ultimate aim was to identify what causes all the failures to the assets in order to conduct preventive maintenance.

Choosing the best data analytics model

To understand the causes of failures, the team studies datasets on TfL assets, failures, maintenance, service operation and external issues such as weather. The factors behind the failures include temperature, departure location, utilisation and maintenance rate.

They analyse the impact that each has on the failure rate, the level of the impact based on the frequency at which it occurs, and the cost of the failure with which it's correlated.

"This gives us an overview of the impact of different factors that we can compare in order to understand what we should try to mitigate," said Tsiotsios.

Challenges include information siloes, missing data, time limits due to TFL's continuous updating and renewing of its assets and a sparse range of data as failures are relatively infrequent.

TfL is a large organisation that relies on safety-critical applications, so the data science team needs to collaborate with different departments and enforce effective timelines.

"We try to have a very strong engagement with our stakeholders, because each one of these projects [involves] a lot of different departments within the organisation, and we have some timelines regarding what we expect from staff and what we're trying to achieve," said Tsiotsios.

Read next: How BT uses data analytics to cut down engineer call-outs

The maintenance decisions taken by staff need to minimise both failures and upkeep costs.

They could conduct maintenance periodically, based on mileage or time passed, but this would likely end up either wasting money by over-maintaining their assets, or allowing too many failures by under-maintaining them.

The data scientists thought a better option was to analyse historical failure and maintenance data to establish the probability of failures. They could then identify possible causes.

They could then evaluate the cost of a failure and decide how many of them are acceptable and set a fixed maintenance rate.

This option was an improvement but still not optimal, as some of the failures will not be prevented, and many of those that will be could have been avoided by maintenance.

"What we want to do is to maintain each asset independently right before the specific asset was about to fail," said Tsiotsios.

"We're talking about predictive maintenance, and the question here is how can we predict when a failure for a specific system type is about to happen."

Predictive maintenance at TfL

TfL can make predictive maintenance possible by analysing remote condition monitoring data that the body already collects.

Sensors on some TfL assets continuously monitor their underlying condition and recognises when an event on the track takes place.

There are hundreds of these events, ranging from when a door is closed to when a train passes a certain speed.

If a door fails, for example, there will be symptoms that arise before that failure occurs.

Read next: Data and AI trends 2018

"The idea here is that the patterns of events before the failure should reflect these symptoms," said Tsiotsios.

"In other words, the patterns of events before a failure should be significantly different to the patterns of events during normal or healthy operations."

To model all this data, they built a machine learning classifier that can distinguish between these different patterns.

An algorithm can then evaluate the patterns of events that took place within the previous days or hours, and predict whether a failure is about to happen.

The machine learning model is applied to the data to predict whether a failure is about to happen, and assets that are expected to fail soon appear on an engineer's dashboard. The asset in question can then be withdrawn from service and maintained before the failure emerges.

TfL's data science projects

TfL is working on a number of data analytics experiments to improve the tube service, including the aforementioned Central line project.

This uses data that is downloaded from the manufacturer's condition monitoring system on to a server every day. An algorithm then evaluates the event patterns over the past five days and predicts whether a failure is likely to happen the next day.

They're also running a proof of concept on the Victoria line that predicts door failure through anomaly detection, and a couple of other projects that analyses signals generated from the sensors to continuously measure performance.

Another data science venture supports TfL's general effort to improve and monitor its data quality.

Read next: Best data science tools for modelling and deploying machine learning and predictive algorithms

Many of TfL's datasets have incorrect or missing information. The data science team is using the free text fields where engineers enter details about failure symptoms and the actions taken to address them, to train a machine learning classifier that analyses patterns in the text to predict which component failed.

So far, the algorithm has proven to be 75 percent accurate when identifying the component.

Any component that is flagged can then be inspected by an expert.

"Our goal is not to build a machine learning tool that automatically fills in the data for us," said Tsiotsios. "We don't want to replace the domain knowledge of our engineers with a tool that is also subject to errors.

"We want to build a quality assurance tool in order to monitor data quality, in order to automatically detect when wrong data is being recorded and to build a process where we give feedback to the input team so that in the future this becomes better and better."