Introduction

I’ve been writing marathon-related blog posts for about 2 years now, describing a range of studies on different aspects of marathon running, such as the influence of age, gender, and experience on performance and pacing, and focusing on race-records from a wide range of big-city marathons around the world. To date these studies have focused on analysing marathon data with a view to gaining insights into what has happened in the past; something that is often referred to as descriptive analytics in the world of data science. Recently I have turned my attention to the future, to use this marathon data to gain insights into what might happen in the future — predictive analytics — and, in particular to make predictions about the potential of runners to achieve new personal best (PB) finish-times.

In fact, what began as a bit of data-fun in my spare-time, has now started to leak into my day-job, and this week I will present a scientific paper based on this prediction work. This is not so unusual. As a Professor in the area of artificial intelligence, machine learning, and recommender systems, a major part of my job involves publishing and presenting research ideas. It usually goes something like this:

Think of a novel idea and develop a test-system to demonstrate and evaluate it;

Write it up as a paper and submit it to a conference or journal;

Sometime later, receive a bunch of anonymous, usually critical, hopefully constructive, peer reviews, along with an acceptance or a rejection judgement;

If it’s an acceptance, then prepare and submit a final (camera-ready) ‘print’ version of the paper and, if it’s a conference, travel to present the work in person.

So, here I am in Trondheim, at the 25th International Conference on Case-Based Reasoning, where I will present a paper entitled:

Usually my attention is focused on more conventional application domains, such as e-commerce and web search, a far cry from marathons. That my running hobby has begun merge with my research is, in itself, a nice case-study on the serendipitous nature of research and the value of — to borrow an aphorism from the world of writing — researching what you know.

My plan in this post is to introduce you to this research and how it came about. In the process I will try to explain the core of the work, in as non-technical a fashion as I can, as well as presenting some initial results, along with some ideas about where we will take it next.

The Birth of an Idea

Not so long ago, as I droned on about my latest marathon data analysis to a friend, and fellow researcher, he asked whether it might be possible to use the data to predict a runner’s potential personal best finish-time for some upcoming marathon. To be clear, we were not talking about predicting any old marathon finish-time, based on a recent half-marathon or 10k race time; there are plenty of race calculators available to do this. Rather, we were interested in predicting a challenging but achievable PB time, which a runner might be capable of pushing themself to achieve. As we discussed the idea further it became more and more intriguing, not to mention challenging, for a variety of reasons:

We need to be able to predict a realistic PB finish-time for a runner, a time that will challenge them, rather than a soft target that will be readily achieved. At the same time picking a time that will be out of reach is a recipe for disaster, and liable to ruin a runner’s race if they try to achieve it. Predicting a target finish-time is just one part of what is needed. We also need to recommend a pacing plan so that the runner can be advised about how to pace themselves, throughout the race, to achieve the predicted PB. All of this needs to be tailored for the course in question as both finish-times and pacing plans are highly influenced by the twists and turns, ups and downs, of a particular race course.

We concluded that all of this might be achievable by using a machine learning technique called case-based reasoning (CBR). We were both very familiar with CBR, having developed a range of CBR systems and algorithms over the years, and, in what follows, I will try to give a sense of the system we built, and the initial results that we obtained using data from the last six years of the London Marathon.

To cut to the chase, and because this is one of my longer and more technical posts, Figure 1 presents a summary of predicted personal best improvements, for men and women, based on different current marathon times. These results are for the London Marathon, but analogous results have been obtained for Chicago, Berlin, Boston, and other cities. For example, it predicts that marathoners who are currently running London in 150-minutes should be capable of improving their times by about 5 minutes in the future, to achieve a PB of 145 minutes. As current finish-times increase, so too does the potential for improvement and, in general, men enjoy greater potential for improvement than women for a given finish-time. For example, 240-minute male finishers can expect to improve by up to 22 minutes in the future, while 240-minute women can expect improvements of about 17 minutes.

Figure 1. The predicted personal best improvements for men and women (and averaged overall all runners) for different current marathon times; all data and predictions are based on the London Marathon (2012–2017).

How did we generate these predictions? Let’s get back to the task at hand and describe exactly how we used case-based reasoning as the basis for this.

Defining the Problem

Let’s begin by carefully defining the problem we wish to solve and by making some simplifying assumptions about how we might approach it. For a given runner we wish to predict a challenging but achievable finish-time, for a given marathon course, and we also want to recommend a pacing plan to help them achieve this time. For simplicity we will assume this plan will take the form of a sequence of paces (mins/km) for each of the 5km segments/splits of the marathon (5k, 10k, 15k, …, 35k, 40k) plus the final 2.2km segment (from 40k to the finish). Moreover, we will assume that all we have to go on, when it comes to generating this prediction and plan, is a single marathon that our runner has previously run. This single marathon must include a finish-time and the split-paces. Obviously, if all we have is a single marathon, then that is not much to go on. We will come back to this later but for now it provides a starting point.

So, imagine a runner, let’s call her Ann, who has previously run the London marathon in a time of 4 hours 13 minutes (253 minutes), with a given pacing profile; that is, a given set of split-paces. We refer to this as Ann’s non-PB time, in the sense that it is what we want Ann to improve upon. We wish to recommend, to Ann, a new faster, finish-time that she should be able to achieve, together with a pacing plan to achieve it.

An aside on terminology: we will refer to a pacing plan as a set of 5k split paces suggested to a runner before her race, while a pacing profile refers to an actual set of split paces that were run in a race. So, a pacing plan is what a runner should run. A pacing profile is what a runner did run.

A Crash Course in Case-Based Reasoning

How might we begin to solve this problem? What information have we that might help, in addition to Ann’s previous marathon record? Well, we have plenty of race records for past runners of London, and some of these are repeat runners, with multiple London marathons under their belts.

Imagine a runner, Sarah, who has completed three London marathons, one in 252 minutes, one in 244 minutes, and one in 236 minutes. Sarah’s 253-minute finish-time is close to Ann’s, and Sarah went on to achieve a faster time (236 minutes) some time later. Does this suggest that Ann might also be capable of achieving a similar speed-up? It’s a reasonable start.

We might find additional support among other repeat runners with similar non-PB times to Ann, and who subsequently went on to achieve faster times. For example, we might consider the 10 or 20 runners with the most similar non-PB finish-times to Ann, who all went on to run even faster finish-times. Then we have a set of 10 or 20 faster finish-times, all by people who were similar to Ann, and we could use the average of their faster finish-times as a predicted PB for Ann, arguing that these runners were once like Ann and so, all other things being equal, it is reasonable to predict that Ann might improve in a manner similar to these runners in the future.

The basic idea, as outlined, is an example of case-based reasoning. This is a common machine learning technique, based on the intuition that, to solve some new problem (predicting a PB time for Ann) we should look to similar problems in the past (those runners who ran similar non-PB times to Ann) and use their solutions (subsequent PB times) as the basis of a prediction for Ann. That is CBR in a nutshell.

Every CBR system is based on a collection of past problems (the cases in a case base), each with its own problem description and solution component, and to solve a new problem we identify the the most similar cases, based on their problem descriptions, and adapt their solutions to fit the new problem situation.

Case-based reasoning has some nice properties. It is intuitive and its solutions are straightforward to explain to end-users, with reference to the similar cases that have been adapted. It is computationally straightforward and relatively easy to implement in code. One of the most important advantages of CBR is that its decision-making relies on examples, in the form of cases, rather than the carefully-crafted, hand-coded rules, which have been commonplace in more traditional forms of knowledge-based AI systems. This also means that it is easy for a CBR system to learn to solve new problems, by simply adding more cases to the case base. In the case of our marathon PB prediction task, if we have lots and lots of cases, covering male and female runners, of all ages and experience levels, and representing a wide range of finish-times, then we have a good chance of being able to find similar runners to act as a basis of a prediction for any given runner whose PB we wish to predict.

A key idea in case-based reasoning concerns how we determine the similarity between a new situation (Ann’s recent marathon race) and a past problem in some case (Sarah’s non-PB race). We have suggested that this might simply be calculated on the basis of just their finish-times — Sarah’s 252-minute finish is very similar to Ann’s 253-minute finish, plus they are both female — but we can do better than that. For example, we can look at the pacing profile of Ann and Sarah during their non-PB races. Suppose for the sake of argument that, although Ann and Sarah have very similar finish-times, their pacing profiles are quite different. For example, suppose Ann started fast and produced a positive-split, with the first half of her race faster than the second, where as Sarah started more cautiously, and went on to finish with a negative-split, with the second half of the race faster than the first. In contrast, Paula, another 252-minute finisher, like Sarah, has a positive-split pacing profile that is more similar to Ann’s. Accordingly, despite their identical finish-times Paula is considered to be more sinilar to Ann than Sarah, because Paula paced her race more similarly to Ann than Sarah.

In other words, when we measure similarity we use pacing information as well as finish-times and gender. The more similar another runner is to Ann, the greater the influence of their PB time on Ann’s prediction. In fact, we consider 3 ways to product a predicted PB time for a runner like Ann, from a set of similar cases:

Best — we take the fastest PB time from the most similar runners. Mean — we take the average PB time from the most similar runners. Even — we take the PB time of the similar runner with the most even pacing profile, on the basis that more even pacing is usually considered to be more optimal for the marathon distance.

In addition to predicting a PB time we also need to suggest a pacing plan to help runners achieve it, which depends greatly on the conditions of a given race course. Once again this is based on the pacing profiles of the PB races from the most similar runners to Ann, but adjusted to fit Ann’s predicted PB time. And once again, we consider 3 basic approaches:

Best — we use the pacing profile of the most similar runner. Mean — we use the average pacing profile, using the profiles of all of the similar runners. Even — we use the pacing profile of the similar runner with the most evenly paced PB.

All of this is illustrated in Figure 2. In Figure 2(a) we see Ann’s 253-minute race with a strong positive split — she started out 10% faster than her mean race-pace and finished about 7% slower — as our new input. In Figure 2(b) represent how we generate a collection of cases from the race records of people who have run at least 2 races. In this case we show Sarah’s 3 races with finish-times, 252, 244, and 236 minutes. The fastest (236 mins) is taken to be Sarah’s PB time (for the purpose of this work) and used to produce two cases for our case base: one case combines Sarah’s 252-minute non-PB time with her 236-minute PB, and the other combines her 244-minute finish with her 236-minute PB. Finally, in Figure 2(c) we see how Ann’s 253-minute race is used to retrieve a set of similar cases (Sarah’s included) to produce a predicted PB for Ann of 237 minutes, along with a new pacing plan for her. In this example, the new pacing plan is more even than Ann’s non-PB plan. It starts out a little faster than her mean race pace, but not too fast, and finishes a little slower than her mean race pace, but not too slow, and in fact the final short segment of teh race calls for Ann to push for a faster finish.

Figure 2. Using case-based reasoning to predict a marathon PB.

This summarises how we make our PB predictions, by adapting the PB times and pacing profiles of similar runners for our target runner. These similar runners all have past races that are similar to our target runner’s performance, but they all went on to achieve faster times in the future, and these faster times, and their corresponding pacing profiles, are the basis of the prediction for the target runner.

The more cases we have, the better the predictions are likely to be, and if we want to make predictions for a different marathon, let’s say Chicago, then we can generate our cases from the race records of Chicago marathoners. In fact, we might even mix-and-match our marathons, so that cases encode the race records from different courses and provide a basis for predicting a PB for some new race on course X based on past races from course Y. But we are getting ahead of ourselves and all of that is for another day. For now we will stick with London.

Testing, Testing, 1 . 2. 3.

We built this system and loaded up its case base based on repeat runners of the London marathon, using 215,000 race-records for 185,000 runners, during the period 2012–2017. In the end, the case base contained almost 13,000 cases based on the race histories of 5,390 male and female runners; we only considered runners who had completed at least 3 London marathons during the 2012–2017 period. Once finished and debugged it was time to test our system.

There are two ways to test such a system. The obvious approach (one we call an online experiment) is to deploy it and encourage runners to use it, and to report back on whether their predicted PB turned out to be accurate. The problem with this is that it can be expensive and slow: it needs to be promoted and advertised and it can take a long-time to obtain feedback from users.

The alternative is an offline experiment where we use the data we already have to test the system. One way to do this is called cross-validation and it works as follows:

Divide the cases up into training and test cases by randomly selecting 90% of the cases as training cases and the remaining 10% as test cases, making sure that none of the runners in the test cases are also in the training cases, which might otherwise contaminate our results. Next, treat each test case as a new target runner who’s PB we want to predict. Use the non-PB part of the test case to generate a predicted PB-time and pacing plan, as described above. Compare this predicted PB time and pacing plan to the actual PB time and the actual pacing plan of the test case. We compute a percentage difference, or error, between the predicted and actual PB times, and we calculate a similarity score by comparing the predicted and actual pacing plans. The bigger the error, the more different the predicted and actual PBs. The larger the similarities, the more similar the predicted and actual pacing plans. We’d like to see small prediction errors and large pacing plan similarities. Finally, we average these errors and similarities across all of the test cases and repeat for a different training/test splits; we do this for 10 different training/test splits (a 10-fold cross-validation). We perform this type of testing for each of our different strategies, Best, Mean, and Even, so that we can compare them.

In what follows we will look at the results of this testing to get a sense of (a) how accurate the PB predictions turned out to be (relative to actual PBs of the runners) and (b) how similar the suggested pacing plans were, compared to the actual pacing profiles of the PB races.

How Many Similar Cases?

One common question when developing a CBR system is, how many similar cases (k) should be retrieved to provide a basis for problem solving (PB prediction and pacing recommendation, in this instance). Figure 3 shows the (a) prediction accuracy and (b) pacing profile similarity for each of the three strategies (Best, Mean, and Even), and for different numbers of retrieved cases; for each strategy we separate the results for male and female runners (the dotted/dashed boundaries of each ‘ribbon plot’) and also show the average over all runners (the marked central lines).

Figure 3. Prediction error and pacing profile/plan similarity versus k, the number of cases retrieved.

These results indicate that the Mean strategy performs best overall, across all values of k>1. It has the lowest PB prediction error and the highest pacing profile similarity. In other words, it predicts PB times that are closest to the actual PB times of the test runners and it recommends pacing plans that are the most similar to those actually run by the runners when they achieved their PB.

The Mean strategy also improves as we include more similar runners (increasing k). For example, at k=1 the prediction error for all strategies is about 6% — that is, the predicted PB times are off by about 6% — and the average pacing profile similarity is about 0.85. But if we used the 10 most similar cases (k=10) then the prediction error for Mean falls to less than 5% and pacing plan similarity increases to about 0.88.

The Best and Even strategies perform less well. They suffer from higher prediction errors and lower pacing plan similarities for increasing k. The Best strategy suffers the most because it is simply too optimistic, predicting PBs that are far too ambitious.

What About Runner Ability?

It is interesting to notice, in the above results, how we can make more accurate predictions for female runners than for male runners. The prediction error for females is consistently lower than for males, and the pacing profile similarity is higher for female runners than for male runners, regardless of strategy. The literature suggests that women are more consistent pacers than men and, as such, tend to run more disciplined races, which may make them inherently more predictable, compared with men.

Building on this idea we might also wonder whether prediction accuracy depends on the runner’s ability level. If faster runners are likely to be more disciplined, does this also make them more predictable? Evidently it does, as the results in Figure 4 testify; incidentally, we average the results for all values of k here for simplicity.

Figure 4. Prediction error and pacing plan/profile similarity versus runner ability, based on current finish-times (minutes).

Faster runners enjoy PB predictions with lower error rates, and pacing profiles/plans with higher similarities, compared with slower runners; once again we can see the Mean strategy tends to work best for all finish-times, and how women continue to enjoy better predictions and pacing plans than men. For example, the Mean strategy can predict a PB time with an error rate of only about 2%, for runners who are currently capable of a 150-minute London finish, and the pacing plan similarity is about 0.92. By comparison, for 240-minute finishers, the PB prediction error is considerably higher, at 6%, and pacing plan similarity falls to about 0.85.

PB Ambitions

So far so good, especially for faster runners. One final matter to consider concerns the nature of the PBs encoded by cases. In this work, we pair non-PB times with PB times for London marathoners to produce our cases. Sometimes, the difference between a non-PB and a PB time is small, as runners gradually improve over time. Sometimes it is much larger; perhaps the runner in question has returned to London after a few years of steady progress, to register a big improvement on their previous time. Do small or large PB improvements matter with respect to prediction performance? Is it easier to make predictions for runners with more modest PB improvements than for runners who register much greater PBs?

Figure 5. Prediction error and pacing plan/profile similarity versus percentage PB differences.

To test this, in Figure 5, we look at the average prediction error and pacing profile similarity, based on the percentage difference between the non-PB and PB time of cases. It turns out that it is easier to make predictions for runners with more modest PB improvements, after all larger PB improvements are relatively rare in our data. For example, focusing on the Mean strategy, when the difference between a runner’s non-PB time and their PB time is 5% (regardless of ability) then our predictions are accurate to within 3% and the pacing plans tend to have similarities of 88. But for runners with larger PB improvements, say 15%, then the prediction error grows to 10% and profile similarity falls to about 0.83.

Interestingly, there comes a point when the Best strategy starts to win out, at least in terms of its prediction error. When the PB difference is greater than about 12%, then the Best strategy tends to produce PB predictions with the lowest error. But these are few and far between in practice.

For the Future …

There is a lot more to say on all of this but that’s probably enough for now. We have described an approach to helping runners to achieve a new marathon PB by suggesting challenging but achievable PB times and by recommending pacing plans to help a runner achieve such a time. We have also described an offline evaluation of this approach, which suggests it is capable of making predictions that are sufficiently accurate to be useful in practice.

Research like this is fundamentally incremental in nature. Now that we have proven the basic concept it is time to refine and iterate, and there are lots of ways to do this. At the very least we need to enrich our cases. Why rely on a single non-PB race, for example? Surely it would be better to take a runner’s race history? Yes it would. Would it also make sense to look at different machine learning techniques, other than case-based reasoning? Probably. Should we deploy it as a live system for runners to use ahead of their next marathon, and use their feedback to evaluate the approach? Certainly.

In the coming months we will continue work on this and see where it takes us. At the very least, it should be possible to improve prediction accuracy to make the system more reliable for runners of all levels of ability. But our ambitions go beyond this and, in fact, we are looking at many different ways in which machine learning can be used to help marathon runners and other athletes, not just for PB prediction, but also for injury prevention, for recovery advice, to personalise training plans, etc.

I will report back on these in the future. Until then, I would love to receive any feedback or suggestions you might have.