There’s a lot of hype these days around predictive analytics, and maybe even more hype around the topics of “real-time predictive analytics” or “predictive analytics on streaming data”. Like most things that are over-hyped, what is actually meant by the term is often lost in the noise. In this case that’s really a shame, because these terms refer to at least two different things, either or both of which may be important in a given context.

This post forms the basis of a lightning talk I’m giving (remotely) at the Real-time Big Data Meetup in Menlo Park, California. Join the group if you’re interested!

What is Machine Learning from Streaming Data?

Generally, when I hear people talking about “machine learning from streaming data”, they may be talking about a couple of things.

They want a model that takes into account recent history when it makes its predictions. A good example is the weather; If it has been sunny and 80 degrees the last two days, it is unlikely that it will be 20 and snowing the next day.

when it makes its predictions. A good example is the weather; If it has been sunny and 80 degrees the last two days, it is unlikely that it will be 20 and snowing the next day. They want a model that is updatable. That is, they want their model to in some sense “evolve” as data streams through their infrastructure. A good example might be a retail sales model that remains accurate as the business gets larger.

These two phenomena sound like the same thing, but they are potentially very different. The central question is whether the underlying source generating the data is changing. In the case of the weather, it really isn’t (okay, okay): Given the weather from the previous few days you can usually make a pretty good guess at the weather for the next day, and your guess, given recent history will be roughly the same from year to year. The same model for last year will work for this year.

In the case of the business, the underlying source is changing; the business is growing, and so your guess of the sales given the previous few days of sales is probably going to be different from last year. So last year’s data, when the business was small, is really not relevant to this year, when the business is large. We need to update the model (or scrap it completely and retrain) to get something that works.

The first case, where you want the prediction conditioned on history, I’m going to call time-series prediction. That problem deserves a post all its own, but it suffices to say that solutions to this problem revolve largely around feeding the prediction history to the model as input. That’s a massive oversimplification, but there’s plenty of information out there if you’re more interested.

The second case, where you need to update the model or retrain completely, is about dealing with non-stationarity, and that’s largely what the rest of this post is going to be about. But consider this first lesson learned: Time series prediction and non-stationary data distributions are two different problems.

Some Options

When I think about the second case above, a couple of classes of approaches jump to mind:

Incremental Algorithms: These are machine learning algorithms that learn incrementally over the data. That is, the model is updated each time it sees a new training instance. There are incremental versions of Support Vector Machines and Neural networks. Bayesian Networks can be made to learn incrementally.

These are machine learning algorithms that learn incrementally over the data. That is, the model is updated each time it sees a new training instance. There are incremental versions of Support Vector Machines and Neural networks. Bayesian Networks can be made to learn incrementally. Periodic Re-training with a batch algorithm: Perhaps the more straightforward solution. Here, we simply buffer the relevant data and retrain our model “every so often”.

Note that any incremental algorithm can work in a batch setting (by simply feeding the input instances in the batch into the algorithm one after another). The reverse, however, isn’t trivially true. Many batch algorithms can only be made to work incrementally with significant work or power sacrifices, and some things just can’t be done.

So is the sacrifice worth it? Here are a couple of considerations to think about:

Data Horizon: How quickly do you need the most recent datapoint to become part of your model? Does the next point need to modify the model immediately, or is this a case where the model needs to behave conditionally based on that point? If it is the latter, perhaps this is a time-series prediction problem rather than an incremental learning problem.

How quickly do you need the most recent datapoint to become part of your model? Does the next point need to modify the model immediately, or is this a case where the model needs to behave conditionally based on that point? If it is the latter, perhaps this is a time-series prediction problem rather than an incremental learning problem. Data Obsolescence: How long does it take before data should become irrelevant to the model? Is the relevancy somehow complex? Are some older instances more relevant than some newer instances? Is it variable depending on the current state of the data? Good examples come from economics; generally, newer data instances are more relevant. However, in some cases data from the same month or quarter from the previous year are more relevant than the previous month or quarter of the current year. Similarly, if it is a recession, data from previous recessions may be more relevant than newer data from a different part of the economic cycle.

With these two concerns in mind, along with architecture and implementation, you can get a pretty good idea of whether you’re looking at a problem where incremental learning is desirable or not.

Obviously, the shorter the data horizon, the more likely you are to want incremental learning. However, it’s a common mistake to confuse a short data horizon with a time-series prediction problem: If you want your model to behave differently based on the last few instances, the right thing to do is condition the behavior of the model on those instances. If you want the model to behave differently based on the last few thousand instances, you may want incremental learning.

This, however, is where the second concern rears its ugly head. Incremental learners all have built-in some parameter or assumption that controls the relevancy of old data. This parameter may or may not be modifiable and the relationship may be complex, but the algorithm will be making some implicit assumption about how relevant old data is. This is the second lesson: Be wary of the data relevance assumptions made by incremental learning algorithms!

By contrast, retraining in batches has lots of flexibility in this regard. It is easy to select data for retraining, filter by relevant criteria, even weight the data according to some relevancy function using one of the many batch training algorithms that take weighting into account. There’s even been some recent work in automatically detecting when retraining is necessary, based essentially on how different the incoming data is from previous recent data.

Summing Up

First of all, don’t confuse learning from streaming data with time series prediction; while the data sources from the two problems look similar, the two concerns are often orthogonal.

Incremental learning is great for two cases: First, simplicity. There’s no buffering and no explicit retraining of the model. Second, speed. You always have a model that’s up to date. You make sacrifices, however, in terms of the power of the model that you can learn and the flexibility of the model to incorporate old data to different degrees. There are also some corner cases where incremental learning is necessary, such as when data privacy demands that instances be discarded immediately after they are seen.

Periodic retraining requires more decisions and more complex implementation. However, you get all of the power of any supervised classification algorithm, and specialized tools can be built on top of it to allow you to retrain on only relevant data and only when necessary. It also offers the nice benefit of being able to plug-and-play different machine learning algorithms into your architecture with a minimum of hassle, as the learning bit is built completely from off-the-shelf algorithms.

BigML has mainly gone the second route, allowing you to easily upload new data and trigger retraining (either manually or via the BigML API) whenever you find it necessary. One thing on our to do list is to implement some of those specialized tools on top of the BigML API, so that data can be streamed to BigML and the model is automatically re-trained only when it needs to be. We’ll update you on our blog about any big changes!

EDIT: This article originally used the terms “classifier” and “predictor” more or less interchangeably, and they’re not really interchangeable as one astute reader pointed out. I’ve replaced both with the term “model”, which follows the convention we’ve been using at BigML. Sorry for any confusion that might have caused!