Photo by Antoine Dautry on Unsplash

Machine Learning Using Mixed Integer Programming

To many operations research (OR) professionals, machine learning (ML) is just another optimization problem with its own constraints and objectives. A quick search of these two terms returns several posts describing what situations require one over the other as the tool of choice. However, the purpose of this post is not to pit the two against each other and tell you when one or the other comes out on top; instead, we want to demonstrate the scalability and potential performance challenges of using mixed-integer programming (MIP) in modeling some ML algorithms.

Most ML models aim to find a set of parameters that achieve a goal. For example, in clustering, one wants to characterize previously unknown groups (or clusters) such that points within said groups are closer (i.e. more similar) to each other than to points outside. In a Support Vector Machine (SVM) model, one wants to separate points of different known groups (classes) by some boundary (usually a line or hyperplane), such that the divider is able to classify current and future points with maximum discriminatory power. In linear regression, one wants to find the best parameters for a function of features to predict target values such that these parameters minimize the sum of squared errors or maximize the likelihood.

Considering that these are all about optimizing a set of parameters to maximize or minimize some criterion, it seems plausible to formulate a machine learning model using mathematical optimization (e.g. MIP, NLP, or MINLP). As stated by Bennett & Parrado-Hernandez, the fields of mathematical optimization and machine learning are very connected:

“Optimization lies at the heart of machine learning. Most machine learning problems reduce to optimization problems.”

Photo by Fabian Grohs on Unsplash

As OR professionals, we thought it would be interesting to use MIP to formulate a machine learning model and then use it to calculate a solution with a solver like CVX, CBC, CPLEX, or Gurobi.

Let’s take a look at a clustering example. Assume that we are given N points each has P features, so x[ip] is the value of feature p=1,2,…,P in the i-th point, i =1,2,…, N. We want to find C clusters such that points in a cluster are more similar to each other than to points outside. We can define a similarity measure between point i and point j like:

where d[ij]=0 means points i and j are similar, and the dissimilarity grows as d[ij] increases.

Now, let

y [ijk] = 1 if both points i=1,2,…, N -1 and j=i+1, …, N are in cluster k=1,2,…, C , and 0 otherwise.

[ijk] = 1 if both points i=1,2,…, -1 and j=i+1, …, are in cluster k=1,2,…, , and 0 otherwise. z[ik] = 1 if point i=1,2,…, N is in cluster k=1,2,…,C, and 0 otherwise.

Thus, we can formulate an MIP model to find optimal clusters as follows:

This MIP model has N(N-1)C/2 inequality constraints, N equality constraints, and N(N-1)C/2+NC binary variables. So, as data grows (which is good for accuracy), this model grows as well (which is bad for performance and solve time).

We ran a simple experiment to see how this MIP model behaves under different conditions. In this experiment, we clustered 10, 15, 20, and 30 points into two and four clusters. We used an open source solver, CBC, along with Gurobi to solve the model for each data size/cluster number combination. Here is what we found:

As you can see, even for a very small-scale clustering problem (4 clusters and 30 data points), this model takes more than 25 min to fit.

One possible reason for the long runtimes is that this model contains a lot of symmetry. We can add constraints to break this symmetry, but there is no guarantee that this will result in a significant performance improvement. In fact, we experimented with this by breaking the symmetry with the following constraint, but it actually increased the overall runtime.

Finding good symmetry-breaking constraints is itself a research study. Another way of solving this problem is taking advantage of methods such as column generation. This research paper has solved instances with more than 2000 nodes and more than 2 clusters to optimality. Even if samples with a few thousand nodes are not usually sufficient for most applications, this paper gives good insights on the structure of optimal solutions and clever methods to find them.

It’s also possible to reduce runtime by using a heuristic algorithm (that’s what we usually do in OR if we can’t solve a problem to optimality). As formulated here, K-means clustering is classified as an NP-hard problem, and solving large instances of NP-hard problems often requires the use of heuristics. So, as OR professionals, we may try to solve a K-means clustering-type problem by developing an efficient heuristic. But here’s the interesting part: the iterative algorithm that is widely used for solving K-means clustering is, in fact, a heuristic itself! You call it an ML algorithm, we call it a heuristic for our OR problem. To-may-to, to-mah-to!

Note for more meticulous readers: K-means is actually a heuristic (or most popular heuristic) for solving the Minimum Sum-of-Squares Clustering (MSSC) problem. There are other heuristics for solving the MSSC problem, one of which can be found here.

Photo by Matt Artz on Unsplash

The MIP approach definitely provides accurate clusters, but as the data scales, the computational performance challenges appear to be a big obstacle. In the future, computational power may advance to some level that makes the MIP approach efficient enough to use. Until then, it is a good idea to keep both ML and MIP in our toolboxes.

If you found this post interesting, you might enjoy a blog we wrote about mathematical optimization and machine learning that discusses how a machine learning model can be formulated as an optimization problem. In addition, we recommend this blog which gives a pretty good overview of the optimization used at the core of machine learning models. If you want to dig deeper, you can find more technical articles on this topic here. Finally, you can see the Python script we used for the above mentioned numerical experiment here.