This is joint work with Vincent Stettler. The complete code is available here.

In this post, we will explore a fascinating emerging topic, which is that of using reinforcement learning to solve combinatorial optimization problems on graphs. We will see how this can be done concretely for finding solutions to the celebrated Traveling Salesman Problem. Our approach will combine neural networks that learn embeddings of random graphs and reinforcement learning to iteratively build solutions. In the end, we will obtain a single neural network that will be able to produce relevant routing decisions on new random graphs.

We will build a solution iteratively throughout the article, using Python and PyTorch.

Traveling Salesmen and Pub Crawls

The Traveling Salesman Problem (TSP) consists in finding the shortest possible tour connecting a list of cities, given the matrix of distances between these cities. Here is an example of a solution (from the Wikipedia TSP article [1]):

This problem has many very concrete applications in domains such as logistics, vehicle routing, chip manufacturing, astronomy, image processing, DNA sequencing and more.

It is known that finding an optimal solution is a NP-hard problem — and there exists N! different possible ways to assemble a tour visiting N cities. Despite these gloom prospects, the problem has a very rich history, and it has been one of the most studied problems of computational sciences. There now exists a host of different techniques — both exact and heuristics — to find good solutions efficiently. The approaches range from simulating ant colonies heuristics, to extremely efficient solvers (such as Concorde [2]) making sophisticated use of integer linear programming methods to find exact solutions.

Solvers such as Concorde can find optimal tours for problems of a few tens of thousands of cities, and close-to-optimal (within 1% or less) tours for problems with millions of cities. At the time of writing, one of the largest problem instance was solved using Concorde in 2006, and it involved 85,900 interconnections (“cities”) on a VLSI application, which at the time took 136 CPU-years [1]. This instance of the problem was Euclidean — meaning that all the distances between the cities are the Euclidean distances between them. Large non-Euclidean problems (for instance, such as considering walking times between cities instead of Euclidean distances) have also been solved, although this is in general much more challenging.

Maybe the most important instance ever solved consists in a tour going through 49,687 pubs in the UK with the smallest walking distance (45,495,239 meters). It was computed in March 2018 [3]:

Overall, we recommend the interested readers to consult [4] for more information and nice anecdotes about the TSP problem, its history, and how it is solved in practice today.

Why Reinforcement Learning?

In this post, we will try a completely different way to approach the TSP problem, using reinforcement learning. But why would we even do that, given that there already exists a host of very effective techniques for solving the TSP?

Here, our goal isn’t to beat current solvers — in fact we won’t even come close. Rather, we will consider a different point in the tradeoff space: We will build a neural-network that finds tours without any built-in knowledge of the TSP. The only signals used by our model will be the observed cumulative “rewards” — that is, how long some tours (or portions of tour) are. There is a very elegant side to this approach, as it means that the algorithm can potentially figure out the problem structure without human help.

Having the computer figure out good heuristics by itself is perhaps the most common dream of computer scientists.

In addition, the approach has a couple of other advantages:

It’s possible to easily plug in fairly arbitrary reward functions in this framework, so one can use the exact same algorithm on different variants of the TSP, and notably on variants that significantly deviate from being Euclidean. Think for instance of minimizing a tour duration, with some time being spent in each city, or with random delays attached to links and depending on certain dynamic features.

Inference speed & online use: once we’ll have trained our neural network to output the next best city to visit, this means that deciding where to go next just requires a forward pass through our network, which is typically very fast. It thus lends itself better to dynamic and online use-cases than some slower approaches.

We get to apply some machine learning on yet another problem. It’s 2020 after all!

In this post, we will only consider toy problems (nothing like the pub crawl one showed above), and focus on building the full solution end-to-end in a comprehensible way, rather than on getting amazing performance.

A (Very) Quick Primer on Reinforcement Learning

Reinforcement learning (RL) concerns itself with an agent taking sequences of actions in order to maximize its (cumulative) reward over time. The agent maintains its own internal state, and it interacts with its environment. In general, these interactions are often depicted in diagrams like this:

The agent can be, say, a moving robot. At each time step t, the state of the robot would maintain an (estimate of) its position and velocity. Based on this state, the robot can choose an action (for instance: turn its right wheel by a certain amount). Once the action is done, the robot can measure again its new position and velocity, which depend both on the external environment and the action taken. In addition, it can also observe a reward for the action it just took — for instance, how much closer to its destination this action took the robot.

There are different flavours of this RL setup. For instance, in some versions the environment maintains its own state that the agent cannot fully observe. For the TSP however, we will be in the simpler case where the agent has full information, and its state corresponds to the state of the environment.

In our case, the agent is interested in finding a short tour. Its state contains the entire graph connecting all cities, the cities’ coordinates, as well as the history of visited cities. The “environment” is fully observed and trivial; it simply permits to make a deterministic move along a graph edge, and observe the travelled distance.

Now that we have a definition of state, we can already start writing code, yay!

Here we simply defined a named tuple representing the state. It contains a matrix W, which is the NxN matrix containing the distances between all the N cities. It also contains the coordinates of the nodes, as well as the current partial solution, which will be a list of visited cities. This tuple representation will be our “user-friendly” representation of the state.

We will also maintain a tensor (approximate) representation of this state, which will contain a fixed-size representation of the set of visited nodes. We omit this part from the article for simplicity, but you can find the complete code, with the imports and everything needed to run it, in a notebook here.

Q-learning

Q-learning refers to an algorithm for training agents so that they (hopefully) converge towards selecting optimal actions. It has been shown to be successful in different contexts — for instance, this is the method that was used by Mnih et al. in [5] to learn to play Atari games. This is also the approach that we’ll follow here.

In general, the idea of Q-learning is to learn an action-value function Q(s, a) which, for a given state s and a given action a, returns the expected cumulative reward that the agent will obtain until the end of the episode (in our case, an episode is one instance of the TSP). Once we have a well-behaved version of such a function, and if we can afford to enumerate all the possible actions, we can simply let the agent pick the actions a that maximizes the estimated expected return Q(s, a) from any state s.

On a high level, in order to learn the function Q(), the idea is to run many episodes with our agent and iteratively improve the function. During each episode, when in state s the agent can employ a greedy policy — where it takes the action a maximizing Q(s, a) — and also it can take completely random actions once in a while (in order to keep exploring). Over the course of many episodes, we can store the cumulative rewards that were eventually obtained as a “consequence” of some state/actions pairs (s, a). If our function Q() is differentiable (for instance, if it’s a neural net), we can then compute the gradient of the error with respect to the parameters of Q(), and use this to perform gradient descent and gradually refine the parameters. When everything goes well, this allows us to learn the function Q().

We omitted a lot of very important facts here. For instance, there is no need in practice to wait all the way to the end of an episode to estimate the target cumulative returns. Instead, we can just wait one or a few steps and then call the function Q() “recursively” to obtain an estimate of the expected cumulative rewards all the way to the end of the episode. This is really neat and a key aspect to making Q-learning scalable, but it can also cause stability challenges, as Q() is itself used to compute the targets used to train it. For more informations on this topic, we refer the interested readers to the excellent reinforcement learning class by David Silver [6].

Now, captured in code, Q-learning for the TSP would look as follows:

First, we build an object named Q_func, which will represent our Q() function neural network (we will implement it in the next section). We also create a memory that we will use to store the state/action/reward experiences. Then, for each episode, we sample a new random and fully-connected graph (ideally, from a distribution close to that of our final problem). As long as our solution does not represent a complete tour, we select a next node, according to an ε-greedy policy: with probability ε, we select a new node at random, and with probability (1-ε), we select the node (action a) that maximizes Q(s, a). We will decrease ε over the course of training (in line 16) until we reach some minimum explore probability, in order to make the algorithm more and more greedy as Q() gets more accurate.

We then observe the actual reward obtained just after taking this action, which is simply the extra travelled distance to reach the next node (line 31). We store the experience tuple in our memory and update our state. Here, the experience tuple is defined as:

These experiences are what we’ll need to actually train our Q() function. They contain a state, an action taken while the agent was in that state, the cumulative reward obtained (after a certain number n of steps), and the next state the agent ends up in after following this action. We need to store this next state, because we’ll use it to call Q() recursively when estimating the cumulative rewards all the way to the end of the episode. In the snippet above, we just show the simpler version of n-step Q-learning where the number of steps n is set to 1.

The actual training of Q() starts at line 46. Here we sample a random batch of experiences from our memory. We retrieve the state/action pairs (the inputs of Q()) from memory, and we compute the targets by calling Q() on the next states. Finally, on line 60, we can use these targets as we would do in any supervised ML task — for instance, by computing the MSE loss, and take a gradient-descent step to update Q().

That’s all for the Q-learning part! Now, in the next section we’ll discuss a possible neural net architecture of the function Q().

Learning Graph Embeddings

To recap, we still need a function Q() that can accept a graph, the history of visited nodes and an action in input, and return the expected cumulative reward. As an extra requirement, we would like this function Q() to be callable with graphs of different sizes. This way, we won’t have to train a separate model for each possible graph size. This represents a challenge because it means the graph matrices W taken as input won’t always have the same size.

Here, to address this challenge and find a suitable architecture, we will follow the approach proposed in [7]. In this paper, the authors propose using an algorithm named structure2vec [8]. The idea is that Q() can be taught to represent each node in the graph by fixed-size vectors — that is, it can be taught to find a fixed-size graph embedding.

The hope is that when training this embedding jointly with the reinforcement learning task, Q() will naturally learn to build representations of nodes that convey the information (or features) required for the TSP agent to make good decisions.

In order to build fixed-size embeddings from graphs of potentially variable sizes, structure2vec employs ideas similar to belief propagation, where each node repeatedly sends “messages” along the graph edges to iteratively build its representation. Using equations, [7] describes this embedding procedure as follows:

where μ denote the p-dimensional node embeddings (e.g., for a node v), which are iteratively constructed (for a certain number of iterations T) for each node v by considering:

The state x of node v (first term above). This is a vector that contains some indicator variables capturing for instance whether the node has already been visited or not, as well as its (x, y) coordinates on the map. We will use our tensor representations of the history of visited nodes mentioned earlier here.

The neighbors’ own embeddings (second term above).

The incoming edges’ weights (third term above).

Finally, to make the embedding learnable everything is stitched together by some learnable linear mappings θ, as well as some ReLU non-linearities (although different architectural options would be possible, of course).

Computing these node embeddings is the first step for the function Q(). The second step is to use the embeddings in order to compute estimated cumulative rewards for each node. Again, in equations this is expressed in [7] as:

Here, Q(s, v; Θ) is our Q() function, parameterized by Θ (the set of parameters θ), and v stands for the next vertex to visit, which represents the action. The value itself is computed as a non-linear function of the concatenation of both the local embedding of v, and a global aggregation of all the embeddings (the summation is over all the graph’s nodes). Finally, note that the computation is done using the “last” values of μ obtained after the T-th iteration.

In code, we can write it using PyTorch as follows:

In the initialization, we simply declare the affine maps with parameters θ (although, contrary to the equations above, we also introduce learnable biases here). The forward pass then implements the function defined by the two equations above. We implement a version that accepts batches of nodes states and graph tensors, and for each element in the batch, it outputs a vector of size N that contains the estimated cumulative rewards for selecting each of the node as a next action. Note that, formally speaking, we are implementing a vector-valued function Q(s) that returns one value for each possible action, instead of a scalar function Q(s, a), but the two versions are equivalent in terms of computation because we always would call Q(s, a) for each possible action a anyway. We opted for this vector-valued version in order to benefit from vectorization gains.

In the code above, we also propose to extend one of the affine layers (the first one) with an extra non-linearity followed by a linear layer (applied on lines 51–52). This is just to demonstrate how the basic architecture explained above can trivially be extended to make the network deeper.

We are now basically done with the basic building blocks of our RL approach to the TSP! We still need some extra code for doing the actual optimization (e.g., using a an MSE loss and an Adam optimizer), for selecting the best (greedy) action out of the neural net, or for generating the random graphs. We do not include all of this code here for simplicity, but you can find all of it in the accompanying notebook.

Preliminary Results

We trained this model on small 10-nodes random graphs. The graphs are fully connected; the nodes are placed uniformly at random on the [0,1]x[0,1] square, and the edge weights are simply all the pairwise Euclidean distances between the nodes. We trained it with an embedding dimensionality p=5, batches of 16, a decaying learning rate, and a minimum ε value of 0.1. We only did minimal hyper-parameter optimization, so our results should be taken with a grain of salt. Still, quite amazingly, over the course of training it appears that the agent is indeed increasingly taking decisions that make the tours shorter!

Moving average of the tour length over the course of training. At the beginning of training, the explore probability ε is close to 1, and the function Q() is randomly initialized, so the decisions are uniformly random. Towards the end of training, ε is smaller and Q() is better, and the decisions become increasingly better.

Here are a few example tours:

Granted, we selected the above examples to show cases where the agent doesn’t completely fail. There are many other examples where our little agent is still failing and performing similar to random strategies. However, on average it does significantly better, and we tend to see patterns like the ones shown on the left column, where the agent seems to favour more convex-looking tours.

While we are of course nowhere close to the performance of TSP solvers (or even fairly naive heuristics), we still find these results quite amazing. At no point did we encode any rule about how to find a good tour — yet, the agent is able to take better decisions by only observing the traveled distances!

What Next?

There are a lot more things to do:

Better capturing sequences: for simplicity, our state tensors essentially capture the set of visited nodes, rather than the ordered sequence of visited nodes. This is an extremely poor representation of partial tours, which probably has no chance to perform truly well, even with a perfect algorithm. A more serious implementation should improve the way sequences are handled.

Better exploiting the graph structure: In this version of the TSP, we didn’t really exploit the graph structure because we only considered fully connected graphs, whereas usually sparser graphs can provide strong inductive biases [9]. We are interested in exploring similar approaches on adjacent problems with sparser graphs.

More tuning, engineering & capacity: We did only minimal tuning of hyper-parameters, and it is likely possible to obtain better results. We have observed that tuning such RL models can be fairly tricky though, with great variance on results quality, and great sensitivity on the many hyper-parameters, which can make the training and model evaluation quite tricky and expensive. Furthermore, our implementation targets ease of understanding rather than speed. For instance, it’s not optimized to run efficiently on GPUs (although it should run on a GPU). A faster code could make the experimentation loop quicker. Finally, we only played with shallow models having little capacity — models with more capacity could give better results.

Conclusions

Reinforcement learning has proved itself in different domains, notably in cases where the environments can be easily simulated. In this post, we explored how this approach can be used on a combinatorial optimization problem — the TSP. Here, the simulated environments are the random graphs of cities. To find solutions, we learn a neural network that can predict which is the best city to visit next, from a given new graph and a history of previously-visited cities. Although there is still much work needed to make our little implementation competitive, the trained network remarkably improves its decision over the course of training without any a-priori knowledge of how to approach the TSP.

At Unit8, we see a lot of potential for marrying optimization and machine learning throughout different industrial use-cases, and we are always keen on working on such hard problems. If you think you have a good use-case, or if you simply have some questions, don’t hesitate to drop us a line. Also don’t hesitate to get in touch with any suggestion or if you spot bugs — in this case, we also welcome pull requests directly on github.

References

[1]: https://en.wikipedia.org/wiki/Travelling_salesman_problem

[2]: http://www.math.uwaterloo.ca/tsp/concorde/index.html

[3]: http://www.math.uwaterloo.ca/tsp/pubs/

[4]: http://www.math.uwaterloo.ca/tsp/index.html

[5]: Mnih, Volodymyr, et al. “Playing atari with deep reinforcement learning.” arXiv preprint arXiv:1312.5602. 2013.

[6]: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

[7]: Khalil, Elias, et al. “Learning combinatorial optimization algorithms over graphs.” NeurIPS. 2017.

[8]: Dai, Hanjun, Bo Dai, and Le Song. “Discriminative embeddings of latent variable models for structured data.” ICML. 2016.

[9]: Battaglia, Peter W., et al. “Relational inductive biases, deep learning, and graph networks.” arXiv preprint arXiv:1806.01261. 2018.