In continuous convex optimization, duality is often the theoretical foundation for computing the sensibility of the optimal value of a problem to one of its parameters. In the non-linear domain, it is fairly standard to assume one can compute at any point of the domain the function $f(x)$ and gradient $

abla f(x)$.

What about discrete optimization?

The first thought would be that differentiating the resolution of a discrete problem does not make sense, the information it yields since infinitesimal variations in the domain of the variables do not make sense.

However, three cases come to mind for which asking for gradients makes perfect sense:

In mixed-integer linear problems, some variables take continuous values. All linear expressions are differentiable, and every constraint coefficient, right-hand-side and objective coefficient can have an attached partial derivative. Even in pure-integer problems, the objective value will be a continuous function of the coefficients, possibly locally smooth, for which one can get the partial derivative associated with each weight. We might be interested in computing the derivative of some expression of the variables with respect to some parameters, without this expression being the objective.

For these points, some duality-based techniques and reformulations can be used, sometimes very expensive when the input size grows. One common approach is to first solve the problem, then fixing the integer variables and re-solving the continuous part of the problem to compute the dual values associated with each constraint, and the reduced cost coefficients. This leads to solving a NP-hard problem, followed by a second solution from scratch of a linear optimization problem, still, it somehow works.

More than just solving the model and computing results, one major use case is embarking the result of an optimization problem into another more complete program. The tricks developed above cannot be integrated with an automated way of computing derivatives.

Automatic Differentiation

Automatic Differentiation is far from new, but has known a gain in attention in the last decade with its used in ML, increasing the usability of the available libraries. It consists in getting an augmented information out of a function.

If a function has a type signature f: a -> b , the goal is, without modifying the function, to compute a derivative, which is also a function, which to every point in the domain, yields a linear map from domain to co-domain df: a -> (a -o b) , where a -o b denotes a linear map, regardless of underlying representation (matrix, function, …). See the talk and paper for a type-based formalism of AD if you are ok with programming language formalism.

Automatic differentiation on a pure-Julia solver

ConstraintSolver.jl is a recent project by Wikunia. As the name indicates, it is a constraint programming solver, a more Computer-Science-flavoured approach to integer optimization. As a Julia solver, it can leverage both multiple dispatch and the type system to benefit from some features for free. One example of such feature is automatic differentiation: if your function is generic enough (not relying on a specific implementation of number types, such as Float64 ), gradients with respect to some parameters can be computed by calling the function just once (forward-mode automatic differentiation).

Example problem: weighted independent set

Let us consider a classical problem in combinatorial optimization, given an undirected graph $G = (V, E)$, finding a subset of the vertices, such that no two vertices in the subset are connected by an edge, and that the total weight of the chosen vertices is maximized.

Optimization model of the weighted independent set

Formulated as an optimization problem, it looks as follows:

$$\begin{align} (\mathcal{P}): \max_{x} & \sum_{i \in V} w_i x_i \\\\

\text{s.t.} \\\\

& x_i + x_j \leq 1 \,\, \forall (i,j) \in E \\\\

& x \in \mathbb{B}^{|V|} \end{align} $$

Translated to English, this would be maximizing the weighted sum of picked vertices, which are decisions living in the $|V|$-th dimensional binary space, such that for each edge, no two vertices can be chosen. The differentiable function here is the objective value of such optimization problem, and the parameters we differentiate with respect to are the weights attached to each vertex $w_i$. We will denote it $f(w) = \max_x (\mathcal{P}_w)$.

If a vertex $i$ is not chosen in a solution, there are two cases:

the vertex has the same weight as at least one other, say $j$, such that swapping $i$ and $j$ in the selected subset does not change the optimal value. of $\mathcal{P}$. In that case, there is a kink in the function, a discontinuity of the derivative, which may not be computed correctly by automatic differentiation. This is related to the phenomenon of degeneracy in the simplex algorithm, multiple variables could be chosen equivalently to enter the base.

there is no other vertex with the same weight, such that swapping the two maintains the same objective value. In that case, the derivative is $0$, small enough variations of the weight does not change the solution nor the objective.

If a vertex $i$ is chosen in a solution, then $x_i = 1$, and the corresponding partial derivative of the weight is $\frac{\partial f(w)}{\partial w_i} = 1$.

A Julia implementation

We will import a few packages, mostly MathOptInterface.jl (MOI), the foundation for constrained optimization, the solver itself, the Test standard lib, and ForwardDiff.jl for automatic differentiation.

using Test import ConstraintSolver const CS = ConstraintSolver import MathOptInterface const MOI = MathOptInterface import ForwardDiff

Let us first write an implementation for the max-weight independent set problem. We will use a 4-vertex graph, looking as such:

The optimal answer here is to pick vertices 1 and 4 (in orange).

@testset "Max independent set MOI" begin matrix = [ 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 ] model = CS . Optimizer() x = [MOI . add_constrained_variable(model, MOI . ZeroOne()) for _ in 1 : 4 ] for i in 1 : 4 , j in 1 : 4 if matrix[i,j] == 1 && i < j (z, _) = MOI . add_constrained_variable(model, MOI . GreaterThan( 0.0 )) MOI . add_constraint(model, z, MOI . Integer ()) MOI . add_constraint(model, z, MOI . LessThan( 1.0 )) f = MOI . ScalarAffineFunction( [ MOI . ScalarAffineTerm( 1.0 , x[i][ 1 ]), MOI . ScalarAffineTerm( 1.0 , x[j][ 1 ]), MOI . ScalarAffineTerm( 1.0 , z), ], 0.0 ) MOI . add_constraint(model, f, MOI . EqualTo( 1.0 )) end end weights = [ 0.2 , 0.1 , 0.2 , 0.1 ] terms = [MOI . ScalarAffineTerm(weights[i], x[i][ 1 ]) for i in eachindex(x)] objective = MOI . ScalarAffineFunction(terms, 0.0 ) MOI . set(model, MOI . ObjectiveFunction{typeof(objective)}(), objective) MOI . set(model, MOI . ObjectiveSense(), MOI . MAX_SENSE) MOI . optimize!(model) # add some tests end

Why the additional code with (z, _) = MOI.add_constrained_variable(model, MOI.GreaterThan(0.0)) ? ConstraintSolver.jl does not yet support constraints of the type a x + b y <= c , but linear equality constraints are fine, so we can derive equivalent formulations by adding a slack variable z .

For this problem, the tests could be on both the solution and objective value, as follows:

@test MOI . get(model, MOI . VariablePrimal(), x[ 4 ][ 1 ]) == 1 @test MOI . get(model, MOI . VariablePrimal(), x[ 1 ][ 1 ]) == 1 @test MOI . get(model, MOI . ObjectiveValue()) ≈ 0.3

An equivalent JuMP version would look look this:

matrix = [ 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 ] m = Model(with_optimizer(CS . Optimizer)) x = @variable (m, x[ 1 : 4 ], Bin) for i in 1 : 4 , j in i + 1 : 4 if matrix[i,j] == 1 zcomp = @variable (m) JuMP . set_binary(zcomp) @constraint (m, x[i] + x[j] + zcomp == 1 ) end end w = [ 0.2 , 0.1 , 0.2 , 0.1 ] @objective (m, Max, dot(w, x)) optimize!(m)

Why are we not using JuMP, which is much more concise and closer to the mathematical formulation?

JuMP uses Float64 for all value types, which means we do not get the benefit of generic types, while MathOptInterface types are parameterized by the numeric type used. To be fair, maintaining type genericity on a project as large as JuMP is hard without making performance compromises. JuMP is not built of functions, but of a model object which contains a mutable state of the problem being constructed, and building an Algebraic Modelling Language without this incremental build of the model has not proved successful till now. One day, we may get a powerful declarative DSL for mathematical optimization, but it has not come yet.

Back to our problem, we now have a way to compute the optimal value and solution. Let us implement our function $f(w)$:

function weighted_stable_set(w) matrix = [ 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 ] model = CS . Optimizer(solution_type = Real ) x = [MOI . add_constrained_variable(model, MOI . ZeroOne()) for _ in 1 : 4 ] for i in 1 : 4 , j in 1 : 4 if matrix[i,j] == 1 && i < j (z, _) = MOI . add_constrained_variable(model, MOI . GreaterThan( 0.0 )) MOI . add_constraint(model, z, MOI . Integer ()) MOI . add_constraint(model, z, MOI . LessThan( 1.0 )) f = MOI . ScalarAffineFunction( [ MOI . ScalarAffineTerm( 1.0 , x[i][ 1 ]), MOI . ScalarAffineTerm( 1.0 , x[j][ 1 ]), MOI . ScalarAffineTerm( 1.0 , z), ], 0.0 ) MOI . add_constraint(model, f, MOI . EqualTo( 1.0 )) end end terms = [MOI . ScalarAffineTerm(w[i], x[i][ 1 ]) for i in eachindex(x)] objective = MOI . ScalarAffineFunction(terms, zero(eltype(w))) MOI . set(model, MOI . ObjectiveFunction{typeof(objective)}(), objective) MOI . set(model, MOI . ObjectiveSense(), MOI . MAX_SENSE) MOI . optimize!(model) return MOI . get(model, MOI . ObjectiveValue()) end

We can now compute the gradient in one function call with ForwardDiff:

@testset "Differentiating stable set" begin weights = [ 0.2 , 0.1 , 0.2 , 0.1 ] ∇w = ForwardDiff . gradient(weighted_stable_set, weights) @test ∇w[ 1 ] ≈ 1 @test ∇w[ 4 ] ≈ 1 @test ∇w[ 2 ] ≈ ∇w[ 3 ] ≈ 0 end

To understand how this derivative computation can work with just few function calls (proportional to the size of the input), one must dig a bit deeper in Dual Numbers. I will shamelessly refer to my slides at the Lambda Lille meetup for an example implementation in Haskell.

Why not reverse-mode?

I mentioned that the cost of computing the value & derivatives is proportional to the size of the input, which can increase rapidly for real-world problems. This is specific to so-called forward mode automatic differentiation. We will not go over the inner details of forward versus reverse. As a rule of thumb, forward-mode has less overhead, and is better when the dimension of the output far exceeds the dimension of the input, while reverse-mode is better when the dimension of the input exceeds the one of the output.

Giving reverse with Zygote a shot

Getting back to our question, the answer is rather down-to-earth, the reverse-mode I tried simply did not work there. Reverse-mode requires tracing the normal function call, building a “tape”, this means that it needs a representation of the function (as a graph or other). I gave Zygote.jl a try, which can be done by replacing ForwardDiff.gradient(f,x) with Zygote.gradient(f, x) in the snippet above. Building a representation of the function means Zygote must have a representation of all operations performed. For the moment, this is still restricted to a subset of the Julia language (which is far more complex than commonly encountered mathematical functions built as a single expression). This subset still excludes throwing and handling exceptions, which is quite present in both ConstraintSolver.jl and MathOptInterface.

I have not tried the other reverse tools for the sake of conciseness (and time), so feel free to check out Nabla.jl, ReverseDiff.jl and Tracker.jl.

How could this be improved?

A first solution could be to move the idiom of Julia from throw/try/catch to handling errors as values, using something like the Result/Either type in Scala / Haskell / Rust and corresponding libraries.

Another alternative, currently happening is to keep pushing Zygote to support more features from Julia, going in the direction of supporting differentiation of any program, as dynamic as it gets.

One last option for the particular problem of exception handling would be to be able to opt-out of input validation, with some @validate expr , with expr potentially throwing or handling an error, and a @nocheck or @nothrows macro in front of the function call, considering the function will remain on the happy path and not guaranteeing validity or error messages otherwise. This works exactly like the @boundscheck , @inbounds pair for index validation.

Conclusion, speculation, prospect

This post is already too long so we’ll stop there. The biggest highlights here are that:

In discrete problems, we also have some continuous parts.

Julia’s type system allows AD to work almost out of the box in most cases.

With JuMP and MOI, solving optimization problems is just another algorithmic building block in your Julia program, spitting out results, and derivatives if you make them.

I believe that’s why plugging in solvers developed in C/C++ is fine, but not always what we want. I would be ready to take a performance hit on the computation time of my algorithms to have some hackable, type-generic MILP solver in pure Julia.

Special mentions

Thanks a lot to Wikunia, first for developing ConstraintSolver.jl, without which none of this would have been possible, and for the open discussion on the multiple issues I posted. Don’t hesitate to check out his blog, where the whole journey from 0 to a constraint solver is documented.