Lagrange multipliers with visualizations and code

The ultimate optimization weapon, explained end-to-end.

In this story, we’re going to take an aerial tour of optimization with Lagrange multipliers. When do we need them? Whenever we have an optimization problem with constraints. Here are some examples:

A hedge fund wants to decide the proportions of stocks to include in their portfolio such that they get the maximum possible expected return, staying within some risk appetite (the risk might be measured in terms of the variance of the return for example). A school district wants to determine an allocation for various items on their lunch menu for their students. They want to minimize the cost per lunch while making sure the kids get a certain quantity of all required nutrients. A trucking company wants to transport goods from source warehouses to destination cities. Given the costs of transport for each warehouse-city pair, the total supply at each warehouse and total demand at each city, decide how much to ship from each warehouse to each city so that overall cost is minimized and demand met (constraint).

We can see that constrained optimization can solve many practical, real world problems arising in domains from logistics to finance. In the rest of the blog, we will start with vanilla optimization without constraints. Then, we will add equality constraints. Then, we will describe the solution to completely general constrained optimization problem with both equality and inequality constraints (the conditions are called KKT — Karush, Kuhn, Tucker conditions). Finally, we demonstrate the power of these conditions on some toy problems. Many people consider the book by Nocedal and Wright the bible of numerical optimization and we will roughly follow chapter 13, side-stepping rigorous proofs (which can always be read from the text) and focusing more on visual intuition.

Un-constrained optimization

In this scenario, we have some variables in our control and an objective function that depends on them. There is no constraint on the variables and the objective function is to be minimized (if it were a maximization problem, we could simply negate the objective function and it would then become a minimization problem).

At any point, for a one dimensional function, the derivative of the function points in a direction that increases it (at least for small steps). Meaning that if we have a function f(x) and the derivative f’(x) is positive, then increasing x will increase f(x) and decreasing it will decrease f(x). If we were minimizing f(x), we would just take a small step opposite to the sign of f’(x) to decrease f(x). What if we’re at a point where f’(x)=0? Then, we might have reached an optima of f(x) since there’s no where else to go.

If there are multiple variables (say x and y), we can have a derivative with each of them. If we take these two numbers and construct a 2-d vector from them, we get the gradient of the function. Now, moving along the direction of the gradient will increase the function while moving opposite to it will decrease it (for small steps).

This means that as long as the gradient is non-zero, we can’t be at a minima since we can simply take a small step along the direction opposite to the gradient and decrease the function further. This means that a necessary (but not sufficient) condition for a point minimizing the function is that the gradient must be zero at that point.

Let’s take a concrete example so we can visualize what this looks like. Consider the function f(x,y) = x²+y². This is a paraboloid and minimized when x=0 and y=0. The derivatives with respect to x and y become 2x and 2y respectively. So, the gradient becomes the vector ∇f = [2x,2y]. We can see that this is zero only when x=0 and y=0. Otherwise, the gradient points in a direction where f(x,y) will increase. So, the direction opposite the gradient will decrease f(x,y). This is shown in the figure below. The pink curve is the objective function f(x,y) which is minimized at the green point (0,0). The purple arrows are the gradients, which point in the direction where f(x,y) will increase. So to decrease it, we move in the opposite direction until we reach the green point.

Fig 1: Paraboloid with gradients. Made using https://github.com/ryu577/pyray

To summarize, when optimizing a function, f in an unconstrained optimization problem, a necessary (but not sufficient) condition to be at a local optima is:

∇f = 0

It’s like when you’re at the top of a mountain (which would be a maxima). How do you know you’re at the top? No matter which direction you step along, you end up decreasing your altitude. So you’re definitely at an optima in your local neighborhood. Now, there might be another mountain right next to you which is even higher. So, you might not be at the global optima. In fact, if we consider the surface of the Earth as our domain (and height above sea level as our objective function), you’re at a local optima if you’re at the top of any old mountain (or building) but the global optima only if that mountain is Everest. In this article, we’re going to be content with finding a local optima.

Now what if we wanted to stay within the confines of a country? That would mean constraining the space in which we can search for our optima, making this an example of constrained optimization. In a way, un-constrained optimization is just a special case of constrained optimization.

Equality constraints

For unconstrained minimization/maximization problems, we simply look for a point where the gradient is the zero vector. If the gradient is not zero, we just take a small step in the direction opposite to where it is pointing (if we’re minimizing; along it if we’re maximizing) and keep doing this until we do get to a point where it is zero and hence, there is no where else to go (this optimization method is called gradient descent). Note that we don’t have to move exactly along the gradient. As long as we move in a direction that has a positive projection (shadow) along the gradient, we end up increasing the objective function (and decreasing if we have a positive projection along the negative gradient). The figure below illustrates this. The green arrow is the gradient of the blue plane (and hence perpendicular to it) and the red arrow is the negative gradient. Since the light-blue arrows lie on the plane, the equation of the plane will yield 0 if we take a step along them. The yellow arrows have a positive shadow (projection) along the green one. So, moving along them will result in points that yield positive numbers when plugged into the equation of the plane (“increase” it). Similarly, the pink arrows have a positive shadow along the red one (reverse gradient). So, moving along them will result in points that yield negative numbers when plugged into the equation of the plane (“decrease” it).

Fig 2: Vectors on opposite sides of a plane. See text for explanation. Made using https://github.com/ryu577/pyray

For un-constrained minimization, we looked for a point where the gradient was zero. This was because if it wasn’t, we could decrease the objective function by going opposite to the gradient.

The same idea can be extended to the case when we have equality constraints. Like before, we need to find a point where we can’t find any possible direction to move where the objective function decreases. For unconstrained optimization, this simply meant that no such direction exists. When we have a constraint, there is another possibility. What if a direction that decreases the objective function exists, but the constraints forbid us from taking any step along it.

Let’s say you want to maximize the amount of money in your bank account. One way to immediately boost your earnings is to sell a kidney. But you probably have a constraint saying you’re not going to lose a vital organ. So, even though an easy path to increasing your earnings exists, your constraint prevents you from accessing it.

This means that the presence of the equality constraint actually reduces the strictness of the condition on the gradient. While it needed to be zero for a local optima without the equality constraint, it’s now okay for it to be non-zero as long as moving in any direction that has a positive shadow along it causes us to violate the constraint. This can only happen when the plane of the constraint is perpendicular to the gradient (like the plane and green arrow in figure 2).

Let’s go back to our objective function f(x,y)=x²+y². Let’s add an equality constraint, y=1. This is a plane. In figure 3 below, the objective function is in pink and the plane is blue. Since we’re constrained to stay on the plane, we can’t move in any direction that has a positive or negative shadow along the gradient of the plane (blue arrows in the figure below) since that would increase or decrease the equation of the constraint, while we want to keep it constant. The plane intersects the equation of our objective function (pink paraboloid) in a curve which is a parabola. The pink arrows in the figure below are the gradients of the objective function at various points along this parabola. If the pink arrow has a projection along the blue plane, we can just move in the direction opposite to the vector corresponding to that projection. This will keep us on the plane, ensuring we don’t violate the constraint while still reducing the objective function. However, at the green point in figure 3 below, the pink arrow (gradient of objective function) has no projection whatsoever along the blue plane. In other words, the pink arrow is parallel to the blue arrow (which is the gradient of the constraint plane).

Fig 3: Constraint gradient aligns with objective function gradient at optima. Made using https://github.com/ryu577/pyray

To decrease the objective function, we need to move in a direction that has a shadow along the negative gradient. But as soon as we do that, we’ll end up leaving the plane of the constraint. So, the constraint makes it impossible to decrease the objective function further at the green point. This means it must be a local minima. An easy way to check this condition is to require that the pink gradient of the objective function is parallel to the blue gradient of the constraint plane. And if two vectors are parallel, we can write one as a multiple of the other. Let’s call this multiple λ. If the gradient of the objective function is ∇f and that of the constraint is ∇c, the condition above is:

∇f = λ ∇c

The λ above is called the Lagrange multiplier. So, we now have a concrete condition to check for when looking for the local optima of a constrained optimization problem.

Inequality constraints

Inequality constraints mean you have to stay on one side of a boundary defining a constraint function as opposed to on it (which was the case for equality constraints). For example, staying inside the boundary of a fence. If we know how to deal with inequality constraints, we can solve any constrained optimization problem. This is because equality constraints can be converted to inequality constraints. Let’s say we require: c(x) = 0. Another way to express this is: c(x)≥0 and c(x)≤ 0. So, each equality constraint can always be replaced with two inequality constraints.

Just as constrained optimization with equality constraints can be handled with Lagrange multipliers as described in the previous section, so can constrained optimization with inequality constraints. What sets the inequality constraint conditions apart from equality constraints is that the Lagrange multipliers for inequality constraints must be positive. To see why, again consider taking a small step in a direction that has a positive component along the gradient. If we can take a step along this direction (if we are maximizing; opposite to it if we are minimizing); we can’t be at a maxima/minima. For inequality constraints, this translates to the Lagrange multiplier being positive. To see why, let’s go back to the constrained optimization problem we considered earlier (figure 3).

Minimize: f(x,y) = x²+y²

Subject to: c(x,y)=y-1=0

Now, let’s change the equality constraint to inequality. This can be done in two ways with completely different results. We can either require:

c(x,y) = y-1 ≥0. In this case, the constraint allows for anything in front of the blue plane in figure 3. It is easy to see that the green point in figure 3 continues to be a local optima. Also, since the blue arrow representing the constraints gradient and the pink arrow representing the objective function’s gradient point in the same direction, we have:

∇f = λ ∇c

with λ>0.

The other possibility is, c(x,y) = y-1≤0. Now, the feasible region becomes everything behind the blue plane. The constraint gradients will flip. So, figure 3 will end up looking like this:

Fig 4: Flipping the sign of inequality constraint from figure 3.

Note that now,

The green point is no longer the local optima since we’re free to move to (0,0); which is the yellow point in figure 4 above. At the green point, we still have ∇f=λ ∇c. since the blue vector points opposite to the pink vector, we have λ<0.

So, it is clear that for an inequality constraint, the condition ∇f=λ ∇c indicates we’re at a local optima only when λ>0.

Putting all of this together, for a general optimization problem:

Minimize f(x)

Subject to:

c_i(x) =0 for i ∈ Equality

c_i(x)≥0 for i ∈ Inequality

We get the full suite of conditions required to be at a local optima:

Lagrange multiplier condition:

∇f =∑_i λ_i ∇c_i(x) +∑_j λ_j ∇c_j(x); Eq(1)

Where i ∈ Equality and j ∈ Inequality constraints.

c_i(x)=0 for all i; Eq(2)

c_j(x)≥0 for all j; Eq(3)

λ_j ≥ 0; Eq(4)

Also note that for the two inequality constraint problems we considered, when we had y-1≥0, the green point in figure 3 was the solution. At this point, we were on the constraint plane (y-1=0). So, we actually had c(x)=0 and λ>0.

When we considered y-1≤0 on the other hand, the yellow point (x=0,y=0) in figure 4 became the local minima. This point was also the solution to the un-constrained problem. So, we simply had ∇f=0 here. Since the Lagrange condition requires ∇f = λ ∇c, we get λ ∇c = 0. Now, ∇c ≠0 at this point, which means we must have had: λ=0.

This means that if the constraint is active (c(x)=0), we should have λ≥0 while if it is not (c(x)≠ 0) we should have λ=0. So, one of them should be zero in all cases. This leads to the final condition (the complimentarity condition):

λ_j c_j(x) = 0 for all j ∈ Inequality; Eq(5)

Equations (1) through (5) are called the KKT conditions. Note that we haven’t really provided rigorous proofs for them, just constructing them based on the simple example. For a proof, the reader should refer to chapter 13 of the book by Nocedal and Wright.

A lot of people when they see these five equations feel like the problem has become even more complicated. How do these equations actually help us solve constrained optimization problems. The best way to get a feel for this is to go through some concrete examples. In the next section, we take a sample problem to which we know the answer in advance and see how the KKT conditions help us correctly identify all local optima.

Example with code

Special cases of the generalized optimization problem involve a linear objective function and linear constraints. This is called a linearly constrained linear program (LCLP). The objective function and constraints can also be quadratic and an optimization problem like this is called a quadratically constrained quadratic program (QCQP). There are software packages that are capable of solving these kinds of optimization problems for an obscenely large number of constraints (in the millions). For simpler problems with a more manageable number of constraints however, we can leverage algorithms that can solve most (more general) polynomially constrained polynomial programs. Which means that the objective function and constraints can be arbitrary polynomial functions. This is because there is a general framework for solving systems of polynomial equations called “Buchberger’s algorithm” and the KKT conditions described above can be reduced to a system of polynomial equations. I wrote a detailed blog on Buchberger’s algorithm for solving systems of polynomial equations here. There is a python library called “sympy” that uses algorithms like this behind the scenes and solves general systems of polynomial equations. So without further ado, let’s frame our first constrained optimization problem.

Equality constraints

Minimize: x³+y³

Subject to: x²+y²=1

Notice that the constraint (x²+y²=1) implies we’re on the boundary of a circle with unit radius. So, we can say: x=cos(t), y=sin(t). The objective function then becomes: sin³(t)+cos³(t). If we plot this with t, we get the following graph:

Fig 5: Objective function sin³(t)+cos³(t) plotted with t at the boundary of the constraint.

We can see that t=0, π/2 and 5π/4 correspond to local maxima while t=π/4, π and 3π/2 correpond to local maxima. Now that we know the answer in advance, let’s see if the KKT conditions described above can find these as well.

Equation (1) gives (taking derivatives of objective function and constraint):

[3x², 3y²] = λ[2x, 2y]

Equating the two components of the vectors on the two sides leads to the two equations:

3x²-2λx=0

3y²-2λy=0

Equation (2) simply requires that the equality constraint be satisfied:

x²+y²=1

And since there are no inequality constraints, we don’t need equations (3) to (6). Now, we can enter the three equations above into the symbolic equation solver the python library sympy provides.

This leads to the following result (all possible solutions to the system above with values of x, y and λ in that order):

[(-1, 0, -3/2),

(0, -1, -3/2),

(0, 1, 3/2),

(1, 0, 3/2),

(-sqrt(2)/2, -sqrt(2)/2, -3*sqrt(2)/4),

(sqrt(2)/2, sqrt(2)/2, 3*sqrt(2)/4)]

(-1,0) corresponds to t=π; (0,-1) corresponds to t=3π/2; (-sqrt(2)/2,-sqrt(2)/2) corresponds to t=5π/4 and (sqrt(2)/2,sqrt(2)/2) corresponds to t=π/4. So, we can see that all the local maxima and local minima we identified above have been identified by the KKT conditions. Now, we can simply find the maximum and minimum values of the objective function at these candidate points.

Inequality constraints

Now, let’s change our equality constraint from the problem above to an inequality constraint and see how that changes our solution.

Minimize: x³+y³

Subject to: x²+y²≤1

Where the constraint implied we could only be on the boundary of the unit circle in the previous case, we can now be anywhere inside it.

The full heat-map of the objective function within the constraint disk is plotted below (looks like a planet with a star somewhere around the top-right corner). The red arrows are the gradients of the boundary of the constraint while the black ones are the gradients of the objective function.

Fig 6: x³+y³ plotted within the disk x²+y²≤1

While the equality constrained problem was a one dimensional problem, this inequality constrained optimization problem is two dimensional. While there are only two ways to approach a point in one dimension (from left or right); there are an infinite number of ways to approach it in two dimensions. This means we need to beware of saddle points. These are points which qualify to be optima, but are not really optima since they are maxima when you approach them from one direction but minima when you approach them from another. The figure below shows what a saddle point looks like.

Fig 7: A saddle point. Maxima when you approach it from one direction, but minima when you approach it from another.

So, we need to re-evaluate all the points that were local minima or maxima in the case of the equality constraint and make sure none of them become saddle points.

Figure 5 tells us that t=0 (x=1,y=0) is a local maxima when approached along the boundary. And when we approach the point from inside the disk as well (say along the line joining x=0,y=0 to this point), the value of the objective function increases as we approach it. So, t=0 is a local maxima no matter where we approach it from. Using similar arguments (or noting the symmetry in x and y), so is t=π/2.

Similarly we can argue that t=π and t=3π/2 are local minima no matter where we approach them from inside the feasible region.

Looking at t=π/4 however, we note from figure 5 that approaching it along the boundary makes it a local minima. However, approaching it from inside the disk (along the line joining the origin to this point for example) makes it a local maxima. So, it is overall neither a local maxima nor a local minima. Such a point is called a saddle point. Using similar arguments, t=5π/4 is also a saddle point.

Now, let’s see what the KKT conditions say for our problem. Plugging the objective function and constraints into the KKT equations (1) through (5) we get:

Equations 6 (a) through (e).

To leverage polynomial equation solvers, we need to convert these to a system of polynomial equations. The first two conditions (6-(a) and (b))are already equations. The third one, x²+y²≤1 (6-(c)) is an inequality. But we can convert it into an equality by introducing a slack variable, k; x²+y²+k²=1. The last equation, λ≥0 is similarly an inequality, but we can do away with it if we simply replace λ with λ². Now, we demonstrate how to enter these into the symbolic equation solving library python provides.

Code solving the KKT conditions for optimization problem mentioned earlier.

This produces the following result (the various solutions of the system above with variables x,y,λ,k in order):

[(-1, 0, -sqrt(6)/2, 0),

(-1, 0, sqrt(6)/2, 0),

(0, -1, -sqrt(6)/2, 0),

(0, -1, sqrt(6)/2, 0),

(0, 0, 0, -1),

(0, 0, 0, 1),

(0, 1, -sqrt(6)*I/2, 0),

(0, 1, sqrt(6)*I/2, 0),

(1, 0, -sqrt(6)*I/2, 0),

(1, 0, sqrt(6)*I/2, 0),

(-sqrt(2)/2, -sqrt(2)/2, -2**(1/4)*sqrt(3)/2, 0),

(-sqrt(2)/2, -sqrt(2)/2, 2**(1/4)*sqrt(3)/2, 0),

(sqrt(2)/2, sqrt(2)/2, -2**(1/4)*sqrt(3)*I/2, 0),

(sqrt(2)/2, sqrt(2)/2, 2**(1/4)*sqrt(3)*I/2, 0)]

The capital ‘I’ in the solutions above refers to the square root of unity. We want to reject these solutions since we require λ²≥0. This means that the points that satisfy the KKT conditions are: (-1,0); (0,-1); (0,0); (-1/sqrt(2),-1/sqrt(2)). As mentioned earlier, the points (-1,0) (corresponding to t=π) and (0,-1) (corresponding to t=3π/2) are the minima. (0,0) and (-1/sqrt(2),-1/sqrt(2)) are saddle points that are also caught in the net. But note that none of the local maxima are caught. I’ll leave you with a small challenge. Change the code above so that it catches the maxima instead of the minima.