Rust has the potential to be a language for building world-class machine learning tools, but it’s currently missing some important pieces of mathematical infrastructure. BFGS is one of these missing pieces.

In many popular machine learning algorithms, the goal is to find parameters that minimize the errors between the algorithm’s predictions and the training data. This can be formulated as finding an input that minimizes a loss function. Linear regression, logistic regression, neural networks, and a couple different bayesian techniques such as maximum likelihood estimation and maximum a priori estimation can be formulated as minimization problems.

There is no general “best” way to minimize a function; different kinds of functions require different strategies. However, Python’s scipy and R’s optim both prominently feature an algorithm called BFGS. I’ll explain what BFGS stands for, the problem that it solves, and how it solves it.

Introduction

BFGS stands for Broyden-Fletcher-Goldfarb-Shanno, the names of four researchers who each independently published the algorithm in 1970. I have never heard of so many people independently discovering the same thing at once!

First, a couple definitions:

P P P : this is the number of dimensions in our problem. In a machine learning context, this is the number of parameters, not the number of data points.

: this is the number of dimensions in our problem. In a machine learning context, this is the number of parameters, not the number of data points. x x x : a possible value for the parameters

: a possible value for the parameters f ( x ) f(x) f ( x ) , the objective function: We are trying to find a value of x x x that minimizes f ( x ) f(x) f ( x ) . The input is a vector of length P P P and the output is a scalar. In Rust: Fn(Array1<f64>) -> f64

, the objective function: We are trying to find a value of that minimizes . The input is a vector of length and the output is a scalar. In Rust: g ( x ) g(x) g ( x ) , the gradient of f f f : the multidimensional derivative of f f f . The input and output are both vectors of length P P P . In Rust: Fn(Array1<f64>) -> Array1<f64>

, the gradient of : the multidimensional derivative of . The input and output are both vectors of length . In Rust: H ( x ) H(x) H ( x ) , the Hessian of f f f : the multidimensional derivative of g g g . The input is a vector of length P P P and the output is a matrix of size P × P P \times P P × P . In Rust: Fn(Array1<f64>) -> Array2<f64>

BFGS is useful when f f f:

Has no closed-form solution (otherwise, we would just solve it!)

Is convex. Intuitively, it must be roughly bowl-shaped with no tricky nooks or crannies.

Is twice-differentiable. Intuitively, it doesn’t have sharp corners or edges.

May be high-dimensional. State-of-the-art neural networks can now have billions of parameters.

In high dimensions, it is important to make use of g g g. Gradient descent is the simplest way to do this, taking tiny steps in the direction of the gradient.

Newton’s method improves on gradient descent by using a second-degree Taylor series to approximate the objective function with a quadratic bowl. Here is what the Newton update looks like:

x k + 1 = x k − H ( x k ) − 1 g ( x k ) x_{k+1} = x_k - H(x_k)^{-1}g(x_k) xk+1​=xk​−H(xk​)−1g(xk​)

Here’s how we might write this in Rust:

pub fn newton < F , G , H > ( x0 : Array1 < f64 > , f : F , g : G , h : H ) -> Array1 < f64 > where F : Fn ( & Array1 < f64 > ) -> f64 , G : Fn ( & Array1 < f64 > ) -> Array1 < f64 > , H : Fn ( & Array1 < f64 > ) -> Array2 < f64 > , { let mut x = x0 ; let mut f_x = f ( & x ); loop { let f_x_old = f_x ; x .scaled_add ( - 1.0 , h ( & x ) .inv () .dot ( g ( & x ))) f_x = f ( & x ); // If f only improved by a small amount, call it a day if stop ( f_x_old , f_x ) { return x ; } } }

This can still be improved in a couple ways, though. Notably:

Inverting a matrix is expensive, between O ( P 2 ) O(P^2) O ( P 2 ) and O ( P 3 ) O(P^3) O ( P 3 ) time

and time Computing the Hessian can be expensive in some cases

The need to pass in the Hessian could be inconvenient for users

Time for math

Feel free to skim over this section if you’d like to get to the code.

The Hessian

The Hessian is the matrix of second derivatives of f f f, where element H i , j H_{i,j} Hi,j​ is ∂ 2 f ∂ x i ∂ x j {\frac {\partial ^{2}f}{\partial x_{i}\partial x_{j}}} ∂xi​∂xj​∂2f​.

A Hessian is always square. In our case, it’s P × P P \times P P×P.

In our case, the Hessian is symmetric because ∂ ∂ x i ( ∂ f ∂ x j ) = ∂ ∂ x j ( ∂ f ∂ x i ) {\frac {\partial }{\partial x_{i}}}\left({\frac {\partial f}{\partial x_{j}}}\right)={\frac {\partial }{\partial x_{j}}}\left({\frac {\partial f}{\partial x_{i}}}\right) ∂xi​∂​(∂xj​∂f​)=∂xj​∂​(∂xi​∂f​), i.e. the order of differentiation doesn’t matter. This opens up the possibility of storing only half of H H H, although I haven’t tried to do that.

Approximating the Hessian

Since we’re only approximating the objective function anyways, we might as well approximate the Hessian as well. Methods that do this are called quasi-Newton methods. I’ll call the approximate Hessian B B B.

When minimizing, it’s important that B B B is positive definite. This means that our quadratic approximation has a single point that is a global minimum. There are no saddle points, the minimum can’t be a line or plane, and there can’t be any maxima. If B B B stops being positive definitel, we can reset it to its initial state.

The secant method

In order to approximate the Hessian, we can use the secant method, which is like Newton’s method except that instead of computing the derivative it uses finite differences to approximate it. This lets us approximate the H H H based on successive evaluations of g g g. Thus, the user never needs to compute the Hessian. Note that we are not using the secant method to approximate g g g from evaluations of f f f; we are actually calculating g g g.

The Sherman-Morrison formula

Inverting the Hessian for each step of the optimization process would be expensive. It would be much nicer if we could just work directly with the inverse Hessian, avoiding inversions at all. The Sherman-Morrison formula provides a way to apply an update to the inverse of a matrix.

( A + u v T ) − 1 = A − 1 − A − 1 u v T A − 1 1 + v T A − 1 u (A+uv^T)^{-1} = A^{-1} - \frac{A^{-1}uv^T A^{-1}}{1 + v^T A^{-1}u} (A+uvT)−1=A−1−1+vTA−1uA−1uvTA−1​

BFGS

BFGS is a quasi-Newton method that uses the secant method to approximate H H H from successive evaluations of g g g along with the Sherman-Morrison formula to efficiently perform the updates without any matrix inversions. The derivation of the update is kind of complicated but you can find it on Wikipedia.

Here is the main loop. I’ve left out some details but the full code is here.

pub fn bfgs < F , G > ( x0 : Array1 < f64 > , f : F , g : G ) -> Array1 < f64 > where F : Fn ( & Array1 < f64 > ) -> f64 , G : Fn ( & Array1 < f64 > ) -> Array1 < f64 > , { let mut x = x0 ; let mut f_x = f ( & x ); let mut g_x = g ( & x ); // Initialize the inverse approximate Hessian to the identity matrix let mut b_inv = new_identity_matrix ( x .len ()); loop { // Find the search direction let search_dir = - 1.0 * b_inv .dot ( & g_x ); // Find a good step size let epsilon = line_search (| epsilon | f ( & ( & search_dir * epsilon + & x ))); // Save the old state let f_x_old = f_x ; let g_x_old = g_x ; // Take a step in the search direction x .scaled_add ( epsilon , & search_dir ); f_x = f ( & x ); g_x = g ( & x ); // Compute deltas between old and new let y = ( & g_x - & g_x_old ) .into_shape (( p , 1 )) .unwrap (); let s = ( epsilon * search_dir ) .into_shape (( p , 1 )) .unwrap (); let sy = s .t () .dot ( & y ) .into_shape (()) .unwrap ()[()]; let ss = s .dot ( & s .t ()); if stop ( f_x_old , f_x ) { return x ; } // Update the Hessian approximation let to_add = ss * ( sy + & y .t () .dot ( & b_inv .dot ( & y ))) / sy .powi ( 2 ); let to_sub = ( b_inv .dot ( & y ) .dot ( & s .t ()) + s .dot ( & y .t () .dot ( & b_inv ))) / sy ; b_inv = b_inv + to_add - to_sub ; } }

I have tested this on a few simple functions, including the Rosenbrock function.

Limitations

BFGS is not trivially parallelizable because it involves matrix multiplication. However, particular parts of the computation might be parallelizable, e.g. using ndarray-parallel.

The memory usage of BFGS can be high if there are a lot of parameters.

Future

The line search that I have implemented could be improved quite a bit. In fact, it’s terrible!

There are also a few algorithms derived from BFGS that provide various benefits:

Limited-memory BFGS (L-BFGS): Instead of using a square matrix to represent the B B B , we can save memory by representing part of it with a low-rank matrix.

, we can save memory by representing part of it with a low-rank matrix. Bounded L-BFGS (L-BFGS-B): This improvement to limited-memory BFGS allows it to support bounded variables. There is already a Rust crate called lbfgsb-sys that includes bindings to the canonical Fortran implementation of L-BFGS-B.

Online L-BFGS: This variant of L-BFGS doesn’t need to look at all data at a time.

Want to help?

If you want to use this from rust, it’s published as the crate bfgs.

Rustaceans: I think that my code is doing some unnecessary copying and allocation, and I’d love help optimizing it.

Mathematicians: any suggestions for functions that I should use to test this code? In particular, I don’t feel confident that everything is as numerically stable as it should be.

Thanks

To my brother Eric for walking me through most of this math. I did not think that I could ever understand BFGS, but now I mostly understand BFGS. Also thanks to Max Livingston for reviewing.