This weekend, I implemented logistic regression in Rust. For me, the most interesting parts were learning how to implement a stopping condition and how to automatically set a step size.

Predicting

Logistic regression is an supervised machine learning algorithm that can be used for binary classification. The goal of binary classification is to assign each input point to one of two classes, which we’ll call C 0 C_0 C0​ and C 1 C_1 C1​. Given a data point x x x and parameters w w w, logistic regression predicts the probability that x x x is assigned to C 1 C_1 C1​:

P ( C 1 ∣ x ) = σ ( w T x ) P(C_1|x) = \sigma(w^Tx) P(C1​∣x)=σ(wTx)

The logistic function, as defined below, ensures that the result is a valid probability, i.e. between 0 and 1.

σ ( a ) : = 1 1 + exp ⁡ ( − a ) \sigma(a) \coloneqq \frac{1}{1 + \exp(-a)} σ(a):=1+exp(−a)1​

Here’s that in Rust. I use ndarray for linear algebra, and I have written predict in vectorized form.

fn sigmoid ( a : & f64 ) -> f64 { 1.0 / ( 1.0 + ( - a ) .exp ()) } pub fn predict ( ws : & Array1 < f64 > , xs : & Array2 < f64 > ) -> Array1 < f64 > { xs .dot ( ws ) .t () .map ( sigmoid ) }

Training

The hard part, as usual, is finding a good value for w w w. This requires two things: a loss function, and a strategy for minimizing that loss function.

In logistic regression, we minimize the cross-entropy loss between the predicted probabilities and the true labels of the training data, which is the same as maximizing the likelihood of the training data.

Here is my implementation, returning both the negative log loss ( f ) and the gradient with respect to the weights ( g ).

let fg = | w : & Array1 < f64 > | { let yhats = predict ( w , & xs ); let f_neg = - ys .iter () .zip ( yhats .iter ()) .map (|( y , yhat )| y * yhat .ln () + ( 1.0 - y ) * ( 1.0 - yhat ) .ln ()) .sum :: < f64 > (); let g = ( & yhats - & ys ) .dot ( & xs ); ( f , g ) };

Minimizing

To minimize the loss function, I implemented a simple version of gradient descent. I really dislike manually tuning parameters, so I also implemented strategies to automatically choose a step size and stopping point.

To automatically choose a stopping point, I used the “ftol” method from L-BFGS-B, which is that the algorithm will automatically terminate if the improvement of the objective function from one step to the next is lower than a given threshold.

To automatically choose a step size, I implemented backtracking line search, which was new to me but is actually pretty straightforward. This algorithm starts with a large step size and shrinks it until the decrease in the value of f is close to the decrease predicted by the gradient.

Here’s the implementation:

// This value is stolen from scipy.optimize const FTOL : f64 = 2.220446049250313e-09 ; // Wikipedia says Armijo used 0.5 in 1966 const ARMIJO_GOLDSTEIN_CONTROL : f64 = 0.5 ; pub enum MinimizeInnerResult { ArmijoGoldsteinFailure , Success { w : Array1 < f64 > , f : f64 , g : Array1 < f64 > , }, } fn minimize_inner < FG > ( w_init : Array1 < f64 > , fg : FG , epsilon : f64 ) -> MinimizeInnerResult where FG : Fn ( & Array1 < f64 > ) -> ( f64 , Array1 < f64 > ) { let mut w = w_init ; let ( mut f_previous , mut g_previous ) = fg ( & w ); loop { w .scaled_add ( - epsilon , & g_previous ); let ( f , g ) = fg ( & w ); let expected_decrease = epsilon * norm_l2 ( & g ); let actual_decrease = f_previous - f ; if actual_decrease < expected_decrease * ARMIJO_GOLDSTEIN_CONTROL { return ArmijoGoldsteinFailure ; } if actual_decrease < FTOL { return Success { w , f , g }; } f_previous = f ; g_previous = g ; } } pub fn minimize < FG > ( w_init : Array1 < f64 > , fg : FG ) -> Array1 < f64 > where FG : Fn ( & Array1 < f64 > ) -> ( f64 , Array1 < f64 > ) { // Start with a large step size and decrease it until we either succeed or fail for i in 0 .. 20 { let epsilon = 2.0_f64 .powi ( - i ); if let Success { w , .. } = minimize_inner ( w_init .clone (), & fg , epsilon ) { return w ; } } panic! ( "Even a very small value of epsilon didn't work" ) }

…and that’s pretty much it. The code for this blog post is here.

Future directions

This implementation is on the simple side, so there are many possible directions for future work, including:

Increasing the dimensions of the training data

Implementing a more advanced optimization algorithm

Adding alternative methods of working with the model, like MAP inference

Appendix: Other logistic regression implementations in Rust

AtheMathmo has actually written a blog post about writing logistic regression in Rust, which focuses on the interface exposed to users. The resulting implementation is presumably here, within rusty-machine. This implementation is in pure Rust, using the rulinalg crate for linear algebra. It supports a couple nice gradient descent algorithms here.

rustlearn has a logistic regression implementation here using SGD. Rustlearn has its own built-in dense and sparse matrix types.