Add new solver that supports L1 penalty to Logistic Regression #70

VolodymyrOrlov · 2021-01-20T00:18:14Z

We need to implement a new solver that supports L1 penalty for Logistic Regression. A good candidate is saga solver. Another alternative would be to use a coordinate descent algorithm that is widely used in LIBLINEAR

NikZak · 2022-10-05T02:15:11Z

I took a brief look at this.

SAGA/SAG/SG solvers take advantage of the fact that the loss function (e.g. least-squares or cross entropy loss) is a sum of more simple convex functions. The main idea of SAGA/SAG/SG is that at each iteration not the whole gradient is computed but only one term of the whole sum is being updated. So computation is significantly less expensive for large n (vs gradient descent) but achieves similar rate of convergence.

Thus we can not use the general API that is nicely made in this lib:

pub trait FirstOrderOptimizer<T: RealNumber> {
    fn optimize<'a, X: Matrix<T>, LS: LineSearchMethod<T>>(
        &self,
        f: &F<'_, T, X>,
        df: &'a DF<'_, X>,
        x0: &X,
        ls: &'a LS,
    ) -> OptimizerResult<T, X>;
}

Unfortunately it is not enough to know how f or df are computed. We need to know how f and df are being presented as a sum and how each term of the sum can be computed and differentiated.

If we look at sklearn library, SAG and SAGA solvers are not implemented in a general way but for exactly three types of loss functions: {'log', 'squared', 'multinomial'} including L1 and L2 regularisations. I would suggest restrict the implementation to same three cases. The original SAGA paper probably also implements only these three (as suggested from the experiments) although reasons about generalized case.

Thus choosing the SAGA solver we would need to change the implementation of LogisticRegression (probably Ridge regression too). Probably change the minimize method. Perhaps add a few methods to BinaryObjectiveFunction and MultiClassObjectiveFunction to calculate the f_i and df_i (for the terms). Now it only calculates the whole function f and whole gradient df.

It is quite a few changes to the library so just wanted to get a green light from you before going ahead. I would probably go as I briefly described - changing the BinaryObjectiveFunction, MultiClassObjectiveFunction and putting some if SAGA ... else ... logic into minimize in LogisticRegression

To save time I suggest reading the SAG paper - it is very comprehensive. (https://arxiv.org/pdf/1309.2388.pdf)

NikZak · 2022-10-05T02:39:37Z

Actually now I am thinking that if we add df_i method - how to compute the gradient of the i-th term to BinaryObjectiveFunction, MultiClassObjectiveFunction then the SAGA solver implementation can be fairly general. Seems like it should be a separate trait (ObjectiveFunctionAsSum or something like this)

Mec-iS · 2022-10-05T15:49:28Z

cool. which basic building blocks do you suggest to start from? And which existing implementations we can take as reference?

NikZak · 2022-10-06T06:12:44Z

I would start with the vanila SAG/SAGA algorithm.

// epoch loop
  // loop through samples
    // extract a random sample
    // update weights
    // get current prediction
    // compute gradient for this sample given prediction
    // L2 regularisation
    // update sum of gradients
    // fit the intercept (if fit_intercept=true)
    // update gradient memory for the sample
    // L1 regularisation
  // in the end of the epoch check stopping criteria

there is one trick to add to the above description - rescaling the weights for numerical stability
This is end of vanila SAGA for dense matrices.

However the SAGA algorithm is one of the not so numerous family allowing for L1 regularisation and L1 regularisation works well in sparse problems.

The above algorithm may be significantly accelerated for sparse problems (and it is explained in the original SAGA paper). The idea is to not make updates to weights at every step but only update them just-in-time when current randomly drawn x is dense in the not updated gradients. The accelerated version would work for dense matrices too.

The reference implementation from sklearn is full and addresses the sparsity problem.

So I have a question. I see DenseMatrix object in the lib but no SparseMatrix object. Is there anything special that was planned for sparse matrices. If not I would probably copy the implementation from sklearn. Sklearn has a SequentialDataSet object which holds the matrix rows with only non-zero values and the indices of these non-zero values in those rows

Mec-iS · 2022-10-06T09:26:41Z

Ideally I wanted to stick with CSR standard as defined in scipy, see #156 (petgraph has implementation for it)

NikZak · 2022-10-06T10:28:41Z

Any objections to this crate then - https://docs.rs/sprs/latest/sprs/ ? It is CSR.

The algo would work this way then. The optimizer would accept normal Matrices and Vectors from this lib. Then under the hood transform into CSR matrices, vectors before doing any computations.

There could be a branch - if the Matrix is specified as DenseMatrix then no CSR transformation and a different algorithm - but probably it is an unnecessary complication as sklearn addresses every input homogenously. Certainly can be done as an improvement rather than first shot

Mec-iS · 2022-10-06T10:39:53Z

The usual questions to answer about a library to add as dependency are:

is it well established/maintained?
does it includes unsafe (C/C++) implementations?

sprs looks like to be well supported and in pure Rust, cool. For me is good.

Having a conversion dense->sparse is ok, there are maybe some memory-performance considerations to be made but for now should be ok.

cc: @morenol @montanalow

NikZak · 2022-10-09T09:59:39Z

Ok, great. Working on it

NikZak · 2022-10-14T01:43:12Z

Just to give you a heads up. I am still on it.

So far not much code done. But I have solidified my understanding of SGD->SAG->SAGA. Tried to write a simplified version here.

I've looked into whether SAGA is obsolete as we're in 2022 and SAGA was done in 2014. And found 3 promising developments based on SAGA:

Prox2SAGA (Douglas-Rachford splitting on top of point-SAGA)
SAGA with negative momentum
ProxASAGA
Last one is most promising in my opinion is it is well parallelized and so it should on one hand be fastest on modern hardware and on the other hand should use the strength of Rust of being able to prevent data races. Some discussion here

But this is rather for future development

I found three reference implementations:

most popular in sklearn. Because it's sklearn
(https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_sag_fast.pyx.tp)
cleaner implementation in lightning and also more feature full (e.g. it has linesearch algorithm and autostepsize algorithm)
(https://github.com/scikit-learn-contrib/lightning/blob/master/lightning/impl/sag_fast.pyx)
clean and well structured accelerated SAGA and point-SAGA implementation by the author himself
(https://github.com/adefazio/point-saga/blob/master/optimization/saga.pyx https://github.com/adefazio/point-saga/blob/master/optimization/pointsaga.pyx)

3-d implementation is best structured and clean IMO but it lacks l1 regularisation. 2d one is most full but it uses strange notations and algorithm deviates in some places (e.g. here). So I'll use 1) trying to structure more like in 3) and add missing parts from 2).

VolodymyrOrlov mentioned this issue Jan 20, 2021

Add L1 and Elastic Net regularization terms to the Logistic Regression #63

Open

VolodymyrOrlov added enhancement New feature or request help wanted Extra attention is needed labels Jan 20, 2021

VolodymyrOrlov mentioned this issue Jan 21, 2021

feat: adds a new parameter to the logistic regression: solver #71

Merged

VolodymyrOrlov self-assigned this Jan 22, 2021

Mec-iS added this to the v0.5 milestone Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new solver that supports L1 penalty to Logistic Regression #70

Add new solver that supports L1 penalty to Logistic Regression #70

VolodymyrOrlov commented Jan 20, 2021

NikZak commented Oct 5, 2022 •

edited

Loading

NikZak commented Oct 5, 2022 •

edited

Loading

Mec-iS commented Oct 5, 2022

NikZak commented Oct 6, 2022

Mec-iS commented Oct 6, 2022

NikZak commented Oct 6, 2022

Mec-iS commented Oct 6, 2022 •

edited

Loading

NikZak commented Oct 9, 2022

NikZak commented Oct 14, 2022 •

edited

Loading

Add new solver that supports L1 penalty to Logistic Regression #70

Add new solver that supports L1 penalty to Logistic Regression #70

Comments

VolodymyrOrlov commented Jan 20, 2021

NikZak commented Oct 5, 2022 • edited Loading

NikZak commented Oct 5, 2022 • edited Loading

Mec-iS commented Oct 5, 2022

NikZak commented Oct 6, 2022

Mec-iS commented Oct 6, 2022

NikZak commented Oct 6, 2022

Mec-iS commented Oct 6, 2022 • edited Loading

NikZak commented Oct 9, 2022

NikZak commented Oct 14, 2022 • edited Loading

NikZak commented Oct 5, 2022 •

edited

Loading

NikZak commented Oct 5, 2022 •

edited

Loading

Mec-iS commented Oct 6, 2022 •

edited

Loading

NikZak commented Oct 14, 2022 •

edited

Loading