Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new solver that supports L1 penalty to Logistic Regression #70

Open
VolodymyrOrlov opened this issue Jan 20, 2021 · 9 comments
Open
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@VolodymyrOrlov
Copy link
Collaborator

We need to implement a new solver that supports L1 penalty for Logistic Regression. A good candidate is saga solver. Another alternative would be to use a coordinate descent algorithm that is widely used in LIBLINEAR

@NikZak
Copy link

NikZak commented Oct 5, 2022

I took a brief look at this.

SAGA/SAG/SG solvers take advantage of the fact that the loss function (e.g. least-squares or cross entropy loss) is a sum of more simple convex functions. The main idea of SAGA/SAG/SG is that at each iteration not the whole gradient is computed but only one term of the whole sum is being updated. So computation is significantly less expensive for large n (vs gradient descent) but achieves similar rate of convergence.

Thus we can not use the general API that is nicely made in this lib:

pub trait FirstOrderOptimizer<T: RealNumber> {
    fn optimize<'a, X: Matrix<T>, LS: LineSearchMethod<T>>(
        &self,
        f: &F<'_, T, X>,
        df: &'a DF<'_, X>,
        x0: &X,
        ls: &'a LS,
    ) -> OptimizerResult<T, X>;
}

Unfortunately it is not enough to know how f or df are computed. We need to know how f and df are being presented as a sum and how each term of the sum can be computed and differentiated.

If we look at sklearn library, SAG and SAGA solvers are not implemented in a general way but for exactly three types of loss functions: {'log', 'squared', 'multinomial'} including L1 and L2 regularisations. I would suggest restrict the implementation to same three cases. The original SAGA paper probably also implements only these three (as suggested from the experiments) although reasons about generalized case.

Thus choosing the SAGA solver we would need to change the implementation of LogisticRegression (probably Ridge regression too). Probably change the minimize method. Perhaps add a few methods to BinaryObjectiveFunction and MultiClassObjectiveFunction to calculate the f_i and df_i (for the terms). Now it only calculates the whole function f and whole gradient df.

It is quite a few changes to the library so just wanted to get a green light from you before going ahead. I would probably go as I briefly described - changing the BinaryObjectiveFunction, MultiClassObjectiveFunction and putting some if SAGA ... else ... logic into minimize in LogisticRegression

To save time I suggest reading the SAG paper - it is very comprehensive. (https://arxiv.org/pdf/1309.2388.pdf)

@NikZak
Copy link

NikZak commented Oct 5, 2022

Actually now I am thinking that if we add df_i method - how to compute the gradient of the i-th term to BinaryObjectiveFunction, MultiClassObjectiveFunction then the SAGA solver implementation can be fairly general. Seems like it should be a separate trait (ObjectiveFunctionAsSum or something like this)

@Mec-iS
Copy link
Collaborator

Mec-iS commented Oct 5, 2022

cool. which basic building blocks do you suggest to start from? And which existing implementations we can take as reference?

@NikZak
Copy link

NikZak commented Oct 6, 2022

I would start with the vanila SAG/SAGA algorithm.

// epoch loop
  // loop through samples
    // extract a random sample
    // update weights
    // get current prediction
    // compute gradient for this sample given prediction
    // L2 regularisation
    // update sum of gradients
    // fit the intercept (if fit_intercept=true)
    // update gradient memory for the sample
    // L1 regularisation
  // in the end of the epoch check stopping criteria

there is one trick to add to the above description - rescaling the weights for numerical stability
This is end of vanila SAGA for dense matrices.

However the SAGA algorithm is one of the not so numerous family allowing for L1 regularisation and L1 regularisation works well in sparse problems.

The above algorithm may be significantly accelerated for sparse problems (and it is explained in the original SAGA paper). The idea is to not make updates to weights at every step but only update them just-in-time when current randomly drawn x is dense in the not updated gradients. The accelerated version would work for dense matrices too.

The reference implementation from sklearn is full and addresses the sparsity problem.

So I have a question. I see DenseMatrix object in the lib but no SparseMatrix object. Is there anything special that was planned for sparse matrices. If not I would probably copy the implementation from sklearn. Sklearn has a SequentialDataSet object which holds the matrix rows with only non-zero values and the indices of these non-zero values in those rows

@Mec-iS
Copy link
Collaborator

Mec-iS commented Oct 6, 2022

Ideally I wanted to stick with CSR standard as defined in scipy, see #156 (petgraph has implementation for it)

@NikZak
Copy link

NikZak commented Oct 6, 2022

Any objections to this crate then - https://docs.rs/sprs/latest/sprs/ ? It is CSR.

The algo would work this way then. The optimizer would accept normal Matrices and Vectors from this lib. Then under the hood transform into CSR matrices, vectors before doing any computations.

There could be a branch - if the Matrix is specified as DenseMatrix then no CSR transformation and a different algorithm - but probably it is an unnecessary complication as sklearn addresses every input homogenously. Certainly can be done as an improvement rather than first shot

@Mec-iS
Copy link
Collaborator

Mec-iS commented Oct 6, 2022

The usual questions to answer about a library to add as dependency are:

  • is it well established/maintained?
  • does it includes unsafe (C/C++) implementations?

sprs looks like to be well supported and in pure Rust, cool. For me is good.

Having a conversion dense->sparse is ok, there are maybe some memory-performance considerations to be made but for now should be ok.

cc: @morenol @montanalow

@NikZak
Copy link

NikZak commented Oct 9, 2022

Ok, great. Working on it

@NikZak
Copy link

NikZak commented Oct 14, 2022

Just to give you a heads up. I am still on it.

So far not much code done. But I have solidified my understanding of SGD->SAG->SAGA. Tried to write a simplified version here.

I've looked into whether SAGA is obsolete as we're in 2022 and SAGA was done in 2014. And found 3 promising developments based on SAGA:

  1. Prox2SAGA (Douglas-Rachford splitting on top of point-SAGA)
  2. SAGA with negative momentum
  3. ProxASAGA
    Last one is most promising in my opinion is it is well parallelized and so it should on one hand be fastest on modern hardware and on the other hand should use the strength of Rust of being able to prevent data races. Some discussion here

But this is rather for future development

I found three reference implementations:

  1. most popular in sklearn. Because it's sklearn
    (https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_sag_fast.pyx.tp)
  2. cleaner implementation in lightning and also more feature full (e.g. it has linesearch algorithm and autostepsize algorithm)
    (https://github.com/scikit-learn-contrib/lightning/blob/master/lightning/impl/sag_fast.pyx)
  3. clean and well structured accelerated SAGA and point-SAGA implementation by the author himself
    (https://github.com/adefazio/point-saga/blob/master/optimization/saga.pyx https://github.com/adefazio/point-saga/blob/master/optimization/pointsaga.pyx)

3-d implementation is best structured and clean IMO but it lacks l1 regularisation. 2d one is most full but it uses strange notations and algorithm deviates in some places (e.g. here). So I'll use 1) trying to structure more like in 3) and add missing parts from 2).

@Mec-iS Mec-iS added this to the v0.5 milestone Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants