You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 6, 2023. It is now read-only.
Why you don't use initialize gradient memory with 0 and use the number of indices seen so far in SAG algorithm as suggested in the paper
In the update of x in Algorithm 1, we normalize the direction d by the total number of data points n. When initializing with y_i = 0 we believe this leads to steps that are too small on early iterations of the algorithm where we have only seen a fraction of the data points, because many y_i variables contributing to d are set to the uninformative zero-vector. Following Blatt et al. [2007], the more logical normalization is to divide d by m, the number of data points that we have seen at least once
SAGA paper suggests a similar procedure
Our algorithm assumes that initial gradients are known for each f_i at the starting point x0. Instead, a heuristic may be used where during the first pass, data-points are introduced one-by-one, in a non-randomized order, with averages computed in terms of those data-points processed so far. This procedure has been successfully used with SAG [1].
The text was updated successfully, but these errors were encountered:
NikZak
changed the title
Why not initialize SAG/SAGA memory with 0 and divide by seen updates as in sklearn?
Why not initialize SAG/SAGA memory with 0 and divide by seen indices so far as in sklearn?
Oct 14, 2022
Why you don't use initialize gradient memory with 0 and use the number of indices seen so far in SAG algorithm as suggested in the paper
SAGA paper suggests a similar procedure
The text was updated successfully, but these errors were encountered: