You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for this is a nice and efficient implementation of AdaShift. But it seems to have a bug in the calculation of exp_avg.
It seems the following line should always be executed: (exp_avg.sub_(first_grad_weight, offset_grad).mul_(beta1).add_(last_grad_weight, grad))
Because the above update rule assumes exp_avg always keeps the weighted sum of kept gradients. If exp_avg is not updated until timestep keep_num +1, exp_avg would be zero and the resulting exp_avg at timestep keep_num +1 would be wrong. And it would also affect all the following estimation of exp_avg.
By contrast, updating it all the time would ensure that at timestep, exp_avg is the weighted sum of previously kept gradients (assuming the queue is initially filled with zero).
Best regards,
Zhiming Zhou
The text was updated successfully, but these errors were encountered:
ZhimingZhou
changed the title
A bug in the implementation of AdaShift?
Thanks for this nice implementation of AdaShift.
Jun 2, 2019
ZhimingZhou
changed the title
Thanks for this nice implementation of AdaShift.
A bug in the implementation of AdaShift?
Jun 2, 2019
Hi,
Thanks for this is a nice and efficient implementation of AdaShift. But it seems to have a bug in the calculation of exp_avg.
It seems the following line should always be executed:
(exp_avg.sub_(first_grad_weight, offset_grad).mul_(beta1).add_(last_grad_weight, grad))
Because the above update rule assumes exp_avg always keeps the weighted sum of kept gradients. If exp_avg is not updated until timestep keep_num +1, exp_avg would be zero and the resulting exp_avg at timestep keep_num +1 would be wrong. And it would also affect all the following estimation of exp_avg.
By contrast, updating it all the time would ensure that at timestep, exp_avg is the weighted sum of previously kept gradients (assuming the queue is initially filled with zero).
Best regards,
Zhiming Zhou
The text was updated successfully, but these errors were encountered: