A bug in the implementation of AdaShift? #2

ZhimingZhou · 2019-06-02T13:31:44Z

Hi,

Thanks for this is a nice and efficient implementation of AdaShift. But it seems to have a bug in the calculation of exp_avg.

It seems the following line should always be executed:
(exp_avg.sub_(first_grad_weight, offset_grad).mul_(beta1).add_(last_grad_weight, grad))

Because the above update rule assumes exp_avg always keeps the weighted sum of kept gradients. If exp_avg is not updated until timestep keep_num +1, exp_avg would be zero and the resulting exp_avg at timestep keep_num +1 would be wrong. And it would also affect all the following estimation of exp_avg.

By contrast, updating it all the time would ensure that at timestep, exp_avg is the weighted sum of previously kept gradients (assuming the queue is initially filled with zero).

Best regards,
Zhiming Zhou

The text was updated successfully, but these errors were encountered:

ZhimingZhou changed the title ~~A bug in the implementation of AdaShift?~~ Thanks for this nice implementation of AdaShift. Jun 2, 2019

ZhimingZhou changed the title ~~Thanks for this nice implementation of AdaShift.~~ A bug in the implementation of AdaShift? Jun 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A bug in the implementation of AdaShift? #2

A bug in the implementation of AdaShift? #2

ZhimingZhou commented Jun 2, 2019 •

edited

Loading

A bug in the implementation of AdaShift? #2

A bug in the implementation of AdaShift? #2

Comments

ZhimingZhou commented Jun 2, 2019 • edited Loading

ZhimingZhou commented Jun 2, 2019 •

edited

Loading