Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bug in the implementation of AdaShift? #2

Open
ZhimingZhou opened this issue Jun 2, 2019 · 0 comments
Open

A bug in the implementation of AdaShift? #2

ZhimingZhou opened this issue Jun 2, 2019 · 0 comments

Comments

@ZhimingZhou
Copy link

ZhimingZhou commented Jun 2, 2019

Hi,

Thanks for this is a nice and efficient implementation of AdaShift. But it seems to have a bug in the calculation of exp_avg.

It seems the following line should always be executed:
(exp_avg.sub_(first_grad_weight, offset_grad).mul_(beta1).add_(last_grad_weight, grad))

Because the above update rule assumes exp_avg always keeps the weighted sum of kept gradients. If exp_avg is not updated until timestep keep_num +1, exp_avg would be zero and the resulting exp_avg at timestep keep_num +1 would be wrong. And it would also affect all the following estimation of exp_avg.

By contrast, updating it all the time would ensure that at timestep, exp_avg is the weighted sum of previously kept gradients (assuming the queue is initially filled with zero).

Best regards,
Zhiming Zhou

@ZhimingZhou ZhimingZhou changed the title A bug in the implementation of AdaShift? Thanks for this nice implementation of AdaShift. Jun 2, 2019
@ZhimingZhou ZhimingZhou changed the title Thanks for this nice implementation of AdaShift. A bug in the implementation of AdaShift? Jun 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant