Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shouldn't Batch Norm Derivatives, be normalized by batch_size #9

Open
amithadiraju1694 opened this issue Jun 6, 2020 · 0 comments
Open

Comments

@amithadiraju1694
Copy link

Hey @cthorey ,

I recently went through your batch normalization tutorial here: What does gradient flowing through ... . First of, thank you so much for such an amazing post about batch normalization, I was implementing batch normalization in a FC-DNN but could find only few resources which give code and also derivations like your blog. Even though I was successful in my implementation, my derivations for the affine transformations were slightly off, your post clarified few bugs I had.

Although I have one question about the derivatives of beta and gamma here: CS231/assignment2/cs231n/layers.py / . I was wondering whether,

dbeta values should be normalized by the batch size of the training like so:

dbeta = np.sum(dout, axis=0) / batch_size

similarly, dgamma:

dgamma = np.sum(va2 * dva3, axis=0) / batch_size

In the implementation that I did, I was using full training set ( a very naive implementation ) , and once I found derivatives of gamma and beta, I always divided them by the number of rows in the training set. The results I got were really identical to same architecture built by keras:

Beta_Dist

Gamma_Dist

I looked at notes of CS231 and several other implementations of batch norm online, all of them were not dividing gradients of gamma and beta by the batch_size, could you please give your thoughts on why that should be the case.

I feel they should be divided in order to normalize the gradients, I also tried not dividing the gradients of my beta and gamma, and as expected their gradients exploded and diverged from optimum values ( my distributions and keras' for beta and gamma were way off ) ... I understand that if I use an entire training set, then almost always I have to divide by train size , but I feel it should also be the case when using mini-batches as well. Curious to know your thoughts :)

Thanks for your time again ! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant