Shouldn't Batch Norm Derivatives, be normalized by batch_size #9

amithadiraju1694 · 2020-06-06T18:08:11Z

I recently went through your batch normalization tutorial here: What does gradient flowing through ... . First of, thank you so much for such an amazing post about batch normalization, I was implementing batch normalization in a FC-DNN but could find only few resources which give code and also derivations like your blog. Even though I was successful in my implementation, my derivations for the affine transformations were slightly off, your post clarified few bugs I had.

Although I have one question about the derivatives of beta and gamma here: CS231/assignment2/cs231n/layers.py / . I was wondering whether,

dbeta values should be normalized by the batch size of the training like so:

dbeta = np.sum(dout, axis=0) / batch_size

similarly, dgamma:

dgamma = np.sum(va2 * dva3, axis=0) / batch_size

In the implementation that I did, I was using full training set ( a very naive implementation ) , and once I found derivatives of gamma and beta, I always divided them by the number of rows in the training set. The results I got were really identical to same architecture built by keras:

I looked at notes of CS231 and several other implementations of batch norm online, all of them were not dividing gradients of gamma and beta by the batch_size, could you please give your thoughts on why that should be the case.

I feel they should be divided in order to normalize the gradients, I also tried not dividing the gradients of my beta and gamma, and as expected their gradients exploded and diverged from optimum values ( my distributions and keras' for beta and gamma were way off ) ... I understand that if I use an entire training set, then almost always I have to divide by train size , but I feel it should also be the case when using mini-batches as well. Curious to know your thoughts :)

Thanks for your time again ! :D

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shouldn't Batch Norm Derivatives, be normalized by batch_size #9

Shouldn't Batch Norm Derivatives, be normalized by batch_size #9

amithadiraju1694 commented Jun 6, 2020

Shouldn't Batch Norm Derivatives, be normalized by batch_size #9

Shouldn't Batch Norm Derivatives, be normalized by batch_size #9

Comments

amithadiraju1694 commented Jun 6, 2020