Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable batch normalization beta #656

Open
Ttl opened this issue May 24, 2018 · 1 comment
Open

Enable batch normalization beta #656

Ttl opened this issue May 24, 2018 · 1 comment

Comments

@Ttl
Copy link
Contributor

Ttl commented May 24, 2018

Currently the training code sets batch normalization center parameter to False which disables the learnable beta bias parameters. This forces every batch norm output plane to have zero mean which limits the internal presentation of the network.

Some theoretical benefits of enabling the beta is that it allows it to control the fraction of output values that are clipped by relu and allows to learn internal presentation that works better with zero padding used in convolutions.

With zero mean about half of the output values are negative and are clipped. Learnable beta allows the optimizer to adjust the amount of nonlinearity that is applied by adding bias to the values.

Convolutional layers use zero padding for values outside the edges. Padding with any other constant value can be achieved by adding bias before and after the convolution. The current architecture can't take advantage of it and is stuck with zero padding because there are no biases.

The good news is that they can be enabled in completely backwards compatible way. Only the training code needs to be changed and the code is also already implemented in LZGo (leela-zero/leela-zero@5a2aca1).

Getting any benefit from enabling them probably requires retraining the network from scratch. It might be a good idea to enable them before the network size is raised again. In LZGo the beta parameters are consistently negative for almost every layer and they have relatively large magnitudes suggesting that the network does use them and the current zero values are not optimal.

Batch norm scale parameter is also probably not redundant due to the residual path but they are harder to enable in backwards compatible way.

@killerducky
Copy link
Collaborator

I think this comment in transforms.cc and network_cudnn.cc means lc0 still supports this.

  // Biases are not calculated and are typically zero but some networks might
  // still have non-zero biases.
  // Move biases to batchnorm means to make the output match without having
  // to separately add the biases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants