You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the training code sets batch normalization center parameter to False which disables the learnable beta bias parameters. This forces every batch norm output plane to have zero mean which limits the internal presentation of the network.
Some theoretical benefits of enabling the beta is that it allows it to control the fraction of output values that are clipped by relu and allows to learn internal presentation that works better with zero padding used in convolutions.
With zero mean about half of the output values are negative and are clipped. Learnable beta allows the optimizer to adjust the amount of nonlinearity that is applied by adding bias to the values.
Convolutional layers use zero padding for values outside the edges. Padding with any other constant value can be achieved by adding bias before and after the convolution. The current architecture can't take advantage of it and is stuck with zero padding because there are no biases.
The good news is that they can be enabled in completely backwards compatible way. Only the training code needs to be changed and the code is also already implemented in LZGo (leela-zero/leela-zero@5a2aca1).
Getting any benefit from enabling them probably requires retraining the network from scratch. It might be a good idea to enable them before the network size is raised again. In LZGo the beta parameters are consistently negative for almost every layer and they have relatively large magnitudes suggesting that the network does use them and the current zero values are not optimal.
Batch norm scale parameter is also probably not redundant due to the residual path but they are harder to enable in backwards compatible way.
The text was updated successfully, but these errors were encountered:
I think this comment in transforms.cc and network_cudnn.cc means lc0 still supports this.
// Biases are not calculated and are typically zero but some networks might
// still have non-zero biases.
// Move biases to batchnorm means to make the output match without having
// to separately add the biases.
Currently the training code sets batch normalization center parameter to False which disables the learnable beta bias parameters. This forces every batch norm output plane to have zero mean which limits the internal presentation of the network.
Some theoretical benefits of enabling the beta is that it allows it to control the fraction of output values that are clipped by relu and allows to learn internal presentation that works better with zero padding used in convolutions.
With zero mean about half of the output values are negative and are clipped. Learnable beta allows the optimizer to adjust the amount of nonlinearity that is applied by adding bias to the values.
Convolutional layers use zero padding for values outside the edges. Padding with any other constant value can be achieved by adding bias before and after the convolution. The current architecture can't take advantage of it and is stuck with zero padding because there are no biases.
The good news is that they can be enabled in completely backwards compatible way. Only the training code needs to be changed and the code is also already implemented in LZGo (leela-zero/leela-zero@5a2aca1).
Getting any benefit from enabling them probably requires retraining the network from scratch. It might be a good idea to enable them before the network size is raised again. In LZGo the beta parameters are consistently negative for almost every layer and they have relatively large magnitudes suggesting that the network does use them and the current zero values are not optimal.
Batch norm scale parameter is also probably not redundant due to the residual path but they are harder to enable in backwards compatible way.
The text was updated successfully, but these errors were encountered: