Neural Networks

Different results on different batch sizes? Culprit : Batch Normalization

We can train out model with or without using Batch Norm. Basically Batch Norm normalizes the activations (if used after each layer then it will normalize the activations after every layer). For this purpose it will learn two additional parameters in each of its use, mean and standard deviation. This extra flexibility helps to represent identity transformation and preserve the network capacity.

 

Although parameters involved in second sub-operation (scaling and shifting) will be learned and converged along with the other network parameter (weights) using gradient descent algorithm, normalization parameter involved in first sub-operation is not trained and may change as the contents in every mini-batch changes.

 

During inference, output will depend only on input and different trained network parameter along with trained shift and scale parameters. So, normalization parameters should be identified during training, which will be used for inference.

 

To get the normalization parameters, which will be used during inference, we can take average of parameters calculated from many equal size mini-batch training. Alternatively, during training, we can track the moving averages of normalization parameters, which can be used for checking of accuracy of the model as well as identification of final normalization parameters

According to this paper:

  • Batch size during testing should not matter if you use pre-computed means and variances based on the activations on the training set.
  • Another potential is to compute the mean and variation values based on your test-set activation distribution, but still not batch-wise. Also it should be helpful to combat domain transfer issues.

 

Example:

 

import torch
import torch.nn as nn
from torch.autograd import Variable

print(torch.__version__)

def get_data(cuda=False):
    if cuda:
        return torch.cuda.FloatTensor(32, 1, 19, 19)
    else:
        return torch.FloatTensor(32, 1, 19, 19)

m = nn.Sequential(
    nn.Conv2d(1, 3, kernel_size=3, padding=1),
    nn.BatchNorm2d(3),
    nn.ReLU()
)

cuda = True
if cuda:
    m.cuda()

m.eval()
print(" 1 ... " ,m._modules['1'].training)

x = Variable(get_data(cuda))
y1 = m.forward(x)

m.train()
print(" 2 ... " ,m._modules['1'].training)
y2 = m.forward(x)

err = (y1 - y2).norm().data
if err > 1e-5:
    print("3 ... difference in results on training and eval mode :" ,err)

## Now the solution :
m.eval()
print("4 ... " ,m._modules['1'].training)

# set the batchnorm layer statistics to False so it doesnt uses the pre-calculated mean and std 
m[1].track_running_stats = False
y1 = m.forward(x)

err = (y1 - y2).norm().data
if err > 1e-5:
    print("3 ... difference in results on training and eval mode :" ,err)
else:
  print("eval and train results SAME")

# Results: 
#0.4.1
# 1 ...  False
# 2 ...  True
#3 ... difference in results on training and eval mode : tensor(53.5466)
#4 ...  False
#eval and train results SAME

Solution: 

From Pytorch’s documentation:

The mean and standard-deviation are calculated per-dimension over the mini-batches and γγ and ββ are learnable parameter vectors of size C (where C is the input size).

By default, during training this layer keeps running estimates of its computed mean and variance, which are then used for normalization during evaluation. The running estimates are kept with a default momentum of 0.1.

If track_running_stats is set to False, this layer then does not keep running estimates, and batch statistics are instead used during evaluation time as well.

When using batch size should be at least 16 or better 32 AND set track_running_stats=False for the batchnorm layer. (access them using their variable name in model or the index m[1] as in this case).

More Culprits – Softmax & Sigmoid:

When using using binary cross entropy loss if you use an additional softmax or even sigmoid at the end of the network then as well the changes in batch size can degrade the models performance drastically. This is because both the sigmoid or softmax calculate the probabilities out of the given input but in this case the output vector is a single value (as being the single class classifier) the input to sigmoid or softmax is infact the batch’s output upon which neither softmax nor sigmoid should be applied.

Tagged , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.