Design a neural network, derive the parameter gradients with respect to loss function and update the parameter weights and update the weight parameters using the gradients without the help of in-built libraries.
In this assignment. I am trying to design a neural network, derive the parameter gradients with respect to loss function and update the parameter weights and update the weight parameters using the gradients. I will be trying to implement the Classification task in MNIST data set. Once the required details are derived, using this detail I will try to code and implement a neural network and train the same.
The MNIST data set is used in this project. The data set consists of 60000 training and 10000 test images. The images are of the dimension 28x28.
Fig: Above are some of the sample images from the MNIST data set.
For this task, a 2 Layer MLP is used. The input dimension of the MLP will be the flattened 28 x28 image i.e. 784x1. The output dimension will be 10x1 representing the 10 classes( 10 digits). The 1 st layer will have 1000 neurons and output layer with 10 neurons. Below is sample design of the neural network and the flow of values.
Fig: The figure above shows the operations between the Weight and bias vectors with the given inputs in the forward propagation of the network.
Input
Input X to the network will be a flattened matrix of the image with dimension 1x784. This will contain the values of each pixels in the image.
First Layer
The 1st layer of the neural network has 1000 neurons. The dimension of the Weight Parameter Uw matrix will be 784x1000 since the input dimension is 1x784. Each neuron will have an associated bias value. So, for 1000 neurons it will have a bias vector Ubias of dimension 1x1000. The output vector of the first layer Y = (X * Uw) + Ubias will have the dimension of 1x1000.
ReLu activation Layer
After the first layer a ReLu (Rectified Linear Unit) layer is added to introduce nonlinearity. The equation of ReLu function is ReLU(x) = (x)+ = max(0,x). If the input to ReLu is less than zero, then output will be zero. For positive values output is same as input.
Fig: The above figure demonstrates how ReLu functions for a given input. The output of the ReLu layer and will have the same dimension as 1x1000.
Second Layer
The second layer has 10 neurons which is equal to the number of classes to classify i.e. 10 digits. The dimension of the Weight Parameter Vw matrix will be 1000x10 since the input dimension is 1x1000. Each neuron will have an associated bias value. So, for 10 neurons it will have a bias vector Vbias of dimension 1x10. The output vector of the second layer will have the dimension of 1x10. So, Z is the final predicted value of the network.
Cross Entropy Layer
The task is classification of 10 digits and the Loss function selected for this task is Cross entropy. Let T be the target vector and Z the prediction, then equation for the loss function is
where ti is corresponding element in target vector T and P(Zi) is the probability of the corresponding element in predicted vector Z. In the above equation probability of the predicted values P(Zi) using the soft max function in the vector Z
So, the probability of one predicted value is equal to the exponent of that value divided by the sum of exponent of all the predicted values. Therefore, if the predicted value is large then the probability of that value will be high and vice versa. From the loss function, we can observe that the loss will be large if the probability P(Zi) is small and vice versa. The least possible Loss value is zero when P(zi) =1 and largest when P(Zi) = 0.
Fig: The above figure shows the value of Loss compared to the predicted probability. We can observe that the loss will be very high if the predicted probability is low and vice versa.
If the network predicts a lower value for the class that is equal to the Target class, then the Loss will be high and the network will be penalized for this and if the network predicts high value for the positive class in target vector then the loss will be small.
Target Vector
Target vector is the ground truth created from the label provided. The target vector is the one hot encoded version of the image label. For example if the label value is 4 then the target vector T will be [0,0,0,0,1,0,0,0,0,0] , so the index associated to digit 4 will be one and rest will be zeros.
In this section, I will be deriving the gradients of each weight and bias with respect to the loss function, which will be used later to update the same weight and bias during the back propagation.
Gradient of the Loss function L with respect to output vector Z
The equation for the cross-entropy loss function is
Here we have 2 cases where i=j and i≠j.
Case i = j then
Case i≠j then
Combining both by chain rule we get
Splitting the 2 cases i = j and i≠j we get
P(Zi) is the Softmax of the predicted vector Z and ti represents the corresponding target vector elements in T. Substituting this we will get
Gradient of the Loss function L with respect to Vw , Vbias and YR of second layer
In this section I will be calculating the gradient of the Loss function with respect to the second layer weight, bias parameters, and the output of the ReLu, YR. I will be calculating dL/dVw and dL/dVbias
Gradient of the Loss Function L with respect to the output Y of first layer
In this section I will be calculating the gradient of Loss function with respect to the output Y of the 1st layer i.e. dL/dY.
Gradient of the Loss function L with respect to Uw and Ubias of first layer
In this section I will be calculating the gradient of the Loss function with respect to the first layer weight and bias parameter. I will be calculating dL/dUw and dL/dUbias.
Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.
Fig: Above figure shows how the weight parameters are updated using the gradients in each learning steps to reduce the loss and moving towards the minimum.
The rate of decrease of the weight parameters is controlled by a hyper parameter called learning rate lr.
Following are the equation to update the weight and bias parameters using the found gradient in each step.
Updated parameters and bias give the values for the next step.
Using the details from the above sections, I implemented the network using python. MNIST data set was downloaded and used it for the training and was loaded using the library PyTorch. Matrix related calculations like matrix multiplication, element wise multiplication, transpose etc. were done with the help of NumPy.
Converting MNIST image to normalized Input
The MNIST image data value range varies from 0 to 255 which is normalized to range of [0,1] by dividing the whole image matrix by 255. MNIST image matrix dimension is 28x28, this was flattened to 1x784 to pass it as input to the network.
Declaring and Initializing the networks weight and bias parameters
Networks first and second layer weight and bias parameter matrixes Uw , Ub , Vw and Vb were declared with the required dimensions. A good initial value should be assigned to these matrixes for good network performance. I followed a strategy similar to Xavier initialization but simpler. The weight we initialized using a normal distribution but between the range of [-1/sqrt(n) , 1/sqrt(n))], where n is the number of inputs to that layer.
Minibatch gradient descent
In this assignment, I implemented mini batch gradient descent with a batch size of 100 over the data set containing 60000 training sample. Gradient descents were calculated for each sample in the minibatch and the weights were updated only once for each mini batch. This helps to increase the speed of the training and avoids over fitting.
One hot encoding data of labels
The labels were one hot encoded using a function ‘get_hot_encodedLabels’ for comparing the predicted values and true label.
Forward propagation
Using the formatted input X and the weight and bias parameter matrices, the model predicted value Z was calculated by forward propagation. The ReLu function is implemented using the function ‘implement_ReLU’. The predicted values were converted into probabilities using the softmax calculation. Using these probabilities, Cross Entropy Loss is calculated for each data sample. The total loss of the batch is calculated using the function ‘calculate_loss_bs’. The percentage of data samples which were predicted wrongly for each batch is calculated using the function ‘getError’.
Backward Propagation
Using the values of the matrices which was populated during the forward propagation is used to find the gradient of Loss with respect to each component. In this step, I did the step by step calculation of gradient from output to the input direction.
Gradient Descent Step
Using the above calculated gradients, the weight and bias parameters were updated after iterating each batch. A learning rate value of 0.001 was used for updating the values.
The aim of network training is to make the network able to predict the correct class of the digit in the given image. For this the Loss of the network, i.e. the difference between the predicted and true label should be reduced. By gradient descent, we obtain this task by updating the weight and bias parameters in such a way so that the network will have minimum loss. In this section we will be analysing the Loss and Error percentage of the network during training.
Loss versus Epoch
Below is the graph of the Loss values obtained after each epoch of training. So here we can observe that the loss value is decreasing after each epoch and have almost become 0 in the 100th epoch. From this graph we can conclude that the network is learning from the training.
Fig: Above graph shows the relation between Network training Loss on each epoch.
Error Percentage versus Epoch
Below is the graph showing the percentage of images that were classified wrongly in each epoch. As the number of epochs increases the error percentage decreases. Thus, we can conclude that the network is learning and is able to classify most of the images correctly. The error of the 100th epoch is 0% for the network.
Fig: Above graph shows the relation between Network training error percentage on each epoch