Loss functions are an important concept of deep learning. They enable the network to actually learn what we're trying to teach.
You might think of the loss function as the part of the brain that represents the first stage that helps store information into our memory.
In a neural network setting, the loss function computes the error of the prediction of our network. This error is propogated backwards through the entire network slowing tweaking our network to be better next time.
But that's a topic for another post.
Let's focus on how this error is actually predicted. We shall call this error the loss from now on.
We shall use a nomencalature as follows:
p
= prediction, or the probability emitted by the network.y
= the ground truth or what we want the network to predict.
The L1 loss is nothing but the absolute different between p
and y
.
L = abs(p - y)
This loss if computed between one hot vectors is summed up to give the loss for that vector.
If the network takes in batches of examples, this loss can either be summed over all examples or the mean loss can be taken.
For Sum:
final_loss = sum(L-1,L-2....,L-10)
For Mean:
final_loss = sum(L-1,L-2....,L-10)/10
// 10 being the total number of examples in a minibatch
The Mean Square Loss is very similiar to the L1 Loss, but instead of taking the absolute element wise difference.
We shall take the square of the element wise difference.
Hence,
L = Square(p - y)
The rest is the same as L1 Loss.
This is the interesting part, what loss function should one use. It matters on the use case.
Generally, it is said that the L1 loss function is good at ignoring outliers to the dataset. But the L2 loss can be unstable if outliers are introduced here.
This can be intuitively explained as the L2 loss computes the squared difference whereas the L1 loss computes the difference.
This leads the L2 loss to be more sensitive towards an outlier, since the value for the square error of the outlier data will typically be large compared to the error generated by the non outlier data. Hence , the network will work towards reducing this loss, leading to less robustness to the network.
But that's about the only disadvantage of the L2 loss over the L1 loss. It has a considerable advantage over L1.
Stability . The L2 loss is more stable over the network as small perturbations in data. the L1 loss may change significantly by moving the data by a small amount. But that is not the case with L2 Loss.
L1 Loss Differentiability The L1 loss is not differentiable at 0. The L2 loss is differentiable throughout. The Smooth L1 loss is used to combat this.
There is a tradeoff while using L1 regularisation vs L2 regularisation that we shall discuss in the Regularisation blog post.
In practice it is said, when in doubt use L2 loss, as it is more precise and better at minimising prediction errors.
An interesting usecase where the L1 loss was used over the L2 loss is in training the Faster RCNN Bounding Box Model It used a variant of the L1 loss called the Smooth L1 loss.
The Smooth L1 loss , unlike the original is differentiable. It is given by:
L = if d <= 1 then 0.5*d^2
L = else abs(d) - 0.5
The Smooth L1 loss is an improvement over the L2
loss used for regression in RCNN. It is found to be less sensitive to
outliers as training with unbounded regression targets led to gradient
explosion. Hence ,a very careful tuned learning rate needed to be
followed. Using the L1 loss removed this problem.
The cross entropy measures the performance of a classifiaction model whose output is between 0 and 1.
Cross-entropy loss increases as the predicted probability diverges from the target label
Multiclass classification means predicting over C classes with there being only 1 target.
The Cross Entropy Loss is defined as follows.
// i = index of target class
s = exp(p[i])/sum_all(exp(p)) // softmax at i
L = -log(s)
The losses are averaged across observations for each minibatch.
The cross entropy loss is also called the softmax loss function.
The Cross Entropy Loss is preferred with Classification and the L2 loss is preferred for regression. This is because of two reasons.
-
The cross entropy loss is compatible with the kind of input one gets for a classification task. It penalises wrong predictions. The cross entropy assumes input to be of the classification vector form. It would do poorly in a regression scenario.
-
The L2 Loss computes error in a non compatible way in a classification problem. The L2 loss assumes the data to be a single quantity. Hence it is most compatible with Regression Tasks.
It is the same thing as the Cross Entropy Loss . The only difference being is that the input is already a list of log softmax probabilities.
The Hinge Loss is generally used in SVMs for classification problems.
In its original form it is non differentiable at x = 1
L = max(0, (1 - y*p))
here , instead of y
being 1 or 0. It is 1 or -1.
Also p
is the raw output, not the probability from the softmax.
Hence, if y
and p
are of the same sign, they will give a low loss. If not, a high loss.
The hinge loss was made differentiable by using the Squared Hinge Loss.
// prev_L = previous Hinge Loss
L = square(prev_L)
Hinge Loss is used with SVM's and Log Loss is used with Deep Learning Networks.
I hope this blog post gave you some clarity on what loss function does what, and when to use these loss functions
Cheers