Pages

Regularizing your Neural Network (Week 1 Summary)


This is part 2 summary of Prof Andrews specialization course "Improving Deep Neural Networks Hyperparameter, tuning, regularization, optimization" (Week1)

Regularization should be used when our neural network overfits the data. (And how'd we know if our NN overfits on data please refer Setting up Machine Learning Application

Let's first see how regularization works in Logistic Regression to get the basic idea:

L2 regularization

So J is the cost function with parameters w and b which we intend to minimize. This is done by calculating the sum of the losses of individual predictions in the data examples. w is the x-dimensional parameter vector and b is the real number. So with this, we add the extra thing lambda (which is the regularization parameter). Lambda/2m times the norm of w squared. This can simply be written as w transpose w, it's just the square Euclidean norm of the prime to vector w. And this is called the L2 regularization.

L1 regularization

L1 regularization causes w to be sparse which means w vector will have a lot of zeros in it. Although L1 regularization compresses the model as some parameters go zero and so takes less memory to store the model. L2 regularization is preferred more generally.

Lambda is the regularization parameter which is the hyperparameter to be tuned. We set this by using a variety of values and see what gives the best output.

In python programming language, reserved keyword for lambda is lambd

Neural Network

In NN, we have the cost function which is the function of all the parameters w[1], b[1] through w[L], b[L] where capital L is the number of layers in your neural network.


where;
This is called the Frobenius norm.

For Gradient descent with NN, previously we would complete dw using the backdrop where the backdrop would give as the partial derivative of J with respect to w, recall w for any given [l]. And then we update w[l] as w[l]- learning rate times alpha.

This is before we added the extra regularization term to the objective. Now that we have added this regularization term to the objective, we then take dw and add to it

L2 norm regularization is also called weight decay because its just like the ordinary gradient descent where you update w by substracting alpha times the original gradient we got from backdrop.

Alternative name of L2 regularization is weight decay.

Why does regularization reduces overfitting?

Adding regularization term in the Cost function penalizes the weight matrices from being too large. This is the Frobenius norm. As we can see the lamda, the regularization parameter is inversely proportional to the weight matrices. If lamba is big, the weight matrix w becomes reasonably close to zero.
As you can see in the above picture, a lot of hidden units basically zeroing out makes NN quite simplified. In reality they are actually not zeroing out but reducing the impact of a lot of hidden units.

One useful tip, when we want to debug gradient descent after implementing regularization, we can plot the cost function J as the function of the number of iterations. What we should see is that the cost function J deceases monotonically after every iteration of gradient descent. (as shown below)

Dropout Regularization

In dropout we go through each of the layers of the network and set some probability of eliminating a node in neural network. In this way, we end up in a much smaller network the original and then we go about back propagation. 

So on one iteration of gradient descent, we might zero out some hidden units and on the second iteration of gradient descent where we go through the training set second time, we may be zero outing a different pattern of hidden units.

For each layer we choose a probability, we say it as 'keep-prob'. For each layer as shown in the diagram, the matrices for W are :
  • W1 = 7 x 3
  • W2 = 7 x 7
  • W3 = 3 x 7
and so on..

Since we can see that W2 is the largest of all the weight matrices we may set the keep-prob lower (like 0.5) because large matrix indicate it will account for over fitting more. Whereas, for other layers where we might worry less about overfitting, we keep a tighter keep-prob. and layers where we have the least danger for overfitting we keep them as 1.0 (here there is literally no dropout of any unit)
For implementation of dropout, the technique we use is called 'inverted dropout'.  Let's understand with example. 
For completeness we will only illustrate the third layer l=3. Here is the code illustration of only the single layer , l=3 whose probability is set as 0.8. This is keep-prob = 0.8, which means that a given hidden unit is kept. This means there is a 0.2 chance of eliminating any unit in this layer.


We will set a vector as d and d3 is going to be the vector for dropout vector for layer 3. So d3 will generate a random matrix. So in each unit of layer 3, there is 80% chance that the unit will be 1 and 20% chance that the unit will be 0. Now we will take our activations from our third layer. We have to set the a3 to:
  • a3*=d3
This above multiplying matrix results in zeroing out. If we do in python, d3 is typically a Boolean  array where there is true and false( not 1 or 0). Finally we will take the a3 and divide it by the keep-prob to scale it up.

Dropout is very frequently used in computer vision as the input size is immensely big. The big downside of dropout regularization is that the cost function J is no longer well defined. On every iteration we are randomly killing off a bunch of nodes. And so if we are double checking the performance of gradient descent it would not be the right thing. One other way to do that is we can turn off drop out, set keep-probs = 1 and then run the code to see if the plot for gradient descent is decreasing. Then we can set them back after debugging.

Other Regularization Techniques to reduce overfitting

  • Data Augmentation

Adding more data by flipping, tilting and zooming. This extra faking of the training examples don't add much information but are in inexpensive way to give our algorithm more data therefore sort of regularize it to reduce overfitting.
  • Early stopping
As we run the gradient descent we will plot either training error or just the cost function J and that should decrease monotonically. Now we can also plot our dev set error( classification error or something like the cost function like the logistic loss or log loss of the dev set)

We'll find that our dev set error is usually moving down and then increase from certain point in time. 



So early stopping is stopping halfway with mid size reasonable W that do not compound for overfitting.
  • L2 regularization
The alternative is L2 regularization, but the downside is that the lamba( L2 regularizing parameter) has to be changed and repeated a number of times which makes it computationally expensive. How every in early stopping the cycle is run only once. Still, L2 is more used and preferable for some authentic reasons.










If you enjoyed this post and wish to be informed whenever a new post is published, then make sure you subscribe to my regular Email Updates. Subscribe Now!


Kindly Bookmark and Share it:

YOUR ADSENSE CODE GOES HERE

0 comments:

Have any question? Feel Free To Post Below:

Leave your feed back here.

 

Popular Posts

Total Pageviews

Like Us at

WUZ Following

Join People Following WUZ

© 2011. All Rights Reserved | WriteUpZone | Template by Blogger Widgets

Home | About | Top