Optimization Techniques popularly used in Deep Learning

Bouzouitina Hamdi
12 min readJan 14, 2021

--

The principal goal of machine learning is to create a model that performs well and gives accurate predictions in a particular set of cases. In order to achieve that, we need machine learning optimization.

So optimization is the most essential ingredient in the recipe of machine learning algorithms. It starts with defining some kind of loss function/cost function and ends with minimizing the it using one or the other optimization routine. The choice of optimization algorithm can make a difference between getting a good accuracy in hours or days.

In addition Machine learning optimization is the process of adjusting hyper-parameters in order to minimize the cost function by using one of the optimization techniques. It is important to minimize the cost function because it describes the discrepancy between the true value of the estimated parameter and what the model has predicted.

Before we go any further, we need to understand the difference between parameters and hyper-parameters of a model. These two notions are easy to confuse but we ought not to.

  • You need to set hyper-parameters before starting to train the model. They include a number of clusters, learning rate, etc. Hyper-parameters describe the structure of the model.
  • On the other hand, the parameters of the model are obtained during the training. There is no way to get them in advance. Examples are weights and biases for neural networks. This data is internal to the model and changes based on the inputs.

In this article we’ll walk through several ML optimization techniques.

Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

Techniques to perform Feature Scaling
Consider the two most important ones:

  • Min-Max Normalization: This technique re-scales a feature or observation value with distribution value between 0 and 1.
  • Standardization: It is a very effective technique which re-scales a feature value so that it has distribution with 0 mean value and variance equals to 1

So let’s take this example :

Country  Age  Salary  Purchased
0 France 44 72000 0
1 Spain 27 48000 1
2 Germany 30 54000 0
3 Spain 38 61000 0
4 Germany 40 1000 1

Original data values :
[[ 44 72000]
[ 27 48000]
[ 30 54000]
[ 38 61000]
[ 40 1000]
[ 35 58000]
[ 78 52000]
[ 48 79000]
[ 50 83000]
[ 37 67000]]

After min max Scaling :
[[ 0.33333333 0.86585366]
[ 0. 0.57317073]
[ 0.05882353 0.64634146]
[ 0.21568627 0.73170732]
[ 0.25490196 0. ]
[ 0.15686275 0.69512195]
[ 1. 0.62195122]
[ 0.41176471 0.95121951]
[ 0.45098039 1. ]
[ 0.19607843 0.80487805]]

After Standardization :
[[ 0.09536935 0.66527061]
[-1.15176827 -0.43586695]
[-0.93168516 -0.16058256]
[-0.34479687 0.16058256]
[-0.1980748 -2.59226136]
[-0.56487998 0.02294037]
[ 2.58964459 -0.25234403]
[ 0.38881349 0.98643574]
[ 0.53553557 1.16995867]
[-0.41815791 0.43586695]]

Batch normalization

We normalize the input layer by adjusting and scaling the activation. For example, when we have features from 0 to 1 and some from 1 to 1000, we should normalize them to speed up learning. If the input layer is benefiting from it, why not do the same thing also for the values in the hidden layers, that are changing all the time, and get 10 times or more improvement in the training speed.

So Batch Normalization is a supervised learning technique that converts inter-layer outputs into of a neural network into a standard format, called normalizing. This effectively ‘resets’ the distribution of the output of the previous layer to be more efficiently processed by the subsequent layer.

This approach leads to faster learning rates since normalization ensures there’s no activation value that’s too high or too low, as well as allowing each layer to learn independently of the others.

Normalizing inputs reduces the “dropout” rate, or data lost between processing layers. Which significantly improves accuracy throughout the network.

How Does Batch Normalization Work?

To enhance the stability of a deep learning network, batch normalization affects the output of the previous activation layer by subtracting the batch mean, and then dividing by the batch’s standard deviation.

Since this shifting or scaling of outputs by a randomly initialized parameter reduces the accuracy of the weights in the next layer, a stochastic gradient descent is applied to remove this normalization if the loss function is too high.

Consequently, batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter (gamma) and add a “mean” parameter (beta). In other words, batch normalization lets SGD (stochastic gradient descent) do the denormalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.

Mini-batch gradient descent (MGD)

MGD is a variation of the gradient descent algorithm that splits the training datasets into small batches that are used to calculate model error and update model coefficients.

Let us understand like this,

suppose I have 1000 records and my batch size = 50, I will choose randomly 50 records, then calculate summation of loss and then send the loss to optimizer to find dE/dw.

Advantages:

  • The model update frequency is higher than BGD (Batch Gradient Descent): In MGD(Stochastic Gradient descent), we are not waiting for entire data, we are just passing 50 records or 200 or 100 or 256, then we are passing for optimization.
  • The batching allows both efficiency of not having all training data in memory and algorithms implementations. We are controlling memory consumption as well to store losses for each and every datasets.
  • The batches updates provide a computationally more efficient process than SGD.

Disadvantages:

  • No guarantee of convergence of a error in a better way.
  • Since the 50 sample records we take , are not representing the properties (or variance) of entire datasets. Do, this is the reason that we will not be able to get an convergence i.e., we won’t get absolute global or local minima at any point of a time.
  • While using MGD, since we are taking records in batches, so, it might happen that in some batches, we get some error and in dome other batches, we get some other error. So, we will have to control the learning rate by our-self , whenever we use MGD. If learning rate is very low, so the convergence rate will also fall. If learning rate is too high, we won’t get an absolute global or local minima. So we need to control the learning rate.

Gradient descent with momentum

Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset.

SGD has trouble navigating ravines, .areas where the surface curves much more steeply in one dimension than in another , which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum.

So Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations . It does this by adding a fraction γγ of the update vector of the past time step to the current update vector

Besides Gradient descent with momentum will always work much faster than the algorithm Standard Gradient Descent. The basic idea of Gradient Descent with momentum is to calculate the exponentially weighted average of your gradients and then use that gradient instead to update your weights.It functions faster than the regular algorithm for the gradient descent.

How it works ?

SGD without momentum

We start gradient descent from point ‘A’ and we through end up at point ‘B’ after one iteration of gradient descent, the other side of the ellipse. Then another phase of downward gradient can end at ‘C’ level. With through iteration of gradient descent, with oscillations up and down, we step towards the local optima. If we use higher learning rate then the frequency of the vertical oscillation would be greater.This vertical oscillation therefore slows our gradient descent and prevents us from using a much higher learning rate.

By using the exponentially weighted average dW and db values, we tend to average the oscillations in the vertical direction closer to zero as they are in both (positive and negative) directions. Whereas all the derivatives point to the right of the horizontal direction in the horizontal direction, the average in the horizontal direction will still be quite large. It enables our algorithm to take a straighter forward path to local optima and to damp out vertical oscillations. Because of this the algorithm will end up with a few iterations at local optima.

SGD with momentum

We use dW and db to update our parameters W and b during the backward propagation as follows:

W = W — learning rate * dW

b = b — learning rate * db

In momentum we take the exponentially weighted averages of dW and db, instead of using dW and db independently for each epoch.

VdW = β * VdW + (1 — β) * dW

Vdb = β * Vdb + (1 — β) *db

Where beta ‘β’ is a different hyper-parameter called momentum, ranging from 0 to <1. To calculate the new weighted average, it sets the weight between the average of previous values and the current value.

We’ll update our parameters after calculating the exponentially weighted averages.

W = W — learning rate * VdW

b = b — learning rate * Vdb

RMSProp optimization

RMSProp also tries to dampen the oscillations, but in a different way than momentum. RMS prop also takes away the need to adjust learning rate, and does it automatically. More so, RMSProp choses a different learning rate for each parameter,which stands for root mean square prop, which can also accelerate gradient descent. RMSprop uses the same concept of the exponentially weighted average of gradient as gradient descent with momentum but the difference is parameter update.

How it works ?

We start gradient descent from point ‘A’ and we through end up at point ‘B’ after one iteration of gradient descent, the other side of the ellipse. Then another phase of downward gradient can end at ‘C’ level. With through iteration of gradient descent, with oscillations up and down, we step towards the local optima. If we use higher learning rate then the frequency of the vertical oscillation would be greater.This vertical oscillation therefore slows our gradient descent and prevents us from using a much higher learning rate.

The ‘bias’ is responsible for the vertical oscillations whereas the movement in the horizontal direction is defined by the weight. If we slow down the update for ‘bias’ then the vertical oscillations can be dampened and if we update ‘weights’ with higher values then we can move quickly towards the local optima still.

Implementation

We use dW and db to update our parameters W and b during the backward propagation as follows:

  • W = W — learning rate * dW
  • b = b — learning rate * db

In momentum we take the exponentially weighted averages of dW and db, instead of using dW and db independently for each epoch.

  • VdW = β * VdW + (1 — β) * dW
  • Vdb = β * Vdb + (1 — β) *db

Where beta ‘β’ is a different hyperparameter called momentum, ranging from 0 to <1. To calculate the new weighted average, it sets the weight between the average of previous values and the current value.

We’ll update our parameters after calculating the exponentially weighted averages.

  • W = W — learning rate * VdW
  • b = b — learning rate * Vdb

Adam optimization

Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent with momentum. It uses the squared gradients to scale the learning rate like RMSprop and it takes advantage of momentum by using moving average of the gradient instead of gradient itself like SGD with momentum.

How it works ?

  1. First, it calculates and stores an exponentially weighted average of past gradients in VdW & Vdb (before bias correction) and VdWcorrected & Vdbcorrected (with bias correction) variables.
  2. It then calculates an exponentially weighted average of past gradient squares and stores it in SdW & Sdb (before bias correction) and SdWcorrected & Sdbcorrected (with bias correction) variables.
  3. Finally, updates the parameters in a direction based on combining the “1” and “2” information

Implementation

To implement Adam optimization algorithm, we need to initialize:

Vdw = 0, Sdw = 0, Vdb = 0, Sdb = 0

Then on iteration t:

Compute the derivatives dw, db using current mini-batch gradient descent

And do the momentum exponentially weighted average. So

  • mom_dw = (beta1 * v) + ((1 — beta1)*grad)

Moreover do the RMSprop update as well. So,

  • rms_dw = (beta2 * s) + ((1 — beta2)*grad**2)

Now we need to implement bias correction in typical Adam’s implementation. So, we’ll have Vcorrected (where Vcorrected means after correction of the bias).

  • VdWcorrected = VdW / (1- ß1t)
  • Vdbcorrected = Vdb / (1- ß1t)

And then similarly, we implement this bias correction on S as well.

  • mom_dw_corr = mom_dw/(1 — beta1**t)
  • rms_dw_corr = rms_dw/(1-beta2**t)

Finally, we need to perform the update.

  • w = var — alpha * (mom_dw_corr/(sqrt(rms_dw_corr)+epsilon))

And we have:

  • alpha is the learning rate
  • beta1 is the weight used for the first moment
  • beta2 is the weight used for the second moment
  • epsilon is a small number to avoid division by zero
  • var is a numpy.ndarray containing the variable to be updated
  • grad is a numpy.ndarray containing the gradient of var
  • v is the previous first moment of var
  • s is the previous second moment of var
  • t is the time step used for bias correction

Advantages of the Adam Algorithm

There are several advantages of the Adam Algorithm and some of them are listed below:

  • Easy to implement
  • Quite computationally efficient
  • Requires little memory space
  • Good for non-stationary objectives
  • Works well on problems with noisy or sparse gradients
  • Works well with large data sets and large parameters

Learning rate decay

The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.

The learning rate may be the most important hyperparameter when configuring your neural network. Therefore it is vital to know how to investigate the effects of the learning rate on model performance and to build an intuition about the dynamics of the learning rate on model behavior.

Implementation

decayed_learning_rate = learning_rate / (1 + decay_rate * global_step /
decay_step)

And we have:

  • alpha is the original learning rate
  • decay_rate is the weight used to determine the rate at which alpha will decay
  • global_step is the number of passes of gradient descent that have elapsed
  • decay_step is the number of passes of gradient descent that should occur before alpha is decayed further

IN THE END I WOULD LIKE TO PRESENT:

  • Advantages and Disadvantages of Gradient Descent
  • Advantage and Disadvantages of Optimization Algorithms
Advantages and Disadvantages of Gradient Descent
Advantage and Disadvantages of Optimization Algorithms

--

--

No responses yet