Tutorial

Intro to optimization in deep learning: Gradient Descent

Updated on April 10, 2025

Intro to optimization in deep learning: Gradient Descent

Deep Learning is, to a large extent, about solving massive, nasty optimization problems. A Neural Network is merely a very complicated function, consisting of millions of parameters, that represents a mathematical solution to a problem. Consider the task of image classification. AlexNet is a mathematical function that takes an array representing the RGB values of an image and produces the output as a bunch of class scores.

By training neural networks, we essentially mean minimizing a loss function. The value of this loss function gives us a measure of how far from perfect our network’s performance is on a given dataset.

Prerequisites

This is an introductory article on optimizing Deep Learning algorithms designed for beginners in this space. It requires no additional experience to follow along.

The Loss Function

For the sake of simplicity, let us assume our network has only two parameters. In practice, this number would be around a billion, but we’ll still stick to the two-parameter example throughout the post to not drive ourselves nuts while trying to visualize things. The contour of a very nice loss function may look like this.

Contour of a Loss Function

Why do I say a very nice loss function?

Because a loss function having a contour like above is like Santa, it doesn’t exist. However, it still serves as a decent pedagogical tool to get some of the most important ideas about gradient descent. So, let’s get to it!

The x and y axes represent the values of the two weights. The z-axis represents the value of the loss function for a particular value of two weights. We aim to find the value of weight for which the loss is minimal. Such a point is called a minima for the loss function.

You initially have randomly initialized weights, so your neural network is probably behaving like a drunk version of yourself, classifying images of cats as humans. Such a situation corresponds to point A on the contour, where the network is performing badly, and consequently, the loss is high.

Do we need to find a way to navigate to the bottom of the “valley” to point B, where the loss function has minima?

Gradient Descent

We are at point A in the loss landscape when we initialize our weights. The first thing we do is to check, out of all possible directions in the x-y plane, moving along which direction brings about the steepest decline in the value of the loss function. This is the direction in which we have to move. This direction is given by the direction exactly opposite to the direction of the gradient. The gradient, the higher dimensional cousin of the derivative, gives us the direction with the steepest ascent.

To wrap your head around it, consider the following figure. At any point of our curve, we can define a tangential plane to the point. We can always define a hyperplane in higher dimensions, but let’s stick to 3-D for now. Then, we can have infinite directions on this plane. Out of them, precisely one direction will give us the direction in which the function has the steepest ascent. The gradient gives this direction. The direction opposite to it is the direction of the steepest descent. This is how the algorithm gets its name. We perform descent along the direction of the gradient. Hence, it’s called Gradient Descent.

grad

Once we have the direction we want to move in, we must decide the size of the step we must take. The size of this step is called the learning rate. We must choose it carefully to ensure we can get to the minimum.

If we go too fast, we might overshoot the minima and keep bouncing along the ridges of the “valley” without ever reaching it. Go too slow, and the training might turn out to be too long to be feasible at all. Even if that’s not the case, very slow learning rates make the algorithm more prone to getting stuck in a minima, which we’ll cover later in this post.

Once we have our gradient and the learning rate, we take a step, recompute the gradient at whatever position we end up at, and repeat the process.

While the direction of the gradient tells us which direction has the steepest ascent, its magnitude tells us how steep the steepest ascent/descent is. So, at the minima, where the contour is almost flat, you would expect the gradient to be almost zero. It’s precisely zero for the point of minima.

Gradient Descent in Action

Using too large a learning rate

In practice, we might never exactly reach the minima, but we keep oscillating in a flat region near the minima. As we oscillate in this region, the loss is almost the minimum we can achieve and doesn’t change much as we just keep bouncing around the actual minimum. Often, we stop our iterations when the loss values haven’t improved in a pre-decided number, say, 10 or 20 iterations. When such a thing happens, we say our training has converged, or convergence has occurred.

A Common Mistake

Let me digress for a moment. If you google for visualizations of gradient descent, you’ll probably see a trajectory that starts from a point and heads to a minimum, just like the animation presented above. However, this gives you a very inaccurate picture of gradient descent. Our trajectory is entirely confined to the x-y plane containing the weights.

As depicted in the above animation, gradient descent doesn’t involve moving in z-direction. This is because only the weights are the free parameters described by the x and y directions. The actual trajectory we take is defined in the x-y plane as follows.

Real Gradient Descent Trajectory

Each point in the x-y plane represents a unique combination of weights, and we want to have a set of weights described by the minima.

Basic Equations

The basic equation that describes the update rule of gradient descent is.

grad_eq-4

This update is performed during every iteration. Here, w is the weights vector, which lies in the x-y plane. From this vector, we subtract the gradient of the loss function concerning the weights multiplied by alpha, the learning rate. The gradient is a vector that gives us the direction in which the loss function has the steepest ascent. The direction of the steepest descent is exactly opposite to the gradient, which is why we are subtracting the gradient vector from the weights vector.

If imagining vectors is a bit hard for you, almost the same update rule is applied to every weight of the network simultaneously. The only change is that since we will perform the update individually for each weight, the gradient in the above equation is replaced by the projection of the gradient vector along the direction represented by the particular weight.

indiveq-2

This update is simultaneously done for all the weights.

Before subtracting, we multiply the gradient vector by the learning rate. This represents the step that we talked about earlier. Realize that even if we keep the learning rate constant, the size of the step can change owing to changes in the magnitude of the gradient or the steepness of the loss contour. As we approach a minima, the gradient approaches zero, and we take smaller and smaller steps towards the minima.

In theory, this is good since we want the algorithm to take smaller steps when it approaches the minima. Having a step size too large may cause it to overshoot the minima and bounce between its ridges.

A widely used technique in gradient descent is to have a variable rather than a fixed learning rate. Initially, we can afford a large learning rate. But later on, we want to slow down as we approach a minima. An approach that implements this strategy is called Simulated annealing, or decaying learning rate. In this, the learning rate is decayed every fixed number of iterations.

Challenges with Gradient Descent #1: Local Minima

Okay, so far, the tale of Gradient Descent seems to be a really happy one. Well. Let me spoil that for you. Remember when I said our loss function is very nice, and such loss functions don’t exist? They don’t.

First off, neural networks are complex—they involve many non-linear transformations within the hypothesis function. As a result, the loss function doesn’t resemble a smooth, bowl-shaped curve with a single minimum to reach. In fact, those well-behaved, upward-curving loss functions are known as convex functions. However, in the case of deep neural networks, the loss landscape is rarely convex and can be much more chaotic.

They may look like this.

challenges-1

In the above image, there exists a local minima where the gradient is zero. However, we know they are not the lowest loss we can achieve, which corresponds to the global minima. Now, if you initialize your weights at point A, then you’re gonna converge to the local minima, and there’s no way gradient descent will get you out of there once you converge to the local minima.

Gradient descent is driven by the gradient, which will be zero at the base of any minima. Local minima are called so since the value of the loss function is minimum at that point in a local region. Whereas a global minima is called so since the value of the loss function is minimum there, globally across the entire domain the loss function.

Only to make things worse, the loss contours may even be more complicated, given the fact that 3-D contours like the one we are considering never actually happen in practice. In practice, our neural network may have about, give or take, 1 billion weights, giving us a roughly (1 billion + 1) dimensional function. I don’t even know the number of zeros in that figure. Visualizing such high-dimensional functions is no easy task. However, thanks to the ingenuity in the deep learning community, researchers have developed techniques to represent loss function contours in 3D. One recent paper introduces a method called Filter Normalization—while the details are beyond the scope of this post, the approach offers valuable insights into the complex nature of loss landscapes in deep networks. For instance, the image below shows a 3D visualization of the loss contour for a VGG-56 network trained on the CIFAR-10 dataset.

Challenges with Gradient Descent #2: Saddle Points

The basic lesson we took away regarding the limitation of gradient descent was that once it arrived at a region with gradient zero, it was almost impossible to escape it regardless of the quality of the minima. Another problem we face is saddle points, which look like this.

A Saddle Point

In the earlier pic, you can also see a saddle point where two “mountains” meet.

A saddle point gets its name from the saddle of a horse, which it resembles. While it’s a minima in one direction (x), it’s a local maxima in another direction. If the contour is flatter towards the x-direction, GD would keep oscillating to and fro in the y-direction, giving us the illusion that we have converged to a minima.

Randomness to the rescue

So, how do we escape local minima and saddle points while trying to converge to a global minima. The answer is randomness.

Until now, we were doing gradient descent with the loss function created by summing loss over all possible examples of the training set. If we get into a local minima or saddle point, we are stuck. A way to help GD escape these is to use Stochastic Gradient Descent.

In stochastic gradient descent, instead of taking a step by computing the gradient of the loss function created by summing all the loss functions, we take a step by computing the gradient of the loss of only one randomly sampled (without replacement) example. In contrast to Stochastic Gradient Descent, where each example is stochastically chosen, our earlier approach processed all examples in one single batch and, therefore, is known as Batch Gradient Descent.

The update rule is modified accordingly.

sgd Update Rule For Stochastic Gradient Descent

This means we are taking the gradient of a loss function at every step, which is different from our actual loss function (which is a summation of the loss of every example). The gradient of this “one-example-loss” at a particular point in a direction slightly differs from the “all-example-loss.”

This also means that while the gradient of the “all-example-loss” may push us down a local minima or get us stuck at a saddle point, the gradient of the “one-example-loss” might point in a different direction and might help us steer clear of these.

One could also consider a point that is a local minima for the “all-example-loss.” If we do Batch Gradient Descent, we will get stuck here since the gradient will always point to the local minima. However, if we use Stochastic Gradient Descent, this point may not lie around a local minima in the loss contour of the “one-example-loss,” allowing us to move away from it.

Even if we get stuck in a minima for the “one-example-loss,” the loss landscape for the “one-example-loss” for the next randomly sampled data point might be different, allowing us to keep moving.

When it does converge, it converges to a point that is a minima for almost all the “one-example-losses.” It’s also been empirically shown that the saddle points are extremely unstable, and a slight nudge may be enough to escape one.

So, does this mean, in practice, should always perform this one-example stochastic gradient descent?

Batch Size

The short answer is no. While stochastic gradient descent (SGD) may seem ideal in theory, it’s not always practical from a computational perspective. Using standard gradient descent with a loss function that sums all individual losses, we can compute those gradients in parallel—making it faster and more efficient. On the other hand, SGD requires calculating gradients one sample at a time, which has to be done sequentially and can significantly slow things down.

So, what we do is a balancing act. Instead of using the entire dataset or just a single example to construct our loss function, we use a fixed number of examples, say, 16, 3,2, or 128, to form what is called a mini-batch. The word is used in contrast with processing all the examples at once, which is generally called Batch Gradient Descent. The mini-batch size ensures enough stochasticity to ward off local minima while leveraging enough computation power from parallel processing.

Local Minima Revisited: They are not as bad as you think

Before you write off local minima as a bad thing, it’s worth noting that recent research suggests otherwise. In the complex loss landscape of a neural network, there are countless minima—and many of these so-called local minima can perform just as well as the global minimum.

Now, why do we call some of them “good”? Because not all local minima are created equal. Some may be “bad,” arising from noisy or outlier training examples and leading to poor generalization. But others—often referred to in the literature as optimal local minima—can deliver excellent performance and are fairly common, especially given the high-dimensional nature of deep learning models.

It’s also important to consider that many neural networks are used for classification tasks. So, even if one local minimum produces confidence scores in the range of 0.7–0.8 and the global minimum yields 0.95–0.98 for the correct labels, the predicted class will still be the same. In practice, the difference may not matter much.

A desirable property of a minima should be that it should be on the flatter side. Why? Flat minima are easy to converge to, given there’s less chance to overshoot the minima and bounce between the ridges of the minima.

More importantly, we expect the loss surface of the test set to be slightly different from that of the training set on which we do our training. Due to this shift, the loss won’t change much for flat and wide minima, but this is not the case for narrow minima. The point we are trying to make is that flatter minima generalize better and are thus desirable.

A desirable property of a minima should be it that it should be on the flatter side. Why? Because flat minimum are easy to converge to, given there’s less chance to overshoot the minima, and be bouncing between the ridges of the minima.

More importantly, we expect the loss surface of the test set to be slightly different from that of the training set, on which we do our training. For a flat and wide minima, the loss won’t change much due to this shift, but this is not the case for narrow minima. The point that we are trying to make is flatter minima generalise better and are thus desirable.

Learning Rate Revisited

Recently, there has been a surge in research on learning rate scheduling to account for sub-optimal minima in the loss landscape. Even with a decaying learning rate, one can get stuck in a local minima. Traditionally, the training is done for a fixed number of iterations, or it can be stopped after 10 iterations if the loss doesn’t improve. This has been called early stopping in the literature.

A fast learning rate also helps us scoot over local minimum earlier in training.

People have also combined early stopping with learning rate decay, where the learning rate decays every time the loss fails to improve after 10 iterations, eventually stopping after the rate is below a decided threshold.

In recent years, cyclic learning rates have become popular. In this method, the learning rate slowly increases and then decreases, which is continued cyclically.

‘Triangular’ and ‘Triangular2’ methods for cycling learning rate proposed by Leslie N. Smith. On the left plot min and max lr are kept the same. On the right the difference is cut in half after each cycle. Image Credits: Hafidz Zulkifli

Stochastic gradient descent with warm restarts anneals the learning rate to a lower bound and then restores the learning rate to its original value.

We also have different schedules for how the learning rates decline, from exponential to cosine decay.

Cosine Annealing combined with restarts

A very recent paper introduces a technique called Stochastic Weight Averaging. The authors develop an approach where they first converge to a minima, cache the weights, and then restore the learning rate to a higher value. This higher learning rate propels the algorithm from the minima to a random point in the loss surface. Then, the algorithm is made to converge again to another minima. This is repeated a few times. Finally, they average the predictions made by all the sets of cached weights to produce the final prediction.

A technique called Stochastic Weight Averaging

Conclusion

This brings us to the end of our introductory post on gradient descent—a foundational technique that has powered deep learning optimization since the breakthrough paper on backpropagation, which demonstrated how neural networks could be trained using gradients. That said, there’s one important aspect we haven’t explored yet: the challenge of pathological curvature. To address this issue, several improvements over vanilla Stochastic Gradient Descent—such as Momentum, RMSProp, and Adam—have been developed.

We highly recommend going through the ‘Further Reading’ segment to deepen your understanding and explore more advanced concepts related to gradient descent and neural network optimization.