Gradient Descent
[!NOTE] Gradient Descent is the engine that powers almost all modern Machine Learning algorithms. When an algorithm “learns,” it is actually just finding a set of parameters (weights and biases) that minimize a specific cost function. Gradient Descent is how we find that minimum.
1. Intuition: Walking Down a Mountain
Imagine you are blindfolded near the top of a mountain, and your goal is to reach the lowest valley. Because you can’t see the whole landscape, you use your feet to feel the slope of the ground right where you stand.
You take a step in the direction that slopes downward the steepest. After that step, you feel the ground again, find the new steepest downward direction, and take another step. If you keep doing this, you will eventually reach the bottom of the valley (a local or global minimum).
This is exactly how Gradient Descent works.
- The “landscape” is the Cost Function
J(θ). - Your “coordinates” on the mountain are the model’s parameters (weights
θ). - The “slope” you feel with your feet is the Gradient (the partial derivatives of the cost function with respect to each parameter).
- The size of the “step” you take is determined by the Learning Rate
α.
2. Interactive Analysis: Finding the Minimum
Let’s visualize a simple 1D convex cost function. The goal is to start at an initial weight and iteratively take steps down the gradient until we reach the minimum cost.
Optimization Step
Adjust the learning rate and step through iterations to see how the parameter updates based on the gradient.
Current Cost (J): 16.00
3. The Rigor: The Update Rule
The mathematical update rule for Gradient Descent is incredibly simple yet powerful:
θj := θj - α × (∂ / ∂θj) J(θ0, θ1, …, θn)
Let’s break this down:
- θj: The current value of the parameter we are updating (e.g., the slope or the intercept). The
:=means assignment (we are updating the value). α: The Learning Rate. This determines how large of a step we take.- If
αis too small, gradient descent will be slow but accurate. - If
αis too large, it might overshoot the minimum, bounce back and forth, and actually diverge (cost increases).
- If
- (∂ / ∂θj) J(θ): This is the partial derivative of the cost function with respect to the specific parameter θj. This represents the slope of the cost function along that parameter’s axis.
[!IMPORTANT] A crucial implementation detail: all parameters θj must be updated simultaneously. You calculate all the new values first using the old parameters, and then assign all the new values at once.
4. Flavors of Gradient Descent
The cost function J(θ) is typically an average of the errors across the training dataset. How much data we use to calculate the gradient before taking a step defines the “flavor” of gradient descent.
- Batch Gradient Descent: We calculate the error for every single example in the entire training dataset, average them, and then take one step.
- Pros: Guaranteed to converge smoothly.
- Cons: Very slow if the dataset is large (e.g., millions of rows). You have to process everything just to take one tiny step.
- Stochastic Gradient Descent (SGD): We calculate the error for exactly one randomly chosen training example and take a step immediately based only on that one example.
- Pros: Extremely fast updates. Can escape shallow local minima because the steps are noisy and erratic.
- Cons: The path to the minimum is jagged and chaotic. It never truly settles; it constantly bounces around the minimum.
- Mini-Batch Gradient Descent: The Goldilocks approach. We calculate the error over a small subset of the data (e.g., 32, 64, or 128 examples) and take a step.
- Pros: Faster than Batch GD, more stable than SGD. It allows for vectorized operations (matrix math) on GPUs, making it the standard algorithm used in Deep Learning.
5. Implementation Example
Let’s implement a simple Batch Gradient Descent from scratch in Python to find the minimum of y = x2.
6. Summary
| Concept | Description |
|---|---|
| Cost Function | Measures how bad the model’s predictions are (e.g., MSE). |
| Gradient | The direction of the steepest ascent. We move in the opposite direction (negative gradient). |
| Learning Rate | The step size. Too small = slow. Too big = divergence. |
| Convexity | A function like a bowl. It has only one global minimum. Linear/Logistic Regression cost functions are convex. Neural Networks are highly non-convex (many valleys). |