Gradient Descent

[!NOTE] Gradient Descent is the engine that powers almost all modern Machine Learning algorithms. When an algorithm “learns,” it is actually just finding a set of parameters (weights and biases) that minimize a specific cost function. Gradient Descent is how we find that minimum.

1. Intuition: Walking Down a Mountain

Imagine you are blindfolded near the top of a mountain, and your goal is to reach the lowest valley. Because you can’t see the whole landscape, you use your feet to feel the slope of the ground right where you stand.

You take a step in the direction that slopes downward the steepest. After that step, you feel the ground again, find the new steepest downward direction, and take another step. If you keep doing this, you will eventually reach the bottom of the valley (a local or global minimum).

This is exactly how Gradient Descent works.

The “landscape” is the Cost Function J(θ).
Your “coordinates” on the mountain are the model’s parameters (weights θ).
The “slope” you feel with your feet is the Gradient (the partial derivatives of the cost function with respect to each parameter).
The size of the “step” you take is determined by the Learning Rate α.

2. Interactive Analysis: Finding the Minimum

Let’s visualize a simple 1D convex cost function. The goal is to start at an initial weight and iteratively take steps down the gradient until we reach the minimum cost.

Optimization Step

Adjust the learning rate and step through iterations to see how the parameter updates based on the gradient.

Learning Rate (α): 0.1

Current Weight (θ): -4.00
Current Cost (J): 16.00

3. The Rigor: The Update Rule

The mathematical update rule for Gradient Descent is incredibly simple yet powerful:

θ_j := θ_j - α × (∂ / ∂θ_j) J(θ₀, θ₁, …, θ_n)

Let’s break this down:

θ_j: The current value of the parameter we are updating (e.g., the slope or the intercept). The := means assignment (we are updating the value).
α: The Learning Rate. This determines how large of a step we take.
- If α is too small, gradient descent will be slow but accurate.
- If α is too large, it might overshoot the minimum, bounce back and forth, and actually diverge (cost increases).
(∂ / ∂θ_j) J(θ): This is the partial derivative of the cost function with respect to the specific parameter θ_j. This represents the slope of the cost function along that parameter’s axis.

[!IMPORTANT] A crucial implementation detail: all parameters θ_j must be updated simultaneously. You calculate all the new values first using the old parameters, and then assign all the new values at once.

4. Flavors of Gradient Descent

The cost function J(θ) is typically an average of the errors across the training dataset. How much data we use to calculate the gradient before taking a step defines the “flavor” of gradient descent.

Batch Gradient Descent: We calculate the error for every single example in the entire training dataset, average them, and then take one step.
- Pros: Guaranteed to converge smoothly.
- Cons: Very slow if the dataset is large (e.g., millions of rows). You have to process everything just to take one tiny step.
Stochastic Gradient Descent (SGD): We calculate the error for exactly one randomly chosen training example and take a step immediately based only on that one example.
- Pros: Extremely fast updates. Can escape shallow local minima because the steps are noisy and erratic.
- Cons: The path to the minimum is jagged and chaotic. It never truly settles; it constantly bounces around the minimum.
Mini-Batch Gradient Descent: The Goldilocks approach. We calculate the error over a small subset of the data (e.g., 32, 64, or 128 examples) and take a step.
- Pros: Faster than Batch GD, more stable than SGD. It allows for vectorized operations (matrix math) on GPUs, making it the standard algorithm used in Deep Learning.

5. Implementation Example

Let’s implement a simple Batch Gradient Descent from scratch in Python to find the minimum of y = x².

Python (From Scratch)

```python
import numpy as np

# 1. Define the cost function: J(theta) = theta^2
def cost_function(theta):
    return theta ** 2

# 2. Define the derivative (gradient) of the cost function: dJ/dtheta = 2*theta
def gradient(theta):
    return 2 * theta

# 3. Initialize parameters
theta = 10.0  # Start far away from the minimum (0)
learning_rate = 0.1
iterations = 20

print(f"Initial theta: {theta}, Initial Cost: {cost_function(theta)}")

# 4. The Gradient Descent Loop
for i in range(iterations):
    # Calculate gradient
    grad = gradient(theta)

    # Update rule
    theta = theta - (learning_rate * grad)

    cost = cost_function(theta)
    print(f"Iteration {i+1} | theta: {theta:.4f} | Cost: {cost:.4f} | Grad: {grad:.4f}")

print(f"\nFinal theta after {iterations} iterations: {theta:.6f}")
```

6. Summary

Concept	Description
Cost Function	Measures how bad the model’s predictions are (e.g., MSE).
Gradient	The direction of the steepest ascent. We move in the opposite direction (negative gradient).
Learning Rate	The step size. Too small = slow. Too big = divergence.
Convexity	A function like a bowl. It has only one global minimum. Linear/Logistic Regression cost functions are convex. Neural Networks are highly non-convex (many valleys).