From Theorem to Inference
Bayes’ Theorem isn’t just for single events (like “do I have a disease?”). It is a general framework for learning parameters from data.
In Machine Learning, we often want to estimate a parameter θ (theta), such as the probability a user clicks an ad, or the mean of a normal distribution. Bayesian Inference treats θ not as a fixed number, but as a random variable with its own distribution.
1. Pillar 1: Intuition (The Coin Flip)
Imagine you find a coin on the street. You want to know if it’s fair (θ = 0.5) or tricked.
- Frequentist: You flip it 10 times. You get 10 Heads. You conclude θ = 1.0 (It will always be heads).
- Bayesian: You flip it 10 times. You get 10 Heads. You conclude θ ≈ 0.8. Why? Because your Prior belief (“Most coins are fair”) pulls the estimate away from the extreme, even with strong evidence.
2. The Bayesian Update Cycle
- Prior P(θ): Our belief about θ before seeing data. (e.g., “The coin is probably fair”).
-
**Likelihood P(Data θ)**: How likely is the data we saw, for a specific value of θ? -
**Posterior P(θ Data)**: Our new belief about θ after seeing the data.
3. Interactive: The Coin Flip Learner
Let’s estimate the bias of a coin (θ). θ = 0.5 means fair. θ = 1 means always Heads.
- Prior: We start with a flat line (Uniform Distribution). We have no idea if the coin is fair or tricked.
- Action: Flip the coin.
- Result: Watch how the curve (Posterior) changes.
Posterior Distribution P(θ|Data)
Heads: 0 | Tails: 0
4. MAP vs MLE
In classical statistics (Frequentist), we use Maximum Likelihood Estimation (MLE). We ask: “What value of θ maximizes the probability of the data?”
In Bayesian statistics, we use Maximum A Posteriori (MAP). We ask: “What is the most likely value of θ given the data and our prior?”
Comparison
| Method | Formula | Philosophy |
|---|---|---|
| MLE | Maximize P(Data|θ) |
Data is king. Prior is irrelevant. |
| MAP | Maximize P(Data|θ) × P(θ) |
Data updates beliefs. Prior matters for small data. |
[!IMPORTANT] As the amount of data approaches infinity, the Likelihood overwhelms the Prior. Therefore, MAP converges to MLE given enough data.
5. Pillar 2: Rigor (Convergence)
As we collect more data (N → ∞), the Posterior distribution becomes narrower and narrower. This is formally known as Bayesian Consistency.
- Small N: The Prior dominates. The curve is wide (High Uncertainty).
- Large N: The Likelihood dominates. The curve is a sharp spike (Low Uncertainty).
This explains why Bayesian methods are most useful in low-data regimes (like rare diseases or new user signups).
6. Pillar 3: Hardware Reality (The Machine)
MAP is L2 Regularization (Weight Decay)
In Deep Learning, we often add a penalty term to our loss function to prevent overfitting. This is called L2 Regularization or Weight Decay.
Loss = Likelihood_Loss + λ * ||Weights||^2
It turns out, this is exactly equivalent to Bayesian MAP estimation with a Gaussian Prior!
- Assume our weights w come from a Gaussian Prior:
P(w) ~ exp(-w^2) - We want to maximize
Posterior = Likelihood * Prior - Take the Log:
log(Posterior) = log(Likelihood) + log(Prior) log(Prior) = log(exp(-w^2)) = -w^2- So maximizing Posterior is equivalent to maximizing
log(Likelihood) - w^2 - Or minimizing
Negative_Log_Likelihood + w^2
Conclusion: When you use Weight Decay in PyTorch/TensorFlow, you are secretly being a Bayesian. You are injecting a “Prior” belief that weights should be small.
7. Python Implementation: Grid Approximation
Sometimes we cannot calculate the posterior analytically. A simple way to visualize it is Grid Approximation: divide the possible values of θ into a grid and calculate the posterior for each point.
import numpy as np
def grid_approximation(prior_grid, likelihood_grid):
"""
Calculate posterior using grid approximation.
prior_grid: Array of prior probabilities for each theta
likelihood_grid: Array of likelihoods P(Data|theta)
"""
# 1. Multiply Prior * Likelihood
unnormalized_posterior = prior_grid * likelihood_grid
# 2. Normalize (divide by Evidence P(Data))
evidence = np.sum(unnormalized_posterior)
posterior = unnormalized_posterior / evidence
return posterior
# Example:
# We define 100 possible values for theta (0.00, 0.01, ... 0.99)
thetas = np.linspace(0, 1, 100)
# Uniform Prior: All values equally likely (1/100)
prior = np.ones(100) / 100
# We observed 3 Heads in 4 tosses (H, H, T, H)
# Likelihood function for Binomial(k=3, n=4) is theta^3 * (1-theta)^1
# We omit the combinatorial constant as it cancels out during normalization
likelihood = (thetas**3) * ((1 - thetas)**1)
posterior = grid_approximation(prior, likelihood)
# Find the MAP estimate (Theta with highest probability)
map_idx = np.argmax(posterior)
print(f"MAP Estimate for Theta: {thetas[map_idx]:.2f}")
8. Summary
- Inference is the process of estimating unknown parameters from data.
- The Posterior is a compromise between the Prior (what we thought) and the Likelihood (what we saw).
- MAP estimates include prior knowledge, making them robust for small datasets (regularization).
- Deep Learning Connection: L2 Regularization is mathematically equivalent to placing a Gaussian Prior on your neural network weights.