From Theorem to Inference

Bayes’ Theorem isn’t just for single events (like “do I have a disease?”). It is a general framework for learning parameters from data.

In Machine Learning, we often want to estimate a parameter θ (theta), such as the probability a user clicks an ad, or the mean of a normal distribution. Bayesian Inference treats θ not as a fixed number, but as a random variable with its own distribution.

1. Pillar 1: Intuition (The Coin Flip)

Imagine you find a coin on the street. You want to know if it’s fair (θ = 0.5) or tricked.

  • Frequentist: You flip it 10 times. You get 10 Heads. You conclude θ = 1.0 (It will always be heads).
  • Bayesian: You flip it 10 times. You get 10 Heads. You conclude θ ≈ 0.8. Why? Because your Prior belief (“Most coins are fair”) pulls the estimate away from the extreme, even with strong evidence.

2. The Bayesian Update Cycle

P(θ|Data) ∝ P(Data|θ) × P(θ)
  1. Prior P(θ): Our belief about θ before seeing data. (e.g., “The coin is probably fair”).
  2. **Likelihood P(Data θ)**: How likely is the data we saw, for a specific value of θ?
  3. **Posterior P(θ Data)**: Our new belief about θ after seeing the data.

3. Interactive: The Coin Flip Learner

Let’s estimate the bias of a coin (θ). θ = 0.5 means fair. θ = 1 means always Heads.

  • Prior: We start with a flat line (Uniform Distribution). We have no idea if the coin is fair or tricked.
  • Action: Flip the coin.
  • Result: Watch how the curve (Posterior) changes.

Posterior Distribution P(θ|Data)

Heads: 0 | Tails: 0

θ = 0 (Always Tails) θ = 0.5 (Fair) θ = 1 (Always Heads)

4. MAP vs MLE

In classical statistics (Frequentist), we use Maximum Likelihood Estimation (MLE). We ask: “What value of θ maximizes the probability of the data?”

In Bayesian statistics, we use Maximum A Posteriori (MAP). We ask: “What is the most likely value of θ given the data and our prior?”

Comparison

Method Formula Philosophy
MLE Maximize P(Data|θ) Data is king. Prior is irrelevant.
MAP Maximize P(Data|θ) × P(θ) Data updates beliefs. Prior matters for small data.

[!IMPORTANT] As the amount of data approaches infinity, the Likelihood overwhelms the Prior. Therefore, MAP converges to MLE given enough data.

5. Pillar 2: Rigor (Convergence)

As we collect more data (N → ∞), the Posterior distribution becomes narrower and narrower. This is formally known as Bayesian Consistency.

  • Small N: The Prior dominates. The curve is wide (High Uncertainty).
  • Large N: The Likelihood dominates. The curve is a sharp spike (Low Uncertainty).

This explains why Bayesian methods are most useful in low-data regimes (like rare diseases or new user signups).

6. Pillar 3: Hardware Reality (The Machine)

MAP is L2 Regularization (Weight Decay)

In Deep Learning, we often add a penalty term to our loss function to prevent overfitting. This is called L2 Regularization or Weight Decay.

Loss = Likelihood_Loss + λ * ||Weights||^2

It turns out, this is exactly equivalent to Bayesian MAP estimation with a Gaussian Prior!

  1. Assume our weights w come from a Gaussian Prior: P(w) ~ exp(-w^2)
  2. We want to maximize Posterior = Likelihood * Prior
  3. Take the Log: log(Posterior) = log(Likelihood) + log(Prior)
  4. log(Prior) = log(exp(-w^2)) = -w^2
  5. So maximizing Posterior is equivalent to maximizing log(Likelihood) - w^2
  6. Or minimizing Negative_Log_Likelihood + w^2

Conclusion: When you use Weight Decay in PyTorch/TensorFlow, you are secretly being a Bayesian. You are injecting a “Prior” belief that weights should be small.

7. Python Implementation: Grid Approximation

Sometimes we cannot calculate the posterior analytically. A simple way to visualize it is Grid Approximation: divide the possible values of θ into a grid and calculate the posterior for each point.

import numpy as np

def grid_approximation(prior_grid, likelihood_grid):
    """
    Calculate posterior using grid approximation.

    prior_grid: Array of prior probabilities for each theta
    likelihood_grid: Array of likelihoods P(Data|theta)
    """
    # 1. Multiply Prior * Likelihood
    unnormalized_posterior = prior_grid * likelihood_grid

    # 2. Normalize (divide by Evidence P(Data))
    evidence = np.sum(unnormalized_posterior)
    posterior = unnormalized_posterior / evidence

    return posterior

# Example:
# We define 100 possible values for theta (0.00, 0.01, ... 0.99)
thetas = np.linspace(0, 1, 100)

# Uniform Prior: All values equally likely (1/100)
prior = np.ones(100) / 100

# We observed 3 Heads in 4 tosses (H, H, T, H)
# Likelihood function for Binomial(k=3, n=4) is theta^3 * (1-theta)^1
# We omit the combinatorial constant as it cancels out during normalization
likelihood = (thetas**3) * ((1 - thetas)**1)

posterior = grid_approximation(prior, likelihood)

# Find the MAP estimate (Theta with highest probability)
map_idx = np.argmax(posterior)
print(f"MAP Estimate for Theta: {thetas[map_idx]:.2f}")

8. Summary

  • Inference is the process of estimating unknown parameters from data.
  • The Posterior is a compromise between the Prior (what we thought) and the Likelihood (what we saw).
  • MAP estimates include prior knowledge, making them robust for small datasets (regularization).
  • Deep Learning Connection: L2 Regularization is mathematically equivalent to placing a Gaussian Prior on your neural network weights.