Maximum Likelihood & MAP Estimation
Estimation is the process of inferring the parameters of a probability distribution that generated a given dataset. In this chapter, we explore the two most fundamental techniques: Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation.
The core idea of Maximum Likelihood Estimation (MLE) is simple: choose the parameters that make the observed data most probable.
1. The Genesis of Likelihood
Why do we define Likelihood the way we do? It stems from two fundamental constraints: Independence and Hardware Reality.
The Independence Assumption
If we observe a set of data points D = {x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>n</sub>}, and we assume they are Independent and Identically Distributed (IID), then the joint probability of observing all of them is the product of their individual probabilities:
P(D | θ) = P(x<sub>1</sub> | θ) × P(x<sub>2</sub> | θ) × ... × P(x<sub>n</sub> | θ) = Π P(x<sub>i</sub> | θ)
We call this function L(θ) because it tells us the “Likelihood” of the parameters θ given the data.
Hardware Reality: The Need for Logs
Probabilities are numbers between 0 and 1. When you multiply many small numbers together, you get an incredibly small number.
[!IMPORTANT] Hardware Reality: Floating Point Underflow In a standard 64-bit float (
double), the smallest positive normalized number is approx2.2 × 10<sup>-308</sup>. If you multiply just 100 probabilities of0.01together, you get10<sup>-200</sup>. If you have a dataset of 10,000 images, multiplying their probabilities will instantly result in Underflow (the computer sees0).The Solution: We take the Logarithm.
log(a × b) = log(a) + log(b)This turns dangerous multiplication into safe addition.
Therefore, we almost always maximize the Log-Likelihood l(θ) = ln(L(θ)):
l(θ) = Σ ln(P(x<sub>i</sub> | θ))
2. Maximum Likelihood Estimation (MLE)
The goal of MLE is to find the parameter θ that maximizes l(θ). The process is standard calculus:
- Write down the Log-Likelihood function
l(θ). - Differentiate with respect to
θ. - Set the derivative to 0 and solve.
Example: Bernoulli Distribution (Coin Flip)
Suppose we flip a coin n times and observe k heads. The probability of heads is p.
The likelihood is:
L(p) = p<sup>k</sup>(1-p)<sup>n-k</sup>
Taking the log:
l(p) = k ln(p) + (n-k) ln(1-p)
Differentiating with respect to p and setting to 0:
dl/dp = k/p - (n-k)/(1-p) = 0
Solving this yields:
p̂ = k/n
So, the MLE estimate for p is simply the sample proportion of heads. This matches our intuition!
Interactive: Likelihood Visualizer
Visualize how the likelihood function changes as we vary the parameter μ (mean) for a fixed set of data points drawn from a Normal distribution. The goal is to find the μ that maximizes the likelihood.
Interactive Likelihood Maximizer
The blue dots are data points. The red curve is the Normal PDF with the current μ. The green bar represents the log-likelihood (higher is better). Maximize the green bar!
3. Maximum A Posteriori (MAP)
MLE is great, but it assumes we know nothing about the parameters beforehand. What if we have prior knowledge?
For example, if we flip a coin 3 times and get 3 heads, MLE says p = 1.0. We know this is unlikely. We know most coins are fair (p ≈ 0.5).
MAP incorporates a Prior Distribution P(θ) over the parameters.
Using Bayes’ Theorem:
P(θ | x) = (P(x | θ) × P(θ)) / P(x)
P(θ | x): Posterior (what we want to maximize)P(x | θ): Likelihood (same as MLE)P(θ): Prior (our belief before seeing data)P(x): Evidence (constant with respect to θ)
To find the MAP estimate, we maximize:
θ<sub>MAP</sub> = argmax (L(θ) × P(θ))
Taking the log:
θ<sub>MAP</sub> = argmax (l(θ) + ln(P(θ)))
[!TIP] MAP can be seen as “Regularized MLE”. The prior acts as a regularizer, preventing overfitting to small datasets.
Code Examples: MLE vs MAP
Here we estimate the probability of heads for a coin flip using both MLE and MAP. Note how MAP pulls the estimate towards the prior belief (0.5), acting as a regularizer.
Python
import numpy as np
# Data: 8 heads, 2 tails (Small sample size)
n_heads = 8
n_tails = 2
N = n_heads + n_tails
# 1. MLE Estimate
# p_mle = k / n
p_mle = n_heads / N
# 2. MAP Estimate with Beta(2, 2) Prior ("Weakly fair belief")
# Posterior is Beta(alpha + k, beta + n - k)
# Mode of Beta(a, b) is (a-1)/(a+b-2)
alpha_prior = 2
beta_prior = 2
alpha_post = alpha_prior + n_heads
beta_post = beta_prior + n_tails
p_map = (alpha_post - 1) / (alpha_post + beta_post - 2)
print(f"Data: {n_heads} Heads, {n_tails} Tails")
print(f"MLE Estimate: {p_mle:.4f} (Likely overfitted)")
print(f"MAP Estimate: {p_map:.4f} (Regularized towards 0.5)")
Java
public class MleMapEstimation {
public static void main(String[] args) {
// Data: 8 heads, 2 tails
double nHeads = 8.0;
double nTails = 2.0;
double N = nHeads + nTails;
// 1. MLE Estimate
double pMle = nHeads / N;
// 2. MAP Estimate with Beta(2, 2) Prior
double alphaPrior = 2.0;
double betaPrior = 2.0;
double alphaPost = alphaPrior + nHeads;
double betaPost = betaPrior + nTails;
// Mode of Beta distribution
double pMap = (alphaPost - 1) / (alphaPost + betaPost - 2);
System.out.printf("Data: %.0f Heads, %.0f Tails%n", nHeads, nTails);
System.out.printf("MLE Estimate: %.4f (Likely overfitted)%n", pMle);
System.out.printf("MAP Estimate: %.4f (Regularized towards 0.5)%n", pMap);
}
}
Go
package main
import "fmt"
func main() {
// Data: 8 heads, 2 tails
nHeads := 8.0
nTails := 2.0
N := nHeads + nTails
// 1. MLE Estimate
pMle := nHeads / N
// 2. MAP Estimate with Beta(2, 2) Prior
alphaPrior := 2.0
betaPrior := 2.0
alphaPost := alphaPrior + nHeads
betaPost := betaPrior + nTails
// Mode of Beta distribution
pMap := (alphaPost - 1) / (alphaPost + betaPost - 2)
fmt.Printf("Data: %.0f Heads, %.0f Tails\n", nHeads, nTails)
fmt.Printf("MLE Estimate: %.4f (Likely overfitted)\n", pMle)
fmt.Printf("MAP Estimate: %.4f (Regularized towards 0.5)\n", pMap)
}
4. Summary
| Feature | MLE | MAP |
|---|---|---|
| Goal | Maximize P(Data | θ) |
Maximize P(θ | Data) |
| Prior | Assumes Uniform Prior | Explicit Prior P(θ) |
| Bias | Often Unbiased (asymptotically) | Biased towards Prior |
| Use Case | Large Datasets | Small Datasets, Regularization |
Next, we will explore the Bias-Variance Tradeoff to understand the quality of these estimators.