Maximum Likelihood & MAP Estimation

Estimation is the process of inferring the parameters of a probability distribution that generated a given dataset. In this chapter, we explore the two most fundamental techniques: Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation.

The core idea of Maximum Likelihood Estimation (MLE) is simple: choose the parameters that make the observed data most probable.

1. The Genesis of Likelihood

Why do we define Likelihood the way we do? It stems from two fundamental constraints: Independence and Hardware Reality.

The Independence Assumption

If we observe a set of data points D = {x1, x2, ..., xn}, and we assume they are Independent and Identically Distributed (IID), then the joint probability of observing all of them is the product of their individual probabilities:

P(D | θ) = P(x1 | θ) × P(x2 | θ) × ... × P(xn | θ) = Π P(xi | θ)

We call this function L(θ) because it tells us the “Likelihood” of the parameters θ given the data.

Hardware Reality: The Need for Logs

Probabilities are numbers between 0 and 1. When you multiply many small numbers together, you get an incredibly small number.

[!IMPORTANT] Hardware Reality: Floating Point Underflow In a standard 64-bit float (double), the smallest positive normalized number is approx 2.2 × 10-308. If you multiply just 100 probabilities of 0.01 together, you get 10-200. If you have a dataset of 10,000 images, multiplying their probabilities will instantly result in Underflow (the computer sees 0).

The Solution: We take the Logarithm. log(a × b) = log(a) + log(b) This turns dangerous multiplication into safe addition.

Therefore, we almost always maximize the Log-Likelihood l(θ) = ln(L(θ)):

l(θ) = Σ ln(P(xi | θ))

2. Maximum Likelihood Estimation (MLE)

The goal of MLE is to find the parameter θ that maximizes l(θ). The process is standard calculus:

Write down the Log-Likelihood function l(θ).
Differentiate with respect to θ.
Set the derivative to 0 and solve.

Example: Bernoulli Distribution (Coin Flip)

Suppose we flip a coin n times and observe k heads. The probability of heads is p. The likelihood is: L(p) = pk(1-p)n-k

Taking the log: l(p) = k ln(p) + (n-k) ln(1-p)

Differentiating with respect to p and setting to 0: dl/dp = k/p - (n-k)/(1-p) = 0

Solving this yields: p̂ = k/n

So, the MLE estimate for p is simply the sample proportion of heads. This matches our intuition!

Interactive: Likelihood Visualizer

Visualize how the likelihood function changes as we vary the parameter μ (mean) for a fixed set of data points drawn from a Normal distribution. The goal is to find the μ that maximizes the likelihood.

Interactive Likelihood Maximizer

Parameter μ: 0.0

Current Log-Likelihood: 0.00

True μ: 0.00

The blue dots are data points. The red curve is the Normal PDF with the current μ. The green bar represents the log-likelihood (higher is better). Maximize the green bar!

3. Maximum A Posteriori (MAP)

MLE is great, but it assumes we know nothing about the parameters beforehand. What if we have prior knowledge?

For example, if we flip a coin 3 times and get 3 heads, MLE says p = 1.0. We know this is unlikely. We know most coins are fair (p ≈ 0.5).

MAP incorporates a Prior Distribution P(θ) over the parameters.

Using Bayes’ Theorem: P(θ | x) = (P(x | θ) × P(θ)) / P(x)

P(θ | x): Posterior (what we want to maximize)
P(x | θ): Likelihood (same as MLE)
P(θ): Prior (our belief before seeing data)
P(x): Evidence (constant with respect to θ)

To find the MAP estimate, we maximize: θMAP = argmax (L(θ) × P(θ))

Taking the log: θMAP = argmax (l(θ) + ln(P(θ)))

[!TIP] MAP can be seen as “Regularized MLE”. The prior acts as a regularizer, preventing overfitting to small datasets.

Code Examples: MLE vs MAP

Here we estimate the probability of heads for a coin flip using both MLE and MAP. Note how MAP pulls the estimate towards the prior belief (0.5), acting as a regularizer.

Python

import numpy as np

# Data: 8 heads, 2 tails (Small sample size)
n_heads = 8
n_tails = 2
N = n_heads + n_tails

# 1. MLE Estimate
# p_mle = k / n
p_mle = n_heads / N

# 2. MAP Estimate with Beta(2, 2) Prior ("Weakly fair belief")
# Posterior is Beta(alpha + k, beta + n - k)
# Mode of Beta(a, b) is (a-1)/(a+b-2)
alpha_prior = 2
beta_prior = 2

alpha_post = alpha_prior + n_heads
beta_post = beta_prior + n_tails

p_map = (alpha_post - 1) / (alpha_post + beta_post - 2)

print(f"Data: {n_heads} Heads, {n_tails} Tails")
print(f"MLE Estimate: {p_mle:.4f} (Likely overfitted)")
print(f"MAP Estimate: {p_map:.4f} (Regularized towards 0.5)")

Java

public class MleMapEstimation {
    public static void main(String[] args) {
        // Data: 8 heads, 2 tails
        double nHeads = 8.0;
        double nTails = 2.0;
        double N = nHeads + nTails;

        // 1. MLE Estimate
        double pMle = nHeads / N;

        // 2. MAP Estimate with Beta(2, 2) Prior
        double alphaPrior = 2.0;
        double betaPrior = 2.0;

        double alphaPost = alphaPrior + nHeads;
        double betaPost = betaPrior + nTails;

        // Mode of Beta distribution
        double pMap = (alphaPost - 1) / (alphaPost + betaPost - 2);

        System.out.printf("Data: %.0f Heads, %.0f Tails%n", nHeads, nTails);
        System.out.printf("MLE Estimate: %.4f (Likely overfitted)%n", pMle);
        System.out.printf("MAP Estimate: %.4f (Regularized towards 0.5)%n", pMap);
    }
}

Go

package main

import "fmt"

func main() {
    // Data: 8 heads, 2 tails
    nHeads := 8.0
    nTails := 2.0
    N := nHeads + nTails

    // 1. MLE Estimate
    pMle := nHeads / N

    // 2. MAP Estimate with Beta(2, 2) Prior
    alphaPrior := 2.0
    betaPrior := 2.0

    alphaPost := alphaPrior + nHeads
    betaPost := betaPrior + nTails

    // Mode of Beta distribution
    pMap := (alphaPost - 1) / (alphaPost + betaPost - 2)

    fmt.Printf("Data: %.0f Heads, %.0f Tails\n", nHeads, nTails)
    fmt.Printf("MLE Estimate: %.4f (Likely overfitted)\n", pMle)
    fmt.Printf("MAP Estimate: %.4f (Regularized towards 0.5)\n", pMap)
}

4. Summary

Feature	MLE	MAP
Goal	Maximize `P(Data \| θ)`	Maximize `P(θ \| Data)`
Prior	Assumes Uniform Prior	Explicit Prior `P(θ)`
Bias	Often Unbiased (asymptotically)	Biased towards Prior
Use Case	Large Datasets	Small Datasets, Regularization

Next, we will explore the Bias-Variance Tradeoff to understand the quality of these estimators.