Maximum Likelihood & MAP Estimation

Estimation is the process of inferring the parameters of a probability distribution that generated a given dataset. In this chapter, we explore the two most fundamental techniques: Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation.

The core idea of Maximum Likelihood Estimation (MLE) is simple: choose the parameters that make the observed data most probable.

1. The Genesis of Likelihood

Why do we define Likelihood the way we do? It stems from two fundamental constraints: Independence and Hardware Reality.

The Independence Assumption

If we observe a set of data points D = {x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>n</sub>}, and we assume they are Independent and Identically Distributed (IID), then the joint probability of observing all of them is the product of their individual probabilities:

P(D | &theta;) = P(x<sub>1</sub> | &theta;) &times; P(x<sub>2</sub> | &theta;) &times; ... &times; P(x<sub>n</sub> | &theta;) = &Pi; P(x<sub>i</sub> | &theta;)

We call this function L(&theta;) because it tells us the “Likelihood” of the parameters &theta; given the data.

Hardware Reality: The Need for Logs

Probabilities are numbers between 0 and 1. When you multiply many small numbers together, you get an incredibly small number.

[!IMPORTANT] Hardware Reality: Floating Point Underflow In a standard 64-bit float (double), the smallest positive normalized number is approx 2.2 &times; 10<sup>-308</sup>. If you multiply just 100 probabilities of 0.01 together, you get 10<sup>-200</sup>. If you have a dataset of 10,000 images, multiplying their probabilities will instantly result in Underflow (the computer sees 0).

The Solution: We take the Logarithm. log(a &times; b) = log(a) + log(b) This turns dangerous multiplication into safe addition.

Therefore, we almost always maximize the Log-Likelihood l(&theta;) = ln(L(&theta;)):

l(&theta;) = &Sigma; ln(P(x<sub>i</sub> | &theta;))

2. Maximum Likelihood Estimation (MLE)

The goal of MLE is to find the parameter &theta; that maximizes l(&theta;). The process is standard calculus:

  1. Write down the Log-Likelihood function l(&theta;).
  2. Differentiate with respect to &theta;.
  3. Set the derivative to 0 and solve.

Example: Bernoulli Distribution (Coin Flip)

Suppose we flip a coin n times and observe k heads. The probability of heads is p. The likelihood is: L(p) = p<sup>k</sup>(1-p)<sup>n-k</sup>

Taking the log: l(p) = k ln(p) + (n-k) ln(1-p)

Differentiating with respect to p and setting to 0: dl/dp = k/p - (n-k)/(1-p) = 0

Solving this yields: p̂ = k/n

So, the MLE estimate for p is simply the sample proportion of heads. This matches our intuition!

Interactive: Likelihood Visualizer

Visualize how the likelihood function changes as we vary the parameter &mu; (mean) for a fixed set of data points drawn from a Normal distribution. The goal is to find the &mu; that maximizes the likelihood.

Interactive Likelihood Maximizer

0.0
Current Log-Likelihood: 0.00
True μ: 0.00

The blue dots are data points. The red curve is the Normal PDF with the current μ. The green bar represents the log-likelihood (higher is better). Maximize the green bar!

3. Maximum A Posteriori (MAP)

MLE is great, but it assumes we know nothing about the parameters beforehand. What if we have prior knowledge?

For example, if we flip a coin 3 times and get 3 heads, MLE says p = 1.0. We know this is unlikely. We know most coins are fair (p &approx; 0.5).

MAP incorporates a Prior Distribution P(&theta;) over the parameters.

Using Bayes’ Theorem: P(&theta; | x) = (P(x | &theta;) &times; P(&theta;)) / P(x)

  • P(&theta; | x): Posterior (what we want to maximize)
  • P(x | &theta;): Likelihood (same as MLE)
  • P(&theta;): Prior (our belief before seeing data)
  • P(x): Evidence (constant with respect to θ)

To find the MAP estimate, we maximize: &theta;<sub>MAP</sub> = argmax (L(&theta;) &times; P(&theta;))

Taking the log: &theta;<sub>MAP</sub> = argmax (l(&theta;) + ln(P(&theta;)))

[!TIP] MAP can be seen as “Regularized MLE”. The prior acts as a regularizer, preventing overfitting to small datasets.

Code Examples: MLE vs MAP

Here we estimate the probability of heads for a coin flip using both MLE and MAP. Note how MAP pulls the estimate towards the prior belief (0.5), acting as a regularizer.

Python

import numpy as np

# Data: 8 heads, 2 tails (Small sample size)
n_heads = 8
n_tails = 2
N = n_heads + n_tails

# 1. MLE Estimate
# p_mle = k / n
p_mle = n_heads / N

# 2. MAP Estimate with Beta(2, 2) Prior ("Weakly fair belief")
# Posterior is Beta(alpha + k, beta + n - k)
# Mode of Beta(a, b) is (a-1)/(a+b-2)
alpha_prior = 2
beta_prior = 2

alpha_post = alpha_prior + n_heads
beta_post = beta_prior + n_tails

p_map = (alpha_post - 1) / (alpha_post + beta_post - 2)

print(f"Data: {n_heads} Heads, {n_tails} Tails")
print(f"MLE Estimate: {p_mle:.4f} (Likely overfitted)")
print(f"MAP Estimate: {p_map:.4f} (Regularized towards 0.5)")

Java

public class MleMapEstimation {
    public static void main(String[] args) {
        // Data: 8 heads, 2 tails
        double nHeads = 8.0;
        double nTails = 2.0;
        double N = nHeads + nTails;

        // 1. MLE Estimate
        double pMle = nHeads / N;

        // 2. MAP Estimate with Beta(2, 2) Prior
        double alphaPrior = 2.0;
        double betaPrior = 2.0;

        double alphaPost = alphaPrior + nHeads;
        double betaPost = betaPrior + nTails;

        // Mode of Beta distribution
        double pMap = (alphaPost - 1) / (alphaPost + betaPost - 2);

        System.out.printf("Data: %.0f Heads, %.0f Tails%n", nHeads, nTails);
        System.out.printf("MLE Estimate: %.4f (Likely overfitted)%n", pMle);
        System.out.printf("MAP Estimate: %.4f (Regularized towards 0.5)%n", pMap);
    }
}

Go

package main

import "fmt"

func main() {
    // Data: 8 heads, 2 tails
    nHeads := 8.0
    nTails := 2.0
    N := nHeads + nTails

    // 1. MLE Estimate
    pMle := nHeads / N

    // 2. MAP Estimate with Beta(2, 2) Prior
    alphaPrior := 2.0
    betaPrior := 2.0

    alphaPost := alphaPrior + nHeads
    betaPost := betaPrior + nTails

    // Mode of Beta distribution
    pMap := (alphaPost - 1) / (alphaPost + betaPost - 2)

    fmt.Printf("Data: %.0f Heads, %.0f Tails\n", nHeads, nTails)
    fmt.Printf("MLE Estimate: %.4f (Likely overfitted)\n", pMle)
    fmt.Printf("MAP Estimate: %.4f (Regularized towards 0.5)\n", pMap)
}

4. Summary

Feature MLE MAP
Goal Maximize P(Data | &theta;) Maximize P(&theta; | Data)
Prior Assumes Uniform Prior Explicit Prior P(&theta;)
Bias Often Unbiased (asymptotically) Biased towards Prior
Use Case Large Datasets Small Datasets, Regularization

Next, we will explore the Bias-Variance Tradeoff to understand the quality of these estimators.