Bayes’ Theorem: The Engine of Learning

At its core, Machine Learning is about updating what we know based on new evidence. Bayes’ Theorem is the mathematical framework for exactly this process. It tells us how to update our prior beliefs after seeing data.

In this chapter, we will move beyond the abstract formula and see how Bayes’ Theorem powers classification algorithms like Spam Filters and Medical Diagnostics.

1. Pillar 1: Intuition (The Map)

Imagine you are exploring a dark room. You have a rough map of where the furniture might be (your Prior). As you walk, you bump into a chair (your Evidence).

Classical Statistics (Frequentist) asks: “Given there is a chair here, what is the probability I bumped into it?”
Bayesian Statistics asks: “Given I bumped into something, how should I update my map?”

Bayes’ Theorem is simply the logic of updating your map.

Prior: You thought the room was empty (Low probability of furniture).
Likelihood: You felt pain in your shin (High probability of pain if furniture exists).
Posterior: You now believe there is a chair in front of you (High probability of furniture).

The Formula

The formula is often written as:

P(A|B) = P(B|A) × P(A) P(B)

Where:

**P(A B)** is the Posterior (What we want to know).
**P(B A)** is the Likelihood (How compatible is the evidence with our hypothesis?).
P(A) is the Prior (What we believed before).
P(B) is the Evidence (Normalization constant).

2. Interactive: The Diagnostic Trap

Let’s visualize why low priors (rare diseases) make positive tests less reliable, even with accurate tests.

Bayesian Diagnostic Calculator

Prevalence (Prior): 1%

How common is the disease?

Sensitivity: 99%

P(+|Disease) (True Positive Rate)

Specificity: 95%

P(-|Healthy) (True Negative Rate)

Posterior Probability P(Disease | +)

16.6%

True Positives (Actually Sick) False Positives (Actually Healthy)

[!TIP] Notice how if the disease is very rare (Prior < 1%), even a very accurate test (99% Specificity) results in a low Posterior probability. This is why doctors don’t test everyone for everything!

3. Pillar 3: Hardware Reality (The Machine)

Integer Overflow and Log-Probabilities

In theory, Bayes’ Theorem involves multiplying probabilities. P(A|B) = P(B|A) * P(A) ...

In Naive Bayes (e.g., Spam Filtering), we multiply the probability of every single word appearing in the email. P(Spam|Email) = P(Spam) * P(Word1|Spam) * P(Word2|Spam) * ... * P(WordN|Spam)

If an email has 1000 words, and each word has a probability of 0.01, we are calculating 0.01^1000. This number is so small that a computer’s floating-point representation (IEEE 754) will just round it to zero. This is called Arithmetic Underflow.

The Solution: Log-Space

To solve this, we work in Log-Space. Since log(a * b) = log(a) + log(b), multiplication becomes addition.

log(P(Spam|Email)) = log(P(Spam)) + Σ log(P(Word_i|Spam))

No Underflow: Logarithms of small probabilities are just large negative numbers (e.g., log(0.0001) = -9.21). Computers handle these easily.
Speed: CPU adders are faster than multipliers.
Accuracy: We preserve precision.

[!IMPORTANT] In production ML systems, we almost never work with raw probabilities. We always compute the Log-Likelihood.

4. Pillar 4: Patterns (The Toolbox)

Naive Bayes Classifier

The most common pattern using Bayes’ Theorem is the Naive Bayes classifier.

Problem: Calculating P(Class | Feature1, Feature2, ...) is hard because features might be correlated.
Assumption: Assume all features are independent given the class.
Result: The complex joint probability breaks down into simple products.

So the Likelihood becomes: P(F1, F2 | Class) ≈ P(F1 | Class) × P(F2 | Class)

Python Example: Log-Probabilities

Here is how we implement a robust Bayesian update using log-probabilities to avoid underflow.

import math

def log_bayes_update(log_prior, log_likelihoods):
    """
    Calculate proportional log-posterior using the Naive Bayes assumption.

    log_prior: log(P(Class))
    log_likelihoods: List of log(P(Feature_i | Class))
    """
    # In Log-Space, multiplication becomes addition
    log_posterior_proportional = log_prior + sum(log_likelihoods)

    return log_posterior_proportional

# Example: Spam Filter
# Prior: P(Spam) = 0.40 -> log(0.40)
p_spam = 0.40
log_prior = math.log(p_spam)

# Likelihoods: Probabilities of seeing specific words given Spam
# P("Free"|Spam) = 0.8, P("Click"|Spam) = 0.6, P("Winner"|Spam) = 0.9
probs = [0.8, 0.6, 0.9]
log_likelihoods = [math.log(p) for p in probs]

# Calculate Log Posterior (Unnormalized)
log_result = log_bayes_update(log_prior, log_likelihoods)

print(f"Log Posterior: {log_result:.4f}")
print(f"Raw Posterior (Proportional): {math.exp(log_result):.8f}")

5. Summary

Bayes’ Theorem updates priors with evidence to form a posterior.
Hardware Reality: Multiplying many small probabilities causes underflow. We use Log-Probabilities to turn multiplication into addition.
Naive Bayes simplifies the math by assuming feature independence, making it scalable for text classification.

Bayes Theorem for Machine Learning