Case Study: Naive Bayes Spam Classifier

[!NOTE] This module explores the core principles of Case Study: Naive Bayes Spam Classifier, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. What is Naive Bayes?

Before the era of Transformers (BERT, GPT), the internet was saved from drowning in spam by a simple, elegant algorithm: Naive Bayes. Even today, it remains a baseline for text classification due to its speed and interpretability.

In this case study, we will design a Spam Filtering System that classifies emails as either Spam (Junk) or Ham (Legitimate).

[!TIP] Try it yourself: Open your email’s “Spam” folder. Notice how it catches “Urgent: You won $1,000,000” but lets “Meeting at 5pm” through. We will build the math engine behind this.

2. Requirements & Goals

Functional Requirements

Classify: Given an email text, predict P(Spam | Text).
Explain: The system must be interpretable (Why was this marked as spam?).
Learn: The system should update its probabilities as new spam trends emerge.

Non-Functional Requirements

Latency: Inference must be under 10ms per email (Real-time).
Precision: We must minimize False Positives. Marking a legitimate job offer as spam is worse than letting a spam email through.
Scalability: Must handle millions of emails per second.

3. Capacity Estimation

Traffic: 1 Billion emails/day ≈ 12,000 emails/sec.
Vocabulary: English has ~170,000 words. We might track top 50,000 features.
Storage:
We store counts Count(Word | Spam) and Count(Word | Ham).
50,000 words × 2 classes × 4 bytes (int) ≈ 400 KB.
Conclusion: The entire model fits in L1 Cache! This is why Naive Bayes is blazingly fast. It requires no complex Sharding (see System Design Module 07).

4. System APIs

We define a simple REST API for our inference service.

Method	Endpoint	Description
`POST`	`/v1/classify`	Returns `spam_score` and `verdict` for a given `text`.
`POST`	`/v1/feedback`	User marks an email as “Not Spam” (Updates counters).

// Response Example
{
  "verdict": "SPAM",
  "confidence": 0.998,
  "top_features": ["winner", "free", "cash"]
}

5. Database Design

For training, we need to store the raw counts. A simple Key-Value store (Redis) or Columnar Store (Cassandra) works well.

Schema: `WordCounts`

Key	Value (Spam)	Value (Ham)
`word:money`	14,050	200
`word:meeting`	5	8,900

We also need global counters: TotalSpamEmails and TotalHamEmails to calculate Priors.

6. High-Level Design

Client sends email to Load Balancer.
Inference Service tokenizes text and queries the Model Cache (In-Memory).
Model Cache holds the pre-calculated Log-Probabilities.
Feedback Loop: When users click “Report Spam”, events are sent to a Kafka topic.
Training Worker consumes Kafka, updates DB, and recalculates Log-Probs to update the Cache.

7. Component Design: The Algorithm

Our Goal: Calculate the probability that a message is Spam given its words w1, w2, ..., wn.

7.1 Bayes’ Theorem

P(Spam | Message) = [ P(Message | Spam) · P(Spam) ] / P(Message)

We can ignore the denominator P(Message) because it is the same for both Spam and Ham calculations. We just compare the numerators.

7.2 The “Naive” Independence Assumption

Calculating P(w1, w2 ... | Spam) directly is impossible (we’d need a dataset with every possible sentence). So, we assume word independence:

P(w₁, w₂ … | Spam) ≈ P(w₁|Spam) · P(w₂|Spam) · … · P(w_n|Spam)

This is “Naive” because “Artificial” and “Intelligence” are clearly correlated, but we treat them as independent coin flips.

7.3 Engineering Challenge: Arithmetic Underflow

Probabilities are small numbers (e.g., 0.0001). Multiplying hundreds of them results in numbers so small that computers round them to zero. Solution: Work in Log Space.

log(a · b) = log(a) + log(b)

Our score becomes a linear sum:

Score = log(P(Spam)) + ∑ log(P(w_i | Spam))

8. Reliability & Edge Cases

The Zero Probability Problem

What if the word “Bitcoin” never appeared in our Spam training set? P("Bitcoin" | Spam) = 0. log(0) = -∞. The entire sum becomes -∞, and the model crashes.

Solution: Laplace Smoothing We add a small count (usually α=1) to every word count.

P(w|C) = (count(w, C) + 1) / (count(C) + V)

Where V is the vocabulary size. This ensures no probability is ever strictly zero.

9. System Walkthrough (Dry Run)

Let’s trace the email: “Win Money Now”

Priors:

P(Spam) = 0.4 → log(0.4) = -0.92
P(Ham) = 0.6 → log(0.6) = -0.51

Likelihoods (Log-Probs):

“Win”: P(w|S)=-2.3, P(w|H)=-4.6 (Spammy)
“Money”: P(w|S)=-2.1, P(w|H)=-5.1 (Spammy)
“Now”: P(w|S)=-3.0, P(w|H)=-3.0 (Neutral)

Calculation:

Spam Score: -0.92 + (-2.3) + (-2.1) + (-3.0) = -8.32
Ham Score: -0.51 + (-4.6) + (-5.1) + (-3.0) = -13.21

Decision: Since -8.32 > -13.21, we classify as SPAM.

Implementation in Python

Building a Naive Bayes classifier from scratch helps understand the internals:

import numpy as np

class NaiveBayes:
    def __init__(self):
        self.spam_counts = {}
        self.ham_counts = {}
        self.total_spam = 0
        self.total_ham = 0
        self.vocab = set()

    def train(self, texts, labels):
        # 1. Count word frequencies
        for text, label in zip(texts, labels):
            words = text.lower().split()
            if label == 'SPAM':
                self.total_spam += 1
                for w in words:
                    self.spam_counts[w] = self.spam_counts.get(w, 0) + 1
                    self.vocab.add(w)
            else:
                self.total_ham += 1
                for w in words:
                    self.ham_counts[w] = self.ham_counts.get(w, 0) + 1
                    self.vocab.add(w)

    def predict(self, text):
        words = text.lower().split()

        # 2. Calculate Priors (Log Space)
        total_docs = self.total_spam + self.total_ham
        log_p_spam = np.log(self.total_spam / total_docs)
        log_p_ham = np.log(self.total_ham / total_docs)

        # 3. Add Likelihoods (with Laplace Smoothing)
        vocab_size = len(self.vocab)

        # Denominators for P(w|C)
        # Sum of all word counts in Spam class + Vocab Size (Smoothing)
        total_spam_words = sum(self.spam_counts.values())
        total_ham_words = sum(self.ham_counts.values())

        for w in words:
            if w not in self.vocab: continue

            # P(w|Spam) = (count(w, Spam) + 1) / (TotalSpamWords + V)
            count_spam = self.spam_counts.get(w, 0) + 1
            prob_spam = count_spam / (total_spam_words + vocab_size)
            log_p_spam += np.log(prob_spam)

            # P(w|Ham)
            count_ham = self.ham_counts.get(w, 0) + 1
            prob_ham = count_ham / (total_ham_words + vocab_size)
            log_p_ham += np.log(prob_ham)

        return 'SPAM' if log_p_spam > log_p_ham else 'HAM'

# Usage
nb = NaiveBayes()
train_data = ["win money now", "meeting at 5pm", "free cash prize", "project report due"]
train_labels = ["SPAM", "HAM", "SPAM", "HAM"]

nb.train(train_data, train_labels)
print(f"Prediction: {nb.predict('win free cash')}") # Output: SPAM

10. Interactive Visualizer: The Log-Likelihood Tug of War

Type a message below to see how the model “fights” to classify it.

Red Bars: Push score towards SPAM (Right).
Green Bars: Push score towards HAM (Left).
The Rope: The cumulative score. If it crosses the center line, the verdict flips.

HAM EVIDENCE

0.00

Prediction

...

SPAM EVIDENCE

0.00

11. Summary

Naive Bayes is the “Hello World” of NLP, but powerful enough for production baselines.
Log-Space Arithmetic is mandatory for probabilistic ML to prevent underflow.
Laplace Smoothing is the standard fix for the “Zero Frequency” bug.

Case Study: Naive Bayes Spam Classifier

Case Study: Naive Bayes Spam Classifier

1. What is Naive Bayes?

2. Requirements & Goals

Functional Requirements

Non-Functional Requirements

3. Capacity Estimation

4. System APIs

5. Database Design

Schema: WordCounts

6. High-Level Design

7. Component Design: The Algorithm

7.1 Bayes’ Theorem

7.2 The “Naive” Independence Assumption

7.3 Engineering Challenge: Arithmetic Underflow

8. Reliability & Edge Cases

The Zero Probability Problem

9. System Walkthrough (Dry Run)

Implementation in Python

10. Interactive Visualizer: The Log-Likelihood Tug of War

11. Summary

Schema: `WordCounts`