Case Study: Naive Bayes Spam Classifier

[!NOTE] This module explores the core principles of Case Study: Naive Bayes Spam Classifier, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. What is Naive Bayes?

Before the era of Transformers (BERT, GPT), the internet was saved from drowning in spam by a simple, elegant algorithm: Naive Bayes. Even today, it remains a baseline for text classification due to its speed and interpretability.

In this case study, we will design a Spam Filtering System that classifies emails as either Spam (Junk) or Ham (Legitimate).

[!TIP] Try it yourself: Open your email’s “Spam” folder. Notice how it catches “Urgent: You won $1,000,000” but lets “Meeting at 5pm” through. We will build the math engine behind this.


2. Requirements & Goals

Functional Requirements

  1. Classify: Given an email text, predict P(Spam | Text).
  2. Explain: The system must be interpretable (Why was this marked as spam?).
  3. Learn: The system should update its probabilities as new spam trends emerge.

Non-Functional Requirements

  1. Latency: Inference must be under 10ms per email (Real-time).
  2. Precision: We must minimize False Positives. Marking a legitimate job offer as spam is worse than letting a spam email through.
  3. Scalability: Must handle millions of emails per second.

3. Capacity Estimation

  • Traffic: 1 Billion emails/day ≈ 12,000 emails/sec.
  • Vocabulary: English has ~170,000 words. We might track top 50,000 features.
  • Storage:
  • We store counts Count(Word | Spam) and Count(Word | Ham).
  • 50,000 words × 2 classes × 4 bytes (int) ≈ 400 KB.
  • Conclusion: The entire model fits in L1 Cache! This is why Naive Bayes is blazingly fast. It requires no complex Sharding (see System Design Module 07).

4. System APIs

We define a simple REST API for our inference service.

Method Endpoint Description
POST /v1/classify Returns spam_score and verdict for a given text.
POST /v1/feedback User marks an email as “Not Spam” (Updates counters).
// Response Example
{
  "verdict": "SPAM",
  "confidence": 0.998,
  "top_features": ["winner", "free", "cash"]
}

5. Database Design

For training, we need to store the raw counts. A simple Key-Value store (Redis) or Columnar Store (Cassandra) works well.

Schema: WordCounts

Key Value (Spam) Value (Ham)
word:money 14,050 200
word:meeting 5 8,900

We also need global counters: TotalSpamEmails and TotalHamEmails to calculate Priors.


6. High-Level Design

  1. Client sends email to Load Balancer.
  2. Inference Service tokenizes text and queries the Model Cache (In-Memory).
  3. Model Cache holds the pre-calculated Log-Probabilities.
  4. Feedback Loop: When users click “Report Spam”, events are sent to a Kafka topic.
  5. Training Worker consumes Kafka, updates DB, and recalculates Log-Probs to update the Cache.
Client Load Balancer Spam Service L1 Model Cache Kafka (Feedback) Trainer Redis / DB POST /classify User Report Async Update

7. Component Design: The Algorithm

Our Goal: Calculate the probability that a message is Spam given its words w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>.

7.1 Bayes’ Theorem

P(Spam | Message) = [ P(Message | Spam) · P(Spam) ] / P(Message)

We can ignore the denominator P(Message) because it is the same for both Spam and Ham calculations. We just compare the numerators.

7.2 The “Naive” Independence Assumption

Calculating P(w<sub>1</sub>, w<sub>2</sub> ... | Spam) directly is impossible (we’d need a dataset with every possible sentence). So, we assume word independence:

P(w1, w2 … | Spam) &approx; P(w1|Spam) · P(w2|Spam) · … · P(wn|Spam)

This is “Naive” because “Artificial” and “Intelligence” are clearly correlated, but we treat them as independent coin flips.

7.3 Engineering Challenge: Arithmetic Underflow

Probabilities are small numbers (e.g., 0.0001). Multiplying hundreds of them results in numbers so small that computers round them to zero. Solution: Work in Log Space.

log(a · b) = log(a) + log(b)

Our score becomes a linear sum:

Score = log(P(Spam)) + ∑ log(P(wi | Spam))


8. Reliability & Edge Cases

The Zero Probability Problem

What if the word “Bitcoin” never appeared in our Spam training set? P("Bitcoin" | Spam) = 0. log(0) = -&infin;. The entire sum becomes -&infin;, and the model crashes.

Solution: Laplace Smoothing We add a small count (usually &alpha;=1) to every word count.

P(w|C) = (count(w, C) + 1) / (count(C) + V)

Where V is the vocabulary size. This ensures no probability is ever strictly zero.


9. System Walkthrough (Dry Run)

Let’s trace the email: “Win Money Now”

Priors:

  • P(Spam) = 0.4log(0.4) = -0.92
  • P(Ham) = 0.6log(0.6) = -0.51

Likelihoods (Log-Probs):

  • “Win”: P(w|S)=-2.3, P(w|H)=-4.6 (Spammy)
  • “Money”: P(w|S)=-2.1, P(w|H)=-5.1 (Spammy)
  • “Now”: P(w|S)=-3.0, P(w|H)=-3.0 (Neutral)

Calculation:

  1. Spam Score: -0.92 + (-2.3) + (-2.1) + (-3.0) = -8.32
  2. Ham Score: -0.51 + (-4.6) + (-5.1) + (-3.0) = -13.21

Decision: Since -8.32 > -13.21, we classify as SPAM.

Implementation in Python

Building a Naive Bayes classifier from scratch helps understand the internals:

import numpy as np

class NaiveBayes:
    def __init__(self):
        self.spam_counts = {}
        self.ham_counts = {}
        self.total_spam = 0
        self.total_ham = 0
        self.vocab = set()

    def train(self, texts, labels):
        # 1. Count word frequencies
        for text, label in zip(texts, labels):
            words = text.lower().split()
            if label == 'SPAM':
                self.total_spam += 1
                for w in words:
                    self.spam_counts[w] = self.spam_counts.get(w, 0) + 1
                    self.vocab.add(w)
            else:
                self.total_ham += 1
                for w in words:
                    self.ham_counts[w] = self.ham_counts.get(w, 0) + 1
                    self.vocab.add(w)

    def predict(self, text):
        words = text.lower().split()

        # 2. Calculate Priors (Log Space)
        total_docs = self.total_spam + self.total_ham
        log_p_spam = np.log(self.total_spam / total_docs)
        log_p_ham = np.log(self.total_ham / total_docs)

        # 3. Add Likelihoods (with Laplace Smoothing)
        vocab_size = len(self.vocab)

        # Denominators for P(w|C)
        # Sum of all word counts in Spam class + Vocab Size (Smoothing)
        total_spam_words = sum(self.spam_counts.values())
        total_ham_words = sum(self.ham_counts.values())

        for w in words:
            if w not in self.vocab: continue

            # P(w|Spam) = (count(w, Spam) + 1) / (TotalSpamWords + V)
            count_spam = self.spam_counts.get(w, 0) + 1
            prob_spam = count_spam / (total_spam_words + vocab_size)
            log_p_spam += np.log(prob_spam)

            # P(w|Ham)
            count_ham = self.ham_counts.get(w, 0) + 1
            prob_ham = count_ham / (total_ham_words + vocab_size)
            log_p_ham += np.log(prob_ham)

        return 'SPAM' if log_p_spam > log_p_ham else 'HAM'

# Usage
nb = NaiveBayes()
train_data = ["win money now", "meeting at 5pm", "free cash prize", "project report due"]
train_labels = ["SPAM", "HAM", "SPAM", "HAM"]

nb.train(train_data, train_labels)
print(f"Prediction: {nb.predict('win free cash')}") # Output: SPAM

10. Interactive Visualizer: The Log-Likelihood Tug of War

Type a message below to see how the model “fights” to classify it.

  • Red Bars: Push score towards SPAM (Right).
  • Green Bars: Push score towards HAM (Left).
  • The Rope: The cumulative score. If it crosses the center line, the verdict flips.
HAM EVIDENCE
0.00
Prediction
...
SPAM EVIDENCE
0.00

11. Summary

  • Naive Bayes is the “Hello World” of NLP, but powerful enough for production baselines.
  • Log-Space Arithmetic is mandatory for probabilistic ML to prevent underflow.
  • Laplace Smoothing is the standard fix for the “Zero Frequency” bug.