Case Study: Naive Bayes Spam Classifier

1. What is Naive Bayes?

Before the era of Transformers (BERT, GPT), the internet was saved from drowning in spam by a simple, elegant algorithm: Naive Bayes. Even today, it remains a baseline for text classification due to its speed and interpretability.

In this case study, we will design a Spam Filtering System that classifies emails as either Spam (Junk) or Ham (Legitimate).

Try it yourself: Open your email’s “Spam” folder. Notice how it catches “Urgent: You won $1,000,000” but lets “Meeting at 5pm” through. We will build the math engine behind this.


2. Requirements & Goals

Functional Requirements

  1. Classify: Given an email text, predict P(Spam | Text).
  2. Explain: The system must be interpretable (Why was this marked as spam?).
  3. Learn: The system should update its probabilities as new spam trends emerge.

Non-Functional Requirements

  1. Latency: Inference must be under 10ms per email (Real-time).
  2. Precision: We must minimize False Positives. Marking a legitimate job offer as spam is worse than letting a spam email through.
  3. Scalability: Must handle millions of emails per second.

3. Capacity Estimation

  • Traffic: 1 Billion emails/day ≈ 12,000 emails/sec.
  • Vocabulary: English has ~170,000 words. We might track top 50,000 features.
  • Storage:
    • We store counts Count(Word | Spam) and Count(Word | Ham).
    • 50,000 words × 2 classes × 4 bytes (int) ≈ 400 KB.
    • Conclusion: The entire model fits in L1 Cache! This is why Naive Bayes is blazingly fast.

4. System APIs

We define a simple REST API for our inference service.

Method Endpoint Description
POST /v1/classify Returns spam_score and verdict for a given text.
POST /v1/feedback User marks an email as “Not Spam” (Updates counters).
// Response Example
{
  "verdict": "SPAM",
  "confidence": 0.998,
  "top_features": ["winner", "free", "cash"]
}

5. Database Design

For training, we need to store the raw counts. A simple Key-Value store (Redis) or Columnar Store (Cassandra) works well.

Schema: WordCounts

| Key | Value (Spam) | Value (Ham) | | :— | :— | :— | | word:money | 14,050 | 200 | | word:meeting | 5 | 8,900 |

We also need global counters: TotalSpamEmails and TotalHamEmails to calculate Priors.


6. High-Level Design

  1. Client sends email to Load Balancer.
  2. Inference Service tokenizes text and queries the Model Cache (In-Memory).
  3. Model Cache holds the pre-calculated Log-Probabilities.
  4. Feedback Loop: When users click “Report Spam”, events are sent to a Kafka topic.
  5. Training Worker consumes Kafka, updates DB, and recalculates Log-Probs to update the Cache.
Client Load Balancer Spam Service L1 Model Cache Kafka (Feedback) Trainer Redis / DB POST /classify User Report Async Update

7. Component Design: The Algorithm

Our Goal: Calculate the probability that a message is Spam given its words w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>.

7.1 Bayes’ Theorem

P(Spam Message) = [ P(Message Spam) · P(Spam) ] / P(Message)

We can ignore the denominator P(Message) because it is the same for both Spam and Ham calculations. We just compare the numerators.

7.2 The “Naive” Independence Assumption

Calculating P(w<sub>1</sub>, w<sub>2</sub> ... | Spam) directly is impossible (we’d need a dataset with every possible sentence). So, we assume word independence:

P(w1, w2 Spam) &approx; P(w1 Spam) · P(w2 Spam) · … · P(wn Spam)

This is “Naive” because “Artificial” and “Intelligence” are clearly correlated, but we treat them as independent coin flips.

7.3 Engineering Challenge: Arithmetic Underflow

Probabilities are small numbers (e.g., 0.0001). Multiplying hundreds of them results in numbers so small that computers round them to zero. Solution: Work in Log Space.

log(a · b) = log(a) + log(b)

Our score becomes a linear sum:

Score = log(P(Spam)) + ∑ log(P(wi Spam))

8. Reliability & Edge Cases

The Zero Probability Problem

What if the word “Bitcoin” never appeared in our Spam training set? P("Bitcoin" | Spam) = 0. log(0) = -&infin;. The entire sum becomes -&infin;, and the model crashes.

Solution: Laplace Smoothing We add a small count (usually &alpha;=1) to every word count.

P(w C) = (count(w, C) + 1) / (count(C) + V)

Where V is the vocabulary size. This ensures no probability is ever strictly zero.


9. System Walkthrough (Dry Run)

Let’s trace the email: “Win Money Now”

Priors:

  • P(Spam) = 0.4log(0.4) = -0.92
  • P(Ham) = 0.6log(0.6) = -0.51

Likelihoods (Log-Probs):

  • “Win”: P(w|S)=-2.3, P(w|H)=-4.6 (Spammy)
  • “Money”: P(w|S)=-2.1, P(w|H)=-5.1 (Spammy)
  • “Now”: P(w|S)=-3.0, P(w|H)=-3.0 (Neutral)

Calculation:

  1. Spam Score: -0.92 + (-2.3) + (-2.1) + (-3.0) = -8.32
  2. Ham Score: -0.51 + (-4.6) + (-5.1) + (-3.0) = -13.21

Decision: Since -8.32 > -13.21, we classify as SPAM.


10. Interactive Visualizer: The Log-Likelihood Tug of War

Type a message below to see how the model “fights” to classify it.

  • Red Bars: Push score towards SPAM (Right).
  • Green Bars: Push score towards HAM (Left).
  • The Rope: The cumulative score. If it crosses the center line, the verdict flips.
HAM EVIDENCE
0.00
Prediction
...
SPAM EVIDENCE
0.00

11. Summary

  • Naive Bayes is the “Hello World” of NLP, but powerful enough for production baselines.
  • Log-Space Arithmetic is mandatory for probabilistic ML to prevent underflow.
  • Laplace Smoothing is the standard fix for the “Zero Frequency” bug.

Next: Module Review →