Case Study: Naive Bayes Spam Classifier

1. What is Naive Bayes?

Before the era of Transformers (BERT, GPT), the internet was saved from drowning in spam by a simple, elegant algorithm: Naive Bayes. Even today, it remains a baseline for text classification due to its speed and interpretability.

In this case study, we will design a Spam Filtering System that classifies emails as either Spam (Junk) or Ham (Legitimate).

Try it yourself: Open your email’s “Spam” folder. Notice how it catches “Urgent: You won $1,000,000” but lets “Meeting at 5pm” through. We will build the math engine behind this.

2. Requirements & Goals

Functional Requirements

Classify: Given an email text, predict P(Spam | Text).
Explain: The system must be interpretable (Why was this marked as spam?).
Learn: The system should update its probabilities as new spam trends emerge.

Non-Functional Requirements

Latency: Inference must be under 10ms per email (Real-time).
Precision: We must minimize False Positives. Marking a legitimate job offer as spam is worse than letting a spam email through.
Scalability: Must handle millions of emails per second.

3. Capacity Estimation

Traffic: 1 Billion emails/day ≈ 12,000 emails/sec.
Vocabulary: English has ~170,000 words. We might track top 50,000 features.
Storage:
- We store counts Count(Word | Spam) and Count(Word | Ham).
- 50,000 words × 2 classes × 4 bytes (int) ≈ 400 KB.
- Conclusion: The entire model fits in L1 Cache! This is why Naive Bayes is blazingly fast.

4. System APIs

We define a simple REST API for our inference service.

Method	Endpoint	Description
`POST`	`/v1/classify`	Returns `spam_score` and `verdict` for a given `text`.
`POST`	`/v1/feedback`	User marks an email as “Not Spam” (Updates counters).

// Response Example
{
  "verdict": "SPAM",
  "confidence": 0.998,
  "top_features": ["winner", "free", "cash"]
}

5. Database Design

For training, we need to store the raw counts. A simple Key-Value store (Redis) or Columnar Store (Cassandra) works well.

Schema: `WordCounts`

| Key | Value (Spam) | Value (Ham) | | :— | :— | :— | | word:money | 14,050 | 200 | | word:meeting | 5 | 8,900 |

We also need global counters: TotalSpamEmails and TotalHamEmails to calculate Priors.

6. High-Level Design

Client sends email to Load Balancer.
Inference Service tokenizes text and queries the Model Cache (In-Memory).
Model Cache holds the pre-calculated Log-Probabilities.
Feedback Loop: When users click “Report Spam”, events are sent to a Kafka topic.
Training Worker consumes Kafka, updates DB, and recalculates Log-Probs to update the Cache.

7. Component Design: The Algorithm

Our Goal: Calculate the probability that a message is Spam given its words w1, w2, ..., wn.

7.1 Bayes’ Theorem

P(Spam Message) = [ P(Message Spam) · P(Spam) ] / P(Message)

We can ignore the denominator P(Message) because it is the same for both Spam and Ham calculations. We just compare the numerators.

7.2 The “Naive” Independence Assumption

Calculating P(w1, w2 ... | Spam) directly is impossible (we’d need a dataset with every possible sentence). So, we assume word independence:

P(w₁, w₂ … Spam) ≈ P(w₁ Spam) · P(w₂ Spam) · … · P(w_n Spam)

This is “Naive” because “Artificial” and “Intelligence” are clearly correlated, but we treat them as independent coin flips.

7.3 Engineering Challenge: Arithmetic Underflow

Probabilities are small numbers (e.g., 0.0001). Multiplying hundreds of them results in numbers so small that computers round them to zero. Solution: Work in Log Space.

log(a · b) = log(a) + log(b)

Our score becomes a linear sum:

Score = log(P(Spam)) + ∑ log(P(w_i Spam))

8. Reliability & Edge Cases

The Zero Probability Problem

What if the word “Bitcoin” never appeared in our Spam training set? P("Bitcoin" | Spam) = 0. log(0) = -∞. The entire sum becomes -∞, and the model crashes.

Solution: Laplace Smoothing We add a small count (usually α=1) to every word count.

P(w C) = (count(w, C) + 1) / (count(C) + V)

Where V is the vocabulary size. This ensures no probability is ever strictly zero.

9. System Walkthrough (Dry Run)

Let’s trace the email: “Win Money Now”

Priors:

P(Spam) = 0.4 → log(0.4) = -0.92
P(Ham) = 0.6 → log(0.6) = -0.51

Likelihoods (Log-Probs):

“Win”: P(w|S)=-2.3, P(w|H)=-4.6 (Spammy)
“Money”: P(w|S)=-2.1, P(w|H)=-5.1 (Spammy)
“Now”: P(w|S)=-3.0, P(w|H)=-3.0 (Neutral)

Calculation:

Spam Score: -0.92 + (-2.3) + (-2.1) + (-3.0) = -8.32
Ham Score: -0.51 + (-4.6) + (-5.1) + (-3.0) = -13.21

Decision: Since -8.32 > -13.21, we classify as SPAM.

10. Interactive Visualizer: The Log-Likelihood Tug of War

Type a message below to see how the model “fights” to classify it.

Red Bars: Push score towards SPAM (Right).
Green Bars: Push score towards HAM (Left).
The Rope: The cumulative score. If it crosses the center line, the verdict flips.

HAM EVIDENCE

0.00

Prediction

...

SPAM EVIDENCE

0.00

11. Summary

Naive Bayes is the “Hello World” of NLP, but powerful enough for production baselines.
Log-Space Arithmetic is mandatory for probabilistic ML to prevent underflow.
Laplace Smoothing is the standard fix for the “Zero Frequency” bug.

Next: Module Review →