Case Study: Naive Bayes Spam Classifier
[!NOTE] This module explores the core principles of Case Study: Naive Bayes Spam Classifier, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. What is Naive Bayes?
Before the era of Transformers (BERT, GPT), the internet was saved from drowning in spam by a simple, elegant algorithm: Naive Bayes. Even today, it remains a baseline for text classification due to its speed and interpretability.
In this case study, we will design a Spam Filtering System that classifies emails as either Spam (Junk) or Ham (Legitimate).
[!TIP] Try it yourself: Open your email’s “Spam” folder. Notice how it catches “Urgent: You won $1,000,000” but lets “Meeting at 5pm” through. We will build the math engine behind this.
2. Requirements & Goals
Functional Requirements
- Classify: Given an email text, predict
P(Spam | Text). - Explain: The system must be interpretable (Why was this marked as spam?).
- Learn: The system should update its probabilities as new spam trends emerge.
Non-Functional Requirements
- Latency: Inference must be under 10ms per email (Real-time).
- Precision: We must minimize False Positives. Marking a legitimate job offer as spam is worse than letting a spam email through.
- Scalability: Must handle millions of emails per second.
3. Capacity Estimation
- Traffic: 1 Billion emails/day ≈ 12,000 emails/sec.
- Vocabulary: English has ~170,000 words. We might track top 50,000 features.
- Storage:
- We store counts
Count(Word | Spam)andCount(Word | Ham). - 50,000 words × 2 classes × 4 bytes (int) ≈ 400 KB.
- Conclusion: The entire model fits in L1 Cache! This is why Naive Bayes is blazingly fast. It requires no complex Sharding (see System Design Module 07).
4. System APIs
We define a simple REST API for our inference service.
| Method | Endpoint | Description |
|---|---|---|
POST |
/v1/classify |
Returns spam_score and verdict for a given text. |
POST |
/v1/feedback |
User marks an email as “Not Spam” (Updates counters). |
// Response Example
{
"verdict": "SPAM",
"confidence": 0.998,
"top_features": ["winner", "free", "cash"]
}
5. Database Design
For training, we need to store the raw counts. A simple Key-Value store (Redis) or Columnar Store (Cassandra) works well.
Schema: WordCounts
| Key | Value (Spam) | Value (Ham) |
|---|---|---|
word:money |
14,050 | 200 |
word:meeting |
5 | 8,900 |
We also need global counters: TotalSpamEmails and TotalHamEmails to calculate Priors.
6. High-Level Design
- Client sends email to Load Balancer.
- Inference Service tokenizes text and queries the Model Cache (In-Memory).
- Model Cache holds the pre-calculated Log-Probabilities.
- Feedback Loop: When users click “Report Spam”, events are sent to a Kafka topic.
- Training Worker consumes Kafka, updates DB, and recalculates Log-Probs to update the Cache.
7. Component Design: The Algorithm
Our Goal: Calculate the probability that a message is Spam given its words w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>.
7.1 Bayes’ Theorem
P(Spam | Message) = [ P(Message | Spam) · P(Spam) ] / P(Message)
We can ignore the denominator P(Message) because it is the same for both Spam and Ham calculations. We just compare the numerators.
7.2 The “Naive” Independence Assumption
Calculating P(w<sub>1</sub>, w<sub>2</sub> ... | Spam) directly is impossible (we’d need a dataset with every possible sentence).
So, we assume word independence:
P(w1, w2 … | Spam) ≈ P(w1|Spam) · P(w2|Spam) · … · P(wn|Spam)
This is “Naive” because “Artificial” and “Intelligence” are clearly correlated, but we treat them as independent coin flips.
7.3 Engineering Challenge: Arithmetic Underflow
Probabilities are small numbers (e.g., 0.0001). Multiplying hundreds of them results in numbers so small that computers round them to zero.
Solution: Work in Log Space.
log(a · b) = log(a) + log(b)
Our score becomes a linear sum:
Score = log(P(Spam)) + ∑ log(P(wi | Spam))
8. Reliability & Edge Cases
The Zero Probability Problem
What if the word “Bitcoin” never appeared in our Spam training set?
P("Bitcoin" | Spam) = 0.
log(0) = -∞.
The entire sum becomes -∞, and the model crashes.
Solution: Laplace Smoothing
We add a small count (usually α=1) to every word count.
P(w|C) = (count(w, C) + 1) / (count(C) + V)
Where V is the vocabulary size. This ensures no probability is ever strictly zero.
9. System Walkthrough (Dry Run)
Let’s trace the email: “Win Money Now”
Priors:
P(Spam) = 0.4→log(0.4) = -0.92P(Ham) = 0.6→log(0.6) = -0.51
Likelihoods (Log-Probs):
- “Win”:
P(w|S)=-2.3,P(w|H)=-4.6(Spammy) - “Money”:
P(w|S)=-2.1,P(w|H)=-5.1(Spammy) - “Now”:
P(w|S)=-3.0,P(w|H)=-3.0(Neutral)
Calculation:
- Spam Score:
-0.92 + (-2.3) + (-2.1) + (-3.0) = -8.32 - Ham Score:
-0.51 + (-4.6) + (-5.1) + (-3.0) = -13.21
Decision:
Since -8.32 > -13.21, we classify as SPAM.
Implementation in Python
Building a Naive Bayes classifier from scratch helps understand the internals:
import numpy as np
class NaiveBayes:
def __init__(self):
self.spam_counts = {}
self.ham_counts = {}
self.total_spam = 0
self.total_ham = 0
self.vocab = set()
def train(self, texts, labels):
# 1. Count word frequencies
for text, label in zip(texts, labels):
words = text.lower().split()
if label == 'SPAM':
self.total_spam += 1
for w in words:
self.spam_counts[w] = self.spam_counts.get(w, 0) + 1
self.vocab.add(w)
else:
self.total_ham += 1
for w in words:
self.ham_counts[w] = self.ham_counts.get(w, 0) + 1
self.vocab.add(w)
def predict(self, text):
words = text.lower().split()
# 2. Calculate Priors (Log Space)
total_docs = self.total_spam + self.total_ham
log_p_spam = np.log(self.total_spam / total_docs)
log_p_ham = np.log(self.total_ham / total_docs)
# 3. Add Likelihoods (with Laplace Smoothing)
vocab_size = len(self.vocab)
# Denominators for P(w|C)
# Sum of all word counts in Spam class + Vocab Size (Smoothing)
total_spam_words = sum(self.spam_counts.values())
total_ham_words = sum(self.ham_counts.values())
for w in words:
if w not in self.vocab: continue
# P(w|Spam) = (count(w, Spam) + 1) / (TotalSpamWords + V)
count_spam = self.spam_counts.get(w, 0) + 1
prob_spam = count_spam / (total_spam_words + vocab_size)
log_p_spam += np.log(prob_spam)
# P(w|Ham)
count_ham = self.ham_counts.get(w, 0) + 1
prob_ham = count_ham / (total_ham_words + vocab_size)
log_p_ham += np.log(prob_ham)
return 'SPAM' if log_p_spam > log_p_ham else 'HAM'
# Usage
nb = NaiveBayes()
train_data = ["win money now", "meeting at 5pm", "free cash prize", "project report due"]
train_labels = ["SPAM", "HAM", "SPAM", "HAM"]
nb.train(train_data, train_labels)
print(f"Prediction: {nb.predict('win free cash')}") # Output: SPAM
10. Interactive Visualizer: The Log-Likelihood Tug of War
Type a message below to see how the model “fights” to classify it.
- Red Bars: Push score towards SPAM (Right).
- Green Bars: Push score towards HAM (Left).
- The Rope: The cumulative score. If it crosses the center line, the verdict flips.
11. Summary
- Naive Bayes is the “Hello World” of NLP, but powerful enough for production baselines.
- Log-Space Arithmetic is mandatory for probabilistic ML to prevent underflow.
- Laplace Smoothing is the standard fix for the “Zero Frequency” bug.