Discrete PDF CDF
[!NOTE] Key Concept: A Discrete Random Variable represents countable outcomes (e.g., number of server 500 errors, users in a queue).
Imagine you are a Site Reliability Engineer (SRE) monitoring an API. You need to know: What is the probability of receiving exactly 5 HTTP 500 errors in the next minute? Or, What is the probability of receiving at most 5 errors?
In the world of probability, we distinguish between variables that are countable (discrete, like HTTP errors) and those that are continuous (measurable, like latency in milliseconds). This chapter focuses on the former: the Discrete Probability Distributions.
We will explore two fundamental ways to describe these distributions:
- Probability Mass Function (PMF): The probability that a discrete random variable is exactly equal to some value. (Think of “Mass” as a solid weight placed exactly on a specific number).
- Cumulative Distribution Function (CDF): The probability that a discrete random variable is less than or equal to some value. (Think of a “Cumulative” snowball rolling down a number line, gathering all the probability mass it passes).
1. Probability Mass Function (PMF)
The PMF, denoted as P(X = x) or p(x), assigns a probability to each possible value x.
Properties:
- 0 ≤ P(X = x) ≤ 1
- Σx P(X = x) = 1
For example, rolling a fair die:
- P(X = 1) = 1/6
- P(X = 2) = 1/6
- …
- P(X = 6) = 1/6
2. Cumulative Distribution Function (CDF)
The CDF, denoted as F(x), is the cumulative sum of probabilities up to x.
F(x) = P(X ≤ x) = Σk ≤ x P(X = k)
Properties:
- F(x) is non-decreasing.
- 0 ≤ F(x) ≤ 1
- limx→-∞ F(x) = 0
- limx→∞ F(x) = 1
Using the die example:
- F(3) = P(X ≤ 3) = P(X=1) + P(X=2) + P(X=3) = 1/6 + 1/6 + 1/6 = 0.5
3. Common Discrete Distributions
Bernoulli Distribution
The simplest discrete distribution. Models a Bernoulli Trial with two outcomes: Success (1) with probability p, and Failure (0) with probability 1-p.
- Analogy: A single coin flip. Will a single network request succeed (1) or timeout (0)?
- PMF: P(X=k) = pk(1-p)1-k for k ∈ {0, 1}
- Mean: p
- Variance: p(1-p)
Binomial Distribution
Models the number of successes in n independent Binomial trials.
- Analogy: Flipping a coin n times. If I send 100 network requests, how many will succeed?
- Parameters: n (trials), p (probability of success).
- PMF: P(X=k) = C(n, k) × pk × (1-p)n-k
- Mean: np
- Variance: np(1-p)
Poisson Distribution
Models the number of events occurring in a fixed interval of time or space.
- Analogy: A traffic cop counting cars. How many database transactions occur per second?
- Parameter: λ (average rate).
- PMF: P(X=k) = (λke-λ) / k!
- Mean: λ
- Variance: λ
4. Hardware Reality: Integer Overflow & Log Probabilities
When implementing probability calculations on real hardware, you will quickly encounter two enemies: Integer Overflow and Underflow.
The Factorial Problem
The Binomial coefficient C(n, k) = n! / (k! (n-k)!) relies on factorials. Factorials grow terrifyingly fast.
- 12! = 479,001,600 (Fits in 32-bit int)
- 13! = 6,227,020,800 (Overflows 32-bit int)
- 20! ≈ 2.4 × 1018 (Fits in 64-bit long)
- 21! overflows 64-bit integer.
If you try to calculate combinations(100, 50) using naive factorials, your program will crash or produce garbage.
The Underflow Problem
Probabilities are often tiny numbers (e.g., P(Winning Lottery) ≈ 10-9). Multiplying many probabilities (like in Naive Bayes) results in numbers smaller than the smallest representable float (1e-324 for double), causing them to collapse to 0.
The Solution: Log-Space
To solve both, we work in Log-Space. Instead of calculating P(x), we calculate ln(P(x)).
- Multiplication becomes Addition:
ln(A * B) = ln(A) + ln(B) - Division becomes Subtraction:
ln(A / B) = ln(A) - ln(B) - Exponentiation becomes Multiplication:
ln(A^k) = k * ln(A)
For Binomial PMF:
ln(P(X=k)) = ln(C(n,k)) + k*ln(p) + (n-k)*ln(1-p)
Where ln(C(n,k)) is calculated using the Log-Gamma function (lgamma), which approximates ln((n-1)!) for continuous values, allowing us to handle huge numbers without overflow.
5. Interactive: Discrete Distribution Explorer
Visualize how changing parameters affects the shape of the distribution. Toggle between PMF (Probability at exact point) and CDF (Cumulative probability).
6. Code Implementation
Calculating Log-Probabilities to avoid overflow using lgamma.
Java
Using Apache Commons Math for logGamma.
import org.apache.commons.math3.special.Gamma;
public class LogProbExample {
// Log-Factorial approximation: ln(n!) = lgamma(n + 1)
public static double logFactorial(int n) {
return Gamma.logGamma(n + 1);
}
// Log-Combination: ln(C(n,k)) = ln(n!) - ln(k!) - ln((n-k)!)
public static double logChoose(int n, int k) {
return logFactorial(n) - logFactorial(k) - logFactorial(n - k);
}
// Log-Binomial PMF: ln(P(X=k))
public static double logBinomialPMF(int n, double p, int k) {
return logChoose(n, k) + k * Math.log(p) + (n - k) * Math.log(1 - p);
}
public static void main(String[] args) {
int n = 1000; // Huge number of trials! 1000! overflows everything.
int k = 500;
double p = 0.5;
// Calculate in log space
double logProb = logBinomialPMF(n, p, k);
System.out.printf("Log Probability: %.4f%n", logProb);
System.out.printf("Probability: %e%n", Math.exp(logProb));
}
}
Go
Using math.Lgamma from the standard library.
package main
import (
"fmt"
"math"
)
// Log-Factorial: ln(n!) = lgamma(n + 1)
func logFactorial(n int) float64 {
res, _ := math.Lgamma(float64(n) + 1)
return res
}
// Log-Combination: ln(C(n,k))
func logChoose(n, k int) float64 {
return logFactorial(n) - logFactorial(k) - logFactorial(n-k)
}
// Log-Binomial PMF
func logBinomialPMF(n int, p float64, k int) float64 {
return logChoose(n, k) + float64(k)*math.Log(p) + float64(n-k)*math.Log(1-p)
}
func main() {
n := 1000
k := 500
p := 0.5
// 1000! is way too big for int64 or float64.
// We must use log space.
logProb := logBinomialPMF(n, p, k)
fmt.Printf("Log Probability: %.4f\n", logProb)
fmt.Printf("Probability: %e\n", math.Exp(logProb))
}
[!TIP] Production Tip: Always use
lgammaor log-probability functions provided by libraries (likescipy.statsin Python orDistributions.jlin Julia) instead of implementing factorials yourself. They are optimized for stability.
Summary
- Discrete Variables are countable. PMF = Exact, CDF = Cumulative.
- Binomial (Coin flips) & Poisson (Rare events) are key.
- Hardware Reality: Factorials overflow quickly. Use Log-Space and
lgammato handle real-world scale.