Discrete PDF CDF
[!NOTE] Key Concept: A Discrete Random Variable represents countable outcomes (e.g., number of server 500 errors, users in a queue).
In the world of probability, we distinguish between variables that are countable (discrete) and those that are continuous (measurable). This chapter focuses on the former: the Discrete Probability Distributions.
We will explore two fundamental ways to describe these distributions:
- Probability Mass Function (PMF): The probability that a discrete random variable is exactly equal to some value.
- Cumulative Distribution Function (CDF): The probability that a discrete random variable is less than or equal to some value.
1. Probability Mass Function (PMF)
The PMF, denoted as P(X = x) or p(x), assigns a probability to each possible value x.
Properties:
- 0 ≤ P(X = x) ≤ 1
- Σx P(X = x) = 1
For example, rolling a fair die:
- P(X = 1) = 1/6
- P(X = 2) = 1/6
- …
- P(X = 6) = 1/6
2. Cumulative Distribution Function (CDF)
The CDF, denoted as F(x), is the cumulative sum of probabilities up to x.
F(x) = P(X ≤ x) = Σk ≤ x P(X = k)
Properties:
- F(x) is non-decreasing.
- 0 ≤ F(x) ≤ 1
- limx→-∞ F(x) = 0
- limx→∞ F(x) = 1
Using the die example:
- F(3) = P(X ≤ 3) = P(X=1) + P(X=2) + P(X=3) = 1/6 + 1/6 + 1/6 = 0.5
3. Common Discrete Distributions
Bernoulli Distribution
The simplest discrete distribution. Models a Bernoulli Trial with two outcomes: Success (1) with probability p, and Failure (0) with probability 1-p.
- PMF: P(X=k) = pk(1-p)1-k for k ∈ {0, 1}
- Mean: p
- Variance: p(1-p)
Binomial Distribution
Models the number of successes in n independent Binomial trials.
- Parameters: n (trials), p (probability of success).
- PMF: P(X=k) = C(n, k) × pk × (1-p)n-k
- Mean: np
- Variance: np(1-p)
Poisson Distribution
Models the number of events occurring in a fixed interval of time or space.
- Parameter: λ (average rate).
- PMF: P(X=k) = (λke-λ) / k!
- Mean: λ
- Variance: λ
4. Hardware Reality: Integer Overflow & Log Probabilities
When implementing probability calculations on real hardware, you will quickly encounter two enemies: Integer Overflow and Underflow.
The Factorial Problem
The Binomial coefficient C(n, k) = n! / (k! (n-k)!) relies on factorials. Factorials grow terrifyingly fast.
- 12! = 479,001,600 (Fits in 32-bit int)
- 13! = 6,227,020,800 (Overflows 32-bit int)
- 20! ≈ 2.4 × 1018 (Fits in 64-bit long)
- 21! overflows 64-bit integer.
If you try to calculate combinations(100, 50) using naive factorials, your program will crash or produce garbage.
The Underflow Problem
Probabilities are often tiny numbers (e.g., P(Winning Lottery) ≈ 10-9). Multiplying many probabilities (like in Naive Bayes) results in numbers smaller than the smallest representable float (1e-324 for double), causing them to collapse to 0.
The Solution: Log-Space
To solve both, we work in Log-Space. Instead of calculating P(x), we calculate ln(P(x)).
- Multiplication becomes Addition:
ln(A * B) = ln(A) + ln(B) - Division becomes Subtraction:
ln(A / B) = ln(A) - ln(B) - Exponentiation becomes Multiplication:
ln(A^k) = k * ln(A)
For Binomial PMF:
ln(P(X=k)) = ln(C(n,k)) + k*ln(p) + (n-k)*ln(1-p)
Where ln(C(n,k)) is calculated using the Log-Gamma function (lgamma), which approximates ln((n-1)!) for continuous values, allowing us to handle huge numbers without overflow.
5. Interactive: Discrete Distribution Explorer
Visualize how changing parameters affects the shape of the distribution. Toggle between PMF (Probability at exact point) and CDF (Cumulative probability).
6. Code Implementation
Calculating Log-Probabilities to avoid overflow using lgamma.
Java Example
Using Apache Commons Math for logGamma.
import org.apache.commons.math3.special.Gamma;
public class LogProbExample {
// Log-Factorial approximation: ln(n!) = lgamma(n + 1)
public static double logFactorial(int n) {
return Gamma.logGamma(n + 1);
}
// Log-Combination: ln(C(n,k)) = ln(n!) - ln(k!) - ln((n-k)!)
public static double logChoose(int n, int k) {
return logFactorial(n) - logFactorial(k) - logFactorial(n - k);
}
// Log-Binomial PMF: ln(P(X=k))
public static double logBinomialPMF(int n, double p, int k) {
return logChoose(n, k) + k * Math.log(p) + (n - k) * Math.log(1 - p);
}
public static void main(String[] args) {
int n = 1000; // Huge number of trials! 1000! overflows everything.
int k = 500;
double p = 0.5;
// Calculate in log space
double logProb = logBinomialPMF(n, p, k);
System.out.printf("Log Probability: %.4f%n", logProb);
System.out.printf("Probability: %e%n", Math.exp(logProb));
}
}
Go Example
Using math.Lgamma from the standard library.
package main
import (
"fmt"
"math"
)
// Log-Factorial: ln(n!) = lgamma(n + 1)
func logFactorial(n int) float64 {
res, _ := math.Lgamma(float64(n) + 1)
return res
}
// Log-Combination: ln(C(n,k))
func logChoose(n, k int) float64 {
return logFactorial(n) - logFactorial(k) - logFactorial(n-k)
}
// Log-Binomial PMF
func logBinomialPMF(n int, p float64, k int) float64 {
return logChoose(n, k) + float64(k)*math.Log(p) + float64(n-k)*math.Log(1-p)
}
func main() {
n := 1000
k := 500
p := 0.5
// 1000! is way too big for int64 or float64.
// We must use log space.
logProb := logBinomialPMF(n, p, k)
fmt.Printf("Log Probability: %.4f\n", logProb)
fmt.Printf("Probability: %e\n", math.Exp(logProb))
}
[!TIP] Production Tip: Always use
lgammaor log-probability functions provided by libraries (likescipy.statsin Python orDistributions.jlin Julia) instead of implementing factorials yourself. They are optimized for stability.
Summary
- Discrete Variables are countable. PMF = Exact, CDF = Cumulative.
- Binomial (Coin flips) & Poisson (Rare events) are key.
- Hardware Reality: Factorials overflow quickly. Use Log-Space and
lgammato handle real-world scale.