Discrete PDF CDF

[!NOTE] Key Concept: A Discrete Random Variable represents countable outcomes (e.g., number of server 500 errors, users in a queue).

In the world of probability, we distinguish between variables that are countable (discrete) and those that are continuous (measurable). This chapter focuses on the former: the Discrete Probability Distributions.

We will explore two fundamental ways to describe these distributions:

  1. Probability Mass Function (PMF): The probability that a discrete random variable is exactly equal to some value.
  2. Cumulative Distribution Function (CDF): The probability that a discrete random variable is less than or equal to some value.

1. Probability Mass Function (PMF)

The PMF, denoted as P(X = x) or p(x), assigns a probability to each possible value x.

Properties:

  • 0 ≤ P(X = x) ≤ 1
  • Σx P(X = x) = 1

For example, rolling a fair die:

  • P(X = 1) = 1/6
  • P(X = 2) = 1/6
  • P(X = 6) = 1/6

2. Cumulative Distribution Function (CDF)

The CDF, denoted as F(x), is the cumulative sum of probabilities up to x.

F(x) = P(X ≤ x) = Σk ≤ x P(X = k)

Properties:

  • F(x) is non-decreasing.
  • 0 ≤ F(x) ≤ 1
  • limx→-∞ F(x) = 0
  • limx→∞ F(x) = 1

Using the die example:

  • F(3) = P(X ≤ 3) = P(X=1) + P(X=2) + P(X=3) = 1/6 + 1/6 + 1/6 = 0.5

3. Common Discrete Distributions

Bernoulli Distribution

The simplest discrete distribution. Models a Bernoulli Trial with two outcomes: Success (1) with probability p, and Failure (0) with probability 1-p.

  • PMF: P(X=k) = pk(1-p)1-k for k ∈ {0, 1}
  • Mean: p
  • Variance: p(1-p)

Binomial Distribution

Models the number of successes in n independent Binomial trials.

  • Parameters: n (trials), p (probability of success).
  • PMF: P(X=k) = C(n, k) × pk × (1-p)n-k
  • Mean: np
  • Variance: np(1-p)

Poisson Distribution

Models the number of events occurring in a fixed interval of time or space.

  • Parameter: λ (average rate).
  • PMF: P(X=k) = (λke) / k!
  • Mean: λ
  • Variance: λ

4. Hardware Reality: Integer Overflow & Log Probabilities

When implementing probability calculations on real hardware, you will quickly encounter two enemies: Integer Overflow and Underflow.

The Factorial Problem

The Binomial coefficient C(n, k) = n! / (k! (n-k)!) relies on factorials. Factorials grow terrifyingly fast.

  • 12! = 479,001,600 (Fits in 32-bit int)
  • 13! = 6,227,020,800 (Overflows 32-bit int)
  • 20! ≈ 2.4 × 1018 (Fits in 64-bit long)
  • 21! overflows 64-bit integer.

If you try to calculate combinations(100, 50) using naive factorials, your program will crash or produce garbage.

The Underflow Problem

Probabilities are often tiny numbers (e.g., P(Winning Lottery) ≈ 10-9). Multiplying many probabilities (like in Naive Bayes) results in numbers smaller than the smallest representable float (1e-324 for double), causing them to collapse to 0.

The Solution: Log-Space

To solve both, we work in Log-Space. Instead of calculating P(x), we calculate ln(P(x)).

  • Multiplication becomes Addition: ln(A * B) = ln(A) + ln(B)
  • Division becomes Subtraction: ln(A / B) = ln(A) - ln(B)
  • Exponentiation becomes Multiplication: ln(A^k) = k * ln(A)

For Binomial PMF: ln(P(X=k)) = ln(C(n,k)) + k*ln(p) + (n-k)*ln(1-p)

Where ln(C(n,k)) is calculated using the Log-Gamma function (lgamma), which approximates ln((n-1)!) for continuous values, allowing us to handle huge numbers without overflow.


5. Interactive: Discrete Distribution Explorer

Visualize how changing parameters affects the shape of the distribution. Toggle between PMF (Probability at exact point) and CDF (Cumulative probability).


6. Code Implementation

Calculating Log-Probabilities to avoid overflow using lgamma.

Java Example

Using Apache Commons Math for logGamma.

import org.apache.commons.math3.special.Gamma;

public class LogProbExample {

    // Log-Factorial approximation: ln(n!) = lgamma(n + 1)
    public static double logFactorial(int n) {
        return Gamma.logGamma(n + 1);
    }

    // Log-Combination: ln(C(n,k)) = ln(n!) - ln(k!) - ln((n-k)!)
    public static double logChoose(int n, int k) {
        return logFactorial(n) - logFactorial(k) - logFactorial(n - k);
    }

    // Log-Binomial PMF: ln(P(X=k))
    public static double logBinomialPMF(int n, double p, int k) {
        return logChoose(n, k) + k * Math.log(p) + (n - k) * Math.log(1 - p);
    }

    public static void main(String[] args) {
        int n = 1000;  // Huge number of trials! 1000! overflows everything.
        int k = 500;
        double p = 0.5;

        // Calculate in log space
        double logProb = logBinomialPMF(n, p, k);

        System.out.printf("Log Probability: %.4f%n", logProb);
        System.out.printf("Probability: %e%n", Math.exp(logProb));
    }
}

Go Example

Using math.Lgamma from the standard library.

package main

import (
	"fmt"
	"math"
)

// Log-Factorial: ln(n!) = lgamma(n + 1)
func logFactorial(n int) float64 {
	res, _ := math.Lgamma(float64(n) + 1)
	return res
}

// Log-Combination: ln(C(n,k))
func logChoose(n, k int) float64 {
	return logFactorial(n) - logFactorial(k) - logFactorial(n-k)
}

// Log-Binomial PMF
func logBinomialPMF(n int, p float64, k int) float64 {
	return logChoose(n, k) + float64(k)*math.Log(p) + float64(n-k)*math.Log(1-p)
}

func main() {
	n := 1000
	k := 500
	p := 0.5

	// 1000! is way too big for int64 or float64.
	// We must use log space.
	logProb := logBinomialPMF(n, p, k)

	fmt.Printf("Log Probability: %.4f\n", logProb)
	fmt.Printf("Probability: %e\n", math.Exp(logProb))
}

[!TIP] Production Tip: Always use lgamma or log-probability functions provided by libraries (like scipy.stats in Python or Distributions.jl in Julia) instead of implementing factorials yourself. They are optimized for stability.

Summary

  • Discrete Variables are countable. PMF = Exact, CDF = Cumulative.
  • Binomial (Coin flips) & Poisson (Rare events) are key.
  • Hardware Reality: Factorials overflow quickly. Use Log-Space and lgamma to handle real-world scale.