The Zoo of Distributions
1. Introduction: Describing Randomness
In the previous chapter, we learned how to calculate probabilities. Now, we learn what to calculate. Most real-world phenomena follow specific patterns. These patterns are called Probability Distributions.
A Random Variable X is a function that maps outcomes to numbers.
- Discrete:
X ∈ {0, 1, 2, ...}(e.g., Number of emails). - Continuous:
X ∈ ℝ(e.g., Height, Temperature, Weights in a Neural Network).
PDF vs PMF
- PMF (Probability Mass Function): For discrete variables. It gives the probability that a discrete random variable is exactly equal to some value.
P(X=k)
- PDF (Probability Density Function): For continuous variables. The probability at a single point is technically 0. We measure probability as the Area under the curve.
P(a ≤ X ≤ b) = ∫ f(x)dx
2. Common Discrete Distributions
2.1 Bernoulli (p)
The “atom” of probability. A single trial with two outcomes: Success (1) or Failure (0).
- Generative Story: You flip a biased coin once.
- Parameter:
p(probability of success). - ML Application: Logistic Regression outputs a Bernoulli probability
P(Y=1|X). It models binary classification tasks like “Spam vs Ham”.
2.2 Binomial (n, p)
The sum of n independent Bernoulli trials.
- Generative Story: You flip the same coin
ntimes. How many heads do you get? - Formula:
P(X=k) = C(n, k) · pk · (1-p)n-k
- ML Application: Predicting the number of conversions from
nad impressions.
2.3 Poisson (λ)
Models the number of events happening in a fixed interval of time or space.
- Generative Story: Events happen independently at a constant average rate.
- Parameter:
λ(lambda, average rate). - Example: Number of API requests per second to your server.
- ML Application: Modeling count data (e.g., predicting call center volume or server load).
3. Continuous Distributions
3.1 The Gaussian (Normal) Distribution (μ, σ<sup>2</sup>)
The “King of Distributions”. It is bell-shaped, symmetric, and defined by:
- Mean (
μ): The center (Expectation). - Variance (
σ<sup>2</sup>): The spread (Uncertainty).
f(x) = [1 / (σ √(2π))] · e-(x - μ)2 / (2σ2)
- Why is it everywhere?: The Central Limit Theorem says that if you add up enough random things (regardless of their original distribution), the sum becomes Gaussian.
- ML Application:
- Weight Initialization: We initialize Neural Network weights from
N(0, 1)or Xavier/He Normal to ensure stable training. - Error Analysis: We assume noise is Gaussian ($y = mx + b + ε$, where
ε ~ N(0, σ<sup>2</sup>)) in Linear Regression.
- Weight Initialization: We initialize Neural Network weights from
3.2 Exponential Distribution (λ)
Models the time between events in a Poisson process.
- Generative Story: How long do you have to wait for the next bus (if buses arrive randomly)?
- Parameter:
λ(rate parameter). - Memoryless Property:
P(T > t+s | T > s) = P(T > t). Past waiting time doesn’t affect future waiting time.
4. Interactive Visualizer: The Distribution Explorer
Select a distribution and tweak the parameters to see how the shape changes. Toggle CDF: Switch between the Density/Mass (PDF/PMF) and the Cumulative Distribution Function (CDF).
5. Summary
- Bernoulli: 1 coin flip.
- Binomial:
ncoin flips. - Poisson: Counts per hour.
- Gaussian: The Bell Curve (The sum of everything).
- Exponential: Waiting time.