Measuring Surprise: Information & Entropy

[!NOTE] This module explores the core principles of Measuring Surprise: Information & Entropy, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction: The Bit

Information Theory is the mathematical study of Surprise. In the digital world, we measure information in Bits. But what is a bit?

It is not just a 0 or 1 in a computer. It is the amount of information required to distinguish between two equally likely alternatives.

Scenario A: “The sun rose today.”
Probability: 100% (Certainty).
Surprise: Zero.
Information: 0 Bits. You learned nothing new.
Scenario B: “I just won the Powerball lottery.”
Probability: 1 in 292 million.
Surprise: Massive.
Information: ~28 Bits. This fundamentally changes your state of knowledge.

Surprisal (Self-Information)

The Surprisal of an event x is defined as:

  I(x) = -log2 P(x)

Fair Coin (P=0.5): − log₂(0.5) = 1 bit.
Roll a Die (P=1/6): − log₂(0.166) ≈ 2.58 bits.
Guess a 4-digit PIN (P=1/10000): − log₂(0.0001) ≈ 13.2 bits.

[!TIP] Why log base 2? Because computers use binary. If we used base 10, the unit would be “Hartleys”. If we used natural log (e), the unit would be “Nats”.

2. Entropy (H)

Entropy is the average surprisal (or expected information) of a random variable. It measures the total uncertainty, chaos, or impurity in a system.

For a probability distribution P, Entropy H is:

  H(P) = - Σ P(x) · log2 P(x)

The Coin Example

Let P(Heads) = p.

Fair Coin (p = 0.5):
- You have no idea what will happen. Maximum Chaos.
- H = 1 bit.
Rigged Coin (p = 0.9):
- You can guess “Heads” and be right 90% of the time.
- There is less uncertainty.
- H ≈ 0.47 bits.
Fixed Coin (p = 1.0):
- Zero uncertainty. You know it’s Heads.
- H = 0 bits.

3. Interactive Visualizer: The Surprise Game

Goal: Minimize your “Surprise Score”. You are betting on coin flips.

If you bet on the likely outcome (e.g., Heads when P=0.9), and you win, your surprise (cost) is low.
If the rare outcome happens, your surprise is HUGE.

[!TIP] Try it yourself: Move the slider to 0.90 (90% Heads).

Most flips give you 0.15 bits of surprise (Low).

Occasionally, you get Tails and suffer 3.32 bits of surprise (High).

The Average Surprise (Entropy) is the weighted average of these two.

P(Heads): 0.50

H(p) = 1.00

Max Entropy at 0.5

Result

-

Surprisal

0.00 bits

Calculator: P(x) = → 1.00 bits

4. Huffman Coding: Practical Compression

How do we use Entropy to compress files? If we know that the letter ‘e’ is very common (Low Surprisal), and ‘z’ is rare (High Surprisal), we should give ‘e’ a short code and ‘z’ a long code.

Example: “BANANA”

Frequency:

A: 3 (50%)
N: 2 (33%)
B: 1 (17%)

Step-by-Step Construction:

Priority Queue: Sort by frequency. [(B,1), (N,2), (A,3)].
Combine Smallest: Take B(1) and N(2). Create a new node (BN, 3).
- B gets 0, N gets 1 (relative to parent).
- Queue: [(BN, 3), (A, 3)].
Combine Next Smallest: Take BN(3) and A(3). Create root (ALL, 6).
- BN gets 0, A gets 1.

Resulting Codes:

A: 1 (1 bit)
B: 00 (2 bits)
N: 01 (2 bits)

Total Size:

Original: 6 chars × 8 bits = 48 bits.
Huffman: (3 × 1) + (1 × 2) + (2 × 2) = 3 + 2 + 4 = 9 bits.
Compression Ratio: 5.3x!

The average length of the code approaches the Entropy of the source. You literally cannot compress lossless data smaller than its Entropy.

5. Cross-Entropy Loss

In Machine Learning (Classification), we don’t just care about the entropy of one distribution. We care about the difference between two. Imagine an AI trying to classify an image.

The “Dog vs Wolf” Example

True Label (P): The image is definitely a Dog.
P = [Dog: 1.0, Wolf: 0.0]
Model Prediction (Q): The model thinks it’s likely a Dog, but isn’t sure.
Q = [Dog: 0.8, Wolf: 0.2]

We want Q to look like P. We minimize Cross-Entropy:

  H(P, Q) = - Σ P(x) · log(Q(x))

Since P is a “One-Hot” vector, the formula simplifies dramatically to just the log-probability of the correct class:

  Loss = -log(Qcorrect)

Case 1 (Good Model): Q_dog = 0.9. Loss = − log(0.9) ≈ 0.10.
Case 2 (Bad Model): Q_dog = 0.1. Loss = − log(0.1) ≈ 2.30.

This is why Neural Networks punish confident wrong answers so heavily.

6. KL Divergence (Relative Entropy)

KL Divergence measures “how much information is lost” when we use distribution Q to approximate P.

It is defined as:

  DKL(P || Q) = H(P, Q) - H(P)

Distance Analogy: Think of it as the “distance” between two distributions (though it’s not symmetric).
If P = Q, then D_KL = 0.
This is the core of Variational Autoencoders (VAEs), where we force the learned latent distribution to match a Standard Normal Distribution.

7. Summary

Surprisal: Rare events carry more information (high bit count).
Entropy: The average uncertainty of a system.
Huffman Coding: Uses entropy to assign shorter codes to frequent symbols.
Cross-Entropy: The standard loss function for classification.