Shannon entropy is the only way to measure information loss that respects composition. There is no other choice.
Information loss
When a stochastic function processes a distribution, some information is lost. How do you measure that loss? Shannon defined entropy in 1948. Baez, Fritz, and Leinster proved his definition is the only one satisfying three natural properties.
importmathdef entropy(dist):
return -sum(p * math.log2(p) for p in dist.values() if p > 0)
# Uniform distribution: maximum entropy (maximum surprise)
uniform = {1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25}
print(f"Uniform over 4: H = {entropy(uniform):.2f} bits")
# Certain outcome: zero entropy (no surprise)
certain = {1: 1.0}
print(f"Certain: H = {entropy(certain):.2f} bits")
# Biased coin: between 0 and 1
biased = {"H": 0.9, "T": 0.1}
print(f"Biased coin: H = {entropy(biased):.2f} bits")
# Fair coin: exactly 1 bit
fair = {"H": 0.5, "T": 0.5}
print(f"Fair coin: H = {entropy(fair):.2f} bits")
print()
print("More surprise = more entropy.")
print("Certain = 0 bits. Fair coin = 1 bit.")
Functoriality: the key constraint
The paper's theorem: Shannon entropy is the unique measure of information loss that is functorial. Functorial means: the loss of a composite equals the sum of the losses of the parts.
If you process data in two steps, the total information lost is the loss from step 1 plus the loss from step 2. No other definition of "loss" has this property (up to a constant factor).
importmathdef entropy(dist):
return -sum(p * math.log2(p) for p in dist.values() if p > 0)
input_dist = {1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25}
after1 = {"A": 0.5, "B": 0.5}
after2 = {"X": 1.0}
h0, h1, h2 = entropy(input_dist), entropy(after1), entropy(after2)
print(f"Loss 1: {h0-h1:.2f}, Loss 2: {h1-h2:.2f}, Total: {h0-h2:.2f}")
print(f"Functorial: {h0-h2:.2f} = {(h0-h1)+(h1-h2):.2f} ✓")
Data processing inequality
A corollary: information can only decrease through processing. Computation loses or preserves bits; it never creates them. This is the data processing inequality (DPI), proved categorically.
importmathdef entropy(dist):
return -sum(p * math.log2(p) for p in dist.values() if p > 0)
h_in = entropy({1: 0.1, 2: 0.2, 3: 0.3, 4: 0.4})
h_lossy = entropy({"low": 0.3, "high": 0.7})
print(f"H(input) = {h_in:.3f}, H(lossy) = {h_lossy:.3f}")
print(f"Lost: {h_in - h_lossy:.3f} bits. DPI holds ✓")
The uniqueness theorem
The paper's main result (Theorem 4): any function F that assigns a real number to a morphism in FinProb (finite probability distributions) and satisfies:
Functoriality: F(g ∘ f) = F(f) + F(g)
Convex linearity: F respects mixtures of distributions
Continuity: small changes in probabilities → small changes in F
...must be F(f) = c · (H(input) − H(output)) for some constant c ≥ 0. Shannon entropy, up to scale.
Scheme
; Uniqueness: F(f) = c · (H(input) - H(output)); Shannon entropy is the ONLY such F (up to scale).
(define (log2 x) (/ (log x) (log 2)))
(define (entropy probs)
(- (apply + (map (lambda (p)
(if (> p 0) (* p (log2 p)) 0)) probs))))
(define h-in (entropy '(0.250.250.250.25)))
(define h-out (entropy '(0.50.5)))
(define F (- h-in h-out))
(display "F(f) = H(in) - H(out) = ") (display F) (newline)
; This is the ONLY measure that is:; 1. Functorial 2. Convex-linear 3. Continuous; Shannon didn't choose entropy. The axioms forced it.
Python
importmathdef entropy(dist):
return -sum(p * math.log2(p) for p in dist.values() if p > 0)
h_in = entropy({1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25})
h_out = entropy({"A": 0.5, "B": 0.5})
print(f"F(f) = {h_in:.2f} - {h_out:.2f} = {h_in - h_out:.2f} bits")
print("The only functorial, convex-linear, continuous measure.")
Notation reference
Paper
Python
Meaning
H(p)
-sum(p*log2(p))
Shannon entropy
F(f) = c(H(p)−H(q))
h_in - h_out
Information loss of morphism f
FinProb
# dicts summing to 1
Category of finite probability distributions
DPI
h_out <= h_in
Data processing inequality
Neighbors
Other paper pages
🍞 Fritz 2020 — the Markov category where entropy lives
All examples use finite distributions with explicit probabilities. The paper's theorem holds over FinProb (finite probability distributions with morphisms that preserve total mass). For example, the DPI example on this page merges four outcomes into two and checks that entropy decreases. In the paper, the same inequality applies to a noisy channel transmitting continuous signals: the received signal's entropy never exceeds the transmitted signal's, regardless of the noise model. The inequality is identical. The distributions are not.
The uniqueness result requires all three axioms: drop any one and other measures become possible. Every example is Simplified.
Framework connection: Shannon entropy is the unique measure of the Natural Framework pipeline's information budget. DPI governs how bits flow through each stage. (The Natural Framework)
Ready for the real thing? Read the paper. Start at Theorem 4 (p.8) for the uniqueness result. 15 pages.