A Characterization of Entropy in Terms of Information Loss

Baez, Fritz, Leinster · 2011 · arxiv arXiv:1106.1791

Assumes you've seen 🍞 Markov categories. 5 minutes.

Shannon entropy is the only way to measure information loss that respects composition. There is no other choice.

Information loss

When a stochastic function processes a distribution, some information is lost. How do you measure that loss? Shannon defined entropy in 1948. Baez, Fritz, and Leinster proved his definition is the only one satisfying three natural properties.

Scheme

; Shannon entropy: H(p) = -Σ p·log₂(p)
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0))
  probs))))

(display "Uniform over 4: H = ")
(display (entropy '(0.25 0.25 0.25 0.25)))
(newline)
(display "Fair coin:      H = ")
(display (entropy '(0.5 0.5)))
(newline)
(display "Certain:        H = ")
(display (entropy '(1.0)))

Functoriality: the key constraint

The paper's theorem: Shannon entropy is the unique measure of information loss that is functorial. Functorial means: the loss of a composite equals the sum of the losses of the parts.

If you process data in two steps, the total information lost is the loss from step 1 plus the loss from step 2. No other definition of "loss" has this property (up to a constant factor).

Scheme

; Functoriality: loss of composite = sum of losses
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

; Step 1: 4 uniform -> 2 uniform (merge pairs). Loss = 1 bit.
; Step 2: 2 uniform -> 1 certain. Loss = 1 bit.
; Total: 2 bits = 1 + 1. Functorial.

(define h-before (entropy '(0.25 0.25 0.25 0.25)))
(define h-after1 (entropy '(0.5 0.5)))
(define h-after2 (entropy '(1.0)))

(display "H(input) = ") (display h-before) (newline)
(display "H(step1) = ") (display h-after1) (newline)
(display "H(step2) = ") (display h-after2) (newline)
(display "Loss 1 = ") (display (- h-before h-after1)) (newline)
(display "Loss 2 = ") (display (- h-after1 h-after2)) (newline)
(display "Total  = ") (display (- h-before h-after2)) (newline)
; Total = Loss 1 + Loss 2. That's functoriality.

Data processing inequality

A corollary: information can only decrease through processing. Computation loses or preserves bits; it never creates them. This is the data processing inequality (DPI), proved categorically.

Scheme

; DPI: H(output) <= H(input). Always.
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

(define h-in (entropy '(0.1 0.2 0.3 0.4)))
(define h-lossy (entropy '(0.3 0.7)))  ; merge 1,2 and 3,4

(display "H(input) = ") (display h-in) (newline)
(display "H(lossy) = ") (display h-lossy) (newline)
(display "Lost: ") (display (- h-in h-lossy)) (newline)
; H(output) <= H(input). You can't gain bits by computing.

The uniqueness theorem

The paper's main result (Theorem 4): any function F that assigns a real number to a morphism in FinProb (finite probability distributions) and satisfies:

Functoriality: F(g ∘ f) = F(f) + F(g)
Convex linearity: F respects mixtures of distributions
Continuity: small changes in probabilities → small changes in F

...must be F(f) = c · (H(input) − H(output)) for some constant c ≥ 0. Shannon entropy, up to scale.

Scheme

; Uniqueness: F(f) = c · (H(input) - H(output))
; Shannon entropy is the ONLY such F (up to scale).
(define (log2 x) (/ (log x) (log 2)))

(define (entropy probs)
  (- (apply + (map (lambda (p)
    (if (> p 0) (* p (log2 p)) 0)) probs))))

(define h-in  (entropy '(0.25 0.25 0.25 0.25)))
(define h-out (entropy '(0.5 0.5)))
(define F (- h-in h-out))

(display "F(f) = H(in) - H(out) = ") (display F) (newline)
; This is the ONLY measure that is:
; 1. Functorial  2. Convex-linear  3. Continuous
; Shannon didn't choose entropy. The axioms forced it.

Notation reference

Paper	Python	Meaning
H(p)	-sum(p*log2(p))	Shannon entropy
F(f) = c(H(p)−H(q))	h_in - h_out	Information loss of morphism f
FinProb	# dicts summing to 1	Category of finite probability distributions
DPI	h_out <= h_in	Data processing inequality

Neighbors

Other paper pages

🍞 Fritz 2020 — the Markov category where entropy lives
🍞 Staton 2025 — Hoare logic in the same category
🍞 Sato, Katsumata 2023 — divergences as enrichment on the category

Foundations (Wikipedia)

Translation notes

All examples use finite distributions with explicit probabilities. The paper's theorem holds over FinProb (finite probability distributions with morphisms that preserve total mass). For example, the DPI example on this page merges four outcomes into two and checks that entropy decreases. In the paper, the same inequality applies to a noisy channel transmitting continuous signals: the received signal's entropy never exceeds the transmitted signal's, regardless of the noise model. The inequality is identical. The distributions are not.

The uniqueness result requires all three axioms: drop any one and other measures become possible. Every example is Simplified.

Framework connection: Shannon entropy is the unique measure of the Natural Framework pipeline's information budget. DPI governs how bits flow through each stage. (The Natural Framework)

Ready for the real thing? arxiv

Read the paper. Start at Theorem 4 (p.8) for the uniqueness result. 15 pages.

← Capucci 2021 · 7 of 21 by june.kim Liell-Cock 2025 · 9 of 21 →