Mutual Information

lean4-proofinformation-theoryvisualization
I(X;Y)=H(X)+H(Y)H(X,Y)0I(X;Y) = H(X) + H(Y) - H(X,Y) \ge 0

Statement

The mutual information between discrete random variables XX and YY is

I(X;Y)=H(X)+H(Y)H(X,Y)=x,yp(x,y)logp(x,y)p(x)p(y).I(X; Y) = H(X) + H(Y) - H(X, Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\,p(y)}.

Key properties:

  • I(X;Y)0I(X; Y) \ge 0, with equality iff XX and YY are independent.
  • I(X;Y)=H(X)H(XY)=H(Y)H(YX)I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X).
  • I(X;X)=H(X)I(X; X) = H(X) (a variable is maximally informative about itself).
  • I(X;Y)min(H(X),H(Y))I(X; Y) \le \min(H(X), H(Y)).

Visualization

Three joint distributions over X,Y{0,1}X, Y \in \{0, 1\}:

Case 1: Independent (I=0I = 0)

p(x,y)p(x,y)Y=0Y=0Y=1Y=1
X=0X=00.250.25
X=1X=10.250.25

H(X)=H(Y)=ln2H(X) = H(Y) = \ln 2; H(X,Y)=2ln2H(X,Y) = 2\ln 2; I=ln2+ln22ln2=0I = \ln 2 + \ln 2 - 2\ln 2 = 0.

Case 2: Y=XY = X (I=H(X)I = H(X))

p(x,y)p(x,y)Y=0Y=0Y=1Y=1
X=0X=00.500.00
X=1X=10.000.50

H(X)=H(Y)=ln2H(X) = H(Y) = \ln 2; H(X,Y)=ln2H(X,Y) = \ln 2; I=ln2+ln2ln2=ln2=H(X)I = \ln 2 + \ln 2 - \ln 2 = \ln 2 = H(X).

Case 3: Partial dependence (0<I<H(X)0 < I < H(X))

p(x,y)p(x,y)Y=0Y=0Y=1Y=1
X=0X=00.400.10
X=1X=10.100.40

H(X)=H(Y)=ln20.693H(X) = H(Y) = \ln 2 \approx 0.693; H(X,Y)1.061H(X,Y) \approx 1.061; I0.693+0.6931.061=0.325I \approx 0.693 + 0.693 - 1.061 = 0.325.

All three confirm 0IH(X)0 \le I \le H(X).

Proof Sketch

  1. Write I(X;Y)=KL(p(x,y)p(x)p(y))I(X;Y) = \text{KL}(p(x,y) \| p(x)p(y)), the Kullback-Leibler divergence.
  2. By the log-sum inequality (a consequence of Jensen's inequality applied to the convex function log-\log), KL(QP)0\text{KL}(Q \| P) \ge 0 for any distributions Q,PQ, P.
  3. Equality holds iff p(x,y)=p(x)p(y)p(x,y) = p(x)p(y) for all x,yx,y, i.e., XYX \perp Y.
  4. The formula I(X;X)=H(X)I(X;X) = H(X) follows: KL(p(x,x)p(x)2)=p(x)log(p(x)/p(x)2)=p(x)logp(x)=H(X)\text{KL}(p(x,x) \| p(x)^2) = \sum p(x)\log(p(x)/p(x)^2) = -\sum p(x)\log p(x) = H(X).

Connections

Mutual information is the fundamental quantity whose maximization defines Channel CapacityChannel CapacityC=maxPXI(X;Y)=1H(p)C = \max_{P_X} I(X;Y) = 1 - H(p)The BSC capacity C = 1 - H(p) is the maximum reliable bit rate; C(0)=1 and C(1/2)=0Read more →. Its non-negativity follows from the same convexity argument underlying Shannon EntropyShannon EntropyH(X)=ipilogpi0H(X) = -\sum_{i} p_i \log p_i \ge 0The unique measure of uncertainty for a probability distribution: H(X) = -sum p_i log p_i, always non-negativeRead more →'s concavity, which also appears in Chebyshev's InequalityChebyshev's InequalityP(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}A random variable rarely deviates from its mean by more than a few standard deviationsRead more → style arguments.

Lean4 Proof

import Mathlib.Analysis.SpecialFunctions.Log.NegMulLog

open Real

/-- Shannon entropy of a 2-point distribution with probabilities p and (1-p).
    H(p) = -p * log p - (1-p) * log (1-p). -/
noncomputable def binaryEntropy (p : ) :  :=
  -(p * Real.log p) - ((1 - p) * Real.log (1 - p))

/-- Mutual information for a joint distribution on a finite type,
    defined as I(X;Y) = H(X) + H(Y) - H(X,Y). -/
noncomputable def mutualInfo {α β : Type*} [Fintype α] [Fintype β]
    (joint : α  β  ) :  :=
  let margX : α   := fun a => ∑ b, joint a b
  let margY : β   := fun b => ∑ a, joint a b
  let hX    := -∑ a, margX a * Real.log (margX a)
  let hY    := -∑ b, margY b * Real.log (margY b)
  let hXY   := -∑ a, ∑ b, joint a b * Real.log (joint a b)
  hX + hY - hXY

/-- I(X;X) = H(X): a variable is maximally informative about itself.
    For the diagonal joint (p(a,b) = p_a when a=b, else 0):
    H(X,X) = H(X), so I = H(X) + H(X) - H(X) = H(X). -/
theorem mutual_info_self_eq_entropy {α : Type*} [DecidableEq α] [Fintype α]
    (pmf : α  ) :
    let diagJoint : α  α   := fun a b => if a = b then pmf a else 0
    mutualInfo diagJoint =
      -∑ a, pmf a * Real.log (pmf a) := by
  simp [mutualInfo, Finset.sum_ite_eq', Finset.mem_univ]
  ring