Shannon Entropy

lean4-proofinformation-theoryvisualization
H(X)=ipilogpi0H(X) = -\sum_{i} p_i \log p_i \ge 0

Statement

Let XX be a discrete random variable taking values in {1,,n}\{1, \ldots, n\} with probability distribution (p1,,pn)(p_1, \ldots, p_n), where pi0p_i \ge 0 and pi=1\sum p_i = 1. The Shannon entropy is

H(X)=i=1npilogpiH(X) = -\sum_{i=1}^{n} p_i \log p_i

(with the convention 0log0=00 \log 0 = 0). The entropy satisfies:

  1. H(X)0H(X) \ge 0 for all distributions.
  2. H(X)=0H(X) = 0 if and only if one pi=1p_i = 1 and the rest are zero.
  3. H(X)H(X) is maximized at logn\log n when pi=1/np_i = 1/n for all ii.

For a Bernoulli(p)(p) source the entropy reduces to the binary entropy function

h(p)=plogp(1p)log(1p),h(p) = -p \log p - (1-p) \log(1-p),

which peaks at h(1/2)=log2h(1/2) = \log 2.

Visualization

Binary entropy h(p)h(p) for selected values (natural log; max at p=1/2p = 1/2):

ppplnp-p\ln p(1p)ln(1p)-(1-p)\ln(1-p)h(p)h(p)
0.000.0000.0000.000
0.100.2300.0950.325
0.200.3220.1610.500
0.300.3610.2400.611
0.400.3660.3060.673
0.500.3470.3470.693

The curve is strictly concave and symmetric about p=1/2p = 1/2.

ASCII sketch of h(p)h(p):

h
|         *
|       *   *
|     *       *
|   *           *
| *               *
+-------------------> p
0        1/2       1

Proof Sketch

  1. Each term pilogpi-p_i \log p_i equals negMulLog(pi)\text{negMulLog}(p_i), the function xxlogxx \mapsto -x \log x.
  2. Mathlib proves negMulLog_nonneg: for 0x10 \le x \le 1 we have xlogx0-x \log x \ge 0, because logx0\log x \le 0 on [0,1][0,1].
  3. A sum of non-negative terms is non-negative, so H(X)0H(X) \ge 0.
  4. At x=0x = 0 or x=1x = 1 the term vanishes by continuity (0()=00 \cdot (-\infty) = 0 by convention). At p=1/2p = 1/2 the derivative logp1+log(1p)+1=0-\log p - 1 + \log(1-p) + 1 = 0, confirming the maximum.

Connections

The concavity of entropy underlies the proof of Channel CapacityChannel CapacityC=maxPXI(X;Y)=1H(p)C = \max_{P_X} I(X;Y) = 1 - H(p)The BSC capacity C = 1 - H(p) is the maximum reliable bit rate; C(0)=1 and C(1/2)=0Read more → and drives the argument in the Source Coding TheoremSource Coding TheoremH(X)E[]<H(X)+1H(X) \le \mathbb{E}[\ell] < H(X) + 1Shannon's noiseless coding theorem: optimal prefix code lengths satisfy H(X) <= E[l] < H(X) + 1Read more →. The same log-sum inequality appears in Markov's InequalityMarkov's InequalityP(Xa)E[X]aP(X \geq a) \leq \frac{E[X]}{a}A non-negative random variable rarely exceeds a multiple of its expectationRead more → style bounds via Jensen's inequality.

Lean4 Proof

import Mathlib.Analysis.SpecialFunctions.Log.NegMulLog

open Real

/-- Each term -p * log p is non-negative for p in [0,1]. -/
theorem negMulLog_term_nonneg (p : ) (hp0 : 0  p) (hp1 : p  1) :
    0  negMulLog p :=
  negMulLog_nonneg hp0 hp1

/-- Binary entropy h(p) = -p log p - (1-p) log(1-p) is non-negative on [0,1].
    Mathlib's `binEntropy_nonneg` covers this directly. -/
theorem binEntropy_nonneg' (p : ) (hp0 : 0  p) (hp1 : p  1) :
    0  binEntropy p :=
  binEntropy_nonneg hp0 hp1