Shannon Entropy

lean4-proofinformation-theoryvisualization

H(X) = -\sum_{i} p_i \log p_i \ge 0

Statement

Let $X$ be a discrete random variable taking values in $\{1, \ldots, n\}$ with probability distribution $(p_1, \ldots, p_n)$ , where $p_i \ge 0$ and $\sum p_i = 1$ . The Shannon entropy is

H(X) = -\sum_{i=1}^{n} p_i \log p_i

(with the convention $0 \log 0 = 0$ ). The entropy satisfies:

$H(X) \ge 0$ for all distributions.
$H(X) = 0$ if and only if one $p_i = 1$ and the rest are zero.
$H(X)$ is maximized at $\log n$ when $p_i = 1/n$ for all $i$ .

For a Bernoulli $(p)$ source the entropy reduces to the binary entropy function

h(p) = -p \log p - (1-p) \log(1-p),

which peaks at $h(1/2) = \log 2$ .

Visualization

Binary entropy $h(p)$ for selected values (natural log; max at $p = 1/2$ ):

$p$	$-p\ln p$	$-(1-p)\ln(1-p)$	$h(p)$
0.00	0.000	0.000	0.000
0.10	0.230	0.095	0.325
0.20	0.322	0.161	0.500
0.30	0.361	0.240	0.611
0.40	0.366	0.306	0.673
0.50	0.347	0.347	0.693

The curve is strictly concave and symmetric about $p = 1/2$ .

ASCII sketch of $h(p)$ :

h
|         *
|       *   *
|     *       *
|   *           *
| *               *
+-------------------> p
0        1/2       1

Proof Sketch

Each term $-p_i \log p_i$ equals $\text{negMulLog}(p_i)$ , the function $x \mapsto -x \log x$ .
Mathlib proves negMulLog_nonneg: for $0 \le x \le 1$ we have $-x \log x \ge 0$ , because $\log x \le 0$ on $[0,1]$ .
A sum of non-negative terms is non-negative, so $H(X) \ge 0$ .
At $x = 0$ or $x = 1$ the term vanishes by continuity ( $0 \cdot (-\infty) = 0$ by convention). At $p = 1/2$ the derivative $-\log p - 1 + \log(1-p) + 1 = 0$ , confirming the maximum.

Connections

The concavity of entropy underlies the proof of Channel CapacityChannel Capacity $C = \max_{P_X} I(X;Y) = 1 - H(p)$ The BSC capacity C = 1 - H(p) is the maximum reliable bit rate; C(0)=1 and C(1/2)=0Read more → and drives the argument in the Source Coding TheoremSource Coding Theorem $H(X) \le \mathbb{E}[\ell] < H(X) + 1$ Shannon's noiseless coding theorem: optimal prefix code lengths satisfy H(X) <= E[l] < H(X) + 1Read more →. The same log-sum inequality appears in Markov's InequalityMarkov's Inequality $P(X \geq a) \leq \frac{E[X]}{a}$ A non-negative random variable rarely exceeds a multiple of its expectationRead more → style bounds via Jensen's inequality.

Lean4 Proof

import Mathlib.Analysis.SpecialFunctions.Log.NegMulLog

open Real

/-- Each term -p * log p is non-negative for p in [0,1]. -/
theorem negMulLog_term_nonneg (p : ℝ) (hp0 : 0 ≤ p) (hp1 : p ≤ 1) :
    0 ≤ negMulLog p :=
  negMulLog_nonneg hp0 hp1

/-- Binary entropy h(p) = -p log p - (1-p) log(1-p) is non-negative on [0,1].
    Mathlib's `binEntropy_nonneg` covers this directly. -/
theorem binEntropy_nonneg' (p : ℝ) (hp0 : 0 ≤ p) (hp1 : p ≤ 1) :
    0 ≤ binEntropy p :=
  binEntropy_nonneg hp0 hp1

Referenced by

Source Coding TheoremInformation Theory
Fano's InequalityInformation Theory
Mutual InformationInformation Theory