Gradient Descent Convergence

lean4-proofoptimizationvisualization
f(xn+1)f(x)(1μL)n(f(x0)f(x))f(x_{n+1}) - f(x^*) \le \left(1 - \frac{\mu}{L}\right)^n (f(x_0) - f(x^*))

Statement

Let f:RnRf : \mathbb{R}^n \to \mathbb{R} be convex and LL-smooth (i.e., f\nabla f is LL-Lipschitz). The gradient descent update with step α=1/L\alpha = 1/L:

xn+1=xn1Lf(xn)x_{n+1} = x_n - \frac{1}{L}\, \nabla f(x_n)

satisfies, for the distance to minimiser xx^*:

xn+1x2(1μL)xnx2\|x_{n+1} - x^*\|^2 \le \left(1 - \frac{\mu}{L}\right)\|x_n - x^*\|^2

when ff is additionally μ\mu-strongly convex. For plain convex ff: objective error decreases as O(1/n)O(1/n).

Concrete instance. f(x)=x2/2f(x) = x^2/2, so L=1L = 1, x=0x^* = 0.

Update: xn+1=xnxn=0x_{n+1} = x_n - x_n = 0convergence in one step with step size 1/L=11/L = 1.

With step α=1/2\alpha = 1/2 (undershoot): xn+1=xn/2x_{n+1} = x_n/2, geometric convergence with ratio 1/21/2.

Visualization

Gradient descent on f(x)=x2f(x) = x^2 starting at x0=2x_0 = 2, step α=0.4\alpha = 0.4:

Update rule: xn+1=xn0.42xn=0.2xnx_{n+1} = x_n - 0.4 \cdot 2x_n = 0.2 x_n.

nnxnx_nf(xn)=xn2f(x_n) = x_n^2xn0\|x_n - 0\|
02.0004.0002.000
10.4000.1600.400
20.0800.00640.080
30.0160.0002560.016
40.00320.00001020.0032

Contraction factor per step: xn+1/xn=0.2x_{n+1}/x_n = 0.2, so xn=2(0.2)n0\|x_n\| = 2 \cdot (0.2)^n \to 0.

  f(x)
  4 |*
    |
  1 |  *
    |
0.1 |    *
    |      * * * ...
  0 +--+--+--+--+---
    0  1  2  3  4   n

The objective f(xn)=4(0.04)nf(x_n) = 4 \cdot (0.04)^n decreases geometrically.

Proof Sketch

  1. LL-smoothness descent lemma: f(xn+1)f(xn)12Lf(xn)2f(x_{n+1}) \le f(x_n) - \frac{1}{2L}\|\nabla f(x_n)\|^2.
  2. Convexity: f(x)f(xn)+f(xn),xxnf(x^*) \ge f(x_n) + \langle \nabla f(x_n), x^* - x_n \rangle.
  3. Combine: the squared distance to xx^* decreases by 1Lf(xn)2\frac{1}{L}\|\nabla f(x_n)\|^2 each step.
  4. For f(x)=x2f(x) = x^2: f(x)=2x\nabla f(x) = 2x, L=2L = 2. Step α=1/L=1/2\alpha = 1/L = 1/2: xn+1=xn122xn=0x_{n+1} = x_n - \frac{1}{2} \cdot 2x_n = 0. With α=0.4\alpha = 0.4: xn+1=xn(10.42)=0.2xnx_{n+1} = x_n(1 - 0.4 \cdot 2) = 0.2 x_n.

Connections

  • Convex FunctionConvex Functionf(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda)y) \le \lambda f(x) + (1-\lambda)f(y)A function is convex when the chord between any two points lies above the graphRead more →LL-smoothness is a quantitative strengthening of convexity; the convergence rate depends on both
  • Mean Value TheoremMean Value Theoremc(a,b)  :  f(c)=f(b)f(a)ba\exists\, c \in (a,b)\;:\; f'(c) = \frac{f(b)-f(a)}{b-a}There exists a point where the instantaneous rate of change equals the average rate of changeRead more → — the LL-smoothness bound f(x)f(y)Lxy|\nabla f(x) - \nabla f(y)| \le L\|x-y\| is a Lipschitz condition proved via the MVT
  • Cauchy CriterionCauchy Criterionε>0  N:  m,nN    aman<ε\forall\varepsilon>0\;\exists N:\; m,n \geq N \implies |a_m - a_n| < \varepsilonA sequence converges if and only if its terms eventually become arbitrarily close to each other — no candidate limit requiredRead more → — geometric decrease of xnx\|x_n - x^*\| implies the sequence is Cauchy, hence convergent
  • Monotone Convergence TheoremMonotone Convergence Theoremf monotone,  supnf(n)<    f(n)supnf(n)f \text{ monotone},\; \sup_n f(n) < \infty \implies f(n) \to \sup_n f(n)A bounded monotone sequence of reals always converges — the supremum is the limitRead more → — the objective sequence f(xn)f(x_n) is monotone decreasing and bounded below, so it converges

Lean4 Proof

import Mathlib.Data.Real.Basic

/-- Gradient descent on f(x) = x²/2 with step α = 1 converges in one step.
    Starting at x₀, update x₁ = x₀ - 1 * x₀ = 0. -/
theorem gd_quadratic_one_step (x0 : ) :
    x0 - 1 * x0 = 0 := by ring

/-- Geometric convergence for f(x) = x², step α = 0.4:
    x_{n+1} = x_n - 0.4 * 2 * x_n = 0.2 * x_n. -/
theorem gd_quadratic_contraction (x : ) :
    x - 0.4 * (2 * x) = 0.2 * x := by ring