Created:
Modified:

Lipschitz

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A function $f$ is $L$ -Lipschitz continuous if changing the input by a little bit can't change the output by more than a proportional amount,

\|f(x) - f(y)\| \le L \|x - y\|

for all $x, y$ in its domain. In particular, choosing $y = x + \epsilon$ implies that the gradient $\| \nabla f(x) \|$ is bounded by $L$ , but the definition doesn't require that $f$ be differentiable.

Sometimes people talk about "Lipschitz smoothness" or an " $L$ -smooth" function to mean a function whose gradient is Lipschitz continuous with constant $L$ . Interestingly, a function is Lipschitz smooth if its convex dual is strongly convex (see, e.g., https://xingyuzhou.org/blog/notes/Lipschitz-gradient).

In deep learning

Say we have a deep network in which all the layers are L-Lipschitz. What can we say about the expressiveness and dynamics of such networks? A few observations:

This network can implement contractions, equivalent to "forgetting" or "erasing" information (in the extreme case, note that constant functions like $f(x) = 0$ are Lipschitz). This is basically what's required for error correction, arbitrary logic gates, universal computation, etc. In particular, arbitrary operations on the unit hypercube (binary strings) are Lipschitz up to the size of the hypercube, so there's no fundamental limit on computational expressivity. However, there are some limits to expressivity for specific input representations, and as a function of input size:
- hard attention / lookups at a fixed margin require increasing Lipschitz constants $O(\log n)$ to discriminate between $n$ context entries (see "Softmax is not enough", https://openreview.net/forum?id=S4JmmpnSPy)
- sorting analog inputs is non-Lipschitz since it branches on arbitrarily small differences (a digital representation would be Lipschitz)
The network can only expand or "amplify" small differences in its inputs, up to the Lipschitz constant times the number of layers. Thus it avoids exploding gradients. (but is still subject to vanishing gradients arising from contractions/forgetting).
Naively we might worry that for any given $L$ we can rescale the inputs $x$ to achieve this $L$ , trading geometric margin for the Lipschitz constraint without actually changing any of the dynamics. However, two factors ensure that the Lipschitz constraint in deep networks actually "bites":
- Input and output representations (eg pixel values or normalized token embeddings as inputs, and logit outputs) have a fixed scale that the network can't control.
- Internal normalization layers (BatchNorm, LayerNorm, etc) control the scale of activations.
Internal normalization layers and residual connections also

Some advantages of Lipschitz-constrained networks:

Provable adversarial robustness: small input changes can't change the output by very much.

Lipschitz

In deep learning

Links to this note

gradient clipping

contraction

A Universal Law of Robustness via Isoperimetry

convex dual

Meta