Lipschitz: Nonlinear Function
Created:
Modified:

Lipschitz

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A function ff is LL-Lipschitz continuous if changing the input by a little bit can't change the output by more than a proportional amount,

f(x)f(y)Lxy\|f(x) - f(y)\| \le L \|x - y\|

for all x,yx, y in its domain. In particular, choosing y=x+ϵy = x + \epsilon implies that the gradient f(x)\| \nabla f(x) \| is bounded by LL, but the definition doesn't require that ff be differentiable.

Sometimes people talk about "Lipschitz smoothness" or an "LL-smooth" function to mean a function whose gradient is Lipschitz continuous with constant LL. Interestingly, a function is Lipschitz smooth if its convex dual is strongly convex (see, e.g., https://xingyuzhou.org/blog/notes/Lipschitz-gradient).

In deep learning

Say we have a deep network in which all the layers are L-Lipschitz. What can we say about the expressiveness and dynamics of such networks? A few observations:

  • This network can implement contractions, equivalent to "forgetting" or "erasing" information (in the extreme case, note that constant functions like f(x)=0f(x) = 0 are Lipschitz). This is basically what's required for error correction, arbitrary logic gates, universal computation, etc. In particular, arbitrary operations on the unit hypercube (binary strings) are Lipschitz up to the size of the hypercube, so there's no fundamental limit on computational expressivity. However, there are some limits to expressivity for specific input representations, and as a function of input size:
    • hard attention / lookups at a fixed margin require increasing Lipschitz constants O(logn)O(\log n) to discriminate between nn context entries (see "Softmax is not enough", https://openreview.net/forum?id=S4JmmpnSPy)
    • sorting analog inputs is non-Lipschitz since it branches on arbitrarily small differences (a digital representation would be Lipschitz)
  • The network can only expand or "amplify" small differences in its inputs, up to the Lipschitz constant times the number of layers. Thus it avoids exploding gradients. (but is still subject to vanishing gradients arising from contractions/forgetting).
  • Naively we might worry that for any given LL we can rescale the inputs xx to achieve this LL, trading geometric margin for the Lipschitz constraint without actually changing any of the dynamics. However, two factors ensure that the Lipschitz constraint in deep networks actually "bites":
    • Input and output representations (eg pixel values or normalized token embeddings as inputs, and logit outputs) have a fixed scale that the network can't control.
    • Internal normalization layers (BatchNorm, LayerNorm, etc) control the scale of activations.
  • Internal normalization layers and residual connections also

Some advantages of Lipschitz-constrained networks:

  • Provable adversarial robustness: small input changes can't change the output by very much.