Modified:
Lipschitz
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.A function is -Lipschitz continuous if changing the input by a little bit can't change the output by more than a proportional amount,
for all in its domain. In particular, choosing implies that the gradient is bounded by , but the definition doesn't require that be differentiable.
Sometimes people talk about "Lipschitz smoothness" or an "-smooth" function to mean a function whose gradient is Lipschitz continuous with constant . Interestingly, a function is Lipschitz smooth if its convex dual is strongly convex (see, e.g., https://xingyuzhou.org/blog/notes/Lipschitz-gradient).
In deep learning
Say we have a deep network in which all the layers are L-Lipschitz. What can we say about the expressiveness and dynamics of such networks? A few observations:
- This network can implement contractions, equivalent to "forgetting" or "erasing" information (in the extreme case, note that constant functions like are Lipschitz). This is basically what's required for error correction, arbitrary logic gates, universal computation, etc. In particular, arbitrary operations on the unit hypercube (binary strings) are Lipschitz up to the size of the hypercube, so there's no fundamental limit on computational expressivity. However, there are some limits to expressivity for specific input representations, and as a function of input size:
- hard attention / lookups at a fixed margin require increasing Lipschitz constants to discriminate between context entries (see "Softmax is not enough", https://openreview.net/forum?id=S4JmmpnSPy)
- sorting analog inputs is non-Lipschitz since it branches on arbitrarily small differences (a digital representation would be Lipschitz)
- The network can only expand or "amplify" small differences in its inputs, up to the Lipschitz constant times the number of layers. Thus it avoids exploding gradients. (but is still subject to vanishing gradients arising from contractions/forgetting).
- Naively we might worry that for any given we can rescale the inputs to achieve this , trading geometric margin for the Lipschitz constraint without actually changing any of the dynamics. However, two factors ensure that the Lipschitz constraint in deep networks actually "bites":
- Input and output representations (eg pixel values or normalized token embeddings as inputs, and logit outputs) have a fixed scale that the network can't control.
- Internal normalization layers (BatchNorm, LayerNorm, etc) control the scale of activations.
- Internal normalization layers and residual connections also
Some advantages of Lipschitz-constrained networks:
- Provable adversarial robustness: small input changes can't change the output by very much.