Transformers are universal in-context function approximators
a simple and streamlined proof
This is a short technical note adapted from a presentation I gave, apropos of nothing, at the metauni SLT seminar on Thursday, December 5th, 2024.
In the seminar, I presented a simplified and streamlined version of the results from section 2 of the paper “Vocabulary in-context learning in transformers: Benefits of positional encoding,” currently under public, double-blind review at ICLR 2025 (OpenReview link).
Definitions: Transformers
A transformer is a function where is the Kleene star. The first input to this function is a context—a sequence of pairs of vectors of a variable, unbounded length , where each and each . It then also takes a query and returns a completion .
Given a particular context , the transformer can be thought of as a function from queries to completions: It is this function (modulo some auxiliary dimensions here and there) that we want to show is a universal function approximator, in the sense that for any continuous function on a compact domain we can find a context such that the transformer as a function from queries to completions is approximately equal to that function.
It may be unclear why we are considering a transformer that operates on sequences of pairs of vectors, rather than a sequence of tokens.
- First, imagine we are studying the part of the transformer between then embedding and the unembedding. That takes us from tokens to vectors.
- Second, imagine we arbitrarily factor the residual stream of shape as . That gives us pairs. Well… that’s what we are doing. I don’t actually know exactly why we want pairs, except to say it makes it simpler to talk about approximating a function, and that I think this is a common approach in theoretical work on transformer expressivity. I don’t completely buy it, but let’s roll with it for today.
With that said, let’s fill out the transformer type signature with the concrete mappings we will show are universal in-context function approximators. Consider the following two “transformers”:
The softmax attention module, , with the mapping
This is a simplified version of a single attention head within a real transformer. See that the query is compared to previous values in the context, and these comparisons are used to create a softmax distribution. Then, the completion is an average of the values from the context weighted by this softmax distribution.
The elementwise attention module, with the mapping where is an arbitrary activation function.
It’s not so clear to me that this second option should be called a transformer. The completion is not generated as an average weighted by a normalised probability distribution, but by the outputs of an arbitrary activation function. But, it is a useful conceptual stepping stone for the proof in the softmax case—if you look closely you might be able to anticipate the way the context plays a similar role to the weights of a feed-forward neural network already.
Even the softmax attention module is substantially simplified compared to a full modern transformer architecture, or even a single attention head within such an architecture. The key missing piece is a triple of parametric linear transformations (the query, key, and value transformations). The authors consider some such parametrised transformers but I have removed these parameters for this note because they lead to more cumbersome notation while not being necessary to see the key ideas in the proof.
Universal (in-context) function approximation
Recall the following result from Leshno, Lin, Pinkus, and Schocken (1993):
Theorem (Universal approximation, Leshno et al., 1993): Let be a non-polynomial, locally bounded, piecewise continuous activation function. For any continuous function defined on a compact domain , and any , there exists , , , and , such that for all
The “Vocabulary in-context learning” paper proves as a corollary of this result a universal approximation property for the above two transformers (and more complex, parameterised transformers) as a function of the context:
Corollary: For any continuous function defined on a compact domain , and any , there exists , , and , such that for all Moreover, the same is true for if is non-polynomial, locally bounded, and piecewise continuous.
Note that compared to the function to be approximated , the context vectors have an additional dimension, and the query is where is the function input. These are the “auxiliary dimensions here and there” to which I referred earlier.
I present the proof first for the case of the elementwise attention () and then the slightly more complex proof for softmax attention ().
Elementwise case: Equivalence of NNs and contexts
Proof ( case): By the theorem there exists , , , and such that for all For , put and . Then QED.
Remark: The above proof reveals an equivalence between:
- biased single-hidden-layer neural networks with input units, biased hidden units, and output units, and
- transformer contexts with length and dimensions and .
The paper shows this correspondence also exists for any parameterised elementwise attention module, as long as the query, key, and value transformations have a certain structure that is pretty unrestrictive but not very easy to describe without introducing a lot of cumbersome notation (and, unfortunately, also a bit buried in the paper).
Softmax case: Approximation of NNs by contexts
Proof ( case): Unfortunately, normalisation prevents a similar correspondence for softmax attention. But for the corollary it suffices to approximate neural networks with contexts.
In particular, the function is non-polynomial, locally bounded, and piecewise continuous. Therefore by the theorem there exists , , , and such that for all I will show there exists and such that for all from which the corollary follows by the triangle inequality.
For , put , and where is to be defined later.
Then
Importantly, we have recovered the exponential neural network as the numerator. We can now factor it out.
Then, noting that for we have ,
To finish, take large enough so that for all and , Such a is well-defined (finite) because is compact.
QED.
Remark: I think the above proof for the softmax attention case is cleaner than the one in the paper (though I followed their approach).
- This proof is substantially streamlined—in the paper, the authors use the triangle inequality twice, going from to first via an exponential neural network (as above) but then also via a normalised ‘softmax neural network’ (the argument is split over two lemmas in the paper).
- The use of the triangle inequality is also implicit in the paper’s argument(s), whereas my personal style encouraged me to frontload this strategic information so that a reader knows what bound we are trying to achieve and why.
- In each of the two steps, the authors introduce an auxiliary element into the approximating architecture: one extra unit in the softmax network to approximate the exponential network, and one extra context pair to approximate the softmax network with a context. The result is that the constructed context has length where is the width of the exponential neural network derived via Leshno et al. (1993). In contrast, the above proof constructs a length- context.
- The above proof corrects a minor flaw in the paper’s proof, where the authors effectively define as a function of , instead of constructing one prompt that achieves the required epsilon bound for all (as is to be demonstrated).
The proof in the paper is also more elaborate due to accounting for query, key, and value transforms. However, this does not prevent my streamlined approach—the above proof can easily be elaborated to account for those.
Finite vocabularies and positional encodings
This result is actually not the main focus of the paper. That would be the following:
In section 3, the authors show that, for both elementwise attention modules and softmax attention modules, this universal approximation property does not hold if the context is restricted to be drawn from a finite alphabet (a “vocabulary”). By “finite alphabet,” I mean contexts are confined to for finite and , as opposed to .
I checked this proof carefully and I think it holds, though there are some minor errors remaining in the most recent revision.
In section 4, the authors show that, in the case of elementwise attention modules, the universal approximation property can be restored in the finite-alphabet setting if one considers contexts with a dense additive positional encoding. By “dense additive positional encoding,” I mean they consider a sequence of vectors such that is dense in (or some bounded subsets of in the case of certain activation functions). Then, in the context is replaced by . They also assume .
I haven’t studied these proofs in detail, but from these assumptions, they could scan the positional encoding sequence until they find an arbitrarily close approximations of the s in their previously-constructed real contexts, and likewise assemble arbitrarily close approximations of the s by merging many such examples with different . Then since their architecture is continuous in the context this should suffice to achieve universality again.
To be honest, I don’t think these results are very compelling, even aside from the fact that universal approximation is not much more than theoretically interesting.
The finite-alphabet setting is motivated by the fact that in a real language model, the input sequence is indeed assembled from a finite vocabulary of tokens. However, in a multi-layer language model, the outputs of the first layer aggregates information from multiple tokens, and through successive transformer blocks I conjecture that we can reach an exponential number of embedding vectors, bringing us closer to the dense setting.
As for the benefits of positional encoding, the assumption of density of the positional encoding is so strong as to substantially cheapen the result. I guess the proof involves some very long contexts to find the right rational approximations. Moreover, authors were only able to prove this result for their elementwise “transformer” architecture (I would guess that the long contexts creates problems via normalisation). Yet the authors force a positive framing, and even go as far as to suggest empirical research should explore the benefits of positional encodings that enumerate the rational numbers.
Will this paper be accepted? Two of the reviewers liked the paper (low confidence accept, fair confidence weak accept). The other two reviewers recommended weak reject, one (low confidence) is not on board with the idea of optimising over contexts to begin with; the other (high confidence, very in-depth review) is on board but not impressed with the clarity of the presentation or the strength of the assumptions. I predict this paper will not be accepted.
Bibliography
Once again, this note is based on the following paper:
- Anonymous author(s), “Vocabulary in-context learning in transformers: Benefits of positional encoding,” under review at ICLR 2025. OpenReview.
The proof techniques build on foundational, classical work in universal function approximation with feed-forward neural networks by Leshno et al.:
- Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken, “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function,” Neural Networks, 6(6):861–867, 1993. DOI:10.1016/S0893-6080(05)80131-5.
The observation that in-context learning is a universal approximator in the above sense has apparently been made before, using different techniques, according to the anonymous author(s):
- Aleksandar Petrov, Philip Torr, and Adel Bibi, “Prompting a pretrained transformer can be a universal approximator,” at ICML 2024. PMLR.