Transformers are universal in-context function approximators

a simple and streamlined proof

Thursday, December 5^th, 2024

This is a short technical note adapted from a presentation I gave, apropos of nothing, at the metauni SLT seminar on Thursday, December 5^th, 2024.

In the seminar, I presented a simplified and streamlined version of the results from section 2 of the paper “Vocabulary in-context learning in transformers: Benefits of positional encoding,” currently under public, double-blind review at ICLR 2025 (OpenReview link).

Definitions: Transformers

A transformer is a function $T : \left( \mathbb{R}^{d} \times \mathbb{R}^{d'} \right)^{\ast} \times \mathbb{R}^d \to \mathbb{R}^{d'}$ where $\ast$ is the Kleene star. The first input to this function is a context—a sequence of pairs of vectors of a variable, unbounded length $n \in \mathbb{N}$ , $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$ where each $x_i \in \mathbb{R}^{d}$ and each $y_i \in \mathbb{R}^{d'}$ . It then also takes a query $x \in \mathbb{R}^d$ and returns a completion $y \in \mathbb{R}^{d'}$ .

Given a particular context $(x_1, y_1), \ldots, (x_n, y_n)$ , the transformer can be thought of as a function from queries to completions: $T(x_1, y_1, \ldots, x_n, y_n, -) : \mathbb{R}^d \to \mathbb{R}^{d'}.$ It is this function (modulo some auxiliary dimensions here and there) that we want to show is a universal function approximator, in the sense that for any continuous function on a compact domain we can find a context such that the transformer as a function from queries to completions is approximately equal to that function.

It may be unclear why we are considering a transformer that operates on sequences of pairs of vectors, rather than a sequence of tokens.

First, imagine we are studying the part of the transformer between then embedding and the unembedding. That takes us from tokens to vectors.
Second, imagine we arbitrarily factor the residual stream of shape $\mathbb{R}^{d+d'}$ as $\mathbb{R}^{d} \times \mathbb{R}^{d'}$ . That gives us pairs. Well… that’s what we are doing. I don’t actually know exactly why we want pairs, except to say it makes it simpler to talk about approximating a function, and that I think this is a common approach in theoretical work on transformer expressivity. I don’t completely buy it, but let’s roll with it for today.

With that said, let’s fill out the transformer type signature with the concrete mappings we will show are universal in-context function approximators. Consider the following two “transformers”:

The softmax attention module, $T_{\mathrm{softmax}}$ , with the mapping $T_{\mathrm{softmax}}(x_1, y_1, \ldots, x_n, y_n, x) = \sum_{i=1}^{n} \frac {\exp( x_i \cdot x )} {\sum_{j=1}^{n} \exp( x_j \cdot x )} \, y_i.$

This is a simplified version of a single attention head within a real transformer. See that the query $x$ is compared to previous $x_i$ values in the context, and these comparisons are used to create a softmax distribution. Then, the completion is an average of the $y_i$ values from the context weighted by this softmax distribution.
The elementwise attention module, with the mapping $T_{\sigma}(x_1, y_1, \ldots, x_n, y_n, x) = \sum_{i=1}^{n} \sigma( x_i \cdot x ) \, y_i$ where $\sigma : \mathbb{R}\to \mathbb{R}$ is an arbitrary activation function.

It’s not so clear to me that this second option should be called a transformer. The completion is not generated as an average weighted by a normalised probability distribution, but by the outputs of an arbitrary activation function. But, it is a useful conceptual stepping stone for the proof in the softmax case—if you look closely you might be able to anticipate the way the context plays a similar role to the weights of a feed-forward neural network already.

Even the softmax attention module is substantially simplified compared to a full modern transformer architecture, or even a single attention head within such an architecture. The key missing piece is a triple of parametric linear transformations (the query, key, and value transformations). The authors consider some such parametrised transformers but I have removed these parameters for this note because they lead to more cumbersome notation while not being necessary to see the key ideas in the proof.

Universal (in-context) function approximation

Recall the following result from Leshno, Lin, Pinkus, and Schocken (1993):

Theorem (Universal approximation, Leshno et al., 1993): Let $\sigma : \mathbb{R}\to \mathbb{R}$ be a non-polynomial, locally bounded, piecewise continuous activation function. For any continuous function $f : \mathbb{R}^d \to \mathbb{R}^{d'}$ defined on a compact domain $\mathcal K \subset \mathbb{R}^d$ , and any $\varepsilon> 0$ , there exists $n \in \mathbb{N}$ , $a_1, \ldots, a_n \in \mathbb{R}^{d}$ , $b_1, \ldots, b_n \in \mathbb{R}$ , and $c_1, \ldots, c_n \in \mathbb{R}^{d'}$ , such that for all $x \in \mathcal{K}$ $\left\lVert f(x) − \sum_{i=1}^n \sigma (a_i \cdot x + b_i)\,c_i \right\rVert_\infty < \varepsilon.$

The “Vocabulary in-context learning” paper proves as a corollary of this result a universal approximation property for the above two transformers (and more complex, parameterised transformers) as a function of the context:

Corollary: For any continuous function $f : \mathbb{R}^d \to \mathbb{R}^{d'}$ defined on a compact domain $\mathcal K \subset \mathbb{R}^d$ , and any $\varepsilon> 0$ , there exists $n \in \mathbb{N}$ , $x_1, \ldots, x_n \in \mathbb{R}^{d+1}$ , and $y_1, \ldots, y_n \in \mathbb{R}^{d'}$ , such that for all $x \in \mathcal{K}$ $\left\lVert f(x) − T_{\mathrm{softmax}}\left( x_1, y_1, \ldots, x_n, y_n, (x, 1) \right) \right\rVert_\infty < \varepsilon.$ Moreover, the same is true for $T_{\sigma}$ if $\sigma : \mathbb{R}\to \mathbb{R}$ is non-polynomial, locally bounded, and piecewise continuous.

Note that compared to the function to be approximated $f : \mathbb{R}^d \to \mathbb{R}^{d'}$ , the context $x_i$ vectors have an additional dimension, and the query is $(x, 1) \in \mathbb{R}^{d+1}$ where $x \in \mathbb{R}^d$ is the function input. These are the “auxiliary dimensions here and there” to which I referred earlier.

I present the proof first for the case of the elementwise attention ( $T_{\sigma}$ ) and then the slightly more complex proof for softmax attention ( $T_{\mathrm{softmax}}$ ).

Elementwise case: Equivalence of NNs and contexts

Proof ( $T_{\sigma}$ case): By the theorem there exists $n \in \mathbb{N}$ , $a_1, \ldots, a_n \in \mathbb{R}^{d}$ , $b_1, \ldots, b_n \in \mathbb{R}$ , and $c_1, \ldots, c_n \in \mathbb{R}^{d'}$ such that for all $x \in \mathcal{K}$ $\left\lVert f(x) − \sum_{i=1}^n \sigma (a_i \cdot x + b_i)\,c_i \right\rVert_\infty < \varepsilon.$ For $i = 1, \ldots, n$ , put $x_i = (a_i, b_i) \in \mathbb{R}^{d+1}$ and $y_i = c_i \in \mathbb{R}^{d'}$ . Then $T_{\sigma}\left( x_1, y_1, \ldots, x_n, y_n, (x, 1) \right) = \sum_{i=1}^{n} \sigma( x_i \cdot (x, 1) ) \, y_i = \sum_{i=1}^{n} \sigma( a_i \cdot x + b_i ) \, c_i$ QED.

Remark: The above proof reveals an equivalence between:

biased single-hidden-layer neural networks with $d$ input units, $n$ biased hidden units, and $d'$ output units, and
transformer contexts with length $n$ and dimensions $d+1$ and $d'$ .

The paper shows this correspondence also exists for any parameterised elementwise attention module, as long as the query, key, and value transformations have a certain structure that is pretty unrestrictive but not very easy to describe without introducing a lot of cumbersome notation (and, unfortunately, also a bit buried in the paper).

Softmax case: Approximation of NNs by contexts

Proof ( $T_{\mathrm{softmax}}$ case): Unfortunately, normalisation prevents a similar correspondence for softmax attention. But for the corollary it suffices to approximate neural networks with contexts.

In particular, the function $\sigma = \exp$ is non-polynomial, locally bounded, and piecewise continuous. Therefore by the theorem there exists $n \in \mathbb{N}$ , $a_1, \ldots, a_n \in \mathbb{R}^{d}$ , $b_1, \ldots, b_n \in \mathbb{R}$ , and $c_1, \ldots, c_n \in \mathbb{R}^{d'}$ such that for all $x \in \mathcal{K}$ $\left\lVert f(x) − \sum_{i=1}^n \exp (a_i \cdot x + b_i)\,c_i \right\rVert_\infty < \frac\varepsilon 2.$ I will show there exists $x_1, \ldots, x_{n} \in \mathbb{R}^{d+1}$ and $y_1, \ldots, y_{n} \in \mathbb{R}^{d'}$ such that for all $x \in \mathcal{K}$ $\left\lVert \sum_{i=1}^{n} \exp (a_i \cdot x + b_i) \,c_i − T_{\mathrm{softmax}}(x_1, y_1, \ldots, x_n, y_n, (x, 1)) \right\rVert_\infty < \frac\varepsilon 2, \tag{$\star$}$ from which the corollary follows by the triangle inequality.

For $i = 1, \ldots, n$ , put $x_i = (a_i, b_i - C) \in \mathbb{R}^{d+1}$ , and $y_i = \exp(C)\, c_i \in \mathbb{R}^{d'}$ where $C$ is to be defined later.

Then $\begin{align*} T_{\mathrm{softmax}}(x_1, y_1, \ldots, x_n, y_n, (x,1)) &= \sum_{i=1}^{n} \frac {\exp( x_i \cdot (x,1) )} {\sum_{j=1}^{n} \exp( x_j \cdot (x,1) )} \, y_i \\[3ex]&= \frac {\sum_{i=1}^{n} \exp( a_i \cdot x + b_i - C )\exp(C)\, c_i} {\sum_{j=1}^{n} \exp( a_j \cdot x + b_j - C )} \\[3ex]&= \frac {\sum_{i=1}^{n} \exp( a_i \cdot x + b_i)\, c_i} {\sum_{j=1}^{n} \exp( a_j \cdot x + b_j - C )} \end{align*}$

Importantly, we have recovered the exponential neural network as the numerator. We can now factor it out. $(\text{LHS of }\star) = \left\lVert \sum_{i=1}^n \exp (a_i \cdot x + b_i)\,c_i \right\rVert_\infty \left( 1 - \frac{1}{\sum_{j=1}^{n} \exp (a_j \cdot x + b_j - C)} \right)$

Then, noting that for $z > 0$ we have $z > 1-1/z$ , $(\text{LHS of }\star) < \left\lVert \sum_{i=1}^n \exp (a_i \cdot x + b_i)\,c_i \right\rVert_\infty \left( \sum_{j=1}^{n} \exp (a_j \cdot x + b_j - C) \right).$

To finish, take $C$ large enough so that for all $x \in \mathcal{K}$ and $j \in \lbrace1,\ldots,n\rbrace$ , $\exp(a_j \cdot x + b_j - C) < \frac {\varepsilon} {2n\left\lVert \sum_{i=1}^n \exp (a_i \cdot x + b_i)\,c_i \right\rVert_\infty}.$ Such a $C$ is well-defined (finite) because $\mathcal{K}$ is compact.

QED.

Remark: I think the above proof for the softmax attention case is cleaner than the one in the paper (though I followed their approach).

This proof is substantially streamlined—in the paper, the authors use the triangle inequality twice, going from $f$ to $T_{\mathrm{softmax}}$ first via an exponential neural network (as above) but then also via a normalised ‘softmax neural network’ (the argument is split over two lemmas in the paper).
The use of the triangle inequality is also implicit in the paper’s argument(s), whereas my personal style encouraged me to frontload this strategic information so that a reader knows what bound we are trying to achieve and why.
In each of the two steps, the authors introduce an auxiliary element into the approximating architecture: one extra unit in the softmax network to approximate the exponential network, and one extra context pair to approximate the softmax network with a context. The result is that the constructed context has length $n+2$ where $n$ is the width of the exponential neural network derived via Leshno et al. (1993). In contrast, the above proof constructs a length- $n$ context.
The above proof corrects a minor flaw in the paper’s proof, where the authors effectively define $C$ as a function of $x$ , instead of constructing one prompt that achieves the required epsilon bound for all $x \in \mathcal{K}$ (as is to be demonstrated).

The proof in the paper is also more elaborate due to accounting for query, key, and value transforms. However, this does not prevent my streamlined approach—the above proof can easily be elaborated to account for those.

Finite vocabularies and positional encodings

This result is actually not the main focus of the paper. That would be the following:

In section 3, the authors show that, for both elementwise attention modules and softmax attention modules, this universal approximation property does not hold if the context is restricted to be drawn from a finite alphabet (a “vocabulary”). By “finite alphabet,” I mean contexts are confined to $(\mathcal{X} \times \mathcal{Y})^\ast$ for finite $\mathcal{X} \subset \mathbb{R}^{d}$ and $\mathcal{Y} \subset \mathbb{R}^{d'}$ , as opposed to $(\mathbb{R}^d \times \mathbb{R}^{d'})^\ast$ .

I checked this proof carefully and I think it holds, though there are some minor errors remaining in the most recent revision.
In section 4, the authors show that, in the case of elementwise attention modules, the universal approximation property can be restored in the finite-alphabet setting if one considers contexts with a dense additive positional encoding. By “dense additive positional encoding,” I mean they consider a sequence of vectors $p_1, p_2, \ldots \in \mathbb{R}^d$ such that $\{ x + p_i | x \in \mathcal{X}, i \in \mathbb{N}\}$ is dense in $\mathbb{R}^d$ (or some bounded subsets of $\mathbb{R}^d$ in the case of certain activation functions). Then, $x_i \in \mathcal{X}$ in the context is replaced by $x_i + p_i$ . They also assume $\{-1, 0, 1, \sqrt{2}\}^{d'} \subseteq \mathcal{Y}$ .

I haven’t studied these proofs in detail, but from these assumptions, they could scan the positional encoding sequence until they find an arbitrarily close approximations of the $x_i$ s in their previously-constructed real contexts, and likewise assemble arbitrarily close approximations of the $y_i$ s by merging many such examples with different $y \in \mathcal{Y}$ . Then since their architecture is continuous in the context this should suffice to achieve universality again.

To be honest, I don’t think these results are very compelling, even aside from the fact that universal approximation is not much more than theoretically interesting.

The finite-alphabet setting is motivated by the fact that in a real language model, the input sequence is indeed assembled from a finite vocabulary of tokens. However, in a multi-layer language model, the outputs of the first layer aggregates information from multiple tokens, and through successive transformer blocks I conjecture that we can reach an exponential number of embedding vectors, bringing us closer to the dense setting.
As for the benefits of positional encoding, the assumption of density of the positional encoding is so strong as to substantially cheapen the result. I guess the proof involves some very long contexts to find the right rational approximations. Moreover, authors were only able to prove this result for their elementwise “transformer” architecture (I would guess that the long contexts creates problems via normalisation). Yet the authors force a positive framing, and even go as far as to suggest empirical research should explore the benefits of positional encodings that enumerate the rational numbers.

Will this paper be accepted? Two of the reviewers liked the paper (low confidence accept, fair confidence weak accept). The other two reviewers recommended weak reject, one (low confidence) is not on board with the idea of optimising over contexts to begin with; the other (high confidence, very in-depth review) is on board but not impressed with the clarity of the presentation or the strength of the assumptions. I predict this paper will not be accepted.

Bibliography

Once again, this note is based on the following paper:

Anonymous author(s), “Vocabulary in-context learning in transformers: Benefits of positional encoding,” under review at ICLR 2025. OpenReview.

The proof techniques build on foundational, classical work in universal function approximation with feed-forward neural networks by Leshno et al.:

Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken, “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function,” Neural Networks, 6(6):861–867, 1993. DOI:10.1016/S0893-6080(05)80131-5.

The observation that in-context learning is a universal approximator in the above sense has apparently been made before, using different techniques, according to the anonymous author(s):

Aleksandar Petrov, Philip Torr, and Adel Bibi, “Prompting a pretrained transformer can be a universal approximator,” at ICML 2024. PMLR.