far.in.net


Transformers are universal in-context function approximators

a simple and streamlined proof

This is a short technical note adapted from a presentation I gave, apropos of nothing, at the metauni SLT seminar on Thursday, December 5th, 2024.

In the seminar, I presented a simplified and streamlined version of the results from section 2 of the paper “Vocabulary in-context learning in transformers: Benefits of positional encoding,” currently under public, double-blind review at ICLR 2025 (OpenReview link).

Definitions: Transformers

A transformer is a function T:(Rd×Rd)×RdRd T : \left( \mathbb{R}^{d} \times \mathbb{R}^{d'} \right)^{\ast} \times \mathbb{R}^d \to \mathbb{R}^{d'} where \ast is the Kleene star. The first input to this function is a context—a sequence of pairs of vectors of a variable, unbounded length nNn \in \mathbb{N}, (x1,y1),(x2,y2),,(xn,yn)(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n) where each xiRdx_i \in \mathbb{R}^{d} and each yiRdy_i \in \mathbb{R}^{d'}. It then also takes a query xRdx \in \mathbb{R}^d and returns a completion yRdy \in \mathbb{R}^{d'}.

Given a particular context (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n), the transformer can be thought of as a function from queries to completions: T(x1,y1,,xn,yn,):RdRd. T(x_1, y_1, \ldots, x_n, y_n, -) : \mathbb{R}^d \to \mathbb{R}^{d'}. It is this function (modulo some auxiliary dimensions here and there) that we want to show is a universal function approximator, in the sense that for any continuous function on a compact domain we can find a context such that the transformer as a function from queries to completions is approximately equal to that function.

It may be unclear why we are considering a transformer that operates on sequences of pairs of vectors, rather than a sequence of tokens.

With that said, let’s fill out the transformer type signature with the concrete mappings we will show are universal in-context function approximators. Consider the following two “transformers”:

  1. The softmax attention module, TsoftmaxT_{\mathrm{softmax}}, with the mapping Tsoftmax(x1,y1,,xn,yn,x)=i=1nexp(xix)j=1nexp(xjx)yi. T_{\mathrm{softmax}}(x_1, y_1, \ldots, x_n, y_n, x) = \sum_{i=1}^{n} \frac {\exp( x_i \cdot x )} {\sum_{j=1}^{n} \exp( x_j \cdot x )} \, y_i.

    This is a simplified version of a single attention head within a real transformer. See that the query xx is compared to previous xix_i values in the context, and these comparisons are used to create a softmax distribution. Then, the completion is an average of the yiy_i values from the context weighted by this softmax distribution.

  2. The elementwise attention module, with the mapping Tσ(x1,y1,,xn,yn,x)=i=1nσ(xix)yi T_{\sigma}(x_1, y_1, \ldots, x_n, y_n, x) = \sum_{i=1}^{n} \sigma( x_i \cdot x ) \, y_i where σ:RR\sigma : \mathbb{R}\to \mathbb{R} is an arbitrary activation function.

    It’s not so clear to me that this second option should be called a transformer. The completion is not generated as an average weighted by a normalised probability distribution, but by the outputs of an arbitrary activation function. But, it is a useful conceptual stepping stone for the proof in the softmax case—if you look closely you might be able to anticipate the way the context plays a similar role to the weights of a feed-forward neural network already.

Even the softmax attention module is substantially simplified compared to a full modern transformer architecture, or even a single attention head within such an architecture. The key missing piece is a triple of parametric linear transformations (the query, key, and value transformations). The authors consider some such parametrised transformers but I have removed these parameters for this note because they lead to more cumbersome notation while not being necessary to see the key ideas in the proof.

Universal (in-context) function approximation

Recall the following result from Leshno, Lin, Pinkus, and Schocken (1993):

Theorem (Universal approximation, Leshno et al., 1993): Let σ:RR\sigma : \mathbb{R}\to \mathbb{R} be a non-polynomial, locally bounded, piecewise continuous activation function. For any continuous function f:RdRdf : \mathbb{R}^d \to \mathbb{R}^{d'} defined on a compact domain KRd\mathcal K \subset \mathbb{R}^d, and any ε>0\varepsilon> 0, there exists nNn \in \mathbb{N}, a1,,anRda_1, \ldots, a_n \in \mathbb{R}^{d}, b1,,bnRb_1, \ldots, b_n \in \mathbb{R}, and c1,,cnRdc_1, \ldots, c_n \in \mathbb{R}^{d'}, such that for all xKx \in \mathcal{K} f(x)i=1nσ(aix+bi)ci<ε. \left\lVert f(x) − \sum_{i=1}^n \sigma (a_i \cdot x + b_i)\,c_i \right\rVert_\infty < \varepsilon.

The “Vocabulary in-context learning” paper proves as a corollary of this result a universal approximation property for the above two transformers (and more complex, parameterised transformers) as a function of the context:

Corollary: For any continuous function f:RdRdf : \mathbb{R}^d \to \mathbb{R}^{d'} defined on a compact domain KRd\mathcal K \subset \mathbb{R}^d, and any ε>0\varepsilon> 0, there exists nNn \in \mathbb{N}, x1,,xnRd+1x_1, \ldots, x_n \in \mathbb{R}^{d+1}, and y1,,ynRdy_1, \ldots, y_n \in \mathbb{R}^{d'}, such that for all xKx \in \mathcal{K} f(x)Tsoftmax(x1,y1,,xn,yn,(x,1))<ε. \left\lVert f(x) − T_{\mathrm{softmax}}\left( x_1, y_1, \ldots, x_n, y_n, (x, 1) \right) \right\rVert_\infty < \varepsilon. Moreover, the same is true for TσT_{\sigma} if σ:RR\sigma : \mathbb{R}\to \mathbb{R} is non-polynomial, locally bounded, and piecewise continuous.

Note that compared to the function to be approximated f:RdRdf : \mathbb{R}^d \to \mathbb{R}^{d'}, the context xix_i vectors have an additional dimension, and the query is (x,1)Rd+1(x, 1) \in \mathbb{R}^{d+1} where xRdx \in \mathbb{R}^d is the function input. These are the “auxiliary dimensions here and there” to which I referred earlier.

I present the proof first for the case of the elementwise attention (TσT_{\sigma}) and then the slightly more complex proof for softmax attention (TsoftmaxT_{\mathrm{softmax}}).

Elementwise case: Equivalence of NNs and contexts

Proof (TσT_{\sigma} case): By the theorem there exists nNn \in \mathbb{N}, a1,,anRda_1, \ldots, a_n \in \mathbb{R}^{d}, b1,,bnRb_1, \ldots, b_n \in \mathbb{R}, and c1,,cnRdc_1, \ldots, c_n \in \mathbb{R}^{d'} such that for all xKx \in \mathcal{K} f(x)i=1nσ(aix+bi)ci<ε. \left\lVert f(x) − \sum_{i=1}^n \sigma (a_i \cdot x + b_i)\,c_i \right\rVert_\infty < \varepsilon. For i=1,,ni = 1, \ldots, n, put xi=(ai,bi)Rd+1x_i = (a_i, b_i) \in \mathbb{R}^{d+1} and yi=ciRdy_i = c_i \in \mathbb{R}^{d'}. Then Tσ(x1,y1,,xn,yn,(x,1))=i=1nσ(xi(x,1))yi=i=1nσ(aix+bi)ci T_{\sigma}\left( x_1, y_1, \ldots, x_n, y_n, (x, 1) \right) = \sum_{i=1}^{n} \sigma( x_i \cdot (x, 1) ) \, y_i = \sum_{i=1}^{n} \sigma( a_i \cdot x + b_i ) \, c_i QED.

Remark: The above proof reveals an equivalence between:

  1. biased single-hidden-layer neural networks with dd input units, nn biased hidden units, and dd' output units, and
  2. transformer contexts with length nn and dimensions d+1d+1 and dd'.

The paper shows this correspondence also exists for any parameterised elementwise attention module, as long as the query, key, and value transformations have a certain structure that is pretty unrestrictive but not very easy to describe without introducing a lot of cumbersome notation (and, unfortunately, also a bit buried in the paper).

Softmax case: Approximation of NNs by contexts

Proof (TsoftmaxT_{\mathrm{softmax}} case): Unfortunately, normalisation prevents a similar correspondence for softmax attention. But for the corollary it suffices to approximate neural networks with contexts.

In particular, the function σ=exp\sigma = \exp is non-polynomial, locally bounded, and piecewise continuous. Therefore by the theorem there exists nNn \in \mathbb{N}, a1,,anRda_1, \ldots, a_n \in \mathbb{R}^{d}, b1,,bnRb_1, \ldots, b_n \in \mathbb{R}, and c1,,cnRdc_1, \ldots, c_n \in \mathbb{R}^{d'} such that for all xKx \in \mathcal{K} f(x)i=1nexp(aix+bi)ci<ε2. \left\lVert f(x) − \sum_{i=1}^n \exp (a_i \cdot x + b_i)\,c_i \right\rVert_\infty < \frac\varepsilon 2. I will show there exists x1,,xnRd+1x_1, \ldots, x_{n} \in \mathbb{R}^{d+1} and y1,,ynRdy_1, \ldots, y_{n} \in \mathbb{R}^{d'} such that for all xKx \in \mathcal{K} i=1nexp(aix+bi)ciTsoftmax(x1,y1,,xn,yn,(x,1))<ε2,() \left\lVert \sum_{i=1}^{n} \exp (a_i \cdot x + b_i) \,c_i − T_{\mathrm{softmax}}(x_1, y_1, \ldots, x_n, y_n, (x, 1)) \right\rVert_\infty < \frac\varepsilon 2, \tag{$\star$} from which the corollary follows by the triangle inequality.

For i=1,,ni = 1, \ldots, n, put xi=(ai,biC)Rd+1x_i = (a_i, b_i - C) \in \mathbb{R}^{d+1}, and yi=exp(C)ciRdy_i = \exp(C)\, c_i \in \mathbb{R}^{d'} where CC is to be defined later.

Then Tsoftmax(x1,y1,,xn,yn,(x,1))=i=1nexp(xi(x,1))j=1nexp(xj(x,1))yi=i=1nexp(aix+biC)exp(C)cij=1nexp(ajx+bjC)=i=1nexp(aix+bi)cij=1nexp(ajx+bjC)\begin{align*} T_{\mathrm{softmax}}(x_1, y_1, \ldots, x_n, y_n, (x,1)) &= \sum_{i=1}^{n} \frac {\exp( x_i \cdot (x,1) )} {\sum_{j=1}^{n} \exp( x_j \cdot (x,1) )} \, y_i \\[3ex]&= \frac {\sum_{i=1}^{n} \exp( a_i \cdot x + b_i - C )\exp(C)\, c_i} {\sum_{j=1}^{n} \exp( a_j \cdot x + b_j - C )} \\[3ex]&= \frac {\sum_{i=1}^{n} \exp( a_i \cdot x + b_i)\, c_i} {\sum_{j=1}^{n} \exp( a_j \cdot x + b_j - C )} \end{align*}

Importantly, we have recovered the exponential neural network as the numerator. We can now factor it out. (LHS of )=i=1nexp(aix+bi)ci(11j=1nexp(ajx+bjC)) (\text{LHS of }\star) = \left\lVert \sum_{i=1}^n \exp (a_i \cdot x + b_i)\,c_i \right\rVert_\infty \left( 1 - \frac{1}{\sum_{j=1}^{n} \exp (a_j \cdot x + b_j - C)} \right)

Then, noting that for z>0z > 0 we have z>11/zz > 1-1/z, (LHS of )<i=1nexp(aix+bi)ci(j=1nexp(ajx+bjC)). (\text{LHS of }\star) < \left\lVert \sum_{i=1}^n \exp (a_i \cdot x + b_i)\,c_i \right\rVert_\infty \left( \sum_{j=1}^{n} \exp (a_j \cdot x + b_j - C) \right).

To finish, take CC large enough so that for all xKx \in \mathcal{K} and j{1,,n}j \in \lbrace1,\ldots,n\rbrace, exp(ajx+bjC)<ε2ni=1nexp(aix+bi)ci. \exp(a_j \cdot x + b_j - C) < \frac {\varepsilon} {2n\left\lVert \sum_{i=1}^n \exp (a_i \cdot x + b_i)\,c_i \right\rVert_\infty}. Such a CC is well-defined (finite) because K\mathcal{K} is compact.

QED.

Remark: I think the above proof for the softmax attention case is cleaner than the one in the paper (though I followed their approach).

The proof in the paper is also more elaborate due to accounting for query, key, and value transforms. However, this does not prevent my streamlined approach—the above proof can easily be elaborated to account for those.

Finite vocabularies and positional encodings

This result is actually not the main focus of the paper. That would be the following:

To be honest, I don’t think these results are very compelling, even aside from the fact that universal approximation is not much more than theoretically interesting.

Will this paper be accepted? Two of the reviewers liked the paper (low confidence accept, fair confidence weak accept). The other two reviewers recommended weak reject, one (low confidence) is not on board with the idea of optimising over contexts to begin with; the other (high confidence, very in-depth review) is on board but not impressed with the clarity of the presentation or the strength of the assumptions. I predict this paper will not be accepted.

Bibliography

Once again, this note is based on the following paper:

  1. Anonymous author(s), “Vocabulary in-context learning in transformers: Benefits of positional encoding,” under review at ICLR 2025. OpenReview.

The proof techniques build on foundational, classical work in universal function approximation with feed-forward neural networks by Leshno et al.:

  1. Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken, “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function,” Neural Networks, 6(6):861–867, 1993. DOI:10.1016/S0893-6080(05)80131-5.

The observation that in-context learning is a universal approximator in the above sense has apparently been made before, using different techniques, according to the anonymous author(s):

  1. Aleksandar Petrov, Philip Torr, and Adel Bibi, “Prompting a pretrained transformer can be a universal approximator,” at ICML 2024. PMLR.