far.in.net


~The tempered posterior

via the principle of maximum entropy

Most of the results in singular learning theory are developed in the context of a generalised form of Bayesian inference, in which the posterior distribution has an additional (inverse) temperature parameter. In this note, I introduce this so-called “tempered posterior” and contrast it with the usual Bayesian posterior. I also discuss various motivations for departing from the Bayesian suggestion, including a derivation of the tempered posterior using the principle of maximum entropy.

Thanks to Edmund Lau, Daniel Murfet, and Susan Wei for helpful conversations.

Contents:

§Inference, two ways

The following describes an inference problem.

We begin with a family of statistical models, WW, with distributions {pw}wW\{p_w\}_ {w \in W}. Suppose one model, wWw^\star \in W, is specially designated as a “true model,” but we don’t know which one. We are given some prior belief distribution, π0Δ(W)\pi_0 \in \Delta(W), representing how much credence we should initially have in each model in the family being the true model.

Suppose further we are given a data set, Dn=x1,,xnD_n = x_1, \ldots, x_n, where each sample is drawn independently from the true distribution pwp_{w^\star}. How should we update our prior belief distribution, π0\pi_0, to form a posterior, πn\pi_n, so as to incorporate the knowledge we have gained about ww^\star by seeing these nn samples from pwp_{w^\star}?

Let’s discuss two answers.

§§The Bayesian posterior

The Bayesian posterior is one classical answer to the question of how to update our beliefs in the face of new samples. First, treat the true model ww^\star as a latent random variable distributed according to the prior P(w=w)=π0(w).\mathbb{P}( w^\star = w ) = \pi_0(w). The true model ww^\star is drawn and fixed before any samples are drawn according to the conditional probability distribution given by each model, P(xi=xw)=pw(x).\mathbb{P}( x_i = x | w^\star ) = p_{w^\star}(x).

We can then use as our posterior πn\pi_n the converse conditional probability distribution P(wDn)\mathbb{P}( w | D_n ). The laws of probability, in particular Bayes’ rule, tells us how to define this conditional probability in terms of the other values we know: πn(w)=P(wDn)=P(w)P(Dnw)P(Dn)=π0(w)i=1npw(xi)Wπ0(w)i=1npw(xi)dw. \pi_n(w) = \mathbb{P}(w | D_n) = \frac{\mathbb{P}(w) \mathbb{P}(D_n | w)}{\mathbb{P}(D_n)} = \frac {\pi_0(w) \prod_{i=1}^{n} p_w(x_i)} {\int_W \pi_0(w') \prod_{i=1}^{n} p_{w'}(x_i) \mathrm{d}{w'}}. The resulting posterior is called the Bayesian posterior.

§§The general/tempered posterior

In singular learning theory, we often have cause to consider instead the so-called general posterior at inverse temperature β\beta, also called the tempered posterior, which is given by a different update rule: πnβ(w)π0(w)i=1npw(xi)βWπ0(w)i=1npw(xi)βdw \pi_n^\beta(w) \doteq \frac {\pi_0(w) \prod_{i=1}^{n} p_w(x_i)^\beta} {\int_W \pi_0(w') \prod_{i=1}^{n} p_{w'}(x_i)^\beta \mathrm{d}{w'}}

The tempered posterior πnβ\pi_n^\beta differs from the Bayesian posterior πn\pi_n by the inclusion of the inverse temperature parameter β0\beta \geq 0.

In the remainder of this note, we’ll explore the interpretation of β\beta and different motivations and derivations of this update rule.

§Interpreting inverse temperature

For different values of β\beta, we recover some interesting forms of inference, which reveals the role of the inverse temperature parameter as controlling the “strength” of the update, in a similar way to the number of samples.

Of course, at inverse temperature β=1\beta=1, the update rule reduces to Bayes rule, and we recover Bayesian inference. But there are also other interpretable examples.

For example, at inverse temperature β=0\beta=0 (infinite “temperature”), the tempered posterior reduces to the prior: πn0(w)=π0(w)i=1npw(xi)0Wπ0(w)i=1npw(xi)0dw=π0(w)Wπ0(w)dw=π0(w) \pi_n^0(w) = \frac {\pi_0(w) \prod_{i=1}^{n} p_w(x_i)^0} {\int_W \pi_0(w') \prod_{i=1}^{n} p_{w'}(x_i)^0 \mathrm{d}{w'}} = \frac{\pi_0(w)}{\int_W \pi_0(w') \mathrm{d}{w'}} = \pi_0(w)

More generally, if β=k\beta=k for some non-negative integer kk, πnβ\pi_n^\beta is the Bayesian posterior we would get if each data point had been independently sampled kk times, since the product in the likelihood can be rearranged as follows: i=1npw(xi)k=i=1nj=1kpw(xi)=j=1ki=1npw(xi). \prod_{i=1}^{n} p_w(x_i)^k = \prod_{i=1}^{n} \prod_{j=1}^k p_w(x_i) = \prod_{j=1}^{k} \prod_{i=1}^n p_w(x_i) .

As inverse temperature is taken to infinity (approaching zero “temperature”), the tempered posterior concentrates around arg maxwi=1npw(xi)\operatorname{arg\,max}_w \prod_{i=1}^n p_w(x_i) (that is, maximum likelihood models): let Ln(w)=i=1npw(xi)L_n(w) = \prod_{i=1}^n p_w(x_i) and Ln=maxwLn(w)L_n^\star = \max_w L_n(w). Observe: πnβ(w)=π0(w)Ln(w)βWπ0(w)Ln(w)βdw=(Ln)βπ0(w)(Ln(w)Ln)β(Ln)βWπ0(w)(Ln(w)Ln)βdw=π0(w)(Ln(w)Ln)βWπ0(w)(Ln(w)Ln)βdw.\begin{align*} \pi_n^\beta(w) &= \frac {\pi_0(w) L_n(w)^\beta} {\int_W \pi_0(w') L_n(w')^\beta \mathrm{d}{w'}} \\&= \frac {(L_n^\star)^\beta \pi_0(w) \left(\frac{L_n(w)}{L^\star_n}\right)^\beta} {(L_n^\star)^\beta \int_W \pi_0(w') \left(\frac{L_n(w')}{L_n^\star}\right)^\beta \mathrm{d}{w'}} \\&= \frac {\pi_0(w) \left(\frac{L_n(w)}{L^\star_n}\right)^\beta} {\int_W \pi_0(w') \left(\frac{L_n(w')}{L_n^\star}\right)^\beta \mathrm{d}{w'}}. \end{align*} If Ln(w)<LnL_n(w) < L_n^\star, then (Ln(w)Ln)β0\left(\frac{L_n(w)}{L^\star_n}\right)^\beta \to 0 as β\beta \to \infty, and the numerator vanishes. Otherwise (that is, if Ln(w)=LnL_n(w) = L_n^\star), (Ln(w)Ln)β=1\left(\frac{L_n(w)}{L^\star_n}\right)^\beta = 1 and the numerator goes to π0(w)\pi_0(w). (The limit of the integral in the denominator depends on the number of likelihood maximisers and their local geometry.)

The above gives rise to the interpretation of the inverse temperature parameter β\beta as describing how dramatically to update from the prior based on the data, from “do not update at all” (β=0\beta = 0) to “update so as to only believe in the maximum likelihood models” (β\beta \to \infty).

§Motivating the tempered posterior

You may be wondering, why should we consider an update rule other than Bayesian updating? What use are tempered posteriors with inverse temperatures other than β=1\beta=1? Doesn’t this fly in the face of all of the philosophical arguments in favour of using the laws of probability to govern our beliefs?!

Let me discuss a few motivations for why you might want to include an inverse temperature parameter in your inference (and set it to something other than 11 in some cases).

§§Temperature and model misspecification

First, a pragmatic justification from the perspective of a practitioner. Mind that the aforementioned philosophical arguments in favour of Bayesian inference involve assumptions. For example, they assume that the statement of the inference problem above is an accurate description of the situation we face. What if, in practice, our samples are not drawn i.i.d. from pwp_{w^\star}?

We see that in practical cases falling outside of the initial assumptions of our inference problem setting, the appropriate (even the Bayesian) thing to do might be to perform a “non-Bayesian,” fractional update on the given data.

If you have a real-world inference problem, perhaps you can even think of β\beta as a tunable hyper-parameter, which you can set based on what seems to get good performance or efficiency in your specific class of problems.

§§Temperature as a continuous variable

Second, a pragmatic justification from the perspective of a theoretician. Generalising the posterior with a new continuous temperature parameter can help to enable new kinds of mathematical analysis.

For example, the number of samples nn is a discrete variable, so you can’t differentiate with respect to nn. However, β\beta is continuous, and plays a similar role to nn (as discussed above). Therefore, you can achieve a similar effect to differentiating by nn by instead differentiating by β\beta.

More generally, mathematicians sometimes find that “generalising” a mathematical object can sometimes lead to a clearer understanding of the original object, within the broader, generalized context.

§§A connection to statistical mechanics

A similar (inverse) temperature parameter arises naturally in the context of statistical mechanics.

There, one studies an object similar to the tempered posterior, called the Boltzmann distribution (or Gibbs distribution), let’s denote it πB\pi_{\mathrm{B}}. The Boltzmann distribution describes the distribution of system microstates (possible configurations) ωΩ\omega \in \Omega in terms of their energy levels (given by a function E:ΩRE : \Omega \to \mathbb{R}). In the discrete case, here is the formula: πB(ω)=exp(1kBTE(ω))ωΩexp(1kBTE(ω)). \pi_{\mathrm{B}}(\omega) = \frac{ \exp\left(-\frac{1}{k_{\mathrm{B}}T} E(\omega) \right) }{ \sum_{\omega' \in \Omega} \exp\left(-\frac{1}{k_{\mathrm{B}}T} E(\omega') \right) }. Here, TT is the system temperature (1/T1/T is the inverse temperature), and kBk_{\mathrm{B}} is the Boltzmann constant (which allows for a conversion between units). (For the machine-learning minded, the Boltzmann distribution is a softmax over microstate energies scaled by temperature and the Boltzmann constant.)

We can compare this to the tempered posterior equation, which is, again: πnβ(w)=π0(w)i=1npw(xi)βWπ0(w)i=1npw(xi)βdw. \pi_n^\beta(w) = \frac {\pi_0(w) \prod_{i=1}^{n} p_w(x_i)^\beta} {\int_W \pi_0(w') \prod_{i=1}^{n} p_{w'}(x_i)^\beta \mathrm{d}{w'}}. To connect these two distributions, first define an energy function as follows: En(w)=ln(i=1npw(xi))1βlnπ0(w). E_n(w) = -\ln \left( \prod_{i=1}^n p_{w}(x_i) \right) -\frac1\beta \ln \pi_0(w) . The first term is the negative log likelihood of the data set. The second is a background potential given by the negative log density of the prior, scaled by temperature. This energy function is designed such that exp(βEn(w))=π0(w)i=1npw(xi)β\exp ( - \beta E_n(w) ) = \pi_0(w) \prod_{i=1}^n p_w(x_i)^\beta, and so we can rewrite the tempered posterior in terms of energy as πnβ(w)=exp(βEn(w))Wexp(βEn(w))dw. \pi_n^\beta(w) = \frac {\exp(-\beta E_n(w))} {\int_W \exp(-\beta E_n(w')) \mathrm{d}{w'}}. This closely matches the form of the Boltzmann distribution πB\pi_{\mathrm{B}}, revealing the motivation for calling β\beta a temperature parameter.

(Note that kBk_{\mathrm{B}} is missing from this story. In the statistical mechanics case it only plays the role of converting between different units used for temperature and energy, whereas in the Bayesian case our quantities are dimensionless.)

§Temperature as a constraint

The above connection between statistical mechanics and inference is a perfect segue to the following, final attempt to motivate the tempered posterior. A similar analogy between statistical mechanics and Shannon’s information theory led Jaynes to develop the principle of maximum entropy, which suggests to choose a posterior probability distribution by maximising information entropy subject to a constraint that the distribution is a good fit for the observed data.

In this section, we will derive the family of tempered posterior distributions with varying inverse temperature as the solutions suggested by the principle of maximum entropy with varying strengths for the requirement that the distribution is a good fit for the data.

§§The principle of maximum entropy

The principle of maximum entropy suggests that in the context of an inference problem, the distribution we should use to represent our updated beliefs is that one that satisfies the following:

  1. The chosen distribution should be consistent with the data. Formally, let’s say we want whatever distribution we select to be one under which the particular data we saw is highly plausible, in the sense that, in expectation over the distribution, the log likelihood of the data is sufficiently high.

  2. The chosen distribution should otherwise be maximally non-committal as to which of the models in the model class explains the data. Formally, let’s say that among all distributions that attribute a high expected likelihood to the data, the chosen distribution should be the one among them with maximal entropy.

Like the principle of Bayesian inference, the principle of maximum entropy is an attempt at making a defensible choice for how to solve an inference problem. In particular, this principle can be defended on the grounds that maximising entropy amounts to making no unnecessary assumptions beyond what is required to be consistent with the data.

§§The constrained optimisation problem

Formally, we can view condition (2) as an optimisation objective and condition (1) as a constraint. Therefore, we can formalise the choice of posterior as a constrained optimisation problem over the space of probability distributions Δ(W)\Delta(W). Actually, it’s simpler to optimise over the space of all functions, and enforce normalisation via another constraint, as follows:

Choose P:WRP : W \to \mathbb{R} so as to maximise the quantity H[P]=WP(w)lnP(w)π0(w)dw H[P] = - \int_W P(w) \ln \frac{P(w)}{\pi_0(w)} \mathrm{d}{w} subject to the constraints WP(w)ln(i=1npw(xi))dw=LWP(w)dw=1\begin{align*} \int_W P(w) \ln \left(\prod_{i=1}^n p_w(x_i) \right) \mathrm{d}{w} &= L \\ \int_W P(w) \mathrm{d}{w} &= 1 \end{align*} for some hyper-parameter LL.

Let’s break down each element of this problem statement:

Now that we have formulated the problem, let’s solve it!

§§Solving for the maximum entropy posterior

The first step for solving this constrained optimisation problem is to convert it to an unconstrained optimisation problem. Formulate a Lagrangian L[P]\mathcal{L}[P] using Lagrange multipliers λβ\lambda_\beta and λZ\lambda_Z to transform the constraints: L[P]=WP(w)lnP(w)π0(w)dw+λβ(WP(w)ln(i=1npw(xi))dwL)+λZ(WP(w)dw1).\begin{align*} \mathcal{L}[P] = &- \int_W P(w) \ln \frac{P(w)}{\pi_0(w)} \mathrm{d}{w} \\&+ \lambda_\beta \left( \int_W P(w) \ln \left(\prod_{i=1}^n p_w(x_i)\right) \mathrm{d}{w} - L \right) \\&+ \lambda_Z \left(\int_W P(w) \mathrm{d}{w} - 1\right) . \end{align*}

Simplify the Lagrangian as follows: L[P]=WP(w)(lnP(w)+lnπ0(w)+λβi=1nlnpw(xi)+λZ)dwλβLλZ. \mathcal{L}[P] = \int_W P(w) \left( - \ln P(w) + \ln \pi_0(w) + \lambda_\beta \sum_{i=1}^n \ln p_w(x_i) + \lambda_Z \right) \mathrm{d}{w} - \lambda_\beta L - \lambda_Z.

To find the optimal PP, first take the functional derivative. For wWw \in W, δδPL[P](w)=P(w)(P(w)(lnP(w)+lnπ0(w)+λβi=1nlnpw(xi)+λZ))+0=1lnP(w)+lnπ0(w)+λβi=1nlnpw(xi)+λZ.\begin{align*} \frac{\delta}{\delta P} \mathcal{L}[P](w) &= \frac{\partial}{\partial P(w)} \left(P(w)\left( - \ln P(w) + \ln \pi_0(w) + \lambda_\beta \sum_{i=1}^n \ln p_w(x_i) + \lambda_Z \right)\right) + 0 \\ &= -1 - \ln P(w) + \ln \pi_0(w) + \lambda_\beta \sum_{i=1}^n \ln p_w(x_i) + \lambda_Z. \end{align*} When PP solves our constrained optimisation problem we must have that δδPL[P]=0\frac{\delta}{\delta P} \mathcal{L}[P] = 0, so we can solve for PP as follows: 0=1lnP(w)+lnπ0(w)+λβi=1nlnpw(xi)+λZlnP(w)=ln(π0(w)i=1npw(xi)λβexp(1λZ))P(w)=π0(w)i=1npw(xi)λβexp(1λZ).\begin{align*} && 0 &= -1 - \ln P(w) + \ln \pi_0(w) + \lambda_\beta \sum_{i=1}^n \ln p_w(x_i) + \lambda_Z \\\rightarrow&& \ln P(w) & = \ln \left(\frac{ \pi_0(w) \prod_{i=1}^n p_w(x_i)^{\lambda_\beta} }{ \exp(1-\lambda_Z) } \right) \\\rightarrow&& P(w) & = \frac{ \pi_0(w) \prod_{i=1}^n p_w(x_i)^{\lambda_\beta} } {\exp(1-\lambda_Z)}. \end{align*}

It is clear that this PP is non-negative since π0\pi_0, pwp_w, and exp\exp are. To make sure that PP is a distribution, we apply our normalisation constraint Wdw=1\int_W \mathrm{d}{w} = 1. Solving for λZ\lambda_Z leads predictably to P(w)=π0(w)i=1npw(xi)λβWπ0(w)i=1npw(xi)λβdw, P(w) = \frac {\pi_0(w) \prod_{i=1}^n p_w(x_i)^{\lambda_\beta}} {\int_W \pi_0(w') \prod_{i=1}^n p_w(x_i)^{\lambda_\beta} \mathrm{d}{w'}}, Thus we have PΔ(W)P \in \Delta(W).

As for the value of λβ\lambda_\beta, this will be determined by our hyper-parameter LL through the likelihood constraint WP(w)ln(i=1npw(xi))dw=L. \int_W P(w) \ln \left(\prod_{i=1}^n p_w(x_i) \right) \mathrm{d}{w} = L. The exact relationship between LL and λβ\lambda_\beta depends on the details of the inference problem, and may in general be hard to state explicitly. However, we can already see from the final form of PP that λβ\lambda_\beta plays the role of inverse temperature β\beta! Therefore, we should expect to range from 00 towards \infty as LL varies from the expected likelihood under the prior to the maximum likelihood realisable in this model class.

§Conclusion

We have shown that the tempered posterior arises as the posterior belief distribution chosen according to the principle of maximum entropy. The inverse temperature parameter specifies exactly how likely we want to have expected our data to be after our update. Among the other motivations we have considered, I hope this deepens our understanding of the tempered posterior and the inverse temperature parameter.