~The tempered posterior

via the principle of maximum entropy

Friday, February 27^th, 2026

Most of the results in singular learning theory are developed in the context of a generalised form of Bayesian inference, in which the posterior distribution has an additional (inverse) temperature parameter. In this note, I introduce this so-called “tempered posterior” and contrast it with the usual Bayesian posterior. I also discuss various motivations for departing from the Bayesian suggestion, including a derivation of the tempered posterior using the principle of maximum entropy.

Thanks to Edmund Lau, Daniel Murfet, and Susan Wei for helpful conversations.

Contents:

Inference, two ways
- The Bayesian posterior
- The general/tempered posterior
Interpreting inverse temperature
Motivating the tempered posterior
Temperature as a constraint
Conclusion

§Inference, two ways

The following describes an inference problem.

We begin with a family of statistical models, $W$ , with distributions $\{p_w\}_ {w \in W}$ . Suppose one model, $w^\star \in W$ , is specially designated as a “true model,” but we don’t know which one. We are given some prior belief distribution, $\pi_0 \in \Delta(W)$ , representing how much credence we should initially have in each model in the family being the true model.

Suppose further we are given a data set, $D_n = x_1, \ldots, x_n$ , where each sample is drawn independently from the true distribution $p_{w^\star}$ . How should we update our prior belief distribution, $\pi_0$ , to form a posterior, $\pi_n$ , so as to incorporate the knowledge we have gained about $w^\star$ by seeing these $n$ samples from $p_{w^\star}$ ?

Let’s discuss two answers.

§§The Bayesian posterior

The Bayesian posterior is one classical answer to the question of how to update our beliefs in the face of new samples. First, treat the true model $w^\star$ as a latent random variable distributed according to the prior $\mathbb{P}( w^\star = w ) = \pi_0(w).$ The true model $w^\star$ is drawn and fixed before any samples are drawn according to the conditional probability distribution given by each model, $\mathbb{P}( x_i = x | w^\star ) = p_{w^\star}(x).$

We can then use as our posterior $\pi_n$ the converse conditional probability distribution $\mathbb{P}( w | D_n )$ . The laws of probability, in particular Bayes’ rule, tells us how to define this conditional probability in terms of the other values we know: $\pi_n(w) = \mathbb{P}(w | D_n) = \frac{\mathbb{P}(w) \mathbb{P}(D_n | w)}{\mathbb{P}(D_n)} = \frac {\pi_0(w) \prod_{i=1}^{n} p_w(x_i)} {\int_W \pi_0(w') \prod_{i=1}^{n} p_{w'}(x_i) \mathrm{d}{w'}}.$ The resulting posterior is called the Bayesian posterior.

§§The general/tempered posterior

In singular learning theory, we often have cause to consider instead the so-called general posterior at inverse temperature $\beta$ , also called the tempered posterior, which is given by a different update rule: $\pi_n^\beta(w) \doteq \frac {\pi_0(w) \prod_{i=1}^{n} p_w(x_i)^\beta} {\int_W \pi_0(w') \prod_{i=1}^{n} p_{w'}(x_i)^\beta \mathrm{d}{w'}}$

The tempered posterior $\pi_n^\beta$ differs from the Bayesian posterior $\pi_n$ by the inclusion of the inverse temperature parameter $\beta \geq 0$ .

In the remainder of this note, we’ll explore the interpretation of $\beta$ and different motivations and derivations of this update rule.

§Interpreting inverse temperature

For different values of $\beta$ , we recover some interesting forms of inference, which reveals the role of the inverse temperature parameter as controlling the “strength” of the update, in a similar way to the number of samples.

Of course, at inverse temperature $\beta=1$ , the update rule reduces to Bayes rule, and we recover Bayesian inference. But there are also other interpretable examples.

For example, at inverse temperature $\beta=0$ (infinite “temperature”), the tempered posterior reduces to the prior: $\pi_n^0(w) = \frac {\pi_0(w) \prod_{i=1}^{n} p_w(x_i)^0} {\int_W \pi_0(w') \prod_{i=1}^{n} p_{w'}(x_i)^0 \mathrm{d}{w'}} = \frac{\pi_0(w)}{\int_W \pi_0(w') \mathrm{d}{w'}} = \pi_0(w)$

More generally, if $\beta=k$ for some non-negative integer $k$ , $\pi_n^\beta$ is the Bayesian posterior we would get if each data point had been independently sampled $k$ times, since the product in the likelihood can be rearranged as follows: $\prod_{i=1}^{n} p_w(x_i)^k = \prod_{i=1}^{n} \prod_{j=1}^k p_w(x_i) = \prod_{j=1}^{k} \prod_{i=1}^n p_w(x_i) .$

As inverse temperature is taken to infinity (approaching zero “temperature”), the tempered posterior concentrates around $\operatorname{arg\,max}_w \prod_{i=1}^n p_w(x_i)$ (that is, maximum likelihood models): let $L_n(w) = \prod_{i=1}^n p_w(x_i)$ and $L_n^\star = \max_w L_n(w)$ . Observe: $\begin{align*} \pi_n^\beta(w) &= \frac {\pi_0(w) L_n(w)^\beta} {\int_W \pi_0(w') L_n(w')^\beta \mathrm{d}{w'}} \\&= \frac {(L_n^\star)^\beta \pi_0(w) \left(\frac{L_n(w)}{L^\star_n}\right)^\beta} {(L_n^\star)^\beta \int_W \pi_0(w') \left(\frac{L_n(w')}{L_n^\star}\right)^\beta \mathrm{d}{w'}} \\&= \frac {\pi_0(w) \left(\frac{L_n(w)}{L^\star_n}\right)^\beta} {\int_W \pi_0(w') \left(\frac{L_n(w')}{L_n^\star}\right)^\beta \mathrm{d}{w'}}. \end{align*}$ If $L_n(w) < L_n^\star$ , then $\left(\frac{L_n(w)}{L^\star_n}\right)^\beta \to 0$ as $\beta \to \infty$ , and the numerator vanishes. Otherwise (that is, if $L_n(w) = L_n^\star$ ), $\left(\frac{L_n(w)}{L^\star_n}\right)^\beta = 1$ and the numerator goes to $\pi_0(w)$ . (The limit of the integral in the denominator depends on the number of likelihood maximisers and their local geometry.)

The above gives rise to the interpretation of the inverse temperature parameter $\beta$ as describing how dramatically to update from the prior based on the data, from “do not update at all” ( $\beta = 0$ ) to “update so as to only believe in the maximum likelihood models” ( $\beta \to \infty$ ).

§Motivating the tempered posterior

You may be wondering, why should we consider an update rule other than Bayesian updating? What use are tempered posteriors with inverse temperatures other than $\beta=1$ ? Doesn’t this fly in the face of all of the philosophical arguments in favour of using the laws of probability to govern our beliefs?!

Let me discuss a few motivations for why you might want to include an inverse temperature parameter in your inference (and set it to something other than $1$ in some cases).

§§Temperature and model misspecification

First, a pragmatic justification from the perspective of a practitioner. Mind that the aforementioned philosophical arguments in favour of Bayesian inference involve assumptions. For example, they assume that the statement of the inference problem above is an accurate description of the situation we face. What if, in practice, our samples are not drawn i.i.d. from $p_{w^\star}$ ?

The samples could be drawn from $p_{w^\star}$ , but non-independently. This would lead to us not having a full $n$ samples worth of information about the true model in our data set of purported size $n$ . In the limit, if we had a single sample $x$ repeated $n$ times, the appropriate Bayesian update to make is a tempered update with $\beta = 1/n$ .
The samples could be drawn independently, but not purely from $p_{w^\star}$ . For example, they could have noise from an unrelated distribution mixed in, again leading to a dilution in the information about the true model in our sample. In the limit, if we see a sample drawn from a completely unrelated generative process, we should not update our beliefs about $w^\star$ at all, that is, we should perform a tempered update with $\beta = 0$ .

We see that in practical cases falling outside of the initial assumptions of our inference problem setting, the appropriate (even the Bayesian) thing to do might be to perform a “non-Bayesian,” fractional update on the given data.

If you have a real-world inference problem, perhaps you can even think of $\beta$ as a tunable hyper-parameter, which you can set based on what seems to get good performance or efficiency in your specific class of problems.

§§Temperature as a continuous variable

Second, a pragmatic justification from the perspective of a theoretician. Generalising the posterior with a new continuous temperature parameter can help to enable new kinds of mathematical analysis.

For example, the number of samples $n$ is a discrete variable, so you can’t differentiate with respect to $n$ . However, $\beta$ is continuous, and plays a similar role to $n$ (as discussed above). Therefore, you can achieve a similar effect to differentiating by $n$ by instead differentiating by $\beta$ .

More generally, mathematicians sometimes find that “generalising” a mathematical object can sometimes lead to a clearer understanding of the original object, within the broader, generalized context.

§§A connection to statistical mechanics

A similar (inverse) temperature parameter arises naturally in the context of statistical mechanics.

There, one studies an object similar to the tempered posterior, called the Boltzmann distribution (or Gibbs distribution), let’s denote it $\pi_{\mathrm{B}}$ . The Boltzmann distribution describes the distribution of system microstates (possible configurations) $\omega \in \Omega$ in terms of their energy levels (given by a function $E : \Omega \to \mathbb{R}$ ). In the discrete case, here is the formula: $\pi_{\mathrm{B}}(\omega) = \frac{ \exp\left(-\frac{1}{k_{\mathrm{B}}T} E(\omega) \right) }{ \sum_{\omega' \in \Omega} \exp\left(-\frac{1}{k_{\mathrm{B}}T} E(\omega') \right) }.$ Here, $T$ is the system temperature ( $1/T$ is the inverse temperature), and $k_{\mathrm{B}}$ is the Boltzmann constant (which allows for a conversion between units). (For the machine-learning minded, the Boltzmann distribution is a softmax over microstate energies scaled by temperature and the Boltzmann constant.)

We can compare this to the tempered posterior equation, which is, again: $\pi_n^\beta(w) = \frac {\pi_0(w) \prod_{i=1}^{n} p_w(x_i)^\beta} {\int_W \pi_0(w') \prod_{i=1}^{n} p_{w'}(x_i)^\beta \mathrm{d}{w'}}.$ To connect these two distributions, first define an energy function as follows: $E_n(w) = -\ln \left( \prod_{i=1}^n p_{w}(x_i) \right) -\frac1\beta \ln \pi_0(w) .$ The first term is the negative log likelihood of the data set. The second is a background potential given by the negative log density of the prior, scaled by temperature. This energy function is designed such that $\exp ( - \beta E_n(w) ) = \pi_0(w) \prod_{i=1}^n p_w(x_i)^\beta$ , and so we can rewrite the tempered posterior in terms of energy as $\pi_n^\beta(w) = \frac {\exp(-\beta E_n(w))} {\int_W \exp(-\beta E_n(w')) \mathrm{d}{w'}}.$ This closely matches the form of the Boltzmann distribution $\pi_{\mathrm{B}}$ , revealing the motivation for calling $\beta$ a temperature parameter.

(Note that $k_{\mathrm{B}}$ is missing from this story. In the statistical mechanics case it only plays the role of converting between different units used for temperature and energy, whereas in the Bayesian case our quantities are dimensionless.)

§Temperature as a constraint

The above connection between statistical mechanics and inference is a perfect segue to the following, final attempt to motivate the tempered posterior. A similar analogy between statistical mechanics and Shannon’s information theory led Jaynes to develop the principle of maximum entropy, which suggests to choose a posterior probability distribution by maximising information entropy subject to a constraint that the distribution is a good fit for the observed data.

In this section, we will derive the family of tempered posterior distributions with varying inverse temperature as the solutions suggested by the principle of maximum entropy with varying strengths for the requirement that the distribution is a good fit for the data.

§§The principle of maximum entropy

The principle of maximum entropy suggests that in the context of an inference problem, the distribution we should use to represent our updated beliefs is that one that satisfies the following:

The chosen distribution should be consistent with the data. Formally, let’s say we want whatever distribution we select to be one under which the particular data we saw is highly plausible, in the sense that, in expectation over the distribution, the log likelihood of the data is sufficiently high.
The chosen distribution should otherwise be maximally non-committal as to which of the models in the model class explains the data. Formally, let’s say that among all distributions that attribute a high expected likelihood to the data, the chosen distribution should be the one among them with maximal entropy.

Like the principle of Bayesian inference, the principle of maximum entropy is an attempt at making a defensible choice for how to solve an inference problem. In particular, this principle can be defended on the grounds that maximising entropy amounts to making no unnecessary assumptions beyond what is required to be consistent with the data.

§§The constrained optimisation problem

Formally, we can view condition (2) as an optimisation objective and condition (1) as a constraint. Therefore, we can formalise the choice of posterior as a constrained optimisation problem over the space of probability distributions $\Delta(W)$ . Actually, it’s simpler to optimise over the space of all functions, and enforce normalisation via another constraint, as follows:

Choose $P : W \to \mathbb{R}$ so as to maximise the quantity $H[P] = - \int_W P(w) \ln \frac{P(w)}{\pi_0(w)} \mathrm{d}{w}$ subject to the constraints $\begin{align*} \int_W P(w) \ln \left(\prod_{i=1}^n p_w(x_i) \right) \mathrm{d}{w} &= L \\ \int_W P(w) \mathrm{d}{w} &= 1 \end{align*}$ for some hyper-parameter $L$ .

Let’s break down each element of this problem statement:

First, the objective. If $P \in \Delta(W)$ , then $H[P]$ is the continuous entropy of $P$ with the prior distribution $\pi_0$ as the invariant measure. This is a generalisation to the continuous case of the more familiar formula for the entropy of a discrete distribution, $H[P] = -\sum_i P(i) \ln P(i).$ Thus, this objective captures our objective of maximising the distribution’s information entropy.

By maximising $H[P]$ , we are effectively minimising Kullback–Leibler divergence from the prior: $-H[P] = \int_W P(w) \ln \frac{P(w)}{\pi_0(w)} \mathrm{d}{w} = \mathcal{D}_ {\mathrm{KL}}(P || \pi_0).$ We can therefore view the principle of maximum entropy as equivalent to finding the distribution that is consistent with the data while remaining as close as possible to the prior, in this sense “updating as little as is required to account for the data.”
Next, the likelihood constraint. Observe that the RHS is the expected log likelihood under $P \in \Delta(W)$ : $\int_W P(w) \ln \left(\prod_{i=1}^n p_w(x_i) \right) \mathrm{d}{w} = \mathbb{E}_ {w\sim P}\left[ \ln \left(\prod_{i=1}^n p_w(x_i)\right) \right].$ That is, this term measures how likely we expect this data set to be under a given distribution $P$ .

By setting the hyperparameter $L$ , we calibrate how likely we want to expect the data to have been after updating. If we set a very high $L$ , we restrict to distributions that concentrate most of their mass on models that find the data very likely. If we set a low $L$ , we’ll consider distributions that are less concentrated. As we will show below, the value of $L$ is going to end up determining the temperature of the resulting tempered posterior.
Finally, the normalisation constraint. This is a simple constraint that ensures that $P$ is not just any function, but one that satisfies the normalisation requirement of a probability distribution.

To ensure that $P \in \Delta(W)$ , we should also enforce that $P$ is non-negative. However, this turns out to be a non-binding constraint for this optimisation problem. We can verify later that the solution we find given the other constraints already turns out to be non-negative.

Now that we have formulated the problem, let’s solve it!

§§Solving for the maximum entropy posterior

The first step for solving this constrained optimisation problem is to convert it to an unconstrained optimisation problem. Formulate a Lagrangian $\mathcal{L}[P]$ using Lagrange multipliers $\lambda_\beta$ and $\lambda_Z$ to transform the constraints: $\begin{align*} \mathcal{L}[P] = &- \int_W P(w) \ln \frac{P(w)}{\pi_0(w)} \mathrm{d}{w} \\&+ \lambda_\beta \left( \int_W P(w) \ln \left(\prod_{i=1}^n p_w(x_i)\right) \mathrm{d}{w} - L \right) \\&+ \lambda_Z \left(\int_W P(w) \mathrm{d}{w} - 1\right) . \end{align*}$

Simplify the Lagrangian as follows: $\mathcal{L}[P] = \int_W P(w) \left( - \ln P(w) + \ln \pi_0(w) + \lambda_\beta \sum_{i=1}^n \ln p_w(x_i) + \lambda_Z \right) \mathrm{d}{w} - \lambda_\beta L - \lambda_Z.$

To find the optimal $P$ , first take the functional derivative. For $w \in W$ , $\begin{align*} \frac{\delta}{\delta P} \mathcal{L}[P](w) &= \frac{\partial}{\partial P(w)} \left(P(w)\left( - \ln P(w) + \ln \pi_0(w) + \lambda_\beta \sum_{i=1}^n \ln p_w(x_i) + \lambda_Z \right)\right) + 0 \\ &= -1 - \ln P(w) + \ln \pi_0(w) + \lambda_\beta \sum_{i=1}^n \ln p_w(x_i) + \lambda_Z. \end{align*}$ When $P$ solves our constrained optimisation problem we must have that $\frac{\delta}{\delta P} \mathcal{L}[P] = 0$ , so we can solve for $P$ as follows: $\begin{align*} && 0 &= -1 - \ln P(w) + \ln \pi_0(w) + \lambda_\beta \sum_{i=1}^n \ln p_w(x_i) + \lambda_Z \\\rightarrow&& \ln P(w) & = \ln \left(\frac{ \pi_0(w) \prod_{i=1}^n p_w(x_i)^{\lambda_\beta} }{ \exp(1-\lambda_Z) } \right) \\\rightarrow&& P(w) & = \frac{ \pi_0(w) \prod_{i=1}^n p_w(x_i)^{\lambda_\beta} } {\exp(1-\lambda_Z)}. \end{align*}$

It is clear that this $P$ is non-negative since $\pi_0$ , $p_w$ , and $\exp$ are. To make sure that $P$ is a distribution, we apply our normalisation constraint $\int_W \mathrm{d}{w} = 1$ . Solving for $\lambda_Z$ leads predictably to $P(w) = \frac {\pi_0(w) \prod_{i=1}^n p_w(x_i)^{\lambda_\beta}} {\int_W \pi_0(w') \prod_{i=1}^n p_w(x_i)^{\lambda_\beta} \mathrm{d}{w'}},$ Thus we have $P \in \Delta(W)$ .

As for the value of $\lambda_\beta$ , this will be determined by our hyper-parameter $L$ through the likelihood constraint $\int_W P(w) \ln \left(\prod_{i=1}^n p_w(x_i) \right) \mathrm{d}{w} = L.$ The exact relationship between $L$ and $\lambda_\beta$ depends on the details of the inference problem, and may in general be hard to state explicitly. However, we can already see from the final form of $P$ that $\lambda_\beta$ plays the role of inverse temperature $\beta$ ! Therefore, we should expect to range from $0$ towards $\infty$ as $L$ varies from the expected likelihood under the prior to the maximum likelihood realisable in this model class.

§Conclusion

We have shown that the tempered posterior arises as the posterior belief distribution chosen according to the principle of maximum entropy. The inverse temperature parameter specifies exactly how likely we want to have expected our data to be after our update. Among the other motivations we have considered, I hope this deepens our understanding of the tempered posterior and the inverse temperature parameter.