far.in.net


Instrumental/intrinsic value ambiguity

A toy model

Friday, July 25th, 2025

In this note, I introduce a toy environment that demonstrates instrumental/intrinsic value ambiguity. An agent repeatedly faces choices between two kinds of actions: intrinsically valuable actions (which are directly rewarded), and instrumentally valuable actions (which are not directly rewarded, but enable later intrinsically valuable actions). Crucially, I show how training in a subset of environment configurations creates a situation where the agent can’t reliably distinguish between these two kinds of value.

The concept of instrumental/intrinsic value ambiguity is my attempt to isolate one class of precursors for goal misgeneralisation. This toy model is a kind of radical simplification of the situation in Matthew Barnett’s keys and chests environment, which features in prior work on goal misgeneralisation.

Contents:

Sections “Evaluating all policies” and “Solving for optimal policies” contain mathematical derivations and are skippable if you read “Understanding the solution”.

The apple game

The apple game.

Consider the environment pictured above, called the apple game. In the apple game, the agent can collect apples and subsequently consume them for reward.

  1. At each timestep, an apple appears with probability p(0,1)p \in (0,1).
  2. If an apple is available, the agent can collect it.
  3. The agent can subsequently consume this apple for reward.
  4. If an apple appears while the agent is already holding a previously-collected apple, the agent faces a choice:
    1. It could consume the held apple, forfeiting its chance to collect the new apple.
    2. Or, it could collect the new apple, deferring the consumption of the held one.

By default, the agent receives reward only upon consuming apples, meaning consumption is intrinsically valuable, whereas collection is merely instrumentally valuable. The optimal action depends on the apple prevalence probability pp. We’ll also consider a variant with a modified reward function that also rewards collection, in which case collection is both instrumentally and also intrinsically valuable.

Formalising the apple game

Formally, the apple game is a discrete-time, infinite-horizon, exponentially discounted, fully observable Markov decision process. Actually, let’s make it a parameterised family of MDPs indexed jointly by an apple prevalence parameter p(0,1)p \in (0,1) and a collection reward parameter r[0,1]r \in [0,1]. These parameters control the transition dynamics and reward function of the MDP.

Each MDP has six states S={0,1}×{0,1,2}\mathcal{S}= \{0,1\} \times \{0,1,2\}, where each state is a tuple comprising (a) the number of apples available for collection at the present timestep (either 00 or 11), and (b) the number of previously-collected apples available for consumption at the present timestep (either 00, 11, or 22).

Two actions are available to the agent, A={\mathcal{A}= \{collect,,consume}\}. These actions generally have the respective effects of increasing or decreasing the number of collected apples (if the appropriate numbers of apples are available for collection and/or consumption).

The initial state distribution and the transition dynamics depend on the apple prevalence parameter p(0,1)p \in (0, 1). All episodes begin in the state (c,0)(c,0) where c=1c=1 with probability pp and 00 otherwise. The transition function stochastically maps from state (c,i)(c, i) and action aa to state (d,j)(d, j) as follows.

The reward function depends on the collection reward parameter r[0,1]r \in [0,1]. The reward for a transition from state (c,i)(c, i) to state (d,j)(d, j) with action aa (suppressing unused arguments) as Rr(i,j)={rif j>i (an apple was successfully collected),1if j<i (an apple was successfully consumed), or0otherwise.\begin{equation} R_r(i, j) = \begin{cases} r & \text{if $j > i$ (an apple was successfully collected),} \\ 1 & \text{if $j < i$ (an apple was successfully consumed), or} \\ 0 & \text{otherwise.} \end{cases} \end{equation} When r=0r=0, collecting apples is instrumentally valuable (for reaching states from which apples can be consumed). When r>0r>0, collecting apples is also intrinsically valuable.

Parametrising the policy

A policy for our family of environments is any function π:SΔA\pi : \mathcal{S}\to \Delta\mathcal{A}.

Since there are only two actions, let’s encode our action distributions by the probability of taking the collect action, making that π:S[0,1]\pi : \mathcal{S}\to [0,1] instead (that is, π(s)\pi(s) is the probability of taking the collect action in state ss, and 1π(s)1-\pi(s) is the probability of taking the consume action in the same state).

Actually, we can simplify further. Note that in states (0,0)(0,0), (0,1)(0,1), (0,2)(0,2), and (1,2)(1,2) there is no apple available (or no room to carry the available apple). The collect action is always useless in these states, and we may as well just hard-code the consume action here, fixing π(0,0)=π(0,1)=π(0,2)=π(1,2)=0\pi(0,0)=\pi(0,1)=\pi(0,2)=\pi(1,2)=0. Likewise, in state (1,0)(1,0), there are no apples ready to consume, so we can get away with always taking the collect action, fixing π(1,0)=1\pi(1,0)=1.

This leaves only state (1,1)(1,1), where the agent faces a choice to either take an opportunity to collect an apple, or to give up this chance in favour of immediately consuming a previously-collected apple. The right choice here depends on pp, rr, and γ\gamma, so let’s keep this probability variable. We can therefore parameterise our policy space with a single scalar. Abusing notation, let’s just call the scalar π[0,1]\pi \in [0,1], representing the probability of taking the collect action in state (1,1)(1,1).

Evaluating all policies

(This section is skippable if you read “Understanding the solution”.)

Fix a discount rate γ(0,1)\gamma \in (0,1), a resource prevalence probability p(0,1)p \in (0,1), and a collection reward r[0,1]r \in [0,1]. For policy π[0,1]\pi \in [0,1], let Jp,rγ(π)\mathcal{J}^\gamma_{p,r}(\pi) denote the expected return accrued by this policy in an episode of the environment with these hyperparameters. Let’s derive a closed form for Jp,rγ(π)\mathcal{J}^\gamma_{p,r}(\pi), so that we can find the optimal π[0,1]\pi \in [0,1] to use.

First, let vπ:SRv^\pi: \mathcal{S}\to \mathbb{R} denote the state value function for policy π\pi. For i=0,1,2i=0,1,2, define Vi=(1p)vπ(0,i)+pvπ(1,i),\begin{equation*} V_i = (1-p) v^\pi(0,i) + p v^\pi(1,i), \end{equation*} representing the value of a state with ii resources available for consumption, averaged over the chance that there will be resources available for collection in the next timestep. Note that Jp,rγ(π)=V0\mathcal{J}^\gamma_{p,r}(\pi) = V_0.

Then by the Bellman equations and our constraints on the policy’s actions in each state, we have the following relations: vπ(0,0)=γV0vπ(1,0)=r+γV1vπ(0,1)=1+γV0vπ(1,1)=π(r+γV2)+(1π)(1+γV0)vπ(0,2)=1+γV1vπ(1,2)=1+γV1.\begin{align*} v^\pi(0,0) &= \gamma V_0 & v^\pi(1,0) &= r + \gamma V_1 \\ v^\pi(0,1) &= 1 + \gamma V_0 & v^\pi(1,1) &= \pi (r + \gamma V_2) + (1-\pi) (1 + \gamma V_0) \\ v^\pi(0,2) &= 1 + \gamma V_1 & v^\pi(1,2) &= 1 + \gamma V_1 . \end{align*}

Combining the above with equation (1), we can eliminate vπv^\pi, leaving the following system of equations. V0=(1p)γV0+p(r+γV1)V1=(1p)(1+γV0)+pπ(r+γV2)+p(1π)(1+γV0)V2=1+γV1\begin{align*} V_0 &= (1-p) \gamma V_0 + p (r + \gamma V_1) \\ V_1 &= (1-p)(1+\gamma V_0) + p\pi(r+\gamma V_2) + p(1-\pi)(1+\gamma V_0) \\ V_2 &= 1 + \gamma V_1 \end{align*}

The system is linear in V0V_0, V1V_1, V2V_2, and the relevant part of the solution is Jp,rγ(π)=V0=(r+γ)p(1r)(1γ)γp2π(1γ)(1+γp)(1γ)γ2(1p)pπ.\begin{equation*} \mathcal{J}^\gamma_{p,r}(\pi) = V_0 = \frac{ (r + \gamma) p - (1 - r) (1 - \gamma) \gamma p^2 \pi }{ (1 - \gamma)(1 + \gamma p) - (1 - \gamma) \gamma^2 (1 - p) p \pi } . \end{equation*}

Solving for optimal policies

(This section is skippable if you read “Understanding the solution”.)

We now compute arg maxπ[0,1]Jp,rγ(π)\operatorname{arg\,max}_{\pi \in [0,1]} \mathcal{J}^\gamma_{p,r}(\pi) in terms of the values of γ(0,1)\gamma \in (0,1), p(0,1)p \in (0,1), and r[0,1]r \in [0,1]. Note that Jp,rγ(π)\mathcal{J}^\gamma_{p,r}(\pi) is a ratio of linear functions of π\pi. Define a=(r+γ)pb=(1r)(1γ)γp2c=(1γ)(1+γp)d=(1γ)γ2(1p)p\begin{align*} a &= (r + \gamma) p \\ b &= - (1 - r) (1 - \gamma) \gamma p^2 \\ c &= (1 - \gamma) (1 + \gamma p) \\ d &= - (1 - \gamma) \gamma^2 (1 - p) p \end{align*} such that Jp,rγ(π)=(a+bπ)/(c+dπ)\mathcal{J}^\gamma_{p,r}(\pi) = (a+b\pi) / (c+d\pi). Then πJp,rγ(π)=bcad(c+dπ)2.\begin{equation} \frac\partial{\partial{\pi}}\mathcal{J}^\gamma_{p,r}(\pi) = \frac{bc-ad}{(c+d\pi)^2}. \end{equation}

Observe that the denominator in (2) is always positive given our constraints γ,p(0,1)\gamma, p \in (0,1), r,π[0,1]r, \pi \in [0,1]. Therefore πJp,rγ(π)\frac\partial{\partial{\pi}}\mathcal{J}^\gamma_{p,r}(\pi) has the same sign as the numerator bcadbc-ad, that is, signπJp,rγ(π)=sign(bcad)=sign((1r)(1γ)2γp2(1+γp)+(r+γ)(1γ)γ2(1p)p2)=sign(γ2+γ1+rp(γ+γ2r)),\begin{align*} \operatorname{sign}\frac\partial{\partial{\pi}}\mathcal{J}^\gamma_{p,r}(\pi) &= \operatorname{sign}(bc - ad) \\ &= \operatorname{sign}\left( - (1 - r) (1 - \gamma)^2 \gamma p^2 (1 + \gamma p) + (r + \gamma) (1 - \gamma) \gamma^2 (1 - p) p^2 \right) \\ &= \operatorname{sign}\left( \gamma^2 + \gamma - 1 + r - p (\gamma + \gamma^2 r) \right), \end{align*} where we have used that (1γ)γp2>0(1-\gamma) \gamma p^2 > 0.

Moreover, this sign is independent of π[0,1]\pi \in [0,1]. Therefore, Jp,rγ(π)\mathcal{J}^\gamma_{p,r}(\pi) is either constant, monotonically increasing, or monotonically decreasing. We can determine the set of optimal policies in each case. Define the critical threshold pcrit=γ2+γ1+rγ+rγ2.p_{\mathrm{crit}}= \frac{\gamma^2 + \gamma - 1 + r}{\gamma + r\gamma^2}. Then:

Understanding the solution

In summary, the key factor in determining the optimal action in state (1,1)(1,1) is the comparison between the resource prevalence parameter pp and the critical threshold pcrit=γ2+γ1+rγ+rγ2.p_{\mathrm{crit}}= \frac{\gamma^2 + \gamma - 1 + r}{\gamma + r\gamma^2}. In particular, if p<pcritp < p_{\mathrm{crit}}, the optimal action is to collect, and if p>pcritp > p_{\mathrm{crit}}, the optimal action is the consume.

Let’s explore this solution to understand it better, starting with our baseline scenario, where there is no collection reward, that is, r=0r=0. In this case, pcritp_{\mathrm{crit}} simplifies to pcritr=0=γ2+γ1γ. p_{\mathrm{crit}}|_ {r=0} = \frac{\gamma^2 + \gamma - 1}{\gamma}.

For a given discount rate, if resources are scarce enough (pp below the critical threshold) then the agent should defer consumption and take the rare opportunity to collect a second apple to consume later. On the other hand, if the apples are comparatively abundant (pp above the critical threshold), then the agent doesn’t need to worry about missing out on collecting the current apple—it might as well immediately consume the apple it already has, betting that there will be another apple soon enough.

What is the effect of introducing r>0r > 0? This increases the numerator and the denominator in the critical threshold, but the increase to the denominator is weighted by γ2\gamma^2, (usually) smaller than 1. The overall effect is to raise the threshold for a given discount rate γ\gamma. In some environments with abundant apples where it previously made sense to forfeit the new apple in favour of consuming the current one, the immediate reward of rr from collecting a new apple will be enough to offset the cost of delaying consumption, and it will become optimal to collect.

We can see this effect taken to the extreme if we put r=1r=1. Then, pcritp_{\mathrm{crit}} simplifies to pcritr=1=γ2+γγ2+γ=1. p_{\mathrm{crit}}|_ {r=1} = \frac{\gamma^2 + \gamma}{\gamma^2 + \gamma} = 1. If the resource prevalence p=1p = 1, meaning there are always apples available, all actions are optimal. Otherwise, all available apples should be collected immediately, because collecting an apple is as rewarding as consuming a previously collected apple, but has the added benefit that it facilitates later consuming both apples.

Another two corner cases to consider are when γ=0\gamma=0 or γ=1\gamma=1. I leave these as an exercise for the reader.

We can see these dynamics play out by visualising the critical threshold for different values of γ\gamma and rr. Note that pp is on the horizontal axis here, so the optimal action is determined by whether we are on the left or the right of the red line.

Value ambiguity

We’re finally in a position to tie this back to instrumental/intrinsic value ambiguity.

Imagine we face a version of the apple game with r=0r=0 and γ=8/10\gamma=8/10. In this case, note pcrit=0.55p_{\mathrm{crit}}=0.55. Therefore, if we set the resource prevalence parameter to p<0.55p < 0.55, we can expect an agent to learn to collect in state (1,1)(1,1), versus if we set the resource prevalence parameter p>0.55p > 0.55, we can expect it to learn to consume.

OK, so far so good. But what if instead of specifying a fixed resource prevalence parameter pp, we want one agent to learn how to act in multiple environments with varying resource prevalences?

In this setting, the ideal agent would choose its action in state (1,1)(1,1) by comparing the value of the current resource prevalence parameter to the threshold pcrit=0.55p_{\mathrm{crit}}= 0.55. The agent needs some way to know the current value of pp, but this isn’t a serious issue—if we have that knowledge at hand, we can give it to the agent as part of the observation, otherwise it can estimate pp by counting apples it sees over time, and it will eventually have an accurate estimate available for decision-making. For simplicity, let’s assume the former, so the policy becomes a function π:(0,1)[0,1]\pi : (0,1) \to [0,1], where the input is the resource prevalence parameter pp and the output is the probability of the collect action in state (1,1)(1,1) given that pp.

Can we expect to train an agent with this ideal policy? It depends. It’s usually pretty easy to learn a simple function like this, it’s just a one-dimensional classifier. However, it can be hard to learn even a simple function if you don’t have the right kind of data. Imagine we train in a distribution of environments with pp values in the interval (0,0.5](0,0.5]. Over that interval, the optimal action in state (1,1)(1,1) is always to collect, because p<0.55=pcritp < 0.55 = p_{\mathrm{crit}}. But there are multiple valid ways to extrapolate from this interval to environments with larger pp values:

Whichever policy the agent would learn, the important point is that there are multiple plausible solutions, and the correct one is ambiguous from the perspective of the agent who has only seen training environments with p(0,0.5]p \in (0, 0.5]. At no point during training did the agent gain any information it could use to distinguish directly between which of these extrapolations was intended.

Though there are many plausible generalisations to the full range of pp values. I chose these three as examples for a reason. If you take these three policies and ask, for the full range of resource prevalence parameters p(0,1)p \in (0,1), for which reward function are these three policies optimal, you get reward functions with collection reward set to, respectively, r=0r=0, r=9/10r=9/10, and r=1r=1.

In the former case (r=0r=0), collecting apples has only instrumental value, whereas in the latter two cases (r=0.9(r=0.9, r=1r=1), collecting apples has intrinsic value.

Hence, it’s ambiguous to the agent trained in the subset of environments whether or not to place merely instrumental value on collecting apples, or whether to place intrinsic value on it as well.

Contextualising the model

This toy model is as simplified as I could make it while still preserving a clear interpretation in terms of instrumental and intrinsic value. In order to capture instrumental value, it appears an environment would need at least two states or two actions, so that the agent can do one thing that is instrumental to later doing another, intrinsically valuable thing. It seems possible that you don’t need to start with a convoluted 6-state set-up to get the same dynamic, but as we discussed, in the end the apple game is really a 1-state environment.

This is not the simplest environment in which a policy faces a challenge of generalising across a step function. For that, you could consider any one-dimensional classification problem where you only train with examples from one side of the ideal decision boundary. For a reinforcement learning version, you can translate that into an environment where your reward for taking each of two actions depends on the environment parameter, and only training in places where the reward for the first action is higher than that of the second.

Actually, this is also not an especially compelling example of a challenging generalisation problem. Yes, getting the inductive bias right enough to time the out-of-distribution step for a step function looks extremely hard. However, this step function is generated by taking the max over an underlying, smooth function.

Of course, regardless of how challenging the generalisation problem really is, it might seem like we’re avoiding the obvious solution—can’t we just train in environments with p(0,1)p \in (0,1) rather than restricting ourselves to p(0,0.5]p \in (0, 0.5]? That would remove the challenge of generalisation and also the value ambiguity. The toy model makes this solution look appealing, because it’s obvious that we’re avoiding training in the part of the environment distribution conveying key information about the environment distribution. Unfortunately, this can be harder in the real world:

I think it will be extremely hard, in terms of both experiment design and also computational feasibility, to ensure that training covers all possible environment configurations.

You should think of this toy model as representing a small slice of such an environmental configuration space. Among the many and varied other configuration parameters, we have the resource prevalence parameter pp, which stands in for some direction along which we only partially explore, but not so much as to clarify whether the agent should be pursuing some sub-goal for intrinsic or instrumental reasons.