~Instrumental/intrinsic value ambiguity

A toy model

Friday, July 25^th, 2025

In this note, I introduce a toy environment that demonstrates instrumental/intrinsic value ambiguity. An agent repeatedly faces choices between two kinds of actions: intrinsically valuable actions (which are directly rewarded), and instrumentally valuable actions (which are not directly rewarded, but enable later intrinsically valuable actions). Crucially, I show how training in a subset of environment configurations creates a situation where the agent can’t reliably distinguish between these two kinds of value.

The concept of instrumental/intrinsic value ambiguity is my attempt to isolate one class of precursors for goal misgeneralisation. This toy model is a kind of radical simplification of the situation in Matthew Barnett’s keys and chests environment, which features in prior work on goal misgeneralisation.

Contents:

The apple game
Formalising the apple game
Parametrising the policy
Evaluating all policies
Solving for optimal policies
Understanding the solution
Value ambiguity
Contextualising the model

Sections “Evaluating all policies” and “Solving for optimal policies” contain mathematical derivations and are skippable if you read “Understanding the solution”.

§The apple game

Consider the environment pictured above, called the apple game. In the apple game, the agent can collect apples and subsequently consume them for reward.

At each timestep, an apple appears with probability $p \in (0,1)$ .
If an apple is available, the agent can collect it.
The agent can subsequently consume this apple for reward.
If an apple appears while the agent is already holding a previously-collected apple, the agent faces a choice:
1. It could consume the held apple, forfeiting its chance to collect the new apple.
2. Or, it could collect the new apple, deferring the consumption of the held one.

By default, the agent receives reward only upon consuming apples, meaning consumption is intrinsically valuable, whereas collection is merely instrumentally valuable. The optimal action depends on the apple prevalence probability $p$ . We’ll also consider a variant with a modified reward function that also rewards collection, in which case collection is both instrumentally and also intrinsically valuable.

§Formalising the apple game

Formally, the apple game is a discrete-time, infinite-horizon, exponentially discounted, fully observable Markov decision process. Actually, let’s make it a parameterised family of MDPs indexed jointly by an apple prevalence parameter $p \in (0,1)$ and a collection reward parameter $r \in [0,1]$ . These parameters control the transition dynamics and reward function of the MDP.

Each MDP has six states $\mathcal{S}= \{0,1\} \times \{0,1,2\}$ , where each state is a tuple comprising (a) the number of apples available for collection at the present timestep (either $0$ or $1$ ), and (b) the number of previously-collected apples available for consumption at the present timestep (either $0$ , $1$ , or $2$ ).

Two actions are available to the agent, $\mathcal{A}= \{$ collect $,$ consume $\}$ . These actions generally have the respective effects of increasing or decreasing the number of collected apples (if the appropriate numbers of apples are available for collection and/or consumption).

The initial state distribution and the transition dynamics depend on the apple prevalence parameter $p \in (0, 1)$ . All episodes begin in the state $(c,0)$ where $c=1$ with probability $p$ and $0$ otherwise. The transition function stochastically maps from state $(c, i)$ and action $a$ to state $(d, j)$ as follows.

The number of apples available in the next state, $d$ , is $1$ with probability $p$ and $0$ otherwise.
If $a = \mathtt{collect}$ , $c=1$ (an apple is available for collection), and $i < 2$ (there is room for the apple to be collected), then $j = i + 1$ (the apple is successfully collected).
If $a = \mathtt{consume}$ and $i>0$ (an apple is available for consumption), then $j = i - 1$ (the apple is successfully consumed).
If neither of the previous conditions are met then $j=i$ .

The reward function depends on the collection reward parameter $r \in [0,1]$ . The reward for a transition from state $(c, i)$ to state $(d, j)$ with action $a$ (suppressing unused arguments) as $\begin{equation} R_r(i, j) = \begin{cases} r & \text{if $j > i$ (an apple was successfully collected),} \\ 1 & \text{if $j < i$ (an apple was successfully consumed), or} \\ 0 & \text{otherwise.} \end{cases} \end{equation}$ When $r=0$ , collecting apples is instrumentally valuable (for reaching states from which apples can be consumed). When $r>0$ , collecting apples is also intrinsically valuable.

§Parametrising the policy

A policy for our family of environments is any function $\pi : \mathcal{S}\to \Delta\mathcal{A}$ .

Since there are only two actions, let’s encode our action distributions by the probability of taking the collect action, making that $\pi : \mathcal{S}\to [0,1]$ instead (that is, $\pi(s)$ is the probability of taking the collect action in state $s$ , and $1-\pi(s)$ is the probability of taking the consume action in the same state).

Actually, we can simplify further. Note that in states $(0,0)$ , $(0,1)$ , $(0,2)$ , and $(1,2)$ there is no apple available (or no room to carry the available apple). The collect action is always useless in these states, and we may as well just hard-code the consume action here, fixing $\pi(0,0)=\pi(0,1)=\pi(0,2)=\pi(1,2)=0$ . Likewise, in state $(1,0)$ , there are no apples ready to consume, so we can get away with always taking the collect action, fixing $\pi(1,0)=1$ .

This leaves only state $(1,1)$ , where the agent faces a choice to either take an opportunity to collect an apple, or to give up this chance in favour of immediately consuming a previously-collected apple. The right choice here depends on $p$ , $r$ , and $\gamma$ , so let’s keep this probability variable. We can therefore parameterise our policy space with a single scalar. Abusing notation, let’s just call the scalar $\pi \in [0,1]$ , representing the probability of taking the collect action in state $(1,1)$ .

§Evaluating all policies

(This section is skippable if you read “Understanding the solution”.)

Fix a discount rate $\gamma \in (0,1)$ , a resource prevalence probability $p \in (0,1)$ , and a collection reward $r \in [0,1]$ . For policy $\pi \in [0,1]$ , let $\mathcal{J}^\gamma_{p,r}(\pi)$ denote the expected return accrued by this policy in an episode of the environment with these hyperparameters. Let’s derive a closed form for $\mathcal{J}^\gamma_{p,r}(\pi)$ , so that we can find the optimal $\pi \in [0,1]$ to use.

First, let $v^\pi: \mathcal{S}\to \mathbb{R}$ denote the state value function for policy $\pi$ . For $i=0,1,2$ , define $\begin{equation*} V_i = (1-p) v^\pi(0,i) + p v^\pi(1,i), \end{equation*}$ representing the value of a state with $i$ resources available for consumption, averaged over the chance that there will be resources available for collection in the next timestep. Note that $\mathcal{J}^\gamma_{p,r}(\pi) = V_0$ .

Then by the Bellman equations and our constraints on the policy’s actions in each state, we have the following relations: $\begin{align*} v^\pi(0,0) &= \gamma V_0 & v^\pi(1,0) &= r + \gamma V_1 \\ v^\pi(0,1) &= 1 + \gamma V_0 & v^\pi(1,1) &= \pi (r + \gamma V_2) + (1-\pi) (1 + \gamma V_0) \\ v^\pi(0,2) &= 1 + \gamma V_1 & v^\pi(1,2) &= 1 + \gamma V_1 . \end{align*}$

Combining the above with equation (1), we can eliminate $v^\pi$ , leaving the following system of equations. $\begin{align*} V_0 &= (1-p) \gamma V_0 + p (r + \gamma V_1) \\ V_1 &= (1-p)(1+\gamma V_0) + p\pi(r+\gamma V_2) + p(1-\pi)(1+\gamma V_0) \\ V_2 &= 1 + \gamma V_1 \end{align*}$

The system is linear in $V_0$ , $V_1$ , $V_2$ , and the relevant part of the solution is $\begin{equation*} \mathcal{J}^\gamma_{p,r}(\pi) = V_0 = \frac{ (r + \gamma) p - (1 - r) (1 - \gamma) \gamma p^2 \pi }{ (1 - \gamma)(1 + \gamma p) - (1 - \gamma) \gamma^2 (1 - p) p \pi } . \end{equation*}$

§Solving for optimal policies

(This section is skippable if you read “Understanding the solution”.)

We now compute $\operatorname{arg\,max}_{\pi \in [0,1]} \mathcal{J}^\gamma_{p,r}(\pi)$ in terms of the values of $\gamma \in (0,1)$ , $p \in (0,1)$ , and $r \in [0,1]$ . Note that $\mathcal{J}^\gamma_{p,r}(\pi)$ is a ratio of linear functions of $\pi$ . Define $\begin{align*} a &= (r + \gamma) p \\ b &= - (1 - r) (1 - \gamma) \gamma p^2 \\ c &= (1 - \gamma) (1 + \gamma p) \\ d &= - (1 - \gamma) \gamma^2 (1 - p) p \end{align*}$ such that $\mathcal{J}^\gamma_{p,r}(\pi) = (a+b\pi) / (c+d\pi)$ . Then $\begin{equation} \frac\partial{\partial{\pi}}\mathcal{J}^\gamma_{p,r}(\pi) = \frac{bc-ad}{(c+d\pi)^2}. \end{equation}$

Observe that the denominator in (2) is always positive given our constraints $\gamma, p \in (0,1)$ , $r, \pi \in [0,1]$ . Therefore $\frac\partial{\partial{\pi}}\mathcal{J}^\gamma_{p,r}(\pi)$ has the same sign as the numerator $bc-ad$ , that is, $\begin{align*} \operatorname{sign}\frac\partial{\partial{\pi}}\mathcal{J}^\gamma_{p,r}(\pi) &= \operatorname{sign}(bc - ad) \\ &= \operatorname{sign}\left( - (1 - r) (1 - \gamma)^2 \gamma p^2 (1 + \gamma p) + (r + \gamma) (1 - \gamma) \gamma^2 (1 - p) p^2 \right) \\ &= \operatorname{sign}\left( \gamma^2 + \gamma - 1 + r - p (\gamma + \gamma^2 r) \right), \end{align*}$ where we have used that $(1-\gamma) \gamma p^2 > 0$ .

Moreover, this sign is independent of $\pi \in [0,1]$ . Therefore, $\mathcal{J}^\gamma_{p,r}(\pi)$ is either constant, monotonically increasing, or monotonically decreasing. We can determine the set of optimal policies in each case. Define the critical threshold $p_{\mathrm{crit}}= \frac{\gamma^2 + \gamma - 1 + r}{\gamma + r\gamma^2}.$ Then:

If $p = p_{\mathrm{crit}}$ , then $\frac\partial{\partial{\pi}}\mathcal{J}^\gamma_{p,r}(\pi) = 0$ for $\pi \in [0,1]$ , and the value is constant as a function of $\pi$ . That is, either action or any mixture of them is equally optimal in state $(1,1)$ .
If $p < p_{\mathrm{crit}}$ , then $\frac\partial{\partial{\pi}}\mathcal{J}^\gamma_{p,r}(\pi) > 0$ for $\pi \in [0,1]$ , and the endpoint $\pi = 1$ is optimal. That is, the optimal action in state $(1,1)$ is collect.
If $p > p_{\mathrm{crit}}$ , then $\frac\partial{\partial{\pi}}\mathcal{J}^\gamma_{p,r}(\pi) < 0$ for $\pi \in [0,1]$ , and the endpoint $\pi = 0$ is optimal. That is, the optimal action in state $(1,1)$ is consume.

§Understanding the solution

In summary, the key factor in determining the optimal action in state $(1,1)$ is the comparison between the resource prevalence parameter $p$ and the critical threshold $p_{\mathrm{crit}}= \frac{\gamma^2 + \gamma - 1 + r}{\gamma + r\gamma^2}.$ In particular, if $p < p_{\mathrm{crit}}$ , the optimal action is to collect, and if $p > p_{\mathrm{crit}}$ , the optimal action is the consume.

Let’s explore this solution to understand it better, starting with our baseline scenario, where there is no collection reward, that is, $r=0$ . In this case, $p_{\mathrm{crit}}$ simplifies to $p_{\mathrm{crit}}|_ {r=0} = \frac{\gamma^2 + \gamma - 1}{\gamma}.$

For a given discount rate, if resources are scarce enough ( $p$ below the critical threshold) then the agent should defer consumption and take the rare opportunity to collect a second apple to consume later. On the other hand, if the apples are comparatively abundant ( $p$ above the critical threshold), then the agent doesn’t need to worry about missing out on collecting the current apple—it might as well immediately consume the apple it already has, betting that there will be another apple soon enough.

What is the effect of introducing $r > 0$ ? This increases the numerator and the denominator in the critical threshold, but the increase to the denominator is weighted by $\gamma^2$ , (usually) smaller than 1. The overall effect is to raise the threshold for a given discount rate $\gamma$ . In some environments with abundant apples where it previously made sense to forfeit the new apple in favour of consuming the current one, the immediate reward of $r$ from collecting a new apple will be enough to offset the cost of delaying consumption, and it will become optimal to collect.

We can see this effect taken to the extreme if we put $r=1$ . Then, $p_{\mathrm{crit}}$ simplifies to $p_{\mathrm{crit}}|_ {r=1} = \frac{\gamma^2 + \gamma}{\gamma^2 + \gamma} = 1.$ If the resource prevalence $p = 1$ , meaning there are always apples available, all actions are optimal. Otherwise, all available apples should be collected immediately, because collecting an apple is as rewarding as consuming a previously collected apple, but has the added benefit that it facilitates later consuming both apples.

Another two corner cases to consider are when $\gamma=0$ or $\gamma=1$ . I leave these as an exercise for the reader.

We can see these dynamics play out by visualising the critical threshold for different values of $\gamma$ and $r$ . Note that $p$ is on the horizontal axis here, so the optimal action is determined by whether we are on the left or the right of the red line.

§Value ambiguity

We’re finally in a position to tie this back to instrumental/intrinsic value ambiguity.

Imagine we face a version of the apple game with $r=0$ and $\gamma=8/10$ . In this case, note $p_{\mathrm{crit}}=0.55$ . Therefore, if we set the resource prevalence parameter to $p < 0.55$ , we can expect an agent to learn to collect in state $(1,1)$ , versus if we set the resource prevalence parameter $p > 0.55$ , we can expect it to learn to consume.

OK, so far so good. But what if instead of specifying a fixed resource prevalence parameter $p$ , we want one agent to learn how to act in multiple environments with varying resource prevalences?

In this setting, the ideal agent would choose its action in state $(1,1)$ by comparing the value of the current resource prevalence parameter to the threshold $p_{\mathrm{crit}}= 0.55$ . The agent needs some way to know the current value of $p$ , but this isn’t a serious issue—if we have that knowledge at hand, we can give it to the agent as part of the observation, otherwise it can estimate $p$ by counting apples it sees over time, and it will eventually have an accurate estimate available for decision-making. For simplicity, let’s assume the former, so the policy becomes a function $\pi : (0,1) \to [0,1]$ , where the input is the resource prevalence parameter $p$ and the output is the probability of the collect action in state $(1,1)$ given that $p$ .

Can we expect to train an agent with this ideal policy? It depends. It’s usually pretty easy to learn a simple function like this, it’s just a one-dimensional classifier. However, it can be hard to learn even a simple function if you don’t have the right kind of data. Imagine we train in a distribution of environments with $p$ values in the interval $(0,0.5]$ . Over that interval, the optimal action in state $(1,1)$ is always to collect, because $p < 0.55 = p_{\mathrm{crit}}$ . But there are multiple valid ways to extrapolate from this interval to environments with larger $p$ values:

One option is the intended policy, $\pi(p) = [\![p<0.55]\!]$ (where $[\![\cdot]\!]$ is an Iverson bracket). We might learn this policy, if the right inductive biases happen to be in play.
But how confident are we that our inductive biases will put the threshold in exactly the right spot? If we’re going to learn a step function, why not $\pi(p) = [\![p<0.97384]\!]$ ? Unfortunately, if we put this policy in an environment with $p > 0.55$ , it would collect when it would be better to forfeit the apple and consume, unless $p$ was very close to $1$ .
More likely, perhaps, we might learn the constant function $\pi(p) = 1$ : always collect. In most cases, a constant is simpler than a step function. And this constant function works perfectly in all of the training environments. Yet this policy would always collect apples in front of it, even if they were guaranteed to be replenished and it could spend that time consuming.

Whichever policy the agent would learn, the important point is that there are multiple plausible solutions, and the correct one is ambiguous from the perspective of the agent who has only seen training environments with $p \in (0, 0.5]$ . At no point during training did the agent gain any information it could use to distinguish directly between which of these extrapolations was intended.

Though there are many plausible generalisations to the full range of $p$ values. I chose these three as examples for a reason. If you take these three policies and ask, for the full range of resource prevalence parameters $p \in (0,1)$ , for which reward function are these three policies optimal, you get reward functions with collection reward set to, respectively, $r=0$ , $r=9/10$ , and $r=1$ .

In the former case ( $r=0$ ), collecting apples has only instrumental value, whereas in the latter two cases $(r=0.9$ , $r=1$ ), collecting apples has intrinsic value.

Hence, it’s ambiguous to the agent trained in the subset of environments whether or not to place merely instrumental value on collecting apples, or whether to place intrinsic value on it as well.

§Contextualising the model

This toy model is as simplified as I could make it while still preserving a clear interpretation in terms of instrumental and intrinsic value. In order to capture instrumental value, it appears an environment would need at least two states or two actions, so that the agent can do one thing that is instrumental to later doing another, intrinsically valuable thing. It seems possible that you don’t need to start with a convoluted 6-state set-up to get the same dynamic, but as we discussed, in the end the apple game is really a 1-state environment.

This is not the simplest environment in which a policy faces a challenge of generalising across a step function. For that, you could consider any one-dimensional classification problem where you only train with examples from one side of the ideal decision boundary. For a reinforcement learning version, you can translate that into an environment where your reward for taking each of two actions depends on the environment parameter, and only training in places where the reward for the first action is higher than that of the second.

Actually, this is also not an especially compelling example of a challenging generalisation problem. Yes, getting the inductive bias right enough to time the out-of-distribution step for a step function looks extremely hard. However, this step function is generated by taking the max over an underlying, smooth function.

If we had framed this as an entropy-regularised policy learning problem, the agent would see the optimal probability of the consume action start to increase as $p$ approaches the endpoint of the restricted interval $(0,0.5]$ , and the challenge would become that of predicting a smooth S-curve rather than a step function (still hard, but not impossible with the right inductive bias).
Alternatively, if we were using a Q-learning algorithm, the task would have been to predict the value of both actions directly, which again, varies smoothly with the resource prevalence parameter $p$ . The Q-learning scaffolding of taking the action with the maximum predicted value would give us the step function.

Of course, regardless of how challenging the generalisation problem really is, it might seem like we’re avoiding the obvious solution—can’t we just train in environments with $p \in (0,1)$ rather than restricting ourselves to $p \in (0, 0.5]$ ? That would remove the challenge of generalisation and also the value ambiguity. The toy model makes this solution look appealing, because it’s obvious that we’re avoiding training in the part of the environment distribution conveying key information about the environment distribution. Unfortunately, this can be harder in the real world:

Real-world environments are extremely high-dimensional, and we might not be able to systematically vary all relevant parameters. It might only be feasible to train in some meaningfully-restricted subset of configurations.
In real-world cases, we might not realise that our training environment has a particular systematic restriction that breaks down in the real world (or, even if it does hold, won’t break down at some point after deployment, noting that ‘the real world’ is not a static environment).

I think it will be extremely hard, in terms of both experiment design and also computational feasibility, to ensure that training covers all possible environment configurations.

You should think of this toy model as representing a small slice of such an environmental configuration space. Among the many and varied other configuration parameters, we have the resource prevalence parameter $p$ , which stands in for some direction along which we only partially explore, but not so much as to clarify whether the agent should be pursuing some sub-goal for intrinsic or instrumental reasons.