Instrumental/intrinsic value ambiguity
A toy model
Friday, July 25th, 2025
In this note, I introduce a toy environment that demonstrates instrumental/intrinsic value ambiguity. An agent repeatedly faces choices between two kinds of actions: intrinsically valuable actions (which are directly rewarded), and instrumentally valuable actions (which are not directly rewarded, but enable later intrinsically valuable actions). Crucially, I show how training in a subset of environment configurations creates a situation where the agent can’t reliably distinguish between these two kinds of value.
The concept of instrumental/intrinsic value ambiguity is my attempt to isolate one class of precursors for goal misgeneralisation. This toy model is a kind of radical simplification of the situation in Matthew Barnett’s keys and chests environment, which features in prior work on goal misgeneralisation.
Contents:
- The apple game
- Formalising the apple game
- Parametrising the policy
- Evaluating all policies
- Solving for optimal policies
- Understanding the solution
- Value ambiguity
- Contextualising the model
Sections “Evaluating all policies” and “Solving for optimal policies” contain mathematical derivations and are skippable if you read “Understanding the solution”.
The apple game
Consider the environment pictured above, called the apple game. In the apple game, the agent can collect apples and subsequently consume them for reward.
- At each timestep, an apple appears with probability .
- If an apple is available, the agent can
collect
it. - The agent can subsequently
consume
this apple for reward. - If an apple appears while the agent is already holding a
previously-collected apple, the agent faces a choice:
- It could
consume
the held apple, forfeiting its chance to collect the new apple. - Or, it could
collect
the new apple, deferring the consumption of the held one.
- It could
By default, the agent receives reward only upon consuming apples, meaning consumption is intrinsically valuable, whereas collection is merely instrumentally valuable. The optimal action depends on the apple prevalence probability . We’ll also consider a variant with a modified reward function that also rewards collection, in which case collection is both instrumentally and also intrinsically valuable.
Formalising the apple game
Formally, the apple game is a discrete-time, infinite-horizon, exponentially discounted, fully observable Markov decision process. Actually, let’s make it a parameterised family of MDPs indexed jointly by an apple prevalence parameter and a collection reward parameter . These parameters control the transition dynamics and reward function of the MDP.
Each MDP has six states , where each state is a tuple comprising (a) the number of apples available for collection at the present timestep (either or ), and (b) the number of previously-collected apples available for consumption at the present timestep (either , , or ).
Two actions are available to the agent,
collect
consume
.
These actions generally have the respective effects of increasing or
decreasing the number of collected apples (if the appropriate numbers of
apples are available for collection and/or consumption).
The initial state distribution and the transition dynamics depend on the apple prevalence parameter . All episodes begin in the state where with probability and otherwise. The transition function stochastically maps from state and action to state as follows.
The number of apples available in the next state, , is with probability and otherwise.
If , (an apple is available for collection), and (there is room for the apple to be collected), then (the apple is successfully collected).
If and (an apple is available for consumption), then (the apple is successfully consumed).
If neither of the previous conditions are met then .
The reward function depends on the collection reward parameter . The reward for a transition from state to state with action (suppressing unused arguments) as When , collecting apples is instrumentally valuable (for reaching states from which apples can be consumed). When , collecting apples is also intrinsically valuable.
Parametrising the policy
A policy for our family of environments is any function .
Since there are only two actions, let’s encode our action
distributions by the probability of taking the collect
action, making that
instead (that is,
is the probability of taking the collect
action in state
,
and
is the probability of taking the consume
action in the same
state).
Actually, we can simplify further. Note that in states
,
,
,
and
there is no apple available (or no room to carry the available apple).
The collect
action is always useless in these states, and
we may as well just hard-code the consume
action here,
fixing
.
Likewise, in state
,
there are no apples ready to consume
, so we can get away
with always taking the collect
action, fixing
.
This leaves only state
,
where the agent faces a choice to either take an opportunity to collect
an apple, or to give up this chance in favour of immediately consuming a
previously-collected apple. The right choice here depends on
,
,
and
,
so let’s keep this probability variable. We can therefore parameterise
our policy space with a single scalar. Abusing notation, let’s just call
the scalar
,
representing the probability of taking the collect
action
in state
.
Evaluating all policies
(This section is skippable if you read “Understanding the solution”.)
Fix a discount rate , a resource prevalence probability , and a collection reward . For policy , let denote the expected return accrued by this policy in an episode of the environment with these hyperparameters. Let’s derive a closed form for , so that we can find the optimal to use.
First, let denote the state value function for policy . For , define representing the value of a state with resources available for consumption, averaged over the chance that there will be resources available for collection in the next timestep. Note that .
Then by the Bellman equations and our constraints on the policy’s actions in each state, we have the following relations:
Combining the above with equation (1), we can eliminate , leaving the following system of equations.
The system is linear in , , , and the relevant part of the solution is
Solving for optimal policies
(This section is skippable if you read “Understanding the solution”.)
We now compute in terms of the values of , , and . Note that is a ratio of linear functions of . Define such that . Then
Observe that the denominator in (2) is always positive given our constraints , . Therefore has the same sign as the numerator , that is, where we have used that .
Moreover, this sign is independent of . Therefore, is either constant, monotonically increasing, or monotonically decreasing. We can determine the set of optimal policies in each case. Define the critical threshold Then:
If , then for , and the value is constant as a function of . That is, either action or any mixture of them is equally optimal in state .
If , then for , and the endpoint is optimal. That is, the optimal action in state is
collect
.If , then for , and the endpoint is optimal. That is, the optimal action in state is
consume
.
Understanding the solution
In summary, the key factor in determining the optimal action in state
is the comparison between the resource prevalence parameter
and the critical threshold
In particular, if
,
the optimal action is to collect
, and if
,
the optimal action is the consume
.
Let’s explore this solution to understand it better, starting with our baseline scenario, where there is no collection reward, that is, . In this case, simplifies to
For a given discount rate, if resources are scarce enough
(
below the critical threshold) then the agent should defer consumption
and take the rare opportunity to collect
a second apple to
consume later. On the other hand, if the apples are comparatively
abundant
(
above the critical threshold), then the agent doesn’t need to worry
about missing out on collecting the current apple—it might as well
immediately consume
the apple it already has, betting that
there will be another apple soon enough.
What is the effect of introducing
?
This increases the numerator and the denominator in the critical
threshold, but the increase to the denominator is weighted by
,
(usually) smaller than 1. The overall effect is to raise the threshold
for a given discount rate
.
In some environments with abundant apples where it previously made sense
to forfeit the new apple in favour of consuming the current one, the
immediate reward of
from collecting a new apple will be enough to offset the cost of
delaying consumption, and it will become optimal to
collect
.
We can see this effect taken to the extreme if we put . Then, simplifies to If the resource prevalence , meaning there are always apples available, all actions are optimal. Otherwise, all available apples should be collected immediately, because collecting an apple is as rewarding as consuming a previously collected apple, but has the added benefit that it facilitates later consuming both apples.
Another two corner cases to consider are when or . I leave these as an exercise for the reader.
We can see these dynamics play out by visualising the critical threshold for different values of and . Note that is on the horizontal axis here, so the optimal action is determined by whether we are on the left or the right of the red line.
Value ambiguity
We’re finally in a position to tie this back to instrumental/intrinsic value ambiguity.
Imagine we face a version of the apple game with
and
.
In this case, note
.
Therefore, if we set the resource prevalence parameter to
,
we can expect an agent to learn to collect
in state
,
versus if we set the resource prevalence parameter
,
we can expect it to learn to consume
.
OK, so far so good. But what if instead of specifying a fixed resource prevalence parameter , we want one agent to learn how to act in multiple environments with varying resource prevalences?
In this setting, the ideal agent would choose its action in state
by comparing the value of the current resource prevalence parameter to
the threshold
.
The agent needs some way to know the current value of
,
but this isn’t a serious issue—if we have that knowledge at hand, we can
give it to the agent as part of the observation, otherwise it can
estimate
by counting apples it sees over time, and it will eventually have an
accurate estimate available for decision-making. For simplicity, let’s
assume the former, so the policy becomes a function
,
where the input is the resource prevalence parameter
and the output is the probability of the collect
action in
state
given that
.
Can we expect to train an agent with this ideal policy? It depends.
It’s usually pretty easy to learn a simple function like this, it’s just
a one-dimensional classifier. However, it can be hard to learn even a
simple function if you don’t have the right kind of data. Imagine we
train in a distribution of environments with
values in the interval
.
Over that interval, the optimal action in state
is always to collect
, because
.
But there are multiple valid ways to extrapolate from this interval to
environments with larger
values:
One option is the intended policy, (where is an Iverson bracket). We might learn this policy, if the right inductive biases happen to be in play.
But how confident are we that our inductive biases will put the threshold in exactly the right spot? If we’re going to learn a step function, why not ? Unfortunately, if we put this policy in an environment with , it would
collect
when it would be better to forfeit the apple andconsume
, unless was very close to .More likely, perhaps, we might learn the constant function : always
collect
. In most cases, a constant is simpler than a step function. And this constant function works perfectly in all of the training environments. Yet this policy would alwayscollect
apples in front of it, even if they were guaranteed to be replenished and it could spend that time consuming.
Whichever policy the agent would learn, the important point is that there are multiple plausible solutions, and the correct one is ambiguous from the perspective of the agent who has only seen training environments with . At no point during training did the agent gain any information it could use to distinguish directly between which of these extrapolations was intended.
Though there are many plausible generalisations to the full range of values. I chose these three as examples for a reason. If you take these three policies and ask, for the full range of resource prevalence parameters , for which reward function are these three policies optimal, you get reward functions with collection reward set to, respectively, , , and .
In the former case (), collecting apples has only instrumental value, whereas in the latter two cases , ), collecting apples has intrinsic value.
Hence, it’s ambiguous to the agent trained in the subset of environments whether or not to place merely instrumental value on collecting apples, or whether to place intrinsic value on it as well.
Contextualising the model
This toy model is as simplified as I could make it while still preserving a clear interpretation in terms of instrumental and intrinsic value. In order to capture instrumental value, it appears an environment would need at least two states or two actions, so that the agent can do one thing that is instrumental to later doing another, intrinsically valuable thing. It seems possible that you don’t need to start with a convoluted 6-state set-up to get the same dynamic, but as we discussed, in the end the apple game is really a 1-state environment.
This is not the simplest environment in which a policy faces a challenge of generalising across a step function. For that, you could consider any one-dimensional classification problem where you only train with examples from one side of the ideal decision boundary. For a reinforcement learning version, you can translate that into an environment where your reward for taking each of two actions depends on the environment parameter, and only training in places where the reward for the first action is higher than that of the second.
Actually, this is also not an especially compelling example of a challenging generalisation problem. Yes, getting the inductive bias right enough to time the out-of-distribution step for a step function looks extremely hard. However, this step function is generated by taking the max over an underlying, smooth function.
If we had framed this as an entropy-regularised policy learning problem, the agent would see the optimal probability of the
consume
action start to increase as approaches the endpoint of the restricted interval , and the challenge would become that of predicting a smooth S-curve rather than a step function (still hard, but not impossible with the right inductive bias).Alternatively, if we were using a Q-learning algorithm, the task would have been to predict the value of both actions directly, which again, varies smoothly with the resource prevalence parameter . The Q-learning scaffolding of taking the action with the maximum predicted value would give us the step function.
Of course, regardless of how challenging the generalisation problem really is, it might seem like we’re avoiding the obvious solution—can’t we just train in environments with rather than restricting ourselves to ? That would remove the challenge of generalisation and also the value ambiguity. The toy model makes this solution look appealing, because it’s obvious that we’re avoiding training in the part of the environment distribution conveying key information about the environment distribution. Unfortunately, this can be harder in the real world:
Real-world environments are extremely high-dimensional, and we might not be able to systematically vary all relevant parameters. It might only be feasible to train in some meaningfully-restricted subset of configurations.
In real-world cases, we might not realise that our training environment has a particular systematic restriction that breaks down in the real world (or, even if it does hold, won’t break down at some point after deployment, noting that ‘the real world’ is not a static environment).
I think it will be extremely hard, in terms of both experiment design and also computational feasibility, to ensure that training covers all possible environment configurations.
You should think of this toy model as representing a small slice of such an environmental configuration space. Among the many and varied other configuration parameters, we have the resource prevalence parameter , which stands in for some direction along which we only partially explore, but not so much as to clarify whether the agent should be pursuing some sub-goal for intrinsic or instrumental reasons.