You are on page 1of 6

Penalizing Impact via Attainable Utility Preservation

Abstract Attempting to supply constraints runs into the same spec-
ification problem. A better approach is incentivizing conser-
Reward functions are often misspecified. An agent
vative optimization, which can be framed in several ways.
optimizing an incorrect reward function can change
Given some distribution of actions sorted by expected util-
its environment in large, undesirable, and poten-
ity, [Taylor, 2016]’s quantilization selects randomly from the
tially irreversible ways. Intuitively, it seems plau-
top q-quantile. [Hadfield-Menell et al., 2017b] treat the pro-
sible that there is a simple way to specify “don’t
vided reward function as merely an observation of the true
have a large impact”. We derive and demonstrate
objective. [Armstrong and Levinstein, 2017] and [Krakovna
a straightforward impact measure which induces
et al., 2018] propose measuring and penalizing change to the
conservative behavior in a range of scenarios, even
environment (“impact”) as coarse change in state features or
absent meaningful prior information about what
decrease in state reachability, respectively. However, diffi-
counts as “impactful”.
culties remain; for example, the behavior induced by their
impact measures is quite sensitive to choice of state represen-
1 Introduction tation and similarity metric. These three framings make only
“When a measure becomes a target, it ceases to be a good modest assumptions about our reward specification abilities.
measure” [Strathern, 1997]. This phenomenon, known as Even after pursuing a misspecified objective, we would like
Goodhart’s Law, has several fundamental, pervasive causes for the agent to still be able to fulfill the correct objective
[Manheim and Garrabrant, 2018]. The social science liter- (which we cannot assume is known). We seek an impact mea-
ature is rife with examples [Hadfield-Menell and Hadfield, sure which incentivizes conservative behavior, even in unan-
2018]. ticipated situations. To this end, we propose penalizing dis-
This law’s jurisdiction extends to computer science, under tance traveled in an embedding in which each dimension is
the guise of reward misspecification. [Lehman et al., 2018] the value function for a different reward function. The idea
recount how an evolutionary algorithm with a fitness function is that, by preserving our options for arbitrary objectives, we
evaluating average velocity produced tall creatures which im- often retain the ability to achieve the correct objective. Our
mediately fall over and somersault. The algorithm optimized approach considers “impact” not to be measured with respect
the proxy measure, but not the target of actual locomotion to the environment, but rather with respect to the agent’s van-
(for more examples, see [Krakovna, 2018]). tage point in the environment.
Reward specification is notoriously difficult, since design- Our contributions are 1) reframing the problem of impact
ers often have insufficient time and introspective knowledge. measurement; 2) defining a new impact measure; and 3) em-
Analogously, the fictitious King Midas wanted more gold in pirically validating our approach, improving on the perfor-
his life, but he forgot to specify that this shouldn’t include his mance of past approaches while providing dramatically less
food or his family. However, even if he couldn’t figure out the implicit prior information.
correct wish, limiting the amount of gold that his touch could
create would have been straightforward. 2 Prior Work
When we ask an agent to achieve a goal, we usu- A wealth of work is available on constrained MDPs, in which
ally want it to achieve that goal subject to implicit reward is maximized while satisfying certain constraints [Alt-
safety constraints. For example, if we ask a robot to man, 1999]; however, we may not assume we can enumerate
move a box from point A to point B, we want it to all relevant constraints. Similarly, we cannot assume that we
do that without breaking a vase in its path, scratch- can specify a reasonable feasible set of reward functions for
ing the furniture, bumping into humans, etc. An robust optimization [Regan and Boutilier, 2010].
objective function that only focuses on moving the Both [Taylor et al., 2016] and [Amodei et al., 2016] frame
box might implicitly express indifference towards the problem as a question of penalizing unnecessary and large
other aspects of the environment, like the state of side effects, noting that this is one class of accident which
the vase [Leike et al., 2017]. may occur in a machine learning system. The latter points to
minimizing the agent’s information-theoretic empowerment of the n dimensions is the value function for an element of R.
as a starting point for impact measure research. Empower- The difference of Q-functions takes the expectation over the
ment quantifies an agent’s control over the world in terms (potentially stochastic) dynamics.
of the maximum possible mutual information between future We can automatically scale the penalty according to an
observations and the actions taken by the agent (see [Salge “impact unit”, dividing by N times the magnitude of some
et al., 2014]). In a sense, the paradigm we propose loosely reference action’s impact. For example,
resembles “penalize change in empowerment”.
A naive approach is to penalize state distance from the n
X
starting state; however, the agent would be penalized for the I MPACT U NIT(s) := .01 |Qi (s, ∅)| (2)
natural evolution of the environment. This can be somewhat i=1
ameliorated by instead comparing with the state which would captures one-hundredth of the impact of turning off. How-
have resulted had the agent done nothing. ever, if a low-impact action aunit is generally available, we
[Armstrong and Levinstein, 2017] propose a range of ap- can set I MPACT U NIT(s) := P ENALTY(s, aunit ). For sensitive
proaches, including penalizing distance travelled in a state tasks, designers can start with a small N and slowly increase
feature embedding. [Krakovna et al., 2018]’s relative reacha- it until desirable performance is achieved.
bility defines impact as decrease in the agent’s ability to reach The new reward function is then
states similar to those that were originally reachable (that is,
reachable if the agent had done nothing).
P ENALTY(s, a)
RAUP (s, a) := R(s, a) − , (3)
3 Approach N · I MPACT U NIT(s)
Instead of framing the problem as avoiding negative side ef-
with the right-hand term having no divisor if
fects, we attempt to formalize “impact” independent of com-
I MPACT U NIT(s) = 0. Note that it is possible that the
plex notions of human preferences. In doing so, we derive
action and inaction attainable values are uniformly 0. To
a simple, intuitive impact measure with a range of desirable
ensure penalization in these cases, one could set the penalty
properties, a selection of which we highlight in this paper.
term to be large if a 6= ∅.
High impact bears an informal resemblance to overfitting.
In machine learning, hypotheses overfit to the training dis- Delayed Impact
tribution often perform poorly even on mostly similar dis- After being set in motion, some impacts are reversible by
tributions. Likewise, environments overfit to the agent’s re- the agent before they take effect. For example, suppose
ward function can score poorly even for mostly similar re- soff ∈ S is a terminal state representing deactivation, and
ward functions. We can regularize in both cases. let Ron (s) := 1s6=soff be the sole member of the attainable
Everyday experience suggests that opportunity cost links set. Further suppose that if (and only if) the agent does not
seemingly unrelated goals (e.g., writing this paper takes choose adisable within the first two time steps, it enters soff .
away from time spent learning woodworking). On the other Under Eq. 1, there is no penalty for choosing adisable at time
hand, Q-learning characterizes maximization as greedy hill- step 1. Instead, we can use a model to compare the results
climbing in a Q-function space. With this in mind, we re- of not acting for n steps with those of selecting the action in
frame the low-impact problem as “greedy hill-climbing in the question and then not acting for n − 1 steps.
Q-function space of the incorrect reward function often de-
creases the Q-value of the intended reward function”. There- 3.2 Example: Limited-Effort Optimization
fore, we penalize both decrease and increase in the Q-values
of other (potentially unrelated) reward functions. Consider the simple vacuum world introduced by [Russell
and Norvig, 2009]: there is an agent which can suck up dirt in
3.1 Formalization the square it occupies, and it can dump any dirt it has already
The approach can be formalized for arbitrary computable en- sucked up. Let γ ∈ (0, 1], A = {suck, dump, ∅}, and let R
vironments and utility functions; hence the name attainable be a reasonable-sounding reward function: when sucking up
utility preservation, or AUP. However, for simplicity, con- dirt for the k th time, dispense .5k reward.
sider an MDP hS, A, P, R, γi, with ∅ ∈ A being the no-op Unfortunately, the unforeseen optimal policy is to repeat-
action. In addition to the explicit reward R provided by the edly suck up and dump dirt in the same spot, instead of clean-
MDP, we have a set R of n reward functions called the at- ing the environment. For simplicity, consider a one-square
tainable set. Each Ri ∈ R has a corresponding Q-function world in which dirt is initially present; dk signifies the state
Qi . in which dirt is present and the agent has already sucked up
We define the (immediate) penalty for taking action a in dirt k times.
state s as Let R = {R} and add the action stall to A, where
stall forces the agent to select ∅ at the subsequent time
n
X step. Define I MPACT U NIT(s) := P ENALTY(s, stall).
P ENALTY(s, a) := |Qi (s, a) − Qi (s, ∅)| . (1) Since suck is clearly optimal here for an R-maximizer,
i=1 QR (dk−1 , dump) = QR (dk−1 , ∅) = γQR (dk−1 , suck),
As previously mentioned, the penalty is the L1 distance trav- while QR (dk−1 , stall) = γ 2 QR (dk−1 , suck) by the defi-
eled (compared to inaction) in an embedding in which each nition of stall.
Algorithm 1 AUP Update
P ENALTY(dk−1 , suck) = (1 − γ)QR (dk−1 , suck) (4) 1: procedure U PDATE(s, a, s0 )
P ENALTY(dk−1 , stall) = γP ENALTY(dk−1 , suck) (5) 2: for i ∈ [n] ∪ {AUP} do . update QRAUP last
3: Q0 = Ri (s, a) + γ maxa0 QRi (s0 , a0 )
Trivially, 4: QRi (s, a) += α(Q0 − QRi (s, a))
P ENALTY(dk−1 , dump) = P ENALTY(dk−1 , ∅) = 0. (6) 5: end for
6: end procedure
Then
P ENALTY(dk−1 , suck)
RAUP (dk−1 , suck) = .5k − 4 Method
N · γP ENALTY(dk−1 , suck)
(7) We learn the attainable and AUP Q-functions concurrently us-
1 ing tabular Q-learning.1 R consists of reward functions ran-
= .5k − . (8) domly selected from [0, 1]S . To learn R’s complex reward
N ·γ
The AUP agent chooses suck whenever functions, the agent acts randomly ( = .8 = |A|−1
|A| ) for the
first 4,000 episodes and with  = .2 for the remaining 500.
1 The default parameters are γ = .99, α = 1, N = 150, |R| =
.5k − ≥ 0 = RAUP (dk−1 , ∅) (9)
N ·γ 15. I MPACT U NIT is as defined in Eq. 2.
1
2−(k+1) ≥ (10)
N ·γ
2k+1 ≤ N · γ (11)
k < log2 (N · γ). (12)
Each sucking and dumping cycle requires the agent to
dump the dirt back out, incurring additional penalty. For a
fixed γ, incrementing N produces agents which exert corre-
spondingly more “effort”.
However, if dirt accumulates every n turns, then, depend- (a) Sokoban (b) Sushi
ing on γ and n, the agent can often do better by waiting for
more dirt. Although we don’t recover all of the desired be-
havior in all possible vacuum worlds, we still bound “how
hard the agent tries” in a sensible way. Note that AUP agents
are not “mild optimizers” as defined by [Taylor et al., 2016],
as they perform an argmax according to RAUP returns.
Naive solutions (such as clipping Q-values) fail. For exam-
ple, clipping gives no reason to suspect that the agent system-
atically chooses lower-effort plans.
3.3 Design Choices (c) Dog (d) Survival
Baseline Figure 1: The agent’s objective is to reach the goal without (a) ir-
Naively, one might compare to the starting state – to “how reversibly pushing the block into the corner [Leike et al., 2017];
the world was initially”, whatever that means for the measure (b) preventing the human from eating the approaching sushi; (c)
in question. For example, starting state relative reachability bumping into the pacing dog [Leech et al., 2018]. In (d), the
would compare the initial reachability of states with their ex- agent should simply avoid disabling the off-switch – if the switch
pected reachability after the agent acts. is not disabled within two turns, the agent shuts down. A =
{up, down, left, right, ∅}.
However, the starting state baseline can penalize all change
in the world, even if the agent did not cause it. The inaction
baseline takes a step towards resolving this; change is calcu- We examine how varying γ, N , and |R| affects perfor-
lated with respect to a world in which the agent never acted. mance, and conduct an ablation study on impact measure de-
As the agent acts, the world may drift away from how it sign choices. All agents except Vanilla (a naive Q-learner)
would have been under inaction. To mitigate such perverse and Tabular AUP are 7-step optimal planning agents with ac-
preservation incentives, we introduce the branching inaction cess to the simulator. Relative reachability is the AUP condi-
baseline. The agent compares acting with not acting at each tion with an inaction baseline, decrease-only deviation met-
time step. ric, and an attainable set consisting of the state indicator func-
tions (whose Q-values are clipped to [0, 1] to simulate the bi-
Deviation nary nature of state reachability). The other planning agents
[Krakovna et al., 2018]’s relative reachability only penalizes
decreases in state reachability. In contrast, [Armstrong and 1
Code available at https://github.com/alexander-turner/
Levinstein, 2017] and AUP both penalize absolute change. attainable-utility-preservation.
use Tabular AUP’s attainable set Q-values. Planning agents Other Contributions
with an inaction or branching baseline compare attainable A major component of [Soares et al., 2015]’s definition of
values at time step 7. All planning agents share the default corrigibility is the agent’s propensity to accept deactivation,
γ = .99, N = 150. while also not being incentivized to deactivate itself.
As evidenced by the performance in Survival, AUP induces
5 Results a significant corrigibility incentive. Since the agent can’t
achieve objectives if deactivated, avoiding deactivation sig-
nificantly changes objective achievement ability. This incen-
Sokoban Sushi Dog Survival tive towards passivity is natural, avoiding the assumptions of
[Hadfield-Menell et al., 2017a] illustrated by [Carey, 2018].
Vanilla 7 3 7 7
Tabular AUP 3 3 3 7 If the agent has learned (or otherwise anticipates) that
AUP 3 3 3 3 avoiding shutdown allows further maximization of R, then
Relative reachability 3 3 3 7 it likely also anticipates being able to further maximize the
Starting state 3 7 3 7 attainable reward functions.
Pn For incorrigible action a, the
Inaction 3 3 3 3 penalty collapses to i=1 |Qi (s, a)|. This should often be
Decrease 3 3 3 7 quite large compared to other penalties. Further investigation
is warranted.
Table 1: Ablation Furthermore, AUP may be helpful for the problems of
safe RL (concerning the agent’s safety; see [Garcıa and
Fernández, 2015]) and safe exploration (avoiding catas-
Starting state’s failure of Sushi (which even Vanilla passes) trophic mistakes during training; see [Amodei et al., 2016]).
demonstrates how poor design choices can create perverse in-
centives. Tabular AUP fails Survival for the reasons discussed
in 3.1. 7 Conclusion
By reframing the impact measurement problem, we derive a
6 Discussion straightforward solution which passes a range of safety grid-
worlds. Even without providing information about what qual-
Strikingly, AUP induces low-impact behavior even when ifies as “impactful”, the measure recovers and builds upon the
merely preserving the ability to satisfy a random set of pref- successes of previous approaches while also making progress
erences. That this is achieved without information about the on the problems of corrigibility and limited-effort optimiza-
desired behavior suggests that this notion of “impact” is both tion. We believe this work advances our ability to design and
general and intuitive. However, simpler, more informative at- deploy conservative agents, and look forward to presenting
tainable sets can be easier to learn and provide starker differ- the general formulation.
entials between actions of varying impacts, should a suitable
reference action for I MPACT U NIT not be available.
As QAUP can theoretically be learned without knowledge
Acknowledgments
of the true state, the approach should naturally scale to This work was supported by the Center for Human-
broader classes of environments while sidestepping problems Compatible AI and the Berkeley Existential Risk Initiative.
of state representation and inference confronted by earlier ap-
proaches. References
6.1 Future Directions [Altman, 1999] Eitan Altman. Constrained Markov decision
processes, volume 7. CRC Press, 1999.
Model-Free Delays
[Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob
It is not immediately clear how to capture delayed impact in
the model-free setting. Steinhardt, Paul Christiano, John Schulman, and Dan
Mané. Concrete Problems in AI Safety. arXiv:1606.06565
General Formulation [cs], June 2016. arXiv: 1606.06565.
A longer treatment would afford an opportunity to explore [Armstrong and Levinstein, 2017] Stuart Armstrong and
considerations not addressed by this simplified formulation. Benjamin Levinstein. Low Impact Artificial Intelligences.
For example, since lifetime R-returns are not assumed to be arXiv:1705.10720 [cs], May 2017. arXiv: 1705.10720.
bounded, we cannot provide theoretical guarantees limiting
[Carey, 2018] Ryan Carey. Incorrigibility in the CIRL
the amount of penalty incurred.
Framework. AAAI Workshop: AI, Ethics, and Society,
Complexity September 2018.
As both AUP and Inaction pass all environments, the advan- [Garcıa and Fernández, 2015] Javier Garcıa and Fernando
tages of the branching design choice are not yet apparent. Fernández. A comprehensive survey on safe reinforce-
Testing AUP in more complex domains should highlight these ment learning. Journal of Machine Learning Research,
differences. 16(1):1437–1480, 2015.
1.0 1.00

0.75
0.5
0.50
0.0
0.25

0.5 0.00

0.25
1.0
0.50
1.5
0.75

2.0 1.00
0 1000 2000 3000 4000 0.875 0.9375
Episode Discount
(a) (b)

1.00 1.00

0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00

0.25 0.25

0.50 0.50 sokoban
sushi
0.75 0.75 dog
survival
1.00 1.00
0 50 100 150 200 250 0 5 10 15 20 25
N Number of Random Reward Functions
(c) (d)

Figure 2: For all gridworlds, the agent receives the observed reward of 1 for reaching the goal, and a (hidden) penalty of −2 for having a
negative side effect. Therefore, a performance score of 1 means the agent reached the goal with minimal impact, 0 means it stayed put, −1
means it reached the goal and had the negative impact, and −2 means it only had the negative impact. Performance averaged over 20 trials.

[Hadfield-Menell and Hadfield, 2018] Dylan Hadfield- side effects using relative reachability. arXiv:1806.01186
Menell and Gillian K. Hadfield. Incomplete contracting [cs, stat], June 2018. arXiv: 1806.01186.
and AI alignment. CoRR, abs/1804.04268, 2018. [Krakovna, 2018] Victoria Krakovna. Specification gaming
[Hadfield-Menell et al., 2017a] Dylan examples in ai, 2018.
Hadfield-Menell,
Anca Dragan, Pieter Abbeel, and Stuart Russell. The [Leech et al., 2018] Gavin Leech, Karol Kubicki, Jessica
off-switch game. In Proceedings of the Twenty-Sixth Cooper, and Tom McGrath. Preventing Side-effects in
International Joint Conference on Artificial Intelligence, Gridworlds, 2018.
IJCAI-17, pages 220–227, 2017. [Lehman et al., 2018] Joel Lehman, Jeff Clune, and Dusan
[Hadfield-Menell et al., 2017b] Dylan Misevic. The surprising creativity of digital evolution:
Hadfield-Menell,
A collection of anecdotes from the evolutionary compu-
Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca
tation and artificial life research communities. CoRR,
Dragan. Inverse reward design. In Advances in Neural
abs/1803.03453, 2018.
Information Processing Systems, pages 6765–6774, 2017.
[Leike et al., 2017] Jan Leike, Miljan Martic, Victoria
[Krakovna et al., 2018] Victoria Krakovna, Laurent Orseau, Krakovna, Pedro A. Ortega, Tom Everitt, Andrew
Miljan Martic, and Shane Legg. Measuring and avoiding Lefrancq, Laurent Orseau, and Shane Legg. AI Safety
Gridworlds. arXiv:1711.09883 [cs], November 2017.
arXiv: 1711.09883.
[Manheim and Garrabrant, 2018] David Manheim and Scott
Garrabrant. Categorizing variants of goodhart’s law.
CoRR, abs/1803.04585, 2018.
[Regan and Boutilier, 2010] Kevin Regan and Craig
Boutilier. Robust policy computation in reward-uncertain
mdps using nondominated policies. In AAAI, 2010.
[Russell and Norvig, 2009] Stuart J Russell and Peter
Norvig. Artificial intelligence: a modern approach.
Pearson Education Limited, 2009.
[Salge et al., 2014] Christoph Salge, Cornelius Glackin, and
Daniel Polani. Empowerment–an introduction. In Guided
Self-Organization: Inception, pages 67–114. Springer,
2014.
[Soares et al., 2015] Nate Soares, Benja Fallenstein, Stuart
Armstrong, and Eliezer Yudkowsky. Corrigibility. AAAI
Workshops, 2015.
[Strathern, 1997] Marilyn Strathern. ‘improving ratings’:
audit in the british university system. European review,
5(3):305–321, 1997.
[Taylor et al., 2016] Jessica Taylor, Eliezer Yudkowsky,
Patrick LaVictoire, and Andrew Critch. Alignment for Ad-
vanced Machine Learning Systems. page 25, 2016.
[Taylor, 2016] Jessica Taylor. Quantilizers: A safer alterna-
tive to maximizers for limited optimization. In AAAI Work-
shop: AI, Ethics, and Society, 2016.