You are on page 1of 3

INSIGHTS

P OLICY FORUM
ARTIFICIAL INTELLIGENCE

Regulating advanced artificial agents


Governance frameworks should address the prospect
of AI systems that cannot be safely tested

By Michael K. Cohen1,2, Noam Kolt3,4, under control is also reflected in President So long as an agent’s rewards can
Yoshua Bengio5,6, Gillian K. Hadfield2,3,4,7, Biden’s 2023 executive order that intro- be controlled, it can be incentivized to
Stuart Russell1,2 duces reporting requirements for AI that achieve complex goals by conditioning the
could “eva[de] human control or oversight rewards appropriately. But a sufficiently

T
echnical experts and policy-makers through means of deception or obfusca- capable RL agent could take control of its

Downloaded from https://www.science.org at University of Lethbridge on April 04, 2024


have increasingly emphasized the tion” (3). Building on these efforts, now is rewards, which would give it the incentive
need to address extinction risk from the time for governments to develop regu- to secure maximal reward single-mindedly.
artificial intelligence (AI) systems latory institutions and frameworks that Constraining the influence that highly
that might circumvent safeguards specifically target the existential risks from competent agents learn to exert over their
and thwart attempts to control them advanced artificial agents. environment is likely to prove extremely
(1). Reinforcement learning (RL) agents difficult; an intelligent agent could, for ex-
that plan over a long time horizon far more RISKS FROM LTPAs ample, persuade or pay unwitting human
effectively than humans present particular RL agents function as follows: They re- actors to execute important actions on its
risks. Giving an advanced AI system the ceive perceptual inputs and take actions, behalf (5, 7).
objective to maximize its reward and, at and certain inputs are typically designated Critically, far-sighted RL agents face
some point, withholding reward from it, as “rewards.” An RL agent then aims to an incentive to develop and execute arbi-
strongly incentivizes the AI system to take select actions that it expects will lead to trarily competent long-term plans. Many
humans out of the loop, if it has the op- higher rewards. For example, by designat- AI systems are trained only to achieve
portunity. The incentive to deceive humans certain immediate outcomes, like cor-
and thwart human control arises not only rectly classifying an image. Although such
for RL agents but for long-term planning “…safety testing is likely short-sighted agents could certainly cause
agents (LTPAs) more generally. Because
empirical testing of sufficiently capable
to be either dangerous or harm, they would likely lack the incentive
to execute protracted schemes to subvert
LTPAs is unlikely to uncover these dan- uninformative.” human control.
gerous tendencies, our core regulatory Accordingly, we define an LTPA as an
proposal is simple: Developers should ing money as a reward, one could train an algorithm designed to produce plans, and
not be permitted to build sufficiently ca- RL agent to maximize profit on an online to prefer plan A to plan B, when it expects
pable LTPAs, and the resources required to retail platform (4). that plan A is more conducive to a given
build them should be subject to stringent Highly capable and far-sighted RL agents goal over a long time horizon. For exam-
controls. are likely to accrue reward very successfully. ple, an agent trained to maximize profit
Governments are turning their attention If plan A leads to more expected reward on an online retail platform, as proposed
to these risks, alongside current and antic- than plan B, sufficiently advanced RL agents by Suleyman’s “new Turing test” (4), might
ipated risks arising from algorithmic bias, would favor the former. Crucially, securing productively use such an algorithm and
privacy concerns, and misuse. At a 2023 the ongoing receipt of maximal rewards hinder attempts to interfere with its profit
global summit on AI safety, the attend- with very high probability would require the making. LTPAs include all long-horizon
ing countries, including the United States, agent to achieve extensive control over its RL algorithms, including so-called “policy
United Kingdom, Canada, China, India, environment, which could have catastrophic gradient” methods, which lack an explicit
and members of the European Union (EU), consequences (5–8). One path to maximiz- planning subroutine but are trained to
issued a joint statement warning that, as ing long-term reward involves an RL agent be as competent as possible. LTPAs also
AI continues to advance, “Substantial risks acquiring extensive resources and taking include algorithms that imitate trained
may arise from…unintended issues of con- control over all human infrastructure (5, 6), LTPAs, but not algorithms that merely imi-
trol relating to alignment with human in- which would allow it to manipulate its own tate humans. In the latter case, if plan A
tent” (2). This broad consensus concerning reward free from human interference (5). is more competent than any plan a human
the potential inability to keep advanced AI Additionally, because being shut down by could develop, and plan B is a human plan,
humans would reduce the expected reward, an algorithm imitating a human would not
1University of California, Berkeley, CA, USA. 2Center for
sufficiently capable and far-sighted agents prefer plan A to plan B. The supplemen-
Human-Compatible Artificial Intelligence, Berkeley, CA, USA.
3University of Toronto, Toronto, Ontario, Canada. 4Schwartz are likely to take steps to preclude that pos- tary materials include a taxonomy situat-
Reisman Institute for Technology and Society, Toronto, sibility (7) or if feasible, create new agents ing LTPAs among other machine learning
Ontario, Canada. 5Université de Montréal, Montréal, Québec, (unimpeded by monitoring or shutdown) to systems. Notably, there is no recognizable
Canada. 6Mila–Quebec AI Institute, Montréal, Québec,
Canada. 7Vector Institute for Artificial Intelligence, Toronto, act on their behalf (5). Progress in AI could horizon length at which risk increases
Ontario, Canada. Email: mkcohen@berkeley.edu enable such advanced behavior. sharply; accordingly, regulators will have

36 5 APRIL 2024 • VOL 384 ISSUE 6691 science.org SCIENCE


to define the length of a long time horizon tests and pause misbehavior (13). Testing could (4). This difficulty arises in part be-
according to their risk tolerance. may nonetheless be useful for detecting cause there is currently no robust scientific
Losing control of advanced LTPAs, al- some dangerous algorithmic capabilities in method for (ii); computer scientists should
though not the only existential risk from systems that cannot thwart human control. develop one quickly. Perhaps if certain re-
AI, is the class of risk that we aim to ad- Stepping back, empirical testing is a no- sources could be used to create an AI sys-
dress here—and one that necessitates new toriously ineffective tool for ensuring the tem with the short-term goal of exhibiting
forms of government intervention. safety of computational systems. For exam- a moderately dangerous capability (i.e.,
ple, extensive testing failed to reveal an er- trying to fail the safety test), that could im-
A GOVERNANCE PROPOSAL ror in the Intel Pentium’s arithmetic unit. prove our understanding of the resources
Although governments have expressed Given that both safety and validity cannot that can produce dangerous capabilities.
concern about existential risks from AI, be ensured when testing sufficiently ca- Admittedly, listing relevant dangerous
regulatory proposals do not adequately pable LTPAs, governments should estab- capabilities and estimating the resources
address this class of risk (9). The EU AI lish new regulatory bodies with the legal required to achieve these capabilities will
Act (10) canvasses a broad array of risks authority and technical capacity to prevent require considerable research. We suggest
from AI but does not single out loss of con- such agents from being built in the first that regulators err on the side of caution
trol of advanced LTPAs. We see promising place, no matter the domain. and underestimate the resources required
first steps from the US and UK—President to develop LTPAs with dangerous capabili-
Biden’s executive order on AI (3) requires DEFINING DANGEROUS CAPABILITIES ties. Systems should be considered “dan-

Downloaded from https://www.science.org at University of Lethbridge on April 04, 2024


reports on potentially uncontrollable AI How capable is “sufficiently capable”? Un- gerously capable” if they are trained with
systems, but it does not seek to constrain fortunately, we do not know. More cautious enough resources to potentially exhibit
their development or proliferation; the regulators might prevent the development those dangerous capabilities, and regula-
US and UK AI Safety Institutes are build-
ing capacity for regulators to understand
cutting-edge AI but lack the authority to Mandatory reporting and production controls for LTPAs
control it (11). To prevent unlawful development of dangerously capable long-term planning agents (LTPAs), which may
Across multiple jurisdictions, following be difficult to directly detect, reporting requirements would enable regulators to have sufficient visibility into
industry practice, the prevailing regulatory easier to observe LTPA production resources and code interacting with those resources. Though concern is
approach for AI involves empirical safety ultimately with subsets of production resources and LTPAs (“definition”), these are not easily recognizable, thus
testing, most prominently within the UK broader recognizable supersets encompassing those subsets (“implementation”) are the focus of regulation.
AI Safety Institute (2, 3, 10–12). We, how-
ever, argue that for a sufficiently capable
LTPA, safety testing is likely to be either LTPA PRODUCTION RESOURCES DANGEROUSLY CAPABLE LTPAs
dangerous or uninformative. Although we
Definition Definition
might like to empirically assess whether Information that makes it low-cost to produce LTPAs able to thwart
an agent would exploit an opportunity to a dangerously capable LTPA Code used to human control
produce LTPAs
thwart our control over it, if the agent in
Implementation Implementation
fact has such an opportunity during a test, Categories might include, e.g., large foundation LTPAs trained with sufficiently extensive
the test may be unsafe. Conversely, if it does models, AI training curriculum resources, e.g., compute, data
not have such an opportunity during a test,
the test is likely to be uninformative with Shape and size of regulatory implementation categories can be updated periodically to ensure the inclusion of new systems that meet the definition.
respect to such risks. This holds for human
agents as well as artificial agents: Consider of even weak LTPAs; however, regulators tors should not permit the development
a leader appointing a general, but worried seeking to facilitate the development of of dangerously capable LTPAs. To ensure
about a coup; if the general is clever, there merely “moderately capable” LTPAs should this occurs, regulators will need to care-
is no safe and reliable loyalty test. A candi- establish protocols to estimate in advance fully monitor and control the resources
date for the role, like an advanced artificial whether such systems might have the abil- that could be used to produce dangerously
agent, would either recognize the test and ity to game safety testing and evade human capable LTPAs. Although this would inter-
behave agreeably or, if possible, execute a control. One factor that regulators could rupt the “move fast” ethos of AI develop-
coup during the test. consider is the resources proposed to be ment, we believe caution is necessary.
If an agent is advanced enough to rec- used to train LTPAs, including compute, If dangerously capable LTPAs are at
ognize that it is being tested, then there data, and the resources used to develop some point permitted to be developed, rig-
is little reason to expect similar behav- any pretrained models that assist in LTPA orous technical and regulatory work would
ior in and out of testing. Moreover, an AI training. We propose that policy-makers need to be done first to determine if, when,
system designed to interact with complex (i) establish a list of dangerous capabili- and how to permit this. The possibility
environments (e.g., human institutions or ties, such as those described in President must also be considered that researchers
biological systems) would likely be able to Biden’s executive order (3), which include and policy-makers fail to identify any ro-
discern a simulated test environment from “high levels of performance at…deception bustly safe regulatory regimes that permit
real-world deployment (because complex or obfuscation” and “offensive cyber op- the development of dangerously capable
systems can only be simulated approxi- erations through automated vulnerability LTPAs, at least by the time that actors in
mately), thereby enabling the AI system to discovery”; and (ii) estimate the resources the private sector are able to build them.
identify when it is being tested. Although needed to develop an LTPA that exhibits It is also worth noting that there might
no current artificial agents are competent those capabilities. We do not believe that be a path to building AI systems that can
enough to thwart human control, some existing systems exhibit those capabilities, be proved mathematically to avoid certain
have already been found to identify safety and it is very difficult to predict when they dangerous behaviors (7), but such formal

SCIENCE science.org 5 APRIL 2024 • VOL 384 ISSUE 6691 37


INSIGHTS | P O L I C Y F O RU M

guarantees appear highly unlikely for any could also control the transfer of large pre- LTPAs fills an important gap, further insti-
AI systems built similarly to the most pow- trained models or other relevant resources. tutional mechanisms will likely be needed
erful systems today (13). Further, regulators could make it unlawful to mitigate the risks posed by advanced ar-
for other actors to use AI systems that fail tificial agents. j
MONITORING AND REPORTING to comply with these requirements (15).
REF ERENCES AND NOTES
Just as nuclear regulation extends to con- Taken together, controls on the develop-
1. G. Hinton et al., “Statement on AI risk” (Center
trolling uranium, AI regulation must ex- ment, use, and dissemination of produc- for AI Safety, May 2023); https://www.safe.ai/
tend to controlling the resources needed tion resources will substantially reduce the statement-on-ai-risk.
to produce dangerously capable LTPAs. We likelihood of these resources being used to 2. United Kingdom Department for Science, Innovation,
and Technology, United Kingdom Foreign,
define production resources (PRs) as any build dangerously capable LTPAs.
Commonwealth, and Development Office, United
information that makes the production of Kingdom Prime Minister’s Office, “The Bletchley
a dangerously capable LTPA cheaper than a ENFORCEMENT MECHANISMS Declaration by countries attending the AI Safety
threshold determined by regulators accord- To ensure compliance with these reporting Summit, 1-2 November 2023” (Gov.uk, November
2023); https://www.gov.uk/government/publications/
ing to their risk tolerance. Unlike uranium, requirements and usage controls, regula- ai-safety-summit-2023-the-bletchley-declaration/
a PR is not a physical resource—it could tors may need to be authorized to (i) issue the-bletchley-declaration-by-countries-attending-the-
include any AI model trained beyond a cer- legal orders that compel organizations to ai-safety-summit-1-2-november-2023.
tain compute threshold (14). Fortunately, report production resources and man- 3. J. Biden, “Executive order on the safe, secure,
and trustworthy development and use of arti-
regulators could detect such PRs by fol- date the cessation of prohibited activities; ficial intelligence” (The White House, October

Downloaded from https://www.science.org at University of Lethbridge on April 04, 2024


lowing the hardware required to produce (ii) audit an organization’s activities and, 2023); https://www.whitehouse.gov/briefing-
them. (Some of this hardware could be where necessary, restrict an organization’s room/presidential-actions/2023/10/30/
executive-order-on-the-safe-secure-and-trustworthy-
regulated as well, including semiconductor access to certain resources, such as cloud
development-and-use-of-artificial-intelligence/.
chips and data centers, but that is outside computing; (iii) impose fines on noncom- 4. M. Suleyman, The Coming Wave (Penguin, September
our focus here.) To limit the proliferation pliant organizations; and (iv) as in finan- 2023).
of PRs, expanding on Hadfield et al. (15), cial regulation, impose personal liability 5. M. K. Cohen, M. Hutter, M. A. Osborne, AI Mag. 43, 282
(2022).
Avin et al. (12), and President Biden’s ex- on key individuals in noncompliant orga- 6. S. Zhuang, D. Hadfield-Menell, Adv. Neural Inf. Process.
ecutive order (3), we propose that develop- nizations. If business leaders can be held Syst. 33, 15763 (2020).
ers be required to report (a) relevant facts to account for breaching corporate duties, 7. S. Russell, Human Compatible: AI and the Problem of
about the PR [if the PR is an AI model, then surely they should face similar conse- Control (Viking, 2019).
8. A. Turner, L. Smith, R. Shah, A. Critch, P. Tadepalli, Adv.
this might include (i) the input/output quences for irresponsibly handling one of Neural Inf. Process. Syst. 34, 23063 (2021).
properties, (ii) the data collection process the world’s most dangerous technologies. 9. N. Kolt, “Algorithmic black swans”(Washington
for training it, (iii) the training objective, University Law Review, October 2023); https://ssrn.
com/abstract=4370566.
and (iv) documented behavior in test set- REGULATORY INSTITUTIONS
10. European Commission, “Proposal for a regulation of
tings, but not typically the model weights We have addressed our discussion to “reg- the European parliament and of the council laying down
themselves]; (b) the specific machines on ulators” but have not proposed specific harmonised rules on artificial intelligence (Artificial
which the PR is stored and their locations; regulatory institutions for addressing the Intelligence Act) and amending certain Union legisla-
tive acts” (COM/2021/0206, European Commission,
(c) all code run on these machines after the risks from LTPAs. This issue will need to January 2024); https://eur-lex.europa.eu/
PR is created; and (d) all outputs of that be approached differently in different legal-content/EN/TXT/?uri=celex%3A52021PC0206.
code. With the context provided by point countries. That being said, we expect that 11. United Kingdom Department for Science, Innovation
(a), governments could monitor the code whereas other risks from AI might be ad- and Technology, “Introducing the AI Safety Institute”
(Gov.uk, November 2023); https://assets.publishing.
that interacts with PRs, allowing them to dressed primarily through domain-specific service.gov.uk/media/65438d159e05fd0014be7bd9/
detect the development of (unlawful) LT- regulation (e.g., financial regulation and introducing-ai-safety-institute-web-accessible.pdf.
PAs (see the figure). In addition, if a com- health care regulation), the risk of loss of 12. S. Avin et al., Science 374, 1327 (2021).
13. J. Lehman et al., Artif. Life 26, 274 (2020).
pany offers users application programming control of AI likely requires specialized
14. Y. Shavit, arXiv:2303.11341 [cs.LG] (2023).
interface (API) access to a PR, users should regulation and the establishment of new 15. G. Hadfield, M. Cuéllar, T. O’Reilly, “It’s time to create
be required to report the code on the us- regulatory institutions. This specialized a national registry for large AI models” (Carnegie
er’s machine that interacts with the API. regulation could nevertheless benefit from Endowment for International Peace, July 2023);
https://carnegieendowment.org/2023/07/12/
Details of the reporting requirements will the existing expertise of domain-specific it-s-time-to-create-national-registry-for-large-ai-
need to be updated in response to techno- regulators, including with developing models-pub-90180.
logical advances that lead to changes in the frameworks for monitoring PRs. Critically,
ACKNOWL EDGEMENTS
resources and processes needed to produce because the risks from LTPAs are global,
M.K.C. and N.K. contributed equally. The authors thank R.
dangerously capable LTPAs. Finally, report- regulatory efforts cannot stop at national Grosse, A. Barto, and P. Christiano for feedback on earlier
ing procedures could be complemented by borders. International cooperation is vital. versions of the manuscript. M.K.C. commenced this project
protecting and rewarding whistleblowers at Oxford University. M.K.C. and S.R. are supported by the
Open Philanthropy Foundation. N.K. is supported by a
who uncover misconduct. BROADER CONCERNS
Vanier Canada Graduate Scholarship. Y.B. is supported by a
LTPAs, of course, are not the only type of Canadian Institute for Advanced Research (CIFAR) AI Chair
PRODUCTION CONTROLS AI system that poses substantial and even and a Natural Sciences and Engineering Research Council of
Given sufficient visibility into the re- existential risks. Accordingly, we suggest Canada Discovery Grant. G.K.H. is supported by the Schwartz
Reisman Institute Chair in Technology and Society and a
sources for producing LTPAs, regulators that empirical testing, which is inadequate Canada CIFAR AI Chair at the Vector Institute for Artificial
could then prohibit the production of dan- for sufficiently advanced LTPAs, could Intelligence. G.K.H. and S.R. are supported by AI2050 Senior
gerously capable LTPAs. Developers that nevertheless substantially improve the Fellowships from Schmidt Futures.
are unsure whether a proposed AI system safety of some other types of AI. At the SUPPL EMENTARY MATE RIALS
meets the definition of dangerously capa- same time, the governance regime that we science.org/doi/10.1126/science.adl0625
ble LTPA could inquire with the relevant propose could be adapted to other AI sys-
regulator prior to development. Regulators tems. Although our proposal for governing 10.1126/science.adl0625

38 5 APRIL 2024 • VOL 384 ISSUE 6691 science.org SCIENCE

You might also like