Professional Documents
Culture Documents
A A: Automatic Agent Learning From Scratch Via Self-Planning
A A: Automatic Agent Learning From Scratch Via Self-Planning
Tool-Agent
Abstract Tool Library
Target Task Information
Name: HotpotQA
Plan-Agent
Language agents have achieved considerable Description: This is a multi-hop
question answering benchmark,
performance on various complex tasks. De- the questions of which are all
from Wikipedia. Differentiation
spite the incessant exploration in this field, ex- Examples: ……
arXiv:2401.05268v2 [cs.CL] 17 Jan 2024
Meta-Agent
Reflect-Agent
isting language agent systems still struggle with GPT-4 / 3.5 Large Scale Labeled Data
costly, non-reproducible data reliance and face Human Labor Automatic Self-Planning
from Scratch
the challenge of compelling a single model
for multiple functions. To this end, we intro- Figure 1: The basic framework of AUTOACT. Armed
duce AUTOACT, an automatic agent learning with just one tool library, the M ETA -AGENT can auto-
framework that does not rely on large-scale matically differentiate based on the target task informa-
annotated data and synthetic trajectories from tion and produce a sub-agent group that can collaborate
closed-source models (e.g., GPT-4). Given lim- to complete the task.
ited data with a tool library, AUTOACT first
automatically synthesizes planning trajectories
without any assistance from humans or strong planning plays a pivotal role, which is responsible
closed-source models. Then, AUTOACT lever- for decomposing complex tasks (Wei et al., 2022;
ages a division-of-labor strategy to automati- Zhou et al., 2023a; Yao et al., 2023), invoking ex-
cally differentiate based on the target task infor-
ternal tools (Shen et al., 2023; Lu et al., 2023; Qin
mation and synthesized trajectories, producing
a sub-agent group to complete the task. We con- et al., 2023), reflecting on past mistakes (Shinn
duct comprehensive experiments with differ- et al., 2023; Madaan et al., 2023), and aggregating
ent LLMs, which demonstrates that AUTOACT information from various sources to reach the fi-
yields better or parallel performance compared nal targets. There have been a lot of works (Yao
to various strong baselines. We even notice et al., 2023; Wang et al., 2023a; Shen et al., 2023;
that AUTOACT, when using the Llama-2-13b Chen et al., 2023c) that directly prompt closed-
model, can achieve performance comparable to source off-the-shelf LLMs to plan on particular
that of the zero-shot GPT-3.5-Turbo agent1 .
tasks. Despite their convenience and flexibility,
closed-source LLMs inevitably suffer from unre-
1 Introduction
solved issues, as their accessibility often comes
Language agents (Wang et al., 2023b; Xi et al., at a steep price and their black-box nature makes
2023; Xie et al., 2023), which leverage the pow- the result reproduction difficult. In light of this,
erful reasoning capabilities (Qiao et al., 2023b; some recent endeavors have shifted their focus to-
Zhang et al., 2023b) of Large Language Models wards imbuing open-source models with planning
(LLMs) to generate executable actions for observ- capabilities through fine-tuning (Chen et al., 2023a;
ing the external world, have emerged as essential Zeng et al., 2023; Yin et al., 2023).
components of AI systems designed to address in- However, despite the existing fine-tuning-based
tricate interactive tasks (Torantulino, 2023; Osika, method’s achievements, they are not without their
2023; Nakajima, 2023; Tang et al., 2023). The pro- limitations. On the one hand, training open-source
cess of endowing LLMs with such interactive ca- models necessitates a substantial amount of anno-
pabilities is referred to as Agent Learning wherein tated task data and still relies on closed-source mod-
∗ els to synthesize planning trajectories. However,
Corresponding Author.
1
Code will be available at https://github.com/ fulfilling these requirements in many real-world
zjunlp/AutoAct. scenarios, such as private personal bot or sensi-
Data Trajectory
Method Planning Multi-Agent Fine-Tuning Generality Reflection
Acquisition Acquisition
R E ACT (Yao et al., 2023) User Prompt Iterative ✗ ✗ ✔ ✗
Reflexion (Shinn et al., 2023) User Prompt Iterative ✗ ✗ ✔ ✔
Chameleon (Lu et al., 2023) User Prompt Global ✗ ✗ ✔ ✗
HuggingGPT (Shen et al., 2023) User Prompt Global ✗ ✗ ✔ ✗
AutoGPT (Torantulino, 2023) User Prompt Iterative ✗ ✗ ✔ ✔
BOLAA (Liu et al., 2023) User Prompt Iterative ✔ ✗ ✔ ✗
AgentVerse (Chen et al., 2023c) User Prompt Iterative ✔ ✗ ✔ ✗
Agents (Zhou et al., 2023c) User Prompt Iterative ✔ ✗ ✔ ✗
AgentTuning (Zeng et al., 2023) Benchmark GPT-4 Iterative ✗ ✔ ✗ ✗
F IRE ACT (Chen et al., 2023a) Benchmark GPT-4 Iterative ✗ ✔ ✗ ✔
AUTOACT (ours) User + Self-Instruct Self-Planning Iterative ✔ ✔ ✔ ✔
Table 1: Comparison of related works. Data and Trajectory Acquisitions refer to the way for obtaining training
data and trajectories. Planning represents the way of planning, parted based on whether each step’s action is
determined globally or iteratively. Multi-Agent indicates whether the framework contains multi-agent. Fine-Tuning
stands for whether the method is a fine-tuning-based agent learning framework. Generality signifies whether the
method is applicable to various tasks. Reflection denotes whether the planning process incorporates reflection.
tive company business, often proves to be rocky. proposed AUTOACT yields better or parallel per-
On the other hand, from the perspective of agent formance compared to various strong baselines. We
framework design, fine-tuning-based methods com- summarize our main contributions as follows:
pel one single language agent to learn all planning • We propose AUTOACT, an automatic agent
abilities, placing even greater pressure on them. learning framework that does not rely on large-
These contradict the Simon’s principle of bounded scale annotated data and synthetic trajectories
rationality (Mintrom, 2015), which states that “pre- from closed-source models while adhering to
cise social division-of-labor and clear individual the principle of bounded rationality.
tasks can compensate for the limited ability of indi-
viduals to process and utilize information”. • We conduct comprehensive experiments with
different LLMs, which demonstrates that AU -
To this end, we introduce AUTOACT, an auto-
TOACT yields better or parallel performance
matic agent learning framework, which does not
compared to various strong baselines. We
rely on large-scale annotated data and synthetic
even notice that when using the Llama-2-13b
trajectories from closed-source models while in-
model, AUTOACT can achieve performance
corporating explicit individual tasks with precise
comparable to that of the zero-shot GPT-3.5-
division-of-labor (see Figure 1). Given a limited
Turbo agent (OpenAI, 2022).
set of user-provided data examples, AUTOACT
starts with a M ETA -AGENT to obtain an aug- • Extensive empirical analysis demonstrates the
mented database through self-instruct (Wang et al., effectiveness of our appropriate division-of-
2023c). Then, armed with a prepared tool library, labor strategy and the trajectory quality gener-
the M ETA -AGENT can automatically synthesize ated by AUTOACT outperforms that of other
planning trajectories without any assistance from methods from multiple aspects.
humans or strong closed-source models. Finally,
we propose the division-of-labor strategy which 2 AUTOACT
resembles cell differentiation based on the self- Overview. As shown in Figure 2, AUTOACT only
synthesized trajectories (genes), where the M ETA - requires target task information and a language
AGENT acts as a stem cell (Colman, 2008) and agent (we name it M ETA -AGENT) to initiate its
differentiates into three sub-agents with distinct work. The M ETA -AGENT first augments the task
functions: task decomposition, tool invocation, and data from scratch by self-instruct. Furthermore,
self-reflection, respectively. Our differentiation with a tool library available, the M ETA -AGENT
process is essentially a parameter-efficient training conducts automatic agent learning by differentiat-
process on the self-synthesized trajectories with ing into sub-agents with distinct functionalities and
low consumption resources. We list the differences enabling them to perform group task-specific plan-
between AUTOACT and prior works in Table 1. ning. We name this process as self-planning. Be-
Experiments on complex question-answering low is a detailed introduction of AUTOACT. Note
tasks with different LLMs demonstrate that the that all symbols used are globally defined.
Self-Instruct Group Planning
Name: HotpotQA
Description: This is a multi-hop
question answering benchmark, (format requirements)
the questions of which are all (tool usage instructions)
from Wikipedia. ……
Examples: ……
Meta-Agent Question: 750 7th Avenue and 101
Database
Park Avenue, are located in which city?
Plan-Agent
Tool Library Automatic Tool Selection Thought: The question simplifies to
Bing Search OCR
which city 750 7th Avenue and 101
Park Avenue are located in. I only
Wiki Retrieve
… Calculator need to search the addresses.
Lookup Code Interpreter
Action: Retrieve[750 7th Avenue]
Meta-Agent Observation: 750 Seventh Avenue is a
Selected Tools
Image Captioner Translator 36-story office building in the Midtown
Trajectories Manhattan neighborhood of New York
Trajectories Synthesis … … City. The building …
Observation: … Observation: …
Thought: … Thought: … Thought: ... Tool-Agent
Action: Finish[New Action: Finish[none] ……
Database York City] Observation: Please
Observation: Please reflect your answer. Thought: I have sufficient information
reflect your answer. Thought: … to answer the question.
Thought: … Action: Reflect[right]
Meta-Agent Action: Finish[New York City]
Action: Reflect[right]
Reward: 0.0
Selected Tools
Reward: 1.0 Observation: Please reflect your
answer based on the history. Reflect-Agent
Self-Differentiation Thought: My answer is correct.
… …… … Lora Action: Reflect[right]
……
_ _ _ _……
_ Fine-Tuning Observation: The answer is CORRECT.
……
_ _ _ _……
_
…… ……
_____
……
_____ …Filtered Meta-Agent
Plan-Agent Tool-Agent Reflect-Agent
Trajectories
Figure 2: The overview of our proposed framework AUTOACT. We initiate with self-instruct to extend the task
database from scratch. Then self-planning is applied to conduct automatic agent learning, including automatic
tool selection, trajectories synthesis, self-differentiation and group planning. Our self-differentiation is a parameter-
efficient fine-tuning process to achieve resource-efficient learning.
2.1 Critical Components of AUTOACT as the only user-provided knowledge of the task for
AUTOACT to conduct automatic agent learning.
M ETA -AGENT. The M ETA -AGENT stands at the
central position of our AUTOACT framework. It is Tool Library. To facilitate our agents in auto-
responsible for all the preparatory work before self- matic task planning, we provide a comprehensive
differentiation and serves as the backbone model tool library at their disposal. The tool library can be
for the self-differentiation process. Given limited |T |
denoted as T = {mi , di , ui }i=1 , where m repre-
target task information and a pre-prepared tool li-
sents the name of each tool, d defines the function-
brary, the M ETA -AGENT can differentiate into an
ality of each tool, u details the usage instruction
agent group capable of collaborating to accomplish
of each tool, and |T | stands for the tools amount
the target task. In AUTOACT, the M ETA -AGENT
of the library. In our automatic procedure, the
can be initialized with any open-source model.
M ETA -AGENT has the autonomy to select appro-
Target Task Information. In this paper, we priate tools from the tool library based on the task
mainly focus on agent learning from scratch, which information. Users also have the option to expand
means the task information at hand is quite lim- the tool library according to their specific needs,
ited, primarily encompassing three aspects: task allowing for more flexible utilization. We list part
name M, task description P, task data examples of our tool library in Appendix C.
C. Concretely, P represents a detailed descrip-
tion of the task’s characteristics, properties, and 2.2 Starting from Scratch via Self-Instruct
|C|
other relevant information. C = {qi , ai }i=1 indi- To acquire a sufficient amount of task data and
cates |C| question-answer example pairs of the task, provide an ample training resource, it is neces-
where |C| is very small which users can effortlessly sary to augment the data based on the examples
provide (e.g., a few demonstrations). For a more at hand. We accomplish this process through self-
in-depth view of task information, please refer to instruct. Initially, the database D is set to be equal
Appendix B. Note that the task information serves to the task data examples C, with C as the seed for
data generation. In each round, the M ETA -AGENT We assume that the planning loop at time t can be
generates new question-answer pairs by few-shot denoted as (τt , αt , ot ), where τ denotes Thought,
prompting, and the few-shot prompt examples are α signifies Action, and o represents Observation.
randomly sampled from D. The generated data will α can be further expressed as (αm , αp ), where αm
be added to D followed by filtering, with the exclu- is the name of the action, and αp is the parameters
sion of format erroneous and duplicate data before required to perform the action. Then the historical
its inclusion. Eventually, we obtain a database trajectory at time t can be signaled as:
|D|
D = {qi , ai }i=1 , where the number of data |D| Ht = (τ0 , α0 , o0 , τ1 , ..., τt−1 , αt−1 , ot−1 ). (1)
satisfies |D| ≫ |C|. The prompt we use for self-
instruct can be seen in Appendix D.1. Eventually, supposing that the prompts of target
task information, planning format requirements,
2.3 Automatic Agent Learning via and the question are all combined as S, the respon-
Self-Planning sibilities of each sub-agent can be defined as:
Automatic Tool Selection. With the tool library τt , αtm = πplan (S, Ts , Ht ), (2)
at hand, we ask the M ETA -AGENT to select appli-
αtp = πtool (S, Ts , Ht , τt , αtm ), (3)
cable tools for each task automatically. Specifically,
r r
|T |
we put T = {mi , di , ui }i=1 in the form of a tool τ , α = πreflect (S, Ts , H), (4)
list as part of the prompt. Along with T , the prompt where τ r and αr represent the thought and action
also includes the task’s description C. Finally, we of the reflection process respectively, and H is the
instruct the M ETA -AGENT to select an appropriate planning history after finishing the answer. The tra-
set of tools Ts (Ts ⊂ T ) to wait for synthesizing jectories can be reorganized based on the responsi-
trajectories. The prompt we use for automatic tool bilities above and fed to the M ETA -AGENT for self-
selection can be seen in Appendix D.2. differentiation. Our differentiation is a parameter-
efficient fine-tuning process to achieve resource-
Trajectories Synthesis. Without relying on
efficient learning. Particularly, for each sub-agent,
closed-source models, we enable the M ETA -
we train a specific LoRA (Hu et al., 2022).
AGENT to synthesize planning trajectories on its
own. Equipped with Ts , we instruct the M ETA - Group Planning. After obtaining the task-
AGENT to synthesize trajectories in a zero-shot specific sub-agents, any new question is processed
manner on the database D adhering to the format of through group planning among the sub-agents to
Thought-Action-Observation as defined in Yao achieve the desired outcome. Once the tool name
et al. (2023). In order to obtain high-quality synthe- αtm generated by the P LAN -AGENT is triggered
sized trajectories, we filter out all the trajectories at time t, the T OOL -AGENT is roused to deter-
with reward < 1 and collect trajectories with ex- mine the parameters αtp transferred to the specific
actly correct answers (reward = 1) as the training tool. The return result of the tool is treated as the
source for self-differentiation. The prompt for tra- observation ot and handed to the P LAN -AGENT.
jectories synthesis can be seen in Appendix D.3. After the collaboration between the P LAN -AGENT
and T OOL -AGENT finishes with a prediction, the
Self-Differentiation. In order to establish a clear
R EFLECT-AGENT comes on the stage to reflect
division-of-labor, we leverage synthesized plan-
on the history and provide a reflection result con-
ning trajectories to differentiate the M ETA -AGENT
tained in the reflection action αr . If the reflection
into three sub-agents with distinct functionalities:
result indicates that the prediction is correct, the
• X P LAN -AGENT πplan undertakes task de- whole planning process concludes. Otherwise, the
composition and determines which tool to in- P LAN -AGENT and T OOL -AGENT will continue
voke in each planning loop (Eq.2). the planning based on the reflection information.
The specific sequence of the group planning pro-
• { T OOL -AGENT πtool is responsible for
cess can be referred to the trajectory example on
how to invoke the tool (Eq.3) by deciding the
the right side of Figure 2.
parameters for the tool invocation.
3 Experimental Setup
• ¤ R EFLECT-AGENT πreflect engages in re-
flection by considering all the historical trajec- Tasks. We evaluate our method on HotpotQA
tories and providing a reflection result (Eq.4). (Yang et al., 2018) and ScienceQA (Lu et al., 2022).
HotpotQA ScienceQA
Backbone Method
Easy Medium Hard All G1-4 G5-8 G9-12 All
GPT-3.5 u CoT 48.21 44.52 34.22 42.32 60.83 55.83 65.00 60.56
Turbo u Zero-Shot Plan* 50.71 45.17 38.23 44.70 76.67 61.67 78.33 72.22
u CoT 35.80 26.69 18.20 26.90 59.17 50.00 59.17 56.11
u ReAct 25.14 19.87 17.39 20.80 52.50 47.50 54.17 51.39
u Chameleon 37.73 26.66 21.83 28.74 59.17 54.17 60.00 57.78
Llama-2
u Reflexion 35.55 28.73 24.35 29.54 60.83 57.50 59.17 58.06
7B-chat
u ² BOLAA 27.55 21.47 21.03 23.35 58.33 53.33 52.50 54.72
v FireAct 38.83 30.19 22.30 30.44 50.83 53.33 60.00 54.72
v ² AUTOACT 34.60 27.73 25.22 29.18 62.50 49.17 48.33 53.33
u CoT 37.90 25.28 21.64 28.27 61.67 52.50 69.17 61.11
u ReAct 28.68 22.15 21.69 24.17 57.27 51.67 65.00 57.98
u Chameleon 40.01 25.39 22.82 29.41 69.17 60.83 73.33 67.78
Llama-2
u Reflexion 44.43 37.50 28.17 36.70 67.50 64.17 73.33 68.33
13B-chat
u ² BOLAA 33.23 25.46 25.23 27.97 60.00 54.17 65.83 60.00
v FireAct 45.83 38.94 26.06 36.94 60.83 57.50 67.50 61.94
v ² AUTOACT 47.29 41.27 32.92 40.49 70.83 66.67 76.67 71.39
u CoT 45.37 36.33 32.27 37.99 74.17 64.17 75.83 71.39
u ReAct 39.70 37.19 33.62 36.83 64.17 60.00 72.50 65.56
u Chameleon 46.86 38.79 34.43 40.03 77.83 69.17 76.67 74.56
Llama-2
u Reflexion 48.01 46.35 35.64 43.33 75.83 67.50 78.33 73.89
70B-chat
u ² BOLAA 46.44 37.29 33.49 39.07 70.00 67.50 75.00 70.83
v FireAct 50.82 41.43 35.86 42.70 72.50 68.33 75.00 71.94
v ² AUTOACT 56.94 50.12 38.35 48.47 82.50 72.50 80.83 78.61
Table 2: Main results of AUTOACT compared to various baselines on HotpotQA and ScienceQA. The icon u
indicates prompt-based agent learning without fine-tuning, while v means fine-tuning-based agent learning.
denotes single-agent learning and ² symbolizes multi-agent learning. The best results of each model are marked
in bold and the second-best results are marked with underline. *We compare the zero-shot plan performance of
GPT-3.5-Turbo to ensure fairness in our evaluation since our setup does not include annotated trajectory examples.
HotpotQA is a multi-hop QA task challenging for method. 2) R E ACT (Yao et al., 2023), a well-
rich background knowledge, the answer of which known single-agent framework based on few-shot
is usually a short entity or yes/no. Following Liu learning that performs planning and action itera-
et al. (2023), we randomly select 300 dev ques- tively. 3) Chameleon (Lu et al., 2023), another few-
tions divided into three levels for evaluation, with shot single-agent framework that performs plan-
100 questions in each level. For HotpotQA, the ning before action. 4) Reflexion (Shinn et al.,
reward ∈ [0, 1] is defined as the F1 score grading 2023), a single-agent framework to reinforce lan-
between the prediction and ground-truth answer. guage agents through linguistic feedback. 5) BO-
ScienceQA is a multi-modal QA task spanning vari- LAA (Liu et al., 2023), a multi-agent framework
ous scientific topics. We also divide the test set into that customizes different agents through prompts.
three levels based on the grade, with 120 randomly 6) F IRE ACT (Chen et al., 2023a), a single-agent
sampled data in each level. Since ScienceQA is framework with fine-tuning on diverse kinds of
a multi-choice QA task, the reward ∈ {0, 1} is trajectories generated by GPT-4 (OpenAI, 2023).
exactly the accuracy. Note that due to the limita- 7) GPT-3.5-Turbo (OpenAI, 2022). To ensure
tions of LMs in generating images, for ScienceQA, fairness, we maintain an equal training trajectory
during the self-instruct stage, we directly generate volume of 200 for F IRE ACT and AUTOACT (200
captions for the images instead. synthesized data). As Reflexion provides answer
correctness labels during reflection but other meth-
Baselines. We choose the open-source Llama-2 ods including AUTOACT do not, we test all the
models (Touvron et al., 2023) as the backbones other methods twice and choose the correct one for
of our M ETA -AGENT and sub-agents. The com- evaluation. For all the prompt-based baselines, we
pared baselines are as follows: 1) CoT (Wei uniformly provide two examples in the prompt.
et al., 2022), the naive Chain-of-Thought reasoning
Training Setups. We fine-tune all our models HotpotQA ScienceQA
with LoRA (Hu et al., 2022) in the format proposed AUTOACT 48.47 78.61
in Alpaca (Taori et al., 2023). Our fine-tuning - reflection 45.66↓2.81 75.28↓3.33
framework leverages FastChat (Zheng et al., 2023) - multi 42.81↓5.66 69.72↓8.89
using DeepSpeed (Rasley et al., 2020). We set the - fine-tuning 32.84↓15.63 61.94↓16.67
learning rate of 1e-4 and the sequence length of - filtering 32.51↓15.96 59.17↓19.44
4096 for all the model scales. The training epoch
for the 13b and 7b models is 5 and for the 70b Table 3: Approach ablations of AUTOACT. - re-
model is 3. The batch size is 4 for the 13b and flection symbolizes removing the reflect-agent in AU -
7b models and 1 for the 70b model. We use the TOACT . - multi denotes feeding all the differentiated
AdamW optimizer (Loshchilov and Hutter, 2019) data into one model for fine-tuning. - fine-tuning
with a cosine learning scheduler. We detail the indicates zero-shot prompt planning with the three
agents defined in AUTOACT. - filtering represents self-
hyper-parameters for training in Appendix A.
differentiation on all the trajectories generated in zero-
shot planning without filtering wrong cases.
4 Results
Compare to Prompt-based Agent Learning
Baselines. As shown in Table 2, the 13b and 70b on HotpotQA and ↑6.67% on ScienceQA with
models consistently outperform various prompt- 70b model. Additionally, AUTOACT achieves self-
based baselines. The 70b model even surpasses the planning without relying on closed-source models
agent performance of GPT-3.5-Turbo, achieving and large-scale labeled datasets, which paves the
a rise of ↑3.77% on HotpotQA and ↑6.39% on way for automatic agent learning with open-source
ScienceQA. The performance of the 7b model is models from scratch. In ablation study (§4) and
comparable to other methods to some extent. How- human evaluation (§5), we will further validate that
ever, it fails to deliver the same impressive results the quality of trajectories synthesized by AUTOACT
due to its limited self-instruct and self-planning is not inferior to F IRE ACT trained on trajectories
capabilities. Therefore, whether in a single-agent synthesized using GPT-4.
or multi-agent architecture, prompt-based methods
Single-agent Learning vs. Multi-agent Learn-
relying on few-shot demonstrations fail to precisely
ing. Under identical settings, multi-agent archi-
customize the behavior of the agent, which is also
tectures generally exhibit better performance than
supported by the fact that F IRE ACT widely out-
single-agent (R E ACT vs. BOLAA, F IRE ACT vs.
performs R E ACT and BOLAA in the context of
AUTOACT), which aligns with Simon’s theory of
iterative planning. In addition, our investigation re-
bounded rationality. Seemingly contrary to expec-
veals a visible disparity in open-source models be-
tations, despite being a single-agent architecture,
tween the performance of many prompt-based plan-
Chameleon outperforms BOLAA (even F IRE ACT
ning baselines (relying on various external tools)
on ScienceQA). However, we analyze that this
and CoT (relying on the models’ intrinsic reason-
can be attributed to the way it leverages tools. In
ing abilities). This discrepancy underscores the
Chameleon, the process of deciding tool parame-
formidable challenge of unlocking planning capa-
ters is considered a form of tool invocation, and
bilities by prompting.
specialized few-shot prompts are designed to guide
Compare to Fine-tuning-based Agent Learning the model through this process. From this aspect,
Baselines. Further focusing on F IRE ACT in Ta- Chameleon, despite being nominally a single-agent
ble 2, despite the assistance of GPT-4, F IRE ACT’s architecture, exhibits characteristics that resemble a
approach of assigning the entire planning task to multi-agent architecture, which does not contradict
a single model proves to be burdensome. As a our initial conclusion. Indeed, we can also explain
result, its performance on ScienceQA even falls from the perspective of optimizing objectives. An-
short compared to the prompt-based global plan- other well-known economic principle, Goodhart’s
ning method, Chameleon. AUTOACT employs self- Law (Goodhart, 1984), states that “When a mea-
differentiation to decouple the planning process and sure becomes a target, it ceases to be a good mea-
reaches a clear division-of-labor among sub-agents sure”. This implies that optimizing one objective
for group planning, resulting in a improvement on the same agent will inevitably harm other opti-
than F IRE ACT, with an enhancement of ↑5.77% mization objectives to some extent. Therefore, it
easy medium hard overall easy medium hard overall easy medium hard overall
60 60 60
40 40 40
F1 Score
F1 Score
F1 Score
20 20 20
0 0 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
(a) 7b-model 7b-data (b) 13b-model 13b-data (c) 70b-model 70b-data
easy medium hard overall easy medium hard overall easy medium hard overall
60 60 60
40 40 40
F1 Score
F1 Score
F1 Score
20 20 20
easy-self easy-self easy-self
medium-self medium-self medium-self
hard-self hard-self hard-self
overall-self overall-self overall-self
0 0 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
(d) 7b-model 13b-data (e) 7b-model 70b-data (f) 13b-model 70b-data
Figure 3: Performance of AUTOACT on different training data scales. (a-c) represents the results of the model
trained on self-synthesized trajectories. (d-f) represents the results of the model trained on trajectories synthesized
by a stronger model, where the dashed line is the baseline trained on self-synthesized trajectories.
40 50 60
One Three Tool-Specified One Three Tool-Specified One Three Tool-Specified
40
30
40
F1 Score
F1 Score
F1 Score
30
20
20
20
10
10
0 0 0
easy medium hard overall easy medium hard overall easy medium hard overall
Figure 4: Performance of AUTOACT based on different degrees of labor division. One is training a single model
with all the differentiated data. Three represents the differentiation into three agents: plan, tool, and reflect. Tool
Specified indicates further differentiating the tool-agent with one tool, one agent.
is not optimal to optimize all objectives on a sin- 70b model may be no worse than that of GPT-4.
gle agent, and a multi-agent architecture happens As expected, the performance deteriorates after -
to address this issue. However, we analyze in §5 fine-tuning, which once again confirms the previ-
that excessive fine-grained division-of-labor is not ous conclusion. To demonstrate the necessity of
the best approach and a moderate division of labor filtering out planning error data, we specifically
benefits group performance. remove the filtering process (- filtering) to examine
the performance of AUTOACT. The results indicate
Approach Ablations. Table 3 presents the per-
that the damage caused by training on unfiltered
formance of AUTOACT on the 70b model after
data is even greater than that of - fine-tuning.
removing certain key processes. It can be observed
that the least impactful removal is the - reflect. We 5 Analysis
investigate that in the zero-shot scenario, the model
tends to be over-confident in its answers. It typi- Larger training data scale does not necessarily
cally only recognizes its errors when there are ob- mean better results. We evaluate the influence
vious formatting mistakes or significant repetitions of different training data scales on the performance
in the planning process. Consistent with previous of self-planning in Figure 3(a-c). It can be ob-
findings, the removal of the - multi agents leads served that the overall performance of different
to a noticeable decrease in performance. A more models steadily improves as the training data scale
exciting discovery is that the results of - multi are increases. However, it goes to stability with min-
comparable to those of F IRE ACT. This indirectly imal fluctuations once the data scale exceeds 200.
suggests that the trajectory quality generated by the We speculate that this may be due to the limited
(a) ReAct (b) AutoAct
Question: The Bass Rock Lighthouse was next to what Castle?
Question: The Bass Rock Lighthouse was next to what Castle? Thought: The Bass Rock Lighthouse is located on Bass Rock.
Thought: Maybe I should retrieve Bass Rock Lighthouse. Action: Retrieve[Bass Rock]
Action: Retrieve[Bass Rock Lighthouse] Observation: The Bass Rock, or simply the Bass ( ) ...
Observation: The Bass Rock Lighthouse on Bass Rock is a ... Thought: The Bass Rock Lighthouse is located near a castle.
Thought: What I want to know now is what castle is it next to. Action: Lookup[castle]
Action: Lookup[castle] Observation: (Result 1/7) The Bass Rock, ... and 5 km (3 mi)
Observation: (Result 1/1) The Bass Rock Lighthouse is built in north-east of North Berwick.
1902 by David Stevenson, who ... and some other buildings Thought: The Bass Rock Lighthouse is near North Berwick.
within the castle for the stone. ... Action: BingSearch[North Berwick Castle]
Thought: The Bass Rock Lighthouse is next to a castle Observation: ... Tantallon Castle. Near North Berwick, ...
demolished by David Stevenson. Thought: The Bass Rock Lighthouse is next to Tantallon Castle.
Action: Finish[David Stevenson demolished the castle] Action: Finish[Tantallon Castle]
(Reflection omitted)
Figure 5: Case study. AUTOACT (b) successfully addresses the failure in R E ACT (a) by employing a more scientific
combination of tools and making more accurate tool invocations. With more planning rounds, AUTOACT (c) can
validate its inner answers by continuing more rounds of self-verification. While this can also lead to a longer context,
gradually deviating AUTOACT (d) from the original question.
Tasks and Backbones. For experimental conve- Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang,
Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin,
nience, we only evaluate our method on complex
Yaxi Lu, Ruobing Xie, Zhiyuan Liu, Maosong Sun,
question-answering tasks with the Llama-2-chat and Jie Zhou. 2023c. Agentverse: Facilitating multi-
model series. However, there are many other in- agent collaboration and exploring emergent behav-
teractive scenarios and backbone models beyond iors in agents. CoRR, abs/2308.10848.
these. Other complex tasks include web (Yao et al.,
Alan Colman. 2008. Human embryonic stem cells and
2022; Zhou et al., 2023b), household (Puig et al., clinical applications. Cell Research, 18(1):S171–
2018; Shridhar et al., 2021), robotics (Ichter et al., S171.
2022), etc., and more backbone models include
Vicuna (Zheng et al., 2023), ChatGLM (Du et al., Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding,
Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM:
2022), Mistrial (Jiang et al., 2023), etc. We plan to
general language model pretraining with autoregres-
conduct further research on applying AUTOACT to sive blank infilling. pages 320–335.
a wider range of tasks and backbones in the future.
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and
Boosting Knowledge via Self-Instruct. As an- Tushar Khot. 2023. Specializing smaller language
alyzed in §5, the planning performance of AU - models towards multi-step reasoning. In Interna-
tional Conference on Machine Learning, ICML 2023,
TOACT can be limited by the model’s ability to
23-29 July 2023, Honolulu, Hawaii, USA, volume
access internal knowledge through self-instruct. 202 of Proceedings of Machine Learning Research,
While the current phenomenon allows us to achieve pages 10421–10430. PMLR.
C. A. E. Goodhart. 1984. Problems of Monetary Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Pe-
Management: The UK Experience, pages 91–121. ter West, Ronan Le Bras, Yejin Choi, and Hannaneh
Macmillan Education UK, London. Hajishirzi. 2022. Generated knowledge prompting
for commonsense reasoning. In Proceedings of the
Çaglar Gülçehre, Tom Le Paine, Srivatsan Srini- 60th Annual Meeting of the Association for Compu-
vasan, Ksenia Konyushkova, Lotte Weerts, Abhishek tational Linguistics (Volume 1: Long Papers), ACL
Sharma, Aditya Siddhant, Alex Ahern, Miaosen 2022, Dublin, Ireland, May 22-27, 2022, pages 3154–
Wang, Chenjie Gu, Wolfgang Macherey, Arnaud 3169. Association for Computational Linguistics.
Doucet, Orhan Firat, and Nando de Freitas. 2023.
Reinforced self-training (rest) for language modeling. Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue,
CoRR, abs/2308.08998. Shelby Heinecke, Rithesh Murthy, Yihao Feng,
Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit,
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Ran Xu, Phil Mui, Huan Wang, Caiming Xiong,
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and and Silvio Savarese. 2023. BOLAA: benchmarking
Weizhu Chen. 2022. Lora: Low-rank adaptation of and orchestrating llm-augmented autonomous agents.
large language models. In The Tenth International CoRR, abs/2308.05960.
Conference on Learning Representations, ICLR 2022,
Virtual Event, April 25-29, 2022. OpenReview.net. Ilya Loshchilov and Frank Hutter. 2019. Decoupled
weight decay regularization. In 7th International
Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi
Conference on Learning Representations, ICLR 2019,
Wang, Hongkun Yu, and Jiawei Han. 2023. Large
New Orleans, LA, USA, May 6-9, 2019. OpenRe-
language models can self-improve. In Proceedings
view.net.
of the 2023 Conference on Empirical Methods in Nat-
ural Language Processing, EMNLP 2023, Singapore, Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-
December 6-10, 2023, pages 1051–1068. Association Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter
for Computational Linguistics. Clark, and Ashwin Kalyan. 2022. Learn to explain:
Brian Ichter, Anthony Brohan, Yevgen Chebotar, Multimodal reasoning via thought chains for science
Chelsea Finn, Karol Hausman, Alexander Herzog, question answering. In NeurIPS.
Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-
Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jian-
Carolina Parada, Kanishka Rao, Pierre Sermanet, feng Gao. 2023. Chameleon: Plug-and-play compo-
Alexander Toshev, Vincent Vanhoucke, Fei Xia, sitional reasoning with large language models. CoRR,
Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, abs/2304.09842.
Michael Ahn, Omar Cortes, Nicolas Sievers, Clayton
Tan, Sichun Xu, Diego Reyes, Jarek Rettinghouse, Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Jornell Quiambao, Peter Pastor, Linda Luu, Kuang- Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Huei Lee, Yuheng Kuang, Sally Jesmonth, Nikhil J. Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
Joshi, Kyle Jeffrey, Rosario Jauregui Ruano, Jasmine Sean Welleck, Bodhisattwa Prasad Majumder,
Hsu, Keerthana Gopalakrishnan, Byron David, Andy Shashank Gupta, Amir Yazdanbakhsh, and Peter
Zeng, and Chuyuan Kelly Fu. 2022. Do as I can, not Clark. 2023. Self-refine: Iterative refinement with
as I say: Grounding language in robotic affordances. self-feedback. CoRR, abs/2303.17651.
In Conference on Robot Learning, CoRL 2022, 14-18
December 2022, Auckland, New Zealand, volume Shengyu Mao, Ningyu Zhang, Xiaohan Wang, Mengru
205 of Proceedings of Machine Learning Research, Wang, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei
pages 287–318. PMLR. Huang, and Huajun Chen. 2023. Editing personality
for llms. CoRR, abs/2310.02168.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego Michael Mintrom. 2015. 12Herbert A. Simon, Admin-
de Las Casas, Florian Bressand, Gianna Lengyel, istrative Behavior: A Study of Decision-Making Pro-
Guillaume Lample, Lucile Saulnier, Lélio Re- cesses in Administrative Organization. In The Oxford
nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Handbook of Classics in Public Policy and Adminis-
Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo- tration. Oxford University Press.
thée Lacroix, and William El Sayed. 2023. Mistral
7b. CoRR, abs/2310.06825. Yohei Nakajima. 2023. Babyagi. https://github.
com/yoheinakajima/babyagi.
Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke
Zettlemoyer, Omer Levy, Jason Weston, and Mike OpenAI. 2022. Chatgpt: Optimizing language mod-
Lewis. 2023. Self-alignment with instruction back- els for dialogue. https://openai.com/blog/
translation. CoRR, abs/2308.06259. chatgpt/.
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, OpenAI. 2023. GPT-4 technical report. CoRR,
Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and abs/2303.08774.
Shuming Shi. 2023. Encouraging divergent thinking
in large language models through multi-agent debate. Anton Osika. 2023. Gpt-engineer. https://github.
CoRR, abs/2305.19118. com/AntonOsika/gpt-engineer.
Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, International Conference on Learning Representa-
Meredith Ringel Morris, Percy Liang, and Michael S. tions, ICLR 2021, Virtual Event, Austria, May 3-7,
Bernstein. 2023. Generative agents: Interactive simu- 2021. OpenReview.net.
lacra of human behavior. In Proceedings of the 36th
Annual ACM Symposium on User Interface Software Chan Hee Song, Jiaman Wu, Clayton Washington,
and Technology, UIST 2023, San Francisco, CA, USA, Brian M. Sadler, Wei-Lun Chao, and Yu Su. 2022.
29 October 2023- 1 November 2023, pages 2:1–2:22. Llm-planner: Few-shot grounded planning for em-
ACM. bodied agents with large language models. CoRR,
abs/2212.04088.
Shishir G. Patil, Tianjun Zhang, Xin Wang, and
Joseph E. Gonzalez. 2023. Gorilla: Large lan- Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun
guage model connected with massive apis. CoRR, Zhao, Xingyao Zhang, Arman Cohan, and Mark Ger-
abs/2305.15334. stein. 2023. Medagents: Large language models as
collaborators for zero-shot medical reasoning. CoRR,
Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, abs/2311.10537.
Tingwu Wang, Sanja Fidler, and Antonio Torralba.
2018. Virtualhome: Simulating household activities Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
via programs. In 2018 IEEE Conference on Com- Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
puter Vision and Pattern Recognition, CVPR 2018, and Tatsunori B. Hashimoto. 2023. Stanford alpaca:
Salt Lake City, UT, USA, June 18-22, 2018, pages An instruction-following llama model. https://
8494–8502. Computer Vision Foundation / IEEE github.com/tatsu-lab/stanford_alpaca.
Computer Society.
Torantulino. 2023. Autogpt: build & use ai agents.
Shuofei Qiao, Honghao Gui, Huajun Chen, and Ningyu https://github.com/Significant-Gravitas.
Zhang. 2023a. Making language models better
tool learners with execution feedback. CoRR, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
abs/2305.13068. bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Bhosale, Dan Bikel, Lukas Blecher, and et. al. 2023.
Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, Llama 2: Open foundation and fine-tuned chat mod-
and Huajun Chen. 2023b. Reasoning with language els. CoRR, abs/2307.09288.
model prompting: A survey. In Proceedings of the
61st Annual Meeting of the Association for Compu- Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man-
tational Linguistics (Volume 1: Long Papers), ACL dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and An-
2023, Toronto, Canada, July 9-14, 2023, pages 5368– ima Anandkumar. 2023a. Voyager: An open-ended
5393. Association for Computational Linguistics. embodied agent with large language models. CoRR,
abs/2305.16291.
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan
Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao
Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang,
Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei
and Maosong Sun. 2023. Toolllm: Facilitating large Wei, and Ji-Rong Wen. 2023b. A survey on large
language models to master 16000+ real-world apis. language model based autonomous agents. CoRR,
CoRR, abs/2307.16789. abs/2308.11432.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase,
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
and Yuxiong He. 2020. Deepspeed: System opti-
Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
mizations enable training deep learning models with
Hajishirzi. 2023c. Self-instruct: Aligning language
over 100 billion parameters. In KDD ’20: The 26th
models with self-generated instructions. In Proceed-
ACM SIGKDD Conference on Knowledge Discovery
ings of the 61st Annual Meeting of the Association
and Data Mining, Virtual Event, CA, USA, August
for Computational Linguistics (Volume 1: Long Pa-
23-27, 2020, pages 3505–3506. ACM.
pers), ACL 2023, Toronto, Canada, July 9-14, 2023,
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, pages 13484–13508. Association for Computational
Weiming Lu, and Yueting Zhuang. 2023. Hugging- Linguistics.
gpt: Solving AI tasks with chatgpt and its friends in
huggingface. CoRR, abs/2303.17580. Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu,
Xiaojian Ma, and Yitao Liang. 2023d. Describe,
Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. explain, plan and select: interactive planning with
Reflexion: language agents with verbal reinforce- llms enables open-world multi-task agents. In Thirty-
ment learning. CoRR, abs/2303.11366. seventh Conference on Neural Information Process-
ing Systems.
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté,
Yonatan Bisk, Adam Trischler, and Matthew J. Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jin-
Hausknecht. 2021. Alfworld: Aligning text and em- bing Hou, Bowei Zhang, Haowei Lin, Zhaofeng
bodied environments for interactive learning. In 9th He, Zilong Zheng, Yaodong Yang, et al. 2023e.
Jarvis-1: Open-world multi-task agents with memory- attributed training data generator: A tale of diversity
augmented multimodal language models. arXiv and bias. CoRR, abs/2306.15895.
preprint arXiv:2311.05997.
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Goodman. 2022. Star: Bootstrapping reasoning with
Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, reasoning. In NeurIPS.
and Denny Zhou. 2022. Chain-of-thought prompt-
ing elicits reasoning in large language models. In Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao
NeurIPS. Liu, Yuxiao Dong, and Jie Tang. 2023. Agenttuning:
Enabling generalized agent abilities for llms. CoRR,
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen abs/2310.12823.
Ding, Boyang Hong, Ming Zhang, Junzhe Wang,
Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Jintian Zhang, Xin Xu, and Shumin Deng. 2023a. Ex-
Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran ploring collaboration mechanisms for LLM agents:
Wang, Changhao Jiang, Yicheng Zou, Xiangyang A social psychology view. CoRR, abs/2310.02124.
Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng,
Zhuosheng Zhang, Yao Yao, Aston Zhang, Xiangru
Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan
Tang, Xinbei Ma, Zhiwei He, Yiming Wang, Mark
Zheng, Xipeng Qiu, Xuanjing Huan, and Tao Gui.
Gerstein, Rui Wang, Gongshen Liu, and Hai Zhao.
2023. The rise and potential of large language model
2023b. Igniting language intelligence: The hitch-
based agents: A survey. CoRR, abs/2309.07864.
hiker’s guide from chain-of-thought reasoning to lan-
Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Lu- guage agents. CoRR, abs/2311.11797.
oxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao,
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Su, Dongchan Shin, Caiming Xiong, and Tao Yu.
Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang,
2023. Openagents: An open platform for language
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging
agents in the wild. CoRR, abs/2310.10634.
llm-as-a-judge with mt-bench and chatbot arena.
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,
Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Jiang. 2023. Wizardlm: Empowering large lan- Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H.
guage models to follow complex instructions. CoRR, Chi. 2023a. Least-to-most prompting enables com-
abs/2304.12244. plex reasoning in large language models. In The
Eleventh International Conference on Learning Rep-
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-
resentations, ICLR 2023, Kigali, Rwanda, May 1-5,
gio, William W. Cohen, Ruslan Salakhutdinov, and
2023. OpenReview.net.
Christopher D. Manning. 2018. Hotpotqa: A dataset
for diverse, explainable multi-hop question answer- Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou,
ing. In Proceedings of the 2018 Conference on Em- Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan
pirical Methods in Natural Language Processing, Bisk, Daniel Fried, Uri Alon, and Graham Neubig.
Brussels, Belgium, October 31 - November 4, 2018, 2023b. Webarena: A realistic web environment for
pages 2369–2380. Association for Computational building autonomous agents. CoRR, abs/2307.13854.
Linguistics.
Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li,
Shunyu Yao, Howard Chen, John Yang, and Karthik Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang,
Narasimhan. 2022. Webshop: Towards scalable real- Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu
world web interaction with grounded language agents. Chen, Wentao Zhang, Ningyu Zhang, Huajun Chen,
In NeurIPS. Peng Cui, and Mrinmaya Sachan. 2023c. Agents:
An open-source framework for autonomous language
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak agents. CoRR, abs/2309.07870.
Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023.
React: Synergizing reasoning and acting in language
models. In The Eleventh International Conference A Hyper-Parameters
on Learning Representations, ICLR 2023, Kigali,
See Table 4.
Rwanda, May 1-5, 2023. OpenReview.net.
requiring multiple steps of reasoning to arrive at the catapult’s arm back to a 30 angle. With each
the final answer. launch, his friend Justin measured the distance
Task Data Examples: between the catapult and the place where the
Question: From 1969 to 1979, Arno Schmidt was ball hit the ground. Tom and Justin repeated
the executive chef of a hotel located in which the launches with ping pong balls in four more
neighborhood in New York? identical catapults. They compared the distances
Answer: Manhattan the balls traveled when launched from a 45 angle
to the distances the balls traveled when launched
Question: Are both Shangri-La City and from a 30 angle. Figure: a catapult for launching
Ma’anshan cities in China? ping pong balls.
Answer: yes Options: (A) Do ping pong balls stop rolling along
the ground sooner after being launched from a
Task Name: ScienceQA 30-angle or a 45-angle? (B) Do ping pong balls
Task Description: This is a multimodal question- travel farther when launched from a 30-angle
answering task that necessitates a model to utilize compared to a 45-angle?
tools for transforming image information into Caption: A wooden board with a wooden head on
textual data. Simultaneously, this task incorporates top of it.
substantial background knowledge, requiring the Answer: B. Do ping pong balls travel farther when
language model to acquire external information to launched from a 30 angle compared to a 45 angle?
enhance its comprehension of the task.
Task Data Examples:
Question: Which of these states is the farthest
north? C Tool Library
Options: (A) West Virginia (B) Louisiana (C) See Table 5.
Arizona (D) Oklahoma
Caption: An aerial view of a painting of a forest. D Prompt
Answer: A. West Virginia
D.1 Prompt for Self-Instruct
Question: Identify the question that Tom See Table 6.
and Justin’s experiment can best answer.
D.2 Prompt for Tool Selection
Context: The passage below describes an exper-
iment. Read the passage and then follow the See Table 7.
instructions below. Tom placed a ping pong ball
D.3 Prompt for Trajectories Synthesis
in a catapult, pulled the catapult’s arm back to
a 45 angle, and launched the ball. Then, Tom See Table 8.
launched another ping pong ball, this time pulling
Name Definition Usage
BingSearch BingSearch engine can search for BingSearch[query], which
rich knowledge on the internet searches the exact detailed query
based on keywords, which can on the Internet and returns the
compensate for knowledge fal- relevant information to the query.
lacy and knowledge outdated. Be specific and precise with your
query to increase the chances
of getting relevant results. For
example, Bingsearch[popular
dog breeds in the United States]
Retrieve Retrieve additional background Retrieve[entity], which retrieves
knowledge crucial for tackling the exact entity on Wikipedia and
complex problems. It is espe- returns the first paragraph if it ex-
cially beneficial for specialized ists. If not, it will return some
domains like science and mathe- similar entities to retrieve. For
matics, providing context for the example, Retrieve[Milhouse]
task
Lookup A Lookup Tool returns the next Lookup[keyword], which returns
sentence containing the target the next sentence containing the
string in the page from the search keyword in the last passage
tool, simulating Ctrl+F function- successfully found by Retrieve
ality on the browser. or BingSearch. For example,
Lookup[river].
Image2Text Image2Text is used to detect Image2Text[image], which gen-
words in images convert them erates captions for the image and
into text by OCR and generate detects words in the image. You
captions for images. It is partic- are recommended to use it first to
ularly valuable when understand- get more information about the
ing an image semantically, like image to the question. If the ques-
identifying objects and interac- tion contains an image, it will re-
tions in a scene. turn the caption and OCR text,
else, it will return None. For ex-
ample, Image2Text[image].
Text2Image Text2Image Specializes in con- Text2Image[text], which
verting textual information into generates an image for the
visual representations, facilitat- text provided by using multi-
ing the incorporation of textual modal models. For example,
data into image-based formats Text2Image[blue sky]
within the task.
...... ...... ......
Code Interpreter Code Interpreter is a tool or soft- Code[python], which interprets
ware that interprets and executes and executes Python code, pro-
code written in Python. It ana- viding a line-by-line analysis
lyzes the source code line by line of the source code and trans-
and translates it into machine- lating it into machine-readable
readable instructions or directly instructions. For instance,
executes the code and returns Ex- Code[print("hello world!")]
ecution results