You are on page 1of 10

Unleashing Infinite-Length Input Capacity for Large-scale Language

Models with Self-Controlled Memory System


Xinnian Liang1∗†, Bing Wang1∗†, Hui Huang3†, Shuangzhi Wu2‡
Peihao Wu2 , Lu Lu2 , Zejun Ma2 and Zhoujun Li1‡
1
State Key Lab of Software Development Environment, Beihang University, Beijing, China
2
ByteDance AI Lab, Beijing, China
3
Harbin Institute of Technology, Harbin, China
{xnliang,bingwang,lizj}@buaa.edu.cn; {wufurui,mazejun}@bytedance.com;

Abstract [User]: What are your hobbies?


[System]: As an AI language model, I don't have hobbies like humans do.
However, I am programmed to assist with various tasks and answer
Large-scale Language Models (LLMs) are questions. Is there anything specific you need help with?"
[User]: Tell you my hobby: swimming, climbing mountain and traveling.
constrained by their inability to process ... ... (50+ turns of dialogue about hobby and other topics is omitted here.)
arXiv:2304.13343v1 [cs.CL] 26 Apr 2023

[User]: We talk about hobbies again. What do you think is my hobby?


lengthy inputs. To address this limitation, we
propose the Self-Controlled Memory (SCM)
system to unleash infinite-length input capac- [ChatGPT]:
As an AI language model, I don't have
ity for large-scale language models. Our access to your personal information or [Our SCM System]:
interests, so I can't say for sure what Based on our previous
SCM system is composed of three key mod- conversation, your hobbies
your hobby is. However, I can suggest
ules: the language model agent, the memory some common hobbies that people include swimming, climbing
enjoy such as reading, writing, painting, mountains, and traveling. Is
stream, and the memory controller. The lan- playing sports, cooking, gardening, there anything related to those
guage model agent iteratively processes ultra- playing musical instruments, and many hobbies that you need assistance
more. You can explore different hobbies with?
long inputs and stores all historical informa- and find the one that brings you joy and
tion in the memory stream. The memory fulfillment.

controller provides the agent with both long-


term memory (archived memory) and short- Figure 1: An example comparing ChatGPT and our
term memory (flash memory) to generate pre- SCM system. The orange text indicates the user’s cur-
cise and coherent responses. The controller de- rent question, bold text denotes related historical infor-
termines which memories from archived mem- mation, red text indicates an erroneous response from
ory should be activated and how to incorpo- ChatGPT, and green text denotes the correct response
rate them into the model input. Our SCM sys- from our system.
tem can be integrated with any LLMs to en-
able them to process ultra-long texts without
any modification or fine-tuning. Experimen- Ouyang et al., 2022; Thoppilan et al., 2022; Work-
tal results show that our SCM system enables shop et al., 2023; Muennighoff et al., 2022). The
LLMs, which are not optimized for multi-turn strong foundational capability of LLMs, achieved
dialogue, to achieve multi-turn dialogue capa- through large-scale pre-training on massive text
bilities that are comparable to ChatGPT, and
corpora (e.g., In-Context Learning (Brown et al.,
to outperform ChatGPT in scenarios involving
ultra-long document summarization or long- 2020a), Chain-of-Thoughts (Wei et al., 2022c,b),
term conversations. Additionally, we will sup- among others), is a contributing factor to their suc-
ply a test set, which covers common long-text cess. Instruction tuning (Raffel et al., 2020; Wei
input scenarios, for evaluating the abilities of et al., 2022a; Chung et al., 2022) helps LLMs com-
LLMs in processing long documents. 12 prehend natural language task descriptions, while
Reinforcement Learning with Human Feedback
1 Introduction (RLHF) (Schulman et al., 2017; Stiennon et al.,
2020; Bai et al., 2022) aligns generated text with
Recently, Large-scale Language Models (LLMs)
human preferences. The combined capabilities of
have attracted significant attention due to their
LLMs have effectively shattered the boundaries
remarkable performance in a diverse range of
between natural language processing tasks, lead-
tasks (Brown et al., 2020a; Zeng et al., 2023;
ing to limitless possibilities in the application and
* The authors contribute equally and the order is random. research directions of LLMs.

Contribution during internship at ByteDance Inc.

Corresponding Author
Large Language Models (LLMs) offer numerous
1
Working in progress. advantages, but their utility is hindered by two main
2
https://github.com/wbbeyourself/SCM4LLMs factors: the maximum input length and the com-
putational complexity of self-attention during the Transformer (Vaswani et al., 2017) architecture as
pre-training phase (Wang et al., 2020). Although their foundation. The earliest Transformer-based
some models (Press et al., 2022; OpenAI, 2022) are pre-trained language model was GPT-1 (Radford
capable of processing long inputs, they may still et al., 2018). Subsequently, GPT-2 (Radford et al.,
struggle to capture crucial contextual information 2019) and GPT-3 (Brown et al., 2020b) were de-
in exceptionally lengthy texts. As demonstrated in veloped with gradually increasing parameter sizes.
Figure 1, even the ChatGPT 3 can miss out on es- GPT-3 has the largest scale, with 175B parameters,
sential context from preceding text because of the along with emergent abilities (Wei et al., 2022b,c),
accumulation of historical noise, which refers to which has attracted the attention of both academia
irrelevant or outdated information that can hinder and industry.
comprehension. Since then, many LLMs have emerged, including
To address this limitation, we present the Self- LAMBDA (Thoppilan et al., 2022), PaLM (Chowd-
Controlled Memory (SCM) system, which enables hery et al., 2022), OPT (Zhang et al., 2022a),
Large Language Models (LLMs) to process text LLaMA (Touvron et al., 2023), BLOOM (Work-
of infinite length without any modification or ad- shop et al., 2023), Galactica (Taylor et al., 2022),
ditional training. The input is partitioned into seg- and Pangu (Zeng et al., 2021; Ren et al., 2023). One
ments and fed to the LLM as observations (inputs). of the most notable works in this series of research
The SCM expands the LLM with a long-term mem- that has attracted widespread industry attention and
ory (archived memory), a short-term memory (flash can be considered a milestone towards Artificial
memory), and a memory controller. The archived General Intelligence (AGI) is ChatGPT (OpenAI,
memory preserves all historical information, while 2022), which is based on InsctuctGPT (Ouyang
the flash memory captures real-time memory in- et al., 2022) and optimized for multi-turn dialogue
formation from previous rounds. The memory ability. ChatGPT has achieved remarkable perfor-
controller determines when and how to introduce mance and surpassed the boundaries between NLP
archived information, allowing the LLM to effi- tasks. However, current LLMs, including ChatGPT,
ciently handle ultra-long text without sacrificing face significant limitations when processing tasks
any essential information. involving extremely long inputs.
To evaluate the performance of our system, we
integrate the SCM with non-dialogue-optimized Long Text Sequence Processing. Handling long
LLMs and simulate ChatGPT with success. Our text sequences has been a persistent challenge in
findings indicate that our system outperforms Chat- natural language processing tasks. This problem
GPT in handling ultra-long inputs or conversations. has become even more prominent with the advent
For summarization tasks, we generate a hierarchi- of pre-training and LLMs, as the fixed input length
cal summary of the entire archived memory un- during pre-training and the high costs of expanding
til the summary length meets the user’s specifica- it during the pre-training stage limit the ability to
tions. By incorporating information from preced- process longer inputs. Existing solutions primar-
ing text into local summaries within the memory, ily involve replacing the Attention structure during
our approach preserves the correlations among the pre-training to reduce computational costs and ex-
original content, in contrast to the conventional panding the pre-training sequence length (Beltagy
approach of directly generating a hierarchical sum- et al., 2020; Zaheer et al., 2021; Guo et al., 2022;
mary of the entire text. Furthermore, our work is Phang et al., 2022; Dong et al., 2023). Another al-
still in progress, and we plan to release a compre- ternative approach (Press et al., 2022) uses special
hensive evaluation dataset designed for long-text positional encoding during pre-training to enable
tasks, along with standardized human evaluations the model to learn relative positions and handle
to evaluate the effectiveness of different methods. longer input texts during inference.
However, the generalizability of these methods
2 Related Work and their impact on downstream tasks remain uncer-
tain. In the field of long-text summarization, there
Large-scale Language Models. Large-scale
are many effective methods. Hierarchical or itera-
Language Models (LLMs) are language models
tive methods have been used by Wu et al. (2021);
trained on massive amounts of text data, using the
Zhang et al. (2022b); Cao and Wang (2022) to
3
In this study, we utilize OpenAI gpt-3.5-turbo-0301. handle long texts by decomposing a complex prob-
Memory Memory Observation #T
Stream Controller
1
Archived 2 Summarize all previous
solutions for ... 
Memory

Activation
Memory 3
Agent

4
You can solve it
as follows:1. xx
Flash 5 2. xx ... ... 6
Memory
(Memory From #T-1)
Response #T

New Memory

Figure 2: The workflow of our proposed Self-Controlled Memory(SCM) system, where numbers 1-6 represent the
sequential process of one iteration with new observation #T.

lem into multiple sub-problems. However, these memory information, refer to section § 3.2) is re-
methods fail to capture the relationships among trieved by executing steps 3 and 4. Otherwise, the
sub-problems. process moves directly to step 5. § 3.3.1 provides a
comprehensive explanation of the control flow of
3 Methodology the memory controller.
The Self-Controlled Memory (SCM) system pro- 3. Memory Retrieval: In this step, we utilize the
posed in this paper aims to give large-scale lan- observation as a query to identify related memories.
guage models (LLMs) the capability to store long- The score ranking of each memory is computed
term memories, allowing them to process lengthy by considering two dimensions: relevance and re-
inputs and retain information after multiple interac- cency. With respect to relevance, we evaluate how
tions with the user. similar the content of the memory is to the observa-
tion. With respect to recency, we consider the time
3.1 System Overview elapsed since the memory was last accessed. Sub-
sequently, we retain the top K-ranked memories.
In this section, we introduce the workflow of our
proposed SCM system. As illustrated in Figure 2, 4. Memory Reorganization: In this step, the con-
our SCM system comprises three modules, includ- troller will determine whether to use the original
ing a language model agent, a memory stream, and memory directly or the summarized memory. If
a memory controller. The three modules work to- summarized memory is chosen, the original mem-
gether to process lengthy documents and provide ory the will be compressed. § 3.3.2 provides a
more accurate and coherent responses. Our system detailed explanation of the state compression pro-
workflow consists of six explicit steps, which are cess. Then, the system will combine the memory
presented as follows: retrieved in a structured manner to serve as back-
1. Input Acquisition: The agent receives an ob- ground information for response generation at this
servation in turn T (i.e., ultra-long document input point.
or a user question), either through direct input or 5. Input Fusion: In this step, we carefully de-
from an external source. sign a prompt that fuses the restructured memory
2. Memory Activation: Based on the current with the present observation to serve as the model’s
observation, the memory controller determines input. A thorough description is given in § 3.4.
whether it is necessary to activate memory for the 6. Response Generation: The model generates a
current user input. In the case where memory acti- response based on previous step result and incorpo-
vation is warranted, relevant memories (for detailed rates the current interaction, including observation
Memory Controller
Given a user command, determine whether executing the
Observation command requires historical or previous information, or
whether it requires recalling the conversation content.
Simply answer yes (A) or no (B) without explaining the
Is memory retrieval necessary?
information:
NO
Y/N Command: [User Input]
Retrieve From Archived Memory
Figure 4: English prompt for the necessity of using
Summary or Full Content? memory.
NO
Y/N Full Contnet
Controller
Summarize Activated Content Given a user command, determine if it can be executed
correctly based solely on the summary historical
Summary information provided. Simply answer yes (A) or no (B),
without explaining the information.

Command: [User Input]


Generator

Figure 3: Workflow of the Memory Controller. Figure 5: English prompt for whether or not to use the
summary of memory.

and response, into the memory stream. Please refer


to § 3.4 for further details. hundreds to thousands or even tens of thousands. A
controller is needed to retrieve and filter the mem-
3.2 Memory Stream ory. The third reason is that the input length of
This section provides an overview of the internal the model is limited, and a controller is needed to
structure of memory stream. The memory stream choose between using the full text of the memory
stores all historical memory items in a designated or a summary of the memory, as the original text
location named as the archived memory center, can be long and may exceed the maximum length
which can easily achieve high-speed access through of the model. The next two subsections present
cache storage and access tools such as Redis or the details of the controller’s workflow and state
Pinecone4 . Each memory item consists of an in- compression implementation, respectively.
teraction index, an observation, a system response,
and an interaction embedding that illustrates the 3.3.1 Memory Controller Workflow
current interaction semantics. In addition, The Ac- The core of the controller in terms of process con-
tivation Memory stores the retrieved memory set, trol is to ask two questions of the agent:
and the Flash Memory indicates the memory of
Turn T − 1. 1. Is it necessary to use memory to accurately
answer when executing user commands?
3.3 Memory Controller
2. Can user commands be executed normally us-
This section discusses the reasons for using the
ing only the summary of memory?
memory controller and its workflow, as illustrated
in Figure 3. There exist three fundamental reasons.
The first question prompt is shown in Figure 4,
Firstly, not all observations, also referred to as user
while the prompt for the second question is shown
input or instruction, require access to historical
in Figure 5. Other language versions of the prompt
memory usage. For example, the user instruction
can be found in § A.1.
“Tell me a joke” does not require the retrieval of
If the controller determines the necessity of uti-
the user’s history memory. However, certain user
lizing historical memory, memory retrieval should
input such as “Do you remember the conclusion we
be carried out. While retrieving memories, we use
made last week on the fitness diets” requires retriev-
the current observation (i.e. user instruction) as
ing past memories. The second reason is that the
a query and evaluate each memory’s rank score
amount of memory can be enormous, ranging from
based on two factors: Recency and Relevance. Re-
4
Pinecone: https://www.pinecone.io/ cency highly prioritizes memory items accessed
Below is a conversation between a user and an AI Here is a conversation between a user and an AI
assistant. Please provide a summary of the user's assistant. Please answer the user's current
question and the assistant's response in one question based on the history of the conversation:
sentence each, with separate paragraphs, while
preserving key information as much as possible. History of the conversation:

Conversation:  [History of Related Turn]

User: [user input] Previous conversation:


Assistant: [system response]     
User: [previous user input]
Summary: Assistant: [previous system response]

###
Figure 6: Prompt for dialogue memory summarization.
User: [current user input]
Assistant:
recently, reinforcing the idea that the agent’s atten-
tion remains on the states of latest interactions. The
relevance factor assigns a higher score to memory
Figure 7: English Prompt of ultra-long dialogue gener-
items that are related to the current observation. ation.
In our implementation, we created an embedding
vector for the text description of every memory
through the use of a language model5 . The co- 3.4 Response Generation
sine similarity between the embedding vector of As mentioned in the introduction, SCM allows non-
the memory and that of the query observation is dialogue-optimized LLMs to emulate ChatGPT.
calculated to determine relevance. The rank score This necessitates providing explicit instructions to
of each memory is the sum of its recency and rel- prompt LLMs to use activated and flash memory to
evance scores: rank_score = recency_score + generate the desired response. Figure 7 shows an
relevance_score. Depending on the length limit, English prompt intended for extremely long multi-
we designate the top k memories with the highest interaction dialogues, with other language versions
rank scores as activated memories, where k varies included in the § A.2.
between 3 and 10.
If the controller determines that the employment 4 Experiments
of a summary can allow for the normal execution 4.1 Settings
of instructions, then it is necessary to summarize
Our framework is preliminarily evaluated in two
the current extracted memories. The specific in-
scenarios: ultra-long dialogues and ultra-long doc-
formation regarding the memory summarization is
ument summarization. We conduct experiments to
located in the following subsection.
answer three research questions (RQs).
3.3.2 Memory Summarization • RQ1. Can SCM system compete with or even
Memory summarization is a crucial aspect in sce- outperform ChatGPT within a specific token
narios such as document summarization, where a limit?
single interaction or dialogue turn can have a token • RQ2. Can SCM system scale to provide ac-
length exceeding 3000. It enables stacking multiple curate responses to users’ questions, which are
memories into an activated memory section. Fig- related to historical contexts that date back hun-
ure 6 shows the English prompt that is specifically dreds or even thousands of turns?
designed for memory summarization in individual
interactions (i.e., dialogue tasks). In addition, other • RQ3. Can SCM demonstrate generalization to
language versions of the prompt can be found in other scenarios, including long document sum-
§ A.3. marization?
5
We used OpenAI embedding model text-embedding-ada- The following experiment evaluates the perfor-
002 mance of the text-davinci-003 model without dia-
I feel like I’ve gained weight recently and I’m planning to lose some. My goal weight is 65 kilograms.

That's a great goal! I can help you reach your goal weight. What kind of exercise plan do you have in mind?

I’m planning to exercise more regularly, starting with running.

That's a great start! How often do you plan to run?

I’m aiming to work out three times a week.

It sounds like you have a great plan! To help you stay motivated, why don't you set goals for each session? 

... after 10 days and 500+ turns of dialogue

I’ve got some good news to share - my weight loss has been going well lately and I’m down to 68kg now.

Congratulations! That's great news. What have you been doing to help you reach this goal?

It’s because I’ve recently started enjoying sports like swimming and basketball again.

That's wonderful! It's great that you're finding activities that you enjoy and can stick with.

Do you remember my first sport?

Yes, you mentioned that you wanted to start running as your first sport. Are you still running?

Figure 8: Ultra-long dialogue example.

logue optimization in comparison to the ChatGPT- After ten days, the length of the dialogue reaches
Turbo model. The evaluation focuses on the tasks 10,000 tokens. The user then asks the model “Do
of ultra-long dialogues and ultra-long document you remember my fir”. Our SCM system accurately
summarization. responds to this question.

4.2 Qualitative Study RQ3. Can SCM demonstrate generalization


to other scenarios, including long document
RQ1. Can SCM system compete with or even
summarization? Yes.
outperform ChatGPT within a specific token
limit? Yes. Figure 9 illustrates an instance of an incredibly
The example in Figure 1 includes 4000 tokens, lengthy document summary. Specifically, the re-
wherein the user inquired about their hobbies, dis- port pertains to the unveiling of GPT-4 by OpenAI.
cussed 100+ turns ago with the agent. The SCM Summaries exceeding 4,000 characters pose a chal-
system provides an accurate response to the query, lenge for conventional models, thus necessitating
demonstrating exceptional memory-enhanced ca- the splitting and individual summarization of doc-
pabilities, as apparent from the observation. In ument parts, which are then united. Nonetheless,
contrast, it appears that ChatGPT was distracted by this method can lose the dependency relationship
a considerable amount of irrelevant historical data. between paragraphs. Our framework utilizes a it-
erative summarization procedure. While summa-
RQ2. Can SCM system scale to provide accu- rizing paragraphs, our approach relies on earlier
rate responses to users’ questions, which are relevant summary memories to generate more pre-
related to historical contexts that date back cise summaries. Ultimately, the framework incor-
hundreds or even thousands of turns? Yes. porates a divide-and-conquer strategy to generate
The example presented in Figure 8 illustrates the final document summary. The final summary
a ultra-long dialogue comprising over 500 turns. obtained through the divide-and-conquer method
At the outset, the user states that his goal is to provides a comprehensive summary by utilizing
reduce weight and intends to initiate a running information from each document block. Further-
regime. Subsequently, the user and the model con- more, our iterative summary paradigm contains
verse daily about progress towards achieving their a memory-enhancement feature that allows topic-
weight loss goals, among other conversation topics. specific summaries to be generated by integrating a
Figure 9: Ultra-long iterative and hierarchical summarization example.

question-asking methodology during single block Consequently, we restrict the usage of our system
summarization. For instance, if a user poses a to academic research purposes for now.
question such as "Please provide a summary of the
technical details and evaluation findings of GPT4 6 Conclusion and Future Work
in image processing", the model will access prior
In this paper, we propose a Self-Controlled Mem-
summary memories and extract the relevant con-
ory (SCM) system to extend the input length of any
tent. We will continue to improve this aspect in the
LLMs model to an unlimited length and effectively
future.
capture useful information from all historical infor-
5 Limitations and Risks mation. This method does not require any training
or modification of models and has strong applicabil-
Limitations A lack of appropriate datasets for ity. We validated the effectiveness of our method
evaluating the handling of extremely lengthy texts through manual evaluation of the ChatGPT and
has resulted in our model being validated solely the Text-DaVinci-003 model based on our system,
through manual verification. This method, how- demonstrating superior performance in certain as-
ever, is inadequate for evaluating different scenar- pects of long-text scenarios compared to ChatGPT.
ios comprehensively and objectively. Therefore, Our future work will focus on releasing a com-
we aim to construct a specific test set that incorpo- prehensive test set and its manual evaluation crite-
rates various key indicators essential for processing ria, and testing our system on various open-source
long texts in diverse settings. This test set will be models currently available.
accompanied by a manual evaluation standard to
enable a more equitable comparison with relevant
methods. Moreover, we will assess the efficacy References
of our system on more open-source models that
possess single-turn instruction comprehension ca- Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda
Askell, Anna Chen, Nova DasSarma, Dawn Drain,
pability. Stanislav Fort, Deep Ganguli, Tom Henighan,
Nicholas Joseph, Saurav Kadavath, Jackson
Risks Our system has the capability to attach to Kernion, Tom Conerly, Sheer El-Showk, Nelson
any LLMs, which may be prone to factual errors, Elhage, Zac Hatfield-Dodds, Danny Hernandez,
delusions, toxic language, and malicious responses. Tristan Hume, Scott Johnston, Shauna Kravec,
Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Katherine Lee, Zongwei Zhou, Xuezhi Wang, Bren-
Amodei, Tom Brown, Jack Clark, Sam McCandlish, nan Saeta, Mark Diaz, Orhan Firat, Michele Catasta,
Chris Olah, Ben Mann, and Jared Kaplan. 2022. Jason Wei, Kathy Meier-Hellstern, Douglas Eck,
Training a helpful and harmless assistant with Jeff Dean, Slav Petrov, and Noah Fiedel. 2022.
reinforcement learning from human feedback. Palm: Scaling language modeling with pathways.

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Hyung Won Chung, Le Hou, Shayne Longpre, Barret
Longformer: The long-document transformer. Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, Al-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie bert Webson, Shixiang Shane Gu, Zhuyun Dai,
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Mirac Suzgun, Xinyun Chen, Aakanksha Chowdh-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda ery, Alex Castro-Ros, Marie Pellat, Kevin Robin-
Askell, Sandhini Agarwal, Ariel Herbert-Voss, son, Dasha Valter, Sharan Narang, Gaurav Mishra,
Gretchen Krueger, Tom Henighan, Rewon Child, Adams Yu, Vincent Zhao, Yanping Huang, Andrew
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean,
Clemens Winter, Christopher Hesse, Mark Chen, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V.
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Le, and Jason Wei. 2022. Scaling instruction-
Chess, Jack Clark, Christopher Berner, Sam Mc- finetuned language models.
Candlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020a. Language models are few-shot Chenhe Dong, Yinghui Li, Haifan Gong, Miaoxin
learners. In Advances in Neural Information Pro- Chen, Junxin Li, Ying Shen, and Min Yang. 2023.
cessing Systems 33: Annual Conference on Neu- A survey of natural language generation. ACM Com-
ral Information Processing Systems 2020, NeurIPS put. Surv., 55(8):173:1–173:38.
2020, December 6-12, 2020, virtual.
Mandy Guo, Joshua Ainslie, David Uthus, Santiago
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Yang. 2022. LongT5: Efficient text-to-text trans-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda former for long sequences. In Findings of the Associ-
Askell, Sandhini Agarwal, Ariel Herbert-Voss, ation for Computational Linguistics: NAACL 2022,
Gretchen Krueger, Tom Henighan, Rewon Child, pages 724–736, Seattle, United States. Association
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, for Computational Linguistics.
Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Niklas Muennighoff, Thomas Wang, Lintang Sutawika,
Chess, Jack Clark, Christopher Berner, Sam Mc- Adam Roberts, Stella Biderman, Teven Le Scao,
Candlish, Alec Radford, Ilya Sutskever, and Dario M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hai-
Amodei. 2020b. Language models are few-shot ley Schoelkopf, Xiangru Tang, Dragomir Radev, Al-
learners. ham Fikri Aji, Khalid Almubarak, Samuel Albanie,
Zaid Alyafeai, Albert Webson, Edward Raff, and
Shuyang Cao and Lu Wang. 2022. HIBRIDS: Atten- Colin Raffel. 2022. Crosslingual generalization
tion with hierarchical biases for structure-aware long through multitask finetuning.
document summarization. In Proceedings of the
60th Annual Meeting of the Association for Compu- OpenAI. 2022. Introducing chatgpt.
tational Linguistics (Volume 1: Long Papers), pages
786–807, Dublin, Ireland. Association for Computa- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida,
tional Linguistics. Carroll L. Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Ray, John Schulman, Jacob Hilton, Fraser Kelton,
Maarten Bosma, Gaurav Mishra, Adam Roberts, Luke Miller, Maddie Simens, Amanda Askell, Pe-
Paul Barham, Hyung Won Chung, Charles Sutton, ter Welinder, Paul Christiano, Jan Leike, and Ryan
Sebastian Gehrmann, Parker Schuh, Kensen Shi, Lowe. 2022. Training language models to follow in-
Sasha Tsvyashchenko, Joshua Maynez, Abhishek structions with human feedback.
Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben Jason Phang, Yao Zhao, and Peter J. Liu. 2022. Inves-
Hutchinson, Reiner Pope, James Bradbury, Jacob tigating efficiently extending transformers for long
Austin, Michael Isard, Guy Gur-Ari, Pengcheng input summarization.
Yin, Toju Duke, Anselm Levskaya, Sanjay Ghe-
mawat, Sunipa Dev, Henryk Michalewski, Xavier Ofir Press, Noah Smith, and Mike Lewis. 2022. Train
Garcia, Vedant Misra, Kevin Robinson, Liam Fe- short, test long: Attention with linear biases enables
dus, Denny Zhou, Daphne Ippolito, David Luan, input length extrapolation. In International Confer-
Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, ence on Learning Representations.
Ryan Sepassi, David Dohan, Shivani Agrawal, Mark
Omernick, Andrew M. Dai, Thanumalayan Sankara- Alec Radford, Jeff Wu, Rewon Child, David Luan,
narayana Pillai, Marie Pellat, Aitor Lewkowycz, Dario Amodei, and Ilya Sutskever. 2018. Improving
Erica Moreira, Rewon Child, Oleksandr Polozov, language understanding with unsupervised learning.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Dario Amodei, and Ilya Sutskever. 2019. Language Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
models are unsupervised multitask learners. Dai, and Quoc V Le. 2022a. Finetuned language
models are zero-shot learners. In International Con-
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ference on Learning Representations.
ine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring Jason Wei, Yi Tay, Rishi Bommasani, Colin Raf-
the limits of transfer learning with a unified text-to- fel, Barret Zoph, Sebastian Borgeaud, Dani Yo-
text transformer. Journal of Machine Learning Re- gatama, Maarten Bosma, Denny Zhou, Donald Met-
search, 21(140):1–67. zler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals,
Percy Liang, Jeff Dean, and William Fedus. 2022b.
Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Emergent abilities of large language models. Trans-
Huang, Yadao Wang, Weichao Wang, Pengfei Li, actions on Machine Learning Research. Survey Cer-
Xiaoda Zhang, Alexander Podolskiy, Grigory Arshi- tification.
nov, Andrey Bout, Irina Piontkovskaya, Jiansheng
Wei, Xin Jiang, Teng Su, Qun Liu, and Jun Yao. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
2023. Pangu-Σ: Towards trillion parameter lan- Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V
guage model with sparse heterogeneous computing. Le, and Denny Zhou. 2022c. Chain of thought
prompting elicits reasoning in large language mod-
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec els. In Advances in Neural Information Processing
Radford, and Oleg Klimov. 2017. Proximal policy Systems.
optimization algorithms.
BigScience Workshop, :, Teven Le Scao, Angela Fan,
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Hesslow, Roman Castagné, Alexandra Sasha Luc-
Dario Amodei, and Paul F. Christiano. 2020. Learn- cioni, François Yvon, Matthias Gallé, Jonathan Tow,
ing to summarize from human feedback. CoRR, Alexander M. Rush, and etc. 2023. Bloom: A
abs/2009.01325. 176b-parameter open-access multilingual language
model.
Ross Taylor, Marcin Kardas, Guillem Cucurull,
Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Sti-
Andrew Poulton, Viktor Kerkez, and Robert Stojnic. ennon, Ryan Lowe, Jan Leike, and Paul Christiano.
2022. Galactica: A large language model for sci- 2021. Recursively summarizing books with human
ence. feedback.

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Manzil Zaheer, Guru Guruganesh, Avinava Dubey,
Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Joshua Ainslie, Chris Alberti, Santiago Ontanon,
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,
YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, and Amr Ahmed. 2021. Big bird: Transformers for
Amin Ghafouri, Marcelo Menegali, Yanping Huang, longer sequences.
Maxim Krikun, Dmitry Lepikhin, James Qin, De-
hao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
Roberts, Maarten Bosma, Vincent Zhao, and etc. Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
2022. Lamda: Language models for dialog appli- Wendi Zheng, Xiao Xia, Weng Lam Tam, Zix-
cations. uan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen,
Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Tang. 2023. GLM-130b: An open bilingual pre-
Martinet, Marie-Anne Lachaux, Timothée Lacroix, trained model. In The Eleventh International Con-
Baptiste Rozière, Naman Goyal, Eric Hambro, ference on Learning Representations (ICLR).
Faisal Azhar, Aurelien Rodriguez, Armand Joulin,
Edouard Grave, and Guillaume Lample. 2023. Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang,
Llama: Open and efficient foundation language mod- Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang
els. Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li,
Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang,
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang
Kaiser, and Illia Polosukhin. 2017. Attention is all Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng,
you need. In Advances in Neural Information Pro- Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie
cessing Systems, volume 30. Curran Associates, Inc. Zhang, Mingyue Guo, Shanzhi Gu, Gaojun Fan,
Yaowei Wang, Xuefeng Jin, Qun Liu, and Yonghong
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Tian. 2021. Pangu-α: Large-scale autoregres-
Fang, and Hao Ma. 2020. Linformer: Self-attention sive pretrained chinese language models with auto-
with linear complexity. parallel computation.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Artetxe, Moya Chen, Shuohui Chen, Christopher De-
wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-
haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
Simig, Punit Singh Koura, Anjali Sridhar, Tianlu
Wang, and Luke Zettlemoyer. 2022a. Opt: Open pre-
trained transformer language models.
Yusen Zhang, Ansong Ni, Ziming Mao, Chen Henry
Wu, Chenguang Zhu, Budhaditya Deb, Ahmed
Awadallah, Dragomir Radev, and Rui Zhang. 2022b.
Summn : A multi-stage summarization framework
for long input dialogues and documents. In Proceed-
ings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-
pers), pages 1592–1604, Dublin, Ireland. Associa-
tion for Computational Linguistics.

A Prompt List
A.1 Prompt For Memory Controller

给定一个用户指令,判断执行该指令是否需要历史信
息或者上文的信息,或者需要回忆对话内容,只需要 以下是用户和人工智能助手的一段对话,请分
回答是(A)或者否(B),不需要解释信息: 别用一句话写出用户摘要、助手摘要,分段列
出,要求尽可能保留用户问题和助手回答的关
指令:[用户输入] 键信息。

Figure 10: Chinese Prompt of memory controller. 对话内容: 

用户:[用户输入]
A.2 Prompt for Dialogue Generation 助手:[系统回复]

摘要:
以下是用户和人工智能助手的对话,请根据历史
对话内容,回答用户当前问题:
Figure 12: Chinese Prompt of ultra-long dialogue sum-
相关历史对话: marization.

[历史轮对话内容]

上一轮对话:

[上一轮对话内容]

###

用户:[用户问题]

助手:

Figure 11: Chinese Prompt of ultra-long dialogue gen-


eration.

A.3 Prompt for Dialogue State Compression

You might also like