Professional Documents
Culture Documents
1
Submission and Formatting Instructions for ICML 2024
Explicit memory is the conscious, intentional recollection LoMA introduces a parallel compression process in the au-
of factual information, previous experiences, and concepts. toregressive inference process. Assuming a preset compres-
Some method for Explicit memory compression are pro- sion rate of 1/c and a memory region length of t, the model
posed by (Lanchantin et al., 2023), (Jiang et al., 2023b). compresses every tc tokens generated. The compression
Those approach involves the generation of a summary of process is as follows:
preceding text, which is then inserted into the generated
text, allowing subsequent text generation to utilize this sum- 1. Select a sequence of tc tokens that the model has al-
mary to produce more coherent text. The downsides of ready generated or completed predicting as the reading
this method include: 1) the generated summary occupies area.
a significant portion of the text length, resulting in shorter
2. Insert t ‘<m>’ tokens at once after the reading area to
generated text; 2) the process of generating a summary is
serve as the memory area.
also autoregressive, leading to a substantial increase in gen-
eration time; 3) the generated summary may omit some 3. The model performs a single inference on the memory
critical information, compromising the accuracy of the re- area, but discards the model’s output, retaining only
sulting text; and 4) a considerable amount of annotated data the KV pairs from each layer.
is required to fine-tune the model, which is costly.
4. Discard the reading area, and the model continues gen-
In (Mu et al., 2023), a novel compression method was in- erating text from after the memory area.
troduced. This method involves inserting a ‘gist token’
between the prompt and response and employing a specially
3.2. Fine-Tune
designed mask to ensure that the response segment can only
extract information from the gist token. During generation, In order to endow language models with the compressive
the prompt is compressed into a gist token and then the orig- memory capabilities introduced earlier, we need to fine-tune
inal prompt is discarded to save resources. This approach the model. During the fine-tuning process, we compel the
effectively reduces memory usage. However, it’s impor- model to decompress and rephrase the compressed content
tant to note that this method is not lossless and results in and require the model to continue text generation based
a significant loss of information. In contrast, our method solely on the KV in the memory zone. At the same time, we
achieves lossless compression of information into a ‘<m>’ must ensure that the model’s generative capability does not
2
Submission and Formatting Instructions for ICML 2024
tc−1
X Figure 1: The specific shape of the mask matrix
LRepeat = LCE (yp+i , ŷp+i ) (2)
i=0
The formal expression of the mask is Equ.3:
where yp+i are the input tokens of the reading zone, ŷp+i
are the output tokens of the model at the corresponding Ltc×tc 0tc×t 0tc×tc
positions, and LCE is the cross-entropy loss function. The M = 1t×tc 1t×t 0t×tc (3)
position ids corresponding to each token in the recollection 0tc×tc 1tc×t Itc×tc
zone are also identical to the position ids at the correspond-
The mask matrix is visually represented as shown in the
ing positions in the reading zone.
Fig.1.
3.2.1. ATTENTION M ASK We observed that many tokens correspond to columns with
small values in the attention matrix, indicating that these
Under these requirements, we have designed an end-to-end
contents are not important for the generation of subsequent
training method. This method involves dividing the data
text. Therefore, in the design of the attention mask, we allow
into segments, each consisting of tc tokens, and then adding
the memory area to have a bidirectional field of view. This
a memory area and a repetition area after each segment,
approach enables all ‘<m>’ tokens to communicate and col-
forming a chunk. All these chunks are then connected to-
lectively remember the content of the tc reading area tokens,
gether to form the data structure as shown in the diagram
rather than requiring each ‘<m>’ token to independently
below:
remember the content of c potentially unimportant tokens.
Furthermore, during the fine-tuning process, we must care- The memory area can only access the reading area and is
fully design attention masks to ensure: restricted from viewing other content to avoid interfering
with the generation of its KV pairs.
1. The preservation of the auto-regressive nature, meaning During our experiments, we also found that if tokens in
that when generating the i-th token, the model can only the repetition area are allowed to access each other, they
see the previous i − 1 tokens and not the subsequent tend to use their generative capabilities to repetition the
tokens. content of the reading area, rather than utilizing the content
of the memory area to recapitulate the reading area’s content.
2. Each token in the memory area can see the reading Therefore, we prohibit tokens in the repetition area from
area and has a bidirectional view of the memory area accessing each other.
itself.
3.2.2. P OSITIONAL E MBEDDING
3. The repetition area can only access the contents of the
memory area and cannot see the contents of any other For ensuring the correctness of the position encoding, we
area to prevent information leakage. need to design distinct position ids for the tokens in the
3
Submission and Formatting Instructions for ICML 2024
memory zone. Let’s assume that the position ids for the Multiply this with the previously designed mask as Equ.10.
initial tokens in the reading zone as Equ.4. T
qr kr ⊙ L 0 0
A = Â · M = qm krT qm kmT
0 (10)
p, p + 1, ..., p + tc − 1 T T
(4) 0 qp km qp kp ⊙ I
Then, the position ids for each ‘<m>’ token in the memory Here, ⊙ denotes element-wise multiplication.√Without loss
zone are designed as Equ.5. of generality, if we ignore the scale factor dk , then the
output of the Attention Block can be expanded as Equ.11.
p + c − 1, p + 2c − 1, ..., p + tc − 1 (5)
O = Softmax(A)V
This design ensures that the position ids of each token in E1 exp(A1 )V
the memory zone correspond to the c tokens in the reading E2 exp(A2 )V (11)
=
zone that it covers. The position ids used by the model to ...
continue generating text after the memory zone start from Es exp(As )V
p + tc.
P Ai represents the i-th row vector of A, and Ei =
Where
3.3. Gradient 1/ exp(Ai ) . Since we are currently only studying the
propagation of gradients and not concerned with the abso-
In this section, we will demonstrate that LoMA can still lute magnitude of the gradients, we can ignore these coeffi-
learn the ability to compress memory during training, even cients . Ei Thus, the first tc rows of O, which represent the
without supervising the output of ‘<m>’ tokens. output of the reading area, can be expressed as Equ.12.
Analyzing the computation process of a single-layer Trans-
former block, let x be the input vector to the Transformer Or = exp(qr krT ⊙ L)vr (12)
block. The computation can be expressed as Equ.6.
The rows froms − tc to s of O, which represent the output
of the repetition area, can be expressed as Equ.13
Q = WQ x
K = WK x (6)
V = WV x T
)vm + exp(qp kpT ⊙ I)vp
Op = exp(qp km (13)
The calculation formula for Attention with a mask is Equ.7. Since we do not supervise the output of ‘<m>’ tokens, the
loss consists only of two parts: LLM for the reading area and
QK T Lrepeat . Therefore, the gradient of the loss with respect to x
Attention(Q, K, V ) = Softmax( √ · M )V (7) is as follows:
dk
∂L ∂LLM ∂Or ∂LRepeat ∂Op
= · + · (14)
Where M is an s × s mask matrix. Suppose the input ∂x ∂Or ∂x ∂Op ∂x
sequence is organized into a reading area r of length tc, a
memory area m of length t, and a repetition area p of the It’s important to note that: in Equ.14, Op is a function of km
same length as the reading area, making the total length and vm , indicating that the model’s memory capability can
s = t(2c + 1). Therefore, Q, K, and V are each composed receive supervisory signals from the loss in the repetition
of three vectors concatenated together as Equ.8. area.
qr kr vr 4. Experiments
Q = qm , K = km , V = vm (8)
qp kp vp 4.1. Settings
Model. We conduct experiments based on the pretrained
By substituting the above expressions into the formula for Llama2 7B Chat models. The training framework is
Attention, we get the following calculation as Equ.9. Megatron-LM(Shoeybi et al., 2020), using a machine
equipped with 8 A100 GPUs. The Global batch-size are
qr krT T
qr kpT
qr km set to 32. The learning rate adopts a cosine decay pattern,
 = qm krT
T
qm km qm kpT (9) varying between [5e − 5, 5e − 6], and includes a warmup
qp krT T
qp km qp kpT of 100 iterations.
4
Submission and Formatting Instructions for ICML 2024
Special Tokens. The tokenizer of Llama2 has a vocabu- in handling and encoding the information in a compact form
lary length of 32, 000. We added two special tokens with without significant loss of detail or accuracy.
IDs ‘<m>: 32000’ and ‘<r>: 32001’. Accordingly, modi-
The slower decline of LLM is attributed to the vast size of
fications are needed in the model’s embedding layer: two
the C4 dataset and the approach of streaming only a very
vectors of 4096 dimensions are added, changing the weight
small part of it for training. This limited exposure to the
shape from 32000 × 4096 to 32002 × 4096. These two new
dataset means that the model doesn’t have the opportunity
vectors are initialized with a normal distribution, the mean
to extensively learn and adapt to the broader range of data
and variance of which are derived from the statistics of the
it contains. As a result, while it becomes efficient in han-
original weights.
dling compressed information (as indicated by the rapid
Framework. Since LoMA requires a custom attention mask decrease in LRepeat ), its original generative capabilities (as
and position id, we are unable to use Flash-Attention to ac- represented by LLM ) do not improve as rapidly.
celerate the training and need to make certain modifications
To verify the generalizability of LoMA, we chose compres-
to Megatron. The specific methods of modification are de-
sion ratios of 4 and 8, and memory lengths of 8 and 16,
tailed in the appendix.
training on the C4 dataset for 5000 iterations, yielding the
Data. A small portion is extracted from the C4 results as Fig.3.
dataset(Raffel et al., 2020) or the GSM8K dataset(Cobbe
Then, we performed inference testing on the GSM8K
et al., 2021) is used in a loop to train the model. The method
dataset. We defined two metrics: total accuracy and to-
of data preprocessing is as mentioned in the previous sec-
ken accuracy. Total accuracy means that the decoding of the
tion on “Attention Mask". Assuming ŝ is the preset training
repetition area must be completely identical to the read area
sequence length, we impose the following requirement:
to be considered correct, while token accuracy calculates
ŝ mod (2 ∗ tc + t) = 0 the ratio of the correct number of tokens to the total number
of tokens. We randomly sampled 100 examples from the
We divide the training space into ŝ/ (2 ∗ tc + t) = n̄ GSM8K test set for inference.
chunks, adding memory and repetition areas to each chunk.
As Tab.1, only a very small dataset is needed for the ‘<m>’
Assuming the sequence length input into the model is s,
token to nearly completely take on the task of compressing
then it is required that:
information, and it generalizes well to other datasets. It
jsk can be noted that the LLM in the picture does not decrease
2∗s+ ≤ ŝ
R significantly. This is because the C4 dataset is too large, and
our strategy was to stream-read a very small part of it with-
where, R is the compress ratio. If s + n̄t(c + 1) < ŝ, it out training multiple epochs. Therefore, if there is capacity
needs to be padded to the length of ŝ. for pretraining, we recommend incorporating the LoMA
mechanism during pretraining to maintain the generative
4.2. Main Result capability of the model.
5
Submission and Formatting Instructions for ICML 2024
Figure 2: The fine-tuning results of LLaMA-2-7B-Chat on the C4 dataset with different hyperparameters. It can be observed that the
smaller the ratio or memory length, the faster the convergence of LRepeat , with ratio being the primary determining variable. Choosing a
ratio of 4 or 8 seems appropriate. Note: LLM represents LLM , and LRepeat represents LRepeat , and Loss represents L. The same applies to
the following figures.
Figure 3: LLaMA-2-7B-Chat, with various parameters, underwent streaming training for 5000 iterations on the C4 dataset
Figure 4: The loss reduction of the LLaMA 7B model after 2000 iterations of training on the GSM8k training set.
6
Submission and Formatting Instructions for ICML 2024
By doing so, the model would not only excel in compress- ber 2023b. URL http://arxiv.org/abs/2310.
ing information but also retain its effectiveness in general 05736. arXiv:2310.05736 [cs].
language generation tasks.
Lanchantin, J., Toshniwal, S., Weston, J., Szlam, A., and
Sukhbaatar, S. Learning to Reason and Memorize
5. Conclusion with Self-Notes, October 2023. URL http://arxiv.
We propose the Lossless Compressed Memory Attention org/abs/2305.00833. arXiv:2305.00833 [cs].
(LoMA), aimed at losslessly compressing information to Mu, J., Li, X. L., and Goodman, N. Learning to Compress
reduce computational consumption in long text contexts. Prompts with Gist Tokens, July 2023. URL http://
We divided the training space into three modules: the read- arxiv.org/abs/2304.08467. arXiv:2304.08467
ing area, memory area, and repetition area. The memory [cs].
area and repetition area are filled with two special tokens,
‘<m>’ and ‘<r>’, respectively. The attention in the memory O’Shea, K. and Nash, R. An Introduction to Convolu-
area for ‘<m>’ is bidirectional to better extract information tional Neural Networks, December 2015. URL http://
from the reading area. The repetition area, containing only arxiv.org/abs/1511.08458. arXiv:1511.08458
information from the memory area and inherently being the [cs].
‘<r>’ token, carries no information of its own and is tasked
with completely restating the text from the reading area. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
No supervisory signal is set for the memory area, but we Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the
demonstrate that the label in the repetition area can provide Limits of Transfer Learning with a Unified Text-to-Text
indirect supervisory signals to the memory area through gra- Transformer. Journal of Machine Learning Research,
dient backpropagation. We fine-tuned the LLaMA 7B model 21(140):1–67, 2020. ISSN 1533-7928. URL http:
with LoMA on the C4 and GSM8K datasets, achieving con- //jmlr.org/papers/v21/20-074.html.
vergence within 2000 iterations. Moreover, we found that Ribar, L., Chelombiev, I., Hudlass-Galley, L., Blake,
information compression has good generalizability; models C., Luschi, C., and Orr, D. SparQ Atten-
trained on C4 can be seamlessly generalized to the GSM8K tion: Bandwidth-Efficient LLM Inference, Decem-
dataset. We suggest adopting LoMA in pretraining to ad- ber 2023. URL http://arxiv.org/abs/2312.
dress the increasingly important scenarios of long texts in 04985. arXiv:2312.04985 [cs].
the future.
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper,
References J., and Catanzaro, B. Megatron-LM: Training Multi-
Billion Parameter Language Models Using Model Par-
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, allelism, March 2020. URL http://arxiv.org/
H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., abs/1909.08053. arXiv:1909.08053 [cs].
Nakano, R., Hesse, C., and Schulman, J. Train-
ing Verifiers to Solve Math Word Problems, Novem- Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
ber 2021. URL http://arxiv.org/abs/2110. A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
14168. arXiv:2110.14168 [cs]. Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen,
M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W.,
Gupta, A., Dar, G., Goodman, S., Ciprut, D., and Berant, Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn,
J. Memory-efficient Transformers via Top-$k$ Attention, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez,
June 2021. URL http://arxiv.org/abs/2106. V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S.,
06899. arXiv:2106.06899 [cs]. Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y.,
Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog,
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi,
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R.,
G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X.,
M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur,
Lacroix, T., and Sayed, W. E. Mistral 7B, Octo- M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S.,
ber 2023a. URL http://arxiv.org/abs/2310. and Scialom, T. Llama 2: Open Foundation and Fine-
06825. arXiv:2310.06825 [cs]. Tuned Chat Models, July 2023. URL http://arxiv.
org/abs/2307.09288. arXiv:2307.09288 [cs].
Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., and Qiu,
L. LLMLingua: Compressing Prompts for Accel- Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin,
erated Inference of Large Language Models, Decem- C., Lv, C., Pan, D., Wang, D., Yan, D., Yang, F., Deng,
7
Submission and Formatting Instructions for ICML 2024
F., Wang, F., Liu, F., Ai, G., Dong, G., Zhao, H., Xu,
H., Sun, H., Zhang, H., Liu, H., Ji, J., Xie, J., Dai, J.,
Fang, K., Su, L., Song, L., Liu, L., Ru, L., Ma, L., Wang,
M., Liu, M., Lin, M., Nie, N., Guo, P., Sun, R., Zhang,
T., Li, T., Li, T., Cheng, W., Chen, W., Zeng, X., Wang,
X., Chen, X., Men, X., Yu, X., Pan, X., Shen, Y., Wang,
Y., Li, Y., Jiang, Y., Gao, Y., Zhang, Y., Zhou, Z., and
Wu, Z. Baichuan 2: Open Large-scale Language Models,
September 2023. URL http://arxiv.org/abs/
2309.10305. arXiv:2309.10305 [cs].
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti,
C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang,
L., and Ahmed, A. Big Bird: Transformers for Longer
Sequences, January 2021. URL http://arxiv.org/
abs/2007.14062. arXiv:2007.14062 [cs, stat].
Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai,
R., Song, Z., Tian, Y., Ré, C., Barrett, C., Wang, Z.,
and Chen, B. H$_2$O: Heavy-Hitter Oracle for Effi-
cient Generative Inference of Large Language Models,
July 2023. URL http://arxiv.org/abs/2306.
14048. arXiv:2306.14048 [cs].
Zhao, G., Lin, J., Zhang, Z., Ren, X., Su, Q., and
Sun, X. Explicit Sparse Transformer: Concen-
trated Attention Through Explicit Selection, Decem-
ber 2019. URL http://arxiv.org/abs/1912.
11637. arXiv:1912.11637 [cs].