Loma: Lossless Compressed Memory Attention: Yang Et Al. 2023

LoMA: Lossless Compressed Memory Attention
Yumeng Wang 1 Zhenyang Xiao 1 2
Abstract answering systems(Yang et al., 2023). In these tasks, users

expect language models to access as much information as
The ability to handle long texts is one of the most possible, necessitating a method that can effectively store
important capabilities of Large Language Mod- and retrieve information.
arXiv:2401.09486v1 [cs.LG] 16 Jan 2024
els (LLMs), but as the text length increases, the

consumption of resources also increases dramati- An essential direction for improving long-context process-
cally. At present, reducing resource consumption ing involves information compression, encapsulating prior
by compressing the KV cache is a common ap- key-value (KV) information within a few specialized tokens.
proach. Although there are many existing com- Previous efforts, such as (Mu et al., 2023), have achieved
pression methods, they share a common draw- this goal with relative efficacy. However, a notable limita-
back: the compression is not lossless. That is, tion of these methods is their lossy nature of compression,
information is inevitably lost during the compres- which inevitably leads to the loss of vital information during
sion process. If the compression rate is high, the the process.
probability of losing important information in- We propose a novel approach, the Lossless Compressed
creases dramatically. We propose a new method, Memory Attention (LoMA), which divides the input tokens
Lossless Compressed Memory Attention (LoMA), into multiple segments of equal length, each segment struc-
which allows for lossless compression of informa- tured to include a reading, memory, and repetition area. The
tion into special memory token KV pairs accord- latter two areas incorporate newly introduced special to-
ing to a set compression ratio. Our experiments kens. We also designed a unique attention matrix mask: the
have achieved remarkable results, demonstrating reading area employs a conventional autoregressive lower
that LoMA can be efficiently trained and has very triangular mask; the memory area has a bidirectional field
effective performance. of view internally; tokens in the repetition area can only ob-
serve the memory area directly preceding it, as well as the
token itself. We mathematically prove that with a complete
1. Introduction repetition of the reading area by the recitation area, even in
the absence of direct supervision for the memory area, the
In the field of Natural Language Processing (NLP), under-
gradients generated in the repetition area will backpropagate
standing and managing long context represents one of the
to the memory area, thereby exerting a supervisory effect.
significant challenges for achieving in-depth language com-
prehension. With the rapid advancement of Large Language Through LoMA, transformer models gain the capability
Models (LLMs), the efficient handling of extensive textual to compress their memory losslessly within the memory
information to accurately capture and utilize the associa- area, directly multiplying the length of the long-context they
tions and coherence within the context has become a critical can handle. Our experiments demonstrate that the Llama-
factor propelling the advancement of NLP technologies. 2-7B(Touvron et al., 2023) model can achieve a 4:1 ratio
Research into long context not only enhances the model’s of lossless memory compression. More importantly, our
capabilities in processing lengthy dialogues, document com- approach does not alter the architecture of the model; it
prehension, and information retrieval tasks but also aids can be realized through fine-tuning training with a minimal
in achieving more precise language inference and knowl- dataset, without any loss to the model’s original capabilities.
edge extraction, thereby facilitating progress in applications
Chapter 2 of the following text will introduce several impor-
such as machine translation, summarization, and question-
tant methods related to our work, Chapter 3 will provide a
*
Equal contribution 1 Deepglint Inc. 2 Peking University. Corre- detailed presentation of our approach, Chapter 4 will dis-
spondence to: Yumeng Wang <yumengwang@deepglint.com>. cuss our experimental results, and Chapter 5 will offer a
summary of our work.
Proceedings of the 41 st International Conference on Machine
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by
the author(s).
1
Submission and Formatting Instructions for ICML 2024
2. Related Works token, ensuring that no information is lost.

2.1. Sparse Attention
3. Method
In recent times, the computational burden of long contexts
has been effectively alleviated with the introduction of vari- 3.1. LoMA
ous sparsified attention mechanisms. (Zaheer et al., 2021) in- The process of generating subsequent text in language mod-
tegrating random attention, windowed attention, and global els based on the transformer architecture essentially involves
attention achieved commendable results. (Zhao et al., 2019), predicting the probability of the next token based on the in-
(Gupta et al., 2021) posits that the plethora of irrelevant formation in the preceding text. The information from the
information within the attention mechanism can be distract- preceding text is encoded into KV pairs by various layers of
ing for the model, and thus zeroes out the less significant transformer blocks and stored in memory. Hence, we can
positions within the attention matrix to focus the model’s consider this information as the model’s memory, and the
attention. Subsequently, (Zhang et al., 2023) proposed a generation process of the model is essentially a process of
method to filter tokens of importance by summing up at- retrieving information from this memory.
tention scores. Going a step further, (Ribar et al., 2023)
estimated attention scores in the embedding dimension us- LoMA involves lossless compression of existing KV pairs,
ing the top-r values to then select the top-k largest KV pairs. shortening the original KV length of c to 1/c (where 1/c is
The recently prominent Mistral architecture(Jiang et al., the compression rate) and replacing the original KV pairs.
2023a), employs windowed attention akin to the receptive The advantages of this approach are: 1) It does not alter the
fields of CNNs(O’Shea & Nash, 2015), theoretically en- model structure, allowing for an expansion of the model’s
abling the effortless handling of text sequences up to the contextual length to c times its original size for most models;
length of 32 × 4096. However, none of these works can 2) It does not require additional annotated data and can be
achieve lossless compression of context. More or less, some fine-tuned directly on pre-trained models; 3) It allows for
important information will be lost. segmental compression, and each compression only adds
one inference process, avoiding a significant increase in
2.2. Explicit Memory generation time.
Explicit memory is the conscious, intentional recollection LoMA introduces a parallel compression process in the au-
of factual information, previous experiences, and concepts. toregressive inference process. Assuming a preset compres-
Some method for Explicit memory compression are pro- sion rate of 1/c and a memory region length of t, the model
posed by (Lanchantin et al., 2023), (Jiang et al., 2023b). compresses every tc tokens generated. The compression
Those approach involves the generation of a summary of process is as follows:
preceding text, which is then inserted into the generated
text, allowing subsequent text generation to utilize this sum- 1. Select a sequence of tc tokens that the model has al-
mary to produce more coherent text. The downsides of ready generated or completed predicting as the reading
this method include: 1) the generated summary occupies area.
a significant portion of the text length, resulting in shorter
2. Insert t ‘<m>’ tokens at once after the reading area to
generated text; 2) the process of generating a summary is
serve as the memory area.
also autoregressive, leading to a substantial increase in gen-
eration time; 3) the generated summary may omit some 3. The model performs a single inference on the memory
critical information, compromising the accuracy of the re- area, but discards the model’s output, retaining only
sulting text; and 4) a considerable amount of annotated data the KV pairs from each layer.
is required to fine-tune the model, which is costly.
4. Discard the reading area, and the model continues gen-
In (Mu et al., 2023), a novel compression method was in- erating text from after the memory area.
troduced. This method involves inserting a ‘gist token’
between the prompt and response and employing a specially
3.2. Fine-Tune
designed mask to ensure that the response segment can only
extract information from the gist token. During generation, In order to endow language models with the compressive
the prompt is compressed into a gist token and then the orig- memory capabilities introduced earlier, we need to fine-tune
inal prompt is discarded to save resources. This approach the model. During the fine-tuning process, we compel the
effectively reduces memory usage. However, it’s impor- model to decompress and rephrase the compressed content
tant to note that this method is not lossless and results in and require the model to continue text generation based
a significant loss of information. In contrast, our method solely on the KV in the memory zone. At the same time, we
achieves lossless compression of information into a ‘<m>’ must ensure that the model’s generative capability does not
2
diminish. For these reasons, we insert a recollection zone

after the memory zone during the fine-tuning process. The
input tokens for the recollection zone are another special
token ‘<r>’.
Since we are not concerned with the model outputs in the
memory zone (which will ultimately be discarded), we do
not need to design a label and loss function for the memory
zone. The labels for the recollection zone are the input
tokens of the reading zone, to ensure that the model can
store all the information of the reading zone in the memory
zone without loss. The model’s loss function is as Equ.1:
L = LLM + LRepeat (1)
where LLM is the language model loss of the model, and

LRepeat is the repetition zone loss of the model, defined as
Equ.2.
tc−1
X Figure 1: The specific shape of the mask matrix
LRepeat = LCE (yp+i , ŷp+i ) (2)
i=0
The formal expression of the mask is Equ.3:
where yp+i are the input tokens of the reading zone, ŷp+i  
are the output tokens of the model at the corresponding Ltc×tc 0tc×t 0tc×tc
positions, and LCE is the cross-entropy loss function. The M =  1t×tc 1t×t 0t×tc  (3)
position ids corresponding to each token in the recollection 0tc×tc 1tc×t Itc×tc
zone are also identical to the position ids at the correspond-
The mask matrix is visually represented as shown in the
ing positions in the reading zone.
Fig.1.
3.2.1. ATTENTION M ASK We observed that many tokens correspond to columns with
small values in the attention matrix, indicating that these
Under these requirements, we have designed an end-to-end
contents are not important for the generation of subsequent
training method. This method involves dividing the data
text. Therefore, in the design of the attention mask, we allow
into segments, each consisting of tc tokens, and then adding
the memory area to have a bidirectional field of view. This
a memory area and a repetition area after each segment,
approach enables all ‘<m>’ tokens to communicate and col-
forming a chunk. All these chunks are then connected to-
lectively remember the content of the tc reading area tokens,
gether to form the data structure as shown in the diagram
rather than requiring each ‘<m>’ token to independently
below:
remember the content of c potentially unimportant tokens.
Furthermore, during the fine-tuning process, we must care- The memory area can only access the reading area and is
fully design attention masks to ensure: restricted from viewing other content to avoid interfering
with the generation of its KV pairs.
1. The preservation of the auto-regressive nature, meaning During our experiments, we also found that if tokens in
that when generating the i-th token, the model can only the repetition area are allowed to access each other, they
see the previous i − 1 tokens and not the subsequent tend to use their generative capabilities to repetition the
tokens. content of the reading area, rather than utilizing the content
of the memory area to recapitulate the reading area’s content.
2. Each token in the memory area can see the reading Therefore, we prohibit tokens in the repetition area from
area and has a bidirectional view of the memory area accessing each other.
itself.
3.2.2. P OSITIONAL E MBEDDING
3. The repetition area can only access the contents of the
memory area and cannot see the contents of any other For ensuring the correctness of the position encoding, we
area to prevent information leakage. need to design distinct position ids for the tokens in the
3
memory zone. Let’s assume that the position ids for the Multiply this with the previously designed mask as Equ.10.
initial tokens in the reading zone as Equ.4.  T 
qr kr ⊙ L 0 0
A = Â · M =  qm krT qm kmT
0  (10)
p, p + 1, ..., p + tc − 1 T T
(4) 0 qp km qp kp ⊙ I
Then, the position ids for each ‘<m>’ token in the memory Here, ⊙ denotes element-wise multiplication.√Without loss
zone are designed as Equ.5. of generality, if we ignore the scale factor dk , then the
output of the Attention Block can be expanded as Equ.11.
p + c − 1, p + 2c − 1, ..., p + tc − 1 (5)
O = Softmax(A)V
 
This design ensures that the position ids of each token in E1 exp(A1 )V
the memory zone correspond to the c tokens in the reading E2 exp(A2 )V  (11)
= 
zone that it covers. The position ids used by the model to  ... 
continue generating text after the memory zone start from Es exp(As )V
p + tc.
P Ai represents the i-th row vector of A, and Ei =
Where
3.3. Gradient 1/ exp(Ai ) . Since we are currently only studying the
propagation of gradients and not concerned with the abso-
In this section, we will demonstrate that LoMA can still lute magnitude of the gradients, we can ignore these coeffi-
learn the ability to compress memory during training, even cients . Ei Thus, the first tc rows of O, which represent the
without supervising the output of ‘<m>’ tokens. output of the reading area, can be expressed as Equ.12.
Analyzing the computation process of a single-layer Trans-
former block, let x be the input vector to the Transformer Or = exp(qr krT ⊙ L)vr (12)
block. The computation can be expressed as Equ.6.
The rows froms − tc to s of O, which represent the output
of the repetition area, can be expressed as Equ.13

 Q = WQ x
K = WK x (6)
V = WV x T
)vm + exp(qp kpT ⊙ I)vp

Op = exp(qp km (13)
The calculation formula for Attention with a mask is Equ.7. Since we do not supervise the output of ‘<m>’ tokens, the
loss consists only of two parts: LLM for the reading area and
QK T Lrepeat . Therefore, the gradient of the loss with respect to x
Attention(Q, K, V ) = Softmax( √ · M )V (7) is as follows:
dk
∂L ∂LLM ∂Or ∂LRepeat ∂Op
= · + · (14)
Where M is an s × s mask matrix. Suppose the input ∂x ∂Or ∂x ∂Op ∂x
sequence is organized into a reading area r of length tc, a
memory area m of length t, and a repetition area p of the It’s important to note that: in Equ.14, Op is a function of km
same length as the reading area, making the total length and vm , indicating that the model’s memory capability can
s = t(2c + 1). Therefore, Q, K, and V are each composed receive supervisory signals from the loss in the repetition
of three vectors concatenated together as Equ.8. area.
     
qr kr vr 4. Experiments
Q = qm  , K = km  , V = vm  (8)
qp kp vp 4.1. Settings
Model. We conduct experiments based on the pretrained
By substituting the above expressions into the formula for Llama2 7B Chat models. The training framework is
Attention, we get the following calculation as Equ.9. Megatron-LM(Shoeybi et al., 2020), using a machine
equipped with 8 A100 GPUs. The Global batch-size are
qr krT T
qr kpT
 
qr km set to 32. The learning rate adopts a cosine decay pattern,
Â = qm krT
 T
qm km qm kpT  (9) varying between [5e − 5, 5e − 6], and includes a warmup
qp krT T
qp km qp kpT of 100 iterations.
4
Special Tokens. The tokenizer of Llama2 has a vocabu- in handling and encoding the information in a compact form
lary length of 32, 000. We added two special tokens with without significant loss of detail or accuracy.
IDs ‘<m>: 32000’ and ‘<r>: 32001’. Accordingly, modi-
The slower decline of LLM is attributed to the vast size of
fications are needed in the model’s embedding layer: two
the C4 dataset and the approach of streaming only a very
vectors of 4096 dimensions are added, changing the weight
small part of it for training. This limited exposure to the
shape from 32000 × 4096 to 32002 × 4096. These two new
dataset means that the model doesn’t have the opportunity
vectors are initialized with a normal distribution, the mean
to extensively learn and adapt to the broader range of data
and variance of which are derived from the statistics of the
it contains. As a result, while it becomes efficient in han-
original weights.
dling compressed information (as indicated by the rapid
Framework. Since LoMA requires a custom attention mask decrease in LRepeat ), its original generative capabilities (as
and position id, we are unable to use Flash-Attention to ac- represented by LLM ) do not improve as rapidly.
celerate the training and need to make certain modifications
To verify the generalizability of LoMA, we chose compres-
to Megatron. The specific methods of modification are de-
sion ratios of 4 and 8, and memory lengths of 8 and 16,
tailed in the appendix.
training on the C4 dataset for 5000 iterations, yielding the
Data. A small portion is extracted from the C4 results as Fig.3.
dataset(Raffel et al., 2020) or the GSM8K dataset(Cobbe
Then, we performed inference testing on the GSM8K
et al., 2021) is used in a loop to train the model. The method
dataset. We defined two metrics: total accuracy and to-
of data preprocessing is as mentioned in the previous sec-
ken accuracy. Total accuracy means that the decoding of the
tion on “Attention Mask". Assuming ŝ is the preset training
repetition area must be completely identical to the read area
sequence length, we impose the following requirement:
to be considered correct, while token accuracy calculates
ŝ mod (2 ∗ tc + t) = 0 the ratio of the correct number of tokens to the total number
of tokens. We randomly sampled 100 examples from the
We divide the training space into ŝ/ (2 ∗ tc + t) = n̄ GSM8K test set for inference.
chunks, adding memory and repetition areas to each chunk.
As Tab.1, only a very small dataset is needed for the ‘<m>’
Assuming the sequence length input into the model is s,
token to nearly completely take on the task of compressing
then it is required that:
information, and it generalizes well to other datasets. It
jsk can be noted that the LLM in the picture does not decrease
2∗s+ ≤ ŝ
R significantly. This is because the C4 dataset is too large, and
our strategy was to stream-read a very small part of it with-
where, R is the compress ratio. If s + n̄t(c + 1) < ŝ, it out training multiple epochs. Therefore, if there is capacity
needs to be padded to the length of ŝ. for pretraining, we recommend incorporating the LoMA
mechanism during pretraining to maintain the generative
4.2. Main Result capability of the model.
4.2.1. I NFERENCE 4.2.2. G ENERATION

We calculate three types of loss to measure the model’s To test the generative capabilities of the model after LoMA
performance. LLM represents the model’s original gener- fine-tuning, we trained it on the GSM8K training set for
ative capability, LRepeat represents the model’s ability to 5000 iterations using different combinations of hyperparam-
extract compressed information from the ‘<m>’ token, and eters. Some of the loss curves obtained are as Fig.4.
L represents the overall mean loss.
We then tested the model’s mathematical ability on the
Fig.2 shows the losses obtained by fine-tuning the model GSM8K test set, as shown in Tab.2.
on the C4 dataset with different hyperparameters using the
LoMA method. ‘ratio’ refers to the compression rate, where A notable phenomenon observed is that the fewer the num-
a ratio of 4 means four ordinary tokens correspond to one ber of compressions, the higher the accuracy of the model
‘<m>’ token. in solving mathematical problems. We believe that this
phenomenon would not occur if LoMA were incorporated
The observation that LRepeat converges rapidly to a value into the training during the pretraining phase. To enhance
close to zero under smaller compression ratios is significant. the model’s original generative capabilities, incorporating
It demonstrates that the method is highly effective in com- the LoMA mechanism during pretraining is recommended.
pressing information losslessly into memory tokens. This This approach would enable it to maintain or even improve
aspect of the model’s performance indicates a successful its generative capabilities compared to the original model.
implementation of the compression mechanism, particularly
5
Figure 2: The fine-tuning results of LLaMA-2-7B-Chat on the C4 dataset with different hyperparameters. It can be observed that the
smaller the ratio or memory length, the faster the convergence of LRepeat , with ratio being the primary determining variable. Choosing a
ratio of 4 or 8 seems appropriate. Note: LLM represents LLM , and LRepeat represents LRepeat , and Loss represents L. The same applies to
the following figures.
Figure 3: LLaMA-2-7B-Chat, with various parameters, underwent streaming training for 5000 iterations on the C4 dataset
Loma Hyperparameters ratio-4-memlen-8 ratio-4-memlen-16 ratio-8-memlen-8 ratio-8-memlen-16

repetition accuracy 71.56% 54.90% 35.30% 40.19%
token repetition accuracy 99.84% 99.68% 99.38% 99.40%
Table 1: To evaluate the generalization of the LLaMA 7B models trained with different hyperparameters on the C4 dataset to the GSM8K
test set, we focused on two key metrics: repetition accuracy and token repetition accuracy.
Figure 4: The loss reduction of the LLaMA 7B model after 2000 iterations of training on the GSM8k training set.
ratio memlen=4 memlen=8 memlen=16 memlen=32

4 13.95% 17.89% 20.77% 29.64%
8 8.87% 14.33% 21.30% 34.72%
Table 2: The inference accuracy on the test set of the LLaMA 7B model after training on the GSM8K training set with different
hyperparameters.
6
By doing so, the model would not only excel in compress- ber 2023b. URL http://arxiv.org/abs/2310.
ing information but also retain its effectiveness in general 05736. arXiv:2310.05736 [cs].
language generation tasks.
Lanchantin, J., Toshniwal, S., Weston, J., Szlam, A., and
Sukhbaatar, S. Learning to Reason and Memorize
5. Conclusion with Self-Notes, October 2023. URL http://arxiv.
We propose the Lossless Compressed Memory Attention org/abs/2305.00833. arXiv:2305.00833 [cs].
(LoMA), aimed at losslessly compressing information to Mu, J., Li, X. L., and Goodman, N. Learning to Compress
reduce computational consumption in long text contexts. Prompts with Gist Tokens, July 2023. URL http://
We divided the training space into three modules: the read- arxiv.org/abs/2304.08467. arXiv:2304.08467
ing area, memory area, and repetition area. The memory [cs].
area and repetition area are filled with two special tokens,
‘<m>’ and ‘<r>’, respectively. The attention in the memory O’Shea, K. and Nash, R. An Introduction to Convolu-
area for ‘<m>’ is bidirectional to better extract information tional Neural Networks, December 2015. URL http://
from the reading area. The repetition area, containing only arxiv.org/abs/1511.08458. arXiv:1511.08458
information from the memory area and inherently being the [cs].
‘<r>’ token, carries no information of its own and is tasked
with completely restating the text from the reading area. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
No supervisory signal is set for the memory area, but we Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the
demonstrate that the label in the repetition area can provide Limits of Transfer Learning with a Unified Text-to-Text
indirect supervisory signals to the memory area through gra- Transformer. Journal of Machine Learning Research,
dient backpropagation. We fine-tuned the LLaMA 7B model 21(140):1–67, 2020. ISSN 1533-7928. URL http:
with LoMA on the C4 and GSM8K datasets, achieving con- //jmlr.org/papers/v21/20-074.html.
vergence within 2000 iterations. Moreover, we found that Ribar, L., Chelombiev, I., Hudlass-Galley, L., Blake,
information compression has good generalizability; models C., Luschi, C., and Orr, D. SparQ Atten-
trained on C4 can be seamlessly generalized to the GSM8K tion: Bandwidth-Efficient LLM Inference, Decem-
dataset. We suggest adopting LoMA in pretraining to ad- ber 2023. URL http://arxiv.org/abs/2312.
dress the increasingly important scenarios of long texts in 04985. arXiv:2312.04985 [cs].
the future.
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper,
References J., and Catanzaro, B. Megatron-LM: Training Multi-
Billion Parameter Language Models Using Model Par-
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, allelism, March 2020. URL http://arxiv.org/
H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., abs/1909.08053. arXiv:1909.08053 [cs].
Nakano, R., Hesse, C., and Schulman, J. Train-
ing Verifiers to Solve Math Word Problems, Novem- Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
ber 2021. URL http://arxiv.org/abs/2110. A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
14168. arXiv:2110.14168 [cs]. Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen,
M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W.,
Gupta, A., Dar, G., Goodman, S., Ciprut, D., and Berant, Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn,
J. Memory-efficient Transformers via Top-$k$ Attention, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez,
June 2021. URL http://arxiv.org/abs/2106. V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S.,
06899. arXiv:2106.06899 [cs]. Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y.,
Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog,
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi,
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R.,
G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X.,
M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur,
Lacroix, T., and Sayed, W. E. Mistral 7B, Octo- M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S.,
ber 2023a. URL http://arxiv.org/abs/2310. and Scialom, T. Llama 2: Open Foundation and Fine-
06825. arXiv:2310.06825 [cs]. Tuned Chat Models, July 2023. URL http://arxiv.
org/abs/2307.09288. arXiv:2307.09288 [cs].
Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., and Qiu,
L. LLMLingua: Compressing Prompts for Accel- Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin,
erated Inference of Large Language Models, Decem- C., Lv, C., Pan, D., Wang, D., Yan, D., Yang, F., Deng,
7
F., Wang, F., Liu, F., Ai, G., Dong, G., Zhao, H., Xu,
H., Sun, H., Zhang, H., Liu, H., Ji, J., Xie, J., Dai, J.,
Fang, K., Su, L., Song, L., Liu, L., Ru, L., Ma, L., Wang,
M., Liu, M., Lin, M., Nie, N., Guo, P., Sun, R., Zhang,
T., Li, T., Li, T., Cheng, W., Chen, W., Zeng, X., Wang,
X., Chen, X., Men, X., Yu, X., Pan, X., Shen, Y., Wang,
Y., Li, Y., Jiang, Y., Gao, Y., Zhang, Y., Zhou, Z., and
Wu, Z. Baichuan 2: Open Large-scale Language Models,
September 2023. URL http://arxiv.org/abs/
2309.10305. arXiv:2309.10305 [cs].
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti,
C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang,
L., and Ahmed, A. Big Bird: Transformers for Longer
Sequences, January 2021. URL http://arxiv.org/
abs/2007.14062. arXiv:2007.14062 [cs, stat].
Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai,
R., Song, Z., Tian, Y., Ré, C., Barrett, C., Wang, Z.,
and Chen, B. H$_2$O: Heavy-Hitter Oracle for Effi-
cient Generative Inference of Large Language Models,
July 2023. URL http://arxiv.org/abs/2306.
14048. arXiv:2306.14048 [cs].
Zhao, G., Lin, J., Zhang, Z., Ren, X., Su, Q., and
Sun, X. Explicit Sparse Transformer: Concen-
trated Attention Through Explicit Selection, Decem-
ber 2019. URL http://arxiv.org/abs/1912.
11637. arXiv:1912.11637 [cs].

Loma: Lossless Compressed Memory Attention: Yang Et Al. 2023

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Loma: Lossless Compressed Memory Attention: Yang Et Al. 2023

Uploaded by

Copyright:

Available Formats

LoMA: Lossless Compressed Memory Attention

Yumeng Wang 1 Zhenyang Xiao 1 2

Abstract answering systems(Yang et al., 2023). In these tasks, users

els (LLMs), but as the text length increases, the

2. Related Works token, ensuring that no information is lost.

diminish. For these reasons, we insert a recollection zone

L = LLM + LRepeat (1)

where LLM is the language model loss of the model, and

4.2.1. I NFERENCE 4.2.2. G ENERATION

Loma Hyperparameters ratio-4-memlen-8 ratio-4-memlen-16 ratio-8-memlen-8 ratio-8-memlen-16

ratio memlen=4 memlen=8 memlen=16 memlen=32

You might also like