You are on page 1of 18

M EGALODON: Efficient LLM Pretraining and

Inference with Unlimited Context Length

Xuezhe Maπ ∗ Xiaomeng Yangµ∗ Wenhan Xiongµ Beidi Chenκ Lili Yuµ

Hao Zhangδ Jonathan Mayπ Luke Zettlemoyerµ Omer Levyµ Chunting Zhouµ∗
arXiv:2404.08801v2 [cs.LG] 16 Apr 2024

µ π
AI at Meta University of Southern California
κ δ
Carnegie Mellon University University of California San Diego

Abstract
The quadratic complexity and weak length extrapolation of Transformers limits
their ability to scale to long sequences, and while sub-quadratic solutions like linear
attention and state space models exist, they empirically underperform Transformers
in pretraining efficiency and downstream task accuracy. We introduce M EGA -
LODON , an neural architecture for efficient sequence modeling with unlimited
context length. M EGALODON inherits the architecture of M EGA (exponential
moving average with gated attention), and further introduces multiple technical
components to improve its capability and stability, including complex exponen-
tial moving average (CEMA), timestep normalization layer, normalized attention
mechanism and pre-norm with two-hop residual configuration. In a controlled
head-to-head comparison with L LAMA2, M EGALODON achieves better efficiency
than Transformer in the scale of 7 billion parameters and 2 trillion training tokens.
M EGALODON reaches a training loss of 1.70, landing mid-way between L LAMA2-
7B (1.75) and 13B (1.67). The improvements of M EGALODON over Transformers
are robust throughout a range of benchmarks across different tasks and modalities.
Code: https://github.com/XuezheMax/megalodon

1 Introduction
In many real-world applications, such as multi-turn conversation, long-document comprehension,
and video generation, large language models (LLMs) must efficiently process long sequential data,
understand internal long-range dynamics, and generate coherent output. The Transformer archi-
tecture (Vaswani et al., 2017), despite its remarkable capabilities, faces challenges with quadratic
computational complexity and limited inductive bias for length generalization, making it inefficient
for long sequence modeling (Wang et al., 2024; Zhou et al., 2024). Even with recently proposed
distributed attention solutions (Li et al., 2023b; Liu et al., 2024), computing a single training step of a
7B parameter model over a 1M-token sequence is more than 100 times slower than performing the
equivalent computation using 256 separate sequences of 4K tokens each.
Techniques like efficient attention mechanisms (Tay et al., 2020; Ma et al., 2021) and structured state
space models (Gu et al., 2022a; Poli et al., 2023; Gu and Dao, 2023) have been introduced to overcome
these limitations, aiming to enhance scalability and performance. However, the practical application
of these methods still falls short of Transformers (Tay et al., 2022; Gu and Dao, 2023). This work
introduces an unlimited context model that outperforms the canonical Transformer architecture on
real-world language modeling.

Equal Contribution. Xiaomeng Yang’s work was done at AI at Meta. Correspondence to chuntinz@meta.com

Preprint. Under review.


2.2
Llama2 7B
Llama2 13B
2.1
Megalodon 7B

2.0
Train Loss

1.9

1.8

1.7

1.6
0 250 500 750 1000 1250 1500 1750 2000
Training Tokens (Billions)

Figure 1: Negative log-likelihood (NLL) for M EGALODON-7B, L LAMA2-7B and L LAMA2-13B


w.r.t processed tokens during training.
Table 1: Performance on standard academic benchmarks, compared to open-source base models.
We reported model size, context length and total data tokens during model pretraining. – indicates
that the number was not reported in the original paper.
Model Size Tokens Context MMLU BoolQ HellaSw PIQA SIQA WinoG Arc-e Arc-c NQ TQA
Mamba 3B 0.6T 2K 26.2 71.0 71.0 78.1 – 65.9 68.2 41.7 – –
RWKV 7B 1.1T 4K – – 70.8 77.3 – 68.4 74.9 46.1 – –
MPT 7B 1T 4K 26.8 75.0 76.4 80.6 48.5 68.3 70.2 42.6 20.8 50.4
Mistral 7B – 16K 60.1 83.2 81.3 82.2 47.0 74.2 80.0 54.9 23.2 62.5
Gemma 8B 6T 8K 64.3 83.2 81.2 81.2 51.8 72.3 81.5 53.2 23.0 63.4
L LAMA2 13B 2T 4K 54.8 81.7 80.7 80.5 50.3 72.8 77.3 49.4 31.2 65.1
L LAMA2 7B 2T 4K 45.3 77.4 77.2 78.8 48.3 69.2 75.2 45.9 25.7 58.5
M EGALODON 7B 2T 32K 49.8 80.5 77.5 80.1 49.6 71.4 79.8 53.1 25.7 60.5

We introduce M EGALODON, an improved M EGA architecture (Ma et al., 2023), which harnesses
the gated attention mechanism with the classical exponential moving average (EMA) (Hunter, 1986)
approach (§2). To further improve the capability and efficiency of M EGALODON on large-scale long-
context pretraining, we propose multiple novel technical components. First, M EGALODON introduces
the complex exponential moving average (CEMA) component, which extends the multi-dimensional
damped EMA in M EGA to the complex domain (§3.1). Then, M EGALODON proposes the timestep
normalization layer, which generalizes the group normalization layer (Wu and He, 2018) to auto-
regressive sequence modeling tasks to allow normalization along the sequential dimension (§3.2). To
improve large-scale pretraining stability, M EGALODON further proposes normalized attention (§3.3),
together with pre-norm with two-hop residual configuration by modifying the widely-adopted pre-
and post-normalization methods (§3.4). By simply chunking input sequences into fixed blocks, as is
done in M EGA-chunk (Ma et al., 2023), M EGALODON achieves linear computational and memory
complexity in both model training and inference.
Empirically, we demonstrate the potential of M EGALODON as a general architecture for modeling
long sequences, by evaluating its performance across multiple scales of language modeling, as well
as downstream domain-specific tasks. Through a direct comparison with L LAMA 2, while controlling
for data and compute, M EGALODON-7B significantly outperforms the state-of-the-art variant of
Transformer used to train L LAMA 2-7B (Touvron et al., 2023) on both training perplexity (Figure 1)
and across downstream benchmarks (Table 1). Evaluation on long-context modeling, including
perplexity in various context lengths up to 2M and long-context QA tasks in Scrolls (Parisotto et al.,
2020) prove M EGALODON’s ability to model sequences of unlimited length. Additional experimental
results on small/medium-scale benchmarks, including LRA (Tay et al., 2021), ImageNet (Deng et al.,
2009), Speech Commands (Warden, 2018), WikiText-103 (Merity et al., 2017) and PG19 (Rae et al.,
2019), demonstrate the robust improvements of M EGALODON across scales and modalities.

2
2 Background: Moving Average Equipped Gated Attention (M EGA)
In this section, we setup notations, briefly review the key components in the M EGA architecture (Ma
et al., 2023), and discuss the existing problems in M EGA.
Following the notations in M EGA, we use X = {x1 , x2 , . . . , xn } ∈ Rn×d and Y =
{y1 , y2 , . . . , yn } ∈ Rn×d to denote the input and output sequences with length n, and assume
the representations of the input and output sequences have the same dimension d.

2.1 Multi-dimensional Damped EMA

M EGA embeds an EMA component into the calculation of the attention matrix to incorporate
inductive biases across the timestep dimension. Concretely, the multi-dimensional damped EMA first
expands each dimension of the input sequence X individually into h dimensions via an expansion
matrix β ∈ Rd×h , then applies damped EMA to the h-dimensional hidden space. Formally, for each
dimension j ∈ {1, 2, . . . , d}:
(j)
ut = β j xt,j
(j) (j) (j)
ht = αj ⊙ ut + (1 − αj ⊙ δ j ) ⊙ ht−1 (1)
(j)
yt,j = η Tj ht
(j)
where ut ∈ Rh is the expanded h-dimensional vector for the j-th dimension at timestep t. α ∈
(j)
(0, 1)d×h , δ ∈ (0, 1)d×h are the decaying and damping factors, respectively. ht ∈ Rh is the EMA
d×h
hidden state for the j-th dimension at timestep t. η ∈ R is the projection matrix to map the
h-dimensional hidden state back to 1-dimensional output yt,j ∈ R.

2.2 Moving Average Equipped Gated Attention

In the gated attention mechanism in M EGA, the output from EMA (1) is used to compute the
shared representation (Hua et al., 2022), because it encodes contextual information through EMA.
Subsequently, M EGA introduces the reset gate, the update gate , and computes the candidate activation
with the update gate and the residual connection. The technical details are provided in Appendix A.

2.3 Existing Problems in M EGA

To reduce the quadratic complexity in the full attention mechanism, M EGA simply split the sequences
of queries, keys and values in (14-16) into chunks of length c. The attention in (17) is individually
applied to each chunk, yielding linear complexity O(kc2 ) = O(nc). Technically, the EMA sub-layer
in M EGA helps capture local contextual information near each token, mitigating the problem of losing
contextual information beyond chunk boundaries in the chunk-wise attention.
Despite the impressive successes of M EGA, it still suffers its own problems: i) the performance of
M EGA with chunk-wise attention still fails behind the one with full attention, due to the limited
expressiveness of the EMA sub-layer in M EGA. ii) for different tasks and/or data types, there are
architectural divergences in the final M EGA architectures. For example, different normalization
layers, normalization patterns (pre-norm vs. post-norm) and attention functions (f (·) in (17)) are
applied to different data types (see Ma et al. (2023) for details). iii) There are no empirical evidences
showing that M EGA is scalable for large-scale pretraining.

3 M EGALODON
To address the aforementioned problems of M EGA, in this section we describe the novel technical
advancements of M EGALODON.

3.1 CEMA: Extending Multi-dimensional Damped EMA to Complex Domain

As discussed in Ma et al. (2023), the EMA component can be regarded as a simplified state space
model with diagonal state matrix. Directly inspired from Gu et al. (2022b), as almost all matrices

3
feature

feature

feature
timestep timestep timestep
(a) Layer Norm (b) Group Norm (c) Timestep Norm
Figure 2: Normalization methods. The elements in blue or pink are the regions to compute means
and variances. We omit the batch dimension for simplicity.

diagonalize over the complex plane, a straight-forward idea to improve EMA capability is to extend
to work over the complex number system C. We propose the complex exponential moving average
(CEMA), which re-writes Eq. (1):
(j) (j) (j)
ht = αj (cos θj + i sin θj ) ⊙ ut + (1 − αj ⊙ δ j )(cos θj + i sin θj ) ⊙ ht−1
(j)
yt,j = Re(η Tj ht ) (2)
where α, δ ∈ Rd×h are the real number parameters same as in EMA. Different from EMA, η ∈ Cd×h
in CEMA are complex numbers. θj ∈ Rh , j ∈ {1, 2, . . . , d} are the h arguments. To uniformly
space the h arguments over the period 2π, we parameterize θj as:
2πk
θj,k = ωj , ∀k ∈ {1, 2, . . . , h} (3)
h
where the learnable parameter ω ∈ Rd depicts the d base angles. By decaying the absolute value of
each ht , CEMA preserves the decaying structure in kernel weights, which is a key principle to the
success of convolutional models on long sequence modeling (Li et al., 2023c).

3.2 Timestep Normalization

Despite the impressive performance of Layer Normalization combined with Transformer, it is obvious
that layer normalization cannot directly reduce the internal covariate shift along the spatial dimension
(a.k.a timestep or sequential dimension) (Ioffe and Szegedy, 2015). Group Normalization (Wu and
He, 2018) normalizes hidden representations both along the timestep dimension and a subset of
the feature dimension, which has obtained improvements over Layer Normalization on a range of
computer vision tasks. However, it cannot be directly applied to Transformer on auto-regressive
sequence modeling, due to the leakage of future information via the mean and variance across the
timestep dimension.
In M EGALODON, we extend Group Normalization to the auto-regressive case by computing the
cumulative mean and variance. Formally, suppose an input sequence X = {x1 , x2 , . . . , xn } ∈ Rn×d ,
and k groups along the feature dimension with dg = d/k elements per group. Then, the mean and
variance of the first group at timestep t ∈ {1, 2, . . . , n} are:
t dg t dg
1 XX 1 XX
µt = xi,j , σt2 = (xi,j − µt )2 (4)
t ∗ dg i=1 j=1 t ∗ dg i=1 j=1

Figure 2 illustrates Layer Normalization and Timestep Normalization. To efficiently and precisely
calculate the cumulative mean and variance in each timestep, we provide hardware-friendly imple-
mentation on modern hardware (GPU) (see Appendix B.1).

3.3 Normalized Attention in M EGALODON

Previous studies have investigated the saturation and instability issues in the original scaled dot-
product attention (17). A number of novel techniques have emerged to modify the scaled dot-product
attention, among which normalized attention mechanisms, such as (scaled-) cosine attention (Luo
et al., 2018; Liu et al., 2022) and QK-normalization (Henry et al., 2020), have stood out for the
simplicity and effectiveness.

4
Y Y
<latexit sha1_base64="2O3+1QMud3zy0/zUmPNhWeDMDUg=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuiG5cV7EPbUjJppg3NZIbkjlCG/oUbF4q49W/c+Tdm2llo64HA4Zx7ybnHj6Uw6LrfTmFldW19o7hZ2tre2d0r7x80TZRoxhsskpFu+9RwKRRvoEDJ27HmNPQlb/njm8xvPXFtRKTucRLzXkiHSgSCUbTSYzekOPKD9GHaL1fcqjsDWSZeTiqQo94vf3UHEUtCrpBJakzHc2PspVSjYJJPS93E8JiyMR3yjqWKhtz00lniKTmxyoAEkbZPIZmpvzdSGhozCX07mSU0i14m/ud1EgyueqlQcYJcsflHQSIJRiQ7nwyE5gzlxBLKtLBZCRtRTRnakkq2BG/x5GXSPKt6F9Xzu/NK7TqvowhHcAyn4MEl1OAW6tAABgqe4RXeHOO8OO/Ox3y04OQ7h/AHzucP0QyRBg==</latexit>

Output Y
<latexit sha1_base64="2O3+1QMud3zy0/zUmPNhWeDMDUg=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuiG5cV7EPbUjJppg3NZIbkjlCG/oUbF4q49W/c+Tdm2llo64HA4Zx7ybnHj6Uw6LrfTmFldW19o7hZ2tre2d0r7x80TZRoxhsskpFu+9RwKRRvoEDJ27HmNPQlb/njm8xvPXFtRKTucRLzXkiHSgSCUbTSYzekOPKD9GHaL1fcqjsDWSZeTiqQo94vf3UHEUtCrpBJakzHc2PspVSjYJJPS93E8JiyMR3yjqWKhtz00lniKTmxyoAEkbZPIZmpvzdSGhozCX07mSU0i14m/ud1EgyueqlQcYJcsflHQSIJRiQ7nwyE5gzlxBLKtLBZCRtRTRnakkq2BG/x5GXSPKt6F9Xzu/NK7TqvowhHcAyn4MEl1OAW6tAABgqe4RXeHOO8OO/Ox3y04OQ7h/AHzucP0QyRBg==</latexit>

+ +
+ <latexit sha1_base64="V3BeyDsuveqBpLfrKPbEH4B1hwQ=">AAAyHniclVvJctzIEeV4HdObxj76gjClmLFDYpAahe3jcN+aZHNfpjUKNDobDRGbUNVoUh3tL/HBF/tTfHP4av+Iz86qQiGz0OiZsCIkot7LStTysjIBQv08joRcW/vPJ9/7/g9++KMff/qT5Z/+7Oe/+OWzz351LbJxEcBVkMVZcdv3BcRRClcykjHc5gX4ST+Gm/7DluJvSihElKWX8imHt4kfptEwCnyJ0Ltnz3ojX057iS9H/eF0fzZ792xlbXVN//HmL9ari5Wl6k/33Wdr/+0NsmCcQCqD2Bfi6/W1XL6d+oWMghhmy72xgNwPHvwQvsbL1E9AvJ3qoc+8F4gMvGFW4N9UehrlPaZ+IsRT0kdLNUbR5BS4iFMexct+0hiCHP7p7TRK87GENDAjGI5jT2aeWiBvEBUQyPgJL/ygiHASXjDyCz+QuIzLyy+8Y7948ATa4AIKLxt6gZ+bazWNAoZQFFEaKoeDqIyENRtG4bgAHGUKkyBLEj8dTHsIxjCUs+m0B4n3RQevfzebLc8ZBbi6UFizLd1Shk27IgpHtbdz1WizkllubS6zvM2in0mZJdZoU7fm7KqJ+9bMnx96ZdK3Jv1FTgJrESyyGFiLgb7NC28fpxerKXq+h/Zqy2CIcTDwcHES1wdeK3D29fpb9NIfeivrygl62dXbYvYNBQEvvTibQPEqwKhaXe6hS72uMFxZn5ot/HMPW1PjoK0/jjeSfrzq7aIehMRAUNsv1J4hb1zuWpe7TZealpPM3nTldXVb4VkjD6dUNV7bHh/G/oC6rHy58mau28u6j736krt6g65eeBdG2IvWw9wM1W9GX4UBG36rB7sipveF7X3R0vvc9tIBOcnqSFutV8bcXeilqeNwwdo0HY4KAOaST0eYhZt3SevGnH8579xPPcBtUJ1bRAQfzKStyeJZO37GeQ6Fp/wYNzuVm502Nxte4U9o4RvOXr165ZdZNPDGQp1P0dDLMyEiTBXVOuSxj+FT3WDx8NR5mGM0tUxSMaZ7ZfN/z7JytFU72vpORzjpNAR9EhtbUc1H4/WQUC2Wt75evVqoFByeH4cZJoBR0jJR5MzwaqNvnSlzNTfVDetqo8WV1by9H06i9rVQ9KbXpdNr4zt7zS1rjidXNXUmQIWa8aqrb9sW078p4G7dv+v2t1Otb4CjVteLp1lpDqJY6TVWF3iuo4W6qhwO4ywrNK2vDK8vKwOk+sl0vZl0ZIGxMDOVUeDH0+2mQenH0YAbvDPXRTI11GzOJQjZ3kEzM5oS5ELlulxEcZZqWK0uOskSr/SLyMeYFZXGQfpT5ftRplmRoN/nPYSez+yKFg3aJ6bvMn1iApcJiBm4zGBW660AlwLqNHSZITGhy4TEjFxmREzkMhEx713mPRvcg0s9UKfYZeKZ1nKReJHAuMVievCkzjyziy+992MhvUGWfi49VcqiJp/UAeRsjZdUvlPXd0p3zVwmIyZ3mZxN4oNLfaBOhcsUxAiXEcyddClJncYuMyamdJmSmInLTIh5dJlHYp5c5okN7qNLfZyZcstGASbprLAxUFahMq2fZCh46oHLkQkVa6Hb7IYlI1mQlH2CWYSUAcEsPMoBwQMGA8EsLsohwUM+lJBwFhLliGAWD+WYYBYM5XuC3zP4gWAWBWVMcMzghOCEj5CtNV/kjGAm6TInOGfwB4KZlsuCYCbkUhAs+L4SLPkI2apw/ZYEM/GWE4KZcstHgplsyyeCnxj8kWCr150Y1JOxfgIsXO1WZ70RXusJDUZ9rWc0GAm2ntJgdNh+ToNRY+tJDUaSrWc1GFm2ntZgtNl6XiO38MQGo9LmmW25hYc2GL02j23LJS6XOJNvO5NrcuGxDEbEzYPZcm0nc00uPJzBiLp5PFtu4fkMRt2tJzQYibee0WB03npKgxF76zkNRvHtJzUY3S8+qzEiiiioC5Zkg6JkgwVsskn4JgVVskXwFoO3Cd5m8A7BOwzeJXiXwXsE7zF4n+B9Bh8QfMAHfkj4ITM/IviIwR2COww+JviYwScEnzD4lOBTPpQu4V1mfkbwGYPPCT5n8AXBFwy+JPiSwVcEXzH4muBrPsIbwm+Y+S3Btwy+I/iOwfcE35u6ufWkdZUHRnpMqBtM4Vp+jNvk3JbLbXFu2+W2Obfjcjuc23W5Xc7tudwejzatTkbu844HLnfAuUOXO+Tckcsdca7jch3OHbvcMedOXO7EmcSpS57yjl2X63LuzOXOOHfucuecu3C5C2cwly55yTteudwV565d7ppzNy53w7lbl7vl3J3L3XHu3uVq6V/zarj8CPrpAp9q1+rOZZbC1D7pWiwZG6iXUPqoy2SF1yWyoSqYIX2D9Gk1dVGCEBUjuhRBhMrhshoK1R266kCEqg1dayBCNYauMBChykLXFYhE7PbvDURlhC4iEKHiQZcOiMRsJQySEJIaJGUraBAqCXRBgAgVAroMQISKWZ37EaGcrzM+IuxBTCd6hCjBl9XOsH0pDULJXKdyRCiF6wSOCCVunbYRoSJV52pEPrYdm85TFZR+nI/UnuuftQzLfiUQrQ0L0hMYvbaoqNhP+gPVw1wQkSUQKlz/JFjLUknSBmiJHhHCfxkmojBRnfVP6mzlW0m3nsp0ymcwVZK1LVRsQC1U64BNa6pUaluo0iG1UKHhjN7ZKH2OiMURR9RCWb5ng0dNPrAFmiot1tOfKh3aFq4oW0rUYEYt1F9OLdTeB2qh7gq+WlMluHqRpkprtoXLPaYW6qykFmpsQi3U1yO1UFtP1EJdfTRl37HKuo/VrXXKRb1RqtWJFpFNCgCdXxGivKqzKiKUTXUuRYRyqM6giFDm1HkTEarldLJEhJKkTpGIUGrUiRERSog6HSJCaVAnQUQo+enUhwilPJ3wEKE6TWc5RFiJppMbQpTUdEpDhFKZTmSIUALT6QsRKsd0zkLkkrm+MhClKJ2gEKHEpNMSIpSOdDJChJKQTkGI3DHX9wa6Z7tIaaLPs0TSHVXHcg+v2ALaU0AxneokqKdXBbPiLkxAm9PpElKhfiO8DUHsF4DSGm2o0whvaSpAMYzUS1VIg2wQpSF688exQsSwvk5mU6HeB1+AXOSgn8WD73LTf5xhLC43X+qmQv9W0STSyqF+o11NTpqiMxUsBuSmxajMlFsW26KFl9sWpEiQOxajWJC7FqNokHsWo3iQ+xajiJAHFqOYkIcWo6iQRxajuJAdi3XYoI8tSMEhTyxG4SFPLUbln+xarMscnlmQgkSeW4zCRF5YjAJFXlqMyjp5ZTGKFXltMYoWeWMxihd5azGKGHlnMRYz8t6CVZ2Gct4r/Hxk6NA+BgfOg0i4yWASR7jFYDomw20Gk0LCHQaTSMJdBpNOwj0Gk1TCfQaTWsIDBpNgwkMGk2bCIwYf0eKEHYbToRoeM5jEE54wmPQTnjKYJBR2GUznbHjG4DM2lHOGk5bCCwaTnMJLBpOiwisGk6jCawazx+DwhuGkrfCWwSSv8I7B9EAQ3jPYaEz99l1W9Zuo37P0+XsWsUkw6UtsEUryEtuEanW98Lb1bz7GAjzfEyA9vHcMA2/npdeHwFe4HEXCm2TjeIAQtsAT+vckWGGOC099zJPF6Eh9LAOPOVac+nfA1S9DxS7dkRQq9gglgYp9Qkmf4oBQ9pZGHBJM8hRHhNKhJjqEkjbFMaHHzPEJwSRNcUooKVN0CSVhijNC6XwT54SSKsUFoewZVlwSTKIUV4SSJsU1oXTUiRtCb5jjW4JJkOKOUNKjuCfUynEnxVoQ9MOFb17HPFYFIxUEHfexQFWMG9RCuW5SC2W6RWWtqhK3icRTb4daqKRdaqGC9qiFytmnFirmgFoolENqoUCOqIXC6FALBXHMBoNCOCESBXBKLdz4LjPFHT8jEnf6nFq4wxfUwo29pBZu6BW1cCOvqYUbeEMt3LdbauF+3bHb40bdsxtWxVdVeKltA75t0tRg6nBRgaw/1MNoNvBLbxLJUTaWHhZA3gTzWw6FWyIB1UhOfVTdX9ZC0IZzxSHoAgoaFRToEgoaNRToIgqoiqrQ7Rplr+d0HQWNQgp0JQWNUgp0LQWNYgp0NQVUTln4oIbZCzhdUkGjpgJdVEGjqgJdVgHVVRV6XKPsRZuurKBRWoGurYCKKwt3a7jL4bMaZm/UdIkFjRoLdJEFjSoLdJkFjToLdKEFjUoLdKkFjVoLdLEFjWoLdLkFVG9Z+K6G2RsyXXIBr7nwCQITkCzG4I3TARTxk/r0aeBL3wshhQJzj2pHAvXeH6tE5Go3V6azaf5u2iuSqW7oNKi8QpJHRYQJ0Olff4TYf9LJT39Iom6C2bLh235joj7BzWfuLRzLLrfsztoGk2QDiL9tItqgnolpzZpPM93Kqtuwaow7igdQWfZ0ox5+3QMPC5kFI1+oD2j9scz0oxUUzhAbH8PmxqYeZNVlfgADcOxMs8WuQAKPHmtnmiiGwPyi2LWO/Tz2A5jVX+V0KmDmvfCqa3eB3f47szr77TRH0hHsw59Okz2f8UzfcIsqY6vc6BkXM/t6ziUKCGf1C7cmFUiapGpFw0itnmsmsqFM/EeytEDTDrNGpr+EMm/i5r3k8VhN/6N6S9CY3FFnxr+DOurM7eG1X9AQVKN5A4k//AK3v8iY5cXcFmxlJdGqoRPdTRYPCz9R76xGk6zAmlX4T8J73vnm9XP1AZD+Mn2cmi9aRY4SEPortOc9iGNmY9+cvvA2MRVi3KfqnycMekjU13CqMDZOmbX6HDUbhzp76jo5kvBSuxeZN8hAuZtED1EOg8hfbXzQnBVJrN70z6adb9ZmLWSWguLW2zg50f1eN/dDk7mi8pZuWg6db3pROpRPza65X6jXyHh2+CpgLgBPXOGH4EWpl2ZVkS/hcdXbGmVCrU+mCsJg5G3jM3EKnwuvn2UPqyos2due01wd0lnxe9R5EeoR4M/eS3X1bYbquDSGkX5n02KpFYtm+t8FFpeoqUv1qWAMsuf3MdbibNIvwH9YfvdsZb35/yLmL65fr67/YfXN2ZuVrzar/zPx6dJvln679MXS+tIfl75a2l/qLl0tBUvl0l+W/rb09/W/rv9j/Z/r/zKm3/uk6vPrJefP+r//B2zvTqs=</latexit>

Ĥ FFN FFN

Laye No m
<latexit sha1_base64="APcQOUhvW043goGB/0d6cURq+ao=">AAAyGniclVvJctzIEeV4HdPbjH30BWFKMWOHxCA1CtvH4b41yea+TEsKNDobDRGbUNVoUh3t7/DBF/tTfHP46ot/xGdnVaGQWWj0TFgRElHvZSVqeVmZAKF+HkdCrq3955Pvff8HP/zRjz/9yfJPf/bzX/zys89/dS2ycRHAVZDFWXHb9wXEUQpXMpIx3OYF+Ek/hpv+w5bib0ooRJSll/IphzeJH6bRMAp8idDbXuLLUX847WWDTM7efbaytrqm/3jzF+vVxcpS9af77vO1//YGWTBOIJVB7AvxzfpaLt9M/UJGQQyz5d5YQO4HD34I3+Bl6icg3kz1sGfec0QG3jAr8G8qPY3yHlM/EeIp6aOlGqZocgpcxCmP4kU/aQxBDv/0Zhql+VhCGpgRDMexJzNPLY43iAoIZPyEF35QRDgJLxj5hR9IXMLl5efesV88eAJtcPGElw29wM/NtZpGAUMoiigNlcNBVEbCmg2jcFwAjjKFSZAliZ8Opj0EYxjK2XTag8T7soPXv5vNlueMAlxdKKzZlm4pw6ZdEYWj2tu5arRZySy3NpdZ3mbRz6TMEmu0qVtzdtXEfWvmzw+9Mulbk/4iJ4G1CBZZDKzFQN/mubeP04vVFD3fQ3u1ZTDEGBh4uDiJ6wOvFTj7Zv0NeukPvZV15QS97OptMfuGgoAXXpxNoHgZYEStLvfQpV5XGK6sT80W/rmHralx0NYfxxtJP171dlEPQmIgqO0Xas+QNy53rcvdpktNy0lmb7ryqrqt8KyRh1OqGq9sjw9jf0BdVr5aeT3X7UXdx159xV29RlfPvQsj7EXrYW6G6jejr8KADb/Vg10R0/vC9r5o6X1ue+mAnGR1pK3WK2PuLvTS1HG4YG2aDkcFAHPJpyPMws27pHVjzr+ad+6nHuA2qM4tIoIPZtLWZPGsHT/jPIfCU36Mm53KzU6bmw2v8Ce08A1nL1++9MssGnhjoc6naOjlmRARpolqHfLYx/CpbrB4eOo8zDGaWiapGNO9svm/Z1k52qodbX2nI5x0GoI+iY2tqOaj8XpIqBbLW18vXy5UCg7Pj8MME8AoaZkocmZ4tdG3zpS5mpvqhnW10eLKat7eDydR+1ooetPr0um18Z295pY1x5OrmjoToELNeNXVt22L6d8UcLfu33X726nWN8BRq+vF06w0B1Gs9BqrCzzX0UJdVQ6HcZYVmtZXhteXlQFS/WS63kw6ssBYmE11cRT48XS7aVD6cTTgBu/MdZFMDTWbcwlCtnfQzIymBLlQuS4XUZylGlari06yxCv9IvIxZkWlcZD+VPl+lGlWJOj3WQ+hZzO7okWD9onpu0yfmMBlAmIGLjOY1XorwKWAOg1dZkhM6DIhMSOXGRETuUxEzHuXec8G9+BSD9Qpdpl4prVcJF4kMG6xkB48qTPP7OIL7/1YSG+QpV9IT5WyqMkndQA5W+Mlle/U9Z3SXTOXyYjJXSZnk/jgUh+oU+EyBTHCZQRzJ11KUqexy4yJKV2mJGbiMhNiHl3mkZgnl3lig/voUh9nptyyUYBJOitsDJRVqEyn9cNMHTz1wOXIhIq10G12w5KRLEjKPsEsQsqAYBYe5YDgAYOBYBYX5ZDgIR9KSDgLiXJEMIuHckwwC4byPcHvGfxAMIuCMiY4ZnBCcMJHyNaaL3JGMJN0mROcM/gDwUzLZUEwE3IpCBZ8XwmWfIRsVbh+S4KZeMsJwUy55SPBTLblE8FPDP5IsNXrTgzqyVg/ARaudquz3giv9YQGo77WMxqMBFtPaTA6bD+nwaix9aQGI8nWsxqMLFtPazDabD2vkVt4YoNRafPMttzCQxuMXpvHtuUSl0ucybedyTW58FgGI+LmwWy5tpO5JhcezmBE3TyeLbfwfAaj7tYTGozEW89oMDpvPaXBiL31nAaj+PaTGozuF5/VGBFFFNQFS7JBUbLBAjbZJHyTgirZIniLwdsEbzN4h+AdBu8SvMvgPYL3GLxP8D6DDwg+4AM/JPyQmR8RfMTgDsEdBh8TfMzgE4JPGHxK8CkfSpfwLjM/I/iMwecEnzP4guALBl8SfMngK4KvGHxN8DUf4Q3hN8z8luBbBt8RfMfge4LvTd3cetK6ygMjPSbUDaZwLT/GbXJuy+W2OLftctuc23G5Hc7tutwu5/Zcbo9Hm1YnI/d5xwOXO+Dcocsdcu7I5Y4413G5DueOXe6Ycycud+JM4tQlT3nHrst1OXfmcmecO3e5c85duNyFM5hLl7zkHa9c7opz1y53zbkbl7vh3K3L3XLuzuXuOHfvcrX0r3k1XH4E/XSBT7VrdecyS2Fqn3QtlowN1EsofdRlssLrEtlQFcyQvkH6tJq6KEGIihFdiiBC5XBZDYXqDl11IELVhq41EKEaQ1cYiFBloesKRCJ2+/cGojJCFxGIUPGgSwdEYrYSBkkISQ2SshU0CJUEuiBAhAoBXQYgQsWszv2IUM7XGR8R9iCmEz1ClODLamfYvpQGoWSuUzkilMJ1AkeEErdO24hQkapzNSIf245N56kKSj/OR2rP9c9ahmW/EojWhgXpCYxeW1RU7Cf9gephLojIEggVrn8SrGWpJGkDtESPCOG/DBNRmKjO+id1tvKtpFtPZTrlM5gqydoWKjagFqp1wKY1VSq1LVTpkFqo0HBG72yUPkfE4ogjaqEs37PBoyYf2AJNlRbr6U+VDm0LV5QtJWowoxbqL6cWau8DtVB3BV+tqRJcvUhTpTXbwuUeUwt1VlILNTahFurrkVqorSdqoa4+mrLvWGXdx+rWOuWi3ijV6kSLyCYFgM6vCFFe1VkVEcqmOpciQjlUZ1BEKHPqvIkI1XI6WSJCSVKnSEQoNerEiAglRJ0OEaE0qJMgIpT8dOpDhFKeTniIUJ2msxwirETTyQ0hSmo6pSFCqUwnMkQogen0hQiVYzpnIXLJXF8ZiFKUTlCIUGLSaQkRSkc6GSFCSUinIETumOt7A92zXaQ00edZIumOqmO5h1dsAe0poJhOdRLU06uCWXEXJqDN6XQJqVC/Ed6GIPYLQGmNNtRphLc0FaAYRuqlKqRBNojSEL3541ghYlhfJ7OpUO+DL0AuctDP4sF3uek/zjAWl5svdVOhf6toEmnlUL/RriYnTdGZChYDctNiVGbKLYtt0cLLbQtSJMgdi1EsyF2LUTTIPYtRPMh9i1FEyAOLUUzIQ4tRVMgji1FcyI7FOmzQxxak4JAnFqPwkKcWo/JPdi3WZQ7PLEhBIs8tRmEiLyxGgSIvLUZlnbyyGMWKvLYYRYu8sRjFi7y1GEWMvLMYixl5b8GqTkM57xV+PjJ0aB+DA+dBJNxkMIkj3GIwHZPhNoNJIeEOg0kk4S6DSSfhHoNJKuE+g0kt4QGDSTDhIYNJM+ERg49occIOw+lQDY8ZTOIJTxhM+glPGUwSCrsMpnM2PGPwGRvKOcNJS+EFg0lO4SWDSVHhFYNJVOE1g9ljcHjDcNJWeMtgkld4x2B6IAjvGWw0pn77Lqv6TdTvWfr8PYvYJJj0JbYIJXmJbUK1up572/o3H2MBnu8JkB7eO4aBt/PC60PgK1yOIuFNsnE8QAhb4An9exKsMMeFpz7myWJ0pD6WgcccK079O+Dql6Fil+5IChV7hJJAxT6hpE9xQCh7SyMOCSZ5iiNC6VATHUJJm+KY0GPm+IRgkqY4JZSUKbqEkjDFGaF0volzQkmV4oJQ9gwrLgkmUYorQkmT4ppQOurEDaE3zPEtwSRIcUco6VHcE2rluJNiLQj64cI3r2Meq4KRCoKO+1igKsYNaqFcN6mFMt2islZVidtE4qm3Qy1U0i61UEF71ELl7FMLFXNALRTKIbVQIEfUQmF0qIWCOGaDQSGcEIkCOKUWbnyXmeKOnxGJO31OLdzhC2rhxl5SCzf0ilq4kdfUwg28oRbu2y21cL/u2O1xo+7ZDaviqyq81LYB3zZpajB1uKhA1h/qYTQb+IU3ieQoG0sPCyBvgvkth8ItkYBqJKc+qu4vayFow7niEHQBBY0KCnQJBY0aCnQRBVRFVeh2jbLXc7qOgkYhBbqSgkYpBbqWgkYxBbqaAiqnLHxQw+wFnC6poFFTgS6qoFFVgS6rgOqqCj2uUfaiTVdW0CitQNdWQMWVhbs13OXwWQ2zN2q6xIJGjQW6yIJGlQW6zIJGnQW60IJGpQW61IJGrQW62IJGtQW63AKqtyx8V8PsDZkuuYDXXPgEgQlIFmPwxukAivhJffo08KXvhZBCgblHtSOBeu+PVSJytZsr09k0fzftFclUN3QaVF4hyaMiwgTo9K8/Quw/6eSnPyRRN8Fs2fBtvzEZ+RIf5N1bOJZdbtmdtQ0myQYQf9tEtEE9E9OaNZ9mupVVt2HVGHcUD6Cy7OlGPfy6Bx4WMgtGvlAf0PpjmelHKyicITY+hs2NTT3Iqsv8AAbg2Jlmi12BBB491s40UQyB+UWxax37eewHMKu/yulUwMx77lXX7gK7/XdmdfbbaY6kI9iHP50mez7jmb7hFlXGVrnRMy5m9vWcSxQQzuoXbk0qkDRJ1YqGkVo910xkQ5n4j2RpgaYdZo1Mfwll3sTNe8njsZr+R/WWoDG5o86Mfwd11Jnbw2u/oCGoRvMGEn/4BW5/kTHLi7kt2MpKolVDJ7qbLB4WfqLeWY0mWYE1q/CfhPes8/bVM/UBkP4yfZyaL1pFjhIQ+iu0Zz2IY2Zj35w+9zYxFWLcp+qfJwx6SNTXcKowNk6ZtfocNRuHOnvqOjmS8EK7F5k3yEC5m0QPUQ6DyF9tfNCcFUms3vTPpp23a7MWMktBcettnJzofq+a+6HJXFF5Szcth87bXpQO5VOza+4X6jUynh2+CpgLwBNX+CF4UeqlWVXkS3hc9bZGmVDrk6mCMBh52/hMnMIXwutn2cOqCkv2tuc0V4d0VvwedV6EegT4s/dCXX2boToujWGk39m0WGrFopn+d4HFJWrqUn0qGIPs+X2MtTib9AvwH5bffbay3vx/EfMX169W1/+w+vrs9crXm9X/mfh06TdLv136cml96Y9LXy/tL3WXrpaCpWLpL0t/W/r7+l/X/7H+z/V/GdPvfVL1+fWS82f93/8DzWFNnw==</latexit>

Layer Norm
Gate O
<latexit sha1_base64="JpQnjIh7n8RpJCW2YBxIFM4w54g=">AAAyKniclVtZk9vGEaady1EuO3nMCyorlZ3Uamspq5I8eu+Lu8u9D6+sAsEmCC0uYYbgrljMr8hr8gvya/Lmymt+SHpmMOgeLChXVGUJ8309jTm+nm6A8CCPIyFXV7//5NMf/fgnP/3ZZz9/9otf/urXv/n8i99eimxSBHARZHFWXA98AXGUwoWMZAzXeQF+MojhanC/ofirEgoRZem5fMzhTeKHaTSKAl8idHuX+HI8GM2O528/X1pdWdV/vKcX3epiqVP96b/9ovvsbpgFkwRSGcS+EN92V3P5ZuYXMgpimD+7mwjI/eDeD+FbvEz9BMSbmR7y3HuByNAbZQX+l0pPo7zHzE+EeEwGaKmGKJqcAhdxyqNYHiSNIcjRX9/MojSfSEgDM4LRJPZk5qmF8YZRAYGMH/HCD4oIJ+EFY7/wA4nL59ziwczh2bMX3qFf3HsC++FiCi8beYGfm2s1tQJGUBRRGqqbDKMyEtZsFIWTAtBtCtMgSxI/Hc7uEIxhJOez2R0k3lc9vP7jHG/TNApwxaGwZhu6pQybdkUUjmtvp6rRZiWz3NqcZ3mbxSCTMkus0bpuPbGrJu5bM//p0CuTgTUZLHISWItgkcXQWgz1bV54uzi9WE3R8z20V9sII4yJoYeLk7g+8FqB82+7b9DLYOQtdZUT9LKtt8XsG4oElr04m0LxMsAIW0G5w0ivK4yWujOzhX+7w9bMOGjrj+ONpB+veNuoByExONT2C7VnyBuX29bldtOlpuU0szddelXdVnjWyMMpVY1Xtsf7iT+kLktfL71+0m257mOvvuauXqOrF96ZEfai9TA3Q/Wb0VdhwIbf6sGuiOl9ZnuftfQ+tb10kE6zOtJW6pUxdxd6aeo4XLA2TYfjAoC55NMRZuGeuqR1Y86/furcTz3AbVCdW0QE782krcniWTt+JnkOhaf8GDdblZutNjdrXuFPaeEbzl6+fOmXWTT0JkKdT9HIyzMhIkwb1TrksY/hU91g8fDUGZljNLVMUjGme2Xzf8+ycrRRO9r4QUc46TQEfRIbW1HNR+P1kFAtlre+Xr5cqBQcnh+HGSaFcdIyUeTM8Gqjj86UuXoy1TXraq3FldW8vR9Oova1UPSm17nTa+0Hez1Z1hxPrmrqTIAKNeNVVx/bFtO/KeB+3b/v9rdTrW+Ao1bXi6dZaQ6iWOk1Vhd4rqOFuqocjuIsKzStrwyvLysDpAbJrNtMOrLAWJjPdLEU+PFss2lQ+nE05AZvzXWRzAw1f+IShGzvoJk5TQlyoXJdLqI4SzWsVhedZIlX+kXkY8yKSuMg/Zny/SDTrEjQ7/M7hJ7P7YoWDdonZuAyA2IClwmIGbrMcF7rrQCXAuo0cpkRMaHLhMSMXWZMTOQyETHvXOYdG9y9S91Tp9hl4rnWcpF4kcC4xcJ6+KjOPLOLy967iZDeMEu/lJ4qb1GTj+oAcrbGSyrfqes7pbtmLpMRk7tMzibx3qXeU6fCZQpihMsI5k66lKROE5eZEFO6TEnM1GWmxDy4zAMxjy7zyAb3waU+zE25ZaMAk3RW2Bgoq1CZzezDDQVPPXA5NqFiLXSb3bBkJAuSckAwi5AyIJiFRzkkeMhgIJjFRTkieMSHEhLOQqIcE8zioZwQzIKhfEfwOwbfE8yioIwJjhmcEJzwEbK15oucEcwkXeYE5wx+TzDTclkQzIRcCoIF31eCJR8hWxWu35JgJt5ySjBTbvlAMJNt+UjwI4M/EGz1uhWDelrWT4CFq93qrDfCaz2hwaiv9YwGI8HWUxqMDtvPaTBqbD2pwUiy9awGI8vW0xqMNlvPa+QWnthgVNo8sy238NAGo9fmsW25xOUSZ/JtZ3JNLjyWwYi4eTBbru1krsmFhzMYUTePZ8stPJ/BqLv1hAYj8dYzGozOW09pMGJvPafBKL79pAaj+8VnNUZEEQV1wZKsUZSssYBN1glfp6BKNgjeYPAmwZsM3iJ4i8HbBG8zeIfgHQbvErzL4D2C9/jA9wnfZ+YHBB8wuEdwj8GHBB8y+IjgIwYfE3zMh9InvM/MTwg+YfApwacMPiP4jMHnBJ8z+ILgCwZfEnzJR3hF+BUzvyb4msE3BN8w+JbgW1M3t560rvLASI8JdY0pXMuPceuc23C5Dc5tutwm57Zcbotz2y63zbkdl9vh0abVychd3nHP5fY4t+9y+5w7cLkDzvVcrse5Q5c75NyRyx05kzh2yWPese9yfc6duNwJ505d7pRzZy535gzm3CXPeccLl7vg3KXLXXLuyuWuOHftctecu3G5G87dulwt/UteDZcfQD9d4FPtat25zFKY2SddiyUTA90llD7qMlnhdYlsqApmyMAgA1pNXZQgRMWILkUQoXK4rIZCdYeuOhChakPXGohQjaErDESostB1BSIRu/07A1EZoYsIRKh40KUDIjFbCYMkhKQGSdkKGoRKAl0QIEKFgC4DEKFiVud+RCjn64yPCHsQ04keIUrwZbUzbF9Kg1Ay16kcEUrhOoEjQolbp21EqEjVuRqRD23HpvNUBaUf52O15/rfWobloBKI1oYF6QmMXltUVOwng6HqYS6IyBIIFa7/JVjLUknSBmiJHhHCvxkmojBRnfW/1NnKt5JuPZXZjM9gpiRrW6jYgFqo1iGb1kyp1LZQpSNqoULDOb2zUfocE4sjjqiFsnzHBo+avGcLNFNarKc/Uzq0LVxRtpSowYxaqL+cWqi999RC3RV8tWZKcPUizZTWbAuXe0It1FlJLdTYlFqorwdqobYeqYW6+jCvfiDDrPtQ3VqnXNQbpVqdaBFZpwDQ+RUhyqs6qyJC2VTnUkQoh+oMighlTp03EaFaTidLRChJ6hSJCKVGnRgRoYSo0yEilAZ1EkSEkp9OfYhQytMJDxGq03SWQ4SVaDq5IURJTac0RCiV6USGCCUwnb4QoXJM5yxEzpnrCwNRitIJChFKTDotIULpSCcjRCgJ6RSEyA1zfWugW7aLlCYGPEsk/XF1LN/hFVtAewoopledBPX0qmBW3JkJaHM6nUMq1K/EmxDEfgEorfGaOo3wlqYCFKNIvVSFNMiGURqiN38SK0SM6utkPhPqffAZyEUOBlk8/CE3g4c5xuKz5kvdVOhfFU0irRzqN9rV5KQpOlPBYkCuW4zKTLlhsQ1aeLlpQYoEuWUxigW5bTGKBrljMYoHuWsxigi5ZzGKCblvMYoKeWAxigvZs1iPDfrQghQc8shiFB7y2GJU/sm+xfrM4YkFKUjkqcUoTOSZxShQ5LnFqKyTFxajWJGXFqNokVcWo3iR1xajiJE3FmMxI28tWNVpKOedws/Hhg7tY3DgPIiE6wwmcYQbDKZjMtxkMCkk3GIwiSTcZjDpJNxhMEkl3GUwqSXcYzAJJtxnMGkmPGDwAS1O2GM4HarhIYNJPOERg0k/4TGDSUJhn8F0zoYnDD5hQzllOGkpPGMwySk8ZzApKrxgMIkqvGQwewwOrxhO2gqvGUzyCm8YTA8E4S2DjcbUr++yqt9E/Z5lwN+ziHWCSV9ig1CSl9gkVKvrhbepf/mYCPB8T4D08N4xDL2tZW8Aga9wOY6EN80m8RAhbIEn9O8kWGFOCk994JPF6Eh9LAMPOVac+jfg6sdQsU13JIWKHUJJoGKXUNKn2COUvaUR+wSTPMUBoXSoiR6hpE1xSOghc3xEMElTHBNKyhR9QkmY4oRQOt/EKaGkSnFGKHuGFecEkyjFBaGkSXFJKB114orQK+b4mmASpLghlPQobgm1ctxKsRYE/XDhm9cxD1XBSAVBz30sUBXjGrVQruvUQpluUFmrqsRNIvHU26IWKmmbWqigHWqhcnaphYrZoxYKZZ9aKJADaqEwetRCQRyywaAQjohEARxTCze+z0xxx0+IxJ0+pRbu8Bm1cGPPqYUbekEt3MhLauEGXlEL9+2aWrhfN+z2uFG37IZV8VUVXmrbgG+bNDWYOlxUIOuP9zCaDbzsTSM5zibSwwLIm2J+y6FwSySgGsmpj6r7y1oI2vBJcQi6gIJGBQW6hIJGDQW6iAKqoip0s0bZ6zldR0GjkAJdSUGjlAJdS0GjmAJdTQGVUxbeq2H2Ak6XVNCoqUAXVdCoqkCXVUB1VYUe1ih70aYrK2iUVqBrK6DiysL9Gu5z+KSG2Rs1XWJBo8YCXWRBo8oCXWZBo84CXWhBo9ICXWpBo9YCXWxBo9oCXW4B1VsWvqlh9oZMl1zAay58gsAEJIsJeJN0CEX8qD59GvrS90JIocDco9qRQL0PJioRudrNlel8lr+d3RXJTDd0GlReIcmjIsIE6PSvP0IcPOrkpz8kUTfBbNnwbb8xGfsSH+TdWziWfW7Zn7cNJsmGEH9sItqgnolpzZtPM/3Kqt+waow7iodQWd7pRj38ugceFjILxr5QH9X6E5npRysonCE2PobNjU09yKrL0wEMwbEzzRa7Agk8eqydaaIYAvNDsWsd+3nsBzCvv8rpVcDce+FV1+4Cu/235nX222qOpCfYhz+9Jns655m+4RZVxla50TMu5vb1nEsUEM7rF25NKpA0SdWKRpFaPddMZCOZ+A9kaYGmHWaNTH8JZd7EPfWSxxM1/Q/qLUFjcge9Of8O6qD3ZA8v/YKGoBrNG0j8xy9w+4uMWZ492YKNrCRaNXSiu8riUeEn6p3VeJoVWLMK/1F4z3vfvXquPgDSX6tPUvNFq8hRAkJ/hfb8DuKY2dg3py+8dUyFGPep+usRgx4S9TWcKoyNU2atPkfNJqHOnrpOjiQsa/ci84YZKHfT6D7KYRj5K40PmrMiidWb/vms993qvIXMUlBct42TU93vVXM/NJkrKm/ppuXQ++4uSkfysdk19wv1GhnPDl8FzBngiSv8ELwo9dKsKvIlPKx4G+NMqPXJVEEYjL1NfCZO4UvhDbLsfkWFJXvbc5yrQzor/oQ6L0I9Avz3blldfcxQHZfGMNLvbFostWLRTP+9wOIcNXWuPhWMQd75A4y1OJsOCvDvn739fKnb/H8lnl5cvlrp/nnl9cnrpW/Wq/+P4rPO7zt/6HzV6Xb+0vmms9vpdy46QSft/L3zj84/u//q/rv7ffc/xvTTT6o+v+s4f7r//R+XXFM7</latexit>

<latexit sha1_base64="hIx9XThzvl7x2xjmgVyuOd5J6w8=">AAAyJniclVtZk9vGEV47l6NcdvKYF1RWKjup1ZYoq5I8eu+Lu8u9D1NWgWATxC4uYYbgrljMb8hr8gvya/KWSuUtPyU9Mxh0DxaUK6qyhPm+nsYcX083QHiQx5GQr17955NPf/DDH/34J5/99NnPfv6LX/7q8y9+fSmySRHARZDFWXE98AXEUQoXMpIxXOcF+MkghqvB/Ybir0ooRJSl5/Ixh7eJH6bRKAp8idBlP/STxH/3+fKr1Vf6j/f0olNdLC9Vf3rvvug86w+zYJJAKoPYF+Lbzqtcvp35hYyCGObP+hMBuR/c+yF8i5epn4B4O9PDnXsvEBl6o6zA/1LpaZT3mPmJEI/JAC0TX45Fk1PgIk55FCuDpDEEOfrz21mU5hMJaWBGMJrEnsw8tSjeMCogkPEjXvhBEeEkvGDsF34gcemcWzyYOTx79sI79It7T2A/XEjhZSMv8HNzraZWwAiKIkpDdZNhVEbCmo2icFIAuk1hGmS4/Olw1kcwhpGcz2Z9SLyvunj9+znepmkU4IpDYc02dEsZNu2KKBzX3k5Vo81KZrm1Oc/yNotBJmWWWKN13XpiV03ct2b+06FXJgNrMljkJLAWwSKLobUY6tu88HZxerGaoud7aK+2EUYYD0MPFydxfeC1Auffdt6il8HIW+4oJ+hlW2+L2TcUCax4cTaF4mWA0bWKcoeRXlcYLXdmZgv/0sfWzDho64/jjaQfr3rbqAchMTjU9gu1Z8gbl9vW5XbTpablNLM3XX5d3VZ41sjDKVWN17bH+4k/pC7LXy+/edJtpe5jr77mrt6gqxfemRH2ovUwN0P1m9FXYcCG3+rBrojpfWZ7n7X0PrW9dJBOszrSVuuVMXcXemnqOFywNk2H4wKAueTTEWbhnrqkdWPOv37q3E89wG1QnVtEBO/NpK3J4lk7fiZ5DoWn/Bg3W5WbrTY3a17hT2nhG85evnzpl1k09CZCnU/RyMszISJMGdU65LGP4VPdYPHw1BmZYzS1TFIxpntl83/PsnK0UTva+F5HOOk0BH0SG1tRzUfj9ZBQLZa3vl6+XKgUHJ4fhxkmhXHSMlHkzPBqo4/OlLl6MtU162qtxZXVvL0fTqL2tVD0pte502vte3s9WdYcT65q6kyACjXjVVcf2xbTvyngXt2/5/a3U61vgKNW14unWWkOoljpNVYXeK6jhbqqHI7iLCs0ra8Mry8rA6QGyazTTDqywFiYz/qq2gj8eLbZNCj9OBpyg3fmukhmhpo/cQlCtnfQzJymBLlQuS4XUZylGlari06yxCv9IvIxZkWlcZD+TPl+kGlWJOj3eR+h53O7okWD9okZuMyAmMBlAmKGLjOc13orwKWAOo1cZkRM6DIhMWOXGRMTuUxEzJ3L3LHB3bvUPXWKXSaeay0XiRcJjFssqoeP6swzu7ji3U2E9IZZ+qX0VHmLmnxUB5CzNV5S+U5d3yndNXOZjJjcZXI2ifcu9Z46FS5TECNcRjB30qUkdZq4zISY0mVKYqYuMyXmwWUeiHl0mUc2uA8u9WFuyi0bBZiks8LGQFmFysxE1GDEgqceuBybULEWus1uWDKSBUk5IJhFSBkQzMKjHBI8ZDAQzOKiHBE84kMJCWchUY4JZvFQTghmwVDeEXzH4HuCWRSUMcExgxOCEz5CttZ8kTOCmaTLnOCcwe8JZlouC4KZkEtBsOD7SrDkI2SrwvVbEszEW04JZsotHwhmsi0fCX5k8AeCrV63YlBPy/oJsHC1W531RnitJzQY9bWe0WAk2HpKg9Fh+zkNRo2tJzUYSbae1WBk2Xpag9Fm63mN3MITG4xKm2e25RYe2mD02jy2LZe4XOJMvu1MrsmFxzIYETcPZsu1ncw1ufBwBiPq5vFsuYXnMxh1t57QYCTeekaD0XnrKQ1G7K3nNBjFt5/UYHS/+KzGiCiioC5YkjWKkjUWsMk64esUVMkGwRsM3iR4k8FbBG8xeJvgbQbvELzD4F2Cdxm8R/AeH/g+4fvM/IDgAwZ3Ce4y+JDgQwYfEXzE4GOCj/lQeoT3mPkJwScMPiX4lMFnBJ8x+JzgcwZfEHzB4EuCL/kIrwi/YubXBF8z+IbgGwbfEnxr6ubWk9ZVHhjpMaGuMYVr+TFunXMbLrfBuU2X2+TclsttcW7b5bY5t+NyOzzatDoZucs77rncHuf2XW6fcwcud8C5rst1OXfocoecO3K5I2cSxy55zDv2XK7HuROXO+Hcqcudcu7M5c6cwZy75DnveOFyF5y7dLlLzl253BXnrl3umnM3LnfDuVuXq6V/yavh8gPopwt8qn1Vdy6zFGb2SddiycRA/YTSR10mK7wukQ1VwQwZGGRAq6mLEoSoGNGlCCJUDpfVUKju0FUHIlRt6FoDEaoxdIWBCFUWuq5AJGK3vzMQlRG6iECEigddOiASs5UwSEJIapCUraBBqCTQBQEiVAjoMgARKmZ17keEcr7O+IiwBzGd6BGiBF9WO8P2pTQIJXOdyhGhFK4TOCKUuHXaRoSKVJ2rEfnQdmw6T1VQ+nE+Vnuu/61lWA4qgWhtWJCewOi1RUXFfjIYqh7mgogsgVDh+l+CtSyVJG2AlugRIfybYSIKE9VZ/0udrXwr6dZTmc34DGZKsraFig2ohWodsmnNlEptC1U6ohYqNJzTOxulzzGxOOKIWijLOzZ41OQ9W6CZ0mI9/ZnSoW3hirKlRA1m1EL95dRC7b2nFuqu4Ks1U4KrF2mmtGZbuNwTaqHOSmqhxqbUQn09UAu19Ugt1NWHefUDGWbdh+rWOuWi3ijV6kSLyDoFgM6vCFFe1VkVEcqmOpciQjlUZ1BEKHPqvIkI1XI6WSJCSVKnSEQoNerEiAglRJ0OEaE0qJMgIpT8dOpDhFKeTniIUJ2msxwirETTyQ0hSmo6pSFCqUwnMkQogen0hQiVYzpnIXLOXF8YiFKUTlCIUGLSaQkRSkc6GSFCSUinIERumOtbA92yXaQ0MeBZIumNq2O5j1dsAe0poJhudRLU06uCWXFnJqDN6XQOqVC/Em9CEPsFoLTGa+o0wluaClCMIvVSFdIgG0ZpiN78SawQMaqvk/lMqPfBZyAXORhk8fD73Awe5hiLz5ovdVOhf1U0ibRyqN9oV5OTpuhMBYsBuW4xKjPlhsU2aOHlpgUpEuSWxSgW5LbFKBrkjsUoHuSuxSgi5J7FKCbkvsUoKuSBxSguZNdiXTboQwtScMgji1F4yGOLUfknexbrMYcnFqQgkacWozCRZxajQJHnFqOyTl5YjGJFXlqMokVeWYziRV5bjCJG3liMxYy8tWBVp6Gcdwo/Hxs6tI/BgfMgEq4zmMQRbjCYjslwk8GkkHCLwSSScJvBpJNwh8EklXCXwaSWcI/BJJhwn8GkmfCAwQe0OGGX4XSohocMJvGERwwm/YTHDCYJhT0G0zkbnjD4hA3llOGkpfCMwSSn8JzBpKjwgsEkqvCSwewxOLxiOGkrvGYwySu8YTA9EIS3DDYaU7++y6p+E/V7lgF/zyLWCSZ9iQ1CSV5ik1Ctrhfepv7lYyLA8z0B0sN7xzD0tla8AQS+wuU4Et40m8RDhLAFntC/k2CFOSk89YFPFqMj9bEMPORYcerfgKsfQ8U23ZEUKnYIJYGKXUJJn2KPUPaWRuwTTPIUB4TSoSa6hJI2xSGhh8zxEcEkTXFMKClT9AglYYoTQul8E6eEkirFGaHsGVacE0yiFBeEkibFJaF01IkrQq+Y42uCSZDihlDSo7gl1MpxK8VaEPTDhW9exzxUBSMVBF33sUBVjGvUQrmuUwtlukFlraoSN4nEU2+LWqikbWqhgnaohcrZpRYqZo9aKJR9aqFADqiFwuhSCwVxyAaDQjgiEgVwTC3c+B4zxR0/IRJ3+pRauMNn1MKNPacWbugFtXAjL6mFG3hFLdy3a2rhft2w2+NG3bIbVsVXVXipbQO+bdLUYOpwUYGsP97DaDbwijeN5DibSA8LIG+K+S2Hwi2RgGokpz6q7i9rIWjDJ8Uh6AIKGhUU6BIKGjUU6CIKqIqq0M0aZa/ndB0FjUIKdCUFjVIKdC0FjWIKdDUFVE5ZeK+G2Qs4XVJBo6YCXVRBo6oCXVYB1VUVelij7EWbrqygUVqBrq2AiisL92q4x+GTGmZv1HSJBY0aC3SRBY0qC3SZBY06C3ShBY1KC3SpBY1aC3SxBY1qC3S5BVRvWfimhtkbMl1yAa+58AkCE5AsJuBN0iEU8aP69GnoS98LIYUCc49qRwL1PpioRORqN1em81n+btYvkplu6DSovEKSR0WECdDpX3+EOHjUyU9/SKJugtmy4dt+YzL2JT7Iu7dwLHvcsjdvG0ySDSH+2ES0QT0T05o3n2Z6lVWvYdUYdxQPobLs60Y9/LoHHhYyC8a+UB/V+hOZ6UcrKJwhNj6GzY1NPciqy9MBDMGxM80WuwIJPHqsnWmiGALzQ7FrHft57Acwr7/K6VbA3HvhVdfuArv9t+Z19ttqjqQr2Ic/3SZ7OueZvuEWVcZWudEzLub29ZxLFBDO6xduTSqQNEnVikaRWj3XTGQjmfgPZGmBph1mjUx/CWXexD31kscTNf0P6i1BY3IH3Tn/Duqg+2QPL/2ChqAazRtI/McvcPuLjFmePdmCjawkWjV0orvK4lHhJ+qd1XiaFVizCv9ReM+7371+rj4A0l+rT1LzRavIUQJCf4X2vA9xzGzsm9MX3jqmQoz7VP31iEEPifoaThXGximzVp+jZpNQZ09dJ0cSVrR7kXnDDJS7aXQf5TCM/NXGB81ZkcTqTf981v3u1byFzFJQXKeNk1Pd73VzPzSZKypv6abl0P2uH6Uj+djsmvuFeo2MZ4evAuYM8MQVfghelHppVhX5Eh5WvY1xJtT6ZKogDMbeJj4Tp/Cl8AZZdr+qwpK97TnO1SGdFX9AnRehHgH+219RVx8zVMelMYz0O5sWS61YNNN/L7A4R02dq08FY5B9f4CxFmfTQQH+/bN3ny93mv+vxNOLy9ernT+uvjl5s/zNevX/UXy29Nul3y19tdRZ+tPSN0u7S72li6Vg6W7pr0t/W/p75x+df3b+1fm3Mf30k6rPb5acP53//g8YFFFd</latexit>

+ +
Normalized
Attention Unit A en on
Attention
Q&K
<latexit sha1_base64="Mrl2vq5bVFJELBEYLieOI/8A3Hw=">AAAyJHiclVvJctzIEeWMtzG9aewIX3xBmJJn7KAYpGbC9nG4b02yuS/TGgUanY2GiE2oapBUu/0xPvhif4pvDh988Wf47KwqFDILjdaEFSER9V5WopaXlQkQ6udxJOTq6r8/+vg73/3e93/wyQ8Xf/Tjn/z0Z88+/fmVyMZFAJdBFmfFTd8XEEcpXMpIxnCTF+An/Riu+/ebir8uoRBRll7IpxxeJ36YRsMo8CVCb579spf4ctQfTk6nvd/Y68Ppm2dLqyur+o83e7FWXSwtVH+6bz5d/W9vkAXjBFIZxL4QX6+t5vL1xC9kFMQwXeyNBeR+cO+H8DVepn4C4vVET2DqvUBk4A2zAv+m0tMo7zHxEyGekj5aqiGKJqfAeZzyKJb7SWMIcvjH15MozccS0sCMYDiOPZl5apm8QVRAIOMnvPCDIsJJeMHIL/xA4mIuLr7wjvzi3hNog8sovGzoBX5urtU0ChhCUURpqBwOojIS1mwYheMCcJQpPARZkvjpYNJDMIahnE4mPUi8zzt4/dvpdHHGKMDVhcKabeqWMmzaFVE4qr2dqUablcxya3OR5W0W/UzKLLFGG7o1Y1dN3Ldm/uzQK5O+NenPcxJYi2CexcBaDPRtXnh7OL1YTdHzPbRXWwZDjIaBh4uTuD7wWoHTr9deo5f+0FtaU07Qy47eFrNvKAhY9uLsAYqXAcbWymIPXep1heHS2sRs4Z972JoYB239cbyR9OMVbwf1ICQGgtp+ofYMeeNyx7rcabrUtHzI7E2XXlW3FZ418nBKVeOV7fFu7A+oy9IXS1/OdFuu+9irL7irL9HVC+/cCHveepibofrN6KswYMNv9WBXxPQ+t73PW3qf2V46IB+yOtJW6pUxdxd6aeo4nLM2TYejAoC55NMRZuFmXdK6MedfzDr3Uw9wG1TnFhHBOzNpazJ/1o6fcZ5D4Sk/xs125Wa7zc26V/gPtPANZy9fvvTLLBp4Y6HOp2jo5ZkQESaMah3y2MfwqW4wf3jqPMwxmlomqRjTvbL5v2dZOdqsHW1+qyOcdBqCPomNrajmo/F6SKgWy1tfL1/OVQoOz4/DDBPAKGmZKHJmeLXRB2fKXM1Mdd26Wm9xZTVv74eTqH3NFb3pdeH0Wv/WXjPLmuPJVU2dCVChZrzq6kPbYvo3Bdyt+3fd/naq9Q1w1Op6/jQrzUEUK73G6gLPdbRQV5XDYZxlhab1leH1ZWWAVD+ZrDWTjiwwFqYTXRgFfjzZahqUfhwNuMEbc10kE0NNZ1yCkO0dNDOlKUEuVK7LRRRnqYbV6qKTLPFKv4h8jFlRaRykP1G+H2WaFQn6fd5D6PnUrmjRoH1i+i7TJyZwmYCYgcsMprXeCnApoE5DlxkSE7pMSMzIZUbERC4TEfPWZd6ywd271D11il0mnmotF4kXCYxbLKkHT+rMM7u47L0dC+kNsvQz6alSFjX5pA4gZ2u8pPKdur5TumvmMhkxucvkbBLvXOoddSpcpiBGuIxg7qRLSeo0dpkxMaXLlMQ8uMwDMY8u80jMk8s8scG9d6n3U1Nu2SjAJJ0VNgbKKlQmE/sgQ8FTD1yOTKhYC91mNywZyYKk7BPMIqQMCGbhUQ4IHjAYCGZxUQ4JHvKhhISzkChHBLN4KMcEs2Ao3xL8lsH3BLMoKGOCYwYnBCd8hGyt+SJnBDNJlznBOYPfEcy0XBYEMyGXgmDB95VgyUfIVoXrtySYibd8IJgpt3wkmMm2fCL4icHvCbZ63Y5BPRnrJ8DC1W511hvhtZ7QYNTXekaDkWDrKQ1Gh+3nNBg1tp7UYCTZelaDkWXraQ1Gm63nNXJzT2wwKm2e2Zabe2iD0Wvz2LZc4nKJM/m2M7km5x7LYETcPJgt13Yy1+TcwxmMqJvHs+Xmns9g1N16QoOReOsZDUbnrac0GLG3ntNgFN9+UoPR/fyzGiOiiIK6YEnWKUrWWcAmG4RvUFAlmwRvMniL4C0GbxO8zeAdgncYvEvwLoP3CN5j8D7B+3zgB4QfMPNDgg8Z3CG4w+Ajgo8YfEzwMYNPCD7hQ+kS3mXmpwSfMviM4DMGnxN8zuALgi8YfEnwJYOvCL7iI7wm/JqZ3xB8w+Bbgm8ZfEfwnambW09aV3lgpMeEus4UruXHuA3ObbrcJue2XG6Lc9sut825HZfb4dyuy+3yaNPqZOQe77jvcvucO3C5A84dutwh5zou1+Hckcsdce7Y5Y6dSZy45Anv2HW5LudOXe6Uc2cud8a5c5c7dwZz4ZIXvOOly11y7srlrjh37XLXnLtxuRvO3brcLefuXK6W/hWvhsv3oJ8u8Kl2te5cZilM7JOuxZKxgXoJpY+6TFZ4XSIbqoIZ0jdIn1ZTFyUIUTGiSxFEqBwuq6FQ3aGrDkSo2tC1BiJUY+gKAxGqLHRdgUjEbv/WQFRG6CICESoedOmASMxWwiAJIalBUraCBqGSQBcEiFAhoMsARKiY1bkfEcr5OuMjwh7EdKJHiBJ8We0M25fSIJTMdSpHhFK4TuCIUOLWaRsRKlJ1rkbkfdux6TxVQenH+Ujtuf5Zy7DsVwLR2rAgPYHRa4uKiv2kP1A9zAURWQKhwvVPgrUslSRtgJboESH8l2EiChPVWf+kzla+lXTrqUwmfAYTJVnbQsUG1EK1Dti0JkqltoUqHVILFRpO6Z2N0ueIWBxxRC2U5Vs2eNTkPVugidJiPf2J0qFt4YqypUQNZtRC/eXUQu29oxbqruCrNVGCqxdporRmW7jcY2qhzkpqocYeqIX6eqQWauuJWqir96bsO1JZ97G6tU65qDdKtTrRIrJBAaDzK0KUV3VWRYSyqc6liFAO1RkUEcqcOm8iQrWcTpaIUJLUKRIRSo06MSJCCVGnQ0QoDeokiAglP536EKGUpxMeIlSn6SyHCCvRdHJDiJKaTmmIUCrTiQwRSmA6fSFC5ZjOWYhcMNeXBqIUpRMUIpSYdFpChNKRTkaIUBLSKQiRW+b6zkB3bBcpTfR5lki6o+pY7uEVW0B7CiimU50E9fSqYFbcuQloczpdQCrUb4S3IIj9AlBao3V1GuEtTQUohpF6qQppkA2iNERv/jhWiBjW18l0ItT74HOQ8xz0s3jwbW76j1OMxcXmS91U6N8qmkRaOdRvtKvJSVN0poLFgNywGJWZctNim7TwcsuCFAly22IUC3LHYhQNctdiFA9yz2IUEXLfYhQT8sBiFBXy0GIUF7JjsQ4b9JEFKTjkscUoPOSJxaj8k12LdZnDUwtSkMgzi1GYyHOLUaDIC4tRWScvLUaxIq8sRtEiry1G8SJvLEYRI28txmJG3lmwqtNQzruFn48MHdrH4MB5EAk3GEziCDcZTMdkuMVgUki4zWASSbjDYNJJuMtgkkq4x2BSS7jPYBJMeMBg0kx4yOBDWpyww3A6VMMjBpN4wmMGk37CEwaThMIug+mcDU8ZfMqGcsZw0lJ4zmCSU3jBYFJUeMlgElV4xWD2GBxeM5y0Fd4wmOQV3jKYHgjCOwYbjanfvsuqfhP1e5Y+f88iNggmfYlNQkleYotQra4X3pb+zcdYgOd7AqSH945h4G0ve30IfIXLUSS8h2wcDxDCFnhC/54EK8xx4amPebIYHamPZeAxx4pT/w64+mWo2KE7kkLFLqEkULFHKOlT7BPK3tKIA4JJnuKQUDrURIdQ0qY4IvSIOT4mmKQpTgglZYouoSRMcUoonW/ijFBSpTgnlD3DiguCSZTiklDSpLgilI46cU3oNXN8QzAJUtwSSnoUd4RaOW6nWAuCfrjwzeuYx6pgpIKg4z4WqIpxnVoo1w1qoUw3qaxVVeIWkXjqbVMLlbRDLVTQLrVQOXvUQsXsUwuFckAtFMghtVAYHWqhII7YYFAIx0SiAE6ohRvfZaa446dE4k6fUQt3+JxauLEX1MINvaQWbuQVtXADr6mF+3ZDLdyvW3Z73Kg7dsOq+KoKL7VtwLdNmhpMHS4qkPWHehjNBl72HiI5ysbSwwLIe8D8lkPhlkhANZJTH1X3l7UQtOFMcQi6gIJGBQW6hIJGDQW6iAKqoip0q0bZ6zldR0GjkAJdSUGjlAJdS0GjmAJdTQGVUxber2H2Ak6XVNCoqUAXVdCoqkCXVUB1VYUe1Sh70aYrK2iUVqBrK6DiysLdGu5y+LSG2Rs1XWJBo8YCXWRBo8oCXWZBo84CXWhBo9ICXWpBo9YCXWxBo9oCXW4B1VsWvq1h9oZMl1zAay58gsAEJIsxeON0AEX8pD59GvjS90JIocDco9qRQL33xyoRudrNlel0kr+Z9Ipkohs6DSqvkORREWECdPrXHyH2n3Ty0x+SqJtgtmz4tt+YjHyJD/LuLRzLLrfsTtsGk2QDiD80EW1Qz8S0ps2nmW5l1W1YNcYdxQOoLHu6UQ+/7oGHhcyCkS/UB7T+WGb60QoKZ4iNj2FzY1MPsuoyO4ABOHam2WJXIIFHj7UzTRRDYH5R7FrHfh77AUzrr3I6FTD1XnjVtbvAbv/taZ39tpsj6Qj24U+nyZ5NeaZvuEWVsVVu9IyLqX095xIFhNP6hVuTCiRNUrWiYaRWzzUT2VAm/iNZWqBph1kj019CmTdxs17yeKym/169JWhM7rAz5d9BHXZm9vDKL2gIqtG8gcQffoHbX2TM8nxmCzazkmjV0InuOouHhZ+od1ajh6zAmlX4T8J73vnm1XP1AZD+Mn2cmi9aRY4SEPortOc9iGNmY9+cvvA2MBVi3KfqnycMekjU13CqMDZOmbX6HDUbhzp76jo5krCs3YvMG2Sg3D1E91EOg8hfaXzQnBVJrN70Tyedb1anLWSWguLW2jj5oPu9au6HJnNF5S3dtBw63/SidCifml1zv1CvkfHs8FXAnAOeuMIPwYtSL82qIl/C44q3OcqEWp9MFYTByNvCZ+IUPhNeP8vuV1RYsrc9J7k6pLPid6jzItQjwJ+9ZXX1IUN1XBrDSL+zabHUikUz/e8ciwvU1IX6VDAG2fP7GGtx9tAvwL9ffPNsaa35/yJmL65eraz9fuXL0y+Xvtqo/s/EJwu/Wvj1wucLawt/WPhqYW+hu3C5ECz8aeEvC39b+PvaX9f+sfbPtX8Z048/qvr8YsH5s/af/wEQ11Dy</latexit>

V
<latexit sha1_base64="2HaIAi1Jf/AGYMfCqPHwvOdcjxo=">AAAyFniclVvJctzIEeV4HdPbjH30BWFKMWOHxCA1CtvH4b41yea+TGsUaHQ2GiI2oarRpDraX+GDL/an+Obw1Vf/iM/OqkIhs9BoTVgRElHvZSVqeVmZAKF+HkdCrq3955Pvff8HP/zRjz/9yfJPf/bzX/zys89/dS2ycRHAVZDFWXHb9wXEUQpXMpIx3OYF+Ek/hpv+w5bib0ooRJSll/IphzeJH6bRMAp8idB9L/HlqD+cXs/efraytrqm/3jzF+vVxcpS9af79vO1//YGWTBOIJVB7AvxzfpaLt9M/UJGQQyz5d5YQO4HD34I3+Bl6icg3kz1kGfec0QG3jAr8G8qPY3yHlM/EeIp6aOlGqJocgpcxCmP4kU/aQxBDv/0Zhql+VhCGpgRDMexJzNPLYw3iAoIZPyEF35QRDgJLxj5hR9IXL7l5efesV88eAJtcOGElw29wM/NtZpGAUMoiigNlcNBVEbCmg2jcFwAjjKFSZAliZ8Opj0EYxjK2XTag8T7soPXv5vNlueMAlxdKKzZlm4pw6ZdEYWj2tu5arRZySy3NpdZ3mbRz6TMEmu0qVtzdtXEfWvmzw+9Mulbk/4iJ4G1CBZZDKzFQN/mubeP04vVFD3fQ3u1ZTBE/Q88XJzE9YHXCpx9s/4GvfSH3sq6coJedvW2mH1DQcALL84mULwMMJpWl3voUq8rDFfWp2YL/9zD1tQ4aOuP442kH696u6gHITEQ1PYLtWfIG5e71uVu06Wm5SSzN115Vd1WeNbIwylVjVe2x/uxP6AuK1+tvJ7r9qLuY6++4q5eo6vn3oUR9qL1MDdD9ZvRV2HAht/qwa6I6X1he1+09D63vXRATrI60lbrlTF3F3pp6jhcsDZNh6MCgLnk0xFm4eZd0rox51/NO/dTD3AbVOcWEcF7M2lrsnjWjp9xnkPhKT/GzU7lZqfNzYZX+BNa+Iazly9f+mUWDbyxUOdTNPTyTIgIU0S1DnnsY/hUN1g8PHUe5hhNLZNUjOle2fzfs6wcbdWOtr7TEU46DUGfxMZWVPPReD0kVIvlra+XLxcqBYfnx2GGCWCUtEwUOTO82uijM2Wu5qa6YV1ttLiymrf3w0nUvhaK3vS6dHptfGevuWXN8eSqps4EqFAzXnX1sW0x/ZsC7tb9u25/O9X6Bjhqdb14mpXmIIqVXmN1gec6WqiryuEwzrJC0/rK8PqyMkCqn0zXm0lHFhgLs6kujAI/nm43DUo/jgbc4K25LpKpoWZzLkHI9g6amdGUIBcq1+UiirNUw2p10UmWeKVfRD7GrKg0DtKfKt+PMs2KBP0+6yH0bGZXtGjQPjF9l+kTE7hMQMzAZQazWm8FuBRQp6HLDIkJXSYkZuQyI2Iil4mIeecy79jgHlzqgTrFLhPPtJaLxIsExi0W0YMndeaZXXzhvRsL6Q2y9AvpqVIWNfmkDiBna7yk8p26vlO6a+YyGTG5y+RsEu9d6j11KlymIEa4jGDupEtJ6jR2mTExpcuUxExcZkLMo8s8EvPkMk9scB9c6sPMlFs2CjBJZ4WNgbIKlenUPshQ8NQDlyMTKtZCt9kNS0ayICn7BLMIKQOCWXiUA4IHDAaCWVyUQ4KHfCgh4SwkyhHBLB7KMcEsGMp3BL9j8APBLArKmOCYwQnBCR8hW2u+yBnBTNJlTnDO4PcEMy2XBcFMyKUgWPB9JVjyEbJV4fotCWbiLScEM+WWjwQz2ZZPBD8x+APBVq87MagnY/0EWLjarc56I7zWExqM+lrPaDASbD2lweiw/ZwGo8bWkxqMJFvPajCybD2twWiz9bxGbuGJDUalzTPbcgsPbTB6bR7blktcLnEm33Ym1+TCYxmMiJsHs+XaTuaaXHg4gxF183i23MLzGYy6W09oMBJvPaPB6Lz1lAYj9tZzGozi209qMLpffFZjRBRRUBcsyQZFyQYL2GST8E0KqmSL4C0GbxO8zeAdgncYvEvwLoP3CN5j8D7B+ww+IPiAD/yQ8ENmfkTwEYM7BHcYfEzwMYNPCD5h8CnBp3woXcK7zPyM4DMGnxN8zuALgi8YfEnwJYOvCL5i8DXB13yEN4TfMPNbgm8ZfEfwHYPvCb43dXPrSesqD4z0mFA3mMK1/Bi3ybktl9vi3LbLbXNux+V2OLfrcruc23O5PR5tWp2M3OcdD1zugHOHLnfIuSOXO+Jcx+U6nDt2uWPOnbjciTOJU5c85R27Ltfl3JnLnXHu3OXOOXfhchfOYC5d8pJ3vHK5K85du9w1525c7oZzty53y7k7l7vj3L3L1dK/5tVw+QH00wU+1a7Vncsshal90rVYMjZQL6H0UZfJCq9LZENVMEP6BunTauqiBCEqRnQpggiVw2U1FKo7dNWBCFUbutZAhGoMXWEgQpWFrisQidjt3xmIyghdRCBCxYMuHRCJ2UoYJCEkNUjKVtAgVBLoggARKgR0GYAIFbM69yNCOV9nfETYg5hO9AhRgi+rnWH7UhqEkrlO5YhQCtcJHBFK3DptI0JFqs7ViHxoOzadpyoo/TgfqT3XP2sZlv1KIFobFqQnMHptUVGxn/QHqoe5ICJLIFS4/kmwlqWSpA3QEj0ihP8yTERhojrrn9TZyreSbj2V6ZTPYKoka1uo2IBaqNYBm9ZUqdS2UKVDaqFCwxm9s1H6HBGLI46ohbJ8xwaPmnxgCzRVWqynP1U6tC1cUbaUqMGMWqi/nFqovffUQt0VfLWmSnD1Ik2V1mwLl3tMLdRZSS3U2IRaqK9HaqG2nqiFuvpgyr5jlXUfq1vrlIt6o1SrEy0imxQAOr8iRHlVZ1VEKJvqXIoI5VCdQRGhzKnzJiJUy+lkiQglSZ0iEaHUqBMjIpQQdTpEhNKgToKIUPLTqQ8RSnk64SFCdZrOcoiwEk0nN4QoqemUhgilMp3IEKEEptMXIlSO6ZyFyCVzfWUgSlE6QSFCiUmnJUQoHelkhAglIZ2CELljru8NdM92kdJEn2eJpDuqjuUeXrEFtKeAYjrVSVBPrwpmxV2YgDan0yWkQv1GeBuC2C8ApTXaUKcR3tJUgGIYqZeqkAbZIEpD9OaPY4WIYX2dzKZCvQ++ALnIQT+LB9/lpv84w1hcbr7UTYX+raJJpJVD/Ua7mpw0RWcqWAzITYtRmSm3LLZFCy+3LUiRIHcsRrEgdy1G0SD3LEbxIPctRhEhDyxGMSEPLUZRIY8sRnEhOxbrsEEfW5CCQ55YjMJDnlqMyj/ZtViXOTyzIAWJPLcYhYm8sBgFiry0GJV18spiFCvy2mIULfLGYhQv8tZiFDHyzmIsZuS9Bas6DeW8V/j5yNChfQwOnAeRcJPBJI5wi8F0TIbbDCaFhDsMJpGEuwwmnYR7DCaphPsMJrWEBwwmwYSHDCbNhEcMPqLFCTsMp0M1PGYwiSc8YTDpJzxlMEko7DKYztnwjMFnbCjnDCcthRcMJjmFlwwmRYVXDCZRhdcMZo/B4Q3DSVvhLYNJXuEdg+mBILxnsNGY+u27rOo3Ub9n6fP3LGKTYNKX2CKU5CW2CdXqeu5t6998jAV4vidAenjvGAbezguvD4GvcDmKhDfJxvEAIWyBJ/TvSbDCHBee+pgni9GR+lgGHnOsOPXvgKtfhopduiMpVOwRSgIV+4SSPsUBoewtjTgkmOQpjgilQ010CCVtimNCj5njE4JJmuKUUFKm6BJKwhRnhNL5Js4JJVWKC0LZM6y4JJhEKa4IJU2Ka0LpqBM3hN4wx7cEkyDFHaGkR3FPqJXjToq1IOiHC9+8jnmsCkYqCDruY4GqGDeohXLdpBbKdIvKWlUlbhOJp94OtVBJu9RCBe1RC5WzTy1UzAG1UCiH1EKBHFELhdGhFgrimA0GhXBCJArglFq48V1mijt+RiTu9Dm1cIcvqIUbe0kt3NArauFGXlMLN/CGWrhvt9TC/bpjt8eNumc3rIqvqvBS2wZ826SpwdThogJZf6iH0WzgF94kkqNsLD0sgLwJ5rccCrdEAqqRnPqour+shaAN54pD0AUUNCoo0CUUNGoo0EUUUBVVods1yl7P6ToKGoUU6EoKGqUU6FoKGsUU6GoKqJyy8EENsxdwuqSCRk0FuqiCRlUFuqwCqqsq9LhG2Ys2XVlBo7QCXVsBFVcW7tZwl8NnNczeqOkSCxo1FugiCxpVFugyCxp1FuhCCxqVFuhSCxq1FuhiCxrVFuhyC6jesvBdDbM3ZLrkAl5z4RMEJiBZjMEbpwMo4if16dPAl74XQgoF5h7VjgTqvT9WicjVbq5MZ9P87bRXJFPd0GlQeYUkj4oIE6DTv/4Isf+kk5/+kETdBLNlw7f9xmTkS3yQd2/hWHa5ZXfWNpgkG0D8sYlog3ompjVrPs10K6tuw6ox7igeQGXZ0416+HUPPCxkFox8oT6g9ccy049WUDhDbHwMmxubepBVl/kBDMCxM80WuwIJPHqsnWmiGALzi2LXOvbz2A9gVn+V06mAmffcq67dBXb778zq7LfTHElHsA9/Ok32fMYzfcMtqoytcqNnXMzs6zmXKCCc1S/cmlQgaZKqFQ0jtXqumciGMvEfydICTTvMGpn+Esq8iZv3ksdjNf0P6i1BY3JHnRn/DuqoM7eH135BQ1CN5g0k/vAL3P4iY5YXc1uwlZVEq4ZOdDdZPCz8RL2zGk2yAmtW4T8J71nn21fP1AdA+sv0cWq+aBU5SkDor9Ce9SCOmY19c/rc28RUiHGfqn+eMOghUV/DqcLYOGXW6nPUbBzq7Knr5EjCC+1eZN4gA+VuEj1EOQwif7XxQXNWJLF60z+bdr5dm7WQWQqKW2/j5ET3e9XcD03mispbumk5dL7tRelQPjW75n6hXiPj2eGrgLkAPHGFH4IXpV6aVUW+hMdVb2uUCbU+mSoIg5G3jc/EKXwhvH6WPayqsGRve05zdUhnxe9R50WoR4A/ey/U1ccM1XFpDCP9zqbFUisWzfS/CywuUVOX6lPBGGTP72OsxdmkX4D/sPz2s5X15v+LmL+4frW6/ofV12evV77erP7PxKdLv1n67dKXS+tLf1z6eml/qbt0tRQspUt/Wfrb0t/X/7r+j/V/rv/LmH7vk6rPr5ecP+v//h8/4Eu7</latexit>

0
<latexit sha1_base64="el2kjuNXNN2bQyuJnh8mmSf2tms=">AAAyK3iclVvJktvIEeWMt7G8zdhHXxBuKWbsaHWIGoXt4/S+sbvZ+6LWKEAwCUKNTagi2C0G/Re+2l/gr/HJDl/9H86qQiGz0KAUVsRIqPeyErW8rEyAmEEeR0K+ePGvzz7/wQ9/9OOffPHTJz/7+S9++asvv/r1hcgmRQDnQRZnxdXAFxBHKZzLSMZwlRfgJ4MYLgd364q/LKEQUZaeyYcc3iR+mEajKPAlQq9vE1+OB6PZ/dfzt18uvVh5of94jy+61cVSp/rTf/tV98ntMAsmCaQyiH0hXndf5PLNzC9kFMQwf3I7EZD7wZ0fwmu8TP0ExJuZHvPce4bI0BtlBf6XSk+jvMfMT4R4SAZoqcYompwCF3HKo1geJI0hyNGf38yiNJ9ISAMzgtEk9mTmqZXxhlEBgYwf8MIPiggn4QVjv/ADievn3OLezOHJk2fegV/ceQL74WoKLxt5gZ+bazW1AkZQFFEaqpsMozIS1mwUhZMC0G0K0yBLEj8dzm4RjGEk57PZLSTeNz28/v0cb9M0CnDFobBm67qlDJt2RRSOa28nqtFmJbPc2pxleZvFIJMyS6zRmm49sqsm7lsz//HQK5OBNRkschJYi2CRxdBaDPVtnnk7OL1YTdHzPbRX2wgjDIqhh4uTuD7wWoHz19036GUw8pa6ygl62dLbYvYNRQLLXpxNoXgeYIitoNxhpNcVRkvdmdnCv9xia2YctPXH8UbSj1e8LdSDkBgcavuF2jPkjcst63Kr6VLTcprZmy69rG4rPGvk4ZSqxkvb4/3EH1KXpW+XXj3qtlz3sVffclev0NUz79QIe9F6mJuh+s3oqzBgw2/1YFfE9D61vU9bep/YXjpIp1kdaSv1ypi7C700dRwuWJumw3EBwFzy6QizcI9d0rox598+du6nHuA2qM4tIoL3ZtLWZPGsHT+TPIfCU36Mm83KzWabm1Wv8Ke08A1nz58/98ssGnoToc6naOTlmRAR5o1qHfLYx/CpbrB4eOqMzDGaWiapGNO9svm/Z1k5Wq8drX/SEU46DUGfxMZWVPPReD0kVIvlra/nzxcqBYfnx2GGSWGctEwUOTO82uijM2WuHk111bpabXFlNW/vh5OofS0Uvel15vRa/WSvR8ua48lVTZ0JUKFmvOrqY9ti+jcF3K/7993+dqr1DXDU6nrxNCvNQRQrvcbqAs91tFBXlcNRnGWFpvWV4fVlZYDUIJl1m0lHFhgL85mulgI/nm00DUo/jobc4K25LpKZoeaPXIKQ7R00M6cpQS5UrstFFGephtXqopMs8Uq/iHyMWVFpHKQ/U77vZZoVCfp9eovQ07ld0aJB+8QMXGZATOAyATFDlxnOa70V4FJAnUYuMyImdJmQmLHLjImJXCYi5p3LvGODu3OpO+oUu0w811ouEi8SGLdYWQ8f1JlndnHZezcR0htm6dfSU+UtavJBHUDO1nhJ5Tt1fad018xlMmJyl8nZJN671HvqVLhMQYxwGcHcSZeS1GniMhNiSpcpiZm6zJSYe5e5J+bBZR7Y4D641Ie5KbdsFGCSzgobA2UVKrOZfbqh4KkHLscmVKyFbrMbloxkQVIOCGYRUgYEs/AohwQPGQwEs7goRwSP+FBCwllIlGOCWTyUE4JZMJTvCH7H4DuCWRSUMcExgxOCEz5CttZ8kTOCmaTLnOCcwe8JZlouC4KZkEtBsOD7SrDkI2SrwvVbEszEW04JZsot7wlmsi0fCH5g8AeCrV43Y1BPy/oJsHC1W531RnitJzQY9bWe0WAk2HpKg9Fh+zkNRo2tJzUYSbae1WBk2Xpag9Fm63mN3MITG4xKm2e25RYe2mD02jy2LZe4XOJMvu1MrsmFxzIYETcPZsu1ncw1ufBwBiPq5vFsuYXnMxh1t57QYCTeekaD0XnrKQ1G7K3nNBjFt5/UYHS/+KzGiCiioC5YklWKklUWsMka4WsUVMk6wesM3iB4g8GbBG8yeIvgLQZvE7zN4B2Cdxi8S/AuH/ge4XvMfJ/gfQb3CO4x+IDgAwYfEnzI4COCj/hQ+oT3mfkxwccMPiH4hMGnBJ8y+IzgMwafE3zO4AuCL/gILwm/ZOZXBF8x+JrgawbfEHxj6ubWk9ZVHhjpMaGuMoVr+TFujXPrLrfOuQ2X2+Dcpsttcm7L5bY4t+1y2zzatDoZucM77rrcLuf2XG6Pc/sut8+5nsv1OHfgcgecO3S5Q2cSRy55xDv2Xa7PuWOXO+bcicudcO7U5U6dwZy55BnveO5y55y7cLkLzl263CXnrlzuinPXLnfNuRuXq6V/wavh8gPopwt8qn1Rdy6zFGb2SddiycRAtwmlj7pMVnhdIhuqghkyMMiAVlMXJQhRMaJLEUSoHC6roVDdoasORKja0LUGIlRj6AoDEaosdF2BSMRu/85AVEboIgIRKh506YBIzFbCIAkhqUFStoIGoZJAFwSIUCGgywBEqJjVuR8Ryvk64yPCHsR0okeIEnxZ7Qzbl9IglMx1KkeEUrhO4IhQ4tZpGxEqUnWuRuRD27HpPFVB6cf5WO25/reWYTmoBKK1YUF6AqPXFhUV+8lgqHqYCyKyBEKF638J1rJUkrQBWqJHhPBvhokoTFRn/S91tvKtpFtPZTbjM5gpydoWKjagFqp1yKY1Uyq1LVTpiFqo0HBO72yUPsfE4ogjaqEs37HBoybv2ALNlBbr6c+UDm0LV5QtJWowoxbqL6cWau89tVB3BV+tmRJcvUgzpTXbwuWeUAt1VlILNTalFurrnlqorQdqoa4+zKsfyDDr3le31ikX9UapVidaRNYoAHR+RYjyqs6qiFA21bkUEcqhOoMiQplT501EqJbTyRIRSpI6RSJCqVEnRkQoIep0iAilQZ0EEaHkp1MfIpTydMJDhOo0neUQYSWaTm4IUVLTKQ0RSmU6kSFCCUynL0SoHNM5C5Ez5vrcQJSidIJChBKTTkuIUDrSyQgRSkI6BSFyzVzfGOiG7SKliQHPEkl/XB3Lt3jFFtCeAorpVSdBPb0qmBV3agLanE5nkAr1K/EGBLFfAEprvKpOI7ylqQDFKFIvVSENsmGUhujNn8QKEaP6OpnPhHoffApykYNBFg8/5WZwP8dYfNJ8qZsK/auiSaSVQ/1Gu5qcNEVnKlgMyDWLUZkp1y22TgsvNyxIkSA3LUaxILcsRtEgty1G8SB3LEYRIXctRjEh9yxGUSH3LUZxIXsW67FBH1iQgkMeWozCQx5ZjMo/2bdYnzk8tiAFiTyxGIWJPLUYBYo8sxiVdfLcYhQr8sJiFC3y0mIUL/LKYhQx8tpiLGbkjQWrOg3lvF34+djQoX0MDpwHkXCNwSSOcJ3BdEyGGwwmhYSbDCaRhFsMJp2E2wwmqYQ7DCa1hLsMJsGEewwmzYT7DN6nxQl7DKdDNTxgMIknPGQw6Sc8YjBJKOwzmM7Z8JjBx2woJwwnLYWnDCY5hWcMJkWF5wwmUYUXDGaPweElw0lb4RWDSV7hNYPpgSC8YbDRmPr1XVb1m6jfswz4exaxRjDpS6wTSvISG4RqdT3zNvQvHxMBnu8JkB7eO4aht7nsDSDwFS7HkfCm2SQeIoQt8IT+nQQrzEnhqQ98shgdqY9l4D7HilP/Blz9GCq26I6kULFNKAlU7BBK+hS7hLK3NGKPYJKn2CeUDjXRI5S0KQ4IPWCODwkmaYojQkmZok8oCVMcE0rnmzghlFQpTgllz7DijGASpTgnlDQpLgilo05cEnrJHF8RTIIU14SSHsUNoVaOmynWgqAfLnzzOua+KhipIOi5jwWqYlylFsp1jVoo03Uqa1WVuEEknnqb1EIlbVELFbRNLVTODrVQMbvUQqHsUQsFsk8tFEaPWiiIAzYYFMIhkSiAI2rhxveZKe74MZG40yfUwh0+pRZu7Bm1cEPPqYUbeUEt3MBLauG+XVEL9+ua3R436obdsCq+qsJLbRvwbZOmBlOHiwpk/fEeRrOBl71pJMfZRHpYAHlTzG85FG6JBFQjOfVRdX9ZC0EbPioOQRdQ0KigQJdQ0KihQBdRQFVUhW7UKHs9p+soaBRSoCspaJRSoGspaBRToKspoHLKwrs1zF7A6ZIKGjUV6KIKGlUV6LIKqK6q0IMaZS/adGUFjdIKdG0FVFxZuF/DfQ4f1zB7o6ZLLGjUWKCLLGhUWaDLLGjUWaALLWhUWqBLLWjUWqCLLWhUW6DLLaB6y8LXNczekOmSC3jNhU8QmIBkMQFvkg6hiB/Up09DX/peCCkUmHtUOxKo98FEJSJXu7kync/yt7PbIpnphk6DyiskeVREmACd/vVHiIMHnfz0hyTqJpgtG77tNyZjX+KDvHsLx7LPLfvztsEk2RDij01EG9QzMa1582mmX1n1G1aNcUfxECrLW92oh1/3wMNCZsHYF+qjWn8iM/1oBYUzxMbHsLmxqQdZdXk8gCE4dqbZYlcggUePtTNNFENgfih2rWM/j/0A5vVXOb0KmHvPvOraXWC3/+a8zn6bzZH0BPvwp9dkT+Y80zfcosrYKjd6xsXcvp5ziQLCef3CrUkFkiapWtEoUqvnmolsJBP/niwt0LTDrJHpL6HMm7jHXvJ4oqb/Qb0laExuvzfn30Ht9x7t4YVf0BBUo3kDif/4BW5/kTHL00dbsJ6VRKuGTnSXWTwq/ES9sxpPswJrVuE/CO9p7/uXT9UHQPpr9UlqvmgVOUpA6K/Qnt5CHDMb++b0mbeGqRDjPlV/PWDQQ6K+hlOFsXHKrNXnqNkk1NlT18mRhGXtXmTeMAPlbhrdRTkMI3+l8UFzViSxetM/n/W+fzFvIbMUFNdt4+RU93vZ3A9N5orKW7ppOfS+v43SkXxods39Qr1GxrPDVwFzCnjiCj8EL0q9NKuKfAn3K976OBNqfTJVEAZjbwOfiVP4WniDLLtbUWHJ3vYc5eqQzoo/oM6LUI8A/71dVlcfM1THpTGM9DubFkutWDTTfy+wOENNnalPBWOQt/4AYy3OpoMC/Lsnb79c6jb/X4nHFxcvV7p/XHl1/Grpu7Xq/6P4ovPbzu8633S6nT91vuvsdPqd807QyTp/7fyt8/fuP7r/7P67+x9j+vlnVZ/fdJw/3f/+D3ycU5U=</latexit>

CEMA output x Layer Norm T mes ep No m

Layer input
<latexit sha1_base64="4g+gLfVoawQt19vUfZI7YmK05fI=">AAAyKniclVtZk9vGEaady1EuO3nMCyorlZ3Uamspq5I8eu+Lu8u9D6+sAsEmCC0uYYbgrljMr8hr8gvya/Lmymt+SHpmMOgeLChXVGUJ8309jTm+nm6A8CCPIyFXV7//5NMf/fgnP/3ZZz9/9otf/urXv/n8i99eimxSBHARZHFWXA98AXGUwoWMZAzXeQF+MojhanC/ofirEgoRZem5fMzhTeKHaTSKAl8idHuX+HI8GM0e5m8/X1pdWdV/vKcX3epiqVP96b/9ovvsbpgFkwRSGcS+EN92V3P5ZuYXMgpimD+7mwjI/eDeD+FbvEz9BMSbmR7y3HuByNAbZQX+l0pPo7zHzE+EeEwGaKmGKJqcAhdxyqNYHiSNIcjRX9/MojSfSEgDM4LRJPZk5qmF8YZRAYGMH/HCD4oIJ+EFY7/wA4nL59ziwczh2bMX3qFf3HsC++FiCi8beYGfm2s1tQJGUBRRGqqbDKMyEtZsFIWTAtBtCtMgSxI/Hc7uEIxhJOez2R0k3lc9vP7jHG/TNApwxaGwZhu6pQybdkUUjmtvp6rRZiWz3NqcZ3mbxSCTMkus0bpuPbGrJu5bM//p0CuTgTUZLHISWItgkcXQWgz1bV54uzi9WE3R8z20V9sII4yJoYeLk7g+8FqB82+7b9DLYOQtdZUT9LKtt8XsG4oElr04m0LxMsAIW0G5w0ivK4yWujOzhX+7w9bMOGjrj+ONpB+veNuoByExONT2C7VnyBuX29bldtOlpuU0szddelXdVnjWyMMpVY1Xtsf7iT+kLktfL71+0m257mOvvuauXqOrF96ZEfai9TA3Q/Wb0VdhwIbf6sGuiOl9ZnuftfQ+tb10kE6zOtJW6pUxdxd6aeo4XLA2TYfjAoC55NMRZuGeuqR1Y86/furcTz3AbVCdW0QE782krcniWTt+JnkOhaf8GDdblZutNjdrXuFPaeEbzl6+fOmXWTT0JkKdT9HIyzMhIkwb1TrksY/hU91g8fDUGZljNLVMUjGme2Xzf8+ycrRRO9r4QUc46TQEfRIbW1HNR+P1kFAtlre+Xr5cqBQcnh+HGSaFcdIyUeTM8Gqjj86UuXoy1TXraq3FldW8vR9Oova1UPSm17nTa+0Hez1Z1hxPrmrqTIAKNeNVVx/bFtO/KeB+3b/v9rdTrW+Ao1bXi6dZaQ6iWOk1Vhd4rqOFuqocjuIsKzStrwyvLysDpAbJrNtMOrLAWJjPdLEU+PFss2lQ+nE05AZvzXWRzAw1f+IShGzvoJk5TQlyoXJdLqI4SzWsVhedZIlX+kXkY8yKSuMg/Zny/SDTrEjQ7/M7hJ7P7YoWDdonZuAyA2IClwmIGbrMcF7rrQCXAuo0cpkRMaHLhMSMXWZMTOQyETHvXOYdG9y9S91Tp9hl4rnWcpF4kcC4xcJ6+KjOPLOLy967iZDeMEu/lJ4qb1GTj+oAcrbGSyrfqes7pbtmLpMRk7tMzibx3qXeU6fCZQpihMsI5k66lKROE5eZEFO6TEnM1GWmxDy4zAMxjy7zyAb3waU+zE25ZaMAk3RW2Bgoq1CZzezDDQVPPXA5NqFiLXSb3bBkJAuSckAwi5AyIJiFRzkkeMhgIJjFRTkieMSHEhLOQqIcE8zioZwQzIKhfEfwOwbfE8yioIwJjhmcEJzwEbK15oucEcwkXeYE5wx+TzDTclkQzIRcCoIF31eCJR8hWxWu35JgJt5ySjBTbvlAMJNt+UjwI4M/EGz1uhWDelrWT4CFq93qrDfCaz2hwaiv9YwGI8HWUxqMDtvPaTBqbD2pwUiy9awGI8vW0xqMNlvPa+QWnthgVNo8sy238NAGo9fmsW25xOUSZ/JtZ3JNLjyWwYi4eTBbru1krsmFhzMYUTePZ8stPJ/BqLv1hAYj8dYzGozOW09pMGJvPafBKL79pAaj+8VnNUZEEQV1wZKsUZSssYBN1glfp6BKNgjeYPAmwZsM3iJ4i8HbBG8zeIfgHQbvErzL4D2C9/jA9wnfZ+YHBB8wuEdwj8GHBB8y+IjgIwYfE3zMh9InvM/MTwg+YfApwacMPiP4jMHnBJ8z+ILgCwZfEnzJR3hF+BUzvyb4msE3BN8w+JbgW1M3t560rvLASI8JdY0pXMuPceuc23C5Dc5tutwm57Zcbotz2y63zbkdl9vh0abVychd3nHP5fY4t+9y+5w7cLkDzvVcrse5Q5c75NyRyx05kzh2yWPese9yfc6duNwJ505d7pRzZy535gzm3CXPeccLl7vg3KXLXXLuyuWuOHftctecu3G5G87dulwt/UteDZcfQD9d4FPtat25zFKY2SddiyUTA90llD7qMlnhdYlsqApmyMAgA1pNXZQgRMWILkUQoXK4rIZCdYeuOhChakPXGohQjaErDESostB1BSIRu/07A1EZoYsIRKh40KUDIjFbCYMkhKQGSdkKGoRKAl0QIEKFgC4DEKFiVud+RCjn64yPCHsQ04keIUrwZbUzbF9Kg1Ay16kcEUrhOoEjQolbp21EqEjVuRqRD23HpvNUBaUf52O15/rfWobloBKI1oYF6QmMXltUVOwng6HqYS6IyBIIFa7/JVjLUknSBmiJHhHCvxkmojBRnfW/1NnKt5JuPZXZjM9gpiRrW6jYgFqo1iGb1kyp1LZQpSNqoULDOb2zUfocE4sjjqiFsnzHBo+avGcLNFNarKc/Uzq0LVxRtpSowYxaqL+cWqi999RC3RV8tWZKcPUizZTWbAuXe0It1FlJLdTYlFqorwdqobYeqYW6+jCvfiDDrPtQ3VqnXNQbpVqdaBFZpwDQ+RUhyqs6qyJC2VTnUkQoh+oMighlTp03EaFaTidLRChJ6hSJCKVGnRgRoYSo0yEilAZ1EkSEkp9OfYhQytMJDxGq03SWQ4SVaDq5IURJTac0RCiV6USGCCUwnb4QoXJM5yxEzpnrCwNRitIJChFKTDotIULpSCcjRCgJ6RSEyA1zfWugW7aLlCYGPEsk/XF1LN/hFVtAewoopledBPX0qmBW3JkJaHM6nUMq1K/EmxDEfgEorfGaOo3wlqYCFKNIvVSFNMiGURqiN38SK0SM6utkPhPqffAZyEUOBlk8/CE3g4c5xuKz5kvdVOhfFU0irRzqN9rV5KQpOlPBYkCuW4zKTLlhsQ1aeLlpQYoEuWUxigW5bTGKBrljMYoHuWsxigi5ZzGKCblvMYoKeWAxigvZs1iPDfrQghQc8shiFB7y2GJU/sm+xfrM4YkFKUjkqcUoTOSZxShQ5LnFqKyTFxajWJGXFqNokVcWo3iR1xajiJE3FmMxI28tWNVpKOedws/Hhg7tY3DgPIiE6wwmcYQbDKZjMtxkMCkk3GIwiSTcZjDpJNxhMEkl3GUwqSXcYzAJJtxnMGkmPGDwAS1O2GM4HarhIYNJPOERg0k/4TGDSUJhn8F0zoYnDD5hQzllOGkpPGMwySk8ZzApKrxgMIkqvGQwewwOrxhO2gqvGUzyCm8YTA8E4S2DjcbUr++yqt9E/Z5lwN+ziHWCSV9ig1CSl9gkVKvrhbepf/mYCPB8T4D08N4xDL2tZW8Aga9wOY6EN80m8RAhbIEn9O8kWGFOCk994JPF6Eh9LAMPOVac+jfg6sdQsU13JIWKHUJJoGKXUNKn2COUvaUR+wSTPMUBoXSoiR6hpE1xSOghc3xEMElTHBNKyhR9QkmY4oRQOt/EKaGkSnFGKHuGFecEkyjFBaGkSXFJKB114orQK+b4mmASpLghlPQobgm1ctxKsRYE/XDhm9cxD1XBSAVBz30sUBXjGrVQruvUQpluUFmrqsRNIvHU26IWKmmbWqigHWqhcnaphYrZoxYKZZ9aKJADaqEwetRCQRyywaAQjohEARxTCze+z0xxx0+IxJ0+pRbu8Bm1cGPPqYUbekEt3MhLauEGXlEL9+2aWrhfN+z2uFG37IZV8VUVXmrbgG+bNDWYOlxUIOuP9zCaDbzsTSM5zibSwwLIm2J+y6FwSySgGsmpj6r7y1oI2vBJcQi6gIJGBQW6hIJGDQW6iAKqoip0s0bZ6zldR0GjkAJdSUGjlAJdS0GjmAJdTQGVUxbeq2H2Ak6XVNCoqUAXVdCoqkCXVUB1VYUe1ih70aYrK2iUVqBrK6DiysL9Gu5z+KSG2Rs1XWJBo8YCXWRBo8oCXWZBo84CXWhBo9ICXWpBo9YCXWxBo9oCXW4B1VsWvqlh9oZMl1zAay58gsAEJIsJeJN0CEX8qD59GvrS90JIocDco9qRQL0PJioRudrNlel8lr+d3RXJTDd0GlReIcmjIsIE6PSvP0IcPOrkpz8kUTfBbNnwbb8xGfsSH+TdWziWfW7Zn7cNJsmGEH9sItqgnolpzZtPM/3Kqt+waow7iodQWd7pRj38ugceFjILxr5QH9X6E5npRysonCE2PobNjU09yKrL0wEMwbEzzRa7Agk8eqydaaIYAvNDsWsd+3nsBzCvv8rpVcDce+FV1+4Cu/235nX222qOpCfYhz+9Jns655m+4RZVxla50TMu5vb1nEsUEM7rF25NKpA0SdWKRpFaPddMZCOZ+A9kaYGmHWaNTH8JZd7EPfWSxxM1/Q/qLUFjcge9Of8O6qD3ZA8v/YKGoBrNG0j8xy9w+4uMWZ492YKNrCRaNXSiu8riUeEn6p3VeJoVWLMK/1F4z3vfvXquPgDSX6tPUvNFq8hRAkJ/hfb8DuKY2dg3py+8dUyFGPep+usRgx4S9TWcKoyNU2atPkfNJqHOnrpOjiQsa/ci84YZKHfT6D7KYRj5K40PmrMiidWb/vms993qvIXMUlBct42TU93vVXM/NJkrKm/ppuXQ++4uSkfysdk19wv1GhnPDl8FzBngiSv8ELwo9dKsKvIlPKx4G+NMqPXJVEEYjL1NfCZO4UvhDbLsfkWFJXvbc5yrQzor/oQ6L0I9Avz3blldfcxQHZfGMNLvbFostWLRTP+9wOIcNXWuPhWMQd75A4y1OJsOCvDvn739fKnb/H8lnl5cvlrp/nnl9cnrpW/Wq/+P4rPO7zt/6HzV6Xb+0vmms9vpdy46QSft/L3zj84/u//q/rv7ffc/xvTTT6o+v+s4f7r//R+PQlNk</latexit>

x <latexit sha1_base64="4g+gLfVoawQt19vUfZI7YmK05fI=">AAAyKniclVtZk9vGEaady1EuO3nMCyorlZ3Uamspq5I8eu+Lu8u9D6+sAsEmCC0uYYbgrljMr8hr8gvya/Lmymt+SHpmMOgeLChXVGUJ8309jTm+nm6A8CCPIyFXV7//5NMf/fgnP/3ZZz9/9otf/urXv/n8i99eimxSBHARZHFWXA98AXGUwoWMZAzXeQF+MojhanC/ofirEgoRZem5fMzhTeKHaTSKAl8idHuX+HI8GM0e5m8/X1pdWdV/vKcX3epiqVP96b/9ovvsbpgFkwRSGcS+EN92V3P5ZuYXMgpimD+7mwjI/eDeD+FbvEz9BMSbmR7y3HuByNAbZQX+l0pPo7zHzE+EeEwGaKmGKJqcAhdxyqNYHiSNIcjRX9/MojSfSEgDM4LRJPZk5qmF8YZRAYGMH/HCD4oIJ+EFY7/wA4nL59ziwczh2bMX3qFf3HsC++FiCi8beYGfm2s1tQJGUBRRGqqbDKMyEtZsFIWTAtBtCtMgSxI/Hc7uEIxhJOez2R0k3lc9vP7jHG/TNApwxaGwZhu6pQybdkUUjmtvp6rRZiWz3NqcZ3mbxSCTMkus0bpuPbGrJu5bM//p0CuTgTUZLHISWItgkcXQWgz1bV54uzi9WE3R8z20V9sII4yJoYeLk7g+8FqB82+7b9DLYOQtdZUT9LKtt8XsG4oElr04m0LxMsAIW0G5w0ivK4yWujOzhX+7w9bMOGjrj+ONpB+veNuoByExONT2C7VnyBuX29bldtOlpuU0szddelXdVnjWyMMpVY1Xtsf7iT+kLktfL71+0m257mOvvuauXqOrF96ZEfai9TA3Q/Wb0VdhwIbf6sGuiOl9ZnuftfQ+tb10kE6zOtJW6pUxdxd6aeo4XLA2TYfjAoC55NMRZuGeuqR1Y86/furcTz3AbVCdW0QE782krcniWTt+JnkOhaf8GDdblZutNjdrXuFPaeEbzl6+fOmXWTT0JkKdT9HIyzMhIkwb1TrksY/hU91g8fDUGZljNLVMUjGme2Xzf8+ycrRRO9r4QUc46TQEfRIbW1HNR+P1kFAtlre+Xr5cqBQcnh+HGSaFcdIyUeTM8Gqjj86UuXoy1TXraq3FldW8vR9Oova1UPSm17nTa+0Hez1Z1hxPrmrqTIAKNeNVVx/bFtO/KeB+3b/v9rdTrW+Ao1bXi6dZaQ6iWOk1Vhd4rqOFuqocjuIsKzStrwyvLysDpAbJrNtMOrLAWJjPdLEU+PFss2lQ+nE05AZvzXWRzAw1f+IShGzvoJk5TQlyoXJdLqI4SzWsVhedZIlX+kXkY8yKSuMg/Zny/SDTrEjQ7/M7hJ7P7YoWDdonZuAyA2IClwmIGbrMcF7rrQCXAuo0cpkRMaHLhMSMXWZMTOQyETHvXOYdG9y9S91Tp9hl4rnWcpF4kcC4xcJ6+KjOPLOLy967iZDeMEu/lJ4qb1GTj+oAcrbGSyrfqes7pbtmLpMRk7tMzibx3qXeU6fCZQpihMsI5k66lKROE5eZEFO6TEnM1GWmxDy4zAMxjy7zyAb3waU+zE25ZaMAk3RW2Bgoq1CZzezDDQVPPXA5NqFiLXSb3bBkJAuSckAwi5AyIJiFRzkkeMhgIJjFRTkieMSHEhLOQqIcE8zioZwQzIKhfEfwOwbfE8yioIwJjhmcEJzwEbK15oucEcwkXeYE5wx+TzDTclkQzIRcCoIF31eCJR8hWxWu35JgJt5ySjBTbvlAMJNt+UjwI4M/EGz1uhWDelrWT4CFq93qrDfCaz2hwaiv9YwGI8HWUxqMDtvPaTBqbD2pwUiy9awGI8vW0xqMNlvPa+QWnthgVNo8sy238NAGo9fmsW25xOUSZ/JtZ3JNLjyWwYi4eTBbru1krsmFhzMYUTePZ8stPJ/BqLv1hAYj8dYzGozOW09pMGJvPafBKL79pAaj+8VnNUZEEQV1wZKsUZSssYBN1glfp6BKNgjeYPAmwZsM3iJ4i8HbBG8zeIfgHQbvErzL4D2C9/jA9wnfZ+YHBB8wuEdwj8GHBB8y+IjgIwYfE3zMh9InvM/MTwg+YfApwacMPiP4jMHnBJ8z+ILgCwZfEnzJR3hF+BUzvyb4msE3BN8w+JbgW1M3t560rvLASI8JdY0pXMuPceuc23C5Dc5tutwm57Zcbotz2y63zbkdl9vh0abVychd3nHP5fY4t+9y+5w7cLkDzvVcrse5Q5c75NyRyx05kzh2yWPese9yfc6duNwJ505d7pRzZy535gzm3CXPeccLl7vg3KXLXXLuyuWuOHftctecu3G5G87dulwt/UteDZcfQD9d4FPtat25zFKY2SddiyUTA90llD7qMlnhdYlsqApmyMAgA1pNXZQgRMWILkUQoXK4rIZCdYeuOhChakPXGohQjaErDESostB1BSIRu/07A1EZoYsIRKh40KUDIjFbCYMkhKQGSdkKGoRKAl0QIEKFgC4DEKFiVud+RCjn64yPCHsQ04keIUrwZbUzbF9Kg1Ay16kcEUrhOoEjQolbp21EqEjVuRqRD23HpvNUBaUf52O15/rfWobloBKI1oYF6QmMXltUVOwng6HqYS6IyBIIFa7/JVjLUknSBmiJHhHCvxkmojBRnfW/1NnKt5JuPZXZjM9gpiRrW6jYgFqo1iGb1kyp1LZQpSNqoULDOb2zUfocE4sjjqiFsnzHBo+avGcLNFNarKc/Uzq0LVxRtpSowYxaqL+cWqi999RC3RV8tWZKcPUizZTWbAuXe0It1FlJLdTYlFqorwdqobYeqYW6+jCvfiDDrPtQ3VqnXNQbpVqdaBFZpwDQ+RUhyqs6qyJC2VTnUkQoh+oMighlTp03EaFaTidLRChJ6hSJCKVGnRgRoYSo0yEilAZ1EkSEkp9OfYhQytMJDxGq03SWQ4SVaDq5IURJTac0RCiV6USGCCUwnb4QoXJM5yxEzpnrCwNRitIJChFKTDotIULpSCcjRCgJ6RSEyA1zfWugW7aLlCYGPEsk/XF1LN/hFVtAewoopledBPX0qmBW3JkJaHM6nUMq1K/EmxDEfgEorfGaOo3wlqYCFKNIvVSFNMiGURqiN38SK0SM6utkPhPqffAZyEUOBlk8/CE3g4c5xuKz5kvdVOhfFU0irRzqN9rV5KQpOlPBYkCuW4zKTLlhsQ1aeLlpQYoEuWUxigW5bTGKBrljMYoHuWsxigi5ZzGKCblvMYoKeWAxigvZs1iPDfrQghQc8shiFB7y2GJU/sm+xfrM4YkFKUjkqcUoTOSZxShQ5LnFqKyTFxajWJGXFqNokVcWo3iR1xajiJE3FmMxI28tWNVpKOedws/Hhg7tY3DgPIiE6wwmcYQbDKZjMtxkMCkk3GIwiSTcZjDpJNxhMEkl3GUwqSXcYzAJJtxnMGkmPGDwAS1O2GM4HarhIYNJPOERg0k/4TGDSUJhn8F0zoYnDD5hQzllOGkpPGMwySk8ZzApKrxgMIkqvGQwewwOrxhO2gqvGUzyCm8YTA8E4S2DjcbUr++yqt9E/Z5lwN+ziHWCSV9ig1CSl9gkVKvrhbepf/mYCPB8T4D08N4xDL2tZW8Aga9wOY6EN80m8RAhbIEn9O8kWGFOCk994JPF6Eh9LAMPOVac+jfg6sdQsU13JIWKHUJJoGKXUNKn2COUvaUR+wSTPMUBoXSoiR6hpE1xSOghc3xEMElTHBNKyhR9QkmY4oRQOt/EKaGkSnFGKHuGFecEkyjFBaGkSXFJKB114orQK+b4mmASpLghlPQobgm1ctxKsRYE/XDhm9cxD1XBSAVBz30sUBXjGrVQruvUQpluUFmrqsRNIvHU26IWKmmbWqigHWqhcnaphYrZoxYKZZ9aKJADaqEwetRCQRyywaAQjohEARxTCze+z0xxx0+IxJ0+pRbu8Bm1cGPPqYUbekEt3MhLauEGXlEL9+2aWrhfN+z2uFG37IZV8VUVXmrbgG+bNDWYOlxUIOuP9zCaDbzsTSM5zibSwwLIm2J+y6FwSySgGsmpj6r7y1oI2vBJcQi6gIJGBQW6hIJGDQW6iAKqoip0s0bZ6zldR0GjkAJdSUGjlAJdS0GjmAJdTQGVUxbeq2H2Ak6XVNCoqUAXVdCoqkCXVUB1VYUe1ih70aYrK2iUVqBrK6DiysL9Gu5z+KSG2Rs1XWJBo8YCXWRBo8oCXWZBo84CXWhBo9ICXWpBo9YCXWxBo9oCXW4B1VsWvqlh9oZMl1zAay58gsAEJIsJeJN0CEX8qD59GvrS90JIocDco9qRQL0PJioRudrNlel8lr+d3RXJTDd0GlReIcmjIsIE6PSvP0IcPOrkpz8kUTfBbNnwbb8xGfsSH+TdWziWfW7Zn7cNJsmGEH9sItqgnolpzZtPM/3Kqt+waow7iodQWd7pRj38ugceFjILxr5QH9X6E5npRysonCE2PobNjU09yKrL0wEMwbEzzRa7Agk8eqydaaIYAvNDsWsd+3nsBzCvv8rpVcDce+FV1+4Cu/235nX222qOpCfYhz+9Jns655m+4RZVxla50TMu5vb1nEsUEM7rF25NKpA0SdWKRpFaPddMZCOZ+A9kaYGmHWaNTH8JZd7EPfWSxxM1/Q/qLUFjcge9Of8O6qD3ZA8v/YKGoBrNG0j8xy9w+4uMWZ492YKNrCRaNXSiu8riUeEn6p3VeJoVWLMK/1F4z3vfvXquPgDSX6tPUvNFq8hRAkJ/hfb8DuKY2dg3py+8dUyFGPep+usRgx4S9TWcKoyNU2atPkfNJqHOnrpOjiQsa/ci84YZKHfT6D7KYRj5K40PmrMiidWb/vms993qvIXMUlBct42TU93vVXM/NJkrKm/ppuXQ++4uSkfysdk19wv1GhnPDl8FzBngiSv8ELwo9dKsKvIlPKx4G+NMqPXJVEEYjL1NfCZO4UvhDbLsfkWFJXvbc5yrQzor/oQ6L0I9Avz3blldfcxQHZfGMNLvbFostWLRTP+9wOIcNXWuPhWMQd75A4y1OJsOCvDvn739fKnb/H8lnl5cvlrp/nnl9cnrpW/Wq/+P4rPO7zt/6HzV6Xb+0vmms9vpdy46QSft/L3zj84/u//q/rv7ffc/xvTTT6o+v+s4f7r//R+PQlNk</latexit>

x
<latexit sha1_base64="4g+gLfVoawQt19vUfZI7YmK05fI=">AAAyKniclVtZk9vGEaady1EuO3nMCyorlZ3Uamspq5I8eu+Lu8u9D6+sAsEmCC0uYYbgrljMr8hr8gvya/Lmymt+SHpmMOgeLChXVGUJ8309jTm+nm6A8CCPIyFXV7//5NMf/fgnP/3ZZz9/9otf/urXv/n8i99eimxSBHARZHFWXA98AXGUwoWMZAzXeQF+MojhanC/ofirEgoRZem5fMzhTeKHaTSKAl8idHuX+HI8GM0e5m8/X1pdWdV/vKcX3epiqVP96b/9ovvsbpgFkwRSGcS+EN92V3P5ZuYXMgpimD+7mwjI/eDeD+FbvEz9BMSbmR7y3HuByNAbZQX+l0pPo7zHzE+EeEwGaKmGKJqcAhdxyqNYHiSNIcjRX9/MojSfSEgDM4LRJPZk5qmF8YZRAYGMH/HCD4oIJ+EFY7/wA4nL59ziwczh2bMX3qFf3HsC++FiCi8beYGfm2s1tQJGUBRRGqqbDKMyEtZsFIWTAtBtCtMgSxI/Hc7uEIxhJOez2R0k3lc9vP7jHG/TNApwxaGwZhu6pQybdkUUjmtvp6rRZiWz3NqcZ3mbxSCTMkus0bpuPbGrJu5bM//p0CuTgTUZLHISWItgkcXQWgz1bV54uzi9WE3R8z20V9sII4yJoYeLk7g+8FqB82+7b9DLYOQtdZUT9LKtt8XsG4oElr04m0LxMsAIW0G5w0ivK4yWujOzhX+7w9bMOGjrj+ONpB+veNuoByExONT2C7VnyBuX29bldtOlpuU0szddelXdVnjWyMMpVY1Xtsf7iT+kLktfL71+0m257mOvvuauXqOrF96ZEfai9TA3Q/Wb0VdhwIbf6sGuiOl9ZnuftfQ+tb10kE6zOtJW6pUxdxd6aeo4XLA2TYfjAoC55NMRZuGeuqR1Y86/furcTz3AbVCdW0QE782krcniWTt+JnkOhaf8GDdblZutNjdrXuFPaeEbzl6+fOmXWTT0JkKdT9HIyzMhIkwb1TrksY/hU91g8fDUGZljNLVMUjGme2Xzf8+ycrRRO9r4QUc46TQEfRIbW1HNR+P1kFAtlre+Xr5cqBQcnh+HGSaFcdIyUeTM8Gqjj86UuXoy1TXraq3FldW8vR9Oova1UPSm17nTa+0Hez1Z1hxPrmrqTIAKNeNVVx/bFtO/KeB+3b/v9rdTrW+Ao1bXi6dZaQ6iWOk1Vhd4rqOFuqocjuIsKzStrwyvLysDpAbJrNtMOrLAWJjPdLEU+PFss2lQ+nE05AZvzXWRzAw1f+IShGzvoJk5TQlyoXJdLqI4SzWsVhedZIlX+kXkY8yKSuMg/Zny/SDTrEjQ7/M7hJ7P7YoWDdonZuAyA2IClwmIGbrMcF7rrQCXAuo0cpkRMaHLhMSMXWZMTOQyETHvXOYdG9y9S91Tp9hl4rnWcpF4kcC4xcJ6+KjOPLOLy967iZDeMEu/lJ4qb1GTj+oAcrbGSyrfqes7pbtmLpMRk7tMzibx3qXeU6fCZQpihMsI5k66lKROE5eZEFO6TEnM1GWmxDy4zAMxjy7zyAb3waU+zE25ZaMAk3RW2Bgoq1CZzezDDQVPPXA5NqFiLXSb3bBkJAuSckAwi5AyIJiFRzkkeMhgIJjFRTkieMSHEhLOQqIcE8zioZwQzIKhfEfwOwbfE8yioIwJjhmcEJzwEbK15oucEcwkXeYE5wx+TzDTclkQzIRcCoIF31eCJR8hWxWu35JgJt5ySjBTbvlAMJNt+UjwI4M/EGz1uhWDelrWT4CFq93qrDfCaz2hwaiv9YwGI8HWUxqMDtvPaTBqbD2pwUiy9awGI8vW0xqMNlvPa+QWnthgVNo8sy238NAGo9fmsW25xOUSZ/JtZ3JNLjyWwYi4eTBbru1krsmFhzMYUTePZ8stPJ/BqLv1hAYj8dYzGozOW09pMGJvPafBKL79pAaj+8VnNUZEEQV1wZKsUZSssYBN1glfp6BKNgjeYPAmwZsM3iJ4i8HbBG8zeIfgHQbvErzL4D2C9/jA9wnfZ+YHBB8wuEdwj8GHBB8y+IjgIwYfE3zMh9InvM/MTwg+YfApwacMPiP4jMHnBJ8z+ILgCwZfEnzJR3hF+BUzvyb4msE3BN8w+JbgW1M3t560rvLASI8JdY0pXMuPceuc23C5Dc5tutwm57Zcbotz2y63

x
(a) Sketch of Megalodon layer. b P e No m c Two hop Res dua
F gure 3 I us ra on of he M EGALODON arch ec ure F gure (a) shows a ske ch of one M EGA -
LODON ayer F gure (b) and (c) d sp ay he configura ons of pre-norm and pre-norm w h wo-hop
res dua respec ve y

D rec y nsp red from hese norma zed a en on mechan sms we propose he norma zed a en on
mechan sm spec fica y defined for M EGA o mprove s s ab y Forma y
X = CEMA(X) ∈ Rn×d (5)
Z
Z = X W z + bz Z = ∈ Rn×z (6)
∥Z∥
Q = κq ⊙ Z + µq ∈ Rn×z (7)
n×z
K = κk ⊙ Z + µk ∈R (8)
where Q and K are compu ed by us ng he norma zed shared represen a on Z ns ead of Z No e
ha we remove he S LU (Ramachandran e a 2017) ac va on func on ϕs u n (13) because he
norma za on on Z has ncorpora ed non- near y n o Z Then he a en on opera on n (17) has
been changed o
 
O = fso tmax QK T V ∈ Rn×v (9)

As we use earnab e κq κk n (7) and (8) we can remove he sca ed erm τ (X) In add on we
found ha w h he norma zed a en on he sof max func on fso tmax ob a ns he bes or a eas
comparab e performance on d fferen asks and da a moda es (see Append x C) Hence hroughou
h s paper we use sof max as he defau a en on func on

34 Pre-Norm w th Two-hop Res dua

Norma za on configura ons are cruc a n s ab y ra n ng deep arch ec ures and pre-
norma za on (X ong e a 2020) has become he defau norma za on configura on because
of s be er convergence proper es han pos -norma za on n he or g na Transformer arch ec-
ure (Vaswan e a 2017) However ex ens ve s ud es have nves ga ed he ns ab y ssue of
pre-norma za on when sca ng up mode s ze (Dav s e a 2021 L u e a 2022) Forma y a
Transformer-based b ock n pre-noroma za on can be formu a ed as (shown n F gure 3 (b))
Ŷ = Attent on(Norm(X)) + X
Y = FFN(Norm(Ŷ )) + Ŷ
= FFN(Norm(Ŷ )) + Attent on(Norm(X)) + X (10)
where he ou pu Y s he sum of he npu X and he ou pu of each componen n one b ock Hence
he range and or var ance of Y keeps ncreas ng for deeper b ocks caus ng he ns ab y ssue In

5
the original M EGA architecture, the update gate φ (19) is used for a gated residual connection (21) to
mitigate this problem (Parisotto et al., 2020; Xu et al., 2020). However, the update gate φ introduces
more model parameters and the instability issue still exists when scaling up model size to 7 billion.
M EGALODON introduces a new configuration named pre-norm with two-hop residual, which simply
re-arranges the residual connections in each block (shown in Figure 3 (c):

Ŷ = Attention(Norm(X)) + X
Y = FFN(Norm(Ŷ )) + X (11)

where the input X is reused as the residual connection of the FFN layer. Since Ŷ is directly followed
by a normalization layer, we remove the update gate φ and use standard residual connection. The
graphical architecture of a M EGALODON sub-layer is visualized in Figure 3 (a). Note that the
Timestep Normalization is only applied before the attention layer. Before the FFN layer, we still
use Layer Normalization. The reasons are two-fold: i) Layer Normalization is faster than Timestep
Normalization; ii) the output vector of each token from the attention layer is a mixture of vectors from
contextual tokens via attention weights. Hence, normalizing the attention output along the feature
dimension is similar to indirectly normalize along the timestep dimension.

3.5 4-Dimensional Parallelism in Distributed LLM Pretraining

Efficient distributed training algorithm is essential to train a large-scale language model, and several
parallelization mechanisms have been introduced. The three most commonly used parallelism
strategies are data, tensor (Shoeybi et al., 2019) and pipeline parallelism (Huang et al., 2019).
However, the 3-dimensional parallelism is still insufficient to scale up the context length of LLMs (Li
et al., 2023b; Liu et al., 2024).
Benefiting from the chunk-wise attention in M EGALODON, we can efficiently parallelize it along the
new timestep/sequence dimension, which is orthogonal to all the aforementioned three parallelism
dimensions. In M EGALODON, the only communications between devices in one chunk-parallel group
are the last hidden state of CEMA and the cumulative mean and variance of Timestep Normalization
in each block. Using asynchronous communication, we can minimize the overhead of chunk
parallelization by hiding the communication costs in the computation of other components inside the
same block and/or other blocks.

4 Experiments

To evaluate the scalability and efficiency of M EGALODON on long-context sequence modeling, we


scale up M EGALODON to 7-billion model size and apply it to large-scale language model pretraining
on 2 trillion tokens. We also conduct experiments on small/medium-scale sequence modeling
benchmarks, including Long Range Arena (LRA) (Tay et al., 2021), raw speech classification on
Speech Commands (Warden, 2018), image classification on ImageNet-1K (Deng et al., 2009), and
language-modeling on WikiText-103 (Merity et al., 2017) and PG19 (Rae et al., 2019). 2 Empirically,
M EGALODON significantly outperforms all the state-of-the-art baseline models on these tasks across
various data modalities.

4.1 LLM Pretraining

Architectural Details In our M EGALODON-7B model, we adopt most of architectural hyperparam-


eters from L LAMA2-7B to ensure fair comparison: M EGALODON-7B consists of 32 blocks, with
feature dimension d = 4096. Following L LAMA2, we use the SwiGLU activation function (Shazeer,
2020) in the feed-forward layer, and rotary positional embedding (RoPE, Su et al. (2021)). We set
the attention chunk size c = 4096, which is the same as the pretraining context length in L LAMA2.
Benefiting from the attention gate (γ in (18)), we use a much smaller number of attention heads h = 4
in M EGALODON-7B, comparing to h = 32 in L LAMA2-7B. In addition, we apply pre-norm with
two-hop residual (§3.4), using Timestep Normalization (§3.2) and Layer Normalization (Ba et al.,
2016), while L LAMA2 models apply pre-normalization with RMSNorm (Zhang and Sennrich, 2019).
2
Some results are provided in Appendix C, due to space limits.

6
Data and Pretraining Details We use the same mix of publicly available data from L LAMA2,
ensuring that the model are trained on exactly the same 2-trillion tokens. We also use the same
tokenizer as L LAMA2, whose vocabulary size is 32K.
We trained M EGALODON-7B using the AdamW optimizer (Loshchilov and Hutter, 2019), with
β1 = 0.9, β2 = 0.95, ϵ = 1e − 8. The learning rate is 3.5e − 4 and cosine learning rate schedule is
applied with warmup of 2500 steps. We use a weight decay of 0.1 and gradient clipping of 1.0, and
no dropout is applied during training. The context length in pretraining is 32K (4 attention chunks).
The global batch size is 4M tokens, and is distributed on 256 NVIDIA A100 GPUs (16K tokens per
A100). We set data parallel size to 128, chunk parallel size to 2 and tensor parallel size to 1.
Data and Computation Efficiency We evaluate the efficiency of M EGALODON w.r.t both the data
and computation perspectives. For data efficiency, we display the negative log-likelihood (NLL)
for M EGALODON-7B, L LAMA2-7B and L LAMA2-13B w.r.t processed tokens during training in
Figure 1. M EGALODON-7B obtains significantly better (lower) NLL than L LAMA2-7B under the
same amount of training tokens, demonstrating better data efficiency. Moreover, M EGALODON
suffers less training spikes than the Transformer-based architecture in L LAMA2. Note that at the first
1/4 of the pretraining process (< 500B tokens), the NLL of M EGALODON-7B is slightly worse than
L LAMA2-7B. We found that the main reason is that we increased the base θ of RoPE from 10, 000 in
L LAMA2 to 100, 000 in M EGALODON, which slows down model convergence at the beginning of
the pretraining process. At the end, M EGALODON reaches a training loss of 1.70, landing mid-way
between L LAMA2-7B (1.75) and L LAMA2-13B (1.67).
For computation efficiency, we conduct exper-
iments of running L LAMA2-7B and M EGA -
LODON -7B using the same amount of compu- ⇥ 1.48
<latexit sha1_base64="2DfdF1l21+C75cp/xOVABOCQskw=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0LqreZbV2X6vUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPuRmPPQ==</latexit>

tational resources and comparing their training ⇥ 1.40


<latexit sha1_base64="2DfdF1l21+C75cp/xOVABOCQskw=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0LqreZbV2X6vUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPuRmPPQ==</latexit>

speed under various context lengths. Specif-


Tokens Per Second

⇥ 1.32
<latexit sha1_base64="2DfdF1l21+C75cp/xOVABOCQskw=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0LqreZbV2X6vUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPuRmPPQ==</latexit>

ically, we execute each experiment to train a


model with global batch size 4M tokens dis-
tributed on 256 NVIDIA A100 GPUs (16K to-
kens per A100) and calculate the word/token
⇥ 1.0
<latexit sha1_base64="2DfdF1l21+C75cp/xOVABOCQskw=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0LqreZbV2X6vUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPuRmPPQ==</latexit>

per second (WPS) to measure the training speed.


Figure 4 illustrates the average WPS per device
of L LAMA2-7B and M EGALODON-7B using Llama2-7B Megalodon-7B Megalodon-7B Llama2-7B
4K and 32K context lengths, respectively. For 4K 4K 32K 32K
L LAMA2 models, we accelerate the computation Figure 4: Average WPS per device.
of full attention with Flash-Attention V2 (Dao,
2024). Under 4K context length, M EGALODON-7B is slightly slower (about 6%) than L LAMA2-7B,
due to the introduction of CEMA and Timestep Normalization. When we scale up context length to
32K, M EGALODON-7B is significantly faster (about 32%) than L LAMA2-7B, demonstrating the com-
putation efficiency of M EGALODON for long-context pretraining. In addition, M EGALODON-7B-32K,
which utilizes chunk parallelism (§3.5), achieves about 94% utilization of M EGALODON-7B-4K.

4.2 Short-Context Evaluation on Academic Benchmarks

We compare M EGALODON-7B to L LAMA2 models on standard academic benchmarks with short


contexts (< 4K tokens), closely following the settings in L LAMA2 (Touvron et al., 2023). The
benchmarks are grouped into the categories listed below:
• Commonsense Reasoning (0-shot): HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020),
SIQA (Sap et al., 2019), WinoGrande (Sakaguchi et al., 2021), ARC-e and -c (Clark et al., 2018).
• World Knowledge (5-shot): NaturalQuestions (NQ, Kwiatkowski et al. (2019)) and TriviaQA
(TQA, Joshi et al. (2017)).
• Reading Comprehension (0-shot): BoolQ (Clark et al., 2019).
• Popular aggregated results (5-shot): MMLU (Hendrycks et al., 2020).
Table 1 summarizes the results of M EGALODON and L LAMA2 on these academic benchmarks,
together with other open-source base models, including MPT (MosaicML, 2023), RWKV (Peng
et al., 2023), Mamba (Gu and Dao, 2023), Mistral (Jiang et al., 2023) and Gemma (Mesnard et al.,
2024). Pretrained on the same 2T tokens, M EGALODON-7B surpasses L LAMA2-7B across all the

7
Model NaQA Qasper QMSum
Xgen 17.4 20.5 6.8
Valid PPL MPT 18.8 24.7 8.8
Yarn 20.9 26.2 11.4
L LAMA2 18.8 19.8 10.1
L LAMA2-L∗ 23.5 28.3 14.5
M EGALODON 23.9 28.0 13.1

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M


Context Length
Table 2: Results on Scrolls. ∗ L LAMA2-L (Xiong
et al., 2023) continually trains L LAMA2 on 500B
Figure 5: PPL in various context lengths. tokens for length extension.

benchmarks. On some tasks, M EGALODON-7B achieves comparable or even better performance with
L LAMA2-13B. Note that Mistral-7B and Gemma-8B were pretrained on much larger datasets than
M EGALODON-7B, hence the results are not directly comparable.

4.3 Long-Context Evaluation

Perplexity over Long Sequences To demonstrate the capability of M EGALODON to make use
of very long contexts to improve next-token prediction, we start by conducting the evaluation of
valid perplexity on different context lengths. Concretely, we construct a validation dataset which
consists of 1,920 selected books. Each of these books contains sequences with at least 2M tokens.
The validation dataset is constructed by first randomly shuffling all the files and then concatenating
them. Figure 5 shows the perplexity (PPL) of the validation dataset in various context lengths ranging
from 4K to 2M. We observe that the PPL decreases monotonically with context length, validating the
effectivenss and robustness of M EGALODON on modeling extremely long sequences.
Long-Context QA tasks in Scrolls Next, we evaluate M EGALODON on long-context open-book
question answering (QA) tasks in the Scrolls dataset (Shaham et al., 2022), including Narra-
tiveQA (Kočiský et al., 2018), Qasper (Dasigi et al., 2021) and QMSum (Zhong et al., 2021).
Following Xiong et al. (2023), we use a simple prompt {CONTEXT} Q: {QUESTION} A: for all
the tasks, and evaluate 0-shot F1-score on NarrativeQA, 2-shot F1-score on Qasper and 1-shot
geometric-ROUGE3 on QMSum. Table 2 lists the results of M EGALODON-7B, together with other
open-source long-context models in the scale of 7B, namely Xgen-7B-8K (Nijkamp et al., 2023),
MPT-7B-8K (MosaicML, 2023), YaRN-7B-128k (Peng et al., 2024), L LAMA2-7B-4K (Touvron
et al., 2023) and L LAMA2-7B-32K (L LAMA2-L, Xiong et al. (2023)). M EGALODON-7B obtains
the best F1 on NarrativeQA, and competitive results with L LAMA2-7B Long. It should be noticed
that L LAMA2-7B Long extends the context length of L LAMA2-7B from 4K to 32K by continually
pretraining it on additional 500B tokens from long-context data.

4.4 Instruction Finetuning

To evaluation the generalization capability of M EGA - Table 3: MT Bench. Comparison of Chat


LODON on instruction following and alignment, We models. ∗ L LAMA 2-Chat utilizes RLHF.
finetune the base model of M EGALODON-7B on a
proprietary instruction-alignment data under a con- Model Size MT-Bench
trolled setting. We did not apply any RLHF tech-
niques to further finetune it. Table 3 summarizes Vicuna 7B 6.17
the performance of chat models in 7B scale on L LAMA2-Chat∗ 7B 6.27
4
MT-Bench . M EGALODON exhibits superior per- Mistral-Instruct 7B 6.84
formance on MT-Bench compared to Vicuna (Chi- M EGALODON 7B 6.27
ang et al., 2023), and comparable performance to
L LAMA2-Chat, which utilizes RLHF for further alignment finetuning. We present some outputs from
instruction finetuned M EGALODON in Appendix D.
3
Geometric mean of ROUGE-1, 2 and L.
4
https://klu.ai/glossary/mt-bench-eval

8
Table 4: (ImageNet-1K) Top-1 accuracy. Table 5: (PG-19) Word-level perplexity.

Model #Param. Acc. Model #Param. Val Test


ResNet-152 60M 78.3 Compressive Trans. – 43.4 33.6
ViT-B 86M 77.9 Perceiver AR 975M 45.9 28.9
DeiT-B 86M 81.8 Block-Recurrent Trans. 1.3B – 26.5
M EGA 90M 82.3 M EGA B YTE 1.3B 42.8 36.4
M EGALODON 90M 83.1 M EGALODON 1.3B 29.5 25.4

4.5 Evaluation on Medium-Scale Benchmarks

ImageNet Classification To evaluate M EGALODON on image classification task, we conduct


experiments on the Imagenet-1K (Deng et al., 2009) dataset, which consists of 1.28M training images
and 50K validation images from 1000 classes. We mostly follow DeiT’s approach of applying several
data augmentation and regularization methods that facilitate the training process, and adopt most the
hyperparameters from Ma et al. (2023). For classification task, we replace the timestep normalization
with the standard group normalization method. Top-1 accuracy on the validation set is reported in
Table 4 to assess various models. M EGALODON obtains about 1.3% accuracy improvement over
DeiT-B (Touvron et al., 2021), and 0.8%. improvement over M EGA (Ma et al., 2023).
Auto-regressive Language Modeling on PG-19 We also evaluate M EGALODON on auto-regressive
language modeling on the medium-scale PG19 (Rae et al., 2019) datasets. We use the same vocabu-
lary from Block-Recurrent Transformer (Hutchins et al., 2022) and adopt most of its hyper-parameters
to train a M EGALODON model with 1.3B parameters. Table 5 illustrate the word-level perplexity
(PPL) of M EGALODON on PG-19, together with previous state-of-the-art models, including Com-
pressive Transformer (Rae et al., 2020), Perceiver AR (Hawthorne et al., 2022), Block-Recurrent
Transformer (Hutchins et al., 2022) and M EGA B YTE (Yu et al., 2024). M EGALODON significantly
outperforms all the baselines.

5 Conclusion
We have introduced M EGALODON, an improved M EGA architecture with multiple novel technical
components, including complex exponential moving average (CEMA), the timestep normalization
layer, normalized attention and pre-norm with two-hop residual configuration, to improve its ca-
pability, efficiency and scalability. Through a direct comparison with L LAMA2, M EGALODON
achieves impressive improvements on both training perplexity and across downstream benchmarks.
Importantly, experimental results on long-context modeling demonstrate M EGALODON’s ability to
model sequences of unlimited length. Additional experiments on small/medium-scale benchmarks
across different data modalities illustrate the robust improvements of M EGALODON, which lead to a
potential direction of future work to apply M EGALODON for large-scale multi-modality pretraining.

Acknowledgments
We thank Sadhika Malladi, Zihao Ye, Dacheng Li and Rulin Shao for their helpful feedback and
discussion during this work.

References
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In
International Conference on Learning Representations, 2018.

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical
commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence,
pages 7432–7439, 2020.

9
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng,
Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna:
An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https:
//lmsys.org/blog/2023-03-30-vicuna/.
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina
Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint
arXiv:1905.10044, 2019.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and
Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.
arXiv preprint arXiv:1803.05457, 2018.
James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier
series. Mathematics of computation, 19(90):297–301, 1965.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov.
Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019.
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In Interna-
tional Conference on Learning Representations (ICLR-2024), 2024.
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of
information-seeking questions and answers anchored in research papers. In Proceedings of the
2021 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies (NAACL-2021), pages 4599–4610, Online, June 2021. Association
for Computational Linguistics.
Jared Q Davis, Albert Gu, Krzysztof Choromanski, Tri Dao, Christopher Re, Chelsea Finn, and
Percy Liang. Catformer: Designing stable transformers via sensitivity analysis. In International
Conference on Machine Learning, pages 2489–2499. PMLR, 2021.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale
hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition,
pages 248–255. Ieee, 2009.
Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re.
Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh
International Conference on Learning Representations (ICLR-2023), 2023.
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2023.
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured
state spaces. In International Conference on Learning Representations (ICLR-2022), 2022a.
Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré. On the parameterization and initialization
of diagonal state space models. arXiv preprint arXiv:2206.11893, 2022b.
Stephen Hanson and Lorien Pratt. Comparing biases for minimal network construction with back-
propagation. Advances in neural information processing systems, 1, 1988.
Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz
Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, et al. General-
purpose, long-context autoregressive modeling with perceiver ar. In International Conference on
Machine Learning, pages 8535–8558. PMLR, 2022.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint
arXiv:2009.03300, 2020.
Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key
normalization for transformers. In Findings of the Association for Computational Linguistics:
EMNLP 2020, pages 4246–4253, 2020.

10
Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. In
International Conference on Machine Learning (ICML-2022), pages 9099–9117. PMLR, 2022.

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong
Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and zhifeng Chen. Gpipe: Efficient training of
giant neural networks using pipeline parallelism. In Advances in Neural Information Processing
Systems, volume 32. Curran Associates, Inc., 2019.

J Stuart Hunter. The exponentially weighted moving average. Journal of quality technology, 18(4):
203–210, 1986.

DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block-recurrent
transformers. Advances in neural information processing systems, 35:33248–33261, 2022.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In International conference on machine learning (ICML-2015),
pages 448–456. pmlr, 2015.

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot,
Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al.
Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association
for Computational Linguistics, 2017.

William Kahan. Pracniques: further remarks on reducing truncation errors. Communications of the
ACM, 8(1):40, 1965.

Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis,
and Edward Grefenstette. The NarrativeQA reading comprehension challenge. Transactions of the
Association for Computational Linguistics, 6:317–328, 2018.

Alex Krizhevsky et al. Learning multiple layers of features from tiny images. Technical Report.
University of Toronto, 2009.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris
Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a
benchmark for question answering research. Transactions of the Association for Computational
Linguistics, 7:453–466, 2019.

Bonan Li, Yinhan Hu, Xuecheng Nie, Congying Han, Xiangjian Jiang, Tiande Guo, and Luoqi Liu.
Dropkey for vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 22700–22709, June 2023a.

Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, Joseph E Gonzalez, Ion Stoica, Xuezhe Ma, and Hao
Zhang. Lightseq:: Sequence level parallelism for distributed training of long context transformers.
In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and
Resource Optimization (WANT@ NeurIPS 2023), 2023b.

Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. What makes convolutional
models great on long sequence modeling? In International Conference on Learning Representa-
tions (ICLR-2023), 2023c.

Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charles Windolf, and Thomas Serre. Learning
long-range spatial dependencies with horizontal gated recurrent units. In S. Bengio, H. Wal-
lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural
Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-
infinite context. In International Conference on Learning Representations (ICLR-2024), 2024.

11
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng
Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Confer-
ence on Learning Representations, 2019.
Chunjie Luo, Jianfeng Zhan, Xiaohe Xue, Lei Wang, Rui Ren, and Qiang Yang. Cosine normalization:
Using cosine similarity instead of dot product in neural networks. In 27th International Conference
on Artificial Neural Networks (ICANN-2018), pages 382–391. Springer, 2018.
Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettle-
moyer. Luna: Linear unified nested attention. Advances in Neural Information Processing Systems,
34:2441–2453, 2021.
Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan
May, and Luke Zettlemoyer. Mega: Moving average equipped gated attention. In The Eleventh
International Conference on Learning Representations, 2023.
Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts.
Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the
association for computational linguistics: Human language technologies, pages 142–150, 2011.
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture
models. In International Conference on Learning Representations (ICLR-2017), 2017.
Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre,
Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Aakanksha
Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie
Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Char-
line Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David
Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Mu-
raru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin,
James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy
Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican,
Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael
Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige
Bailey, Paul Michel, Petko Yotov, Pier Giuseppe Sessa, Rahma Chaabouni, Ramona Comanescu,
Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian
Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Kli-
menko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao
Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean,
Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando
Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen
Kenealy. Gemma: Open models based on gemini research and technology, 2024.
MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023.
Nikita Nangia and Samuel Bowman. Listops: A diagnostic dataset for latent tree learning. In Proceed-
ings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Student Research Workshop, pages 92–99, 2018.
Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, Congying Xia, Chen Xing, Jesse Vig, Semih
Yavuz, Philippe Laban, Ben Krause, Senthil Purushwalkam, Tong Niu, Wojciech Kryściński, Lidiya
Murakhovs’ka, Prafulla Kumar Choubey, Alex Fabbri, Ye Liu, Rui Meng, Lifu Tu, Meghana Bhat,
Chien-Sheng Wu, Silvio Savarese, Yingbo Zhou, Shafiq Joty, and Caiming Xiong. Xgen-7b
technical report, 2023.
Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayakumar,
Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers
for reinforcement learning. In International conference on machine learning, pages 7487–7498.
PMLR, 2020.

12
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style,
high-performance deep learning library. Advances in neural information processing systems, 32,
2019.
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin
Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, et al. Rwkv: Reinventing rnns for the
transformer era. arXiv preprint arXiv:2305.13048, 2023.
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window
extension of large language models. In International Conference on Learning Representations
(ICLR-2024), 2024.
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua
Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional
language models. In International conference on machine learning (ICML-2023). PMLR, 2023.
Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. The acl
anthology network corpus. Language Resources and Evaluation, 47(4):919–944, 2013.
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap.
Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL
https://arxiv.org/abs/1911.05507.
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap.
Compressive transformers for long-range sequence modeling. In International Conference on
Learning Representations (ICLR-2020), 2020.
Prajit Ramachandran, Barret Zoph, and Quoc V Le. Swish: a self-gated activation function. arXiv
preprint arXiv:1710.05941, 7(1):5, 2017.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An
adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106,
2021.
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense
reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong,
Mor Geva, Jonathan Berant, and Omer Levy. SCROLLS: Standardized CompaRison over long
language sequences. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of
the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP-2022), pages
12007–12021, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational
Linguistics.
Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan-
zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.
arXiv preprint arXiv:1909.08053, 2019.
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with
rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv
preprint arXiv:2009.06732, 2020.
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao,
Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient
transformers. In International Conference on Learning Representations, 2021. URL https:
//openreview.net/forum?id=qVyeW-grC2k.
Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan
Narang, Vinh Q. Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures:
How does inductive bias influence scaling?, 2022.

13
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé
Jégou. Training data-efficient image transformers & distillation through attention. In International
Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation
and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc., 2017.
Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, and Armaghan
Eshaghi. Beyond the limits: A survey of techniques to extend the context length in large language
models, 2024.
Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint
arXiv:1804.03209, 2018.
B. P. Welford. Note on a method for calculating corrected sums of squares and products. Technomet-
rics, 4(3):419–420, 1962.
Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on
computer vision (ECCV-2018), pages 3–19, 2018.
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang,
Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture.
In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin,
Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of
foundation models. arXiv preprint arXiv:2309.16039, 2023.
Hongfei Xu, Qiuhui Liu, Deyi Xiong, and Josef van Genabith. Transformer with depth-wise lstm.
arXiv preprint arXiv:2007.06257, 2020.
Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis.
Megabyte: Predicting million-byte sequences with multiscale transformers. Advances in Neural
Information Processing Systems, 36, 2024.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine
really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics (ACL-2019). Association for Computational Linguistics, 2019.
Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural
Information Processing Systems, 32, 2019.
Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah,
Asli Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir Radev. QMSum: A new benchmark for
query-based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies (NAACL-2021), pages 5905–5921, Online, June 2021. Association for Computational
Linguistics.
Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, and Denny Zhou. Trans-
formers can achieve length generalization but not robustly, 2024.

14
Appendix: M EGALODON: Efficient Long-Context LLM Pretraining and
Inference with Unlimited Context Length

A Background: Moving Average Equipped Gated Attention


In the gated attention mechanism in M EGA, the output from EMA (1) is used to compute the shared
representation (Hua et al., 2022) Z:
X ′ = EMA(X) ∈ Rn×d (12)
′ n×z
Z = ϕsilu (X Wz + bz ) ∈R (13)
where X ′ can be regarded as the updated or contextual input, because it encodes contextual infor-
mation through EMA. Then, the query and key sequences are computed by applying per-dimension
scalars and offsets to Z, and the value sequence is from the original X:
Q = κq ⊙ Z + µq ∈ Rn×z (14)
n×z
K = κk ⊙ Z + µk ∈R (15)
n×v
V = ϕsilu (XWv + bv ) ∈R (16)
where κq , µq , κk , µk ∈ Rz are the learnable scalars and offsets of queries and keys, respectively. v
is the expanded intermediate dimension for the value sequence. The output of attention is computed
as follows:
!
QK T
O=f V ∈ Rn×v (17)
τ (X)

Subsequently, M EGA introduces the reset gate γ, the update gate φ, and computes the candidate
activation Ĥ and final output Y :
γ = ϕsilu (X ′ Wγ + bγ ) ∈ Rn×v (18)
′ n×d
φ = ϕsigmoid (X Wφ + bφ ) ∈R (19)
′ n×d
Ĥ = ϕsilu (X Wh + (γ ⊙ O)Uh + bh ) ∈R (20)
n×d
Y = φ ⊙ Ĥ + (1 − φ) ⊙ X ∈R (21)
with the update gate φ and the residual connection X.

B Implementation Details
B.1 Efficient Fused CUDA Operators Implementation

Fused Attention We implemented a fused attention operator to improve the efficiency, mainly
by fusing the causal mask, softmax function and dropout operation (if necessary). The fused
implementation reduces the IO costs from global memory for the attention weight. For attention
dropout, we adopt the dropout-before-softmax scheme in DropKey (Li et al., 2023a), which applies
the dropout mask on the input attention matrix of the softmax function. Concretely, we fill the values
of the attention matrix at dropout mask positions to −∞ before feeding it into the softmax function.
One important advantage of this dropout-before-softmax scheme comparing to the standard attention
dropout is that the computation of the gradients in back-propagation is independent with the applied
dropout mask.

Efficient FFTConv We also provide an efficient fused implementation of the FFTConv operator.
Similar with the FlashConv in H3 (Fu et al., 2023), we fused the real number FFT (RFFT), its inverse
(IRFFT) and the and implemented the Cooley-Tukey FFT algorithm (Cooley and Tukey, 1965) in
the CUDA shared memory. Similar with the FlashConv in H3 (Fu et al., 2023), we fused the real
number FFT (RFFT), its inverse (IRFFT) and the element-wise multiplication, and implemented
the Cooley-Tukey FFT algorithm (Cooley and Tukey, 1965) in the CUDA shared memory. Our
implementation is able to accommodate up to 16K tokens in the limited shared memory of A100.

15
Table 6: (Long Range Arena) Accuracy on the full suite of long range arena (LRA) tasks. Results of
previous models are reported in Ma et al. (2023).

Models ListOps Text Retrieval Image Pathfinder Path-X Avg.


Transformer 37.11 65.21 79.14 42.94 71.83 ✗ 59.24
Reformer 37.27 56.10 53.40 38.07 68.50 ✗ 50.67
Linformer 35.70 53.94 52.27 38.56 76.34 ✗ 51.36
BigBird 36.05 64.02 59.29 40.83 74.87 ✗ 55.01
Luna-256 37.98 65.78 79.56 47.86 78.55 ✗ 61.95
S4 59.10 86.53 90.94 88.48 94.01 96.07 85.86
M EGA-chunk 58.76 90.19 90.97 85.80 94.41 93.81 85.66
M EGA 63.14 90.43 91.25 90.44 96.01 97.98 88.21
M EGALODON-chunk 62.23 90.53 91.74 87.11 96.89 97.21 87.62
M EGALODON 63.79 90.48 91.76 89.42 98.13 98.17 88.63

Timestep Normalization For the TimestepNorm operator, we have an efficient implementation to


improve both its speed and numerical stability. To compute the cumulative mean and variance for
each of the timesteps, our implementation limits the parallel threads used for the timestep/sequence
dimension. To improve numerical stability, we used the Welford algorithm (Welford, 1962) to
compute the cumulative mean and variance and the Kahan Summation (Kahan, 1965) to reduce the
numerical error from summation.

B.2 Plus 1 Reparameterization in Normalization Layers

In the normalization methods, two learnable parameters γ and β are introduced to scale and shift the
normalized value:
x−µ
y=γ +β (22)
σ
where µ and σ 2 are the mean and variance of the input x across the pre-defined dimensions. Initial-
ization of γ and β is crucial for model performance and stability. The standard implementation of
normalization layers, such as PyTorch (Paszke et al., 2019), initializes γ and β to vectors of ones and
zeros, respectively, to preserve the mean and variance of the normalized inputs at the beginning of
training.
This standard implementation, however, suffers a problem when weight decay regularization is
applied to prevent overfitting (Hanson and Pratt, 1988). Technically, the weight decay regularization
pushes the values of model parameters towards smaller magnitudes. In the context of normalization
methods, weight decay pushes the values in γ towards zero, which diverges from its initialization of
one. This may prevent the model from learning the true scale of the data distribution, and may cause
numerical stability issues as well.
To address this problem, we used the plus 1 reparameterization5 of the scale parameter γ:
x−µ
y = (γ + 1) +β (23)
σ
where γ is initialized to zero. Under weight decay, γ remains centered around zero, resulting in a
desirable scale of γ + 1 around one.

C Experiments on Small-Scale Benchmarks


We conducted small-scale experiments on five benchmarks across various data modalities, including
text, audio and image. To demonstrate the robustness of the M EGALODON architecture on different
tasks and data types, we used a single unified architecture with minimal architectural divergence in
all the experiments: softmax attention function, rotary postional embedding, pre-norm with two-hop
5
Similar idea in the blog: https://medium.com/@ohadrubin/exploring-weight-decay-in-layer-normalization-
challenges-and-a-reparameterization-solution-ad4d12c24950

16
Table 7: (SC-Raw) Accuracy. Table 8: (WikiText-103) Word-level PPL.

Model #Param. Acc. Model #Param. PPL


Transformer 786K ✗ Transformer 247M 18.66
S4 300K 97.50 Transformer-XL 257M 18.30
M EGA 300K 96.92 S4 249M 20.95
M EGA (big) 476K 97.30 M EGA 252M 18.07
M EGALODON 300K 98.14 M EGALODON 252M 17.23

residual, and timestep Normalization (Group Normalization for classification). We adopt (almost)
all the architectural and training hyperparameters from the corresponding experiments of original
M EGA (Ma et al., 2023).

C.1 Long Range Arena (LRA)

Long Range Arena (LRA) benchmark (Tay et al., 2021) is designed for evaluating sequence models un-
der the long-context scenario. They collect six tasks in this benchmark which are ListOps (Nangia and
Bowman, 2018), byte-level text classification (Text; Maas et al. (2011)), byte-level document retrieval
(Retrieval; Radev et al. (2013)), image classification on sequences of pixels (Image; Krizhevsky et al.
(2009)), Pathfinder (Linsley et al., 2018) and its extreme long version (Path-X; Tay et al. (2021)).
These tasks consist of input sequences ranging from 1K to 16K tokens and span across a variety of
data types and modalities.
Table 6 compares M EGALODON against several baselines, including Transformer and its efficient
variants, the state space model S4 (Gu et al., 2022a), and the original M EGA model. Following Ma
et al. (2023), we also evaluate M EGALODON-chunk on each task, by setting the chunk size c = 128
for all the tasks, except Path-X where c = 4096. With chunk-wise attention, M EGALODON-chunk
substantially outperforms M EGA-chunk on all the six tasks. In addition, M EGALODON significantly
narrows the gap between chunk-wise attention and full attention.

C.2 Raw Speech Classification

To evaluate the capability of M EGALODON on the long-range modeling of speech signals, we apply
M EGALODON to classify raw speech (with length 16000), rather than using traditional preprocessing
(e.g. convert to MFCC features). Following Ma et al. (2023), we perform speech classification on the
SC10 subset of the Speech Commands dataset (Warden, 2018), and set attention chunk size c = 1000.
As shown in Table 7, our M EGALODON model with 300K parameters achieves an accuracy of 98.14,
which is significantly better than 97.50 from S4 and 96.92 from M EGA with similar model size.

C.3 Auto-regressive Language Modeling

We also evaluate M EGALODON on auto-regressive language modeling on the small-scale WikiText-


103 (Merity et al., 2017) datasets. Following Baevski and Auli (2018), we adopt adaptive softmax and
input embeddings and use a vocabulary of 260K tokens for WikiText-103 and train a M EGALODON
model with about 250M parameters with attention chunk size c = 2048. Table 8 illustrate the word-
level perplexity (PPL) of M EGALODON on WikiText-103, together with previous state-of-the-art
models, including Transformer (Baevski and Auli, 2018), Transformer-XL (Dai et al., 2019), S4 (Gu
et al., 2022a) and M EGA (Ma et al., 2023). M EGALODON significantly outperforms all the baselines
on both the two datasets.

D Model Outputs from Instruction-finetuned M EGALODON

17
What is the central dogma of molecular biology? What processes are involved? Who named this?

The central dogma of molecular biology describes the flow of genetic information from DNA to RNA to proteins. The three processes
involved in the central dogma are transcription, translation, and translocation.
The central dogma was first proposed by Francis Crick, James Watson, and Maurice Wilkins in the 1950s, and was later formalized by
Crick in a 1958 paper titled ”On the Central Dogma of Molecular Biology”. The term ”central dogma” was coined by Crick to describe
the flow of genetic information from DNA to RNA to proteins.

Identify and fix one incorrect fact in your previous response.

The correct name of the person who proposed the central dogma of molecular biology is James Watson, not Maurice Wilkins. James
Watson, along with Francis Crick, first proposed the central dogma in the 1950s, and Maurice Wilkins was a co-recipient of the Nobel
Prize in Physiology or Medicine in 1962 for his contributions to the discovery of the double helix structure of DNA.
Act as a math teacher. I will provide some mathematical equations or concepts, and it will be your job to explain them in
easy-to-understand terms. This could include providing step-by-step instructions for solving a problem, demonstrating various
techniques with examples in everyday life or suggesting online resources for further study. My first request is ”I need help
understanding how probability works.”

Probability is a branch of mathematics that deals with the likelihood of an event occurring. It is often expressed as a number between 0
and 1, where 0 represents an impossible event and 1 represents a certain event.
For example, if you were to roll a fair six-sided die, the probability of rolling a 1 would be 1/6, or about 17%. This is because there is
only one way for the die to land on a 1 out of the six possible outcomes.
To calculate probability, you can use the formula:
Probability = Number of ways the event can occur / Total number of possible outcomes
For example, if you were to ask ”What is the probability of flipping heads on a fair coin toss?”, the answer would be 1/2, or 50%. This is
because there is only one way for the coin to land heads-up out of the two possible outcomes (heads or tails).
There are many different ways to interpret probability, and it can be used in a variety of real-world applications such as gambling,
insurance, and statistics.
I hope this helps to give you a basic understanding of probability. Let me know if you have any questions or need further clarification.

What are the differences between Riemannian geometry and euclidean geometry?

Riemannian geometry and Euclidean geometry are both branches of mathematics that study the properties of space, but they differ in
their assumptions and methods.
Euclidean geometry is based on the assumption that the space is flat and flat Euclidean space is the simplest non-empty connected flat
metric space. It studies the properties of space using concepts such as distance, angle, and parallelism. It is named after the Greek
mathematician Euclid, who developed the foundations of the field in his book ”Elements”.
Riemannian geometry, on the other hand, is based on the assumption that the space is curved. It studies the properties of space using
concepts such as distance, angle, and parallelism, but in a more general setting, and it is named after the German mathematician Bernhard
Riemann.
In Riemannian geometry, the concept of a metric is used to measure the distance between two points in a curved space. A metric is a
mathematical function that assigns a distance to each segment in a space. The most common metric in Riemannian geometry is the
distance function induced by a Riemannian metric, which is a mathematical function that assigns a distance to each segment in a space.
In summary, Euclidean geometry is a special case of Riemannian geometry, where the metric is the flat Euclidean metric. Riemannian
geometry is a more general field that studies the properties of curved spaces, and it is used in many areas of mathematics, physics and
other sciences.
I hope this helps to give you a basic understanding of the differences between Riemannian geometry and Euclidean geometry. Let me
know if you have any questions or need further clarification.
Evaluate the following movie reviews on a scale of 1 to 5, with 1 being very negative, 3 being neutral, and 5 being very positive: 1.
This movie released on Nov. 18, 2019, was phenomenal. The cinematography, the acting, the plot - everything was top-notch. 2.
Never before have I been so disappointed with a movie. The plot was predictable and the characters were one-dimensional. In my
opinion, this movie is the worst one to have been released in 2022. 3. The movie was okay. There were some parts I enjoyed, but
there were also parts that felt lackluster. This is a movie that was released in Feb 2018 and seems to be quite ordinary. Return the
answer as a JSON array of integers.

[5, 1, 3]

Update your previous reply by including the release date as part of the JSON content.

[
{"release_date": "Nov 18, 2019", "rating": 5},
{"release_date": "2022", "rating": 1},
{"release_date": "Feb 2018", "rating": 3}
]

Figure 6: Model outputs from instruction-finetuned M EGALODON.

18

You might also like