Professional Documents
Culture Documents
Jiawei Zhao 1 Zhenyu Zhang 3 Beidi Chen 2 4 Zhangyang Wang 3 Anima Anandkumar * 1 Yuandong Tian * 2
Abstract
BF16
Training Large Language Models (LLMs)
presents significant memory challenges, predom- Adafactor
arXiv:2403.03507v1 [cs.LG] 6 Mar 2024
1
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Parameter-efficient fine-tuning (PEFT) techniques allow Notably, for pre-training, GaLore keeps low memory
for the efficient adaptation of pre-trained language mod- throughout the entire training, without requiring full-rank
els (PLMs) to different downstream applications without training warmup like ReLoRA. Thanks to GaLore’s mem-
the need to fine-tune all of the model’s parameters (Ding ory efficiency, for the first time it is possible to train
et al., 2022). Among them, the popular Low-Rank Adapta- LLaMA 7B from scratch on a single GPU with 24GB mem-
tion (LoRA Hu et al. (2021)) reparameterizes weight ma- ory (e.g., on NVIDIA RTX 4090), without any costly mem-
trix W ∈ Rm×n into W = W0 + BA, where W0 is a ory offloading techniques (Fig. 1).
frozen full-rank matrix and B ∈ Rm×r , A ∈ Rr×n are
GaLore is also used to fine-tune pre-trained LLMs on
additive low-rank adaptors to be learned. Since the rank
GLUE benchmarks with comparable or better results than
r ≪ min(m, n), A and B contain fewer number of train-
existing low-rank methods. When fine-tuning RoBERTa-
able parameters and thus smaller optimizer states. LoRA
Base on GLUE tasks with a rank of 4, GaLore achieves
has been used extensively to reduce memory usage for fine-
an average score of 85.89, outperforming LoRA, which
tuning in which W0 is the frozen pre-trained weight. Its
achieves a score of 85.61.
variant ReLoRA is also used in pre-training, by periodi-
cally updating W0 using previously learned low-rank adap- As a gradient projection method, GaLore is independent
tors (Lialin et al., 2023). of the choice of optimizers and can be easily plugged into
existing ones with only two lines of code, as shown in Al-
However, many recent works demonstrate the limitation of
gorithm 1. Our experiment (Fig. 3) shows that it works
such a low-rank reparameterization. For fine-tuning, LoRA
for popular optimizers such as AdamW, 8-bit Adam, and
is not shown to reach a comparable performance as full-
Adafactor. In addition, its performance is insensitive to
rank fine-tuning (Xia et al., 2024). For pre-training from
very few hyper-parameters it introduces. We also provide
scratch, it is shown to require a full-rank model training
theoretical justification on the low-rankness of gradient up-
as a warmup (Lialin et al., 2023), before optimizing in the
date, as well as the convergence analysis of GaLore.
low-rank subspace. There are two possible reasons: (1) the
optimal weight matrices may not be low-rank, and (2) the
reparameterization changes the gradient training dynamics. 2. Related Works
Our approach: To address the above challenge, we pro- Low-Rank Adaptation Hu et al. (2021) proposed Low-
pose Gradient Low-rank Projection (GaLore), a training Rank Adaptation (LoRA) to fine-tune pre-trained models
strategy that allows full-parameter learning but is more with low-rank adaptors. This method reduces the mem-
memory-efficient than common low-rank adaptation meth- ory footprint by maintaining a low-rank weight adaptor for
ods, such as LoRA. Our key idea is to leverage the slow- each layer. There are a few variants of LoRA proposed to
changing low-rank structure of the gradient G ∈ Rm×n of enhance its performance (Renduchintala et al., 2023; Sheng
the weight matrix W , rather than trying to approximate the et al., 2023; Xia et al., 2024), supporting multi-task learn-
weight matrix itself as low rank. ing (Wang et al., 2023), and further reducing the mem-
ory footprint (Dettmers et al., 2023). Lialin et al. (2023)
We first show theoretically that the gradient matrix G be-
proposed ReLoRA, a variant of LoRA designed for pre-
comes low-rank during training. Then, we propose Ga-
training, but requires a full-rank training warmup to achieve
Lore that computes two projection matrices P ∈ Rm×r
comparable performance as the standard baseline.
and Q ∈ Rn×r to project the gradient matrix G into a low-
rank form P ⊤ GQ. In this case, the memory cost of opti-
mizer states, which rely on component-wise gradient statis- Subspace Learning Recent studies have demonstrated
tics, can be substantially reduced. Occasional updates of P that the learning primarily occurs within a significantly
and Q (e.g., every 200 iterations) incur minimal amortized low-dimensional parameter subspace (Larsen et al., 2022;
additional computational cost. GaLore is more memory- Gur-Ari et al., 2018). These findings promote a spe-
efficient than LoRA as shown in Table 1. In practice, this cial type of learning called subspace learning, where the
yields up to 30% memory reduction compared to LoRA model weights are optimized within a low-rank subspace.
during pre-training. This notion has been widely used in different domains of
machine learning, including meta-learning and continual
We demonstrate that GaLore works well in both LLM pre- learning (Lee & Choi, 2018; Chaudhry et al., 2020).
training and fine-tuning. When pre-training LLaMA 7B on
C4 dataset, 8-bit GaLore, combined with 8-bit optimizers Projected Gradient Descent GaLore is closely related
and layer-wise weight updates techniques, achieves com- to the traditional topic of projected gradient descent (PGD)
parable performance to its full-rank counterpart, with less (Chen & Wainwright, 2015; Chen et al., 2019). A key dif-
than 10% memory cost of optimizer states. ference is that, GaLore considers the specific gradient form
that naturally appears in training multi-layer neural net-
2
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
works (e.g., it is a matrix with specific structures), proving Low-rank updates.. For a linear layer W ∈ Rm×n , LoRA
many of its properties (e.g., Lemma 3.1, Theorem 3.2, and and its variants utilize the low-rank structure of the update
Theorem 3.6). In contrast, traditional PGD mostly treats matrix by introducing a low-rank adaptor AB:
the objective as a general blackbox nonlinear function, and
study the gradients in the vector space only. W T = W 0 + B T AT , (5)
Gl = Jl⊤ yfl−1
⊤
− Jl⊤ Jl Wl fl−1 fl−1
⊤
p
G̃t = Mt / Vt + ϵ (4) (8)
| {z } | {z }
√
| {z }
A B C
Here G2t and Mt / Vt + ϵ means element-wise multipli-
cation and division. η is the learning rate. Together with where Jl := Jacobian(NL ) . . . Jacobian(Nl+1 ) and fl :=
W ∈ Rm×n , this takes 3mn memory. Nl (Nl−1 . . . N1 (x)).
3
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Note that for softmax objective with small logits, we can with batchsize > 1):
also prove a similar structure of backpropagated gradient, X X
and thus Theorem 3.2 can also apply. G= Ai − Bi W Ci (12)
i i
Lemma 3.3 (Gradient structure of softmax K-
loss).⊤ For
way logsoftmax loss φ(y; f ) := − log 1exp(y f) where Bi and Ci are PSD matrices, Ai , Bi and Ci have
⊤ exp(f ) , let
LA , LB and LC continuity with respect to W and ∥Wt ∥ ≤
fˆ = P1⊥ f be the zero-mean version of network output f , D. Let Rt := Pt⊤ Gt Qt , B̂itP:= Pt⊤ Bi (Wt )Pt , Ĉit :=
where P1⊥ := I − K 1
11⊤ , then we have: Q⊤
t Ci (Wt )Qt and κt := N
1
i λmin (B̂it )λmin (Ĉit ). If
we choose constant Pt = P and Qt = Q, then GaLore
−dφ = y ⊤ dfˆ − γ fˆ⊤ dfˆ/K + O(fˆ2 /K)dfˆ (9) with ρt ≡ 1 satisfies:
With this lemma, it is clear that for a reversible network As a result, if mint κt > LA + LB LC D2 , Rt → 0 and thus
f := N (x) = Jl (x)Wl fl−1 (x), the gradient Gl of Wl GaLore converges with fixed Pt and Qt .
has the following form:
Setting P and Q. The theorem tells that P and Q should
Gl = Jl P1⊥ yfl−1 − γJl⊤ P1⊥ Jl ⊤
Wl fl−1 fl−1 /K (10) project into the subspaces corresponding to the first few
| {z } | {z } | {z } largest eigenvectors of B̂it and Ĉit for faster convergence
A B C
(large κt ). While all eigenvalues of the positive semidefi-
which is consistent with the form Gl = A − BWl C. For a nite (PSD) matrix B and C are non-negative, some of them
detailed introduction to reversibility, please check the Ap- can be very small and hinder convergence (i.e., it takes a
pendix A.2. long time for Gt to become 0). With the projection P and
Q, P ⊤ Bit P and Q⊤ Cit Q only contain the largest eigen
3.3. Gradient Low-rank Projection (GaLore) subspaces of B and C, improving the convergence of Rt
and at the same time, reduces the memory usage.
Since the gradient G may have a low-rank structure, if we
While it is tricky to obtain the eigenstructure of B̂it and Ĉit
can keep the gradient statistics of a small “core” of gradient
(they are parts of Jacobian), one way is to instead use the
G in optimizer states, rather than G itself, then the memory
spectrum of Gt via Singular Value Decomposition (SVD):
consumption can be reduced substantially. This leads to
our proposed GaLore strategy: r
X
Gt = U SV ⊤ ≈ si ui vi⊤ (14)
Definition 3.4 (Gradient Low-rank Projection (GaLore)).
i=1
Gradient low-rank projection (GaLore) denotes the follow-
ing gradient update rules (η is the learning rate): Pt = [u1 , u2 , ..., ur ], Qt = [v1 , v2 , ..., vr ] (15)
T
X −1 Difference between GaLore and LoRA. While both Ga-
WT = W 0 + η G̃t , G̃t = Pt ρt (Pt⊤ Gt Qt )Q⊤
t , Lore and LoRA have “low-rank” in their names, they fol-
t=0 low very different training trajectories. For example, when
(11) r = min(m, n), GaLore with ρt ≡ 1 follows the ex-
act training trajectory of the original model, as G̃t =
where Pt ∈ Rm×r and Qt ∈ Rn×r are projection matrices. Pt Pt⊤ Gt Qt Q⊤ = Gt . On the other hand, when BA
t
reaches full rank (i.e., B ∈ Rm×m and A ∈ Rm×n ),
Different from LoRA, GaLore explicitly utilizes the low-
optimizing B and A simultaneously follows very different
rank updates instead of introducing additional low-rank
training trajectory from the original model.
adaptors and hence does not alter the training dynamics.
In the following, we show that GaLore converges under
4. GaLore for Memory-Efficient Training
a similar (but more general) form of gradient update rule
(Eqn. 6). This form corresponds to Eqn. 8 but with a larger For a complex optimization problem such as LLM pre-
batch size. training, it may be difficult to capture the entire gradient
Definition 3.5 (L-continuity). A function h(W ) has (Lip- trajectory with a single low-rank subspace. One reason is
schitz) L-continuity, if for any W1 and W2 , ∥h(W1 ) − that the principal subspaces of Bt and Ct (and thus Gt ) may
h(W2 )∥F ≤ L∥W1 − W2 ∥F . change over time. In fact, if we keep the same projection
P and Q, then the learned weights will only grow along
Theorem 3.6 (Convergence of GaLore with fixed projec- these subspaces, which is not longer full-parameter train-
tions). Suppose the gradient has the following form (Eqn. 8 ing. Fortunately, for this, GaLore can switch subspaces
4
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Initialize step t ← 0
repeat
𝑾𝟎 + ∆𝑾𝑻𝟏 + ∆𝑾𝑻𝟐 Gt ∈ Rm×n ← −∇W φt (Wt )
𝑾𝟎
if t mod T = 0 then
U, S, V ← SVD(Gt )
Pt ← U [:, : r] {Initialize left projector as m ≤ n}
else
Pt ← Pt−1 {Reuse the previous projector}
Figure 2: Learning through low-rank subspaces ∆WT1 and end if
Rt ← Pt Gt ⊤ {Project gradient into compact space}
∆WT2 using GaLore. For t1 ∈ [0, T1 − 1], W are updated by
projected gradients G̃t1 in a subspace determined by fixed Pt1 UPDATE (Rt )by Adam
and Qt1 . After T1 steps, the subspace is changed by re-computing Mt ← β1 · Mt−1 + (1 − β1 ) · Rt
Pt2 and Qt2 for t2 ∈ [T1 , T2 − 1], and the process repeats until Vt ← β2 · Vt−1 + (1 − β2 ) · Rt2
convergence. Mt ← Mt /(1 − β1t )
Vt ← Vt /(1 − β2t )
√
Nt ← Mt /( Vt + ϵ)
during training and learn full-rank weights without increas-
G̃t ← α · P Nt {Project back to original space}
ing the memory footprint. Wt ← Wt−1 + η · G̃t
t←t+1
4.1. Composition of Low-Rank Subspaces until convergence criteria met
return Wt
We allow GaLore to switch across low-rank subspaces:
Wt = W0 + ∆WT1 + ∆WT2 + . . . + ∆WTn , (16) 2020).
hP i
n−1 Pn
where t ∈ i=1 Ti , i=1 Ti and ∆WTi =
PTi −1 4.2. Memory-Efficient Optimization
η t=0 G̃t is the summation of all Ti updates within the
i-th subspace. When switching to i-th subspace at step Reducing memory footprint of gradient statistics. Ga-
t = Ti , we re-initialize the projector Pt and Qt by perform- Lore significantly reduces the memory cost of optimizer
ing SVD on the current gradient Gt by Equation 14. We il- that heavily rely on component-wise gradient statistics,
lustrate how the trajectory of G̃t traverses through multiple such as Adam (Kingma & Ba, 2014). When ρt ≡ Adam,
low-rank subspaces in Fig. 2. In the experiment section, we by projecting Gt into its low-rank form Rt , Adam’s gra-
show that allowing multiple low-rank subspaces is the key dient regularizer ρt (Rt ) only needs to track low-rank gra-
to achieving the successful pre-training of LLMs. dient statistics. where Mt and Vt are the first-order and
second-order momentum, respectively. GaLore computes
Following the above procedure, the switching frequency the low-rank normalized gradient Nt as follows:
T becomes a hyperparameter. The ablation study (Fig. 5) p
shows a sweet spot exists. A very frequent subspace change Nt = ρt (Rt ) = Mt /( Vt + ϵ). (17)
increases the overhead (since new Pt and Qt need to be
GaLore can also apply to other optimizers (e.g., Adafactor)
computed) and breaks the condition of constant projection
that have similar update rules and require a large amount of
in Theorem 3.6. In practice, it may also impact the fi-
memory to store gradient statistics.
delity of the optimizer states, which accumulate over mul-
tiple training steps. On the other hand, a less frequent
Reducing memory usage of projection matrices. To
change may make the algorithm stuck into a region that
achieve the best memory-performance trade-off, we only
is no longer important to optimize (convergence proof in
use one project matrix P or Q, projecting the gradient G
Theorem 3.6 only means good progress in the designated
into P ⊤ G if m ≤ n and GQ otherwise. We present the
subspace, but does not mean good overall performance).
algorithm applying GaLore to Adam in Algorithm 2.
While optimal T depends on the total training iterations
and task complexity, we find that a value between T = 50 With this setting, GaLore requires less memory than LoRA
to T = 1000 makes no significant difference. Thus, the during training. As GaLore can always merge ∆Wt to
total computational overhead induced by SVD is negligi- W0 during weight updates, it does not need to store a
ble (< 10%) compared to other memory-efficient training separate low-rank factorization BA. In total, GaLore re-
techniques such as memory offloading (Rajbhandari et al., quires (mn + mr + 2nr) memory, while LoRA requires
5
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
6
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Table 2: Comparison with low-rank algorithms on pre-training various sizes of LLaMA models on C4 dataset. Validation perplexity
is reported, along with a memory estimate of the total of parameters and optimizer states based on BF16 format. The actual memory
footprint of GaLore is reported in Fig. 1 and Fig. 4.
Table 3: Pre-training LLaMA 7B on C4 dataset for 150K steps. 5.2. GaLore with Memory-Efficient Optimizers
Validation perplexity and memory estimate are reported.
We demonstrate that GaLore can be applied to various
Mem 40K 80K 120K 150K learning algorithms, especially memory-efficient optimiz-
ers, to further reduce the memory footprint. We apply Ga-
8-bit GaLore 18G 17.94 15.39 14.95 14.65 Lore to AdamW, 8-bit Adam, and Adafactor optimizers
8-bit Adam 26G 18.09 15.47 14.83 14.61 (Loshchilov & Hutter, 2019; Dettmers et al., 2022; Shazeer
Tokens (B) 5.2 10.5 15.7 19.7 & Stern). We consider Adafactor with first-order statistics
to avoid performance degradation.
We evaluate them on LLaMA 1B architecture with 10K
full-rank initialization matrix. We set LoRA alpha to 32 training steps, and we tune the learning rate for each set-
and LoRA dropout to 0.05 as their default settings. ting and report the best performance. As shown in Fig. 3,
applying GaLore does not significantly affect their conver-
gence. By using GaLore with a rank of 512, the memory
ReLoRA Lialin et al. (2023) is a variant of LoRA de- footprint is reduced by up to 62.5%, on top of the mem-
signed for pre-training, which periodically merges BA into ory savings from using 8-bit Adam or Adafactor optimizer.
W , and initializes new BA with a reset on optimizer states Since 8-bit Adam requires less memory than others, we de-
and learning rate. ReLoRA requires careful tuning of merg- note 8-bit GaLore as GaLore with 8-bit Adam, and use it
ing frequency, learning rate reset, and optimizer states re- as the default method for the following experiments on 7B
set. We evaluate ReLoRA without a full-rank training model pre-training and memory measurement.
warmup for a fair comparison.
5.3. Scaling up to LLaMA 7B Architecture
For GaLore, we set subspace frequency T to 200 and scale
factor α to 0.25 across all model sizes in Table 2. For each Scaling ability to 7B models is a key factor for demonstrat-
model size, we pick the same rank r for all low-rank meth- ing if GaLore is effective for practical LLM pre-training
ods, and we apply them to all multi-head attention layers scenarios. We evaluate GaLore on an LLaMA 7B archi-
and feed-forward layers in the models. We train all mod- tecture with an embedding size of 4096 and total layers of
els using Adam optimizer with the default hyperparame- 32. We train the model for 150K steps with 19.7B tokens,
ters (e.g., β1 = 0.9, β2 = 0.999, ϵ = 10−8 ). We also using 8-node training in parallel with a total of 64 A100
estimate the memory usage based on BF16 format, includ- GPUs. Due to computational constraints, we only compare
ing the memory for weight parameters and optimizer states. 8-bit GaLore (r = 1024) with 8-bit Adam with a single
As shown in Table 2, GaLore outperforms other low-rank trial without tuning the hyperparameters. As shown in Ta-
methods and achieves comparable performance to full-rank ble 3, after 150K steps, 8-bit GaLore achieves a perplexity
training. We note that for 1B model size, GaLore even of 14.65, which is comparable to 8-bit Adam with a per-
outperforms full-rank baseline when r = 1024 instead of plexity of 14.61.
r = 512. Compared to LoRA and ReLoRA, GaLore re-
quires less memory for storing model parameters and opti- 5.4. Memory-Efficient Fine-Tuning
mizer states. A detailed training setting of each model and
our memory estimation for each method are provided in the GaLore not only achieves memory-efficient pre-training
appendix. but also can be used for memory-efficient fine-tuning. We
7
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Figure 3: Applying GaLore to different optimizers for pre-training LLaMA 1B on C4 dataset for 10K steps. Validation perplexity over
training steps is reported. We apply GaLore to each optimizer with the rank of 512 and 1024, where the 1B model dimension is 2048.
40 8-bit GaLore (retaining grad) when activation checkpointing is enabled, per-GPU token
8-bit GaLore batch size can be increased up to 4096. While the batch size
30
RTX 4090 is small per GPU, it can be scaled up with data parallelism,
20 which requires much lower bandwidth for inter-GPU com-
10 munication, compared to model parallelism. Therefore, it
is possible that GaLore can be used for elastic training (Lin
0
350M 1B 3B 7B et al.) 7B models on consumer GPUs such as RTX 4090s.
Model Size
Specifically, we present the memory breakdown in Fig. 1.
Figure 4: Memory usage for different methods at various model It shows that 8-bit GaLore reduces 37.92G (63.3%) and
sizes, evaluated with a token batch size of 256. 8-bit GaLore (re- 24.5G (52.3%) total memory compared to BF16 Adam
taining grad) disables per-layer weight updates but stores weight baseline and 8-bit Adam, respectively. Compared to 8-
gradients during training. bit Adam, 8-bit GaLore mainly reduces the memory in
two parts: (1) low-rank gradient projection reduces 9.6G
(65.5%) memory of storing optimizer states, and (2) using
fine-tune pre-trained RoBERTa models on GLUE tasks us- per-layer weight updates reduces 13.5G memory of storing
ing GaLore and compare its performance with a full fine- weight gradients.
tuning baseline and LoRA. We use hyperparameters from Throughput overhead of GaLore. We also measure the
Hu et al. (2021) for LoRA and tune the learning rate and throughput of the pre-training LLaMA 1B model with 8-bit
scale factor for GaLore. As shown in Table 4, GaLore GaLore and other methods, where the results can be found
achieves better performance than LoRA on most tasks with in the appendix. Particularly, the current implementation
less memory footprint. This demonstrates that GaLore can of 8-bit GaLore achieves 1019.63 tokens/second, which
serve as a full-stack memory-efficient training strategy for induces 17% overhead compared to 8-bit Adam imple-
both LLM pre-training and fine-tuning. mentation. Disabling per-layer weight updates for GaLore
achieves 1109.38 tokens/second, improving the throughput
5.5. Measurement of Memory and Throughput by 8.8%. We note that our results do not require offloading
While Table 2 gives the theoretical benefit of GaLore com- strategies or checkpointing, which can significantly impact
pared to other methods in terms of memory usage, we also training throughput. We leave optimizing the efficiency of
measure the actual memory footprint of training LLaMA GaLore implementation for future work.
models by various methods, with a token batch size of 256.
The training is conducted on a single device setup with-
6. Ablation Study
out activation checkpointing, memory offloading, and opti- 6.1. How many subspaces are needed during
mizer states partitioning (Rajbhandari et al., 2020). pre-training?
Training 7B models on consumer GPUs with 24G mem- We observe that both too frequent and too slow changes of
ory. As shown in Fig. 4, 8-bit GaLore requires signifi- subspaces hurt the convergence, as shown in Fig. 5(left).
cantly less memory than BF16 baseline and 8-bit Adam,
8
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Table 4: Evaluating GaLore for memory-efficient fine-tuning on GLUE benchmark using pre-trained RoBERTa-Base. We report the
average score of all tasks.
Memory CoLA STS-B MRPC RTE SST2 MNLI QNLI QQP Avg
Full Fine-Tuning 747M 62.24 90.92 91.30 79.42 94.57 87.18 92.33 92.28 86.28
GaLore (rank=4) 253M 60.35 90.73 92.25 79.42 94.04 87.00 92.24 91.06 85.89
LoRA (rank=4) 257M 61.38 90.57 91.07 78.70 92.89 86.82 92.18 91.29 85.61
GaLore (rank=8) 257M 60.06 90.82 92.01 79.78 94.38 87.17 92.20 91.11 85.94
LoRA (rank=8) 264M 61.83 90.80 91.90 79.06 93.46 86.94 92.25 91.22 85.93
9
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
by projected gradient descent: General statistical and al- Lee, Y. and Choi, S. Gradient-Based Meta-Learning with
gorithmic guarantees, September 2015. Learned Layerwise Metric and Subspace, June 2018.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, Li, B., Chen, J., and Zhu, J. Memory Efficient Optimiz-
G., Roberts, A., Barham, P., Chung, H. W., Sutton, ers with 4-bit States. https://arxiv.org/abs/2309.01507v3,
C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, September 2023.
S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer,
Lialin, V., Shivagunde, N., Muckatira, S., and Rumshisky,
N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B.,
A. ReLoRA: High-Rank Training Through Low-Rank
Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari,
Updates, December 2023.
G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev,
S., Michalewski, H., Garcia, X., Misra, V., Robinson, Lin, H., Zhang, H., Ma, Y., He, T., Zhang, Z., Zha, S., and
K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim, Li, M. Dynamic Mini-batch SGD for Elastic Distributed
H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D., Training: Learning in the Limbo of Resources. URL
Agrawal, S., Omernick, M., Dai, A. M., Pillai, T. S., Pel- http://arxiv.org/abs/1904.12043.
lat, M., Lewkowycz, A., Moreira, E., Child, R., Polozov,
O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Diaz, M., Loshchilov, I. and Hutter, F. Decoupled Weight Decay Reg-
Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., Eck, ularization, January 2019.
D., Dean, J., Petrov, S., and Fiedel, N. PaLM: Scaling
Lv, K., Yang, Y., Liu, T., Gao, Q., Guo, Q., and Qiu, X.
Language Modeling with Pathways, October 2022.
Full Parameter Fine-tuning for Large Language Models
Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, with Limited Resources, June 2023.
L. 8-bit Optimizers via Block-wise Quantization. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
arXiv:2110.02861 [cs], October 2021. Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring
the Limits of Transfer Learning with a Unified Text-to-
Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer,
Text Transformer, September 2023.
L. 8-bit Optimizers via Block-wise Quantization, June
2022. Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. ZeRO:
Memory Optimizations Toward Training Trillion Param-
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, eter Models, May 2020.
L. QLoRA: Efficient Finetuning of Quantized LLMs,
May 2023. Renduchintala, A., Konuk, T., and Kuchaiev, O. Tied-Lora:
Enhacing parameter efficiency of LoRA with weight ty-
Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., ing, November 2023.
Hu, S., Chen, Y., Chan, C.-M., Chen, W., Yi, J., Zhao,
W., Wang, X., Liu, Z., Zheng, H.-T., Chen, J., Liu, Y., Shazeer, N. GLU Variants Improve Transformer, February
Tang, J., Li, J., and Sun, M. Delta Tuning: A Compre- 2020.
hensive Study of Parameter Efficient Methods for Pre-
Shazeer, N. and Stern, M. Adafactor: Adaptive Learning
trained Language Models, March 2022.
Rates with Sublinear Memory Cost.
Gur-Ari, G., Roberts, D. A., and Dyer, E. Gradient Descent Sheng, Y., Cao, S., Li, D., Hooper, C., Lee, N., Yang, S.,
Happens in a Tiny Subspace, December 2018. Chou, C., Zhu, B., Zheng, L., Keutzer, K., Gonzalez,
J. E., and Stoica, I. S-LoRA: Serving Thousands of Con-
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
current LoRA Adapters, November 2023.
S., Wang, L., and Chen, W. LoRA: Low-Rank Adapta-
tion of Large Language Models, October 2021. Tian, Y., Yu, L., Chen, X., and Ganguli, S. Understanding
self-supervised learning with dual deep networks. arXiv
Kamalakara, S. R., Locatelli, A., Venkitesh, B., Ba, J., Gal, preprint arXiv:2010.00578, 2020.
Y., and Gomez, A. N. Exploring Low Rank Training of
Deep Neural Networks, September 2022. Tian, Y., Wang, Y., Zhang, Z., Chen, B., and Du, S. Joma:
Demystifying multilayer transformers via joint dynam-
Kingma, D. P. and Ba, J. Adam: A Method for Stochastic ics of mlp and attention. ICLR, 2024.
Optimization. arXiv:1412.6980 [cs], December 2014.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Larsen, B. W., Fort, S., Becker, N., and Ganguli, S. How A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
many degrees of freedom do we need to train deep net- Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen,
works: A loss landscape perspective, February 2022. M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J.,
10
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N.,
Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas,
M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev,
A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J.,
Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov,
T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizen-
stein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R.,
Smith, E. M., Subramanian, R., Tan, X. E., Tang, B.,
Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z.,
Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S.,
Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T.
Llama 2: Open Foundation and Fine-Tuned Chat Mod-
els, July 2023.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and
Bowman, S. R. GLUE: A Multi-Task Benchmark and
Analysis Platform for Natural Language Understanding,
February 2019.
11
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
A. Proofs
A.1. Gradient becomes low-rank
Lemma A.1 (Gradient becomes low-rank during training). Let m ≤ n without loss of generality. The gradient update:
with constant A and PSD matrices B and C and randomly initialized W0 leads to low-rank gradient with high probability:
m 2t
X 1 − ηλi ν1
stable-rank(Gt ) ≤ 1 + O (7)
i=2
1 − ηλ1 ν1
Here ν1 = λmin (C) is the smallest eigenvalues of C and λ1 ≤ . . . ≤ λn are eigenvalues of B. Furthermore, if λ2 > λ1
and ν1 > 0, then Gt converges to rank-1 exponentially.
Proof. We have
Suppose ht,ij is the ij component of Ht , then from the equation above we have:
Then for first few rows i and columns j that correspond to large eigenvalues, ht,ij → 0 quickly and rank(Ht ) becomes
small.
To make it more precise, consider the stable rank:
∥Ht ∥2F
stable-rank(Gt ) = stable-rank(Ht ) = (21)
∥Ht ∥22
Then we have:
m X
X n
∥Ht ∥2F = (1 − ηλi νj )2t h20,ij (22)
i=1 j=1
and
n
X n
X
∥Ht ∥22 ≥ 2
Ht,1j = (1 − ηλ1 νj )2t h20,1j (23)
j=1 j=1
12
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
A.2. Reversibility
Definition A.2 (Reversiblity (Tian et al., 2020)). A network N that maps input x to output y = N (x) is reversible, if
there exists K(x; W ) so that y = K(x; W )x, and the backpropagated gradient gx satisfies gx = K ⊤ (x; W )gy , where
gy is the backpropagated gradient at the output y. Here K(x; W ) depends on the input x and weight W in the network
N.
Note that many layers are reversible, including linear layer (without bias), reversible activations (e.g., ReLU, leaky ReLU,
polynomials, etc). Furthermore, they can be combined to construct more complicated architectures:
Property 1. If N1 and N2 are reversible networks, then (Parallel) y = α1 N1 (x) + α2 N2 (x) is reversible for constants
α1 and α2 , and (Composition) y = N2 (N1 (x)) is reversible.
From this property, it is clear that ResNet architecture x + N (x) is reversible, if N contains bias-free linear layers and
reversible activations, which is often the case in practice. For a detailed analysis, please check Appendix A in (Tian et al.,
2020). For architectures like self-attention, one possibility is to leverage JoMA (Tian et al., 2024) to analyze, and we leave
for future work.
The gradient of chained reversible networks has the following structure:
Theorem 3.2 (Gradient Form of reversible models). In a chained reversible neural network N (x) :=
NL (NL−1 (. . . N1 (x))) with ℓ2 -objective φ := 12 ∥y − N (x)∥22 , the weight matrix Wl at layer l has gradient Gl of the
following form for batchsize 1:
Gl = Jl⊤ yfl−1
⊤
− Jl⊤ Jl Wl fl−1 fl−1
⊤
(8)
| {z } | {z } | {z }
A B C
exp(y ⊤ f )
Lemma A.3 (Gradient structure of softmax loss). For K-way logsoftmax loss φ(y; f ) := − log 1⊤ exp(f )
, let fˆ = P1⊥ f
be the zero-mean version of network output f , where P1⊥ := I − 1 ⊤
K 11 , then we have:
Proof. Let fˆ := P1⊥ f be the zero-mean version of network output f . Then we have 1⊤ fˆ = 0 and f = fˆ+ c1. Therefore,
we have:
!
exp(c) exp(y ⊤ fˆ)
−φ = log = y ⊤ fˆ − log(1⊤ exp(fˆ)) (32)
exp(c)1⊤ exp(fˆ)
13
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
x2
Using the Taylor expansion exp(x) = 1 + x + 2 + o(x2 ), we have:
1
1⊤ exp(fˆ) = 1⊤ (1 + fˆ + fˆ2 ) + o(fˆ2 ) = K(1 + fˆ⊤ fˆ/2K + o(fˆ2 /K)) (33)
2
So
Therefore
!
γ fˆ2
−dφ = y dfˆ − fˆ⊤ dfˆ + O
⊤
dfˆ (35)
K K
where Bi and Ci are PSD matrices, Ai , Bi and Ci have LA , LB and LC continuity P with respect to W and ∥Wt ∥ ≤ D.
Let Rt := Pt⊤ Gt Qt , B̂it := Pt⊤ Bi (Wt )Pt , Ĉit := Q⊤ 1
t Ci (Wt )Qt and κt := N i λmin (B̂it )λmin (Ĉit ). If we choose
constant Pt = P and Qt = Q, then GaLore with ρt ≡ 1 satisfies:
As a result, if mint κt > LA + LB LC D2 , Rt → 0 and thus GaLore converges with fixed Pt and Qt .
Proof. Using vec(AXB) = (B ⊤ ⊗ A)vec(X) where ⊗ is the Kronecker product, the gradient assumption can be written
as the following:
gt = at − St wt (36)
1
:= vec(Gt ) ∈ Rmn , wt := vec(Wt ) ∈ Rmn be the vectorized versions of Gt and Wt , at :=
P
where gtP N i vec(Ait ) and
1
St = N i Cit ⊗ Bit are mn-by-mn PSD matrix.
Using the same notation, it is clear to show that:
gt = at − St wt (39)
= (at − at−1 ) + (St−1 − St )wt + at−1 − St−1 wt (40)
= et + at−1 − St−1 (wt−1 + ηg̃t−1 ) (41)
= et + gt−1 − ηSt−1 g̃t−1 (42)
14
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Let
1 X 1 X ⊤
Ŝt := (Q ⊗ P )⊤ St (Q ⊗ P ) = (Q ⊗ P )⊤ (Cit ⊗ Bit )(Q ⊗ P ) = (Q Cit Q) ⊗ (P ⊤ Bit P ) (44)
N i N i
Then we have:
rt = (I − η Ŝt−1 )rt−1 + (Q ⊗ P )⊤ et (45)
⊤ ⊤
Now we bound the norm. Note that since P and Q are projection matrices with P P = I and Q Q = I, we have:
∥(Q ⊗ P )⊤ et ∥2 = ∥vec(P ⊤ Et Q)∥2 = ∥P ⊤ Et Q∥F ≤ ∥Et ∥F (46)
1 1
P P
where Et := N i (Ait − Ai,t−1 ) + N i (Bi,t−1 Wt Ci,t−1 − Bit Wt Cit ). So we only need to bound ∥Et ∥F . Note that:
Now we estimate the minimal eigenvalue of Ŝt−1 . Let λit := λmin (P ⊤ Bit P ) and ν it := λmin (Q⊤ Cit Q), then
λmin ((P ⊤ Bit P ) ⊗ (Q⊤ Cit Q)) = λit ν it and for any unit vector v:
1 X ⊤ ⊤ 1 X
v ⊤ Ŝt v = v (P Bit P ) ⊗ (Q⊤ Cit Q) v ≥
λ ν (50)
N i N i it it
Table 5: Hyperparameters of LLaMA models for evaluation. Data amount are specified in tokens.
For all methods on each size of models (from 60M to 1B), we tune their favorite learning rate from a set of
{0.01, 0.005, 0.001, 0.0005, 0.0001}, and the best learning rate is chosen based on the validation perplexity. We find
GaLore is insensitive to hyperparameters and tends to be stable with the same learning rate across different model sizes.
For all models, GaLore use the same hyperparameters, including the learning rate of 0.01, scale factor α of 0.25, and the
subspace change frequency of T of 200. We note that since α can be viewed as a fractional learning rate, most of the
modules (e.g., multi-head attention and feed-forward layers) in LLaMA models have the actual learning rate of 0.0025.
This is, still, a relatively large stable learning rate compared to the full-rank baseline, which usually uses a learning rate
≤ 0.001 to avoid spikes in the training loss.
15
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
(a) Memory estimate of weight parameters. (b) Memory estimate of optimizer states.
1
https://huggingface.co/transformers/model_doc/roberta.html
16
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Throughput
Model Size Layer Wise Methods Token Batch Size Memory Cost
#Tokens / s #Samples / s
AdamW 256 13.60 1256.98 6.33
Adafactor 256 13.15 581.02 2.92
1B ✘
Adam8bit 256 9.54 1569.89 7.90
8-bit GaLore 256 7.95 1109.38 5.59
AdamW 256 9.63 1354.37 6.81
Adafactor 256 10.32 613.90 3.09
1B ✔ Adam8bit 256 6.93 1205.31 6.07
8-bit GaLore 256 5.63 1019.63 5.13
17