Professional Documents
Culture Documents
Search Write
Introduction
In this blog post, I will be discussing a paper from Google DeepMind in
which they perform a lot of experiments on training large language models
to find the relation between model size, compute budget, and no. of
training tokens. I will also cover Meta’s LLaMA model which was trained by
using the results gained from the experiments performed by DeepMind.
This blog is part of my blog series on large language models. You can view
the previous post on advanced prompting techniques over here Navigating
the Prompt Space: Techniques for Effective Prompt Exploration. Chinchilla’s
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 1 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
For optimal performance, all three factors must be scaled up alongside each other.
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 2 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
number of tokens, and compute budget. They came to the conclusion that
the current LLMs like 175B GPT-3, 280B Gopher, and 530B Megatron are
significantly undertrained. All these models have increased the number of
parameters but the training data remained constant. The authors mention
that for compute-optimal training, the number of training tokens and
model size must be scaled equally. They trained about 400 language models
ranging from 70 million to over 16 billion parameters on 5 to 500 billion
tokens.
After finding the relationship between the three factors, they trained a new
LLM called Chinchilla which uses same compute budget as 280B Gopher but
has 70B parameters and 4 times more training data. Chinchilla outperforms
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 3 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron (530B). This
result is in contradiction to the “Scaling laws for LLMs” by OpenAI. Now,
relatively smaller models can give better performance if trained on more
data. Smaller models are easy to fine-tune and also have less latency at
inference. These models should not be to their lowest possible loss to be
compute optimal.
Current LLMs
The main question for their research is “Given a fixed FLOPs budget, how
should one trade-off model size and the number of training tokens?”. They
tried three different approaches to answer this question. They have
assumed a power-law relationship between compute and model size.
In the first approach, they have fixed the model sizes (75M, 250M, 500M, 1B,
2.5B, 5B, 10B) and they are changing the number of training tokens with a
fixed number of FLOPS. Using power law, they found out that the optimal
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 4 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
model size for Gopher’s compute budget (5.76 × 10^23) is 67B and training
tokens should be 1.5 trillion.
In the second approach, they vary the model size for a fixed set of 9
different training FLOP counts (ranging from 10^18 to 10^21 FLOPs). This
approach answers the question “For a given FLOP budget, what is the
with over 𝐷 tokens, a cosine cycle length that decays 10× over
optimal parameter count?”. While training they suggest that for a model
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 5 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
IsoFLOP curves.
For the third approach, they have tried to combine the final loss of the
above two approaches as a parametric function of model parameters and
number of tokens. They proposed a functional form and then minimized
the Huber loss to estimate the optimal model size for the Gopher Flop
budget is 40B parameters.
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 6 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
Parametric fit
All three approaches suggest that as compute budget increases, model size
and the amount of training data should be increased in approximately equal
proportions. The first and second approaches yield very similar predictions
for optimal model sizes. The third approach suggests that smaller models
will be optimal for larger compute budgets. The model Chinchilla that they
trained using the above results was trained on MassiveText. It uses AdamW
optimizer and SentencePiece tokenizer. At around 80% of the cosine cycle,
AdamW passes the performance of model training on the Adam optimizer.
LLaMA models
Meta released a collection of models ranging from 7B to 65B parameters.
These models have been trained efficiently using Chinchilla’s scaling laws.
These smaller models are cheaper at inference and trained on publicly
available datasets. LLaMA-13B outperforms GPT-3 on most benchmarks,
despite being 10× smaller. These models are not the fastest to train but they
are faster at inference.
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 7 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
Pre-training data
They have used byte-pair encoding. For 6B and 13B parameter models, they
have trained with 1T tokens. For 32B and 65B parameter models, they have
trained with 1.4T tokens. They have used the basic architectural details of
the original Transformer but with few changes from the research of Palm,
GPT-3, and other models. They have used pre-normalization, SwiGLU
activation function, and rotary embeddings in place of positional
embeddings. They have used AdamW optimizer and causal multi-head
attention. For efficient implementation, they have given preference to
storing the activations in case of recomputing it during backward pass.
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 8 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
Other benchmarks on which LLaMA models have been tested are present in
the paper. These open-source and state-of-the-art foundation models have
proved that relatively smaller models can outperform large models if
trained efficiently and longer.
Closing Remarks
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 9 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
efficiency. In the next blog post, I will be covering language models like
Alpaca, Vicuna, and WizardLM. One thing common about these models is
that all three of them are fine-tuned versions of the LLaMA model. I will
also explain how these models are different with respect to the data that
they have collected for efficient fine-tuning.
Follow me on LinkedIn!
References
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 10 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
254 Followers
224 32 1
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 11 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
38 36
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 12 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
80 1 214
Lists
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 13 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
9 25
72 3 7
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 14 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
Help Status About Careers Blog Privacy Terms Text to speech Teams
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 15 of 15