You are on page 1of 15

Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

Search Write

Training at Scale: Chinchilla Scaling


Laws for Compute-Optimal
Training of LLMs
Zain ul Abideen · Follow
6 min read · Jun 26

Exploring Chinchilla’s scaling laws and Meta’s LLaMA model

Introduction
In this blog post, I will be discussing a paper from Google DeepMind in
which they perform a lot of experiments on training large language models
to find the relation between model size, compute budget, and no. of
training tokens. I will also cover Meta’s LLaMA model which was trained by
using the results gained from the experiments performed by DeepMind.
This blog is part of my blog series on large language models. You can view
the previous post on advanced prompting techniques over here Navigating
the Prompt Space: Techniques for Effective Prompt Exploration. Chinchilla’s

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 1 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

paper refers to the scaling laws for LLMs by OpenAI in a significant


manner. So, I’m going to cover the results of their paper first.

Scaling Laws for LLMs by OpenAI


In 2020, OpenAI published a paper “Scaling Laws for Neural Language
Models”. They came to the result that the loss scales as a power law with
model size, dataset size, and the amount of compute used for training.
Network depth and width have minimum effects. These relationships
helped them to come to the conclusion that “Larger models are significantly
more sample efficient, such that optimally compute-efficient training
involves training very large models on a relatively modest amount of data
and stopping significantly before convergence.”

For optimal performance, all three factors must be scaled up alongside each other.

Training Compute-Optimal LLMs by DeepMind


This paper was published in 2022. The main goal of this paper was to find
the relationship between three factors. These factors are model size,

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 2 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

number of tokens, and compute budget. They came to the conclusion that
the current LLMs like 175B GPT-3, 280B Gopher, and 530B Megatron are
significantly undertrained. All these models have increased the number of
parameters but the training data remained constant. The authors mention
that for compute-optimal training, the number of training tokens and
model size must be scaled equally. They trained about 400 language models
ranging from 70 million to over 16 billion parameters on 5 to 500 billion
tokens.

Chinchilla outperforms Gopher and the other large models

After finding the relationship between the three factors, they trained a new
LLM called Chinchilla which uses same compute budget as 280B Gopher but
has 70B parameters and 4 times more training data. Chinchilla outperforms

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 3 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron (530B). This
result is in contradiction to the “Scaling laws for LLMs” by OpenAI. Now,
relatively smaller models can give better performance if trained on more
data. Smaller models are easy to fine-tune and also have less latency at
inference. These models should not be to their lowest possible loss to be
compute optimal.

Current LLMs

The main question for their research is “Given a fixed FLOPs budget, how
should one trade-off model size and the number of training tokens?”. They
tried three different approaches to answer this question. They have
assumed a power-law relationship between compute and model size.

Approach 1: Fix model sizes and vary number of training tokens

In the first approach, they have fixed the model sizes (75M, 250M, 500M, 1B,
2.5B, 5B, 10B) and they are changing the number of training tokens with a
fixed number of FLOPS. Using power law, they found out that the optimal

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 4 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

model size for Gopher’s compute budget (5.76 × 10^23) is 67B and training
tokens should be 1.5 trillion.

Training curve envelope

Approach 2: IsoFLOP profiles

In the second approach, they vary the model size for a fixed set of 9
different training FLOP counts (ranging from 10^18 to 10^21 FLOPs). This
approach answers the question “For a given FLOP budget, what is the

with over 𝐷 tokens, a cosine cycle length that decays 10× over
optimal parameter count?”. While training they suggest that for a model

approximately 𝐷 tokens is suggested. This approach suggests that the


optimal model size for Gopher’s compute budget is 63B and training tokens
should be 1.4 trillion.

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 5 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

IsoFLOP curves.

Approach 3: Fitting a parametric loss function

For the third approach, they have tried to combine the final loss of the
above two approaches as a parametric function of model parameters and
number of tokens. They proposed a functional form and then minimized
the Huber loss to estimate the optimal model size for the Gopher Flop
budget is 40B parameters.

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 6 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

Parametric fit

All three approaches suggest that as compute budget increases, model size
and the amount of training data should be increased in approximately equal
proportions. The first and second approaches yield very similar predictions
for optimal model sizes. The third approach suggests that smaller models
will be optimal for larger compute budgets. The model Chinchilla that they
trained using the above results was trained on MassiveText. It uses AdamW
optimizer and SentencePiece tokenizer. At around 80% of the cosine cycle,
AdamW passes the performance of model training on the Adam optimizer.

LLaMA models
Meta released a collection of models ranging from 7B to 65B parameters.
These models have been trained efficiently using Chinchilla’s scaling laws.
These smaller models are cheaper at inference and trained on publicly
available datasets. LLaMA-13B outperforms GPT-3 on most benchmarks,
despite being 10× smaller. These models are not the fastest to train but they
are faster at inference.

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 7 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

Pre-training data

They have used byte-pair encoding. For 6B and 13B parameter models, they
have trained with 1T tokens. For 32B and 65B parameter models, they have
trained with 1.4T tokens. They have used the basic architectural details of
the original Transformer but with few changes from the research of Palm,
GPT-3, and other models. They have used pre-normalization, SwiGLU
activation function, and rotary embeddings in place of positional
embeddings. They have used AdamW optimizer and causal multi-head
attention. For efficient implementation, they have given preference to
storing the activations in case of recomputing it during backward pass.

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 8 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

Zero-shot performance on Common Sense Reasoning tasks.

Other benchmarks on which LLaMA models have been tested are present in
the paper. These open-source and state-of-the-art foundation models have
proved that relatively smaller models can outperform large models if
trained efficiently and longer.

Closing Remarks

In conclusion, the application of the chinchilla scaling law in training large


language models has provided a breakthrough in optimizing compute
utilization and achieving efficient training. By recognizing the need to train
models on longer time durations and a higher number of tokens, the
chinchilla scaling law offers a compute-optimal approach that enhances the
performance and capabilities of large language models. The remarkable
LLaMA model stands as a testament to the effectiveness of this approach,
having been trained on an impressive 1 trillion tokens while maintaining

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 9 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

efficiency. In the next blog post, I will be covering language models like
Alpaca, Vicuna, and WizardLM. One thing common about these models is
that all three of them are fine-tuned versions of the LLaMA model. I will
also explain how these models are different with respect to the data that
they have collected for efficient fine-tuning.

Thank you for reading!

Follow me on LinkedIn!

References

1. Training Compute-Optimal Large Language Models

2. LLaMA: Open and Efficient Foundation Language Models

3. Scaling Laws for Neural Language Models

Machine Learning AI Deep Learning NLP Llm

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 10 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

Written by Zain ul Abideen Follow

254 Followers

Machine Learning Engineer | I share what I learn.


https://www.linkedin.com/in/zaiinulabideen/

More from Zain ul Abideen

Zain ul Abideen Zain ul Abideen

Attention Is All You Need: The Core A Comparative Analysis of LLMs


Idea of the Transformer like BERT, BART, and T5
An overview of the Transformer model and Exploring Language Models
its key components.

6 min read · Jun 26 6 min read · Jun 26

224 32 1

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 11 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

Zain ul Abideen Zain ul Abideen

From Seq2Seq to Attention: Demystifying Sequence Modeling:


Revolutionizing Sequence… Understanding RNNs, LSTMs, an…
Modeling
Investigating the origin of Attention Seq2Seq
Exploring the fundamentals and applications
mechanism and Bahdanau attention of sequence modeling.

6 min read · Jun 26 5 min read · Jun 26

38 36

See all from Zain ul Abideen

Recommended from Medium

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 12 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

Rania _Hossam Freedom Preetham in Autonomous Agents

Chinchilla Scaling Laws for Large Simplifying Transformer Blocks—


Language Models (LLMs) A Detailed Mathematical…
In the realm of artificial intelligence, size Explanation
Large language models (LLMs) can expand
matters. OpenAI’s groundbreaking paper,… their capabilities through various scaling…
“Scaling Laws for Neural Language Models,” strategies. The more straightforward
6
takes us…· Sep 7
min read 16 · Nov 28
min read involves…
approach

80 1 214

Lists

Natural Language Processing The New Chatbots: ChatGPT,


938 stories · 445 saves Bard, and Beyond
12 stories · 227 saves

Predictive Modeling w/ Practical Guides to Machine


Python Learning
20 stories · 663 saves 10 stories · 748 saves

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 13 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

Dillip Amur Amirhossein Abaskohi

QLoRA: Fine-Tuning Large Navigating Transformers: A


Language Models (LLM’s) Comprehensive Exploration of…
In this blog I will aim to explain the concept Encoder-Only
Introduction and Decoder-Only
and important terminology related to QLoR… Models, Right…
The focus of this blog is to provide better…
12 min read · Nov 27 9 min read · Aug 16

9 25

Mastering LLM (Large Language Model) TitanML

LLM Training: A Simple 3-Step Harmonizing Multi-GPUs: Efficient


Guide You Won’t Find Anywhere… Scaling of LLM Inference
Else!
Discover How Language Models are Trained Massively parallel hardware accelerators,
in 3 Easy Steps such as GPUs, have played a key role in…
providing the computational power required
6 min read · Oct 1 12
tomin read · Oct 30
train…

72 3 7

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 14 of 15
Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

See more recommendations

Help Status About Careers Blog Privacy Terms Text to speech Teams

https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 15 of 15

You might also like