Training at Scale: Chinchilla Scaling Laws For Compute-Optimal Training of LLMs - by Zain Ul Abideen - Medium

Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM
Search Write
Training at Scale: Chinchilla Scaling

Laws for Compute-Optimal
Training of LLMs
Zain ul Abideen · Follow
6 min read · Jun 26
Exploring Chinchilla’s scaling laws and Meta’s LLaMA model
Introduction
In this blog post, I will be discussing a paper from Google DeepMind in
which they perform a lot of experiments on training large language models
to find the relation between model size, compute budget, and no. of
training tokens. I will also cover Meta’s LLaMA model which was trained by
using the results gained from the experiments performed by DeepMind.
This blog is part of my blog series on large language models. You can view
the previous post on advanced prompting techniques over here Navigating
the Prompt Space: Techniques for Effective Prompt Exploration. Chinchilla’s
https://medium.com/@zaiinn440/training-at-scale-chinchilla-scaling-laws-for-compute-optimal-training-of-llms-eca49f58c358 Page 1 of 15
paper refers to the scaling laws for LLMs by OpenAI in a significant

manner. So, I’m going to cover the results of their paper first.
Scaling Laws for LLMs by OpenAI

In 2020, OpenAI published a paper “Scaling Laws for Neural Language
Models”. They came to the result that the loss scales as a power law with
model size, dataset size, and the amount of compute used for training.
Network depth and width have minimum effects. These relationships
helped them to come to the conclusion that “Larger models are significantly
more sample efficient, such that optimally compute-efficient training
involves training very large models on a relatively modest amount of data
and stopping significantly before convergence.”
For optimal performance, all three factors must be scaled up alongside each other.
Training Compute-Optimal LLMs by DeepMind

This paper was published in 2022. The main goal of this paper was to find
the relationship between three factors. These factors are model size,
number of tokens, and compute budget. They came to the conclusion that
the current LLMs like 175B GPT-3, 280B Gopher, and 530B Megatron are
significantly undertrained. All these models have increased the number of
parameters but the training data remained constant. The authors mention
that for compute-optimal training, the number of training tokens and
model size must be scaled equally. They trained about 400 language models
ranging from 70 million to over 16 billion parameters on 5 to 500 billion
tokens.
Chinchilla outperforms Gopher and the other large models
After finding the relationship between the three factors, they trained a new
LLM called Chinchilla which uses same compute budget as 280B Gopher but
has 70B parameters and 4 times more training data. Chinchilla outperforms
Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron (530B). This
result is in contradiction to the “Scaling laws for LLMs” by OpenAI. Now,
relatively smaller models can give better performance if trained on more
data. Smaller models are easy to fine-tune and also have less latency at
inference. These models should not be to their lowest possible loss to be
compute optimal.
Current LLMs
The main question for their research is “Given a fixed FLOPs budget, how
should one trade-off model size and the number of training tokens?”. They
tried three different approaches to answer this question. They have
assumed a power-law relationship between compute and model size.
Approach 1: Fix model sizes and vary number of training tokens
In the first approach, they have fixed the model sizes (75M, 250M, 500M, 1B,
2.5B, 5B, 10B) and they are changing the number of training tokens with a
fixed number of FLOPS. Using power law, they found out that the optimal
model size for Gopher’s compute budget (5.76 × 10^23) is 67B and training
tokens should be 1.5 trillion.
Training curve envelope
Approach 2: IsoFLOP profiles
In the second approach, they vary the model size for a fixed set of 9
different training FLOP counts (ranging from 10^18 to 10^21 FLOPs). This
approach answers the question “For a given FLOP budget, what is the
with over 𝐷 tokens, a cosine cycle length that decays 10× over
optimal parameter count?”. While training they suggest that for a model
approximately 𝐷 tokens is suggested. This approach suggests that the

optimal model size for Gopher’s compute budget is 63B and training tokens
should be 1.4 trillion.
IsoFLOP curves.
Approach 3: Fitting a parametric loss function
For the third approach, they have tried to combine the final loss of the
above two approaches as a parametric function of model parameters and
number of tokens. They proposed a functional form and then minimized
the Huber loss to estimate the optimal model size for the Gopher Flop
budget is 40B parameters.
Parametric fit
All three approaches suggest that as compute budget increases, model size
and the amount of training data should be increased in approximately equal
proportions. The first and second approaches yield very similar predictions
for optimal model sizes. The third approach suggests that smaller models
will be optimal for larger compute budgets. The model Chinchilla that they
trained using the above results was trained on MassiveText. It uses AdamW
optimizer and SentencePiece tokenizer. At around 80% of the cosine cycle,
AdamW passes the performance of model training on the Adam optimizer.
LLaMA models
Meta released a collection of models ranging from 7B to 65B parameters.
These models have been trained efficiently using Chinchilla’s scaling laws.
These smaller models are cheaper at inference and trained on publicly
available datasets. LLaMA-13B outperforms GPT-3 on most benchmarks,
despite being 10× smaller. These models are not the fastest to train but they
are faster at inference.
Pre-training data
They have used byte-pair encoding. For 6B and 13B parameter models, they
have trained with 1T tokens. For 32B and 65B parameter models, they have
trained with 1.4T tokens. They have used the basic architectural details of
the original Transformer but with few changes from the research of Palm,
GPT-3, and other models. They have used pre-normalization, SwiGLU
activation function, and rotary embeddings in place of positional
embeddings. They have used AdamW optimizer and causal multi-head
attention. For efficient implementation, they have given preference to
storing the activations in case of recomputing it during backward pass.
Zero-shot performance on Common Sense Reasoning tasks.
Other benchmarks on which LLaMA models have been tested are present in
the paper. These open-source and state-of-the-art foundation models have
proved that relatively smaller models can outperform large models if
trained efficiently and longer.
Closing Remarks
In conclusion, the application of the chinchilla scaling law in training large

language models has provided a breakthrough in optimizing compute
utilization and achieving efficient training. By recognizing the need to train
models on longer time durations and a higher number of tokens, the
chinchilla scaling law offers a compute-optimal approach that enhances the
performance and capabilities of large language models. The remarkable
LLaMA model stands as a testament to the effectiveness of this approach,
having been trained on an impressive 1 trillion tokens while maintaining
efficiency. In the next blog post, I will be covering language models like
Alpaca, Vicuna, and WizardLM. One thing common about these models is
that all three of them are fine-tuned versions of the LLaMA model. I will
also explain how these models are different with respect to the data that
they have collected for efficient fine-tuning.
Thank you for reading!
Follow me on LinkedIn!
References
1. Training Compute-Optimal Large Language Models
2. LLaMA: Open and Efficient Foundation Language Models
3. Scaling Laws for Neural Language Models
Machine Learning AI Deep Learning NLP Llm
Written by Zain ul Abideen Follow
254 Followers
Machine Learning Engineer | I share what I learn.

https://www.linkedin.com/in/zaiinulabideen/
More from Zain ul Abideen
Zain ul Abideen Zain ul Abideen
Attention Is All You Need: The Core A Comparative Analysis of LLMs

Idea of the Transformer like BERT, BART, and T5
An overview of the Transformer model and Exploring Language Models
its key components.
6 min read · Jun 26 6 min read · Jun 26
224 32 1
Zain ul Abideen Zain ul Abideen
From Seq2Seq to Attention: Demystifying Sequence Modeling:

Revolutionizing Sequence… Understanding RNNs, LSTMs, an…
Modeling
Investigating the origin of Attention Seq2Seq
Exploring the fundamentals and applications
mechanism and Bahdanau attention of sequence modeling.
6 min read · Jun 26 5 min read · Jun 26
38 36
See all from Zain ul Abideen
Recommended from Medium
Rania _Hossam Freedom Preetham in Autonomous Agents
Chinchilla Scaling Laws for Large Simplifying Transformer Blocks—

Language Models (LLMs) A Detailed Mathematical…
In the realm of artificial intelligence, size Explanation
Large language models (LLMs) can expand
matters. OpenAI’s groundbreaking paper,… their capabilities through various scaling…
“Scaling Laws for Neural Language Models,” strategies. The more straightforward
6
takes us…· Sep 7
min read 16 · Nov 28
min read involves…
approach
80 1 214
Lists
Natural Language Processing The New Chatbots: ChatGPT,

938 stories · 445 saves Bard, and Beyond
12 stories · 227 saves
Predictive Modeling w/ Practical Guides to Machine

Python Learning
20 stories · 663 saves 10 stories · 748 saves
Dillip Amur Amirhossein Abaskohi
QLoRA: Fine-Tuning Large Navigating Transformers: A

Language Models (LLM’s) Comprehensive Exploration of…
In this blog I will aim to explain the concept Encoder-Only
Introduction and Decoder-Only
and important terminology related to QLoR… Models, Right…
The focus of this blog is to provide better…
12 min read · Nov 27 9 min read · Aug 16
9 25
Mastering LLM (Large Language Model) TitanML
LLM Training: A Simple 3-Step Harmonizing Multi-GPUs: Efficient

Guide You Won’t Find Anywhere… Scaling of LLM Inference
Else!
Discover How Language Models are Trained Massively parallel hardware accelerators,
in 3 Easy Steps such as GPUs, have played a key role in…
providing the computational power required
6 min read · Oct 1 12
tomin read · Oct 30
train…
72 3 7
See more recommendations
Help Status About Careers Blog Privacy Terms Text to speech Teams

Training at Scale: Chinchilla Scaling Laws For Compute-Optimal Training of LLMs - by Zain Ul Abideen - Medium

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Training at Scale: Chinchilla Scaling Laws For Compute-Optimal Training of LLMs - by Zain Ul Abideen - Medium

Uploaded by

Copyright:

Available Formats

Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs | by Zain ul Abideen | Medium 12/6/23, 3:13 PM

Training at Scale: Chinchilla Scaling

Exploring Chinchilla’s scaling laws and Meta’s LLaMA model

paper refers to the scaling laws for LLMs by OpenAI in a significant

Scaling Laws for LLMs by OpenAI

Training Compute-Optimal LLMs by DeepMind

Chinchilla outperforms Gopher and the other large models

Approach 1: Fix model sizes and vary number of training tokens

Training curve envelope

Approach 2: IsoFLOP profiles

approximately 𝐷 tokens is suggested. This approach suggests that the

Approach 3: Fitting a parametric loss function

Zero-shot performance on Common Sense Reasoning tasks.

In conclusion, the application of the chinchilla scaling law in training large

Thank you for reading!

1. Training Compute-Optimal Large Language Models

2. LLaMA: Open and Efficient Foundation Language Models

3. Scaling Laws for Neural Language Models

Machine Learning AI Deep Learning NLP Llm

Written by Zain ul Abideen Follow

Machine Learning Engineer | I share what I learn.

More from Zain ul Abideen

Zain ul Abideen Zain ul Abideen

Attention Is All You Need: The Core A Comparative Analysis of LLMs

6 min read · Jun 26 6 min read · Jun 26

Zain ul Abideen Zain ul Abideen

From Seq2Seq to Attention: Demystifying Sequence Modeling:

6 min read · Jun 26 5 min read · Jun 26

See all from Zain ul Abideen

Recommended from Medium

Rania _Hossam Freedom Preetham in Autonomous Agents

Chinchilla Scaling Laws for Large Simplifying Transformer Blocks—

Natural Language Processing The New Chatbots: ChatGPT,

Predictive Modeling w/ Practical Guides to Machine

Dillip Amur Amirhossein Abaskohi

QLoRA: Fine-Tuning Large Navigating Transformers: A

Mastering LLM (Large Language Model) TitanML

LLM Training: A Simple 3-Step Harmonizing Multi-GPUs: Efficient

See more recommendations

You might also like