You are on page 1of 25

Guide to fine-tuning LLMs using PEFT and LoRa

techniques
mercity.ai/blog-post/fine-tuning-llms-using-peft-and-lora

1/25
2/25
Pranav Patel

Large Language Models

Large Language Models (LLMs) like GPT are getting only larger in size. Even open-source
models like MPT and Falcon have reached 30 and 40 billion parameters respectively. With
size, the capabilities and complexities of these models have also increased. But this
increased complexity and model size can also create challenges. Training larger models
requires more extensive data sets, and as the model grows, more parameters must be
tuned. This can be very compute-heavy and as a result costly too. This is where fine-tuning
comes in. Fine-tuning is a technique that allows for the re-purposing of pre-trained models
and can help reduce the complexity of building larger models.

In this blog, we will discuss advanced fine-tuning techniques like PEFT (Parameter Efficient
Fine-Tuning) and see how they can save you a ton of time and money on training massive
LLMs.

What is Fine-tuning?
Fine-tuning is the process of taking a model that is already trained on some task and then
tweaking it to perform a similar task. It is often used when a new dataset or task requires the
model to have some modifications, or when the model is not performing well on a specific
task.

For example, a model trained to generate stories can be fine-tuned to generate poems. This
is possible because the model has already learned how to generate casual language and
write stories, this skill can also be used to generate poems if the model is tweaked properly.

How does Fine-tuning work?

3/25
As mentioned, fine-tuning is tweaking an already-trained model for some other task. The way
this works is by taking the weights of the original model and adjusting them to fit a new task.

Models when trained learn to do some specific task, for example, GPT-3 has been trained on
a massive dataset and as a result, it has learned to generate stories, poems, songs, letters,
and a lot of other things. One can take this ability of GPT-3 and fine-tune it on a specific task
like generating answers to customer queries in a specific manner.

There are different ways and techniques to fine-tune a model, the most popular being
transfer learning. Transfer learning comes out of the computer vision world, it is the process
of freezing the weights of the initial layers of a network and only updating the weights of the
later layers. This is because the lower layers, the layers closer to the input, are responsible
for learning the general features of the training dataset. And the upper layers, closer to the
output, learn more specific information which is directly tied to generating the correct output.

Here is a quick visualization of how fine-tuning works:

Alammar, J (2018). The Illustrated Transformer [Blog post].

Why use Fine-Tuning?

4/25
As the model size increases, it becomes more costly and time-consuming to train it. And with
more size it requires more training data, otherwise, models usually overfit and generate poor
results in a production environment. Fine-tuning allows us to not run into these issues by
efficiently using a pre-trained model for our purposes. Here are some reasons why you
should consider fine tuning instead of training a model from scratch:

Larger models generalize to downstream tasks well

We all know how large models like GPT-3 and GPT-4 can perform really well on complicated
tasks. This is because they have very sophisticated architectures and are trained on massive
datasets, this helps them generalize on a lot of tasks really well. These models understand
the underlying properties of language and that helps them learn any new tasks with minimal
effort like prompt engineering.

But if we want to use these models for some very specific tasks, like building a legal contract
generator, you should probably fine-tune the model instead of using prompt engineering.
This is because a model performing well in a very general task like language generation will
perform well in a downstream task like generating legal contracts.

Cheaper than training a whole model

As mentioned before, these large models can be very expensive to train from scratch. Also
very time-consuming. It is always cheaper to train an already-trained model. This also allows
you to leverage what is already out there instead of doing everything yourself. Most of the
time good datasets can be very hard and time-consuming to build. Open-source models like
MPT and LLaMA have already been trained and made sure that they work well by some of
the best researchers out there. It is very easy to load and train them in a cloud infrastructure.

Good for online training

One of the biggest challenges in AI is to keep the model up to date with the latest data.
Models when deployed in production can start degrading in performance if not updated
regularly. For example, if you deploy an AI model to predict customer behavior in a store, it
might stop performing well once the store is restocked with products with different prices or if
they introduce new products in the store. This is a classic example of how changes in data
can drastically change the performance of a model.

5/25
Fine-tuning can help you to keep updating the model with the latest data without having to
re-train the whole model. This makes it possible to deploy models in production without much
effort and cost. This is called online learning or online training and is absolutely necessary for
any model in production.

What is PEFT?
PEFT, Parameter Efficient Fine-Tuning, is a set of techniques or methods to fine-tune a large
model in the most compute and time-efficient way possible, without losing any performance
which you might see from full fine-tuning. This is done because with models growing bigger
and bigger like BLOOM which has a whopping 176 billion parameters, it is almost
impossible to finetune them without spending tens of thousands of dollars. But it is
sometimes almost necessary to use such big models for better performance. This is where
PEFT comes in. It helps you solve the problems faced during such big models.

Here are some PEFT techniques:

6/25
Why PEFT?
As mentioned above, it has become a necessity to fine-tune and use bigger models when it
comes to production-grade applications. PEFT techniques allow you to fine-tune the models
efficiently and save money and time as a result. This is done by fine-tuning only the most
important and relevant parameters in the neural network. The techniques introduce new
parameters in the network or freeze the whole model except for some parts to make it easier
to train the model.

Transfer Learning
Transfer learning is when we take some of the learned parameters of a model and use them
for some other task. This sounds similar to fine-tuning but is different. In finetuning, we re-
adjust all the parameters of the model or freeze some of the weights and adjust the rest of
the parameters. But in fine-tuning, we use some of the learned parameters from a model and
use them in other networks. This gives us more flexibility in terms of what we can do. For
example, we cannot change the architecture of the model when fine-tuning, this limits us in
many ways. But when using transfer learning, we use only a part of the trained model, which
we can then attach to any other model with any architecture.

How Transfer Learning Works


Transfer learning has been a common practice in the computer vision world for a very long
time now. This is because of the nature of the visual models and how they learn. In CNN
models, the early layers extract more general features like edges and curves, whereas the

7/25
later layers extract more complicated features like whole eyes and faces. This is because the
receptive field of CNNs grows as they are stacked on top of each other.

Let’s say for example you are trying to train a neural network to classify if a vehicle in front of
you is a car or a motorbike. This is a very basic task. But let’s say you have very limited data
and you don’t want to train your model too much. Here is what a basic CNN network looks
like.

There are 2 major parts of the network here, the CNN head and the later fully connected
layers. As mentioned, CNN layers extract representations of the data which then are used by
the fully connected network to classify the image. Here we can use any other CNN network

8/25
trained on a similar classification problem and use that as the CNN head for this new
problem.

Here as you can see, we are using transfer learning by using the weights of a network
pretrained to classify the car type. We are only freezing the first two layers of the CNN
network, and leaving the latter two free to be updated during the training process. This
makes sure that the CNN head of the model learns new features from the images which
might be necessary for the new task we are training the model for.

Transfer learning is also often seen in NLP tasks with LLMs where people use the encoder
part of the transformer network from a pretrained model like T5 and train the later layers.

Adapters
Adapters were one of the first parameter-efficient fine-tuning techniques released. In the
paper, they showed that you can add more layers to the pre-existing transformer architecture
and only finetune them instead of the whole model. They showed that this technique resulted
in similar performance when compared to complete fine-tuning.

9/25

On the left, there is the modified transformer architecture with added adapter layers. You can
see adapter layers are added after the attention stack and the feed-forward stack. And on
the right, you can see the architecture of the adapter layer itself. The adapter layer
comprises a bottleneck architecture, it takes the input and narrows it down to a smaller
dimension representation and then passes it through a non-linear activation function, and
then scales it back up to the dimension of the input. This makes sure that the next layer in
the transformer stack will be able to receive the generated output from the adapter layer.

In the paper, the authors show that this method of fine-tuning is comparable to complete fine-
tuning while consuming much less compute resources and training time. They were able to
attain 0.4% of full fine-tuning on the GLUE benchmark while adding 3.6% of the parameters.

10/25

LoRA - Low-Rank Adaptation


LoRA is a similar strategy to Adapter layers but it aims to further reduce the number of
trainable parameters. It takes a more mathematically rigorous approach. LoRA works by
modifying how the updatable parameters are trained and updated in the neural network.

Let’s explain mathematically, you can skip to the next paragraph if you are not interested. We
know that the weights matrices of a pretrained neural network are full rank, meaning each
weight is unique and can't be made by combining other weights. But in this paper authors
showed that when pretrained language models are adjusted to a new task the weights have
a lower “intrinsic dimension”. Meaning, that the weights can be represented in a smaller
matrix, or that it has a lower rank. This in turn means that during backpropagation, the weight
update matrix has a lower rank, as most of the necessary information has already been
captured by the pre-training process and only task-specific adjustments are made during
fine-tuning.

11/25
A much simpler explanation is that during finetuning only a very few weights are updated a
lot as most of the learning is done during the pretraining phase of the neural network. LoRA
uses this information to reduce the number of trainable parameters.

The image above gives a visual representation of what LoRA is doing. The ΔWAxB is the
weight updation matrix, these are the changes needed to be applied to the neural network in
order for it to learn a new task. This matrix can be broken down into two matrices and then
we can only train them and then use them to get back our weight updation matrix. As you
can see in the image, the matrix is broken down into matrices with columns and rows r, it can
be understood as the rank of the weight updation matrix if it was actually trained. The bigger
the rank, the more parameters will be updated during training.

Efficiency of LoRA

Authors in the paper show that LoRA can outperform full finetuning with only 2% of total
trainable parameters.

12/25

As for the number of parameters it trains, we can largely control that using the rank r
parameter. For example, let’s say the weight updation matrix has 100,000 parameters, A
being 200 and B being 500. The weight updation matrix can be decomposed into smaller
matrixes of lower dimensions, A being 200 x 3 and B being 3 x 500. This gives us 200 x 3 +
3 x 500 = 2100 trainable parameters only, which is only 2.1% of the total number of
parameters. This can be further reduced as we can decide to only apply LoRA to specific
layers only.

As the number of parameters trained and applied are MUCH smaller than the actual model,
the files can be as small as 8MB. This makes loading, applying, and transferring the learned
models much easier and faster.

You can read the LoRA paper if you want to learn more and do a deeper dive into the topic.

LoRA in Stable Diffusion

One of the most interesting use cases of LoRA can be shown in image generation
applications. Images have an inherent style that can be visually seen. Instead of training
massive models to get specific styles of images out of models, users can now only train
LoRA weights and use them with techniques like Dreambooth to achieve really good quality
images with a lot of customizability.

13/25

LoRA weights can also be combined with other LoRA weights and be used in a weighted
combination to generate images that carry multiple styles. You can find a ton of LoRA
adapters online and load them into your models on CivitAI.

IA3 - Infused Adapter by Inhibiting and Amplifying Inner Activations


IA3 is an adapter-based technique that is somewhat similar to LoRA. The goal of the authors
was to replicate the advantages of ICL (in context learning or Few-Shot prompting) without
the issues that come with it. ICL can get messy in terms of cost and inference as it requires
prompting the model with examples. Longer length prompts require more time and
computation to process. But ICL is perhaps the easiest way to get started working with
models.

IA3 works by introducing rescaling vectors that target the activations of the model. A total of
3 vectors are introduced, lv, ik, and lff. These vectors target the value, keys in the attention
layer, and the non-linear layer in the dense layers. These vectors are multiplied elementwise
to the default values in the model. Once injected, these parameters are then learned during
the training process, while the rest of the model remains frozen. These learned vectors
essentially rescale or optimize the targeted pretrained model weights for the task at hand.

14/25

So far this seems like a basic adapter type PEFT method. But that’s not all. The authors also
use 3 loss terms to enhance the learning process. The 3 losses are LLM, LUL, and LLN. LLM is
the standard cross-entropy loss, which increases the likelihood of generating the correct
response. Then there is LUL which is Unlikelihood Loss. This loss term reduces the
probability of incorrect outputs using Rank Classification. Finally, we have LLN, which is a
length-normalized loss that applies a softmax cross-entropy loss to length-normalized log
probabilities of all output choices. Multiple losses are used here to ensure faster and better
learning of the model. Because we are trying learn using few-shot examples, these losses
are necessary.

Now let’s talk about two very important concepts in IA3. Rank Classification and Length
Normalization.

In Rank Classification a model is asked to rank a set of responses by their correctness. This
is done by calculating the probability scores for the potential responses. The LUL is then used
to reduce the probability of the wrong responses and as a result, increase the probability of
the correct response. But with Rank classification, we face a critical problem, which is that
the responses with fewer tokens will rank higher, because of how probability works. A
smaller amount of generated tokens ensures a higher probability as the probability of every
generated token is < 1. To fix this, the authors propose dividing the score of the response by
the number of tokens in the response. Doing this will normalize the scores. One very
important thing to note here is that normalization is done over log probabilities, not raw
probabilities. Log probabilities are negative and between zero to one.

15/25

Efficiency of IA3
IA3 just like LoRA reduces the number of trainable parameters. But instead of using low-rank
matrices, IA3 uses rescaling vectors. This reduces the trainable parameters to about 0.01%,
compared to LoRA's > 0.1%, for the T0 model trained in the paper. The frozen state of the
LLM also provides us with the option of having multiple adapters for multiple use cases. Also,
because the authors used element-wise multiplication, it is super easy to merge the adapter
to the LLM weights because of the commutative property of multiplication.

The above figure shows that IA3 performs better than LoRA and barely affects the FLOPs.
This makes IA3 a highly efficient and desirable technique. Also because IA3 is an additive
adapter technique, just like LoRA we can target specific parts of the model and decide where
to introduce the rescaling vectors. This helps us reduce the training time and even more.

P-Tuning

16/25
The P-tuning method aims to optimize the representation of the prompt which is passed to
the model. In the P-Tuning paper, the authors emphasize how prompt engineering is a very
strong technique when working with large language models. The p-Tuning method builds up
on top of prompt engineering and tries to further improve the effectiveness of a good prompt.

P-Tuning works by creating a small encoder network for your prompt that creates a soft
prompt for your passed prompt. To tune your LLM using P-tuning, you are supposed to
create a prompt template that represents your prompt. And a context x which is used in the
template to get label y. This is the approach mentioned in the paper. The tokens used for the
prompt template are trainable and learnable parameters, these are called pseudo tokens.
We also add a prompt encoder which then helps us update pseudo tokens to the specific
task at hand. The prompt encoder is usually a bi-LSTM network that learns the optimal
representation of the prompt for the model and then passes the representation to it. The
LSTM network is attached to the original model. Only the encoder network and the pseudo
tokens are trained here, the weights of the original network remain unaffected. Once the
training is done, the LSTM head is discarded as we have the hi which can be used directly.

In short, the prompt encoder only changes the embeddings of the passed prompt to better
represent the task, everything else remains unchanged.

Efficiency of P-Tuning

In terms of efficiency, P-tuning is just as good as any other method. In the paper, the authors
show that P-Tuning was able to perform better than full fine-tuning on most of the
benchmarks. It can be said that P-Tuning is comparable to the full fine-tuning of large
language models.

17/25

But there is a core issue when it comes to P-Tuning. P-Tuning is a prompt optimization
technique, it optimizes the prompt that is passed to the bigger model. This means that we
are still largely based on the large model in terms of capability. If a model has not been
trained on sentiment classification optimizing sentiment classification prompts using P-
Tuning will not do a lot of good to the model. P-Tuning is an assistive technique. It is always
very important to pick a model that can do the required task out of the box “well” with some
prompt engineering, and then further optimize it.

Prefix Tuning
Prefix tuning can be considered the next version of P-Tuning. The authors of P-Tuning
published a paper on P-Tuning V-2 addressing the issues of P-Tuning. In this paper, they
implemented the Prefix tuning introduced in this paper. Prefix tuning and P-Tuning do not
have a lot of differences but can still lead to different results. Let’s dive into a deeper
explanation.

18/25

In P-Tuning, we added learnable parameters only to the input embeddings but in Prefix
Tuning we add them to all the layers of the network. This ensures that the model itself
learns more about the task it is being finetuned on. We append learnable parameters to the
prompt and to every layer activation in the transformer layers. The difference from P-Tuning
is that instead of completely modifying the prompt embeddings, we only add very few
learnable parameters at the start of the prompt at every layer. Here’s a visual explanation:

19/25
At every layer in the transformer, we concatenate a soft prompt with the input which has
learnable parameters. These learnable parameters are tuned using a very small MLP, only 2
fully connected layers. This is done because in the paper authors note that directly updating
these prompt tokens is very sensitive to learning rate and initialization. The soft prompts
increase the number of trainable parameters but substantially increase the learning ability of
the model too. The MLP or fully connected layers can be dropped later as we only care
about the soft prompts, which will be appended to the input sequences during inference and
will guide the model.

Efficiency of Prefix Tuning


Prefix tuning shows massive gains over P-Tuning. And as the model size increases, these
gains increase too. This is perhaps because there are more trainable parameters for larger
models. In the chart, you can see the authors compare the performance of P-Tuning, full
finetuning, and Prefix tuning. Prefix tuning performs better than or as well as P-tuning in
almost all tasks. In many cases, it performs even better than Full fine-tuning!

20/25

One big reason why prefix tuning works really well is that the number of trainable parameters
is not limited only to the input sequence. Learnable parameters are added at every layer,
making the model much more flexible. Prefix tuning, unlike P-tuning, not only affects the
prompt tokens but also the model itself. This allows the model to learn more. But this
approach is still largely based on the prompt. It is still suggested to take a model that can
perform the task and only then optimize it, as that will lead to much better results. As for the
size of parameters, the number of trained parameters increase substantially, from 0.01% to
0.1 to 3% parameters. But the size of parameters still remains small enough to be
transferred and loaded easily and quickly.

Prompt Tuning
Prompt tuning was one of the first papers to build upon the idea of finetuning only with soft
prompts. The ideas of P-Tuning and Prefix Tuning come from this paper. Prompt tuning is a
very simple and easy-to-implement idea. It involves prepending a specific prompt to the input
and using virtual tokens or new trainable tokens for that specific prompt. These new virtual
tokens can be finetuned during the process to learn a better representation of the prompt.
This means that the model is tuned to understand the prompt better. Here is a comparison of
prompt tuning with full fine-tuning from the paper:

21/25

Here you can see that full model tuning requires multiple copies of the model to exist if we
want to use the model for multiple tasks. But with Prompt Tuning, you only need to store the
learned virtual tokens of the prompt tokens. So for example, if you use a prompt like “Classify
this tweet: {tweet}” the goal will be to learn new better embeddings for the prompt. And
during inference, only these new embeddings will be used to generate the outputs. This
allows the model to tune the prompt to help itself generate better outputs during inference.

Efficiency of Prompt Tuning

The biggest advantage of using prompt tuning is the small size of learned parameters. The
files can be in KBs. As we can determine the dimension size and number of parameters to
use for the new tokens, we can greatly control the number of parameters we are going to
learn. In the paper, the authors show how even with a very small number of trainable tokens
method performs really well. And the performance only goes up as bigger models are used.
You can read the paper here.

22/25

Another big advantage is that we can use the same model without any changes for multiple
tasks, as the only thing being updated are the embeddings of the prompt tokens. Meaning
you can use the same model for a tweet classification task and for a language generation
task without any changes to the model itself, given the model is big and sophisticated
enough to perform those tasks. But a big limitation is that the model itself doesn’t learn
anything new. This is purely a prompt optimization task. This means if the model has never
trained on a sentiment classification dataset, prompt tuning might not be of any help. It is
very important to note that this method optimizes the prompts, not the model. So, if you
cannot handcraft a hard prompt that can do the task relatively well, there is no use of trying
to optimize for a soft prompt using prompt optimization techniques.

LoRA vs Prompt Tuning


Now we have explored various PEFT techniques. Now the question becomes whether to use
an additive technique like Adapter and LoRA or you use a Prompt based technique like P-
Tuning and Prefix Tuning.

23/25
On comparing LoRA vs P-Tuning and Prefix Tuning, one can say for sure LoRA is the best
strategy in terms of getting the most out of the model. But it might not be the most efficient
based on your needs. If you want to train the model on a much different task than what it
has been trained on, LoRA is without a doubt the best strategy for tuning the model
efficiently. But if your task is more or less already understood by the model, but the challenge
is to properly prompt the model, then you should use Prompt Tuning techniques. Prompt
Tuning doesn’t modify many parameters in the model and mainly focuses on the passed
prompt instead.

One important point to note is that LoRA decomposes the weight updation matrix into smaller
rank matrices and uses them to update the weights of the model. Even though trainable
parameters are low, LoRA updates all the parameters in the targeted parts of the neural
network. Whereas in Prompt Tuning techniques, a few trainable parameters are added to the
model, this usually helps the model adjust to and understand the task better but does not
help the model learn new properties well.

LoRA and PEFT in comparison to full Finetuning


PEFT, Parameter Efficient Fine Tuning, is proposed as an alternative to full Finetuning. For
most of the tasks, it has already been shown in papers that PEFT techniques like LoRA are
comparable to full finetuning, if not better. But, if the new task you want the model to adapt to
is completely different from the tasks the model has been trained on, PEFT might not be
enough for you. The limited number of trainable parameters can result in major issues in
such scenarios.

If you are trying to build a code generation model using a text-based model like LLaMA or
Alpaca, you should probably consider fine-tuning the whole model instead of tuning the
model using LoRA. This is because the task is too different from what the model already
knows and has been trained on. Another good example of such a task is training a model,
which only understands English, to generate text in the Nepali language.

Why you should Fine-tune models for your business use case
Finetuning model is an important step for any business that wants to get the most out of its
machine-learning applications. It allows you to customize the model to your specific use
case, which can lead to improved accuracy and performance. It saves time, money, and

24/25
resources by eliminating the need to build a new model from the ground up. Fine-tuning lets
you optimize the use of your proprietary data, adjusting the model to better fit your available
data, and even incorporating new data if needed. This ensures a more accurate model that
better serves your business needs. Here are some more benefits:

Customization: Fine-tuning allows you to tailor the model to your specific needs,
enhancing accuracy and performance.
Resource Efficiency: It saves time, money, and resources by eliminating the need to
build a new model from scratch.
Performance Boost: Fine-tuning enhances the performance of the pretrained model
using your unique datasets.
Data Optimization: It lets you make the most of your data, adjusting the model to better
fit your available data, and even incorporating new data if needed.

But as the size of models grows to billions of parameters fine-tuning itself can be a
challenge. The PEFT techniques we discussed in this blog help to reduce the time and
resources needed to fine-tune a model. It helps speed up the training process by making use
of the pretrained weights and parameters and allows you to fine-tune the model more
efficiently. Also, using PEFT, you can easily transfer models over the internet and even use
the same model for multiple purposes. PEFT opens up a whole new world of possibilities for
businesses that want to make the most of their machine-learning applications.

Want to Train Custom LLMs with PEFT?


If you want to build or train custom LLMs or Chatbots, we can help you fine-tune them to
your specific needs. We have done a ton of work on building custom chatbots and training
large language models. Contact us today and let us build a custom LLM that revolutionizes
your business.

25/25

You might also like