You are on page 1of 4

Performance Analysis of LoRA Finetuning Llama-2

Avik Pathak Om Shree Mallika Agarwal


Department of Electronics and Department of Electronics and Department of Electronics and
Communication Engineering, Communication Engineering,
2023 7th International Conference on Electronics, Materials Engineering & Nano-Technology (IEMENTech) | 979-8-3503-0551-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/IEMENTech60402.2023.10423400

Communication Engineering,
Institute of Engineering & Institute of Engineering & Institute of Engineering &
Management, University of Management, University of Management, University of
Engineering & Management, Engineering & Management, Engineering & Management,
Salt Lake, Kolkata, India Salt Lake, Kolkata, India Salt Lake, Kolkata, India
avikpathak2000@gmail.com omshree.osj06@gmail.com mallikaa0214@gmail.com
Shek Diya Sarkar Anupam Tiwary
Department of Information Technology, Department of Information Technology,
Institute of Engineering & Institute of Engineering &
Management, University of Management, University of
Engineering & Management, Engineering & Management,
Salt Lake, Kolkata, India Salt Lake, Kolkata, India
sarkardiya370@gmail.com anurise999@gmail.com

Abstract— Artificial Intelligence (AI) has emerged as a The Transformer is one of the most widely used neural
transformative force, particularly in the realm of Large network architectures in natural language processing [3,6].
Language Models (LLMs), which have long been in existence Transformers are the industrialized, homogenized post-deep
but recently gained substantial impact in our daily lives. Our learning models designed for parallel computing on
research endeavors focused on the exploration and open- supercomputers [10]. Transformers can perform self-
sourcing of Llama-2, a significant LLM, through fine-tuning supervised learning on billions of raw, unlabeled data with
with the Low-Rank Adaptation (LoRA) technique. The billions of parameters.
satisfactory results obtained from the LoRA fine-tuning of
Llama-2 have laid a foundation for further research in this
domain. Positioned as the premier open-sourced model, rivaling
even GPT-4, Llama-2 holds great promise for diverse research
applications. Its status as a transformer model, coupled with
refined hyperparameter tuning, positions Llama-2 as a pivotal
tool for research and practical applications in the foreseeable
future. Leveraging a text summarization dataset, our study
demonstrated the enhanced performance of finetuned Llama-2,
suggesting its potential for broader applications and furthering
its significance in cutting-edge research. The dataset was
imported from the hugging face space on which we tried to fine-
tune the model. Thus, the fine-tuned Llama-2 is truly a force to
be reckoned with.

Keywords— LLMs, Llama-2, LoRA, GPT-4, Hugging Face

I. INTRODUCTION
Large language models:
Large language models (LLMs) are gaining increasing
popularity in both academia and industry, owing to their
Fig.1. Architecture of the Transformer
unprecedented performance in various applications like
finance, healthcare, customer service, education, etc [1,2]. It AI Transformer Architecture is shown in figure 1. To the
represents a transformative breakthrough in the field of left, the input sequences enter the encoder side of the
artificial intelligence, particularly within natural language transformer, passing through an attention sublayer and a
processing. These models, such as OpenAI's GPT-3, are feedforward sublayer and on the right, the target outputs
characterized by their vast size, often comprising billions of navigate the decoder side of the transformer, engaging with
parameters. Their ability to understand and generate human- two attention sublayers and a feedforward network sublayer.
like text has far-reaching implications for various
Encoder: The encoder in the transformer architecture
applications, including natural language understanding, text
shown in the above figure is responsible for processing the
generation, and even creative tasks. As the capabilities of
input sequences.This architecture consists of a stack of
LLMs continue to evolve, it becomes imperative to explore
identical layers, each equipped with two primary components:
their underlying mechanisms, assess their impact on diverse
self-attention mechanisms and feedforward neural networks.
domains, and consider the ethical implications associated with
The self-attention mechanism allows the model to consider all
their deployment.
positions in the input sequence simultaneously, capturing
AI Transformer Architecture: intricate dependencies and relationships. This parallel
processing capability significantly enhances the encoder's

979-8-3503-0551-7/23/$31.00 ©2023 IEEE


Authorized licensed use limited to: VIT University. Downloaded on March 08,2024 at 10:58:00 UTC from IEEE Xplore. Restrictions apply.
ability to understand the contextual information of each
element in the input sequence.
Decoder: On the other hand, the decoder is tasked with
generating the output sequence based on the representations
acquired from the encoder. Similar to the encoder, the decoder
consists of multiple layers, each featuring self-attention
mechanisms, multi-head attention, and feedforward neural
networks. The self-attention mechanism in the decoder
enables it to attend to different positions in the output
sequence, facilitating the modelling of dependencies within
the generated sequence. Fig.3. Variation of win rate% across various AI models.
Llama-2: Llama 2, a family of pretrained and fine-tuned
large language models (LLMs) ranging in scale from 7 billion
to 70 billion parameters. Llama 2, an updated version of As shown in Fig.3., Human evaluation results for Llama
Llama 1, trained on increased the size of the pretraining corpus 2- Chat models compared to open- and closed source models
by 40%, doubled the context length of the model, and adopted across ~4,000 helpfulness prompts with three raters per
grouped-query attention [4]. prompt.

Hyperparameters:The model is trained using the Adam Parameter Efficient Finetuning: Large pretrained
W optimizer ,with β1 = 0.9, β2 = 0.95, and eps = 10−5[5]. We models have achieved their best performances across a wide
use a cosine learning rate schedule with a warmup of 2000 variety of downstream natural language processing tasks
steps and decay the final learning rate down to 10% of the peak through fine-tuning task-specific labeled data [8]. But fine-
learning rate. We use a weight decay of 0.1 and gradient tuning all the parameters and storing them separately for
clipping of 1.0. different tasks is expensive in terms of computation and
storage overhead and might decrease the model's ability [9].
Parameter-efficient Fine-tuning (PEFT) is a technique used to
improve the performance of pre-trained language models on
specific downstream tasks by reusing the pre-trained model’s
parameters and fine-tuning them on a smaller dataset, which
saves computational time and overhead compared to training
the entire model from scratch.
LoRa: Low-Rank Adaptation, or LoRA, is one of the
PEFT techniques that freezes the pertained model weights and
injects trainable rank decomposition matrices into each layer
of the Transformer architecture, greatly reducing the number
Fig.2. Accuracy of Llama 2 taking GPT-4 as benchmark of trainable parameters for downstream tasks [11].
The reward models perform the best on the internal test Flowchart LLAMA-2:
sets collected based on Llama 2-Chat, as shown in the graph
below, with the Helpfulness reward model performing best on
the Meta Helpfulness test set, and similarly, the Safety reward
model performing best on the Meta Safety test set.
As shown in Fig.2., The largest Llama 2-Chat model is
competitive with ChatGPT. The Llama 2-Chat 70B model has
a win rate of 36% and a tie rate of 31.5% relative to ChatGPT.
The Llama 2-Chat 70B model outperforms the PaLM-bison
chat model by a large percentage on our prompt set.
AI Finetuning: AI fine-tuning refers to the process of
taking a pre-trained model and adjusting it to a specific task or
dataset. It allows us to adapt this pretrained model to a more Fig.4. Different fine-tuning methods and their memory
specific task, such as sentiment analysis or object detection. requirements.
As demonstrated in Fig.4., QLORA improves over LoRA
by quantizing the transformer model to 4-bit precision and
using paged optimizers to handle memory spikes [7].
Advantages of using LoRa: LoRA is a technique that
accelerates the finetuning of large models while consuming
less memory. To make fine-tuning more efficient, LoRA’s
approach is to represent the weight updates with two smaller

Authorized licensed use limited to: VIT University. Downloaded on March 08,2024 at 10:58:00 UTC from IEEE Xplore. Restrictions apply.
matrices (called update matrices) through low-rank Thus we find, with 60 epochs and the training examples,
decomposition. This approach has a number of advantages: we could minimize the loss function of fine-tuned Llama-2
which shows better accuracy and minimal loss than the
● LoRA makes fine-tuning more efficient by drastically original one as shown by the given responses before by the
reducing the number of trainable parameters. model.
● The original pre-trained weights are kept frozen, which
means you can have multiple lightweight and portable LoRA
models for various downstream tasks built on top of them.
II. OBSERVATION AND RESULTS ANALYSIS:
We tried to train the model over a dialogue and context
summarization dataset from Hugging face Space. It is open-
sourced and readily available for anyone to work on. Now, a
very basic input of this dataset kind of looks like a Fig.5.The evaluation with samples and steps per second
conversation dataset of two people. Let’s showcase an
example below.
### Input: While finetuning, we tried to evaluate the model
performance over 400 steps and 20 samples per second as
#Person1#: Did you enjoy your weekend at the Highland demonstrated in Fig.5.
Hotel? I heard it's an excellent place to stay and has good
facilities.
#Person2#: I had a wonderful time. The rooms are not
very big, but they are well furnished. The restaurant is
excellent and reasonably priced. There's a sauna and a Jacuzzi.
#Person1#: Do they have a swimming pool?
To this end, the untrained model output was:
#Person1#: Did you enjoy your weekend at the Highland
Hotel? I heard it's an excellent place to stay and has good Fig.6.The training vs learning rate and epochs curve
facilities.
#Person2#: I had a wonderful time. The rooms are not
very big, but they are well furnished. The restaurant is The AI was finetuned over 60 epochs with a regularized
excellent and reasonably priced. There's a sauna and a Jacuzzi. learning rate as projected in Fig.6.
#Person1#: Do they have a swimming pool?
Post Llora fine-tuning the model, the output changed to:
#Person1# and #Person2# talk about the Highland Hotel.
#Person2# recommends the hotel.
#Person2# says the hotel has a sauna and a jacuzzi.
#Person2# says the hotel is well-furnished.
#Person2# says the hotel offers a discount at the
weekends.
#Person2# says the hotel has a business card.
#Person2# says the hotel has a bar on the ground floor.
Fig.7. The model loss with increasing epochs.
The model loss is normalized with increasing epochs. It can
be clearly inferred from the projection in Fig.7. that with 60
samples per second, the model loss is normalized and
validation accuracy is increased.
III. CONCLUSION
Fig.4. The model evaluation of loss with the runtime. To conclude , we do find the finetuned Llama-2 showing a
Finally, we have our loss function curve which shows the very conclusive loss function and a better context learning
model gradients to show minimal loss at 60 epochs as shown ability to learn and perform better if the finetuning is properly
in Fig.4. . Which means the model parameters were most executed. The LoRA finetuning also reduces the GPU
likely to be optimally set for the given performance of the compute by a lot which makes it accessible to low-level
model. research where anybody with an idea and curiosity in AI can
play with the model and finetune it to perform specific tasks

Authorized licensed use limited to: VIT University. Downloaded on March 08,2024 at 10:58:00 UTC from IEEE Xplore. Restrictions apply.
better than the model. The prospect of llama-2 is still yet to be [5] Loshchilov, Ilya, and Frank Hutter. "Decoupled weight decay
explored to its fullest and we invite any fellow researchers regularization." arXiv preprint arXiv:1711.05101 (2017).
who may be interested to pursue and venture into training the [6] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.
model with real or synthetic data. "Attention is all you need." Advances in neural information processing
systems 30 (2017).
ACKNOWLEDGMENT
[7] Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke
The authors are thankful to Prof. Mili Sarkar of Institute Zettlemoyer. "Qlora: Efficient finetuning of quantized llms." arXiv
of Engineering and Management, Kolkata for her valuable preprint arXiv:2305.14314 (2023).
suggestions. [8] Chen, Jiaao, Aston Zhang, Xingjian Shi, Mu Li, Alex Smola, and Diyi
Yang. "Parameter-Efficient Fine-Tuning Design Spaces." arXiv
REFERENCES preprint arXiv:2301.01821 (2023).
[9] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D.
Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language
[1] Chang, Yupeng, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao models are few-shot learners." Advances in neural information
Chen, Linyi Yang et al. "A survey on evaluation of large language processing systems 33 (2020): 1877-1901.
models." arXiv preprint arXiv:2307.03109 (2023). [10] Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, Russ Altman,
[2] Xiong, Ruibin, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Simran Arora, Sydney von Arx, Michael S. Bernstein et al. "On the
Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan opportunities and risks of foundation models." arXiv preprint
Liu. "On layer normalization in the transformer architecture." In arXiv:2108.07258 (2021).
International Conference on Machine Learning, pp. 10524-10533. [11] Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu,
PMLR, 2020. Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. "Lora: Low-
[3] Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad rank adaptation of large language models." arXiv preprint
Almahairi, Yasmine Babaei, Nikolay Bashlykov et al. "Llama 2: Open arXiv:2106.09685 (2021).
foundation and fine-tuned chat models." arXiv preprint
arXiv:2307.09288 (2023).
[4] Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury
Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. "GQA: Training
Generalized Multi-Query Transformer Models from Multi-Head
Checkpoints." arXiv preprint arXiv:2305.13245 (2023).

Authorized licensed use limited to: VIT University. Downloaded on March 08,2024 at 10:58:00 UTC from IEEE Xplore. Restrictions apply.

You might also like