Professional Documents
Culture Documents
This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research
and technology used to create Gemini models. Gemma models demonstrate strong performance across
academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models
(2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma
outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive
evaluations of safety and responsibility aspects of the models, alongside a detailed description of model
development. We believe the responsible release of LLMs is critical for improving the safety of frontier
arXiv:2403.08295v3 [cs.CL] 11 Apr 2024
1 See Contributions and Acknowledgments section for full author list. Please send correspondence to gemma-1-report@google.com.
60
Performance by Score
40
20
0
Question Answering Reasoning Math / Science Coding
Figure 1 | Language understanding and generation performance of Gemma 7B across different capa-
bilities compared to similarly sized open models. We group together standard academic benchmark
evaluations by capability and average the respective scores; see Table 6 for a detailed breakdown of
performance.
2
Gemma: Open Models Based on Gemini Research and Technology
3
Gemma: Open Models Based on Gemini Research and Technology
Team, 2023), we stage training to alter the cor- The final data mixtures and supervised fine-
pus mixture throughout training to increase the tuning recipe, which includes tuned hyperparam-
weight of relevant, high-quality data towards the eters, were chosen on the basis of improving help-
end of training. fulness while minimizing model harms related to
safety and hallucinations.
Instruction Tuning
Formatting
We finetune Gemma 2B and 7B with supervised
Instruction tuned models are trained with a spe-
fine-tuning (SFT) on a mix of text-only, English-
cific formatter that annotates all instruction tun-
only synthetic and human-generated prompt-
ing examples with extra information, both at
response pairs and reinforcement learning from
training and inference time. It has two purposes:
human feedback (RLHF) with the reward model
1) indicating roles in a conversation, such as the
trained on labelled English-only preference data
User role, and 2) delineating turns in a conver-
and the policy based on a set of high-quality
sation, especially in a multi-turn conversation.
prompts. We find that both stages are important
Special control tokens are reserved in the tok-
for improved performance on downstream auto-
enizer for this purpose. While it is possible to
matic evaluations and human preference evalua-
get coherent generations without the formatter,
tions of model outputs.
it will be out-of-distribution for the model, and
will very likely produce worse generations.
Supervised Fine-Tuning
The relevant formatting control tokens are pre-
We selected our data mixtures for supervised fine- sented in Table 3, with a dialogue example pre-
tuning based on LM-based side-by-side evalua- sented in Table 4.
tions (Zheng et al., 2023). Given a set of held-
Context Relevant Token
out prompts, we generate responses from a test
User turn user
model, generate responses on the same prompts
from a baseline model, shuffle these randomly, Model turn model
and ask a larger, high capability model to express Start of conversation turn <start_of_turn>
a preference between two responses. Different End of conversation turn <end_of_turn>
prompt sets are constructed to highlight specific
capabilities, such as instruction following, factu- Table 3 | Relevant formatting control tokens used
ality, creativity, and safety. Our LM-based judges for both SFT and RLHF of Gemma models.
employ a number of known strategies, such as
chain-of-thought prompting (Wei et al., 2022), User: <start_of_turn>user
rubrics and constitutions (Bai et al., 2022), to be Knock knock.<end_of_turn>
aligned with human preferences. <start_of_turn>model
Model: Who’s there?<end_of_turn>
User: <start_of_turn>user
Filtering Gemma.<end_of_turn>
<start_of_turn>model
When using synthetic data, we run several stages
Model: Gemma who?<end_of_turn>
of filtering over it, removing examples that show
certain personal information, unsafe or toxic
Table 4 | Example dialogue with user and model
model outputs, mistaken self-identification data,
control tokens.
or duplicated examples. Following Gemini, we
find that including subsets of data that encour- Reinforcement Learning from Human Feed-
age better in-context attribution, hedging, and back
refusals to minimize hallucinations improves per-
formance on factuality metrics, without degrad- We further finetuned the supervised fine-tuned
ing model performance on other metrics. model using RLHF (Christiano et al., 2017;
4
Gemma: Open Models Based on Gemini Research and Technology
Ouyang et al., 2022). We collected pairs of pref- Model Safety Instr. Following
erences from human raters and trained a reward Gemma 1.1 IT 7B 63.5% 61.2%
function under the Bradley-Terry model (Bradley 95% Conf. Interval [60.7%, 66.1%] [59.3%, 63%]
and Terry, 1952), similarly to Gemini. The pol- Win / Tie / Loss 51.5% / 23.9% / 24.6% 52.2% / 18.1% / 29.8%
icy was trained to optimize this reward function Gemma 1.1 IT 2B 60.1% 45%
using a novel reinforcement learning algorithm. 95% Conf. Interval [57.3%, 62.8%] [43.1%, 46.9%]
Win / Tie / Loss 48.5% / 23.2% / 28.3% 37.1% / 15.8% / 47.1%
Similar to the SFT phase, and in order to tune hy-
perparameters and additionally mitigate reward Table 5 | Win rate of Gemma 1.1 IT models ver-
hacking (Amodei et al., 2016; Skalse et al., 2022) sus Mistral 7B v0.2 Instruct with 95% confidence
we relied on a high capacity model as an auto- intervals. We report breakdowns of wins, ties,
matic rater and computed side-by-side compar- and losses, and we break ties evenly when report-
isons against baseline models. ing the final win rate. Gemma 1.0 results can be
found in the appendix.
Evaluation
We evaluate Gemma across a broad range of do- ing comprehension (Joshi et al., 2017), and more.
mains, using both automated benchmarks and
For most automated benchmarks we use the
human evaluation.
same evaluation methodology as in Gemini.
Specifically for those where we report perfor-
Human Preference Evaluations mance compared with Mistral, we replicated
methodology from the Mistral technical report
In addition to running standard academic bench- as closely as possible. These specific benchmarks
marks on the finetuned models, we sent final re- are: ARC (Clark et al., 2018), CommonsenseQA
lease candidates to human evaluation studies to (Talmor et al., 2019), Big Bench Hard (Suzgun
be compared against the Mistral v0.2 7B Instruct et al., 2022), and AGI Eval (English-only) (Zhong
model (Jiang et al., 2023). et al., 2023). Due to restrictive licensing, we were
On a held-out collection of around 1000 unable to run any evaluations on LLaMA-2 and
prompts oriented toward asking models to follow cite only those metrics previously reported (Tou-
instructions across creative writing tasks, coding, vron et al., 2023b).
and following instructions, Gemma 7B IT has a We compare Gemma 2B and 7B models to sev-
51.7% positive win rate and Gemma 2B IT has eral external open-source (OSS) LLMs across a
a 41.6% win rate over Mistral v0.2 7B Instruct. series of academic benchmarks, reported in Table
On a held-out collection of around 400 prompts 6 and Table 7.
oriented towards testing basic safety protocols,
Gemma 7B IT has a 58% win rate, while Gemma On MMLU (Hendrycks et al., 2020), Gemma
2B IT has a 56.5% win rate. We report the corre- 7B outperforms all OSS alternatives at the same
sponding numbers in Table 5. or smaller scale; it also outperforms several larger
models, including LLaMA2 13B. However, human
expert performance is gauged at 89.8% by the
Automated Benchmarks benchmark authors; as Gemini Ultra is the first
model to exceed this threshold, there is signifi-
We measure Gemma models’ performance on do-
cant room for continued improvements to achieve
mains including physical reasoning (Bisk et al.,
Gemini and human-level performance.
2019), social reasoning (Sap et al., 2019), ques-
tion answering (Clark et al., 2019; Kwiatkowski Gemma models demonstrate particularly
et al., 2019), coding (Austin et al., 2021; Chen strong performance on mathematics and coding
et al., 2021), mathematics (Cobbe et al., 2021), benchmarks. On mathematics tasks, which
commonsense reasoning (Sakaguchi et al., 2019), are often used to benchmark the general ana-
language modeling (Paperno et al., 2016), read- lytical capabilities of models, Gemma models
5
Gemma: Open Models Based on Gemini Research and Technology
Table 6 | Academic benchmark results, compared to similarly sized, openly-available models trained
on general English text data. † Mistral reports 50.2 on a different split for MBPP and on their split
our 7B model achieves 54.5. ∗ evaluations run by us. Note that due to restrictive licensing, we were
unable to run evals on LLaMA-2; all values above were previously reported in Touvron et al. (2023b).
outperform other models by at least 10 points memorization of a model (Nasr et al., 2023) and
on GSM8K (Cobbe et al., 2021) and the more has been the common definition used in several
difficult MATH (Hendrycks et al., 2021) bench- studies (Anil et al., 2023; Carlini et al., 2022;
mark. Similarly, they outperform alternate open Kudugunta et al., 2023).
models by at least 6 points on HumanEval (Chen
We test for memorization1 of the Gemma pre-
et al., 2021). They even surpass the performance
trained models with the same methodology per-
of the code-fine-tuned CodeLLaMA-7B models
formed in Anil et al. (2023). We sample 10,000
on MBPP (CodeLLaMA achieves a score of 41.4%
documents from each corpus and use the first
where Gemma 7B achieves 44.4%).
50 tokens as a prompt for the model. We focus
mainly on exact memorization, where we classify
Memorization Evaluations texts as memorized if the subsequent 50 tokens
generated by the model exactly match the ground
Recent work has shown that aligned models may truth continuation in the text. However, to bet-
be vulnerable to new adversarial attacks that can ter capture potential paraphrased memorizations,
bypass alignment (Nasr et al., 2023). These at- we include approximate memorization (Ippolito
tacks can cause models to diverge, and sometimes et al., 2022) using an 10% edit distance thresh-
regurgitate memorized training data in the pro-
cess. We focus on discoverable memorization, 1Our use of “memorization” relies on the definition of
which serves as a reasonable upper-bound on the that term found at www.genlaw.org/glossary.html.
6
Gemma: Open Models Based on Gemini Research and Technology
% Exact Memorized
% Exact Memorized
TruthfulQA 42.2 44.8 1 1
Winogrande 78.4 79.0
GSM8K 37.8 50.9 0.1 0.1
Memorization of Memorization of
English Web Content All Content
Personal Data Perhaps of higher importance is
% Exact Memorized
% Exact Memorized
7
Gemma: Open Models Based on Gemini Research and Technology
Table 8 | Safety academic benchmark results of Gemma 1.1 IT models, compared to similarly sized,
openly-available models. Evaluations run by us. Note that due to restrictive licensing, we were unable
to run evals on LLaMA-2; we do not report previously-published numbers for LLaMA-2 on TruthfulQA,
as we use different, non-comparable evaluation set-ups: we use MC2, where LLaMA-2 uses GPT-Judge.
Results for Gemma 1.0 IT models can be found in appendix.
% Memorized
1 1
perts internally and externally, and unstructured
0.1 0.1 attempts to discover new model vulnerabilities.
e i e b al e i e b al
Cod Wik Scienc We ultilingu Cod Wik Scienc We ultilingu
M M
Data Source Data Source Benefits
Memorization Type
Exact Approximate We believe that openness in AI science and tech-
nology can bring significant benefits. Open-
Figure 4 | Comparing exact and approximate sourcing is a significant driver of science and
memorization. innovation, and a responsible practice in most
circumstances. But this needs to be balanced
against the risk of providing actors with the tools
mately memorized (note the log scale) and that
to cause harm now or in the future.
this is nearly consistent across each of the differ-
ent subcategories over the dataset. Google has long committed to providing
broader access to successful research innovations
(GraphCast, Transformer, BERT, T5, Word2Vec),
Responsible Deployment and we believe that releasing Gemma into the AI
development ecosystem will enable downstream
In line with previous releases of Google’s AI tech- developers to create a host of beneficial appli-
nologies (Gemini Team, 2023; Kavukcuoglu et al., cations, in areas such as science, education and
2022), we follow a structured approach to respon- the arts. Our instruction-tuned offerings should
sible development and deployment of our models, encourage a range of developers to leverage
in order to identify, measure, and manage fore- Gemma’s chat and code capabilities to support
seeable downstream societal impacts. As with their own beneficial applications, while allowing
our recent Gemini release, these are informed for custom fine-tuning to specialize the model’s ca-
by prior academic literature on language model pabilities for specific use cases. To ensure Gemma
8
Gemma: Open Models Based on Gemini Research and Technology
supports a wide range of developer needs, we are can be reduced via various filtering methods.
also releasing two model sizes to optimally sup-
port different environments, and have made these Mitigations
models available across a number of platforms
(see Kaggle for details). Providing broad access to Without this layer of defense for the Gemma fam-
Gemma in this way should reduce the economic ily of models, we have endeavoured to safeguard
and technical barriers that newer ventures or in- against these risks by filtering and measuring bi-
dependent developers face when incorporating ases in pre-training data in line with the Gemini
these technologies into their workstreams. approach, assessing safety through standardized
AI safety benchmarks, internal red teaming to
As well as serving developers with our better understand the risks associated with exter-
instruction-tuned models, we have also provided nal use of Gemma, and subjecting the models to
access to corresponding base pretrained mod- rigorous ethics and safety evaluations, the results
els. By doing so, it is our intention to encourage of which can be seen in 8.
further AI safety research and community inno-
vation, providing a wider pool of models avail- While we’ve invested significantly in improving
able to developers to build on various methods of the model, we recognize its limitations. To en-
transparency and interpretability research that sure transparency for downstream users, we’ve
the community has already benefited from (Pac- published a detailed model card to provide re-
chiardi et al., 2023; Zou et al., 2023). searchers with a more comprehensive under-
standing of Gemma.
We have also released a Generative AI Respon-
Risks
sible Toolkit to support developers to build AI
In addition to bringing benefits to the AI devel- responsibly. This encompasses a series of assets
opment ecosystem, we are aware that malicious to help developers design and implement respon-
uses of LLMs, such as the creation of deepfake sible AI best practices and keep their users safe.
imagery, AI-generated disinformation, and illegal The relative novelty of releasing open weights
and disturbing material can cause harm on both models means new uses, and misuses, of these
an individual and institutional levels (Weidinger models are still being discovered, which is why
et al., 2021). Providing access to model weights, Google DeepMind is committed to the continuous
rather than releasing models behind an API, also research and development of robust mitigation
raises new challenges for responsible deployment. strategies alongside future model development.
First, we cannot prevent bad actors from fine
tuning Gemma for malicious intent, despite their Assessment
use being subject to Terms of Use that prohibit
Ultimately, given the capabilities of larger systems
the use of Gemma models in ways that contravene
accessible within the existing ecosystem, we be-
our Gemma Prohibited Use Policy. However, we
lieve the release of Gemma will have a negligible
are cognizant that further work is required to
effect on the overall AI risk portfolio. In light
build more robust mitigation strategies against
of this, and given the utility of these models for
intentional misuse of open models, which Google
research, auditing and downstream product de-
DeepMind will continue to explore both internally
velopment, we are confident that the benefit of
and in collaboration with the AI community.
Gemma to the AI community outweighs the risks
The second challenge we face is protecting de- described.
velopers and downstream users against the un-
intended behaviours of open models, including Going Forward
generation of toxic language or perpetuation of
discriminatory social harms, model hallucinations As a guiding principle, Google DeepMind strives
and leakage of personally identifiable information. to adopt assessments and safety mitigations pro-
When deploying models behind an API, these risks portionate to the potential risks from our models.
9
Gemma: Open Models Based on Gemini Research and Technology
Although we are confident that Gemma models Beyond state-of-the-art performance measures
will provide a net benefit to the community, our on benchmark tasks, we are excited to see what
emphasis on safety stems from the irreversible na- new use-cases arise from the community, and
ture of this release. As the harms resulting from what new capabilities emerge as we advance the
open models are not yet well defined, nor does an field together. We hope that researchers use
established evaluation framework for such mod- Gemma to accelerate a broad array of research,
els exist, we will continue to follow this precedent and that developers create beneficial new applica-
and take a measured and cautionary approach tions, user experiences, and other functionality.
to open model development. As capabilities ad-
Gemma benefits from many learnings of the
vance, we may explore extended testing, stag-
Gemini model program including code, data,
gered releases or alternative access mechanisms
architecture, instruction tuning, reinforcement
to ensure responsible AI development.
learning from human feedback, and evaluations.
As the ecosystem evolves, we urge the wider As discussed in the Gemini technical report, we
AI community to move beyond simplistic ’open reiterate a non-exhaustive set of limitations to
vs. closed’ debates, and avoid either exaggerat- the use of LLMs. Even with great performance on
ing or minimising potential harms, as we believe benchmark tasks, further research is needed to
a nuanced, collaborative approach to risks and create robust, safe models that reliably perform
benefits is essential. At Google DeepMind we’re as intended. Example further research areas in-
committed to developing high-quality evaluations clude factuality, alignment, complex reasoning,
and invite the community to join us in this effort and robustness to adversarial input. As discussed
for a deeper understanding of AI systems. by Gemini, we note the need for more challenging
and robust benchmarks.
10
Gemma: Open Models Based on Gemini Research and Technology
11
Gemma: Open Models Based on Gemini Research and Technology
12
Gemma: Open Models Based on Gemini Research and Technology
R. A. Bradley and M. E. Terry. Rank analysis in Neural Information Processing Systems, 30,
of incomplete block designs: I. the method of 2017.
paired comparisons. Biometrika, 39, 1952.
C. Clark, K. Lee, M. Chang, T. Kwiatkowski,
N. Carlini, D. Ippolito, M. Jagielski, K. Lee, M. Collins, and K. Toutanova. Boolq: Explor-
F. Tramer, and C. Zhang. Quantifying memo- ing the surprising difficulty of natural yes/no
rization across neural language models. arXiv questions. CoRR, abs/1905.10044, 2019. URL
preprint arXiv:2202.07646, 2022. http://arxiv.org/abs/1905.10044.
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabhar-
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
wal, C. Schoenick, and O. Tafjord. Think you
de Oliveira Pinto, J. Kaplan, H. Edwards,
have solved question answering? try arc, the
Y. Burda, N. Joseph, G. Brockman, A. Ray,
ai2 reasoning challenge, 2018.
R. Puri, G. Krueger, M. Petrov, H. Khlaaf,
G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ry- K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen,
der, M. Pavlov, A. Power, L. Kaiser, M. Bavar- H. Jun, L. Kaiser, M. Plappert, J. Tworek,
ian, C. Winter, P. Tillet, F. P. Such, D. Cum- J. Hilton, R. Nakano, C. Hesse, and J. Schul-
mings, M. Plappert, F. Chantzis, E. Barnes, man. Training verifiers to solve math word
A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, problems. CoRR, abs/2110.14168, 2021. URL
N. Tezak, J. Tang, I. Babuschkin, S. Balaji, https://arxiv.org/abs/2110.14168.
S. Jain, W. Saunders, C. Hesse, A. N. Carr,
J. Leike, J. Achiam, V. Misra, E. Morikawa, J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin,
A. Radford, M. Knight, M. Brundage, M. Murati, M. Mao, M. a. Ranzato, A. Senior, P. Tucker,
K. Mayer, P. Welinder, B. McGrew, D. Amodei, K. Yang, Q. Le, and A. Ng. Large scale dis-
S. McCandlish, I. Sutskever, and W. Zaremba. tributed deep networks. In F. Pereira, C. Burges,
Evaluating large language models trained on L. Bottou, and K. Weinberger, editors, Advances
code. CoRR, abs/2107.03374, 2021. URL in Neural Information Processing Systems,
https://arxiv.org/abs/2107.03374. volume 25. Curran Associates, Inc., 2012.
URL https://proceedings.neurips.
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, cc/paper_files/paper/2012/file/
G. Mishra, A. Roberts, P. Barham, H. W. 6aca97005c68f1206823815f66102863-Paper.
Chung, C. Sutton, S. Gehrmann, P. Schuh, pdf.
K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao,
P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, J. Devlin, M. Chang, K. Lee, and K. Toutanova.
E. Reif, N. Du, B. Hutchinson, R. Pope, J. Brad- BERT: pre-training of deep bidirectional trans-
bury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, formers for language understanding. CoRR,
T. Duke, A. Levskaya, S. Ghemawat, S. Dev, abs/1810.04805, 2018. URL http://arxiv.
H. Michalewski, X. Garcia, V. Misra, K. Robin- org/abs/1810.04805.
son, L. Fedus, D. Zhou, D. Ippolito, D. Luan,
Gemini Team. Gemini: A family of highly capable
H. Lim, B. Zoph, A. Spiridonov, R. Sepassi,
multimodal models, 2023.
D. Dohan, S. Agrawal, M. Omernick, A. M. Dai,
T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, D. Hendrycks, C. Burns, S. Basart, A. Zou,
R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, M. Mazeika, D. Song, and J. Steinhardt. Mea-
B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, suring massive multitask language understand-
K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, ing. CoRR, abs/2009.03300, 2020. URL
and N. Fiedel. Palm: Scaling language model- https://arxiv.org/abs/2009.03300.
ing with pathways, 2022.
D. Hendrycks, C. Burns, S. Kadavath, A. Arora,
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Basart, E. Tang, D. Song, and J. Steinhardt.
S. Legg, and D. Amodei. Deep reinforcement Measuring mathematical problem solving with
learning from human preferences. Advances the math dataset. NeurIPS, 2021.
13
Gemma: Open Models Based on Gemini Research and Technology
D. Ippolito, F. Tramèr, M. Nasr, C. Zhang, Y. LeCun, Y. Bengio, and G. Hinton. Deep learn-
M. Jagielski, K. Lee, C. A. Choquette-Choo, and ing. nature, 521(7553):436–444, 2015.
N. Carlini. Preventing verbatim memorization
in language models gives a false sense of pri- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef-
vacy. arXiv preprint arXiv:2210.17546, 2022. ficient estimation of word representations in
vector space. In Y. Bengio and Y. LeCun, edi-
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- tors, 1st International Conference on Learning
ford, D. S. Chaplot, D. de las Casas, F. Bressand, Representations, ICLR 2013, Scottsdale, Arizona,
G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, USA, May 2-4, 2013, Workshop Track Proceed-
M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, ings, 2013. URL http://arxiv.org/abs/
T. Wang, T. Lacroix, and W. E. Sayed. Mistral 1301.3781.
7b, 2023.
M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F.
M. Joshi, E. Choi, D. S. Weld, and L. Zettle- Cooper, D. Ippolito, C. A. Choquette-Choo,
moyer. Triviaqa: A large scale distantly su- E. Wallace, F. Tramèr, and K. Lee. Scal-
pervised challenge dataset for reading compre- able extraction of training data from (pro-
hension. CoRR, abs/1705.03551, 2017. URL duction) language models. arXiv preprint
http://arxiv.org/abs/1705.03551. arXiv:2311.17035, 2023.
14
Gemma: Open Models Based on Gemini Research and Technology
15
Gemma: Open Models Based on Gemini Research and Technology
16
Gemma: Open Models Based on Gemini Research and Technology
Table 9 | Win rate of Gemma 1.0 IT models versus Mistral 7B v0.2 Instruct with 95% confidence
intervals. We report breakdowns of wins, ties, and losses. Ties are broken evenly in the final win rate.
Table 10 | Safety academic benchmark results of Gemma 1.0 IT models, compared to similar size open
models. Evaluations run by us. Note that due to restrictive licensing, we were unable to run evals on
LLaMA-2; we do not report previously-published numbers for LLaMA-2 on TruthfulQA, because we
use different, non-comparable evaluation set-ups: we use MC2, where LLaMA-2 uses GPT-Judge.
17