Professional Documents
Culture Documents
Abstract
The proliferation of open-source Large Lan-
guage Models (LLMs) from various institutions
arXiv:2401.12794v1 [cs.CL] 23 Jan 2024
Ground-truth
A. Paris Step 1:
B. Lyon Compute
Prompt conformal scores si
C. London
Input on calibration data
D. Boston
Answer:
Step 2:
Large
Calculate the
Language
threshold smin smax
Models
Conformal scores {si}
Step 3: 1-
Predicted Construct
{B, C}
Probabilities prediction sets
for test instances
(b) Prompt LLMs to obtain predicted probabilities (c) Generate prediction sets on test data
(a) Task data preparation
of options for all data points based on conformal prediction
Figure 2: The overall process of applying conformal prediction for uncertainty quantification in LLMs. (a) Five
distinct tasks are considered, and a dataset comprising 10,000 instances is prepared for each task. (b) Each data
instance is transformed into a multiple-choice question, and eight LLMs (LLM series) are prompted to generate
predicted probabilities for the given options. (c) Each dataset is divided into a calibration set and a test set, followed
by the application of conformal prediction to generate prediction sets for test set instances. For illustrative purposes,
demonstrations in the prompt input are excluded, and solely the process of constructing prediction sets utilizing the
LAC conformal score function is demonstrated. In addition, only four options of the question are presented.
The overall process of employing conformal pre- MMLU (Hendrycks et al., 2020) as the evaluation
diction for uncertainty quantification in LLMs is dataset. MMLU encompasses a total of 57 subjects,
illustrated in Figure 2. In the following sections, spanning various disciplines such as elementary
we first elucidate on the evaluation tasks and their mathematics, US history, computer science, and
associated datasets, then provide details about the law. These subjects are further classified into four
evaluation prompts used to extract softmax scores broad categories, namely humanities, social sci-
(i.e. predicted probabilities) from LLMs, and fi- ences, STEM, and others (business, health, misc.).
nally, introduce the adopted evaluation metrics. For each category, we sample 2500 instances, lead-
ing to 10,000 instances in total.
4 Evaluation Tasks and Datasets
Reading Comprehension (RC) RC is used for
LLMs have demonstrated remarkable capabilities testing an LLM’s ability to understand and ana-
across various aspects (Guo et al., 2023; Naveed lyze a given context, comprehend the meaning of
et al., 2023). It is essential to develop multiple tasks words and sentences, and answer questions based
to evaluate their performance comprehensively. For on the information presented in the context. It also
this purpose, we consider five typical NLP tasks, tests the ability of LLMs to make inferences and
including question answering, reading comprehen- draw conclusions from the given context. We take
sion, commonsense inference, dialogue response CosmosQA (Huang et al., 2019) as the evaluation
selection, and document summarization. For each dataset. CosmosQA focuses on reading between
task, we prepare a dataset with 10,000 instances. the lines (Norvig, 1987) over a diverse collection of
In addition, we formulate each task as a multiple- people’s everyday narratives that require reasoning
choice question answering (MCQA) task and the beyond the exact text spans in the context. Due
objective is to select the only correct answer out of to the unavailability of ground truth labels for the
six possible options (i.e. A, B, C, D, E, and F). test set in CosmosQA, we sample 10,000 instances
from the training and development sets.
Question Answering (QA) QA is applied to eval-
uate an LLM’s proficiency in utilizing its extensive Commonsense Inference (CI) CI is leveraged
world knowledge to provide accurate answers to a to evaluate the ability of LLMs to understand and
diverse range of questions. For this task, we adopt reason about the relationships between concepts
and events based on commonsense and background 5 Evaluation Prompts and Metrics
knowledge. This type of testing helps to assess the
As delineated in Section 3, one crucial step of lever-
generalization and reasoning abilities of LLMs be-
aging conformal prediction for uncertainty quantifi-
yond simple pattern recognition tasks. We employ
cation is to acquire the softmax score correspond-
HellaSwag (Zellers et al., 2019) as the evaluation
ing to each option. Next, we explicate the method
dataset. HellaSwag focuses on commonsense nat-
used to elicit these scores from LLMs, followed by
ural language inference whose target is to select
a description of the evaluation metrics utilized.
the most likely followup to a given event descrip-
tion. Same as CosmosQA, we sample 10,000 in- 5.1 Prompting Strategies
stances from the training and development sets of
Following previous works (Hendrycks et al., 2020;
HellaSwag for the purpose of evaluation.
Gao et al., 2023; Zhang et al., 2023; Chen et al.,
Dialogue Response Selection (DRS) DRS is 2023; Zheng et al., 2023), we rely on prompt en-
adopted for assessing the ability of LLMs to com- gineering rather than supervised finetuning as the
prehend the meaning of a given dialogue and select testing approach to evaluating the performance of
an appropriate response from a set of possible re- LLMs on each task. However, our preliminary re-
sponses. This includes the ability to comprehend sults show that LLMs are sensitive to prompts. In
the meaning of the user’s input, select appropriate this regard, we consider three prompting strategies,
responses that are relevant to the conversational including Base Prompt, Shared Instruction Prompt,
context, and maintain coherence and consistency and Task-specific Instruction Prompt, to reduce the
in the dialogue. We utilize the dialogue data from influence of LLMs’ sensitivity to different prompts,
the HaluEval (Li et al., 2023a) benchmark as the thereby ensuring a fairer comparison.
evaluation dataset (denoted as HaluDial). This
dataset consists of exactly 10,000 instances and • Base Prompt: This strategy directly combines
is built upon OpenDialKG (Moon et al., 2019), a the question and all of its options as the input and
knowledge-grounded dialogue dataset. prompts the LLM to output the correct option
with a prefix "Answer:".
Document Summarization (DS) DS is taken to
evaluate the proficiency of LLMs in comprehend- • Shared Instruction Prompt: This strategy adds
ing the substance and context of a given document, a general description of the task before the ques-
and in producing a succinct and cohesive summary tion, informing the LLM that the task is to solve
that effectively conveys the crucial information and a multiple-choice question and there is only one
main ideas of the document. This requires the LLM correct answer out of six options.
to have a good understanding of the language, the
• Task-specific Instruction Prompt: Instead of
topic, and the structure of the document. Similar
using a shared instruction, this strategy provides
to DRS, we adopt the summarization data from the
a task-specific instruction that briefly describes
HaluEval (Li et al., 2023a) benchmark as the eval-
the task and the expected type of option.
uation dataset (denoted as HaluSum). This dataset
also comprises precisely 10,000 instances and is These prompting strategies facilitate a system-
derived from CNN/Daily Mail (See et al., 2017), a atic and standardized evaluation of the performance
summarization dataset pertaining to news articles. of LLMs. The prompt templates linked with each
Out of the aforementioned five datasets, MMLU, strategy are elaborated in Appendix D. The soft-
CosmosQA, and HellaSwag originally consisted max score for each prompt is derived by subjecting
of questions with four options each, while Halu- the logits corresponding to each option letter (i.e.
Dial and HaluSum had only two options per ques- A, B, C, D, E, and F) to the softmax function. The
tion. To standardize the number of options, two said logits are generated by the language modeling
additional choices were added to each question head in contemporary causal LLMs. It is worth
in HaluDial and HaluSum by randomly selecting noting that only the logits associated with the last
from the choices of other questions in HaluDial and token of the prompt input are utilized.
HaluSum, respectively. Furthermore, all datasets
were modified to include two more options, "I don’t 5.2 Evaluation Metrics
know" and "None of the above", resulting in a total We evaluate LLMs from two perspectives, namely
of six possible options for each question. prediction accuracy and prediction uncertainty. For
prediction accuracy, we adopt the commonly used various architectures and training methodologies,
metric of Accuracy (Acc), which measures the pro- allowing for a comprehensive benchmarking anal-
portion of test instances whose true label has the ysis. Specifically, the chosen models include the
highest softmax score. To evaluate prediction un- Llama-2 series (Touvron et al., 2023), Mistral-7B
certainty, we use Set Size (SS), which calculates (Jiang et al., 2023), Falcon series4 (Almazrouei
the average length of prediction sets of all test in- et al., 2023), MPT-7B (Team, 2023b), Qwen se-
stances. These prediction sets are obtained using ries (Bai et al., 2023), Yi series (01.AI, 2023),
conformal prediction, as explained in Section 3. DeepSeek series (DeepSeek, 2023), and InternLM-
In addition to Acc and SS, we propose a new met- 7B (Team, 2023a). For all models, we utilize their
ric, Uncertainty-aware Accuracy (UAcc), which checkpoints from the HuggingFace platform5 .
takes into account both the prediction accuracy and
uncertainty. UAcc is calculated as follows: 6.3 Main Findings
In our primary experiments, we focus on LLMs
Acc p
U Acc = |Y|, (6) with sizes ranging from 6B to 14B parameters. The
SS
outcomes are presented in Table 1. In addition to
p
where the incorporation of |Y| allows UAcc to the three evaluation metrics, we also report the cov-
be smaller or larger than Acc, depending on the erage rate, which is defined as the proportion of
level of uncertainty in the predictions. test instances for which the prediction set encom-
passes the true label. As previously mentioned, the
6 Evaluation Results reported results are the mean value derived from
the two conformal score functions, LAC and APS.
6.1 Setup
For a detailed analysis of the results pertaining to
In our experiments, we set the error rate α to 0.1, each function, please refer to Appendix C.1.
implying that the prediction set should include the From Table 1, it is evident that in the majority of
true label with a probability of at least 0.9. In order cases, the coverage rate is at least 90%, indicating
to better excite the ability of LLMs, we incorporate that the coverage guarantee requirement has been
examples or demonstrations in the prompt, adher- met. Although there are cases where the coverage
ing to the in-context learning paradigm (Dong et al., rate falls below 90%6 , the values are still in close
2022). Specifically, we provide five demonstrations proximity to the 90% threshold. The lowest cov-
for QA, RC, and CI tasks. For the DRS task, we erage rate is attained by Qwen-7B on the DS task,
utilize three demonstrations, while we use only one with a value of 89.56%. Furthermore, all models
demonstration for the DS task due to constraints achieve an average coverage rate exceeding 90%
on input length and inference cost. The maximum across the five tasks. These findings suggest that
input length for all tasks is set to 2048 tokens. In the generated prediction sets are meaningful, as
addition, for each task, we allocate 50% of the data they cover the true label with a high probability.
as the calibration set and the remaining 50% as the Therefore, the size of the prediction set can serve
test set. We report results on the test set. These re- as a reliable indicator of uncertainty.
sults represent the average value obtained from the In principle, an LLM having higher accuracy
two conformal score functions, namely, LAC and should exhibit lower uncertainty. However, the re-
APS, as well as the three prompting strategies. It sults regarding the SS metric reveal that in practice,
is noteworthy that while both LAC and APS fulfill higher accuracy does not necessarily correlate with
the coverage guarantee requirement, they may pro- lower uncertainty. Concretely, for each task, we ob-
duce prediction sets of varying sizes. By taking the serve that the ranking of LLMs based on accuracy
average value of these score functions, we aim to differs from that based on uncertainty, suggesting
mitigate the influence of different score functions that some LLMs possessing higher accuracy actu-
on the evaluation of uncertainty, thereby ensuring ally display higher uncertainty. Notably, two LLMs
a more rigorous and reliable assessment. with a significant difference in accuracy may even
6.2 Evaluated Models 4
We omit Falcon-180B due to insufficient GPU resources.
5
We select a diverse set of eight representative mod- https://huggingface.co/models
6
While the theoretical guarantee of conformal prediction
els (or model series) from the vast array of open- is rigorous, there can be minor fluctuations in practice due to
source LLMs available. These models encompass finite-sample variability (Angelopoulos and Bates, 2021).
Coverage Rate (%) Acc (%) ↑
LLMs
QA RC CI DRS DS Avg. QA RC CI DRS DS Avg.
Qwen-14B 92.58 95.20 95.29 91.92 89.71 92.94 64.25(1) 91.52(1) 91.00(1) 73.90(1) 49.33(4) 74.00(1)
Yi-6B 91.30 94.06 92.35 91.36 91.38 92.09 57.57(3) 85.99(2) 76.50(2) 58.72(3) 66.06(1) 68.97(2)
Mistral-7B 93.00 92.91 89.94 91.02 92.05 91.78 60.44(2) 81.94(4) 62.93(4) 53.21(4) 62.16(2) 64.14(3)
Llama-2-13B 92.59 93.15 90.50 90.92 90.82 91.60 52.52(5) 77.23(5) 59.66(5) 52.65(5) 60.05(3) 60.42(4)
Qwen-7B 92.79 94.02 91.53 92.43 89.56 92.07 55.21(4) 83.89(3) 63.70(3) 64.04(2) 32.53(8) 59.87(5)
InternLM-7B 90.68 93.28 90.10 90.40 90.34 90.96 48.37(6) 73.86(6) 46.21(6) 43.72(6) 34.38(7) 49.31(6)
Llama-2-7B 91.37 90.69 90.97 89.60 90.04 90.53 45.60(8) 65.79(7) 43.05(7) 32.61(8) 45.60(5) 46.53(7)
DeepSeek-7B 91.18 89.95 90.16 90.89 90.21 90.48 45.65(7) 65.39(8) 42.66(8) 33.50(7) 42.15(6) 45.87(8)
MPT-7B 89.79 90.54 90.12 90.80 89.71 90.19 29.49(9) 31.69(9) 25.50(9) 24.38(10) 24.86(9) 27.18(9)
Falcon-7B 90.04 89.95 89.82 90.46 90.71 90.19 23.75(10) 24.98(10) 24.91(10) 25.86(9) 24.69(10) 24.84(10)
SS ↓ UAcc (%) ↑
LLMs
QA RC CI DRS DS Avg. QA RC CI DRS DS Avg.
Qwen-14B 2.80(1) 1.74(1) 2.02(2) 1.94(1) 2.37(3) 2.17(1) 57.83(1) 157.52(1) 147.13(1) 97.70(1) 51.22(4) 102.28(1)
Yi-6B 3.20(4) 1.92(3) 1.88(1) 2.85(5) 1.96(1) 2.36(2) 45.18(3) 132.61(2) 103.41(2) 50.97(3) 85.49(1) 83.53(2)
Mistral-7B 2.80(1) 1.75(2) 2.48(4) 2.71(4) 2.40(4) 2.43(3) 54.60(2) 124.71(3) 62.45(4) 48.18(5) 64.25(3) 70.84(3)
Llama-2-13B 3.06(3) 2.24(6) 2.72(5) 2.55(3) 2.24(2) 2.56(4) 42.53(4) 92.46(5) 53.82(5) 50.52(4) 66.02(2) 61.07(5)
Qwen-7B 3.26(6) 2.15(4) 2.28(3) 2.51(2) 2.92(5) 2.63(5) 42.45(5) 118.10(4) 69.47(3) 64.42(2) 27.28(7) 64.34(4)
InternLM-7B 3.49(8) 2.19(5) 3.28(8) 3.63(9) 4.47(10) 3.41(8) 34.17(7) 86.84(6) 34.56(6) 29.73(6) 18.87(8) 40.83(6)
Llama-2-7B 3.20(4) 2.39(7) 3.27(7) 3.26(6) 3.30(7) 3.09(6) 34.97(6) 67.92(7) 32.25(8) 24.50(7) 33.91(5) 38.71(7)
DeepSeek-7B 3.34(7) 2.77(8) 3.06(6) 3.40(7) 3.08(6) 3.13(7) 33.63(8) 58.50(8) 34.23(7) 24.11(8) 33.52(6) 36.80(8)
MPT-7B 3.53(9) 3.46(9) 3.60(9) 3.59(8) 3.66(8) 3.57(9) 20.44(9) 22.43(9) 17.36(9) 16.66(10) 16.63(9) 18.70(9)
Falcon-7B 3.90(10) 3.60(10) 3.66(10) 3.64(10) 3.92(9) 3.75(10) 14.90(10) 17.01(10) 16.66(10) 17.41(9) 15.42(10) 16.28(10)
Table 1: Results of LLMs with sizes ranging from 6B to 14B. These results represent the mean values of LAC and
APS. The "Avg." column denotes the average performance across the five tasks. The small number in parentheses
next to each score indicates the ranking of the LLM for that specific task. The relative ranking of LLMs is also
visually demonstrated pby the depth of color, with darker colors signifying higher rankings. Note that UAcc takes
values in the range [0, |Y|], thus can be larger than 1.
display inverse uncertainty. For example, on the accuracy but higher UAcc than Mistral-7B. These
DRS task, InternLM-7B demonstrates higher accu- results highlight the unique contribution of UAcc
racy than MPT-7B in accuracy by 19.34 absolute in providing a more nuanced evaluation of LLMs.
points, yet it shows higher uncertainty. This pattern
is also observed on the QA task with Qwen-7B and 6.4 Effects of Model Scale
Llama-2-7B (9.61), the CI task with Qwen-14B and LLMs with larger sizes are usually pretrained on
Yi-6B (14.50), and the DS task with InternLM-7B more data and tend to exhibit superior capabilities
and Falcon-7B (9.69). In each case, the LLM with across various tasks. In this study, we aim to inves-
much higher accuracy exhibits greater uncertainty tigate how an LLM’s performance changes when
compared to its counterpart. These observations un- scaling its model size. We present the results of the
derscore the importance of considering uncertainty Qwen model series in Figure 3. It is evident that,
in addition to accuracy when evaluating LLMs. with the exception of Qwen-72B and Qwen-14B on
Our new metric, UAcc, takes into account both the CI task, an increase in model size consistently
accuracy and uncertainty. It can penalize LLMs leads to improved task performance. Concerning
with high uncertainty and reward those with low uncertainty, there is a general trend of decreasing
uncertainty, as exemplified by Qwen-14B on the uncertainty when scaling the model size from 1.8B
QA task and the RC task, respectively. This unique to 14B. However, Qwen-7B displays higher uncer-
characteristic of UAcc allows it to enlarge or shrink tainty compared to Qwen-1.8B on the QA task.
the relative improvement of one LLM compared When further scaling the model size from 14B
to another. For example, on the QA task, Qwen- to 72B, the enhancements in uncertainty become
14B demonstrates an 11.60% higher accuracy than less pronounced, and more variations are observed.
Yi-6B, but when considering UAcc, the improve- Notably, on both the RC and DRS tasks, Qwen-
ment increases to 28%. Conversely, on the CI task, 72B demonstrates higher uncertainty than Qwen-
InternLM-7B has an 8.32% higher accuracy than 14B, although Qwen-72B achieves higher accuracy.
DeepSeek-7B, but the improvement in UAcc is Consequently, Qwen-72B underperforms Qwen-
only 0.96%. Moreover, the UAcc metric can even 14B in terms of UAcc on the two tasks. We provide
alter the relative ranking of two LLMs. For exam- the results of Llama-2 series, Yi series, DeepSeek
ple, on the DS task, Llama-2-13B achieves lower series, and Falcon series in Appendix C.2, which
Acc SS UAcc
Avg. Avg. Avg.
4.0 160
80 3.5 140
3.0 120
DS 60 QA DS 2.5 QA DS 100 QA
40 2.0 3.20 80
1.5 3.26 60
20 1.0 40
0.5 20
96.04
91.52 97.70
DRS 91.86RC DRS RC DRS RC
146.12 147.13
CI CI CI
Qwen-72B Qwen-14B Qwen-7B Qwen-1.8B
Figure 3: Performance comparison of different versions of the Qwen series (1.8B to 72B) on various tasks.
Acc SS UAcc
5.0
60 80
4.0 60
40
3.0 40
20 20
2.0
0 0
7B 13B 70B 7B 13B 70B 7B 13B 70B
Base Chat-V1 Chat-V2
Figure 4: Mean performance outcomes of the Llama-2 series’ base pretrained model and the instruction-finetuned
chat model across five tasks. Chat-V1 converts inputs into a chat format. Chat-V2 shares the same format as Base.
Figure 5: Average results across five tasks of the Llama-2 series when varying the proportion of calibration data.
LLMs Acc (%) ↑ SS ↓ UAcc (%) ↑ tasks, demonstrates that relying solely on accuracy
Yi-34B 81.46 1.93 121.09 for benchmarking LLMs is insufficient. Instead, it
Yi-34B∗ 80.10 1.88 119.99 is imperative to take uncertainty into account when
Llama-2-70B 72.24 2.16 94.62 assessing their overall performance.
Llama-2-70B∗ 63.55 2.72 63.67
Mengting Hu, Zhen Zhang, Shiwan Zhao, Minlie Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023.
Huang, and Bingzhe Wu. 2023. Uncertainty in natu- Generating with confidence: Uncertainty quantifi-
ral language processing: Sources, quantification, and cation for black-box large language models. arXiv
applications. arXiv preprint arXiv:2306.04459. preprint arXiv:2305.19187.
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Bingshuai Liu, Chenyang Lyu, Zijun Min, Zhanyu
Yejin Choi. 2019. Cosmos QA: Machine reading Wang, Jinsong Su, and Longyue Wang. 2023.
comprehension with contextual commonsense rea- Retrieval-augmented multi-modal chain-of-thoughts
soning. In Proceedings of the 2019 Conference on reasoning for large language models. arXiv preprint
Empirical Methods in Natural Language Processing arXiv:2312.01714.
Charles Lu, Yaodong Yu, Sai Praneeth Karimireddy, Linguistics (Volume 1: Long Papers), pages 1073–
Michael Jordan, and Ramesh Raskar. 2023. Feder- 1083, Vancouver, Canada. Association for Computa-
ated conformal predictors for distributed uncertainty tional Linguistics.
quantification. In International Conference on Ma-
chine Learning, pages 22942–22964. PMLR. InternLM Team. 2023a. Internlm: A multilingual lan-
guage model with progressively enhanced capabili-
Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting ties. https://github.com/InternLM/InternLM.
Huang, Bingshuai Liu, Zefeng Du, Shuming Shi,
and Zhaopeng Tu. 2023. Macaw-llm: Multi-modal MosaicML NLP Team. 2023b. Introducing mpt-7b:
language modeling with image, audio, video, and A new standard for open-source, commercially us-
text integration. arXiv preprint arXiv:2306.09093. able llms. www.mosaicml.com/blog/mpt-7b. Ac-
cessed: 2023-05-05.
Matthias Minderer, Josip Djolonga, Rob Romijnders,
Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Tran, and Mario Lucic. 2021. Revisiting the calibra- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
tion of modern neural networks. Advances in Neural Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Information Processing Systems, 34:15682–15694. Bhosale, et al. 2023. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
Seungwhan Moon, Pararth Shah, Anuj Kumar, and Ra- arXiv:2307.09288.
jen Subba. 2019. OpenDialKG: Explainable conver-
sational reasoning with attention-based walks over Vladimir Vovk, Alexander Gammerman, and Glenn
knowledge graphs. In Proceedings of the 57th An- Shafer. 2005. Algorithmic learning in a random
nual Meeting of the Association for Computational world, volume 29. Springer.
Linguistics, pages 845–854, Florence, Italy. Associa-
tion for Computational Linguistics. Sridevi Wagle, Sai Munikoti, Anurag Acharya, Sara
Smith, and Sameera Horawalavithana. 2023. Em-
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham- pirical evaluation of uncertainty quantification in
mad Saqib, Saeed Anwar, Muhammad Usman, Nick retrieval-augmented language models for science.
Barnes, and Ajmal Mian. 2023. A comprehensive arXiv preprint arXiv:2311.09358.
overview of large language models. arXiv preprint
arXiv:2307.06435. Sang Michael Xie, Aditi Raghunathan, Percy Liang,
and Tengyu Ma. 2021. An explanation of in-context
Peter Norvig. 1987. A unified theory of inference for text learning as implicit bayesian inference. In Interna-
understanding. Ph.D. thesis, University of California, tional Conference on Learning Representations.
Berkeley.
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie
Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Fu, Junxian He, and Bryan Hooi. 2023. Can llms
Jae Ho Sohn, Tommi S Jaakkola, and Regina Barzilay. express their uncertainty? an empirical evaluation
2023. Conformal language modeling. arXiv preprint of confidence elicitation in llms. arXiv preprint
arXiv:2306.10193. arXiv:2306.13063.
Rahul Rahaman et al. 2021. Uncertainty quantification Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian
and deep ensembles. Advances in Neural Informa- Han, Qizhang Feng, Haoming Jiang, Bing Yin, and
tion Processing Systems, 34:20063–20075. Xia Hu. 2023a. Harnessing the power of llms in
practice: A survey on chatgpt and beyond. arXiv
Shauli Ravfogel, Yoav Goldberg, and Jacob Goldberger. preprint arXiv:2304.13712.
2023. Conformal nucleus sampling. In Findings of
the Association for Computational Linguistics: ACL Yuchen Yang, Houqiang Li, Yanfeng Wang, and
2023, pages 27–34, Toronto, Canada. Association for Yu Wang. 2023b. Improving the reliability of large
Computational Linguistics. language models by leveraging uncertainty-aware in-
context learning. arXiv preprint arXiv:2310.04782.
Yaniv Romano, Matteo Sesia, and Emmanuel Candes.
2020. Classification with valid and adaptive coverage. Seonghyeon Ye, Doyoung Kim, Sungdong Kim,
Advances in Neural Information Processing Systems, Hyeonbin Hwang, Seungone Kim, Yongrae Jo,
33:3581–3591. James Thorne, Juho Kim, and Minjoon Seo. 2023.
Flask: Fine-grained language model evaluation
Mauricio Sadinle, Jing Lei, and Larry Wasserman. 2019. based on alignment skill sets. arXiv preprint
Least ambiguous set-valued classifiers with bounded arXiv:2307.10928.
error levels. Journal of the American Statistical As-
sociation, 114(525):223–234. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. HellaSwag: Can a ma-
Abigail See, Peter J. Liu, and Christopher D. Manning. chine really finish your sentence? In Proceedings of
2017. Get to the point: Summarization with pointer- the 57th Annual Meeting of the Association for Com-
generator networks. In Proceedings of the 55th An- putational Linguistics, pages 4791–4800, Florence,
nual Meeting of the Association for Computational Italy. Association for Computational Linguistics.
Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun,
Yongkang Huang, Chong Long, Xiao Liu, Xuanyu
Lei, Jie Tang, and Minlie Huang. 2023. Safety-
bench: Evaluating the safety of large language mod-
els with multiple choice questions. arXiv preprint
arXiv:2309.07045.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Zhang, Junjie Zhang, Zican Dong, et al. 2023. A
survey of large language models. arXiv preprint
arXiv:2303.18223.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023.
Judging llm-as-a-judge with mt-bench and chatbot
arena. arXiv preprint arXiv:2306.05685.
A Pseudo Code 7. Calculate the evaluation metrics, namely Acc,
SS, and UAcc.
Algorithm 1 Conformal Prediction for Uncertainty In our experiments, we use a server with eight
Quantification of LLMs A100 40GB cards to load each LLM checkpoint
Require: An LLM M, a task-specific dataset D, a user- and perform inference with a batch size of 1.
specified error rate α, a test data ratio β, a prompting
strategy P, and a conformal score function s (either LAC B Dataset Statistics
or APS);
Output: Evaluation results in terms of Acc, SS, U Acc; Figure 6 presents the distribution of correct answer
1: Get model predictions
2: O ← Initialized as an empty list; choices for each task. It is noteworthy that while
3: for each instance X in D do we have incorporated options E ("I don’t know")
4: X ′ ← FormatPromptInput(X, P); and F ("None of the above") for every question, the
5: L(X) ← GetLogitsOfOptions(M, X ′ );
6: P (X) ← Softmax(L(X)); correct answer consistently falls within the set {A,
7: Append P (X) to O; B, C, D}. As depicted in Figure 6, the distribution
8: end for of correct answers is nearly uniform across options
9: Dcal , Dtest ← CalibrationTestSplit(D, β);
10: Ocal , Otest ← CalibrationTestSplit(O, β); A, B, C, and D for all tasks except the QA task.
11: Acc ← CalculateAccuracy(Dtest , Otest ); However, even on the QA task, the distribution
12: Apply conformal prediction
13: S ← ComputeConformalScores(Dcal , Ocal , s);
does not exhibit a significant skew. These statistics
14: p̂ ← CalculateConformalThreshold(S, α); indicate that the created datasets are suitable for
15: B ← Initialized as an empty list; rigorously evaluating the performance of LLMs.
16: for each instance X in Dtest do
17: Get P (X) from Otest ;
18: C(X) ← CreatePredictionSet(P (X), p̂, s);
C Further Experimental Results
19: if |C(X)| == 0 then
20: C(X) ← {Option with the largest probability};
C.1 Detailed Results of LAC and APS
21: end if Table 1 presents the average results derived from
22: Append C(X) to B;
23: end for the two conformal score functions, namely, LAC
24: SS ← CalculateAverageSetSize(B);
p and APS. In this part, we analyze the results as-
25: U Acc ← Acc SS
|Y|, where Y denotes the option set; sociated with each conformal score function. The
26: return Acc, SS, U Acc.
detailed results for LAC and APS are reported in
Table 3 and Table 4, respectively.
We present a summary of the pseudo code for It is evident that for both conformal score func-
applying conformal prediction to quantify the un- tions, the ranking of LLMs based on accuracy can
certainty of LLMs in Algorithm 1. The procedure be different from that based on uncertainty. These
is outlined as follows: results reaffirm the importance of considering un-
certainty in order to evaluate the performance of
1. For each instance, input it into the LLM to ob- LLMs in a more holistic manner. Another notable
tain the logits output for all possible options. observation is the difference in the uncertainty es-
timations produced by LAC and APS. In general,
2. Apply the softmax function to transform these APS tends to produce larger prediction sets. More
logits into probability values. importantly, APS can lead to a significantly differ-
ent ranking of LLMs based on uncertainty com-
3. Divide the dataset into a calibration set and a
pared to LAC. For example, on the CI task, Qwen-
test set.
14B secures the lowest uncertainty when LAC is
4. Employ the user-specified error rate and the utilized as the conformal score function. However,
calibration set to determine the conformal when APS is employed as the conformal score func-
threshold. tion, Qwen-14B is ranked fifth. This observation
suggests that it is essential to average the results
5. Generate prediction sets for instances in the of the two conformal score functions to provide a
test set based on the conformal threshold. more accurate quantification of uncertainty. Last
but not least, both conformal score functions can
6. In the event that a prediction set is empty, achieve high coverage rates, verifying again the
select the option with the highest probability rationality of relying on prediction set size to esti-
as the final prediction. mate uncertainty.
QA RC CI DRS DS
2500 2500 2500 2500 2500
2000 2000 2000 2000 2000
Frequency
Frequency
Frequency
Frequency
Frequency
1500 1500 1500 1500 1500
1000 1000 1000 1000 1000
500 500 500 500 500
0 0 0 0 0
A B C D A B C D A B C D A B C D A B C D
Options Options Options Options Options
Table 3: Results of LLMs with sizes ranging from 6B to 14B. These results are obtained when LAC is adopted as
the conformal score function. The "Avg." column denotes the average performance across tasks. The small number
p
in parentheses indicates the rank of the model on each task. Note that UAcc takes values in the range [0, |Y|],
thus can be larger than 1.
C.2 Effects of Model Scale (Cont.) prompt input. The first approach adheres to the
Figures 7-10 illustrate the performance outcomes format of the instruction data, which is denoted as
of the Llama-2 series, Yi series, DeepSeek series, Chat-V1. The second approach employs the same
and Falcon series. It is observed that while in gen- prompt format as the base version and is denoted
eral, increasing model size can lead to stronger per- as Chat-V2. For the Yi series, it is observed that
formance, on some tasks, a larger model may dis- both Chat-V1 and Chat-V2 consistently yield infe-
play weaker performance. For example, on the DS rior performance than the base model concerning
task, Llama-2-70B exhibits inferior performance all three evaluation metrics. In contrast, for the
compared to Llama-2-13B across all three evalu- DeepSeek series, both Chat-V1 and Chat-V2 ex-
ation metrics. On the CI task, although Yi-34B hibit enhanced performance in terms of Acc and
demonstrates higher accuracy than Yi-6B, it shows UAcc. Nevertheless, Chat-V1 results in greater
higher uncertainty. uncertainty compared to the base model for both
DeepSeek-7B and DeepSeek-70B. Chat-V2 also
C.3 Effects of Instruction Finetuning (Cont.) demonstrates higher uncertainty relative to the base
model for DeepSeek-7B. For the Falcon series, it
Figures 11-13 depict the average results across five
is observed that both Chat-V1 and Chat-V2 con-
tasks of the Yi series, DeepSeek series, and Falcon
sistently lead to higher uncertainty than the base
series, as well as their instruction-finetuned coun-
model, even though Chat-V2 achieves better per-
terparts. Recall that for the instruction-finetuned
formance in terms of Acc and UAcc.
version, two approaches are adopted to prepare the
Coverage Rate (%) Acc (%) ↑
LLMs
QA RC CI DRS DS Avg. QA RC CI DRS DS Avg.
Qwen-14B 94.74 98.83 99.19 93.52 89.79 95.21 64.25(1) 91.52(1) 91.00(1) 73.90(1) 49.33(4) 74.00(1)
Yi-6B 92.64 98.09 94.86 91.77 92.37 93.95 57.57(3) 85.99(2) 76.50(2) 58.72(3) 66.06(1) 68.97(2)
Mistral-7B 96.37 96.33 89.88 91.04 93.24 93.37 60.44(2) 81.94(4) 62.93(4) 53.21(4) 62.16(2) 64.14(3)
Llama-2-13B 95.12 96.39 90.68 91.25 91.18 92.92 52.52(5) 77.23(5) 59.66(5) 52.65(5) 60.05(3) 60.42(4)
Qwen-7B 95.04 98.25 93.18 94.35 89.12 93.99 55.21(4) 83.89(3) 63.70(3) 64.04(2) 32.53(8) 59.87(5)
InternLM-7B 91.98 96.23 90.24 90.18 90.06 91.74 48.37(6) 73.86(6) 46.21(6) 43.72(6) 34.38(7) 49.31(6)
Llama-2-7B 92.33 91.76 90.95 89.40 90.24 90.94 45.60(8) 65.79(7) 43.05(7) 32.61(8) 45.60(5) 46.53(7)
DeepSeek-7B 92.42 91.43 90.53 90.70 90.26 91.07 45.65(7) 65.39(8) 42.66(8) 33.50(7) 42.15(6) 45.87(8)
MPT-7B 89.35 90.32 90.06 90.72 89.94 90.08 29.49(9) 31.69(9) 25.50(9) 24.38(10) 24.86(9) 27.18(9)
Falcon-7B 90.01 89.96 89.86 90.20 90.82 90.17 23.75(10) 24.98(10) 24.91(10) 25.86(9) 24.69(10) 24.84(10)
SS ↓ UAcc (%) ↑
LLMs
QA RC CI DRS DS Avg. QA RC CI DRS DS Avg.
Qwen-14B 3.21(1) 2.47(2) 3.03(5) 2.33(1) 2.47(3) 2.70(1) 49.12(1) 91.15(1) 73.50(2) 77.53(1) 48.95(4) 68.05(1)
Yi-6B 3.48(5) 2.72(5) 2.22(1) 2.95(4) 2.33(1) 2.74(3) 40.68(3) 77.50(3) 84.49(1) 48.95(4) 69.64(1) 64.25(2)
Mistral-7B 3.27(2) 2.25(1) 2.59(3) 2.80(3) 2.67(4) 2.71(2) 45.42(2) 89.57(2) 59.50(4) 46.62(5) 57.28(3) 59.68(3)
Llama-2-13B 3.40(4) 2.90(6) 2.86(4) 2.58(2) 2.36(2) 2.82(4) 37.88(4) 65.32(6) 51.06(5) 49.98(3) 62.38(2) 53.32(4)
Qwen-7B 3.70(8) 3.10(8) 2.53(2) 2.95(4) 2.91(5) 3.04(5) 36.56(5) 66.31(5) 61.74(3) 53.18(2) 27.36(7) 49.03(5)
InternLM-7B 3.74(9) 2.68(4) 3.39(8) 3.71(10) 4.51(10) 3.61(9) 31.65(8) 67.60(4) 33.38(7) 29.03(6) 18.71(8) 36.07(7)
Llama-2-7B 3.35(3) 2.58(3) 3.32(7) 3.27(6) 3.15(7) 3.14(6) 33.30(6) 62.48(7) 31.83(8) 24.40(7) 35.43(5) 37.49(6)
DeepSeek-7B 3.48(5) 3.03(7) 3.12(6) 3.42(7) 3.07(6) 3.23(7) 32.10(7) 52.91(8) 33.56(6) 23.97(8) 33.67(6) 35.24(8)
MPT-7B 3.53(7) 3.49(9) 3.60(9) 3.55(8) 3.69(8) 3.57(8) 20.49(9) 22.26(9) 17.35(9) 16.82(10) 16.49(9) 18.68(9)
Falcon-7B 3.89(10) 3.60(10) 3.69(10) 3.62(9) 3.91(9) 3.74(10) 14.96(10) 16.98(10) 16.56(10) 17.52(9) 15.47(10) 16.30(10)
Table 4: Results of LLMs with sizes ranging from 6B to 14B. These results are obtained when APS is adopted as
the conformal score function. The "Avg." column denotes the average performance across tasks. The small number
p
in parentheses indicates the rank of the model on each task. Note that UAcc takes values in the range [0, |Y|],
thus can be larger than 1.
Acc SS UAcc
Avg. Avg. Avg.
80 3.0 140
2.5 120
60 100
DS QA DS 2.25 2.0 QA DS 80 QA
57.41 40 1.5 60
60.05 2.24 1.0 62.50
20 40
0.5 66.02 20
CI CI CI
Llama-2-70B Llama-2-13B Llama-2-7B
Figure 7: Performance comparison of different versions of the Llama-2 series (7B to 70B) on various tasks.
C.4 Effects of Error Rate cally. This outcome is anticipated, as a higher error
rate implies a reduced probability of the true label
Conformal prediction ensures that the prediction being included in the prediction set. Nevertheless,
set encompasses the true label with a probability no the coverage rate consistently remains greater than
less than 1 − α, where α denotes a user-specified 1 − α. This observation reaffirms the statistical
error rate. In consideration of this, it is imperative guarantee provided by conformal prediction in gen-
to investigate the impact of varying the value of the erating the prediction set. It is also observed that
error rate α on the prediction set and, subsequently, the average set size (SS) decreases monotonically
the estimation of uncertainty. For this purpose, we with an increase in the value of α. This observation
vary the value of α within the range of 0.05 to is logical, as a larger error rate suggests that the
0.3 and report the results of the Llama-2 series in prediction set can miss the true label with a higher
Figure 14. It is observed that as the value of α probability and consequently, have a smaller size.
increases, the coverage rate decreases monotoni-
Acc SS UAcc
Avg. Avg. Avg.
3.0 160
80 2.5 140
120
DS 60 QA DS 2.0 QA DS 100 QA
40 1.5 80
1.0 60
20 40
0.5 20
CI CI CI
Yi-34B Yi-6B
Figure 8: Performance comparison of different versions of the Yi series (6B to 34B) on various tasks.
Acc SS UAcc
Avg. Avg. Avg.
80 3.0 140
2.5 120
60 100
DS QA DS 2.0 QA DS 80 QA
40 1.5 60
20 1.0 40
0.5 20
CI CI CI
DeepSeek-67B DeepSeek-7B
Figure 9: Performance comparison of different versions of the DeepSeek series (7B to 67B) on various tasks.
Acc SS UAcc
Avg. Avg. Avg.
3.5 35
40 3.0 30
30 3.89 2.5 25
DS QA DS 2.0 QA DS 20 QA
20 3.92 1.5 15
10 1.0 10
0.5 5
27.25 18.60
25.86 3.59 17.41
DRS 24.91 RC DRS 3.64 RC DRS 17.98 16.66 RC
25.98
3.54 3.66
CI CI CI
Falcon-40B Falcon-7B
Figure 10: Performance comparison of different versions of the Falcon series (7B to 40B) on various tasks.
In terms of UAcc, according to its definition, it is 13B, and Llama-2-13B consistently displays lower
intuitive that its value will increase as the value uncertainty than Llama-2-7B. This finding is of
of SS decreases. Our results corroborate this ex- paramount importance, as it demonstrates that al-
pectation. Another noteworthy observation is that, though different values of the error rate α can lead
regardless of the value of α, Llama-2-70B con- to varying sizes of the prediction set, the relative
sistently exhibits lower uncertainty than Llama-2- rankings of different LLMs based on uncertainty
Acc SS UAcc
80 3.0
100
60
2.0
40 50
20 1.0
0 0.0 0
6B 34B 6B 34B 6B 34B
Base Chat-V1 Chat-V2
Figure 11: Mean performance outcomes of the Yi series’ base pretrained model and the instruction-finetuned chat
model across five tasks. Chat-V1 converts inputs into a chat format. Chat-V2 shares the same format as Base.
Acc SS UAcc
80
3.0 100
60 75
40 2.0
50
20 1.0 25
0 0.0 0
7B 67B 7B 67B 7B 67B
Base Chat-V1 Chat-V2
Figure 12: Mean performance outcomes of the DeepSeek series’ base pretrained model and the instruction-finetuned
chat model across five tasks. Chat-V1 converts inputs into a chat format. Chat-V2 shares the same format as Base.
Acc SS UAcc
4.0
30 20
3.0
20 2.0 10
10 1.0
0 0.0 0
7B 40B 7B 40B 7B 40B
Base Chat-V1 Chat-V2
Figure 13: Mean performance outcomes of the Falcon series’ base pretrained model and the instruction-finetuned
chat model across five tasks. Chat-V1 converts inputs into a chat format. Chat-V2 shares the same format as Base.
Figure 14: Average results across five tasks of the Llama-2 series when varying the error rate α. Note that the ideal
coverage rate should be no less than 1 − α.
lower performance with respect to UAcc. This find- 2021), as shown in the following equation:
ing suggests that while LLMs are indeed capable zi
eτ
of addressing multiple tasks, it remains crucial to sof tmax(zi , τ ) = P zj , (7)
m
analyze each task independently since LLMs can j=1 e
τ
(a) LAC
102
Coverage Rate 6.0
SS UAcc
100 Llama-2-70B 90
Llama-2-13B 5.0 80
98 Llama-2-7B 70
96 4.0
60
94 50
3.0
92 40
90 2.0 30
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
(b) APS
100
Coverage Rate 4.0
SS UAcc
Llama-2-70B 100
98 Llama-2-13B 3.5
96 Llama-2-7B 80
3.0
94 2.5 60
92 40
2.0
90 20
1.5
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Figure 15: Average results across five tasks of the Llama-2 series when varying the softmax temperature τ .
It is observed that when using LAC as the confor- and the LLM becomes underconfident. Therefore,
mal score function, we can obtain relatively stable the uncertainty estimation should be high rather
performance in terms of both SS and UAcc. How- than low. In light of this, we have taken the average
ever, APS is highly sensitive to the temperature τ . value of LAC and APS to achieve a more robust
When τ takes small values, APS results in high un- estimation of uncertainty in our analysis.
certainty, which is somewhat counter-intuitive, as In summary, these observations suggest that: (I)
a low temperature leads to sharp distributions and, entropy is not a reliable metric for evaluating uncer-
consequently, low entropy. Intuitively, low entropy tainty, (II) LAC is not very sensitive to the softmax
suggests low uncertainty. However, when we vary temperature τ , (III) a small value of τ results in
the value of τ , we do not change the LLM’s accu- the LLM being overconfident, and APS penalizes
racy. Thus, a low temperature causes the LLM to this by estimating high uncertainty, and (IV) how-
be overconfident. APS penalizes this phenomenon ever, APS cannot handle the case where the LLM
by producing large prediction sets (implying high is underconfident, so it is crucial to take the aver-
uncertainty). This is a desirable property of APS. age results of LAC and APS to achieve a robust
Nonetheless, we observe that when τ takes large quantification of uncertainty.
values, APS provides low uncertainty estimation,
which is undesirable. This is because when τ takes C.7 Rate of Predicted Option Being E or F
large values, the probability distribution is close When preparing datasets, in order to enhance the
to a uniform distribution (indicating high entropy), complexity of tasks and effectively quantify uncer-
E Rate (%) F Rate (%)
LLMs
QA RC CI DRS DS Avg. QA RC CI DRS DS Avg.
Yi-34B 1.79 0.41 0.00 1.99 0.00 0.84 0.37 0.08 0.07 0.07 0.00 0.12
Qwen-72B 0.31 0.47 0.39 1.53 0.00 0.54 0.32 0.62 1.19 3.22 0.02 1.07
Qwen-14B 0.21 0.83 0.08 0.19 0.00 0.26 0.43 0.70 0.17 0.19 0.27 0.35
Llama-2-70B 0.11 0.03 0.00 0.47 0.72 0.27 0.05 0.07 0.00 0.17 0.00 0.06
DeepSeek-67B 3.46 2.95 0.75 0.81 0.00 1.60 0.04 0.00 0.00 0.00 0.02 0.01
Yi-6B 5.88 1.01 0.00 5.71 0.00 2.52 0.05 0.15 0.00 0.03 0.00 0.05
Mistral-7B 0.18 0.00 0.00 0.01 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00
Llama-2-13B 0.99 0.05 0.00 0.00 0.00 0.21 0.16 0.00 0.00 0.00 0.00 0.03
Qwen-7B 1.25 0.51 0.00 0.32 0.00 0.42 1.01 0.74 0.00 0.04 0.00 0.36
InternLM-7B 1.05 0.00 0.03 0.07 2.71 0.77 0.00 0.00 0.00 0.00 1.61 0.32
Llama-2-7B 0.39 0.00 0.00 0.00 1.85 0.45 0.01 0.00 0.00 0.00 0.12 0.03
DeepSeek-7B 1.05 1.67 0.00 0.00 0.39 0.62 0.58 0.00 0.00 0.00 0.00 0.12
Qwen-1.8B 1.65 0.51 0.68 0.00 1.01 0.77 0.00 0.00 0.00 0.00 0.52 0.10
Falcon-40B 0.15 0.01 0.00 0.00 3.72 0.78 0.00 0.05 0.05 0.00 0.95 0.21
MPT-7B 0.05 0.00 0.00 0.00 0.04 0.02 0.00 0.00 0.00 0.00 0.00 0.00
Falcon-7B 0.16 0.05 0.00 0.01 0.03 0.05 0.00 0.01 0.00 0.00 0.10 0.02
Table 6: The ratio of test instances for which the predicted answer is option E ("I don’t know") or option F ("None
of the above"). Note that neither of them corresponds to the ground truth answer.
tainty, two additional answer choices, E ("I don’t LLMs Prompting Strategy Acc (%) SS
Base 73.19 1.40
know") and F ("None of the above"), are incorpo- Yi-34B Shared Instruction 69.87 1.46
rated into the datasets for each task. It is important Task-specific Instruction 71.35 1.42
to note that neither of these options represents the Base 58.30 1.70
correct answer for any of the questions. With this Qwen-72B Shared Instruction 61.08 1.67
Task-specific Instruction 62.51 1.61
in mind, our study aims to investigate whether an Base 56.66 2.08
LLM might predict options E or F as the answer, Llama-2-70B Shared Instruction 56.18 2.21
and if so, the number of test instances for which Task-specific Instruction 59.38 2.18
Base 55.60 2.10
such predictions would be made. Table 6 presents
DeepSeek-67B Shared Instruction 56.66 2.03
the results of various LLMs, demonstrating that, Task-specific Instruction 56.34 2.18
across all LLMs, only a tiny proportion of test in-
stances are predicted to have a true answer of either Table 7: Comparison of different prompting strategies
E or F. Furthermore, we observe that for a particular on the DS task. The reported values of the SS metric are
obtained using LAC as the conformal score function.
LLM, there can be no questions whose predicted
answer is E or F on some tasks (e.g., Mistral-7B on
the RC task). Consequently, the inclusion of these we delve deeper into the comparative performance
additional options does not significantly impact the of these prompting strategies. Specifically, we con-
accuracy of the evaluation process. This indicates duct experiments on the DS task and report the
that our approach to incorporating options E and F results of Yi-34B, Qwen-72B, Llama-2-70B, and
effectively increases the difficulty of the tasks with- DeepSeek-67B in Table 7. It can be observed that
out compromising the reliability of the assessment. while the performance of each LLM varies with
With more options in the answer choices, we can different prompting strategies, the discrepancies
also quantify uncertainty more accurately. are generally marginal. Of greater significance is
the observation that different LLMs demonstrate
C.8 Comparison of Prompting Strategies preferences for different prompting strategies. For
In Section 5.1, we have introduced three prompt- example, the base prompting strategy yields the
ing strategies and our previous analyses are based highest accuracy for Yi-34B. Conversely, Qwen-
on the average results obtained from these prompt- 72B performs optimally with the task-specific in-
ing strategies, aiming to reduce the influence of struction prompting strategy, while DeepSeek-67B
LLMs’ sensitivities to prompts. In this subsection, exhibits superior accuracy with the shared instruc-
Question: Which of the following is thought to prompting strategies result in different prediction
be implicated in the development of peripheral sets when APS is utilized as the conformal score
muscle fatigue during multiple sprint activities? function. This case study reveals that varying con-
Choices: formal score functions and prompting strategies
A. An accumulation of inorganic phosphate. can yield different measurements of uncertainty,
B. Development of hyperosmolality in the mus- even though the true answer is encompassed in
cles. the prediction sets. In practice, the mean results
C. An excess of antioxidants. derived from these conformal score functions and
D. A lack of potassium. prompting strategies can be used to achieve a more
E. I don’t know precise uncertainty quantification.
F. None of the above
Correct Answer: A
D Prompt Templates
Predicted Answer based on LAC: We provide the prompt templates employed by the
Base Prompt: {A} three prompting strategies in Table 9, Table 10, and
Shared Instruction Prompt: {A} Table 11, respectively. For the QA task, there is no
Task-specific Instruction Prompt: {A} background information for each question. For the
Predicted Answer based on APS: RC and CI tasks, each question has an associated
Base Prompt: {A, B, D} contextual description, as indicated by the keyword
Shared Instruction Prompt: {A, F} "Context" in the prompt templates. For the DRS
Task-specific Instruction Prompt: {A, E} task and the DS task, we use "Dialogue" and "Doc-
ument" rather than "Context" as the keywords to
Table 8: An example of the prediction sets produced by
incorporate the dialogue history and document con-
the Yi-34B model on the QA task. We include results
derived from both the LAC and APS score functions and
tent into the prompt. When evaluating instruction-
the three prompting strategies. All generated prediction finetuned LLMs, we treat the entire prompt input
sets encompass the true answer. as the message from users and then employ the
"apply_chat_template" function to transform the
prompt input into a chat format.
tion prompting strategy. Similar patterns can be
observed from the results of the SS metric. These
observations highlight the importance of consider-
ing multiple prompting strategies when benchmark-
ing LLMs. By averaging the results from various
strategies, we can mitigate the impact of prompt
sensitivities and ensure a more equitable compari-
son of LLM performance.
Table 9: Prompt template for the base prompting strategy. Note that for the QA task, there is no background
information pertaining to questions. For the RC and CI tasks, we adopt the keyword "Context" to include the
contextual information associated with each question. For the DRS and DS tasks, we employ the keywords
"Dialogue" and "Document" to incorporate the pertaining information, respectively.
Below are some examples of multiple-choice questions with six potential answers. For
each question, only one option is correct.
{Demonstrations in the same format as the question shown below except that the
true answers are provided for questions in the demonstrations}
Now make your best effort and select the correct answer for the following question. You
only need to output the option.
Table 10: Prompt template for the shared instruction prompting strategy. In this strategy, we add a shared general
task description at the beginning of the prompt. Note that for the QA task, there is no background information
pertaining to questions. For the RC and CI tasks, we adopt the keyword "Context" to include the contextual
information associated with each question. For the DRS and DS tasks, we employ the keywords "Dialogue" and
"Document" to incorporate the pertaining information, respectively.
(for the QA task) Below are some examples of multiple-choice questions about question
answering. Each question should be answered based on your world knowledge and problem
solving ability./
(for the RC task) Below are some examples of multiple-choice questions about reading
comprehension. Each question should be answered based on the given context and
commonsense reasoning when necessary./
(for the CI task) Below are some examples of multiple-choice questions about commonsense
natural language inference. For each question, there is a given context and the answer
is the option that most likely follows the context./
(for the DRS task) Below are some examples of multiple-choice questions about dialogue
response selection. For each question, the answer is the option that represents the most
suitable response for the given dialogue history, without hallucination and non-factual
information./
(for the DS task) Below are some examples of multiple-choice questions about document
summarization. For each question, the answer is the option that accurately summarizes
the given document without hallucination and non-factual information.
{Demonstrations in the same format as the question shown below except that the
true answers are provided for questions in the demonstrations}
Now make your best effort and select the correct answer for the following question. You
only need to output the option.
Table 11: Prompt template for the task-specific instruction prompting strategy. In this strategy, we add a task-specific
description at the beginning of the prompt. Note that for the QA task, there is no background information pertaining
to questions. For the RC and CI tasks, we adopt the keyword "Context" to include the contextual information
associated with each question. For the DRS and DS tasks, we employ the keywords "Dialogue" and "Document" to
incorporate the pertaining information, respectively.