Professional Documents
Culture Documents
Open Math Instruct
Open Math Instruct
Abstract
Recent work has shown the immense poten-
tial of synthetically generated datasets for train-
arXiv:2402.10176v1 [cs.CL] 15 Feb 2024
2.2 Prompting
Solution Format. We use the code-interpreter
In the previous section, we described our solu-
format for the synthesized solutions (Figure 2).
tion generation pipeline. A key ingredient of this
The code-interpreter format interweaves natural
pipeline is the few-shot prompt examples. We next
language reasoning with Python code blocks. It
describe the different prompting strategies explored
thus combines the computation precision of cod-
in this work.
ing environments with the expressiveness of natu-
ral language reasoning, which is particularly suit- 2.2.1 Default
able for mathematical tasks (Zhou et al., 2024; We choose five representative examples of GSM8K
Gou et al., 2024). To demarcate the start and end and MATH to create the few-shot prompt for the
of a code block, we use the strings ⟨llm-code⟩ respective datasets. For GSM8K, we use a mix
and ⟨/llm-code⟩. A code block is followed of problems that require vanilla Python code and
by its execution block, which is demarcated by problems that are best solved using Python’s sympy
⟨llm-code-output⟩ and ⟨/llm-code-output⟩. library. For MATH, we compose a 5-shot prompt
During inference, the model invokes the Python with examples from different subjects. To reflect
interpreter to run the preceding code block after this diversity of reasoning paths required for MATH,
generating ⟨/llm-code⟩, appends the execution re- we choose a mix of problems that require code-
sult in between the ⟨llm-code-output⟩ separators, based solutions, text-based solutions, and a com-
and resumes the autoregressive model inference.6 bination of both. The prompts used for the two
datasets are shown in Appendix B.6.
Approach. We use few-shot prompting to synthe- For GSM8K, we sample 128 solutions per train-
size solutions for the training sets of GSM8K and ing problem, which gets a training set coverage of
6
99.1%. For MATH, we sample 224 solutions per
During training, we don’t mask the code execution output
surrounded by ⟨llm-code-output⟩ separators. 7
https://github.com/NVIDIA/TensorRT-LLM
Table 2: Statistics of unique solutions generated by prompts described in Section 2.2. Default prompt refers to the
single prompt used for the two benchmarks, Mask-Text refers to prompting the model with masked text solution,
and Subj refers to prompting with subject-specific prompts (applicable only to MATH). Coverage % refers to the
percentage of problems in the training set for which there’s at least one solution among the generated solutions.
Figure 4: Histogram of the number of solutions for problems in a 64K downsampled subset of MATH instances in
OpenMathInstruct-1.
Size Base Model Model GSM8K MATH GSM-Hard SVAMP TabMWP ASDiv MAWPS
- GPT-4 (Code Interpreter) 97.0 69.7 77.6 94.8 95.9 92.6 97.7
WizardMath 54.9 10.7 - 36.1 - - -
Llama-2
MetaMath 66.4 19.4 -
MAmmoTH 59.4 33.4 - 71.4 - - -
ToRA 72.6 44.6 56.0 70.4 51.6 78.7 91.3
CodeLlama + SC (k=50) 76.8 52.5 - - - - -
OpenMath-CodeLlama 75.9 43.6 60.1 79.6 56.0 77.7 93.5
7B
+ SC (k=50) 84.8 55.6 - - - - -
MetaMath-Mistral-7B 77.7 28.2 - - - - -
MAmmoTH-7B-Mistral 75.0 40.0 - - - - -
Mistral WizardMath 83.2 33.0 - - - - -
OpenMath-Mistral-7B 80.2 44.5 63.7 82.4 70.0 82.7 95.4
+ SC (k=50) 86.9 57.2 - - - - -
WizardMath 63.9 14.0 - 51.9 - - -
Llama-2
MetaMath 72.3 22.4 - - - - -
MAmmoTH 64.7 36.3 - 73.7 - - -
13B ToRA 75.8 48.1 60.5 75.7 65.4 81.4 92.5
CodeLlama + SC (k=50) 80.4 55.1 - - - - -
OpenMath-CodeLlama 78.8 45.5 61.9 78.8 59.7 81.2 93.6
+ SC (k=50) 86.8 57.6 - - - - -
MAmmoTH 72.7 43.6 - 84.3 - - -
ToRA 80.7 51.0 63.7 80.5 70.5 84.2 93.3
34B CodeLlama + SC (k=50) 85.1 60.0 - - - - -
OpenMath-CodeLlama 80.7 48.3 64.0 83.6 66.0 82.7 94.9
+ SC (k=50) 88.0 60.2 - - - - -
WizardMath 81.6 22.7 - 71.8 - - -
MetaMath 82.3 26.6 - - - - -
MAmmoTH 76.9 41.8 - 82.4 - - -
Llama-2 ToRA 84.3 49.7 67.2 82.7 74.0 86.8 93.8
70B + SC (k=50) 88.3 56.9 - - - - -
OpenMath-Llama2 84.7 46.3 65.7 85.0 70.8 84.3 95.6
+ SC (k=50) 90.1 58.3 - - - - -
OpenMath-CodeLlama 84.6 50.7 66.6 87.8 74.2 84.7 95.7
CodeLlama
+ SC (k=50) 90.8 60.4 - - - - -
Our models easily outperform both MAmmoTH on both MATH and GSM8K. The substantial gains
and MetaMath, even when controlling for the base with SC can be attributed to the diversity of our
fine-tuned model. Since WizardMath and ToRA fine-tuning data.
finetuning datasets are not publicly available yet,
OpenMathInstruct-1 presents a superior alternative 4.1 Ablations
to the publicly available MetaMathQA and Math- We perform ablation experiments with the Mistral-
Instruct datasets, which are used to fine-tune Meta- 7B as the base model. We report results on the
Math and MAmmoTH, respectively. 1K-sized validation subsets for MATH and GSM8K
With the increase in model parameters, our mod- created by us.
els continue to outperform MAmmoTH and Meta-
Math substantially. Compared to ToRA, with 4.1.1 Fair vs Naive Downsampling
greedy decoding, we see a meaningful drop in per- We finetune the base model on a dataset of 128K
formance on MATH, though our models are equal instances created by combining 64K naive or fair
or better on GSM8K. With self-consistency (SC) downsampled instances from the GSM8K and
decoding, however, our models outperform ToRA MATH portion of the data. Table 4 shows that
the model fine-tuned on the data downsampled with
model doesn’t generate answers in the correct format. fair sampling outperforms the one created by naive
Table 4: Comparison of performance of fair vs naive sam- To answer this, we create a subset of 128K instances
pling on our validation subset of GSM8K and MATH. generated with the default prompt/subject-specific
prompts.
Prompt GSM8K MATH Table 6 compares the finetuning performance
Random 74.3 35.0 on these two splits on our MATH validation sub-
Fair 75.3 37.0 set. While the model trained on the subject-specific
subset trails the model trained on the default sub-
set for greedy decoding; the trend is decisively re-
downsampling. The performance gap is particularly versed for self-consistent decoding with four sam-
substantial for MATH, which suffers from a graver ples. This suggests that the subset collected with
data imbalance than GSM8K in our corpus. subject-specific prompts has a higher diversity of
4.1.2 Impact of Fine-Tuning Dataset Size solutions than the ones collected using a single
prompt.
Table 5: Effect of fine-tuning dataset size on perfor-
mance on our validation subset of GSM8K and MATH. Table 7: Impact of code-preferential data selection on
our MATH validation subset performance.
Dataset Size GSM8K MATH
128K 75.3 37.0 Prompt Pass@1 SC (k=4)
256K 79.0 38.6 Default 37.4 45.2
512K 81.0 41.6 Majority-Code 39.8 42.6
Any-Code 39.4 42.6
To determine the impact of the size of the
fine-tuning dataset, we create datasets of size Code-Preferential Subsets. In this ablation, we
128K/256K/512K by combining 64K/128K/256K determine the impact of code-preferential solu-
fair downsampled subsets of GSM8K and MATH. tion selection strategies proposed in Section 2.4.2.
Table 5 shows that the performance increases on Table 7 shows that code-preferential solution
both GSM8K and MATH with the increase in the strategies aid the greedy decoding performance.
fine-tuning dataset size. We didn’t find benefit from However, the reduction in solution diversity ar-
training the models for more steps, so the perfor- guably results in decreased performance with self-
mance gain is attributable to the increased data size. consistency decoding (text-based solutions are only
4.1.3 MATH-only Ablations 1/3rd of the original corpus to begin with). Based
on these results and because any-code results in a
This section presents the ablation results for only
smaller finetuning dataset (512K compared to 664K
the MATH portion of OpenMathInstruct-1. In all
with majority-code), we chose to use the any-code
experiments, we finetune the base model on a 128K
subset in our finetuning data blend.
fair downsampled subset to control for the impact
of data size.
5 Analysis
Table 6: Comparison of default vs subject-wise prompt
performance on our MATH validation subset.
Table 8: Performance split based on solution format.
Solutions without ⟨llm-code-output⟩ string are con-
Prompt Pass@1 SC (k=4) sidered text-based.
Default 39.1 41.7
Subject 38.3 44.5 Solution Type Accuracy (in %) Count
Text-based 32.0 278
Code + Text 45.3 722
Default vs Subject-Specific Prompting. In sec-
tion 2.2.2, we motivated using subject-specific Total 41.6 1000
prompts, which ultimately didn’t result in much
training set coverage difference. But how are the We analyze the performance of the ablation
solutions generated by the combination of subject- model trained on 512K instances from Section 4.1.2.
wise prompts different from a single default prompt? We focus our discussion on the MATH benchmark
(a) Subject-wise performance (b) Level-wise performance
Figure 5: Performance split by subjects and levels on our MATH validation subset.
Table 9: Types of errors and their counts. error. This analysis provides another support for our
proposal and use of code-preferred solutions from
Error Type Count Section 2.4.2.
Text Reasoning Error 189 Table 9 presents the count of different error cat-
Code Reasoning Error 292 egories. For code-based solutions, we find that al-
Code Execution Error 78 most 74% of the errors in such solutions are due to
Code timeout 15 reasoning error and the remaining 26% attributable
Max code executions reached 10 to execution-related issues. We present sample solu-
tions from these error categories in Appendix B.3.
Total 584
6 Related Work
where this model scores 41.6% on our MATH vali- Mathematical Reasoning and LLMs. Recently,
dation subset. a plethora of work has been done on enhancing the
mathematical reasoning capabilities of LLMs. In-
5.1 Performance-split by Subjects and Levels ference techniques such as Chain-of-Thought (Wei
Figure 5 presents the performance split by subjects et al., 2022), its programmatic counterpart, Program
and levels on the MATH validation subset. Among of Thought (Gao et al., 2023; Chen et al., 2023b),
subjects, we see that the model’s worst performance Self-Consistency (Wang et al., 2023), and Self-
is on geometry, which can be attributed to the lack Verification (Zhou et al., 2024) have been shown to
of multi-modality in our base models (Zhou et al., significantly improve the reasoning capabilities of
2024). We see a monotonic decrease in perfor- LLMs without any further training.
mance with the increase in hardness level which Pretraining language models on math-heavy con-
is to be expected (Zhou et al., 2024). The model tent has resulted in foundation LLMs such as Min-
scores 72.4% on Level 1 problems and only 16.3% erva (Lewkowycz et al., 2022), Galactica (Taylor
on the hardest problems, i.e., Level 5. et al., 2022), and Llemma (Azerbayev et al., 2023)
with stronger mathematical skills out-of-the-box. A
5.2 Error Analysis more direct approach of dataset-specific training
Table 8 shows that the model performs an absolute does instruction fine-tuning on problem-solution
13.3% better when using code for answering ques- pairs derived from math reasoning datasets. Our
tions in comparison to when not using it. We find work falls in this latter category and bears simi-
that some of the errors made by text-based solu- larity with recent work such as RFT (Yuan et al.,
tion could have been avoided by preferring code- 2023), ToRA (Gou et al., 2024), MAmmoTH (Yue
based solution; see Figure 15 for a sample solution et al., 2024), MetaMath (Yu et al., 2024) and Math-
where the model makes an arithmetic calculation Coder (Wang et al., 2024). We differ from the pre-
vious work along one factor or a combination of Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
the following factors: (a) reliance on GPT-3.5/4, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
(b) solution format, and (c) use of ground truth text
Nakano, Christopher Hesse, and John Schulman.
solution in synthesizing code-based solutions. 2021. Training Verifiers to Solve Math Word Prob-
lems.
Knowledge Distillation via Synthetic Data. Re-
cent work exploring the use of targeted synthetic Ronen Eldan and Yuanzhi Li. 2023. TinyStories: How
data generated by large foundation models for pre- Small Can Language Models Be and Still Speak Co-
herent English? arXiv.
training/instruction tuning smaller LLMs has led
to tremendous progress in reasoning skills of these Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon,
smaller LLMs (Gunasekar et al., 2023; Li et al., Pengfei Liu, Yiming Yang, Jamie Callan, and Gra-
2023; Eldan and Li, 2023; Mukherjee et al., 2023; ham Neubig. 2023. PAL: Program-aided Language
Models. In ICML, pages 10764–10799. PMLR.
Xu et al., 2023; Liu et al., 2023).
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen,
7 Conclusion Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu
Chen. 2024. ToRA: A Tool-Integrated Reasoning
We introduce OpenMathInstruct-1, a math instruc- Agent for Mathematical Problem Solving. In ICLR.
tion tuning dataset with 1.8M problem-solution
pairs with a commercially permissive license. Com- Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio
César Teodoro Mendes, Allie Del Giorno, Sivakanth
pared to previous work, OpenMathInstruct-1 is at Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo
least four times bigger. The problems are taken de Rosa, Olli Saarikivi, Adil Salim, Shital Shah,
from the training set of GSM8K and MATH bench- Harkirat Singh Behl, Xin Wang, Sébastien Bubeck,
marks, and the solutions are synthesized by few- Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and
Yuanzhi Li. 2023. Textbooks Are All You Need.
shot prompting the Mixtral model. With our pro- arXiv.
posed prompting novelty of using masked text solu-
tions and some brute-force scaling, we achieve train- Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
Arora, Steven Basart, Eric Tang, Dawn Song, and
ing set coverage of 99.9% for the GSM8K bench-
Jacob Steinhardt. 2021. Measuring Mathematical
mark and 93% for the challenging MATH bench- Problem Solving With the MATH Dataset. NeurIPS.
mark. The quality of these synthesized solutions is
illustrated by finetuning experiments, which show Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky,
Ilya Sutskever, and Ruslan R Salakhutdinov. 2012.
models achieving performance comparable to or Improving neural networks by preventing co-
outperforming their gpt-distilled counterparts. To adaptation of feature detectors. arXiv preprint
support the open-source efforts in this direction, we arXiv:1207.0580.
publicly release all our fine-tuned models, code, and Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi
the OpenMathInstruct-1 along with a further 6.6M Wang, Hongkun Yu, and Jiawei Han. 2023. Large
incorrect sampled solutions. Language Models Can Self-Improve. In EMNLP.
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu,
Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo
Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Li, Adrian Weller, and Weiyang Liu. 2024. Meta-
Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Math: Bootstrap Your Own Mathematical Questions
Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wen- for Large Language Models. In ICLR.
han Xiong, Alexandre Défossez, Jade Copet, Faisal Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting
Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and
Thomas Scialom, and Gabriel Synnaeve. 2023. Code Jingren Zhou. 2023. Scaling Relationship on Learn-
Llama: Open Foundation Models for Code. arXiv. ing Mathematical Reasoning with Large Language
Models.
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas
Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wen-
Poulton, Viktor Kerkez, and Robert Stojnic. 2022. hao Huang, Huan Sun, Yu Su, and Wenhu Chen.
Galactica: A Large Language Model for Science. 2024. MAmmoTH: Building math generalist models
through hybrid instruction tuning. In ICLR.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Good-
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti man. 2022. STaR: Bootstrapping Reasoning With
Bhosale, Dan Bikel, Lukas Blecher, Cristian Can- Reasoning. In NeurIPS.
ton Ferrer, Moya Chen, Guillem Cucurull, David
Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun
Brian Fuller, Cynthia Gao, Vedanuj Goswami, Na- Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song,
man Goyal, Anthony Hartshorn, Saghar Hosseini, Mingjie Zhan, and Hongsheng Li. 2024. Solving
Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Challenging Math Word Problems Using GPT-4 Code
Madian Khabsa, Isabel Kloumann, Artem Korenev, Interpreter with Code-based Self-Verification. In
Punit Singh Koura, Marie-Anne Lachaux, Thibaut ICLR.
Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yun-
ing Mao, Xavier Martinet, Todor Mihaylov, Pushkar
Mishra, Igor Molybog, Yixin Nie, Andrew Poulton,
Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi,
Alan Schelten, Ruan Silva, Eric Michael Smith, Ran-
jan Subramanian, Xiaoqing Ellen Tan, Binh Tang,
Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin
Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, An-
gela Fan, Melanie Kambadur, Sharan Narang, Aure-
lien Rodriguez, Robert Stojnic, Sergey Edunov, and
Thomas Scialom. 2023. Llama 2: Open Foundation
and Fine-Tuned Chat Models.
A.3 Frequency of solutions per problem in law curve, with most problems having very few so-
OpenMathInstruct-1 lutions and very few problems with many solutions.
Figure 12 shows the histogram of the frequency Concretely, 19% problems (1324 out of 6978) with
of the number of solutions in OpenMathInstruct-1 any solutions in OpenMathInstruct-1 have ≤ 10
for problems in GSM8K and MATH. For GSM8K, valid solutions out of the 896 sampled solutions
57.4% problems (4292 out of 7473) have more than per problem. To further highlight the hardness of
half the sampled solutions, i.e., > 128 solutions out the MATH benchmark in comparison to GSM8K,
of 256 sampled solutions, that are valid. For MATH, just 3.1% problems (233 out of 7500) have more
the solution frequency histogram mimics a power- than half the sampled solutions ( > 448 out of 896
Code + Text Reasoning Shortcut Solution
Question Question
√ Jolene and Phil have four children, each with the same
Let p(x) = −x, and q(x) = 8x2 + 10x − 3. The
domain of p(q(x)) can be written in the form a ≤ x ≤ b. birthday. They gave birth to their first child exactly 15
Find b − a. years ago. They gave birth to their second child exactly
one year after the birth of their first child. They gave birth
to their third child on the fourth birthday of their second
Generated Solution child. Two years after the birth of their third child, they
gave birth to their fourth child. How old, in years, is their
Let's use sympy to solve for the domain of $p(q(x))$. fourth child?
<llm−code>
import sympy as sp Reference Solution
# define the symbols Their 1st child was born 15 years ago, and therefore is 15
x = sp.symbols('x') years old.
Their 2nd child was born 1 year after their 15-year-old
# define the functions child, and therefore is 15-1=14 years old.
p = sp.sqrt(−x) Their 3rd child was born 4 years after their 14-year-old
q = 8*x**2 + 10*x − 3 child, and therefore is 14-4=10 years old.
Their 4th child was born 2 years after their 10-year-old
# solve for the domain of p(q(x)) child, and therefore is 10-2=8 years old.
domain = sp.solve(q >= 0, x)
Generated Solution
# print the domain
print(domain) Let’s write down a python script to answer this problem.
</llm−code> <llm−code>
<llm−code−output> children = ['first', 'second', 'third', 'fourth']
((1/4 <= x) & (x < oo)) | ((−oo < x) & (x <= −3/2)) child_age = [15, 14, 10, 8]
</llm−code−output> number_of_children = len(children)
So the domain is $x \in [1/4, \infty) \cup (−\infty, −3/2) children = children[:−1] # let's get rid of the youngest
$. one since we already know that one
The difference between the upper and lower bounds is dictionary = dict(zip(children, child_age))
dictionary
b − a = 1/4 − (−3/2) . </llm−code>
<llm−code−output>
{'first': 15, 'second': 14, 'third': 10}
Figure 8: Sample solution where the model performs the </llm−code−output>
initial reasoning in code and performs the last reasoning The answer is 8 years old.
steps to answer the question in natural language. This
shows the strength of the code-interpreter solution format
for mathematical reasoning. Figure 9: Sample shortcut solution generated from
prompting the LLM with the reference solution in con-
text. The code initializes the variable child_age with
sampled solutions), that are valid. the children’s ages copied from the reference solution.
The generated code then performs ad-hoc computations
B Miscellaneous before answering the question in the text.
B.1 Training Hyperparameters
pute limitations, for 70B models, we limit the train-
Table 11: Details of training hyperparameters for fine- ing to 2 epochs for Llama2 and 3 epochs for CodeL-
tuning the different base models. LR=Learning rate, lama. For CodeLlama models, we use the learning
TP=Tensor Parallel, PP=Pipeline Parallel. rate of 2e-5 used by ToRA models (Gou et al., 2024).
For the Mistral-7B model, we use a lower learning
Epochs LR # of GPUs TP PP
rate of 1e-6 based on our hyperparameter search.
Mistral-7B 4 1e-6 64 4 1
CodeLlama-7B 4 2e-5 64 4 1 B.2 Sample Solutions
CodeLlama-13B 4 2e-5 64 4 1
CodeLlama-34B 4 1e-5 128 8 1 In this section, we illustrate sample solutions repre-
Llama 2-70B 2 1e-5 256 8 2
CodeLlama-70B 3 1e-5 256 8 2
sentative of different phenomena encountered dur-
ing the creation of OpenMathInstruct-1.
Table 11 details the hyperparameters used for • Figure 8 shows a sample solution that utilizes
finetuning the different base models. Due to com- the strength of the code-interpreter solution
Solution Requiring Trimming Flawed Reasoning
Question Question
Caroline can make eleven lassis out of two mangoes. The areas of two squares are in the ratio 25 : 36. What is
How many lassis can she make out of twelve mangoes? the ratio of their perimeters? Express your answer in the
form a : b.
• Figure 11 shows a sample solution where the nasekar et al., 2023). We leave the work of de-
generated solution gets the right answer but veloping such semantic filters for future work.
through flawed reasoning. These semantically
noisy solutions are much harder to detect with • Figure 10 illustrates a sample solution where
simple syntactic filters. One solution might the solution goes beyond answering the ques-
be to use models like GPT-4 to grade the gen- tion, with the model generating coherent but
erated solutions as done in recent work (Gu- unrelated text for the input problem.
(a) GSM8K (b) MATH
Figure 12: Histogram of the number of solutions for problems in GSM8K and MATH.
B.3 Error Analysis of Solutions Generated by Table 12: Instructions for prompting the model.
Fine-tuned Model
Task Instruction
In this section, we illustrate instances of the dif-
ferent kind of errors made by the ablation model Few-shot Here are some examples of questions
analyzed in Section 5. prompt and solutions followed by a new ques-
(I) tion that you need to solve. Make sure
• Figure 13 shows a sample solution where the to put the answer (and only answer)
code generated in the solution runs into an inside \boxed{}.
execution error. Nevertheless, the model still
Few-shot Here are some examples of questions,
generates an incorrect answer to the question.
prompt solutions, and their masked solutions
• Figure 14 demonstrates a sample where the text followed by a new question and solu-
model performs correct reasoning while gen- masking tion that you need to mask. The goal
erating the code. However, the model falters (Imask ) is to ensure that the masked solution
at copying the code output and ends up gener- doesn’t have any of the numerical val-
ating a new answer. ues not mentioned in the question. So
intermediate values calculated in the
• Figure 15 illustrates a sample where the model solution are to be masked by single
performs correct reasoning but falters in arith- letter capital variables, such as M, N.
metic calculation (multiplication). Failure at
arithmetic computation has been a known is- Zero-shot System: You’re an expert Python pro-
sue with LLMs and justifies our choice of pre- prompt for grammer and mathematician. Help
ferring code-based solutions. fine-tuned the user to solve this problem using
model code when necessary. Make sure to
B.4 Instructions for Few-shot Data Generation put the answer (and only answer) in-
Table 12 details the instructions used for the differ- side \boxed{}.
ent generation tasks.
B.5 Masked Text Solution Generation • Filter out solutions that have the final answer.
This ensures that our masked solutions are at
We generate masked text solutions using a pipeline
least masking the final answer.
very similar to the solution generation pipeline. We
use the following procedure: • Rank solutions where higher-ranked solutions
• Generate eight candidate masked solutions. have fewer numbers. This is essentially a
heuristic to ensure we pick solutions where
• Filter out solutions that have very different the intermediate computation is masked out
lengths. the most.
Code Execution Error
Question
The area of a triangle is 600 square feet. Find the altitude,
in feet, of the triangle if the length of the corresponding Last Step Mishap
base is 30 feet.
Question
Generated Solution What is the 100th digit to the right of the decimal point in
13
Let’s use sympy to solve this problem. the decimal representation of 90 ?
<llm−code>
from sympy import symbols, Eq, solve Generated Solution
We can use sympy to calculate the decimal representation
# define the variables of 13 and then extract the 100th digit.
90
base, altitude = symbols('base altitude')
<llm−code>
# area of the triangle from sympy import Rational, N
area_eq = Eq(base * altitude / 2, 600)
# Calculate the decimal representation of 13/90
# length of the corresponding base decimal_rep = N(Rational(13, 90), 100)
base_eq = Eq(base, 30)
# Extract the 100th digit
# solve the equations digit = int(str(decimal_rep)[−1])
solutions = solve((area_eq, base_eq), (base, altitude))
print(digit)
# print the altitude </llm−code>
print(solutions[altitude]) <llm−code−output>
</llm−code> 4
<llm−code−output> </llm−code−output>
[0;31m−−−−−−−−−−−−−−−−−−−−−−−−−−[0m So the 100th digit to the right of the decimal point in the
[0;31mTypeError[0m Traceback (most recent call last) decimal representation of 13 is 7 .
90
File [0;32m<ipython−input−1−f95732badac7>:16[0m
[1;32m 13[0m solutions [38;5;241m=[39m solve((
area_eq, base_eq), (base, altitude))
[1;32m 15[0m [38;5;66;03m# print the altitude[39;00m Figure 14: The model performs correct reasoning till
[0;32m−−−> 16[0m [38;5;28mprint[39m([43 code generation. But rather than copying the code output
msolutions[49m[43m[[49m[43maltitude[49m[43m][49 (4) it concludes with a different answer (7).
m)
So the altitude is 20 .
Generated Solution
We can choose a starting lineup in 12×11×10×9×8 =
11880 ways.
Question:
Missy had a giant piggy bank in her bedroom. Every day she would search the house for change
to put in her bank. After 4 years, the bank was opened and it contained $450 in change. If
the second, third, and fourth-year she doubled the amount of money she put in the bank from
the amount she had put in the previous year, how much money, in dollars, did she put in the
bank the first year?
My solution:
Missy started with some amount of money in the first year.
She was then doubling the amount of money every year for 4 years and ended up with $450.
Let's write down an equation for this problem and solve it using sympy.
<llm-code>
from sympy import solve, symbols
first_year_money = symbols('first_year_money')
second_year_money = 2 * first_year_money
third_year_money = 2 * second_year_money
fourth_year_money = 2 * third_year_money
# Solve for first year money given that the combined saving is 450
result = solve(first_year_money + second_year_money + third_year_money + fourth_year_money -
450, first_year_money)[0]
result
</llm-code>
<llm-code-output>
30
</llm-code-output>
Thus Missy put \boxed{30} dollars in the bank the first year.
Question:
Pete has to take a 10-minute walk down to the train station and then board a 1hr 20-minute
train to LA. When should he leave if he cannot get to LA later than 0900 hours? (24-hr time)
My solution:
Since Pete needs to take a 10 minutes walk and then a 1 hour 20 minutes train ride, he will
spend a total of 1 hour and 30 minutes.
This means that he needs to leave 1 hour and 30 minutes earlier than 09:00 hours.
Subtracting 1 hour and 30 minutes from 09:00 hours we get \boxed{07:30} hours.
Question:
Mark deposited $88 in a bank. Bryan deposited $40 less than five times as much as Mark. How
much did Bryan deposit in the bank?
My solution:
Let's solve this problem using Python code.
<llm-code>
mark_deposit = 88
five_times_mark_deposit = 5 * mark_deposit
bryan_deposit = five_times_mark_deposit - 40
bryan_deposit
</llm-code>
<llm-code-output>
400
</llm-code-output>
So Bryan deposited \boxed{400} dollars in the bank.
Question:
A department store displays a 20% discount on all fixtures. What will be the new price of a
25 cm high bedside lamp that was worth $120?
My solution:
Let's solve this problem using Python code.
<llm-code>
discount_percent = 20
price_before_discount = 120
discount_portion = discount_percent / 100
discount_amount = price_before_discount * discount_portion
price_after_discount = price_before_discount - discount_amount
price_after_discount
</llm-code>
<llm-code-output>
96.0
</llm-code-output>
So the new price of the lamp is \boxed{96} dollars.
Question:
James opens up a flower shop. He needs to pay rent of $1200 a week with an additional 20% of
rent to pay for utilities and he has 2 employees per shift with the store open 16 hours a day
for 5 days a week. If he pays each employee $12.50 an hour, what are his weekly expenses to
run the store?
My solution:
The cost consists of rent, utilities, and employee salaries. Let's compute each of them
separately and then add them up.
<llm-code>
# rent cost
rent_per_week = 1200
# utility cost
utility_per_week = rent_per_week * 20 / 100
# employee cost
employee_work_hours = 16
work_days_per_week = 5
employee_work_hours_per_week = work_days_per_week * employee_work_hours
number_of_employees = 2
employee_cost_per_hour = 12.5
employees_cost_per_week = number_of_employees * employee_work_hours_per_week *
employee_cost_per_hour
# add the three to get total cost
cost_per_week = rent_per_week + utility_per_week + employees_cost_per_week
cost_per_week
</llm-code>
<llm-code-output>
3440.0
</llm-code-output>
Thus James's weekly expences add up to \boxed{3440} dollars.
Question:
Tim has 30 less apples than Martha, and Harry has half as many apples as Tim. If Martha has
68 apples, how many apples does Harry have?
My solution:
Listing 2: Few-shot prompt for MATH.
Here are some examples of questions and solutions followed by a new question that you need to
solve.
Make sure to put the answer (and only answer) inside \boxed{}.
Question:
A parabola with equation $y=x^2+bx+c$ passes through the points $(-1,-11)$ and $(3,17)$. What
is $c$?
My solution:
Let's write down an equation for the parabola and solve for $c$ using sympy.
<llm-code>
import sympy as sp
Question:
Let $f(x)$ be an odd function. Is $f(f(x))$ even, odd, or neither?
My solution:
The definition of an odd function is that $f(-x) = -f(x)$.
Applying this to $f(f(-x))$ we get $f(f(-x)) = f(-f(x)) = -f(f(x))$.
Thus, $f(f(x))$ is an \boxed{odd} function.
Question:
At the 2007 Math Olympics, Team Canada won $17$ out of a possible $100$ medals. Which one of
the following is closest to the fraction of medals that they won? $$
\frac{1}{4} \qquad \frac{1}{5} \qquad \frac{1}{6} \qquad \frac{1}{7} \qquad \frac{1}{8}
$$
My solution:
Let's use sympy to print out the differences between the fraction of medals that Canada won
and each of the options.
<llm-code>
from sympy import Rational, Abs
# list of options
options = [Rational(1, 4), Rational(1, 5), Rational(1, 6), Rational(1, 7), Rational(1, 8)]
Question:
A rectangular box $P$ is inscribed in a sphere of radius $r$. The surface area of $P$ is 384,
and the sum of the lengths of its 12 edges is 112. What is $r$?
My solution:
Let the dimensions of $P$ be $x$, $y$, and $z$.
The sum of the box's edges is $4(x + y + z)$ and the surface area is $2xy + 2yz + 2xz$.
The diameter of the sphere is $2r$ and it's equal to the diagonal of the box.
Let's now write down the equations based on the above information and solve them using sympy.
<llm-code>
from sympy import symbols, Eq, solve
Question:
A bee starts flying from point $P_0$. She flies $1$ inch due east to point $P_1$. For $j \ge
1$, once the bee reaches point $P_j$, she turns $30^{\circ}$ counterclockwise and then flies
$j+1$ inches straight to point $P_{j+1}$. When the bee reaches $P_{2015},$ how far from $P_0$
is she, in inches?
My solution:
We can represent the rotation action via $\omega = e^{\pi i/6}$.
Let's assume the bee starts at the origin, so $P_{2015}$ is at the point \[z = 1 + 2 \omega +
3 \omega^2 + 4 \omega^3 + \dots + 2015 \omega^{2014}.\]
This is an arithmetic-geometric series which we can solve by simplifying the expression.
Alternatively, we can solve for |z| using sympy using the following code.
<llm-code>
from sympy import I, pi, exp, sqrt
position = 0
for i in range(2015):
delta = (i + 1) * omega**(i)
position += delta
Question:
If $f (x) = x^2 - 1$, what is the value of $f (-1)$?
My solution: