Professional Documents
Culture Documents
Ananta-2312 05708
Ananta-2312 05708
context tuning approach in RAG-based planning model that uses LambdaMART (Burges, 2010) for
systems.2 initial ranking of data across context stores, fol-
lowed by re-ranking using RRF.
3.2 Context Tuning
To compare various context retrieval methods, we 3.3 Tool Retrieval
employ both text-based and vector-based retrieval While advanced ranking models can enhance the
baselines. We simulate different context stores by recall of tool retrieval, we employ the pre-trained
structuring context data per persona and train mod- GTR-T5-XL model for semantic search using co-
els to perform federated search. We use query and sine similarity to retrieve the top-K tools. Extend-
persona meta-signals, such as frequency, usage his- ing the tool retrieval process to incorporate ranking
tory, and correlation with geo-temporal features, should be a straightforward endeavor. We evaluate
to perform retrieval. We evaluate context retrieval tool retrieval performance with and without context
using the Recall@K and Normalized Discounted retrieval using Recall@K.
Cumulative Gain (NDCG@K) metrics.
3.4 Planner
BM25 For text-based search, we use an improved
version of BM25, called BM25T (Trotman et al., The planner’s objective is to select the most appro-
2014). priate tool from the retrieved tool list and gener-
ate a well-formed plan. A plan comprises an API
Semantic Search For vector-based search, we call constructed using the chosen tool and parame-
employ the widely adopted Semantic Search ap- ters extracted from the query and retrieved context.
proach. We use GTR-T5-XL (Ni et al., 2021) We fine-tune OpenLLaMA-v2-7B (Touvron et al.,
to generate query and context item embeddings, 2023) for plan generation. To assess the planner’s
which are then ranked using cosine similarity to se- performance, we employ the Abstract Syntax Tree
lect the top-K results. We evaluate both pre-trained (AST) matching strategy to compute plan accuracy.
and fine-tuned variants of this method. A hallucination is defined as a plan generated using
CoT Augmentation To enhance the likelihood an imaginary tool.
of semantic alignment with pertinent contextual
4 Results
elements, we augment the under-specified or im-
plicit query with GPT-4 (OpenAI, 2023) generated 4.1 Context Retrieval
CoT.3 We evaluate both pre-trained and fine-tuned
semantic search versions utilizing CoT. Consistent with expectations, vector-based search
surpasses text-based search, as shown in Table 1.
LambdaMART with RRF Reciprocal Rank Fu- Nevertheless, both approaches struggle to retrieve
sion (RRF) (Cormack et al., 2009) is shown to relevant context for under-specified queries. Fine-
outperform individual rank learning methods. To tuned semantic search and CoT augmentation with
leverage this advantage, we propose a lightweight pre-trained semantic search both significantly en-
2 hance retrieval performance. Notably, when fine-
Refer to Appendix A for more details on data generation.
3
Please refer Appendix A.6 for the GPT-4 prompt used tuning is employed, CoT augmentation yields only
and Table 5 for CoT examples. marginal gains, suggesting that comparable im-
Table 1: A comparison of various Context Retrieval Table 2: End-to-end planner evaluation both with and
methods using Recall@K and NDCG@K metrics. The without context tuning. “Lower Bound" excludes re-
context-seeking query is used as input to perform a trieval and performs direct plan generation while “Upper
federated search across different context stores, after Bound" assumes perfect context and tool retrieval.
which semantic search or ranking is applied.
5 Conclusion
Our work introduces context tuning, a novel compo-
nent that enhances RAG-based planning by equip-
Figure 2: Evaluation of tool retrieval using Recall@k, ping it with essential context-seeking capabilities
with and without context tuning. to address incomplete or under-specified queries.
Through a systematic comparison of various re-
provements could be achieved without augmenting trieval methods applied to both lightweight models
the input sequence with CoT. and LLMs, we demonstrate the effectiveness of
context tuning in improving contextual understand-
Our proposed approach utilizing LambdaMART
ing. Our empirical observations reveal that CoT
with RRF outperforms both fine-tuned semantic
augmentation enhances context retrieval when fine-
search and CoT augmentation. Additionally, we ob-
tuning is not applied, while fine-tuning the retrieval
serve that for fine-tuned methods, both Recall@K
model eliminates the need for CoT augmentation.
and NDCG@K increase with K, whereas for pre-
Furthermore, we observe that context augmenta-
trained methods, NDCG@K decreases with an in-
tion at the plan generation stage reduces halluci-
crease in K and Recall@K.
nations. Finally, we showcase the superiority of
4.2 Tool Retrieval our proposed lightweight model using RRF with
LambdaMART over the GPT-4-based system.
Figure 2 illustrates the performance of tool retrieval
using semantic search. Incorporating relevant con-
Limitations
text into tool retrieval consistently yields substan-
tial gains across various K-values. The current work does not utilize conversation his-
tory, which is crucial for handling explicit multi-
4.3 Planner turn instructions that contain anaphora or ellipsis.
To establish the planner’s lower bound, we remove This limitation also hinders the model’s ability to
the retrieval step, while the upper bound is set by effectively process and respond to complex tasks
directly utilizing context and/or tool labels, effec- that require multi-hop context retrieval. Addition-
ally, the absence of conversation history impedes Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin
the model’s ability to adapt to topic shifts that may Paranjape, Michele Bevilacqua, Fabio Petroni, and
Percy Liang. 2023. Lost in the middle: How lan-
occur throughout a dialogue.
guage models use long contexts. arXiv preprint
Furthermore, the performance of the planner arXiv:2307.03172.
model is constrained by the length of the context
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-
window. While employing LLMs with longer con- Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jian-
text windows can enhance performance, it also in- feng Gao. 2023. Chameleon: Plug-and-play compo-
creases model size and computational complexity. sitional reasoning with large language models. arXiv
To address this limitation, incorporating context preprint arXiv:2304.09842.
compression techniques could potentially improve Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao,
end-to-end performance without incurring signifi- and Nan Duan. 2023. Query rewriting for retrieval-
cant increases in model size. augmented large language models. arXiv preprint
arXiv:2305.14283.
Due to privacy constraints, we simulated real-
world data by generating synthetic user profiles Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gus-
and personas that mirrored real-world use cases for tavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao,
a digital assistant. Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei
Yang. 2021. Large dual encoders are generalizable
retrievers.
Ethics Statement
OpenAI. 2023. Gpt-4 technical report.
To safeguard privacy, this study exclusively utilizes
Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang,
synthetically generated data, eliminating the use of Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Don-
real user information under ethical considerations. ald Metzler, Xuanhui Wang, and Michael Bender-
sky. 2023. Large language models are effective text
Acknowledgements rankers with pairwise ranking prompting. arXiv
preprint arXiv:2306.17563v1.
We would like to thank Stephen Pulman, Barry Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay,
Theobald and Joel Moniz for their valuable feed- Amnon Shashua, Kevin Leyton-Brown, and Yoav
back. Shoham. 2023. In-context retrieval-augmented lan-
guage models. arXiv preprint arXiv:2302.00083.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta
References Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola
Cancedda, and Thomas Scialom. 2023. Toolformer:
Christopher J.C. Burges. 2010. From ranknet to lamb- Language models can teach themselves to use tools.
darank to lambdamart: An overview. Microsoft Re- arXiv preprint arXiv:2302.04761.
search Technical Report MSR-TR-2010-82.
Krishna Srinivasan, Karthik Raman, Anupam Samanta,
Gordon V. Cormack, Charles L. A. Clarke, and Stefan Lingrui Liao, Luca Bertelli, and Mike Bendersky.
Buettcher. 2009. Reciprocal rank fusion outperforms 2023. Quill: Query intent with large language mod-
condorcet and individual rank learning methods. In els using retrieval augmentation and multi-stage dis-
Proceedings of the 32nd International ACM SIGIR tillation. arXiv preprint arXiv:2210.15718v1.
Conference on Research and Development in Infor-
mation Retrieval., pages 758–759. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu- Baptiste Rozière, Naman Goyal, Eric Hambro,
pat, and Ming-Wei Chang. 2020. Realm: Retrieval- Faisal Azhar, et al. 2023. Llama: Open and effi-
augmented language model pre-training. cient foundation language models. arXiv preprint
arXiv:2302.13971.
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Andrew Trotman, Antti Puurula, and Blake Burgess.
Igor Mordatch. 2022. Language models as zero-shot 2014. Improvements to bm25 and language models
planners: Extracting actionable knowledge for em- examined. In Proceedings of the 32nd International
bodied agents. ACM SIGIR Conference on Research and Develop-
ment in Information Retrieval., pages 58–65.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
rich Küttler, Mike Lewis, Wen tau Yih, Tim Rock- Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and
täschel, Sebastian Riedel, and Douwe Kiela. 2020. Denny Zhou. 2022. Chain-of-thought prompting
Retrieval-augmented generation for knowledge- elicits reasoning in large language models. arXiv
intensive nlp tasks. preprint arXiv:2201.11903.
A Data Generation Details Application Avg. Context Items
Mail 2.93
A.1 Implicit Query Dataset
Calendar 5.63
For our experiments, we created a synthetic dataset Google 9.57
to simulate realistic interactions across various ap- Notes 2.23
plications commonly found with digital assistants. Music 4.38
The dataset is structured to encompass a diverse Reminders 4.81
range of contexts, representing different synthetic Phonecall 2.34
user activities and interactions.
Table 3: Distribution of context items per application.
Data Points: A total of 791 unique personas were
synthesized, covering seven key applications: Mail,
Calendar, Google, Music, Reminders, Notes, and
iPhone Data .
Phone Calls. The final dataset contained 4,338 train
and 936 test data points. Here are the characteristics of the
persona that we would like to
Generation Method: We utilized GPT-4 to gen- generate the data for :
erate the data. We ensured high diversity in the
age : 22
dataset is met through manual inspection, this is favorite_music_genre : Pop
essential to accurately reflect a wide range of syn- favorite_movie_genre : Romance
favorite_cuisine : Italian
thetic user personalities and interaction patterns. favorite_sport : Tennis
profession : Software Developer
Data Representation: Each data point in the hobbies : [' Cooking ' , ' Swimming ', '
dataset contains multiple contextual information Reading ']
fields, relevant to the specific application and syn-
I want to generate data for ios App
thetic user’s activity. An example of persona in called Music with bundle id as com .
JSON format is shown in Figure 3. apple . music .
Can you generate around 5 recently
played songs
Instructions :
1. Today ' s date is 2023 -12 -07
11:18:19.028759 , Please generate any
times or dates in the past 15 days .
2. ' played_time ' should be in yyyy -MM - dd
HH : mm : ss . SSS format
Table 3 shows the distribution of context items Here is the output schema :
```
per application in our dataset. {" $defs ": {" MusicAppData ": {" properties
": {" recent_songs ": {" items ": {" $ref
A.2 Persona Data Creation Example Prompt ": "#/ $defs / Song "} , " title ": " Recent
Songs ", " type ": " array "} , "
I ' m working on generating synthetic data current_playing ": {" $ref ": "#/ $defs /
for a user ( also known as persona ) Song "}} , " required ": ["
and the persona ' s current_playing "] , " title ": "
MusicAppData ", " type ": " object "} , " arguments :
Song ": {" properties ": {" played_time - text : text to be sent to the group
": {" default ": "" , " title ": " Played - contacts : list of contacts in the
Time " , " type ": " string "} , " group
album_title ": {" default ": "" , " title
": " Album Title ", " type ": " string "} , api : search_messages
" artist ": {" default ": "" , " title ": description : Messages App ' s
" Artist ", " type ": " string "} , " search_messages API is used to
song_name ": {" default ": "" , " title ": search messages by text , recipient ,
" Song Name ", " type ": " string "} , " id sender .
": {" default ": "" , " title ": " Id ", " arguments :
type ": " string "}} , " title ": " Song ", - text : text to be searched .
" type ": " object "}} , " properties ": {" - recipient : search messages by
app_name ": {" default ": "" , " title ": recipient name
" App Name ", " type ": " string "} , " - sender : Search messages by sender
app_bundle_id ": {" default ": "" , " name
title ": " App Bundle Id ", " type ": "
string "} , " app_data ": {" $ref ": "#/ Similarly , can you generate the APIs for
$defs / MusicAppData "}} , " required ": the following Application : {
[" app_data "]} application }?
``` Do not include any explanations . Only
provide the APIs in YAML format as
Do not include any explanations , only above .
provide a RFC8259 compliant JSON
response following this format The following table represents the distribution
without deviation . of APIs: