You are on page 1of 10

RankVicuna: Zero-Shot Listwise Document Reranking

with Open-Source Large Language Models


Ronak Pradeep∗ , Sahel Sharifymoghaddam∗ , Jimmy Lin
David R. Cheriton School of Computer Science,
University of Waterloo, Canada
{rpradeep, sahel.sharifymoghaddam, jimmylin}@uwaterloo.ca

Abstract results. It would, of course, be desirable for the


community to have access to a fully open-source
Researchers have successfully applied large
LLM and associated code infrastructure capable of
language models (LLMs) such as ChatGPT to
reranking in an information retrieval context, performing high-quality reranking.
arXiv:2309.15088v1 [cs.IR] 26 Sep 2023

but to date, such work has mostly been built RankVicuna provides exactly this: To our knowl-
on proprietary models hidden behind opaque edge, we present the first open-source large lan-
API endpoints. This approach yields exper- guage model for zero-shot listwise document
imental results that are not reproducible and reranking. Experimental validation on test collec-
non-deterministic, threatening the veracity of tions from the TREC 2019 and 2020 Deep Learning
outcomes that build on such shaky founda-
Tracks (Craswell et al., 2020, 2021) shows that the
tions. To address this significant shortcom-
ing, we present RankVicuna, the first fully effectiveness of our model is on par with zero-shot
open-source LLM capable of performing high- reranking using GPT3.5 , but slightly worse than
quality listwise reranking in a zero-shot set- reranking with GPT4 . However, we can achieve
ting. Experimental results on the TREC 2019 these results with a much smaller model with only
and 2020 Deep Learning Tracks show that 7B parameters while still constrained to a GPT3.5
we can achieve effectiveness comparable to teacher. We share our model checkpoints and asso-
zero-shot reranking with GPT3.5 with a much
ciated code, providing a valuable resource for the
smaller 7B parameter model, although our ef-
fectiveness remains slightly behind reranking research community.
with GPT4 . We hope our work provides the During the process of building RankVicuna, we
foundation for future research on reranking have gained several important insights that we
with modern LLMs. All the code necessary share: First, we confirm that proprietary LLMs are
to reproduce our results is available at https: indeed effective at reranking in a zero-shot man-
//github.com/castorini/rank_llm. ner (Sun et al., 2023; Ma et al., 2023), although
1 Introduction they exhibit several shortcomings. Beyond the obvi-
ous issue of non-reproducibility, results from these
The widespread availability of instruction fine- models are also non-deterministic, which makes
tuned large language models (LLMs) has led to them unreliable for rigorous scientific research. Ad-
an explosion of applications in various natural lan- ditionally, proprietary LLMs occasionally fail to
guage processing and information retrieval tasks. follow the requested format in their responses. In
In the context of text retrieval, we have seen multi- contrast, RankVicuna is open-source, deterministic,
ple efforts focused on zero-shot listwise reranking and always generates well-formed responses.
using LLMs (Sun et al., 2023; Ma et al., 2023), Second, we examine the impact of first-stage
but unfortunately, to date, they have all relied on retrieval methods on downstream reranking effec-
proprietary models. While such models support tiveness and find that RankVicuna consistently im-
rapid prototyping, particularly when exposed as proves over the baseline retrieved results. We also
API endpoints, the reproducibility of experimental find that with an effective first-stage retriever, even
results that build on them is suspect—both from a single pass with reranking only the top 20 candi-
the normative perspective of what is “good science” dates brings an improvement similar to reranking
and the practical perspective of obtaining reliable the top 100 candidates.
and deterministic measurements of experimental Finally, our experiments shed some light on the

Equal contribution. importance of training strategies that involve data
augmentation to ensure model robustness against is used for such purposes. Most of the early work
shuffled candidates or variations in initial retrieval on reranking with transformers can be character-
quality. However, we note that data augmenta- ized as a pointwise approach, where the relevance
tion techniques affect the quality of model out- of a particular candidate document is estimated
puts under “ideal” conditions, and thus we face independently of others.
an effectiveness–robustness tradeoff. More recently, however, researchers have ad-
Our work lays a solid foundation for future re- dressed this shortcoming by incorporating pair-
search. By making our models and infrastructure wise and listwise losses in their cross-encoder ap-
available to the public, we hope to stimulate further proaches (Gao et al., 2021; Pradeep et al., 2022b;
exploration and innovation in reranking. We an- Zhuang et al., 2022). Using hard negatives in com-
ticipate that our findings will guide researchers in bination with such losses yields systems that are
developing more effective and efficient reranking better at reranking in high-precision settings and
models. As the demand for accurate and reliable that align more closely to the first-stage retriever.
information retrieval systems continues to grow in In contrast, our work focuses on the zero-shot
this age of retrieval-augmented LLMs, we expect setting, where the model is not provided any task-
our work to contribute to future advances. specific supervised training (e.g., relevant query–
passage pairs). We build on a recent thread of
2 Background and Related Work work (Sun et al., 2023; Ma et al., 2023; Qin et al.,
Given a corpus C = {D1 , D2 , ..., Dn } containing 2023) that directly uses LLMs as rerankers in a
a collection of documents and an information need multi-stage ranking pipeline, primarily focusing on
expressed as a query q, the task of a retriever is to prompt engineering to accomplish the reranking
efficiently return a list of k documents from C that task. We coin the term “prompt-decoders” (in con-
are most relevant to the query q according to some trast to BERT-style cross-encoders) to characterize
metric such as nDCG or average precision, where this class of rerankers. Furthermore, since these
k ≪ |C|. The task of a reranker is to further im- models are not fine-tuned or benefit from in-context
prove the quality of the ranked list produced by the learning, we might describe this type of reranking
retriever or another upstream reranker, according model as a zero-shot prompt-decoder. To use an
to either the same or a different metric. open-source LLM as a prompt-decoder, Qin et al.
Retrievers and rerankers together form multi- (2023) adopted a pairwise approach since FLAN-
stage ranking pipelines for text ranking, which UL2 is not capable of reordering a list of input
have been studied in the context of transformer documents. We find the same shortcoming to be
models (Nogueira et al., 2019; Gao et al., 2021) also true for Vicuna, but we address this by using
but date back well over a decade (Matveeva et al., RankGPT3.5 as its teacher.
2006; Cambazoglu et al., 2010; Wang et al., 2011). Rerankers depend on an upstream source to sup-
Nogueira and Cho (2019) were the first to demon- ply candidate documents, which can be a first-stage
strate the use of (encoder-only) transformer models retriever or another reranker. In all our experi-
for reranking (using BERT) with a simple cross- ments, we rely on a first-stage retriever to generate
encoder architecture they called monoBERT. While a candidate list of documents from the corpus. Re-
neural rerankers had been explored extensively searchers have explored a variety of sparse, dense,
by researchers prior to the advent of BERT, the and hybrid retrieval techniques, but these are not
monoBERT model represented a significant ad- the focus of our study. We refer interested readers
vance in effectiveness; see Lin et al. (2021b) for a to Lin (2021) and Lin et al. (2021b) for an overview
historical overview. of such models.
Following monoBERT, other researchers have In another relevant thread, recent work such as
explored reranking using decoder-only transformer InPars (Bonifacio et al., 2022; Boytsov et al., 2023)
models (Nogueira dos Santos et al., 2020) and full and Promptagator (Dai et al., 2022) explored us-
encoder–decoder models (Nogueira et al., 2020; ing LLMs to generate synthetic queries for docu-
Zhuang et al., 2022). These approaches are effec- ments to craft relevant query–document pairs as
tive but require copious amounts of training data training data for retrievers or rerankers. Similarly,
in the form of (query, relevant passage) pairs; of- HyDE (Gao et al., 2023) used LLMs to augment
ten, the MS MARCO dataset (Bajaj et al., 2016) queries by generating hypothetical documents for
USER: I will provide you with {num} passages, each
unsupervised dense retrieval. Related, Sachan et al. indicated by a numerical identifier []. Rank the
(2023) proposed ART, a novel approach to train- passages based on their relevance to the search
ing a dense passage retriever starting only with query: {query}.
questions, which outperforms the standard refer- [1] {passage 1}
ence dense retrieval model DPR (Karpukhin et al., [2] {passage 2}
2020). In the emerging paradigm of generative ...
[{num}] {passage {num}}
retrieval, Pradeep et al. (2023) explored different
document representation strategies and found syn- Search Query: {query}.
thetic queries to be necessary for effectiveness as
Rank the {num} passages above based on their
the corpus size increases. However, all these ap- relevance to the search query. All the passages
proaches take advantage of large language models should be included and listed using identifiers, in
descending order of relevance. The output format
indirectly. should be [] > [], e.g., [4] > [2]. Only respond
Finally, we note that rerankers have gained addi- with the ranking results, do not say any word
tional prominence in recent months with the intro- or explain.
duction of commercially available API endpoints.
Examples include Cohere’s Rerank API1 and Mi- Figure 1: User Input for both RankVicuna and our repli-
cation of RankGPT.
crosoft’s Semantic Search API in Azure Cognitive
Search.2 The existence of these production services
suggests that reranking models have attained ma- swers to the user’s questions.” We hope that align-
turity beyond explorations in research laboratories, ing our model with the exact prompt setup used
and that rerankers address a real-world problem. to train Vicuna would help generate higher-quality
ranked lists for our task.
3 Methods
3.2 RankVicuna
3.1 Prompt Design We leveraged RankGPT3.5 as a teacher model for
Recent work (Ma et al., 2023) has shown that zero- Vicuna to prompt-decode high-quality ranked lists.
shot listwise LLM-based rerankers outperform their More specifically, we trained RankVicuna on the
pointwise counterparts since the former can attend ranked lists generated by RankGPT3.5 for the 100K
to multiple documents simultaneously to determine training set queries provided by Sun et al. (2023).
their relative positions in a relevance ranking. We To generate this dataset, the authors randomly
build on this finding and define our ranking prob- sampled 100K queries from the MS MARCO
lem as follows: Given a user query q and candidate v1 passage ranking training set and retrieved 20
documents {D1 , . . . , Dn } from the previous stage, candidates using BM25 for each query using Py-
the task is to return a reordered list of the input doc- serini (Lin et al., 2021a). Then, these candidates
ument identifiers that improves a retrieval metric were passed into RankGPT3.5 to generate teacher
such as nDCG. orderings, which we distill down to our student,
Our prompt template for zero-shot listwise RankVicuna. Since both RankGPT3.5 and Rank-
reranking is similar to the RankGPT prompt (Sun Vicuna are not directly exposed to human-labeled
et al., 2023), but accounts for differences between relevant query–passage pairs, our approach can still
Vicuna and GPT; specifically, we use the default be considered zero-shot.
system description for Vicuna. In addition, we To ensure higher quality and more robust trained
modified the prompt to show that the answer can, models, we took the following additional steps:
and in many cases should, deviate from the identity
• We did not train on malformed generations. More
ordering, [1] > [2] > . . . > [m]. The exact input
specifically, examples with incorrect list format-
prompt to Vicuna is shown in Figure 1.
ting, missing document identifiers, or repetitions
We prepend the prompt with the system descrip-
were excluded from the training set. This is im-
tion, which, in Vicuna’s case, is “A chat between a
portant as we find that about 12% of the outputs
curious user and an artificial intelligence assistant.
were malformed, and we desire a model that con-
The assistant gives helpful, detailed, and polite an-
sistently generates a well-formed ordering.
1
https://cohere.com/rerank
2
https://learn.microsoft.com/en-us/azure/search/ • Besides including the original generations pro-
semantic-search-overview vided by the teacher, which reranks the top 20 re-
sults by BM25 (Robertson and Zaragoza, 2009), We evaluated our methods using test collections
we also include a condition where the input or- from the TREC 2019 and 2020 Deep Learning
der is shuffled. Our hope is that this exposes Tracks (Craswell et al., 2020, 2021), using query
the model to a more complex reordering task and relevance judgments from the passage retrieval
while not incurring additional data generation tasks. These tasks use the MS MARCO v1 passage
costs. However, we still retain the original BM25 corpus (Bajaj et al., 2016), which contains 8.8 mil-
input ordering, as we believe it is important to lion passages. For convenience, we refer to these
model “success”, given it is the closest to what datasets as DL19 and DL20. We report effective-
the model sees during inference. All RankVicuna ness in terms of nDCG@10 and average precision
settings in the rest of the paper involve this data at a rank cutoff of 100 (denoted MAP@100).
augmentation (DA) process unless specified. The context size is 4096 for Vicuna and GPT3.5
and 8192 for GPT4 . To reorder the top 100 can-
We trained our 7B parameter RankVicuna for didates for each query given these context sizes,
two epochs with an effective batch size of 128 we used a sliding window similar to RankGPT and
and a learning rate of 2 × 10−5 in bfloat16. LRL. In our experiments, we have adopted the
Training took roughly 80 hours on four NVIDIA same values as RankGPT (window size 20, stride
RTX A6000 GPUs. The Vicuna model that 10) to isolate the impact of window and stride size
served as our initial weights can be found under in our comparisons.
lmsys/vicuna-7b-v1.5 in the HuggingFace Hub. Unlike RankVicuna, we (surprisingly) observe
This model is instruction fine-tuned from Meta’s non-deterministic outputs for GPT3.5 and GPT4 ,
LLaMA-v2 model (Touvron et al., 2023). even with a temperature of zero. For these two
It is worth noting that the “out-of-the-box” Vi- models, we report the mean over six and three runs,
cuna model, which was not trained on the Rank- respectively, with 99% confidence intervals. We
GPT3.5 data, completely fails at the reranking task, limited the number of GPT4 runs to three due to
often simply returning an identity ordering or a our computation budget.
malformed generation. In all our reranking experiments, we replaced
any reference of the form [n] in the passages with
4 Experimental Setup (n) to avoid confusing the models. We also lever-
aged ftfy’s fix_text method to preprocess any
To demonstrate the effectiveness of RankVicuna, input sent to the rerankers.
we compared it with existing representative unsu-
pervised ranking methods (BM25 and Contriever) 5 Results
as well as our replications of two closed-source
prompt-decoder models: LRL (Ma et al., 2023) Table 1 compares different reranking pipelines us-
with GPT3.5 and RankGPT (Sun et al., 2023), with ing data from DL19 and DL20. Rows (1) and
both GPT3.5 and GPT4 , which we refer to as Rank- (2) report baselines using two first-stage retrievers,
GPT3.5 and RankGPT4 , respectively. GPT3.5 refers BM25 and Contriever (Izacard et al., 2021). The
to the model dubbed gpt-3.5-turbo in the Open- remaining rows (besides the last one) report the
AI suite while GPT4 refers to gpt-4. We also com- results of using zero-shot LLM rerankers to reorder
pared RankVicuna with our replication of PRP- top 100 candidate documents retrieved by BM25.
Sliding-10 from Qin et al. (2023), albeit with Vi- Rows (6) and (7) show scores of two variants of
cuna (7B parameters). For these experiments, we PRP-Sliding-10, FLAN-T5-XXL and FLAN-UL2,
used Vicuna instead of FLAN-T5 or FLAN-UL2 directly copied from Qin et al. (2023). The final
because we wanted an apples-to-apples compari- row represents our best system, where we apply
son with the same base LLM. Additionally, we note RankVicuna to rerank the top 100 candidates gener-
that the FLAN mixture, used to pretrain the mod- ated by SPLADE++ EnsembleDistil (Formal et al.,
els, includes the MS MARCO QA task,3 thereby 2021), a state-of-the-art neural first-stage sparse
rendering the results suspect from the perspective retrieval method.
of zero-shot retrieval. As expected, all LLM rerankers outperform the
3
baseline (first-stage) methods. The effectiveness of
https://github.com/google-research/FLAN/blob/
e9e4ec6e2701182c7a91af176f705310da541277/flan/ RankVicuna, with 7B parameters, is on par with
v2/flan_collection_info.csv#L1032 the effectiveness of RankGPT3.5 , with 175B pa-
Source DL19 DL20
Prev. Top-k nDCG@10 MAP@100 nDCG@10 MAP@100
(1) BM25 None |C| 0.5058 0.2476 0.4796 0.2685
(2) Contriever None |C| 0.6164 0.3163 0.5986 0.3309
(3) LRL (GPT3.5 ) BM25 100 0.6451±0.003 0.3035±0.004 0.6099±0.004 0.3496±0.004
(4) RankGPT3.5 BM25 100 0.6855±0.006 0.3335±0.002 0.6202±0.005 0.3525±0.002
(5) RankGPT4 BM25 100 0.7500±0.002 0.3703±0.004 0.7036±0.004 0.4134±0.004
(6) PRP-Sliding-10 (FLAN-T5-XXL) BM25 100 0.6700 - 0.6735 -
(7) PRP-Sliding-10 (FLAN-UL2) BM25 100 0.7265 - 0.7046 -
(8) PRP-Sliding-10 (Vicuna) BM25 100 0.5606 0.2735 0.5367 0.2990
(9) RankVicuna BM25 100 0.6682 0.3316 0.6549 0.3789
(10) RankVicuna SPLADE++ ED 100 0.7459 0.4416 0.7473 0.5183

Table 1: nDCG@10 and MAP@100 on DL19 and DL20 for different reranking pipelines, with BM25 and Contriever
as baselines. Each reranker uses the top 100 retrieved results of the previous stage as input. Rows (3–4) and row
(5) represent averages of six and three runs, respectively. We directly copied results in rows (6–7) from Qin et al.
(2023). All other results are from our own experiments.

OK Wrong Format Repetition Missing Total


RankGPT3.5 838.67 0 1.16 33.16 873
RankGPT4 830.33 40.67 1.67 0.33 873
RankVicuna 873 0 0 0 873

Table 2: The number of malformed responses for each reranking method. Reported numbers for RankGPT3.5 and
RankGPT4 are averages of three and six runs, respectively.

rameters. Specifically, compared to its teacher 10 (FLAN-T5-UL2) with 20B parameters outper-
RankGPT3.5 , RankVicuna achieves higher scores form RankVicuna. This could be because, in ad-
on DL20 but slightly lower scores on DL19. Com- dition to the differences in model sizes, the effec-
pared with another zero-shot reranking method, tiveness of RankVicuna is bounded by its teacher,
LRL, which uses RankGPT3.5 , RankVicuna demon- RankGPT3.5 .
strates considerably higher effectiveness on both Finally, in row (10), we used RankVicuna to
DL19 and DL20. rerank the top 100 candidates from SPLADE++
We note that PRP-Sliding-10 (FLAN-T5-XXL) EnsembleDistil instead of BM25. This combina-
with 11B parameters is comparable to RankVicuna tion achieves effectiveness on par with RankGPT4
both in terms of model size and effectiveness. with an open-source model that is more than two
Other than being fully open-source, our main ad- orders of magnitude smaller.
vantage over PRP-Sliding-10 (FLAN-T5-XXL) is Table 2 shows the number of malformed re-
the prompt cost: to bring the top 10 most relevant sponses generated by the RankGPT variants and
candidates to the top of the list, PRP-Sliding-10 RankVicuna, which we have grouped into the fol-
(FLAN-T5-XXL) requires each passage to be in- lowing categories:
cluded in ∼40 prompts on average. In contrast,
1. Wrong Format: includes responses that do not
we only require two prompts for our listwise ap-
follow the requested format. For example, when
proach with a sliding window of size 20 and a
RankGPT4 refuses to generate a sorted list, its
stride of 10. Furthermore, training on the FLAN
response falls into this category.
mixture, which includes the MS MARCO QA task,
calls into question the validity of PRP-Sliding-10 2. Repetition: includes responses that contain re-
(FLAN-T5-XXL) as a true zero-shot method. We peated document ids.
suspect this to be a contributing factor to the effec-
3. Missing: includes responses with missing docu-
tiveness gap between PRP-Sliding-10 (FLAN-T5-
ment ids.
XXL) and PRP-Sliding-10 (Vicuna).
Not surprisingly, both RankGPT4 (rumored to Since RankVicuna is deterministic, we report the
contain more than 1T parameters) and PRP-Sliding- results of a single run. For every request in this
Source DL19 DL20
Prev. Top-k nDCG@10 MAP@100 nDCG@10 MAP@100
(1a) BM25 None |C| 0.5058 0.2476 0.4796 0.2685
(1b) RankVicuna BM25 20 0.6164 0.2867 0.5986 0.3194
(1c) RankVicuna BM25 100 0.6682 0.3316 0.6549 0.3789
(2a) BM25 + RM3 None |C| 0.5216 0.2807 0.4896 0.2821
(2b) RankVicuna BM25 + RM3 20 0.6053 0.3110 0.5825 0.3323
(2c) RankVicuna BM25 + RM3 100 0.6588 0.3573 0.6567 0.3991
(3a) OpenAI ada2 None |C| 0.7035 0.4151 0.6759 0.4587
(3b) RankVicuna OpenAI ada2 20 0.7448 0.4398 0.7101 0.4718
(3c) RankVicuna OpenAI ada2 100 0.7374 0.4409 0.7210 0.4755
(4a) DistillBERT KD TASB None |C| 0.7210 0.4050 0.6854 0.4520
(4b) RankVicuna DistillBERT KD TASB 20 0.7588 0.4121 0.7404 0.4648
(4c) RankVicuna DistillBERT KD TASB 100 0.7551 0.4170 0.7049 0.4620
(5a) SPLADE++ ED None |C| 0.7308 0.4464 0.7197 0.4826
(5b) RankVicuna SPLADE++ ED 20 0.7532 0.4491 0.7455 0.5150
(5c) RankVicuna SPLADE++ ED 100 0.7459 0.4416 0.7473 0.5183

Table 3: nDCG@10 and MAP@100 for RankVicuna with different first-stage candidate generation methods. For
each method, reranking is performed using the top 20 or 100 candidates.

run, RankVicuna returned a correctly formatted candidates improves effectiveness by 30%–45%


response. In contrast, for RankGPT3.5 and Rank- for all metrics, the improvement for SPLADE++
GPT4 , we averaged the results of six and three ED is only 2%–4% for the same metrics. This is a
runs, respectively. Both RankGPT methods occa- commonly noted phenomenon across multi-stage
sionally return malformed responses. Most of the ranking systems (Pradeep et al., 2021, 2022b,a).
malformed responses from RankGPT3.5 are miss- Comparing top 20 vs. top 100 results shows that
ing documents in the ordered list; when malformed, reranking more candidates generally results in a
RankGPT4 mostly refuses to rank. Repetition is a higher MAP@100. However, in cases where the
rare problem for both RankGPT methods. first-stage effectiveness is “good enough”, rows (3–
5) for DL19 and rows (4–5) for DL20, reranking
6 Ablation Studies only the top 20 candidates achieves an nDCG@10
6.1 First-Stage Candidate Generation score on par with reranking the top 100 candidates.

To evaluate the impact of the quality and quan- 6.2 Data Augmentation
tity of the generated candidates on the final results,
we repeated our experiments with the following Section 3.2 discussed the training process of Rank-
five first-stage retrieval methods using either top Vicuna, highlighting the use of data augmentation
20 or top 100 retrieved results: (1) BM25 (Robert- (DA) as a crucial step in our training pipeline. To
son and Zaragoza, 2009), (2) BM25+RM3 (Abdul- recap, the DA process involves shuffling the input
Jaleel et al., 2004), (3) OpenAI ada2 (Neelakantan order of the documents and permuting the origi-
et al., 2022; Lin et al., 2023), (4) DistillBERT KD nal generations provided by the teacher. This step
TASB (Hofstätter et al., 2021), (5) SPLADE++ En- exposes the model to a more complex reordering
sembleDistil (ED) (Formal et al., 2022). The first task, which hopefully enhances its robustness and
two represent strong traditional “bag-of-words” re- effectiveness.
trieval baselines; the others represent a sample of In this section, we study the dependence of Rank-
effective neural first-stage retrievers that are com- Vicuna on the order of generated candidates. We
monly seen in research studies today. OpenAI compared two versions of the model: (1) the default
ada2 and DistillBERT KD TASB are dense retrieval version trained using Data Augmentation (DA),
methods, while SPLADE++ ED is a sparse one. and (2) a variant trained without DA. Experimental
Our experiment shows that as the first-stage results are shown in Table 4.
effectiveness increases, additional improvements Using BM25 as the first stage, our experiments
from RankVicuna decrease (see Table 3). For ex- show that RankVicuna without DA results in worse
ample, while RankVicuna over the top 100 BM25 effectiveness than using RankVicuna with DA.
Source DL19 DL20
Prev. Top-k nDCG@10 MAP@100 nDCG@10 MAP@100
(1a) RankVicuna BM25 100 0.6682 0.3316 0.6549 0.3789
(1b) RankVicuna Shuf. BM25 100 0.6702±0.009 0.2977±0.006 0.6537±0.006 0.3553±0.006
(1c) RankVicuna SPLADE++ ED 100 0.7459 0.4416 0.7473 0.5183
(1d) RankVicuna Shuf. SPLADE++ ED 100 0.7271±0.009 0.3860±0.008 0.7071±0.007 0.4312±0.006
(2a) RankVicuna (w/o DA) BM25 100 0.6612 0.3254 0.6420 0.3612
(2b) RankVicuna (w/o DA) Shuf. BM25 100 0.5893±0.017 0.2666±0.011 0.5293±0.010 0.2754±0.007
(2c) RankVicuna (w/o DA) SPLADE++ ED 100 0.7653 0.4672 0.7536 0.5180
(2d) RankVicuna (w/o DA) Shuf. SPLADE++ ED 100 0.5893±0.010 0.3289±0.009 0.5373±0.020 0.3406±0.013

Table 4: nDCG@10 and MAP@100 of two variants of RankVicuna with different first-stage candidate generation
methods. For each method, reranking is performed using top 100 candidates from the previous step on six shuffled
orderings. We report average metrics and with 99% confidence intervals.

0.7
nDCG@10

RankVicuna on DL19
0.6 PRPVicuna on DL19
RankVicuna on DL20
PRPVicuna on DL20
0.5

0 1 2 3 4 5 6 7 8 9 10
Number of Sliding Window Passes

Figure 2: Comparing the effectiveness of RankVicuna vs. PRPVicuna on DL19 and DL20, varying the number of
times the ranked list is progressively refined. The zeroth pass corresponds to the BM25 run.

When we replace BM25 with SPLADE++ ED, represents the number of sliding window passes,
RankVicuna without DA outperforms RankVicuna ranging from 0 to 10, and the y-axis represents the
with DA. While data augmentation can cause a nDCG@10 score. We plot four curves, each repre-
small drop in effectiveness (depending on the first senting a combination of a reranking method and a
stage), it makes the model less vulnerable to poor dataset. The solid lines show results on DL19 and
quality candidates (whether intentional or not), as the dashed lines show results on DL20. The blue
shown by Qin et al. (2023) in methods like PRP- lines represent the RankVicuna method and the red
Sliding-10 and RankGPT3.5 . lines represent the PRPVicuna method (Qin et al.,
To showcase this vulnerability, we provided both 2023).
model variants with shuffled candidate documents We see that, for both datasets, RankVicuna con-
(rows b and d). The results show that the model sistently outperforms PRPVicuna. The nDCG@10
without DA exhibited a significant effectiveness score for RankVicuna on DL19 starts at 0.5058 and
drop (up to 34%) and higher variance among dif- increases to 0.6837 at the second pass, remaining
ferent runs. In contrast, the default model, which is relatively stable thereafter. The score for Rank-
more robust due to its exposure to a more complex Vicuna on DL20 follows a similar pattern, starting
reordering task, better retained its effectiveness at 0.4796 and rising to about 0.6604 at pass four,
(comparing rows b vs. a and d vs. c, respectively, albeit at a slower pace after the first pass. On the
for each version). other hand, the nDCG@10 scores for PRPVicuna
on both datasets increase gradually with each pass
6.3 Effect of Progressive Reranking but remain far below RankVicuna.
Finally, Figure 2 compares the effectiveness of two This plot suggests that RankVicuna is more ef-
reranking methods, RankVicuna and a variant of fective than PRPVicuna and that multiple passes
PRP-Sliding from Qin et al. (2023), we call PRPVi- of the sliding window have a minimal impact as
cuna, on two datasets, DL19 and DL20. The x-axis an effectiveness boost for RankVicuna. It is also
worth noting that a single pass of reranking with Thirteenth Text REtrieval Conference (TREC 2004),
both methods takes about the same time, around Gaithersburg, Maryland.
30 seconds per query using a batch size of one
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng,
on an RTX A6000 GPU. These results show that Jianfeng Gao, Xiaodong Liu, Rangan Majumder, An-
RankVicuna is much more efficient and achieves drew McNamara, Bhaskar Mitra, Tri Nguyen, Mir
quicker convergence to the best possible results. Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary,
This is likely because PRPVicuna handles only two and Tong Wang. 2016. MS MARCO: A human
generated machine reading comprehension dataset.
passages at a time, whereas RankVicuna attends arXiv:arXiv:1611.09268v3.
to 20 passages simultaneously, resulting in more
effective relevance estimation. Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and
Rodrigo Nogueira. 2022. InPars: Unsupervised
dataset generation for information retrieval. In Pro-
7 Conclusion ceedings of the 45th International ACM SIGIR Con-
ference on Research and Development in Information
In this study, we introduce RankVicuna, a listwise Retrieval (SIGIR 2022), pages 2387–2392, Madrid,
zero-shot reranking approach powered by an open- Spain.
source large language model, Vicuna. Experimen-
tal studies show that our model achieves effective- Leonid Boytsov, Preksha Patel, Vivek Sourabh, Riddhi
Nisar, Sayani Kundu, Ramya Ramanathan, and Eric
ness on par with much larger models. We also quan- Nyberg. 2023. InPars-Light: Cost-effective unsuper-
titatively demonstrated the stability of RankVicuna vised training of efficient rankers. arXiv:2301.02998.
results compared to closed-source counterparts.
Along the way, we explored many aspects of B. Barla Cambazoglu, Hugo Zaragoza, Olivier Chapelle,
Jiang Chen, Ciya Liao, Zhaohui Zheng, and Jon De-
prompt-decoder models for reranking, including genhardt. 2010. Early exit optimizations for additive
the impact of first-stage retrievers on downstream machine learned ranking systems. In Proceedings
effectiveness. Our work also sheds light on the of the Third ACM International Conference on Web
importance of data augmentation for system robust- Search and Data Mining (WSDM 2010), pages 411–
420, New York, New York.
ness, which plays a vital role in ensuring stability
in the face of document shuffling and variations in Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and
initial retrieval quality. Daniel Campos. 2021. Overview of the TREC 2020
In summary, RankVicuna advances zero-shot deep learning track. arXiv:2102.07662.
reranking for information retrieval, demonstrating Nick Craswell, Bhaskar Mitra, Emine Yilmaz,
the potential of large language models to enhance Daniel Campos, and Ellen M. Voorhees. 2020.
search effectiveness, even in data-scarce settings. Overview of the TREC 2019 deep learning track.
We are able to achieve high-quality reranking using arXiv:2003.07820.
fully open-source models, which provides a firm Zhuyun Dai, Vincent Zhao, Ji Ma, Yi Luan, Jianmo
foundation for the rest of the research community Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B.
to build on. As we further refine and expand these Hall, and Ming-Wei Chang. 2022. Promptaga-
techniques, we anticipate exciting opportunities for tor: Few-shot dense retrieval from 8 examples.
arXiv:2209.11755.
integrating large language models into end-to-end
information access applications. Thibault Formal, Carlos Lassance, Benjamin Pi-
wowarski, and Stéphane Clinchant. 2021. SPLADE
Acknowledgments v2: Sparse lexical and expansion model for informa-
tion retrieval. arXiv:2109.10086.
This research was supported in part by the Nat-
ural Sciences and Engineering Research Council Thibault Formal, Carlos Lassance, Benjamin Pi-
wowarski, and Stéphane Clinchant. 2022. From dis-
(NSERC) of Canada. tillation to hard negative sampling: Making sparse
neural ir models more effective. In Proceedings of
the 45th International ACM SIGIR Conference on
References Research and Development in Information Retrieval
(SIGIR 2022), page 2353–2359, Madrid, Spain.
Nasreen Abdul-Jaleel, James Allan, W. Bruce Croft, Fer-
nando Diaz, Leah Larkey, Xiaoyan Li, Donald Met- Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. Re-
zler, Mark D. Smucker, Trevor Strohman, Howard think training of BERT rerankers in multi-stage re-
Turtle, and Courtney Wade. 2004. UMass at TREC trieval pipeline. In Proceedings of the 43rd European
2004: Novelty and HARD. In Proceedings of the Conference on Information Retrieval (ECIR 2021).
Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Tyna Eloundou Nekoul, Girish Sastry, Gretchen
2023. Precise zero-shot dense retrieval without rel- Krueger, David Schnurr, Felipe Petroski Such, Kenny
evance labels. In Proceedings of the 61st Annual Hsu, Madeleine Thompson, Tabarak Khan, Toki
Meeting of the Association for Computational Lin- Sherbakov, Joanne Jang, Peter Welinder, and Lilian
guistics (Volume 1: Long Papers), pages 1762–1777, Weng. 2022. Text and code embeddings by con-
Toronto, Canada. trastive pre-training. arXiv preprint arXiv: Arxiv-
2201.10005.
Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong
Yang, Jimmy Lin, and Allan Hanbury. 2021. Ef- Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage
ficiently teaching an effective dense retriever with re-ranking with BERT. arXiv:1901.04085.
balanced topic aware sampling. In Proceedings of
the 44th Annual International ACM SIGIR Confer- Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and
ence on Research and Development in Information Jimmy Lin. 2020. Document ranking with a pre-
Retrieval (SIGIR 2021), pages 113–122. trained sequence-to-sequence model. In Findings
of the Association for Computational Linguistics:
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Se- EMNLP 2020, pages 708–718.
bastian Riedel, Piotr Bojanowski, Armand Joulin,
and Edouard Grave. 2021. Unsupervised dense Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and
information retrieval with contrastive learning. Jimmy Lin. 2019. Multi-stage document ranking
arXiv:2112.09118. with BERT. arXiv:1910.14424.
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Cicero Nogueira dos Santos, Xiaofei Ma, Ramesh Nalla-
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and pati, Zhiheng Huang, and Bing Xiang. 2020. Beyond
Wen-tau Yih. 2020. Dense passage retrieval for open- [CLS] through ranking by generation. In Proceed-
domain question answering. In Proceedings of the ings of the 2020 Conference on Empirical Methods
2020 Conference on Empirical Methods in Natural in Natural Language Processing (EMNLP), pages
Language Processing (EMNLP), pages 6769–6781, 1722–1727, Online.
Online.
Ronak Pradeep, Kai Hui, Jai Gupta, Adam D. Lelkes,
Jimmy Lin. 2021. A proposed conceptual framework Honglei Zhuang, Jimmy Lin, Donald Metzler, and
for a representational approach to information re- Vinh Q. Tran. 2023. How does generative retrieval
trieval. arXiv:2110.01529. scale to millions of passages? arXiv:2305.11841.
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng- Ronak Pradeep, Yilin Li, Yuetong Wang, and Jimmy Lin.
Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2022a. Neural query synthesis and domain-specific
2021a. Pyserini: A Python toolkit for reproducible ranking templates for multi-stage clinical trial match-
information retrieval research with sparse and dense ing. In Proceedings of the 45th International ACM
representations. In Proceedings of the 44th Annual SIGIR Conference on Research and Development
International ACM SIGIR Conference on Research in Information Retrieval (SIGIR 2022), pages 2325–
and Development in Information Retrieval (SIGIR 2330, Madrid, Spain.
2021), pages 2356–2362.
Ronak Pradeep, Yuqi Liu, Xinyu Zhang, Yilin Li, An-
Jimmy Lin, Rodrigo Nogueira, and Andrew Yates.
drew Yates, and Jimmy Lin. 2022b. Squeezing water
2021b. Pretrained Transformers for Text Ranking:
from a stone: A bag of tricks for further improving
BERT and Beyond. Morgan & Claypool Publishers.
cross-encoder effectiveness for reranking. In Pro-
Jimmy Lin, Ronak Pradeep, Tommaso Teofili, and ceedings of the 44th European Conference on Infor-
Jasper Xian. 2023. Vector search with OpenAI em- mation Retrieval (ECIR 2022), Part I, pages 655–670,
beddings: Lucene is all you need. arXiv:2308.14963. Stavanger, Norway.

Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin.
Jimmy Lin. 2023. Zero-shot listwise docu- 2021. The expando-mono-duo design pattern for
ment reranking with a large language model. text ranking with pretrained sequence-to-sequence
arXiv:2305.02156. models. arXiv:2101.05667.

Irina Matveeva, Chris Burges, Timo Burkard, Andy Lau- Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang,
cius, and Leon Wong. 2006. High accuracy retrieval Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu,
with multiple nested ranker. In Proceedings of the Donald Metzler, Xuanhui Wang, and Michael Ben-
29th Annual International ACM SIGIR Conference on dersky. 2023. Large language models are effec-
Research and Development in Information Retrieval tive text rankers with pairwise ranking prompting.
(SIGIR 2006), pages 437–444, Seattle, Washington. arXiv:2306.17563.

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Rad- Stephen E. Robertson and Hugo Zaragoza. 2009. The
ford, Jesse Michael Han, Jerry Tworek, Qiming probabilistic relevance framework: BM25 and be-
Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, yond. Foundations and Trends in Information Re-
Johannes Heidecke, Pranav Shyam, Boris Power, trieval, 3(4):333–389.
Devendra Singh Sachan, Mike Lewis, Dani Yogatama,
Luke Zettlemoyer, Joelle Pineau, and Manzil Zaheer.
2023. Questions are all you need to train a dense
passage retriever. Transactions of the Association for
Computational Linguistics, 11:600–616.
Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie Ren,
Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT
good at search? Investigating large language models
as re-ranking agent. arXiv:2304.09542.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
stein, Rashi Rungta, Kalyan Saladi, Alan Schel-
ten, Ruan Silva, Eric Michael Smith, Ranjan Sub-
ramanian, Xiaoqing Ellen Tan, Binh Tang, Ross
Taylor, Adina Williams, Jian Xiang Kuan, Puxin
Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, An-
gela Fan, Melanie Kambadur, Sharan Narang, Aure-
lien Rodriguez, Robert Stojnic, Sergey Edunov, and
Thomas Scialom. 2023. Llama 2: Open foundation
and fine-tuned chat models. arXiv preprint arXiv:
2307.09288.
Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A
cascade ranking model for efficient ranked retrieval.
In Proceedings of the 34th Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval (SIGIR 2011), pages 105–114,
Beijing, China.
Honglei Zhuang, Zhen Qin, Rolf Jagerman, Kai
Hui, Ji Ma, Jing Lu, Jianmo Ni, Xuanhui Wang,
and Michael Bendersky. 2022. RankT5: Fine-
tuning T5 for text ranking with ranking losses.
arXiv:2210.10634.

You might also like