You are on page 1of 22

All New: Evaluations for RAG & Chain applications - Learn more!

Mastering RAG: How to Select A Reranking Model


Pratik Bhavsar
Galileo Labs

8 min read March 21 2024

Optimizing retrieval results in a RAG system requires various modules to work together in
unison. A crucial component of this process is the reranker, a module that improves the
order of documents within the retrieved set to prioritize the most relevant items. Let’s dive
into the intricacies of selecting an optimal reranker for RAG systems, including the
significance of rerankers, scenarios demanding their use, potential drawbacks, and the
diverse types available.
What is a Reranker? All New: Evaluations for RAG & Chain applications - Learn more!

https://txt.cohere.com/rerank/

A reranker, functioning as the second-pass document filter in information retrieval(IR)


systems, focuses on reordering documents retrieved by the initial retriever (semantic
search, keyword search, etc.). This reordering is based on the query-document relevance,
emphasizing the quality of document ranking. Balancing efficiency with effectiveness,
rerankers prioritize the enhancement of search result quality, often involving more complex
matching methods than traditional vector inner products.

Why We Need Rerankers


All New: Evaluations for RAG & Chain applications - Learn more!

We know that hallucinations happen when unrelated retrieved docs are included in output
context. This is exactly where rerankers can be helpful! They rearrange document records
to prioritize the most relevant ones. This not only helps address hallucinations but also
saves money during the RAG process. Let’s explore this need in more detail and why
rerankers are necessary.

Limitations of Embeddings
Let's examine why embeddings fail to adequately address retrieval challenges. Their
generalization issues present significant obstacles in real-world applications.

Limited Semantic Understanding

While embeddings capture semantic information, they often lack contrastive information.
For example, embeddings may struggle to distinguish between "I love apples" and "I used to
love apples" since both convey a similar semantic meaning.
Dimensionality Constraints
All New: Evaluations for RAG & Chain applications - Learn more!

Embeddings represent documents or sentences in a relatively low-dimensional space,


typically with a fixed number of dimensions (e.g., 1024). This limited space makes it
challenging to encode all relevant information accurately, especially for longer documents or
queries.

Generalization Issues

Embeddings must generalize well to unseen documents and queries, which is crucial for
real-world search applications. However, due to their dimensionality constraints and training
data limitations, embeddings-based models may struggle to generalize effectively beyond
the training data.

How Rerankers Work


Rerankers fundamentally surpass the limitations of embeddings, rendering them valuable for
retrieval applications.

Bag-of-Embeddings Approach
Early interaction models like cross encoders and late-interaction models like ColBERT adopt
a bag-of-embeddings approach. Instead of representing documents as single vectors, they
break documents into smaller, contextualized units of information.

Semantic Keyword Matching


Reranker models combine the power of strong encoder models, such as BERT, with
keyword-based matching. This allows them to capture semantic meaning at a finer
granularity while retaining the simplicity and efficiency of keyword matching.

Improved Generalization
They alleviate generalization issues faced by traditional embeddings by focusing on smaller
contextualized units. They can better handle unseen documents and queries, leading to
improved retrieval performance in real-world scenarios.
Types of Rerankers All New: Evaluations for RAG & Chain applications - Learn more!

Rerankers have been used for years, but the field is rapidly evolving. Let's examine current
options and how they differ.

Model Type Performance Cost Example

Cross encoder Open source Great Medium BGE, sentence, transformers, Mixedbread

Multi-vector Open source Good Low ColBERT

LLM Open source Great High RankZephyr, RankT5

LLM API Private Best Very High GPT, Claude

Rerank API Private Great Medium Cohere, Mixedbread, Jina

Cross-Encoders
Cross-encoder models redefine the conventional approach by employing a classification
mechanism for pairs of data. The model takes a pair of data, such as two sentences, as
input, and produces an output value between 0 and 1, indicating the similarity between the
two items. This departure from vector embeddings allows for a more nuanced
understanding of the relationships between data points.

It's important to note that cross-encoders require a pair of "items" for every input, making
them unsuitable for handling individual sentences independently. In the context of search, a
cross-encoder is employed with each data item and the search query to calculate the
similarity between the query and the data object.

Here is a snippet of how to use state-of-the-art rerankers like BGE.

1 from FlagEmbedding import FlagReranker


2 reranker = FlagReranker('BAAI/bge-reranker-base, use_fp16=True)
3
4 score = reranker.compute_score(['query', 'passage'])
5 print(score)
6
7 scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The gian

Multi-Vector Rerankers
Cross encoders perform very well, but what about alternative options that require less
compute?
Multi-vector embedding All
models like ColBERT
New: Evaluations feature
for RAG & Chain late interaction,
applications - Learn more! where the interaction

between query and document representations occurs late in the process, after both have
been independently encoded. This approach contrasts with early interaction models like
cross-encoder, where query and document embeddings interact at earlier stages,
potentially leading to increased computational complexity. While in case of cosine similarity
of embeddings for retrieval, there is no interaction!

The late interaction design allows for the pre-computation of document representations,
contributing to faster retrieval times and reduced computational demands, making it
suitable for processing large document collections.

ColBERT offers the best of both worlds, so let's see how it is implemented. Below is a code
snippet of from Jina’s blog.

1 import torch
2
3
4 def compute_relevance_scores(query_embeddings, document_embeddings, k):
5
6 """Compute relevance scores for top-k documents given a query."""
7
8
9 scores = torch.matmul(query_embeddings.unsqueeze(0), document_embeddings.transpose(
10
11 max_scores_per_query_term = scores.max(dim=2).values
12
13 total_scores = max_scores_per_query_term.sum(dim=1)
14
15 sorted_indices = total_scores.argsort(descending=True)
16
17 return sorted_indices

How the score is calculated using late interaction:

1. Dot Product: It computes the dot product between the query embeddings and document
embeddings. This operation is performed using torch.matmul(), which calculates the matrix
multiplication between query_embeddings.unsqueeze(0) (unsqueeze is used to add a batch
dimension) and document_embeddings.transpose(1, 2) (transposed to align dimensions for
multiplication). The result is a tensor with shape [k, num_query_terms, max_doc_length],
where each element represents the similarity score between a query term and a document
term.
2. Max pooling: Max-pooling operation is applied across document terms (dimension 2) to
find the maximum similarity score per query term. This is done using
scores.max(dim=2).values. TheEvaluations
All New: resulting tensor
for RAG has
& Chain shape -[k,
applications num_query_terms],
Learn more! where
each row represents the maximum similarity scores for each query term across all
documents.
3. Total scoring: The maximum scores per query term are summed up to get the total score
for each document. This is done using .sum(dim=1), resulting in a tensor with shape [k]
containing the total relevance score for each document.
4. Sorting: The documents are sorted based on their total scores in descending order using
.argsort(descending=True).

Here is a code snippet using Jina-ColBERT - a ColBERT-style model but based on JinaBERT
so it can support 8k context with better retrieval. The extra context length can help
reduce the reliance on chunking techniques.

1 from colbert.modeling.checkpoint import Checkpoint


2
3 from colbert.infra import ColBERTConfig
4
5
6 query = ["How to use ColBERT for indexing long documents?"]
7
8 documents = [
9
10 "ColBERT is an efficient and effective passage retrieval model.",
11
12 "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8
13
14 "JinaBERT is a BERT architecture that supports the symmetric bidirectional variant of
15
16 "Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very s
17
18 ]
19
20 config = ColBERTConfig(query_maxlen=32, doc_maxlen=512)
21
22 ckpt = Checkpoint(args.reranker, colbert_config=colbert_config)
23
24 Q_emb = ckpt.queryFromText([all_queries[i]])
25
26 D_emb = ckpt.docFromText(all_passages, bsize=32)[0]
27
28 D_mask = torch.ones(D.shape[:2], dtype=torch.long)
29
30 scores = colbert_score(Q_emb, D_emb, D_mask).flatten().cpu().numpy().tolist()
31
32 ranking = numpy.argsort(scores)[::-1]
Below is the comparison All
between Jina-ColBERT-v1-en
New: Evaluations and
for RAG & Chain applications state-of-the-art
- Learn more! late
interaction model ColBERT-v2. Both show nearly similar zero-shot performance on 13
public BEIR datasets, rendering it a worthy candidate for reranking.

https://huggingface.co/jinaai/jina-colbert-v1-en

LLMs for Reranking


As LLMs grow in size, surpassing 10 billion parameters, fine-tuning the reranking model
becomes progressively more challenging. Recent endeavors have aimed to tackle this issue
by prompting LLMs to improve document reranking autonomously. Broadly, these prompting
strategies fall into three categories: pointwise, listwise, and pairwise methods.
All New: Evaluations for RAG & Chain applications - Learn more!

https://arxiv.org/abs/2308.07107

Pointwise Methods

Pointwise methods measure relevance between a query and a single document.


Subcategories include relevance generation and query generation, both effective in zero-
shot document reranking.

Listwise Methods

Listwise methods directly rank a list of documents by inserting the query and a document
list into the prompt and instructing the LLMs to output the reranked document identifiers.
Due to the limited input length of LLMs, inserting all candidate documents into the prompt
is not feasible. To address this issue, these methods employ a sliding window strategy to
rerank a subset of candidate documents each time. This involves ranking from back to front
using a sliding window and re-ranking only the documents within the window at a time.

Pairwise Methods

In pairwise methods, LLMs receive a prompt containing a query and a document pair. They
are then directed to generate the identifier of the document deemed more relevant.
Aggregation methods like AllPairs are employed to rerank all candidate documents. AllPairs
initially generates all potential document pairs and consolidates a final relevance score for
each document. EfficientAllsorting algorithms
New: Evaluations like
for RAG & heap
Chain sort and
applications bubble
- Learn more! sort are typically
utilized to expedite the ranking process.

This can be accomplished with the help of GPT-3.5, GPT4 and GPT4-Turbo, where the last
one offers the best tradeoff between cost and performance.

Supervised LLMs Rerankers


Supervised fine-tuning is a critical step when applying pre-trained LLMs to reranking tasks.
Due to the lack of ranking awareness during pre-training, LLMs cannot accurately measure
query-document relevance. Fine-tuning on task-specific ranking datasets, such as MS
MARCO passage ranking dataset, allow LLMs to adjust parameters for improved reranking
performance. Supervised rerankers can be categorized based on the backbone model
structure into two types.

Encoder-Decoder

Studies in this category mainly formulate document ranking as a generation task, optimizing
an encoder-decoder-based reranking model. For example, the RankT5 model is fine-tuned
to generate classification tokens for relevant or irrelevant query-document pairs.

Decoder-only

Recent attempts involve reranking documents by fine-tuning decoder-only models, such as


LLaMA. Different approaches, including RankZephyr and RankGPT, propose various
methods for relevance calculation.

Private Reranking APIs


All New: Evaluations for RAG & Chain applications - Learn more!

https://txt.cohere.com/rerank/

Hosting and improving rerankers is often challenging. Private reranking APIs offer a
convenient solution for organizations seeking to enhance their search systems with
semantic relevance without making an infrastructure investment. Below, we look into three
notable private reranking APIs: Cohere, Jina, and Mixedbread.

Cohere

Cohere's rerank API offers rerank models tailored for English and multilingual documents,
each optimized for specific language processing tasks. Cohere automatically breaks down
lengthy documents into manageable chunks for efficient processing. The API delivers
relevance scores normalized between 0 and 1, facilitating result interpretation and threshold
determination.

Jina
All New: Evaluations for RAG & Chain applications - Learn more!

https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/

Jina's reranking API, exemplified by jina-colbert-v1-en, specializes in enhancing search


results with semantic understanding with longer context lengths.

Mixedbread

https://www.mixedbread.ai/blog/mxbai-rerank-v1
Mixedbread offers a family of reranking
All New: models
Evaluations for with
RAG & Chain an open-source
applications - Learn more! Apache 2.0 license,

empowering organizations to integrate semantic relevance into their existing search


infrastructure seamlessly.

How to Select a Reranker


Several key factors need to be carefully considered when selecting a reranker to ensure
optimal performance and compatibility with system requirements.

Relevance Improvement
The primary objective of adding a reranker is to enhance the relevance of search results.
Evaluate the reranker's ability to improve the ranking in terms of retrieval metrics like NDCG
or generation metrics like attribution.

Latency Considerations
Assess the additional latency introduced by the reranker to the search system. Measure the
time required for reranking documents and ensure that it remains within acceptable limits
for real-time or near-real-time search applications.

Contextual Understanding
Determine the reranker's ability to handle varying lengths of context in queries and
documents. Some rerankers may be optimized for short text inputs, while others may be
capable of processing longer sequences with rich contextual information.

Generalization Ability
Evaluate the reranker's ability to generalize across different domains and datasets. Ensure
that the reranker performs well not only on training data but also on unseen or out-of-
domain data to prevent overfitting and ensure robust performance in diverse search
scenarios.

Latest Research on Comparison of Rerankers


How should you go aboutAllall
New:these options?
Evaluations Recent
for RAG & research,
Chain applications highlighted
- Learn more! in the paper A
Thorough Comparison of Cross-Encoders and LLMs for Reranking SPLADE, sheds light on
the effectiveness and efficiency of different reranking methods, especially when coupled
with strong retrievers like SPLADEv3. Here are the key conclusions drawn from this
research.

In-Domain vs. Out-of-Domain Performance

In the in-domain setting, differences between evaluated rerankers are not as pronounced.
However, in out-of-domain scenarios, the gap between approaches widens, suggesting that
the choice of reranker can significantly impact performance, especially across different
domains.

Impact of Reranked Document Count

Increasing the number of documents to rerank has a positive impact on the final
effectiveness of the reranking process. This highlights the importance of considering the
trade-off between computational resources and performance gains when determining the
optimal number of reranked documents.

Cross-Encoders vs. LLMs

Effective cross-encoders, when paired with strong retrievers, have shown the ability to
outperform most LLMs in reranking tasks, except for GPT-4 on some datasets. Notably,
cross-encoders offer this improved performance while being more efficient, making them
an attractive option for reranking tasks.

Evaluation of LLM-based Rerankers

Zero-shot LLM-based rerankers, including those based on OpenAI and open models, exhibit
competitive effectiveness, with some even matching the performance of GPT3.5 Turbo.
However, the inefficiency and high cost associated with these models currently limit their
practical use in retrieval systems, despite their promising performance.

Overall, the research highlights the significance of reranking methods in enhancing retrieval,
with cross-encoders emerging as a particularly promising and efficient option. While LLM-
based rerankers show competitive effectiveness,
All New: Evaluations for RAG & Chain their practical
applications deployment is hindered by
- Learn more!

inefficiency and high costs.

How to Evaluate Your Reranker

RAG with reranker using Langchain

Let's continue with our last RAG example, where we built a Q&A system on Nvidia’s 10-k
filings. At the time our goal was to evaluate embedding models. This time we want to see
how we can evaluate a reranker.

We leverage the same data and introduce Cohere reranker in the RAG chain as shown
below.

1 import os
2
3 from langchain_openai import ChatOpenAI
4 from langchain.prompts import ChatPromptTemplate
5 from langchain.schema.runnable import RunnablePassthrough
6 from langchain.schema import StrOutputParser
7 from langchain_community.vectorstores import Pinecone as langchain_pinecone
8 from langchain.retrievers.contextual_compression import ContextualCompressionRetrieve
9 from langchain.retrievers.document_compressors import CohereRerank
10 from pinecone import Pinecone
11
12 def get_qa_chain(embeddings, index_name, emb_k, rerank_k, llm_model_name, temperature
13 # setup retriever
14 pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
All New: Evaluations for RAG & Chain applications - Learn more!
15 index = pc.Index(index_name)
16 vectorstore = langchain_pinecone(index, embeddings.embed_query, "text")
17 compressor = CohereRerank(top_n=rerank_k)
18 retriever = vectorstore.as_retriever(search_kwargs={"k": emb_k}) # https://githu
19
20 # rerank retriever
21 compression_retriever = ContextualCompressionRetriever(
22 base_compressor=compressor, base_retriever=retriever
23 )
24
25 # setup prompt
26 rag_prompt = ChatPromptTemplate.from_messages(
27 [
28 (
29 "system",
30 "Answer the question based only on the provided context."
31 ),
32 ("human", "Context: '{context}' \n\n Question: '{question}'"),
33 ]
34 )
35
36 # setup llm
37 llm = ChatOpenAI(model_name=llm_model_name, temperature=temperature, tiktoken_mod
38
39 # helper function to format docs
40 def format_docs(docs):
41 return "\n\n".join([d.page_content for d in docs])
42
43 # setup chain
44 rag_chain = (
45 {"context": compression_retriever | format_docs, "question": RunnablePassthro
46 | rag_prompt
47 | llm
48 | StrOutputParser()
49 )
50
51 return rag_chain

We modify our rag_chain_executor function to include emb_k and rerank_k which are the
number of retrieved docs we want from OpenAI’s text-embedding-3-small retrieval and
Cohere’s rerank-english-v2.0 reranker.

1 def rag_chain_executor(emb_model_name: str, dimensions: int, llm_model_name: str, emb


2 # initialise embedding model
3 if "text-embedding-3" in emb_model_name:
4 embeddings = OpenAIEmbeddings(model=emb_model_name, dimensions=dimensions)
5 else:
6 embeddingsAll=New:
HuggingFaceEmbeddings(model_name=emb_model_name,
Evaluations for RAG & Chain applications - Learn more! encode_kwargs =
7
8 index_name = f"{emb_model_name}-{dimensions}".lower()
9
10 # First, check if our index already exists and delete stale index
11 if index_name in [index_info['name'] for index_info in pc.list_indexes()]:
12 pc.delete_index(index_name)
13
14 # create a new index
15 pc.create_index(name=index_name, metric="cosine", dimension=dimensions,
16 spec=ServerlessSpec(
17 cloud="aws",
18 region="us-west-2"
19 ) )
20 time.sleep(10)
21
22 # index the documents
23 _ = langchain_pinecone.from_documents(documents, embeddings, index_name=index_nam
24 time.sleep(10)
25
26 # load qa chain
27 qa = get_qa_chain(embeddings, index_name, emb_k, rerank_k, llm_model_name, temper
28
29 # tags to be kept in galileo run
30 run_name = f"{index_name}-emb-k-{emb_k}-rerank-k-{rerank_k}"
31 index_name_tag = pq.RunTag(key="Index config", value=index_name, tag_type=pq.TagTy
32 encoder_model_name_tag = pq.RunTag(key="Encoder", value=emb_model_name, tag_type=
33 llm_model_name_tag = pq.RunTag(key="LLM", value=llm_model_name, tag_type=pq.TagTy
34 dimension_tag = pq.RunTag(key="Dimension", value=str(dimensions), tag_type=pq.Tag
35 emb_k_tag = pq.RunTag(key="Emb k", value=str(emb_k), tag_type=pq.TagType.RAG)
36 rerank_k_tag = pq.RunTag(key="Rerank k", value=str(rerank_k), tag_type=pq.TagType
37
38 evaluate_handler = pq.GalileoPromptCallback(project_name=project_name, run_name=r
39
40 # run chain with questions to generate the answers
41 print("Ready to ask!")
42 for i, q in enumerate(tqdm(questions)):
43 print(f"Question {i}: ", q)
44 print(qa.invoke(q, config=dict(callbacks=[evaluate_handler])))
45 print("\n\n")
46
47 evaluate_handler.finish()
48
49 pq.login("console.demo.rungalileo.io")

Now we can go ahead and run the same sweep with the required parameters.

1 pq.sweep(
2
3 rag_chain_executor, All New: Evaluations for RAG & Chain applications - Learn more!
4
5 {
6
7 "emb_model_name": ["text-embedding-3-small"],
8
9 "dimensions": [384],
10
11 "llm_model_name": ["gpt-3.5-turbo-0125"],
12
13 "emb_k": [10],
14
15 "rerank_k": [3]
16
17 },
18
19 )

The outcome isn't surprising! We get a 10% increase in attribution, indicating we now
possess more relevant chunks necessary to address the question. Additionally, there's 5%
improvement in Context Adherence, suggesting a reduction in hallucinations.
Table of contents Show

In most cases, we continue to seek further improvements rather than stopping at this point.
Galileo Evaluate facilitates error analysis by examining individual runs and inspecting
attributed chunks. The following illustrates the specific chunk attributed from the retrieved
chunks in the vector DB.
All New: Evaluations for RAG & Chain applications - Learn more!

When we click the rerank node, we can see the total attributed chunks from the reranker
and each chunk’s attribution(yes/no).
All New: Evaluations for RAG & Chain applications - Learn more!

This approach enables us to swiftly conduct experiments with various rerankers to


determine the most effective one for our needs.

Conclusion
The importance of selecting the appropriate reranker cannot be overstated in optimizing
RAG systems and ensuring dependable search outcomes by mitigating hallucinations. A
nuanced understanding of various reranker types, including cross-encoders and multi-
vector models, underscores their pivotal role in augmenting search precision.

However, as the RAG pipeline matures, debugging complexities can hinder system
improvements. By leveraging comprehensive observability across the entire pipeline, AI
teams can build production-ready systems.
All New: Evaluations for RAG & Chain applications - Learn more!
Subscribe to Newsletter
Email*

Submit

Working with Natural Language Processing?


Read about Galileo’s NLP Studio

Natural Language Processing


Learn more

PRODUCTSRESOURCES COMPANY

LLM Studio Docs Careers


Subscribe to our newsletter
Email*
NLP Studio Blog Privacy Policy

Submit
Request a Demo
All New: Evaluations for RAG & Chain applications - Learn more!

Contact us at info@rungalileo.io © 2023 Galileo. All rights reserved.

You might also like