Professional Documents
Culture Documents
Optimizing retrieval results in a RAG system requires various modules to work together in
unison. A crucial component of this process is the reranker, a module that improves the
order of documents within the retrieved set to prioritize the most relevant items. Let’s dive
into the intricacies of selecting an optimal reranker for RAG systems, including the
significance of rerankers, scenarios demanding their use, potential drawbacks, and the
diverse types available.
What is a Reranker? All New: Evaluations for RAG & Chain applications - Learn more!
https://txt.cohere.com/rerank/
We know that hallucinations happen when unrelated retrieved docs are included in output
context. This is exactly where rerankers can be helpful! They rearrange document records
to prioritize the most relevant ones. This not only helps address hallucinations but also
saves money during the RAG process. Let’s explore this need in more detail and why
rerankers are necessary.
Limitations of Embeddings
Let's examine why embeddings fail to adequately address retrieval challenges. Their
generalization issues present significant obstacles in real-world applications.
While embeddings capture semantic information, they often lack contrastive information.
For example, embeddings may struggle to distinguish between "I love apples" and "I used to
love apples" since both convey a similar semantic meaning.
Dimensionality Constraints
All New: Evaluations for RAG & Chain applications - Learn more!
Generalization Issues
Embeddings must generalize well to unseen documents and queries, which is crucial for
real-world search applications. However, due to their dimensionality constraints and training
data limitations, embeddings-based models may struggle to generalize effectively beyond
the training data.
Bag-of-Embeddings Approach
Early interaction models like cross encoders and late-interaction models like ColBERT adopt
a bag-of-embeddings approach. Instead of representing documents as single vectors, they
break documents into smaller, contextualized units of information.
Improved Generalization
They alleviate generalization issues faced by traditional embeddings by focusing on smaller
contextualized units. They can better handle unseen documents and queries, leading to
improved retrieval performance in real-world scenarios.
Types of Rerankers All New: Evaluations for RAG & Chain applications - Learn more!
Rerankers have been used for years, but the field is rapidly evolving. Let's examine current
options and how they differ.
Cross encoder Open source Great Medium BGE, sentence, transformers, Mixedbread
Cross-Encoders
Cross-encoder models redefine the conventional approach by employing a classification
mechanism for pairs of data. The model takes a pair of data, such as two sentences, as
input, and produces an output value between 0 and 1, indicating the similarity between the
two items. This departure from vector embeddings allows for a more nuanced
understanding of the relationships between data points.
It's important to note that cross-encoders require a pair of "items" for every input, making
them unsuitable for handling individual sentences independently. In the context of search, a
cross-encoder is employed with each data item and the search query to calculate the
similarity between the query and the data object.
Multi-Vector Rerankers
Cross encoders perform very well, but what about alternative options that require less
compute?
Multi-vector embedding All
models like ColBERT
New: Evaluations feature
for RAG & Chain late interaction,
applications - Learn more! where the interaction
between query and document representations occurs late in the process, after both have
been independently encoded. This approach contrasts with early interaction models like
cross-encoder, where query and document embeddings interact at earlier stages,
potentially leading to increased computational complexity. While in case of cosine similarity
of embeddings for retrieval, there is no interaction!
The late interaction design allows for the pre-computation of document representations,
contributing to faster retrieval times and reduced computational demands, making it
suitable for processing large document collections.
ColBERT offers the best of both worlds, so let's see how it is implemented. Below is a code
snippet of from Jina’s blog.
1 import torch
2
3
4 def compute_relevance_scores(query_embeddings, document_embeddings, k):
5
6 """Compute relevance scores for top-k documents given a query."""
7
8
9 scores = torch.matmul(query_embeddings.unsqueeze(0), document_embeddings.transpose(
10
11 max_scores_per_query_term = scores.max(dim=2).values
12
13 total_scores = max_scores_per_query_term.sum(dim=1)
14
15 sorted_indices = total_scores.argsort(descending=True)
16
17 return sorted_indices
1. Dot Product: It computes the dot product between the query embeddings and document
embeddings. This operation is performed using torch.matmul(), which calculates the matrix
multiplication between query_embeddings.unsqueeze(0) (unsqueeze is used to add a batch
dimension) and document_embeddings.transpose(1, 2) (transposed to align dimensions for
multiplication). The result is a tensor with shape [k, num_query_terms, max_doc_length],
where each element represents the similarity score between a query term and a document
term.
2. Max pooling: Max-pooling operation is applied across document terms (dimension 2) to
find the maximum similarity score per query term. This is done using
scores.max(dim=2).values. TheEvaluations
All New: resulting tensor
for RAG has
& Chain shape -[k,
applications num_query_terms],
Learn more! where
each row represents the maximum similarity scores for each query term across all
documents.
3. Total scoring: The maximum scores per query term are summed up to get the total score
for each document. This is done using .sum(dim=1), resulting in a tensor with shape [k]
containing the total relevance score for each document.
4. Sorting: The documents are sorted based on their total scores in descending order using
.argsort(descending=True).
Here is a code snippet using Jina-ColBERT - a ColBERT-style model but based on JinaBERT
so it can support 8k context with better retrieval. The extra context length can help
reduce the reliance on chunking techniques.
https://huggingface.co/jinaai/jina-colbert-v1-en
https://arxiv.org/abs/2308.07107
Pointwise Methods
Listwise Methods
Listwise methods directly rank a list of documents by inserting the query and a document
list into the prompt and instructing the LLMs to output the reranked document identifiers.
Due to the limited input length of LLMs, inserting all candidate documents into the prompt
is not feasible. To address this issue, these methods employ a sliding window strategy to
rerank a subset of candidate documents each time. This involves ranking from back to front
using a sliding window and re-ranking only the documents within the window at a time.
Pairwise Methods
In pairwise methods, LLMs receive a prompt containing a query and a document pair. They
are then directed to generate the identifier of the document deemed more relevant.
Aggregation methods like AllPairs are employed to rerank all candidate documents. AllPairs
initially generates all potential document pairs and consolidates a final relevance score for
each document. EfficientAllsorting algorithms
New: Evaluations like
for RAG & heap
Chain sort and
applications bubble
- Learn more! sort are typically
utilized to expedite the ranking process.
This can be accomplished with the help of GPT-3.5, GPT4 and GPT4-Turbo, where the last
one offers the best tradeoff between cost and performance.
Encoder-Decoder
Studies in this category mainly formulate document ranking as a generation task, optimizing
an encoder-decoder-based reranking model. For example, the RankT5 model is fine-tuned
to generate classification tokens for relevant or irrelevant query-document pairs.
Decoder-only
https://txt.cohere.com/rerank/
Hosting and improving rerankers is often challenging. Private reranking APIs offer a
convenient solution for organizations seeking to enhance their search systems with
semantic relevance without making an infrastructure investment. Below, we look into three
notable private reranking APIs: Cohere, Jina, and Mixedbread.
Cohere
Cohere's rerank API offers rerank models tailored for English and multilingual documents,
each optimized for specific language processing tasks. Cohere automatically breaks down
lengthy documents into manageable chunks for efficient processing. The API delivers
relevance scores normalized between 0 and 1, facilitating result interpretation and threshold
determination.
Jina
All New: Evaluations for RAG & Chain applications - Learn more!
https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/
Mixedbread
https://www.mixedbread.ai/blog/mxbai-rerank-v1
Mixedbread offers a family of reranking
All New: models
Evaluations for with
RAG & Chain an open-source
applications - Learn more! Apache 2.0 license,
Relevance Improvement
The primary objective of adding a reranker is to enhance the relevance of search results.
Evaluate the reranker's ability to improve the ranking in terms of retrieval metrics like NDCG
or generation metrics like attribution.
Latency Considerations
Assess the additional latency introduced by the reranker to the search system. Measure the
time required for reranking documents and ensure that it remains within acceptable limits
for real-time or near-real-time search applications.
Contextual Understanding
Determine the reranker's ability to handle varying lengths of context in queries and
documents. Some rerankers may be optimized for short text inputs, while others may be
capable of processing longer sequences with rich contextual information.
Generalization Ability
Evaluate the reranker's ability to generalize across different domains and datasets. Ensure
that the reranker performs well not only on training data but also on unseen or out-of-
domain data to prevent overfitting and ensure robust performance in diverse search
scenarios.
In the in-domain setting, differences between evaluated rerankers are not as pronounced.
However, in out-of-domain scenarios, the gap between approaches widens, suggesting that
the choice of reranker can significantly impact performance, especially across different
domains.
Increasing the number of documents to rerank has a positive impact on the final
effectiveness of the reranking process. This highlights the importance of considering the
trade-off between computational resources and performance gains when determining the
optimal number of reranked documents.
Effective cross-encoders, when paired with strong retrievers, have shown the ability to
outperform most LLMs in reranking tasks, except for GPT-4 on some datasets. Notably,
cross-encoders offer this improved performance while being more efficient, making them
an attractive option for reranking tasks.
Zero-shot LLM-based rerankers, including those based on OpenAI and open models, exhibit
competitive effectiveness, with some even matching the performance of GPT3.5 Turbo.
However, the inefficiency and high cost associated with these models currently limit their
practical use in retrieval systems, despite their promising performance.
Overall, the research highlights the significance of reranking methods in enhancing retrieval,
with cross-encoders emerging as a particularly promising and efficient option. While LLM-
based rerankers show competitive effectiveness,
All New: Evaluations for RAG & Chain their practical
applications deployment is hindered by
- Learn more!
Let's continue with our last RAG example, where we built a Q&A system on Nvidia’s 10-k
filings. At the time our goal was to evaluate embedding models. This time we want to see
how we can evaluate a reranker.
We leverage the same data and introduce Cohere reranker in the RAG chain as shown
below.
1 import os
2
3 from langchain_openai import ChatOpenAI
4 from langchain.prompts import ChatPromptTemplate
5 from langchain.schema.runnable import RunnablePassthrough
6 from langchain.schema import StrOutputParser
7 from langchain_community.vectorstores import Pinecone as langchain_pinecone
8 from langchain.retrievers.contextual_compression import ContextualCompressionRetrieve
9 from langchain.retrievers.document_compressors import CohereRerank
10 from pinecone import Pinecone
11
12 def get_qa_chain(embeddings, index_name, emb_k, rerank_k, llm_model_name, temperature
13 # setup retriever
14 pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
All New: Evaluations for RAG & Chain applications - Learn more!
15 index = pc.Index(index_name)
16 vectorstore = langchain_pinecone(index, embeddings.embed_query, "text")
17 compressor = CohereRerank(top_n=rerank_k)
18 retriever = vectorstore.as_retriever(search_kwargs={"k": emb_k}) # https://githu
19
20 # rerank retriever
21 compression_retriever = ContextualCompressionRetriever(
22 base_compressor=compressor, base_retriever=retriever
23 )
24
25 # setup prompt
26 rag_prompt = ChatPromptTemplate.from_messages(
27 [
28 (
29 "system",
30 "Answer the question based only on the provided context."
31 ),
32 ("human", "Context: '{context}' \n\n Question: '{question}'"),
33 ]
34 )
35
36 # setup llm
37 llm = ChatOpenAI(model_name=llm_model_name, temperature=temperature, tiktoken_mod
38
39 # helper function to format docs
40 def format_docs(docs):
41 return "\n\n".join([d.page_content for d in docs])
42
43 # setup chain
44 rag_chain = (
45 {"context": compression_retriever | format_docs, "question": RunnablePassthro
46 | rag_prompt
47 | llm
48 | StrOutputParser()
49 )
50
51 return rag_chain
We modify our rag_chain_executor function to include emb_k and rerank_k which are the
number of retrieved docs we want from OpenAI’s text-embedding-3-small retrieval and
Cohere’s rerank-english-v2.0 reranker.
Now we can go ahead and run the same sweep with the required parameters.
1 pq.sweep(
2
3 rag_chain_executor, All New: Evaluations for RAG & Chain applications - Learn more!
4
5 {
6
7 "emb_model_name": ["text-embedding-3-small"],
8
9 "dimensions": [384],
10
11 "llm_model_name": ["gpt-3.5-turbo-0125"],
12
13 "emb_k": [10],
14
15 "rerank_k": [3]
16
17 },
18
19 )
The outcome isn't surprising! We get a 10% increase in attribution, indicating we now
possess more relevant chunks necessary to address the question. Additionally, there's 5%
improvement in Context Adherence, suggesting a reduction in hallucinations.
Table of contents Show
In most cases, we continue to seek further improvements rather than stopping at this point.
Galileo Evaluate facilitates error analysis by examining individual runs and inspecting
attributed chunks. The following illustrates the specific chunk attributed from the retrieved
chunks in the vector DB.
All New: Evaluations for RAG & Chain applications - Learn more!
When we click the rerank node, we can see the total attributed chunks from the reranker
and each chunk’s attribution(yes/no).
All New: Evaluations for RAG & Chain applications - Learn more!
Conclusion
The importance of selecting the appropriate reranker cannot be overstated in optimizing
RAG systems and ensuring dependable search outcomes by mitigating hallucinations. A
nuanced understanding of various reranker types, including cross-encoders and multi-
vector models, underscores their pivotal role in augmenting search precision.
However, as the RAG pipeline matures, debugging complexities can hinder system
improvements. By leveraging comprehensive observability across the entire pipeline, AI
teams can build production-ready systems.
All New: Evaluations for RAG & Chain applications - Learn more!
Subscribe to Newsletter
Email*
Submit
PRODUCTSRESOURCES COMPANY
Submit
Request a Demo
All New: Evaluations for RAG & Chain applications - Learn more!