Table of Content

close

    Reranking model 

Exercise 2: RAG with Reranker for 10-K filings

open-book3 min read
Artificial Intelligence
Rohit Aggarwal
Harpreet Singh
Rohit Aggarwal
  +1 More
down

The objective of this exercise series is to develop a prototype of a Retrieval-Augmented Generation (RAG) system capable of answering questions based on 10-K filings submitted to the U.S. Securities and Exchange Commission (SEC). The full series includes six Colab notebooks, each exploring progressively advanced concepts in RAG systems and their applications:

  • Exercise 1: Simple RAG for 10-K filings
     
  • Exercise 2: RAG with Reranker for 10-K filings

    Code with Explanation is posted here: Colab Notebook Link
     

  • Exercise 3: RAG with Query Decomposition & Tracing with LangSmith/LangFuse
     
  • Exercise 4: RAG with Agentic Pattern: ReAct (Reasoning + Action) 
     
  • Exercise 5: RAG with Agentic Pattern: ReAct + Reflection

     

These exercises incrementally build on basic RAG with focus on “why” before “what” and “how".

This exercise, the second in the series, focuses on illustrating how Reranking makes a difference in the quality of response generated by RAG system. This exercise extends the last exercise by adding Reranker.

We encourage readers to go through Reranking Retrieved Chunks using Reranker (Cross-Encoder model) before going through the code.

 


Reranking Retrieved Chunks using Reranker (Cross-Encoder model)

While vector similarity search provides a good initial set of relevant chunks, it can sometimes miss nuanced semantic relationships or return chunks that are only superficially similar to the query. Consider a user asking "List out major changes that occurred in Tesla in 2023." A vector search might rank chunks discussing changes from 2022 higher than a more relevant chunk about a Director selling common stock in 2023, simply because the 2022 chunks share more semantic similarities around the concept of "changes" and "Tesla." This highlights a limitation of pure vector similarity matching.

This is where rerankers come into play, serving as a crucial refinement layer in the RAG pipeline. A reranker takes the initial set of retrieved chunks from the vector database and performs a more sophisticated, computationally intensive analysis to improve the ranking quality. The reranking process often employs cross-encoders, which are transformer models that simultaneously process both the query and a candidate chunk to produce a relevance score. This approach captures more subtle semantic relationships and contextual nuances. It can correctly identify that the Director's stock sale in 2023 is more relevant to the query than changes from 2022, despite fewer surface-level semantic similarities.

A natural question arises: why not use these more sophisticated reranker models for the initial retrieval instead of vector search? The answer lies in computational efficiency. Using a reranker as the primary retrieval mechanism would require passing each query through the model alongside millions of individual chunks in the vector database, computing similarity scores one at a time. This process would be prohibitively expensive and slow, especially for large-scale applications that need to maintain responsive query times.

This is why modern RAG systems typically employ a two-stage retrieval process that combines the best of both approaches. They first use rapid vector similarity search to quickly identify a promising set of candidates (e.g., top 100 chunks), then apply the more sophisticated reranker to this smaller set to determine the final top-k chunks (e.g., top 5-10) that will be provided as context to the language model. This hybrid approach balances computational efficiency with retrieval quality, ensuring that the system provides accurate and relevant responses while maintaining reasonable response times.

Technical Details

Reranking model 

The key distinction between Embedding models (typically bi-encoders) and Reranking models (typically cross-encoders) lies in how they process queries and chunks. Bi-encoders process each text independently - the query and chunk are fed through the model separately to generate their respective embeddings. These embeddings can then be compared using similarity metrics like cosine similarity. This approach allows for efficient retrieval since chunk embeddings can be pre-computed and indexed, but it limits the model's ability to capture complex interactions between the query and document.

Cross-encoders take a fundamentally different approach by processing the query and chunk together as a single input. By concatenating the query and chunk with a separator token, the model can leverage its attention mechanisms to directly compare and contrast every term in the query with every term in the chunk. This enables the model to capture nuanced relevance patterns and contextual relationships that might be missed when processing texts independently. For example, if a query asks about "Tesla's competitors' plant locations," a cross-encoder can directly attend to chunks mentioning locations of Tesla's competitors (e.g., Ford's plant in Michigan, Rivian's facility in Illinois) while downweighting or deprioritizing chunks that primarily describe Tesla's plant locations. This results in a reranked list where chunks about competitors' locations are moved higher than those focusing on Tesla itself, better aligning with the user's intent.

When it comes to model training objectives, embedding models and cross-encoder models serve different purposes despite often starting from the same base architectures like BERT. Embedding models are fine-tuned specifically to generate high-quality sentence or paragraph level embeddings that capture semantic meaning in a fixed-dimensional vector space. The training process typically involves contrastive learning objectives that push similar texts closer together and dissimilar texts further apart in the embedding space. Cross-encoder models, on the other hand, are fine-tuned to directly predict a relevance score given a query-document pair. Rather than generating embeddings, the model learns to output a single similarity score that indicates how well the document answers the query. This direct optimization for the ranking task typically leads to better ranking performance, though at the cost of computational efficiency since pairs must be processed together.