Tue Feb 11 / Preetam Joshi, Alex Lyzhov, Bibek Paudel, Puneet Anand
RRE-1 helps developers easily evaluate retrieval performance and allows them to fix relevance issues by applying the learnings from the evaluation in the re-ranking phase - RRE-1 can be used as a low latency re-ranker via a convenient API.
Good retrieval is like being able to quickly find the mountain lion camouflaged in this picture. (image credit: link)
Retrieval is the process of searching through a large collection of documents, data, or information to find the most relevant pieces in response to a specific query. It originated in early library systems and evolved with the rise of digital search engines, where it was primarily used for document retrieval. Over time, retrieval systems became integral to web search, e-commerce, recommendation systems, and large-scale knowledge bases. Fig.1 shows a typical example of a retrieval system.
A query expansion module is used to handle ambiguous queries (typically a single word, ex: “space” - could mean astronomy, real estate or just a key stroke). Traditional retrieval methods, such as reverse index-based searches, rely on keyword matching, which often fails to capture semantic meaning, leading to poor relevance. On the other hand, vector database searches improve semantic matching but struggle with precision, especially in domain-specific contexts. Recent advancements like Graph RAG (Retrieval-Augmented Generation with Graphs) and knowledge graphs have further enhanced vector database retrieval. By integrating structured relationships between entities, these methods improve contextual understanding, enabling retrieval systems to disambiguate queries, capture implicit relationships, and prioritize relevant results. Details about query expansion and Graph RAG are out of the scope of this article.
As datasets grow in size and complexity, poor relevance of retrieved documents becomes more pronounced. To address these limitations, two-phase ranking—where an initial retrieval phase is followed by a re-ranking phase—has re-surfaced in terms of popularity. The first stage quickly narrows down candidates using fast heuristics, while the second stage applies deeper contextual understanding using machine learning models. As shown in Fig. 1, re-ranking helps place the most relevant candidate documents (retrieved in the first phase) on the top. A re-ranker is more computationally expensive than the retrieval (vector/reverse index) phase. It typically utilizes more sophisticated logic (either through an ML model or complex heuristics) to better rank the retrieved candidate documents. In traditional search, models such as decision trees (GBDT or Random Forest) were widely used as the model of choice for the re-ranker. This approach improves retrieval relevance, especially when the re-ranking models are adaptable to domain-specific nuances, ensuring higher precision. In this post, we will outline our approach to re-ranking and evaluating retrieval relevance that can be easily customized to each domain.
Fig. 1 A typical retrieval system consisting of a query expansion module, a reverse/vector index and a re-ranker.
A Retrieval-Augmented Generation (RAG) application consists of several essential components working together to retrieve relevant information and generate meaningful responses. At its core, the process begins with query processing and embedding, where a user’s input is transformed into a dense vector representation using an embedding model. The effectiveness of this step depends on the model’s ability to capture semantic meaning, thus making the choice of embeddings a critical factor.
Once embedded, the query is matched against stored vectors in a vector database to find the most relevant documents. While vector retrieval improves upon traditional keyword-based searches, it has limitations—such as struggling with ambiguous queries, domain-specific jargon, and retrieval drift, where highly similar but irrelevant results appear. The retrieved documents are then re-ranked and filtered using machine learning models to ensure that the most relevant content appears first, improving overall response quality. Finally, the retrieved documents are passed into the generation model, allowing it to create responses enriched with retrieved knowledge rather than relying solely on internal training data.
However, several challenges hinder the effectiveness of retrieval. One key issue is the quality of embeddings - a general-purpose embedding model may not capture the subtle nuances of a specialized domain like medicine, law, or finance. Domain-specific fine-tuning or using hybrid retrieval approaches (combining vector search with traditional keyword matching) can significantly improve results. Another challenge is handling long-tail queries, where rare or complex questions often return poor results due to insufficient training data. In addition, developers must ensure that retrieval remains performant at scale, balancing accuracy and response latency.
When building a RAG application, developers face several concerns. Data chunking strategies impact retrieval effectiveness; segmenting documents incorrectly can lead to fragmented, contextually incomplete responses. Keeping the vector database up to date is another challenge — stale or irrelevant data can degrade the quality of responses, particularly in dynamic fields like finance or current events. There’s also the issue of poor retrieval leading to hallucinations in the output. Another challenge is the quality of the data being indexed. A lot of time and effort is required in ensuring that the data is properly cleaned.
Evaluating the quality of retrieval is crucial to ensure a RAG system delivers relevant, precise, and domain-aware responses. Poor retrieval not only leads to irrelevant answers but can also cause models to hallucinate information, reducing trust and usability. To quantitatively assess retrieval quality, several key metrics are commonly used. When ground truth is present, metrics like Precision@K and Recall@K, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) can be used to measure how many of the top K retrieved documents are relevant, with recall ensuring that no critical information is missing.
Beyond quantitative measures, dataset selection plays a key role in benchmarking. Popular domain-specific datasets such as SQuAD for general QA, MIMIC for medical retrieval, and CaseLaw for legal applications provide a gold standard for comparison. However, real-world data isn’t always available, especially in enterprise settings. In such cases, synthetic datasets can be constructed using LLMs to generate queries and their corresponding expected results. Careful validation of synthetic data is needed to ensure it aligns with real-world use cases. In addition, qualitative evaluation is equally important. Developers manually inspect retrieval results to check for factual correctness, contextual alignment, and response diversity. This step is essential in cases where models may rank results based on generic similarity rather than true relevance. A critical aspect of measuring retrieval quality is query-context relevance within a specific domain. Different industries have unique data structures and retrieval expectations - for instance, a legal RAG model must retrieve court cases based on precedent, while a medical model should prioritize clinical guidelines. Enterprise applications often require adapting retrieval methods to specific data patterns, ensuring that the retrieved documents align with business needs. Without domain-specific tuning, even high-ranking results can fail to provide meaningful value, highlighting the importance of continuous adaptation and evaluation.
Using AIMon’s retriever relevance evaluator (RRE-1), developers are able to customize the grader to their domain through the “task_definition” field where they provide details of the task and a few examples of query-context pairs along with their relevance scores. The code snippet below provides a simple example:
from aimon import Detect
# Configure the retrieval relevance evaluator here
aimon_config = {
"retrieval_relevance": {"detector_name": "default"}
}
# Setup the AIMon decorator
rr_eval = Detect(
values_returned=["user_query", "context", "task_definition"],
api_key=os.getenv("AIMON_API_KEY"),
config=aimon_config,
async_mode=True,
publish=True,
application_name="summarization_app_jan_20_2025",
model_name="my-llm"
)
@rr_eval
def my_llm_app(query, context):
task_def = "Your task is to grade the relevance of context document against a specified user query. The domain here is music."
return query, context, task_def
_,_,_,_, aimon_res = my_llm_app("Among Moonbabies and Massive Attack, which group has a larger number of members who are musicians?", ["Moonbabies is a Swedish duo formed in 1997 by vocalists, multi-instrumentalists, producers, and songwriters Ola Frick (Vocals, guitar and various instruments) and Carina Johansson (Vocals and keyboards).Massive Attack are an English trip hop group formed in 1988 in Bristol, consisting of Robert \"3D\" Del Naja, Grant \"Daddy G\" Marshall and formerly Andy \"Mushroom\" Vowles (\"Mush\")."])
if aimon_res.status == 200:
print(f"\U0001F7E2 {aimon_res.detect_response['message']}\n")
# Output:
# 🟢 Data successfully sent to AIMon
Here is the corresponding output (formatted in JSON for easy reading):
[
{
"retrieval\_relevance": [
{
"explanations": [
"1. Document 1 provides information about both Moonbabies and Massive Attack noting that Moonbabies is a duo with two members who are musicians, while Massive Attack consists of three members, all of whom are musicians as well. This directly relates to the query about which group has a larger number of musician members. However, the document does not provide additional context about the roles and contributions of each member, limiting the depth of comparison regarding their musical involvement."
],
"query": "Among Moonbabies and Massive Attack, which group has a larger number of members who are musicians?",
"relevance_scores": [
39.75
]
}
]
}
]
Notice how AIMon returns both a relevance score (range: 0.0 to 100.0) that can be interpreted as a percentage match indicating the degree of relevance between the query and the context document. The higher the score, the better the relevance. In addition, AIMon returns text based explanations to help the user understand why this particular relevance score was generated. The explanation further helps the user to tweak their context documents or retrieval algorithms to return more relevant documents for queries that failed the relevance evaluation.
In the next section, we will demonstrate how to improve your RAG retrievals by using the RRE evaluator in re-ranking mode that runs under low latency.
A natural next step after running evaluations is to incorporate the learnings back into your retrieval stage to improve it. You can do this by running RRE-1 in re-ranker mode and incorporating the task_definition
that you built during the evaluation stage. As a reminder, a re-ranker is a more sophisticated step in the two-phase ranking process that is used to enhance the relevance of the candidate documents from the vector/reverse index. AIMon allows you to create a low-latency, domain specific re-ranker model that can be directly plugged into the second phase ranking through our simple re-ranker API that can be invoked from the AIMon python SDK. Here is an example:
from aimon import Client
import os
aimon_client = Client(auth_header="Bearer {}".format(os.getenv("AIMON_API_KEY")))
queries=["Among Moonbabies and Massive Attack, which group has a larger number of members who are musicians?"],
context_docs=["Moonbabies is a Swedish duo formed in 1997 by vocalists, multi-instrumentalists, producers, and songwriters Ola Frick (Vocals, guitar and various instruments) and Carina Johansson (Vocals and keyboards).Massive Attack are an English trip hop group formed in 1988 in Bristol, consisting of Robert \"3D\" Del Naja, Grant \"Daddy G\" Marshall and formerly Andy \"Mushroom\" Vowles (\"Mush\").", "The Moon is Earth's only natural satellite. It orbits at an average distance of 384,400 km, about 30 times the diameter of Earth. "]
scores = aimon_client.retrieval.rerank(
context_docs=context_docs,
queries=queries,
task_definition="Your task is to grade the relevance of context document against a specified user query. A document about music is more relevant than science articles.",
)
print(scores)
# [[27.375, 13.875]]
Notice the task_definition
here explicitly asks the model prefer documents about music than about science articles. This allows the model to prefer the music article about Moonbabies - the band v/s the science article about the Moon. Refer to this Google Collab Notebook for more examples on how to use the task_definition
field to align the model for a specific domain.
In this section, we report the performance of our retrieval evaluator and re-ranker. To evaluate RRE-1’s usefulness of text based explanations, we created a synthetic dataset of query context pairs from the HuggingFace mteb/hotpotqa dataset and evaluated the usefulness of the resulting explanations from RRE-1 using an ensemble of experts consisting of SOTA LLMs. We preferred this approach v/s using human evaluations since we noticed that the ensemble of expert LLMs is better than human evaluations and it is much more faster (we will share more on this topic in a another blog post). We would also like to note that none of the ensemble LLM evaluations were used for training our model and are only used to report the usefulness
metric below.
Metric | Value |
---|---|
MAP (Mean Average Precision) | 61.3 |
Text Explanation Usefulness [1.0 - 10.0] (Graded by an ensemble of expert SOTA LLMs) | 9.0 |
RRE-1 Re-ranker (3k context length) | 671ms (L4), 301ms (A100) |
RRE-1 Retrieval Evaluator (3k context length) | 2.47s |
Ensuring high-quality retrieval is crucial for effective RAG applications, as poor relevance can lead to incorrect or misleading responses. By integrating RRE-1 into the retrieval pipeline, developers can proactively diagnose relevance issues, use text-based explanations to debug retrieval, and apply low-latency re-ranking to boost accuracy. With strong benchmarks - including a Mean Average Precision (MAP) of 61.3 and an average text explanation usefulness score of 9.0/10.0 - RRE-1 enables developers to measure and refine retrieval performance with confidence.
Try AIMon’s retrieval relevance evaluator and re-ranker today by signing up on https://www.aimon.ai and enhance your RAG pipeline with better retrieval evaluation and re-ranking!
AIMon helps you build more deterministic Generative AI Apps. It offers specialized tools for monitoring and improving the quality of outputs from large language models (LLMs). Leveraging proprietary technology, AIMon identifies and helps mitigate issues like hallucinations, instruction deviation, and RAG retrieval problems. These tools are accessible through APIs and SDKs, enabling offline analysis real-time monitoring of LLM quality issues.