Tue Mar 12 / Preetam Joshi
Aimon Rely is a state-of-the-art, multi-model system for detecting LLM quality issues like hallucinations offline and online at low cost.
Aimon Rely is a state-of-the-art, multi-model system for detecting LLM quality issues like hallucinations offline and online at low cost.
Aimon Rely is a cost-effective, multi-model system for detecting “hallucinations” in Large Language Models (LLMs), enhancing the reliability of AI applications in both offline and online environments. Hallucinations can undermine the credibility and efficiency of AI-driven operations. Aimon Rely supports scenarios with noisy and precise contexts, applicable to about 80% of LLM applications, offering a solution that is 10 times cheaper and 4 times faster than using GPT-4 Turbo, without significant loss in evaluation quality. The system provides a fully hosted low latency API that supports batch processing for both offline evaluations and real-time monitoring. In the near future, we plan to expand the system’s capabilities, including multiple evaluation metrics for more comprehensive assessment. Check out our Github Repo that contains examples using Langchain, API documentation and more! Join our Discord or reach out to us at info@aimon.ai for early access.
Large Language Models (LLMs) have become integral to automating and enhancing various business processes. However, a significant challenge these models face is the concept of “hallucinations” - outputs that, although fluent and confident, are factually incorrect or nonsensical. For enterprises relying on AI for decision-making, content creation, or customer service, these hallucinations can undermine credibility, spread misinformation, and disrupt operations. Recently, AirCanada lost a court case due to hallucinations in its chatbot [7]. Also, the 2024 Edelman Trust Barometer reported a drop in trust in AI companies from 61% to 53% compared to 90% 8 years ago [8]. Recognizing the urgency of the issue, we have developed a state-of-the-art system designed for both offline and online detection of hallucinations, ensuring higher reliability and trustworthiness in LLM outputs.
LLMs are typically deployed in 3 scenarios:
Aimon Rely helps detect hallucinations in scenarios 2 and 3. In our experience, we have seen that roughly 80% of LLM applications fall within scenarios 2 and 3.
In the sections below, we will briefly discuss the need to detect hallucinations offline and online, details about Aimon Rely along with metrics on the standard industry benchmarks for hallucination detection and finally our recommendation on the ideal LLM application architecture.
Offline evaluation is crucial for validating AI models, using large datasets to identify potential errors before deployment. Traditional methods like static datasets and manual reviews are expensive and lack scalability. Adopting a dynamic approach, such as a fully hosted API for batch processing, allows for comprehensive, automated testing across vast data volumes, ensuring reliability of models.
In addition to offline evaluation, continuous monitoring of LLM applications is vital due to evolving data and the probabilistic nature of LLMs, which can lead to variable outputs. This ongoing oversight is necessary to align models with current data trends and mitigate risks like hallucinations, protecting against financial loss and reputational damage. Tools like Aimon Rely offer scalable, real-time monitoring solutions, maintaining the integrity of LLM applications in production environments. As shown in the benchmarks below, our system’s latency is orders of magnitude lower than that of GPT-4 without a significant loss of quality. You do not need to fly blind anymore! Refer to our previous blog on the necessity for continuous monitoring for your LLM applications.
The Aimon Rely system is a collection of multiple models that we have trained internally using multiple datasets that include paraphrasing, entailment and custom hallucination datasets. Our system draws inspiration from ensemble models and phased ranking approaches [2] that were successfully used in information retrieval tasks - a popular concept in search and recommendation systems. We experimented with several different approaches before arriving at this approach that gave us the best tradeoff between latency and quality.
In this section we will discuss our results on the industry standard benchmark datasets. In addition, we also discuss aspects like cost, latency and explainability of the system. We compare our results against the LLM-as-a-judge method (using GPT-4 Turbo) that is popularly used to evaluate LLM responses for hallucinations.
Aimon Rely Benchmarks
A few key takeaways from these results:
Overall, Aimon Rely is close to or even better than GPT-4 on the benchmarks making it a suitable choice for both offline and online detection of hallucinations. To further help with costs, Aimon Rely has a generous free tier. Check out our Github Repo that contains examples using Langchain, API documentation and more! Join our Discord or reach out to us at info@aimon.ai for access.
Aimon Rely’s hosted API comes with support for batch mode that allows you to run multiple sets of hallucination evaluations in a single request. This low latency API can also be used online for continuous monitoring. The advantage of this API is the simplicity of integration. You can use it with either an existing evaluation or guardrail framework of your choice.
Below is an example of the batch API with 3 pairs of context and _generated_text._You can pass in multiple items as part of the request and for each item, the service will return a top level is_hallucinated boolean along with a top level score. In addition, to help debug exactly where the error lies in the generated text, the service provides sentence level scores. Across our three benchmarks (see performance benchmarks section below, we see an average latency of 417ms.
Notice how Aimon Rely is able to give you sentence level scores that greatly helps with explainability i.e., it allows you to tell exactly in which sentence in the generated text the hallucination occurred. Also observe that the system is able to correctly handle things like paraphrasing and different variations in the generated text.
Request sample taken from the “Anyscale Ranking Test for Hallucinations” dataset
Response received for the above request
The above diagram illustrates our recommended architecture for LLM applications. In an online setting, every response from the LLM along with the context retrieved from your RAG system can be evaluated either synchronously (with a small amount of latency overhead) or asynchronously using a sidecar like mechanism [6]. The advantage of using it synchronously is that you are able to take immediate actions such as blocking the response completely, restating the prompt to the LLM or at the very least providing a warning label to your user that the response may be hallucinated.
In an offline setting, you can log the responses and run the evaluation using Aimon Rely’s batch API. This way you are able to augment your golden dataset evaluations with a scalable evaluation approach using Aimon Rely.
In this article, we covered Aimon Rely - our state of the art system to detect hallucinations in large language models in both offline and online environments. Aimon Rely supports both noisy and precise context deployments of large language models. We show through our benchmarks that using our system you are able to get similar performance in terms of evaluation quality as the more expensive LLM-as-a-judge method at one-tenth of the cost. A sufficiently large context window of 32,000 tokens together with the added benefit of explainability in the output should help developers reduce hallucinations in the LLM applications.