LLM Hallucination logo

/ 2024

100ms Hallucination evaluations that beat GPT4-o-mini and Turbo help you check, detect, and correct Hallucinations.

Aimon Rely is a state-of-the-art, multi-model system for detecting LLM quality issues like hallucinations offline and online at low cost.

Summary

Aimon Rely is a cost-effective, multi-model system for detecting “hallucinations” in Large Language Models (LLMs), enhancing the reliability of AI applications in both offline and online environments. Hallucinations can undermine the credibility and efficiency of AI-driven operations. Aimon Rely supports scenarios with noisy and precise contexts, applicable to about 80% of LLM applications, offering a solution that is 10 times cheaper and 4 times faster than using GPT-4 Turbo, without significant loss in evaluation quality. The system provides a fully hosted low latency API that supports batch processing for both offline evaluations and real-time monitoring. In the near future, we plan to expand the system’s capabilities, including multiple evaluation metrics for more comprehensive assessment. Check out our Github Repo that contains examples using Langchain, API documentation and more! Join our Discord or reach out to us at info@aimon.ai for early access.

Introduction

Large Language Models (LLMs) have become integral to automating and enhancing various business processes. However, a significant challenge these models face is the concept of “hallucinations” - outputs that, although fluent and confident, are factually incorrect or nonsensical. For enterprises relying on AI for decision-making, content creation, or customer service, these hallucinations can undermine credibility, spread misinformation, and disrupt operations. Recently, AirCanada lost a court case due to hallucinations in its chatbot [7]. Also, the 2024 Edelman Trust Barometer reported a drop in trust in AI companies from 61% to 53% compared to 90% 8 years ago [8]. Recognizing the urgency of the issue, we have developed a state-of-the-art system designed for both offline and online detection of hallucinations, ensuring higher reliability and trustworthiness in LLM outputs.

LLMs are typically deployed in 3 scenarios:

  • Zero context: An LLM is used without any grounding documents. Example: Asking ChatGPT to write a sonnet “Help me write a sonnet about going to school in Palo Alto”. The LLM would need to incorporate specific facts about Palo Alto, the areas around it, why it is popular etc. while generating the sonnet. These facts are generated without any supporting context i.e., only based on the model’s training data.
  • Noisy context: Multiple documents that serve as context to “ground” the LLM are retrieved from a Vector DB. The answer is typically contained within these context documents if the retrieval was performed correctly. Example: A chatbot for customer support that uses a RAG.
  • Precise context: The LLM’s output should necessarily be from the few documents that were supplied as an input. Example: Information extraction or text summarization.

Aimon Rely helps detect hallucinations in scenarios 2 and 3. In our experience, we have seen that roughly 80% of LLM applications fall within scenarios 2 and 3.

In the sections below, we will briefly discuss the need to detect hallucinations offline and online, details about Aimon Rely along with metrics on the standard industry benchmarks for hallucination detection and finally our recommendation on the ideal LLM application architecture.

Offline Evaluation and Continuous Monitoring

Offline evaluation is crucial for validating AI models, using large datasets to identify potential errors before deployment. Traditional methods like static datasets and manual reviews are expensive and lack scalability. Adopting a dynamic approach, such as a fully hosted API for batch processing, allows for comprehensive, automated testing across vast data volumes, ensuring reliability of models.

In addition to offline evaluation, continuous monitoring of LLM applications is vital due to evolving data and the probabilistic nature of LLMs, which can lead to variable outputs. This ongoing oversight is necessary to align models with current data trends and mitigate risks like hallucinations, protecting against financial loss and reputational damage. Tools like Aimon Rely offer scalable, real-time monitoring solutions, maintaining the integrity of LLM applications in production environments. As shown in the benchmarks below, our system’s latency is orders of magnitude lower than that of GPT-4 without a significant loss of quality. You do not need to fly blind anymore! Refer to our previous blog on the necessity for continuous monitoring for your LLM applications.


Aimon Rely

How does it work?

The Aimon Rely system is a collection of multiple models that we have trained internally using multiple datasets that include paraphrasing, entailment and custom hallucination datasets. Our system draws inspiration from ensemble models and phased ranking approaches [2] that were successfully used in information retrieval tasks - a popular concept in search and recommendation systems. We experimented with several different approaches before arriving at this approach that gave us the best tradeoff between latency and quality.

Benchmarks

In this section we will discuss our results on the industry standard benchmark datasets. In addition, we also discuss aspects like cost, latency and explainability of the system. We compare our results against the LLM-as-a-judge method (using GPT-4 Turbo) that is popularly used to evaluate LLM responses for hallucinations.

Comparison

Aimon Rely Benchmarks

A few key takeaways from these results:

  • Aimon Rely is 10x cheaper than GPT-4 Turbo without significant loss in quality of the evaluations.
  • Aimon Rely is 4x faster than GPT-4 Turbo without significant loss in quality of the evaluations.
  • Aimon Rely provides the convenience of a fully hosted API that includes baked-in explainability.
  • Support for a context length of up to 32,000 tokens (with plans to further expand this in the near future).

Overall, Aimon Rely is close to or even better than GPT-4 on the benchmarks making it a suitable choice for both offline and online detection of hallucinations. To further help with costs, Aimon Rely has a generous free tier. Check out our Github Repo that contains examples using Langchain, API documentation and more! Join our Discord or reach out to us at info@aimon.ai for access.

API Details - Fully hosted with batch support

Aimon Rely’s hosted API comes with support for batch mode that allows you to run multiple sets of hallucination evaluations in a single request. This low latency API can also be used online for continuous monitoring. The advantage of this API is the simplicity of integration. You can use it with either an existing evaluation or guardrail framework of your choice.

Below is an example of the batch API with 3 pairs of context and _generated_text._You can pass in multiple items as part of the request and for each item, the service will return a top level is_hallucinated boolean along with a top level score. In addition, to help debug exactly where the error lies in the generated text, the service provides sentence level scores. Across our three benchmarks (see performance benchmarks section below, we see an average latency of 417ms.

Notice how Aimon Rely is able to give you sentence level scores that greatly helps with explainability i.e., it allows you to tell exactly in which sentence in the generated text the hallucination occurred. Also observe that the system is able to correctly handle things like paraphrasing and different variations in the generated text.

Request

Request Sample

Request sample taken from the “Anyscale Ranking Test for Hallucinations” dataset

Response

Comparison

Response received for the above request

Features bg

The above diagram illustrates our recommended architecture for LLM applications. In an online setting, every response from the LLM along with the context retrieved from your RAG system can be evaluated either synchronously (with a small amount of latency overhead) or asynchronously using a sidecar like mechanism [6]. The advantage of using it synchronously is that you are able to take immediate actions such as blocking the response completely, restating the prompt to the LLM or at the very least providing a warning label to your user that the response may be hallucinated.

In an offline setting, you can log the responses and run the evaluation using Aimon Rely’s batch API. This way you are able to augment your golden dataset evaluations with a scalable evaluation approach using Aimon Rely.

Conclusion

In this article, we covered Aimon Rely - our state of the art system to detect hallucinations in large language models in both offline and online environments. Aimon Rely supports both noisy and precise context deployments of large language models. We show through our benchmarks that using our system you are able to get similar performance in terms of evaluation quality as the more expensive LLM-as-a-judge method at one-tenth of the cost. A sufficiently large context window of 32,000 tokens together with the added benefit of explainability in the output should help developers reduce hallucinations in the LLM applications.

References