LLM Hallucination logo

/ 2024

100ms Hallucination evaluations that beat the latest LLMs. Helps you check, detect, and correct Hallucinations.

Introducing HDM-1: The Industry-Leading Hallucination Detection Model with Unrivaled Accuracy and Speed

By: Bibek Paudel, Alex Lyzhov, Preetam Joshi, Puneet Anand

After extensive research and development, we are excited to announce the launch of HDM-1, our cutting-edge Hallucination Detection model, that outperforms all competitors.

HDM-1 delivers unmatched accuracy and real-time evaluations, setting a new standard for reliability in hallucination evaluations for open-book LLM applications.

Key highlights

HDM1 overview

Detailed benchmark results

Why is it so important to reduce hallucination in LLMs?

LLM hallucination occurs when an LLM generates statements that contradict or deviate from the facts provided in its context. This can result in factual inaccuracies or the creation of entirely fabricated information, which undermines trust in AI outputs and leads to significant errors in applications where accuracy is critical.

To address this problem, developers often rely on context augmentation backed by internal or external knowledge sources (RAG). However, LLMs may still generate inaccurate content, exposing LLM app builders to significant risks like reputational damage or negative financial impact.

HDM-1 is the ultimate solution for tackling the problem of LLM hallucinations, bringing both precision and accuracy to ensure AI outputs are factual and reliable.

About HDM-1

AIMon HDM-1 is our proprietary hallucination detection model, based on cutting-edge research and internally curated datasets.

It is immediately available in two different sizes:

  • HDM-1, a highly robust detector that surpasses LLMs and other competitive models in benchmarks.
  • HDM-1 Small, which can serve smaller workloads on CPUs at a very similar performance.

Both models are smaller-sized single models specialized to detect inaccuracies and fabrications for real-time and offline evaluation use cases. They can be customized and deployed in your trusted network (on-premise) or hosted by AIMon.

Benchmarking HDM-1 and HDM-1 Small

The performance of both models has been rigorously tested across a range of industry-standard hallucination datasets. Below, we outline key performance metrics and how HDM-1 models compare to similar solutions. For all these measures, a higher number means better performance, with 100 being the highest possible score and 0 being the lowest.

f1scoreaccuracy

Table 1: F1 Score, Accuracy, and Balance Accuracy metrics for Hallucination evaluation models on various benchmark datasets.

N/A = Not available, not reported at source, or could not compute.

Notes:

  1. GPT-4 Turbo metrics obtained from RAGTruth [1], LLM-AggreFact Leaderboard [2], and Patronus Lynx paper [4].
  2. GPT-4o metrics obtained from LLM-AggreFact Leaderboard [2] and Patronus Lynx paper [4].
  3. Luna metrics obtained from Galileo Luna paper [5].
  4. HHEM 2.1 Open metrics obtained from HHEM 2.1-Open [3] and LLM-AggreFact Leaderboard [2].
  5. Lynx 8B and Lynx 70B metrics obtained from Patronus Lynx paper [4].
  6. MiniCheck-Flan-T5-Large metrics come from our evaluation of the model on the RAGTruth dataset [1], following the original evaluation method.
  7. HaluEval is an easy dataset to fit; we provide the numbers on it only for completeness.

Precision is a valuable metric for reporting on hallucinations because it measures the proportion of correctly detected hallucinations (true positive predictions) out of all detected hallucinations (true and false positive predictions), helping identify how many of the flagged results are genuinely inaccurate. Precision is more important than recall since false positives tend to result in a poor user experience due to the noise added by them. F1 Score combines both Precision and Recall into a measure that balances them. We also include the Balanced Accuracy (Bal.Acc.) measure for the LLM-AggreFact benchmark, as this is the measure used in the LLM-AggreFact leaderboard [2]. Balanced Accuracy measures the average accuracy of a model for both the minority and majority classes and is useful when dealing with imbalanced data, where one class has more labels than another.

precisionrecall

Table 2: Precision and Recall metrics for Hallucination Evaluation models on the RAGTruth datasets.

Notes:

  1. Precision is higher for HDM-1 compared to HDM-1-small even though recall is slightly low. Higher precision results in lower false positives which is a much better user experience.
  2. Metrics for GPT-4o and Lynx models were not available.

Latency metrics

A key lesson we learned from our customers is that the mainstream operationalization of LLMs requires real-time evaluations and mitigation strategies. That is why latency was top of mind as we architected and optimized HDM-1 from the get-go.

A low response time opens up opportunities such as re-prompting and multi-sampling for our LLM App builders. Needless to say, performance efficiency varies with the input size and hardware configurations used, meaning that HDM-1 can be significantly faster when a cluster of GPUs is deployed.

L4 and A100 are both part of NVIDIA’s lineup designed for AI workloads.

latency1

Latency metrics computed on just a single GPU.

Additionally, the smaller version, HDM-1 Small can be run on CPUs. On a single AMD EPYC 7R13, the smaller model performs at a respectable 1-2.5 seconds for 500-1000 tokens.
Here is the detailed latency metrics for HDM-1 Small:

latency2 Latency metrics computed on just a single CPU or GPU.

Notes:

  1. CPU refers to a single AMD EPYC 7R13

Qualitative analysis

To demonstrate how HDM-1 works for a simple hallucination, we picked a context about Meta’s Whatsapp, a summarization query, and an LLM output. We then tested AIMon’s HDM-1 (available for testing through our sandbox) and GPT-4o with this example to check if they appropriately identify it. Please note that the LLM output does not have a hallucination and this example is not in the training data for HDM-1.

ex1

In this case, GPT-4o fails to identify the LLM output as accurate, whereas HDM-1 correctly labels it as a hallucination.

In the next example, we will look at a more complicated hallucination case about “Moonbabies”, a Swedish bubblegum pop duo formed in 1997.

ex2

As mentioned above in the context box, the band “Massive Attack” has 3 musicians and “Moonbabies” has 2 musicians. Llama-3-8B decides that Moonbabies is the music group with more members but AIMon correctly HDM-1 identifies it as a hallucination.

Even for this complicated problem pertaining to counting, HDM-1 was able to correctly classify that Moonbabies is the incorrect answer as it has fewer members.

Would you like to try this example out on our sandbox with “Massive Attack” as the LLM Output and see what HDM-1 identifies it as?

This is our small first step in catching inaccuracies for more complex tasks such as counting, addition, subtraction, and many other complicated problems.

These examples demonstrate AIMon’s superior hallucination detection capabilities compared to much larger LLMs such as Llama 3 8B and GPT-4o, which is thought to be greater than 1 Trillion parameters in size. On the other hand, HDM-1 models are sub-500 million parameter models yet provide SOTA hallucination detection capabilities.

What does it cost to run AIMon’s HDM-1?

HDM-1 and HDM-1 Small are available through our Unified API on the AIMon platform with all other detectors, such as Instruction Adherence and Context Quality included. We offer 1M free tokens to allow you to run your evaluations. Once the first million tokens are exhausted, the pricing is currently set at $0.49 per Million tokens.

LLMs are a common choice for conducting evaluations. As you think about comparing AIMon to LLM evaluators, consider a few important points:

  1. If you use an LLM evaluator, you would have to call the LLM multiple times for each type of evaluation you are interested in conducting. With AIMon, you can configure which detectors you would like to use and call all of them in a single call. Your tokens are not double counted with AIMon.
  2. Any LLM evaluator requires you to use an evaluation prompt whereas AIMon’s detectors don’t. Evaluation prompts may add a lot of input tokens since LLMs need a good amount of few-shot examples to perform well as an evaluator.
  3. There are drawbacks to using LLMs as evaluators such as consistency. We cover more on this topic in this blog.

What the Industry Leaders are saying about AIMon

“The productivity gains provided by LLMs are only as valuable as the trust in the LLMs’ output. Reliability tools like AIMon are key to enabling that business value,” said Mosi K. Platt, Senior Security Compliance Engineer. “That is critical to professionals in fields like security compliance where programs are looking to drive adoption of these tools and use them as force multipliers.”

“We recently moved from a popular OSS framework to AIMon for its accuracy and latency benefits,” said Joel Ritossa, CTO of a YCombinator-backed startup, Duckie.

“AIMon will enable us to boost accuracy in GenAI solutions for SAP Clean Core to achieve superior quality analysis and automation in custom code transformations,” said Vyom Gupta, President and COO of smartShift.

How can I try HDM-1?

If you are interested in on-premise deployments, HDM-1 is available on the AWS Marketplace. Please reach out to us for more details. For other situations, you can try AIMon without signing up using our sandbox and if you like our powerful detectors, you can also sign up for free to use the AIMon platform and answer questions like:

  1. How much do my LLM Apps hallucinate?
  2. Which LLM models hallucinate least for my use case?
  3. What context retrieval issues hurt my app’s accuracy?
  4. How are the hallucination rates trending for each of my apps?
  5. Are my LLMs adhering to provided instructions?
  6. and more..

With quick one-line additions (using our decorators), it is a breeze to instrument your LLM apps with our Python and TypeScript SDKs. You can quickly configure the tool to track select key metrics such as hallucination (using HDM-1), instruction adherence, and context quality.

You can get started with AIMon by checking out the quick start page on our docs and reviewing the available recipes and examples.

Resources

Try out the AIMon Sandbox without signing up:

https://www.aimon.ai/sandbox

Try out HDM-1 and the AIMon platform:

https://www.app.aimon.ai/?screen=signup

Visit our Docs site:

https://docs.aimon.ai/

Pros and Cons of “LLM as a Judge”:

https://aimon.ai/posts/llm-as-judge-pros-and-cons

Definitions

Open-book LLM Apps

Open-book and closed-book LLM apps differ on whether the apps can use external resources to answer queries. In an open-book setting, the apps can use external resources, whereas in a closed-book setting, the apps cannot access external resources. RAG (Retrieval Augmented Generation), a popular choice for enterprise LLM app design, is an example of open-book setting, since it includes retrieved information in the prompt.

Precision

Precision measures how accurate your positive predictions (hallucinations) are. It answers the question: When the model predicts something as a hallucination, how often is it correct?

Recall

Recall measures how well the model finds all actual positives (hallucinations). It answers the question: Of all the actual hallucinations, how many did the model correctly identify?

F1 Score

F1 score is a harmonic mean of both precision and recall. It gives a higher score only if both Precision and Recall are high.

Accuracy

Accuracy measures the overall correctness of the model. It answers the question: How much was the model correct overall?

References

[1] RAGTruth

[2] LLM-AggreFact Leaderboard

[3] HHEM 2.1-Open

[4] Patronus Lynx paper

[5] Galileo Luna paper