Tue Sep 10 / Puneet Anand
In this article, we’ll explore general strategies for detecting hallucinations in LLMs (in RAG-based and non-RAG apps).
Large Language Models (LLMs) like GPT-4, Claude 3, and Llama 3.1 are becoming essential across multiple industries, powering everything from customer service chatbots to complex data analysis.
Their ability to generate human-like text has made them widely adopted across the globe - however, they aren’t perfect. There are many challenges that hinder the adoption of LLMs in production.
Among these, hallucination - where the model generates outputs that sound plausible but are factually incorrect - reigns supreme, as it can spur undesired consequences for users and expose companies to a series of risks.
In this article, we’ll explore general strategies for detecting hallucinations in LLMs (in RAG-based and non-RAG apps).
By understanding the root causes of AI hallucination, exploring the most effective detection methods, and considering scalable solutions like AIMon Rely, we aim to help organizations improve the accuracy and reliability of their AI-powered apps.
Hallucinations in LLMs stem from specific factors that influence how outputs are generated, such as:
Furthermore, let’s not forget that this is still a matter of probabilities.
Even if a developer does everything in their power to solve context and query-related problems, LLMs might still hallucinate as they are probabilistic and non-deterministic systems.
Now, let’s look at the main root causes of LLM hallucination.
LLMs learn from datasets. Sometimes, these are vast and sourced from the internet, while other times they’re more limited in scope and sourced from non-public sources.
If the data is inaccurate, the model can internalize and replicate these errors, effectively producing hallucinations. Similarly, if the data required to serve a user’s query is missing, the LLM may fabricate false information in its answer.
The design and training goals of an LLM can contribute to hallucinations. If a model is optimized for fluency over accuracy, it might generate plausible-sounding but factually incorrect outputs.
For example, the GPT Base model is optimized for natural language but not instruction following, while the GPT 4o model is optimized for complex tasks, particularly text generation.
Additionally, Transformer architectures (which are the base for all LLMs) are great for prediction, but not nearly as great for validating the accuracy of the generated content.
LLMs may struggle with attending to the correct context, especially over longer passages. This can lead to outputs that deviate from the expected input, as the model might lose track of key details, resulting in incorrect information.
For example, in a 2023 study conducted at Stanford University, GPT-3.5-Turbo was tested on a multi-document question-answering task.
The results indicated that the model performed well when relevant information was at the beginning or end of the text but struggled significantly when the information was in the middle.
This issue, often called “context degradation,” highlights how LLMs can lose track of important details when processing longer inputs.
During the generation process, models use various decoding strategies that can introduce randomness into the output. If the model’s attention to the context is flawed or the wrong decoding technique is used, it can result in hallucinations.
Temperature settings are a good example of this occurrence, as they help control the randomness of predictions in LLM outputs.
A higher temperature can lead to more creative but less accurate outputs, while a lower temperature can make the output more deterministic.
If the temperature is set too high, the model may generate text that seems plausible but is factually incorrect.
LLMs are designed to generalize from the data they’ve seen, but this can lead to overgeneralization, where the model applies learned patterns too broadly. This can result in plausible but incorrect statements, especially in unfamiliar contexts.
Grokking - a form of generalization observed in LLMs - is an example of this that was observed in small and large models.
On top of the reasons above, there are others as well, including:
Ambiguity in prompts: If a prompt is unclear or vague, the model may generate content that fills in the gaps with incorrect information.
Latency and resource constraints: In environments where response time is prioritized, output quality might be compromised, leading to higher hallucination rates.
Several detection methods have been developed to reduce hallucination in LLMs, each offering unique advantages and challenges. Let’s go over the main ones.
Rule-based detection systems flag content based on predefined rules and patterns.
These rules might include keyword matching, grammatical structures, or specific formats that are considered reliable or unreliable.
The simplicity of rule-based systems makes them easy to implement, especially for tasks where specific errors are predictable.
Pros: Straightforward, easy to customize, and effective in controlled environments where hallucinations are known in advance.
Cons: Lack of flexibility, as it may struggle to identify hallucinations that fall outside predefined rules. Also, this method doesn’t scale very well (how many rules can you add?). Additionally, it may produce false positives, flagging correct outputs as incorrect because they don’t fit the rules.
External knowledge verification involves cross-referencing the outputs of LLMs with reliable external databases or knowledge graphs.
This method helps identify hallucinations by checking whether the generated content aligns with verified information.
For instance, an LLM’s output about a historical event could be compared against entries in a trusted encyclopedia or database.
RAG can be seen as a sophisticated form of external knowledge verification too (as in external to the LLM, that is, not the organization).
However, RAG doesn’t just verify after the fact - it integrates the retrieval process directly into the generation phase, ensuring that the model’s outputs are closely tied to the most relevant and accurate external information available.
Pros: This method enhances the accuracy of LLM outputs by anchoring them to verified information. Particularly effective for tasks requiring factual precision, such as legal document drafting or medical advice.
Cons: Integration and maintenance of external knowledge bases. These systems need regular updates to ensure the most current information is used, and they can be computationally expensive.
Human-in-the-loop (HITL) systems involve human reviewers who verify the outputs of LLMs.
This method combines the efficiency of AI with human judgment, and is useful in contexts where a highly nuanced understanding is required, such as content moderation, legal analysis, and high-stakes decision-making processes.
Reinforcement learning from human feedback (RLHF) is a type of human-in-the-loop technique to train and improve LLMs.
Pros: Human reviewers can catch subtle hallucinations that automated systems might miss. This method is most valuable when there’s no room for mistakes.
Cons: Speed, cost, and scalability. As the volume of content increases, relying on human reviewers becomes time-consuming and expensive. Also, it’s highly dependent on the quality of human feedback.
In this approach, another LLM is used to evaluate or “judge” the output of the primary model.
This is common in offline settings where multiple models or iterations are tested to determine the most accurate or reliable output.
The “judge” model might check the coherence, factual accuracy, or adherence to specific guidelines.
Pros: Effective and easy to automate. Reduces the need for human intervention.
Cons: It might lead to overfitting if the judge model is too closely aligned with the primary model. Also, If the judge model shares the same biases as the model it evaluates, it may fail to catch some errors.
Aside from the methods listed above, others can be used to detect hallucinations in LLMs, including confidence scoring and consistency checks.
Confidence scoring is an evaluation standard that assigns a probability score to the model’s outputs based on how confident the model is in its prediction.
Low-confidence outputs are often more likely to be hallucinations and be flagged for further review. However, confidence scoring is not the most reliable method out there when it comes to detecting hallucinations.
Finally, we have consistency checks, which are often used in fine-tuning. This technique involves generating multiple outputs for the same input, and then comparing them.
Inconsistent outputs might indicate a hallucination.
While the most important way to mitigate LLM hallucinations is using RAG systems, problems also stem from the dependency on RAG for context.
In other words, if RAG systems are not optimized or if they retrieve incorrect, outdated, or incomplete information, the resulting context can lead the LLM astray.
To make things worse, the detectors designed to identify hallucinations can fail in these cases because they might also rely on the accuracy of the context provided.
If the wrong information is fed into the system, the LLM might generate a hallucinated output that appears correct relative to the flawed context. The detector, which compares the output to this flawed context, won’t flag the hallucination either.
Let’s illustrate this with an example: Imagine that we ask an LLM to summarize a recent legal case.
The RAG system retrieves context from a legal database, but the database hasn’t been updated with the latest rulings.
As a result, the LLM might produce an output based on outdated legal information, leading to a hallucinated summary. The detector, comparing the LLM’s output with the outdated database, would find the output consistent with the retrieved context and fail to identify the inaccuracy.
This issue shows the importance of beefing up the databases that sustain retrieval-augmented generation.
Other significant challenges include scalability, and choosing between real-time and offline detection.
As the deployment of LLMs scales, so does the challenge of detecting hallucinations.
One common method to manage this at scale is using an LLM-as-a-judge, where a model evaluates another model’s outputs. While it can be deemed the most reliable method, it still presents scalability and cost issues.
In short, the computational cost of using an LLM-as-a-judge for every output is significant, and the latency introduced can slow down the entire process, making it impractical for real-time applications.
As a result, many organizations limit the scope of their detection efforts, applying these techniques only to a subset of outputs, which can lead to undetected hallucinations in unreviewed content.
Another challenge in hallucination detection is choosing between real-time detection and post-processing (offline detection).
In the real-time approach, hallucinations are detected as the LLM generates a response, allowing for immediate correction or flagging before the output is delivered to the end user.
The advantage of real-time detection is that it ensures only accurate information is presented. The downside is that it consumes tons of resources and can introduce latency.
In contrast, post-processing involves reviewing and verifying outputs after they have been generated. This method is less resource-intensive and can be applied selectively to a subset of outputs, reducing costs.
The trade-off is that incorrect information may be delivered to users before it is detected and corrected, which can be problematic in high-stakes environments.
Implementing a hybrid approach that combines several detection methods can significantly enhance the effectiveness of hallucination detection in LLMs.
While this approach is more resource-intensive, it offers the best chance of minimizing errors across various contexts and applications, and it is recommended for LLMs who assist in mission-critical tasks.
On top of this, it’s important to have a continuous monitoring program for hallucination detection systems. LLMs are not static; they learn and adapt over time, and the systems designed to detect hallucinations must also evolve.
Finally, developing the feedback loops constitutes another best practice.
ChatGPT’s feedback loop
Including feedback from end users can significantly enhance the detection process while reducing costs.
We’ve explored various detection methods, including rule-based systems, external knowledge verification, human-in-the-loop processes, and LLM-as-a-judge techniques.
Each method has its strengths and challenges, but when combined, they provide a comprehensive framework for minimizing errors.
For organizations looking to reduce the hallucination rates of their LLMs, AIMon Rely offers a scalable approach to ensure that outputs remain accurate and trustworthy.
AIMon helps you build more deterministic Generative AI Apps. It offers specialized tools for monitoring and improving the quality of outputs from large language models (LLMs). Leveraging proprietary technology, AIMon identifies and helps mitigate issues like hallucinations, instruction deviation, and RAG retrieval problems. These tools are accessible through APIs and SDKs, enabling offline analysis real-time monitoring of LLM quality issues.