Fri Oct 18 / Puneet Anand

Are LLMs the best way to judge LLMs?

”LLM as a judge” is a general technique where a language model, such as GPT-4, evaluates text generated by other models. This article dives into the pros and cons of using this popular technique for LLM evaluations.

llm-judging

As Gen AI becomes more popular, the need for better evaluation methods gets critical.

At this point, it’s normal to see teams scrambling to evaluate their models in different ways, ranging from manual actions to a popular approach known as “LLM as a judge”.

This technique utilizes LLMs to evaluate and provide feedback on the output of other models.

What is “LLM as a Judge”?

“LLM as a judge” is a general technique where a language model, such as GPT-4, evaluates text generated by other models.

These evaluations check different aspects in the model’s outputs, like tone, coherence, and factual accuracy, among others.

Essentially, LLMs are given the role of “judging” whether content meets specific criteria. Arguably, the most important reason this technique got so popular is because it automates away what humans (usually engineers) need to do.

It is good to note that in this article we are covering “vanilla” LLMs as compared to LLMs fine-tuned for evaluation purposes such as LlamaGuard.

How are LLM judges used?

LLM Judge Architecture

Imagine we have a base LLM to handle user queries and provide outputs. When using the LLM-as-judge technique, these are the usual steps followed:

  • We take the base LLM response to the user query, along with context and evaluation criteria.
  • We send this data to a separate LLM (the judge).
  • The LLM judge then grades the response based on whatever criteria is provided (this criteria can be hyper customized, by the way).

As a simple example, let us assume you’re using an LLM to evaluate creative writing submissions from high school students.

In that case, here’s how you could use an LLM as a judge:

  1. Select the LLM you will use as a judge. Is it GPT-4, for example, or a much smaller model like Llama 2 7B will suffice? This can help you save cost in the long run as smaller models are cheaper to run.
  2. Create evaluation criteria. At this step, you define what criteria you want the judge LLM to use to evaluate the submissions on, for example, the factual accuracy of the output, adherence to the given topic, etc.
  3. Create an evaluation prompt. This is where you create the instructions for the LLM judge to evaluate the writing submission. Here is an example of an evaluation prompt to evaluate submissions.

LLM Judge example prompt

  1. Automate the judging. To use this LLM judge at a larger scale, you may submit the entries to the LLM inline (as outputs are generated) or in batches, allowing it to return scores and feedback automatically.
  2. Monitor results. Set up a cadence to review the LLM’s evaluations and ensure your criteria (such as accuracy) are met, or else adjust as needed.

Another way to use LLM judges involves taking two different outputs and asking the models to compare them for a given set of criteria.

Where can we rely on LLM judges?

LLM Judges typically produce scores and some explanation. They work well for getting off the ground and give you a subjective review of how the original LLM’s output fared for a given task.

Instead of humans reviewing each output, this technique serves as a way to automatically grade the output. To list out some pros of using LLM-as-a-judge:

  1. Can be used offline or online. Evaluation LLMs can be plugged in for any use case.
  2. Cross-domain usage. To a certain degree, LLMs are well-equipped to handle evaluations on a variety of domains.
  3. They can help automate reading and understanding of hundreds of lines of text and can work 24/7.
  4. LLM Judges can successfully indicate if there is a major change in your LLM app’s success or failure rate.

But is this just an order of magnitude better than using humans only?

Why LLM Judges aren’t always the most optimal approach

While LLMs offer powerful capabilities for text evaluation and help you get off the ground, they are not always the best solution for every scenario.

Different LLM Scores for the same query
In the example above, the LLM judge (in this case GPT-4o-mini) states different scores for the exact same evaluation query.

Here are a few key reasons why relying on LLM judges may present significant limitations:

  1. Consistency. LLMs that haven’t been fine-tuned are known for coming up with different answers for the same query. For this reason, using an LLM judge can turn into inconsistencies when evaluating outputs. LLMs are trained to “generate” text but they aren’t trained to generate scores (or grade submissions) in the same way a human would.
  2. Latency. Your base LLMs may take seconds to generate the output text and then the LLM judge may take about the same time to generate an evaluation. You need to check if this is acceptable to your users and other stakeholders.
  3. Cost. LLM providers typically charge by tokens and costs add up very quickly, especially if you are using LLM judges in production for each query or if you have a multi-agent system.
  4. Switching the judges might cause breakages. Each LLM is different. When your organization switches from Open AI to Anthropic, your entire LLM-as-judge accuracy might land on shaky grounds.
  5. Dependency on prompt design. The success of LLM judges heavily depends on how well the evaluation prompt is designed. Poorly designed prompts can lead to inconsistent or incorrect judgments.
  6. LLM Judges add work for humans. Yes. LLM as judges were meant to reduce work for humans, remember? But unless you spend the effort to benchmark LLMs on the relevant datasets, someone still needs to read and review the scores and the reasoning generated. Blindly trusting the LLM to do a good job is just wishful thinking.

While some researchers argue that LLMs can be fine-tuned for assessment tasks and scoring, others highlight the inherent limitations of such models in replicating human grading behavior.

LLMs fail to respect scoring scales given to them

- Large Language Models are Inconsistent and Biased Evaluators [Link]

We find limited evidence that 11 state-of-the-art LLMs are ready to replace expert 4 or non-expert human judges, and caution against using LLMs for this purpose.

- LLMs instead of Human Judges? A Large Scale Empirical Study [Link]

Best practices for using LLM as judge

The following best practices come in handy for getting the most out of this technique:

  1. Fine-tune an LLM before using it as a judge. You may need to fine-tune your LLMs to make them better at evaluations or use techniques such as in-context learning. LLMs are trained on general purpose goals whereas evaluations often require a deep understanding of task-specific goals such as accuracy.

Few-shot in-context learning does lead to more consistent LLM-based evaluators

- Assessment and Mitigation of Inconsistencies in LLM-based Evaluations [Link]

  1. Don’t use the same model to generate and evaluate. As this paper suggests, LLMs exhibit various cognitive biases, which can affect their evaluation performance. This means that LLMs used for generation may not be reliable evaluators of their own or other models’ outputs.
  2. Provide LLM judges with examples of good and bad evaluations. When LLMs are given such examples, it allows them to better differentiate between high-quality and low-quality responses.
  3. “Chain of Thought” (CoT) prompting can also improve LLM evaluations. By prompting the LLM to reason through its evaluation step-by-step, CoT often results in more accurate evaluations because it reduces impulsive or biased responses, and helps the model follow a structured thought process. One limitation of CoT prompting is it can lead to longer response times and might introduce unnecessary complexity or errors if the reasoning steps are flawed.

AIMon’s Hallucination Detection Model (HDM-1)

AIMon HDM-1 is our proprietary hallucination eval model, based on cutting-edge research, latest innovations, and internally curated on battle-tested datasets.

It is immediately available in two different flavors: HDM-1, offering passage-level hallucination scores and HDM-1s, offering sentence-level hallucination scores.

It is a smaller sized model designed to detect factual inaccuracies and fabrications of information for real-time and offline evaluation use cases.

Key highlights

AIMon HDM-1 Key highlights

Additionally, HDM-1 is trained to provide consistent output scores, as can be seen below for an LLM based e-commerce recommendation. We show the same query served by an LLM evaluator above.

AIMon's HDM-1 scores

Going further than detection, the AIMon platform provides ways to identify root causes of hallucinations to help fix them and improve LLM apps incrementally - key features that vanilla LLM judges don’t provide.

About AIMon

AIMon helps you build more deterministic Generative AI Apps. It offers specialized tools for monitoring and improving the quality of outputs from large language models (LLMs). Leveraging proprietary technology, AIMon identifies and helps mitigate issues like hallucinations, instruction deviation, and RAG retrieval problems. These tools are accessible through APIs and SDKs, enabling offline analysis real-time monitoring of LLM quality issues.