Tue Mar 11 /

Why Duckie replaced a popular RAG and LLM Evaluation framework with AIMon

Customer Logo

Case Study: Why Duckie replaced a popular RAG and LLM Evaluation framework with AIMon

CustomerAn Agentic AI assistant automating customer support for technical products
IndustryCustomer Experience
Primary AdopterCTO, Software Engineering

Goals

  1. High LLM Output Accuracy while automatically answering product support questions for B2B SaaS products, including code snippets
  2. Continuous RAG Relevance and Optimized ranking for input context being passed to the LLM at the inference stage

Background

Duckie AI is an Agentic AI assistant that automates customer support for B2B SaaS companies. It quickly finds relevant information, generates solutions, and conducts technical investigations, leading to faster resolution times, increased productivity, and improved customer satisfaction for its customers.

Being an AI-first company, the accuracy and relevance of their AI systems was of the highest importance. To get insights into the output of their LLM and RAG systems, they implemented a popular OSS LLM Evaluation framework that internally used LLM judges to score their LLM outputs for metrics such as contextual hallucination and RAG relevance scores.

However, they realized over time that the scores were inconsistent and showed a high degree of variance. This meant drawing the line between good and bad became harder and required consistent time investment.

Joel Ritossa Chief Technology Officer

“AIMon offers a suite of consistent evaluators that make it easier for us to draw a line between good and bad. Our previous experience with LLM Judges was making it hard to trust them for evals and required a continuous effort to tweak the evaluations.”

How AIMon helps

AIMon provides judging as a service. The platform allows Duckie to choose from a variety of judges they need to evaluate a variety of use cases, either offline or online. These run in parallel and serve back results together. AIMon’s advanced hallucination detection and instruction adherence solutions help uphold accuracy.

  • Hallucination Detection solution using HDM-1: AIMon’s HDM-1 model helped the team identify and flag AI-generated inaccuracies initiated in the Duckie app. HDM-1 is a benchmark-leading Hallucination Detection Model that can be deployed to check for hallucinations in real time. It meets or beats GPT-4o accuracy metrics while incurring just a few hundred milliseconds in latency.
  • Accuracy Improvements: AIMon helped the team analyze different topics that exhibited hallucinations and helped them identify the exact action items that would result in improvements. The team successfully used these insights to improve the accuracy of their Chatbot.

Results

After deploying AIMon, the company experienced major improvements in AI performance:

50% lower cost than using LLMs as judges while achieving lower latencies.
✅Ability to evaluate for hallucinations and instruction deviations offline and continuously.

Conclusion

By implementing AIMon’s specialized evaluation tools, Duckie has addressed critical challenges in their AI-powered customer support system. The transition to AIMon’s judging-as-a-service platform with HDM-1 model for hallucination detection has eliminated inconsistency issues they faced with traditional LLM judges, allowing them to clearly distinguish between acceptable and unacceptable AI outputs.

The integration has delivered substantial business benefits, including a 50% reduction in evaluation costs compared to previous methods. With an enhanced ability to guardrail against hallucinations and instruction deviations both in real-time and offline, Duckie has significantly improved system reliability. As they prepare to implement AIMon’s RAG Evaluation and Reranking model, they’re well-positioned to further optimize their context retrieval processes and continue delivering exceptional customer experiences.

About AIMon

AIMon helps you build more deterministic Generative AI Apps. It offers specialized tools for monitoring and improving the quality of outputs from large language models (LLMs). Leveraging proprietary technology, AIMon identifies and helps mitigate issues like hallucinations, instruction deviation, and RAG retrieval problems. These tools are accessible through APIs and SDKs, enabling offline analysis real-time monitoring of LLM quality issues.