Case Study: Why Duckie replaced a popular RAG and LLM Evaluation framework with AIMon

Customer	An Agentic AI assistant automating customer support for technical products
Industry	Customer Experience
Primary Adopter	CTO, Software Engineering

Goals

High LLM Output Accuracy while automatically answering product support questions for B2B SaaS products, including code snippets
Continuous RAG Relevance and Optimized ranking for input context being passed to the LLM at the inference stage

Background

Duckie AI is an Agentic AI assistant that automates customer support for B2B SaaS companies. It quickly finds relevant information, generates solutions, and conducts technical investigations, leading to faster resolution times, increased productivity, and improved customer satisfaction for its customers.

Being an AI-first company, the accuracy and relevance of their AI systems was of the highest importance. To get insights into the output of their LLM and RAG systems, they implemented a popular OSS LLM Evaluation framework that internally used LLM judges to score their LLM outputs for metrics such as contextual hallucination and RAG relevance scores.

However, they realized over time that the scores were inconsistent and showed a high degree of variance. This meant drawing the line between good and bad became harder and required consistent time investment.

Joel Ritossa Chief Technology Officer

“AIMon offers a suite of consistent evaluators that make it easier for us to draw a line between good and bad. Our previous experience with LLM Judges was making it hard to trust them for evals and required a continuous effort to tweak the evaluations.”

How AIMon helps

AIMon provides judging as a service. The platform allows Duckie to choose from a variety of judges they need to evaluate a variety of use cases, either offline or online. These run in parallel and serve back results together. AIMon’s advanced hallucination detection and instruction adherence solutions help uphold accuracy.

Hallucination Detection solution using HDM-1: AIMon’s HDM-1 model helped the team identify and flag AI-generated inaccuracies initiated in the Duckie app. HDM-1 is a benchmark-leading Hallucination Detection Model that can be deployed to check for hallucinations in real time. It meets or beats GPT-4o accuracy metrics while incurring just a few hundred milliseconds in latency.
Accuracy Improvements: AIMon helped the team analyze different topics that exhibited hallucinations and helped them identify the exact action items that would result in improvements. The team successfully used these insights to improve the accuracy of their Chatbot.

Results

After deploying AIMon, the company experienced major improvements in AI performance:

✅50% lower cost than using LLMs as judges while achieving lower latencies.
✅Ability to evaluate for hallucinations and instruction deviations offline and continuously.

Conclusion

By implementing AIMon’s specialized evaluation tools, Duckie has addressed critical challenges in their AI-powered customer support system. The transition to AIMon’s platform with HDM-1 model for hallucination detection has eliminated inconsistency issues they faced with traditional LLM judges, allowing them to clearly distinguish between acceptable and unacceptable AI outputs.

The integration has delivered substantial business benefits, including a 50% reduction in evaluation costs compared to previous methods. With an enhanced ability to guardrail against hallucinations and instruction deviations both in real-time and offline, Duckie has significantly improved system reliability. As they prepare to implement AIMon’s RAG Evaluation and Reranking model, they’re well-positioned to further optimize their context retrieval processes and continue delivering exceptional customer experiences.

The one platform you need to drive success with AI

Backed by Bessemer Venture Partners, Tidal Ventures, and other notable angel investors, AIMon is the one platform enterprises need to drive success with AI. We help you build, deploy, and use AI applications with trust and confidence, serving customers including Fortune 200 companies.

Our benchmark-leading ML models support over 20 metrics out of the box and let you build custom metrics using plain English guidelines. With coverage spanning output quality, adversarial robustness, safety, data quality, and business-specific custom metrics, you can apply any metric as a low-latency guardrail, for continuous monitoring, or in offline evaluations.

Finally, we offer tools to help you iteratively improve your AI, including capabilities for real-world evaluation and benchmarking dataset creation, fine-tuning, and reranking.

Book a Demo