Puneet Anand
  • Feb 21st 2024

  • 5 minute read

The Case for Continuous Monitoring of Generative AI Models

Read on to learn about why Generative AI requires a new continuous monitoring stack, what the market offers currently, and what we are building

Image Description

A Few Quick Words

You've arrived at this article, which means you're no stranger to Generative AI. We won’t bore you with “and another one” on Gen AI’s awesomeness. What's noteworthy, however, is the emerging trend of Product Engineers spearheading many Generative AI initiatives, alongside the expected contribution of ML Engineers. The net new is that the technology has lent itself nicely into the hands of the pioneering Product Engineer, who has excellent skills around architecting complex applications powered by distributed, high-performance systems. Unlike Traditional ML/AI, Gen AI is much more accessible, sidestepping the need for intricate training processes and cumbersome deployments. Yes, this is a nod to the highly skilled and immensely talented ML Engineers out there who handle these tasks effortlessly.

How Gen AI models are different from Traditional ML models

Generative AI models diverge from traditional ML models in their usage and applications. Traditional ML models focus on tasks such as classification and regression, excelling in scenarios such as ranking, sentiment analysis and image classification. In contrast, generative AI models, leveraging architectures like Transformers, specialize in dynamic content creation and augmentation. They are instrumental in generating synthetic images, crafting coherent and contextually relevant text, and dynamically adapting to real-time inputs. , which will allow individual designers, startups and other small teams a chance to create a culture of openness early on.

Overview of how Traditional ML models differ from Generative AI models.

Observability aspects for Traditional ML vs. Generative AI

Having understood how Traditional ML models are different from Generative AI models, let's explore the differences from the Observability perspective. Due to the nature of their functions and outputs, there are both similarities and some stark differences. Let’s cover these in a bit more detail:

High-level Similarities (but details matter):

  • Output Quality Metrics: Both types of models require monitoring of performance metrics. Depending on the use case, accuracy, precision, recall might be more relevant for traditional ML while for Generative AI, metrics like ROUGE, perplexity or BLEU score might be appropriate.
  • Model Drift: Both models can suffer a decline in performance over time due to Data Drift, where the change in data patterns impact model performance. Concept Drift, where the definition of model’s goal changes, can also impact model performance. Monitoring for performance changes over time is essential for both, but the methods used for measuring and alleviating Model drift differ.
  • Resource Usage: Monitoring computational resources like CPU, memory, and disk usage is important for both, especially since Generative AI models can be very resource-intensive.

  • Coherence and Consistency: In the realm of Generative AI models, guaranteeing attributes like completeness, consistency, and coherence presents a distinct challenge, one that isn't usually shared by traditional ML models. Consider a scenario where a Large Language Model (LLM) responds with varying facts to the same user, even when presented with slightly different but essentially equivalent contexts or input prompts.
  • Complexity: Generative AI models, especially those generating text or images, produce more complex and varied outputs compared to traditional ML models. This complexity makes it harder to define and measure accuracy.
  • Interpretability and Explainability: Traditional ML models (like linear or tree based models) are often more interpretable. In contrast, the complexity of Generative AI models makes it difficult to understand how they arrive at a particular output.
  • Content Safety and Ethics: For Generative AI, especially those generating content like text or images, it's crucial to vigilantly check for outputs that might be unethical or harmful. This concern is typically less pronounced in many traditional ML models.
  • Real-time Interactivity: Generative AI models, for example, in interactive applications like chatbots, may require real-time monitoring to ensure appropriate responses, a scenario less common in traditional ML applications.
  • User Interaction Patterns: How users interact with the two types of models is quite different which can be crucial for understanding and iteratively improving business metrics. To give an example, imagine rolling out a ChatGPT-like experience offering product knowledge to end users vs. a recommendation (traditional ML) model that allows them to select the best product according to their personal preferences.

Do Traditional Observability and Monitoring tools work for LLMs?

Having looked at the Observability aspects of Generative AI and how they compare with Traditional ML, let's explore whether traditional Observability and Monitoring tools work for LLMs.

A wide range of Traditional ML tools for Observability, Explainability, and Offline/Online Evaluations have been available on the market for a few years now. And some of these companies have been moving into offering basic LLM performance monitoring including system metrics, Data Quality, and Model Quality. Yet it's very important to note that they are not natively designed to assess real-time factors like hallucinations and completeness, or other pertinent metrics that reflect the quality of LLM outputs.

Offline Evaluations with Golden Datasets

Offline evaluations help detect LLM quality metrics for 'golden datasets,' which consist of a predefined set of prompts, context, and optionally, the ideal answers. These datasets are then used like regression test suites (remember Selenium?) to test the LLMs at a specific point in time or at a preset frequency. Offline evaluations can be very useful if you are in the early stages and experimenting with various LLMs, RAG techniques, prompts, etc., or if you only worry about a static set of prompts to ensure the quality of the top N business use cases. However, if your app's context or prompts vary greatly, this approach might not work too well. It could be more like capturing a view of the world while wearing large blinders and could potentially have a significant impact on your business.

As the following quote suggests, different factors like prompts and context might induce hallucinations in real-time, which the golden dataset based approaches don’t cover completely.

LLMs are typically evaluated using accuracy, yet this metric does not capture the vulnerability of LLMs to hallucination-inducing factors like prompt and context variability.

Assessing the Reliability of Large Language Model Knowledge

Why continuous real-time monitoring for Generative AI models is necessary in production.

The above discussion reminds me of how bad I was at the History class. I would cram the most important questions in preparation for the exam, but I was always boggled by the questions I didn’t prepare for or if the ones I knew were asked differently.

LLMs behave similarly. Imagine an insurance company, called InsuranceAI that sells three products - Home, Automobile, and Renter’s and utilizes AI heavily. This company makes use of LLMs in the following ways:

  • For new users, it summarizes a new policy for a new user’s quote and at the same time, lets users ask arbitrary questions answered using this policy.
  • For existing users, InsuranceAI lets them get real-time AI-based support on their favorite device backed by a chatbot answering questions

Let us assume the LLMs get a few important facts on Renter’s deductibles wrong, especially when the user asks for a bundled quote (multiple products quoted together). This could happen due to lack of data on Renter’s, poor implementation of fine-tuning or in-context learning (RAG) approaches, or for various other reasons. We would all agree that this will impact existing and new business, brand value, and revenue in a major way. Let me ask you - how, and when should the InsuranceAI ideally learn about this? ASAP or a month later?


Continuous, automated, real-time monitoring for output quality would definitely help them learn about these gaps instantly, and neither would this kind of monitoring require them to proactively figure out each and every aspect that the LLM could get wrong. Why use an LLM if you had to go through all of that in the first place?

I had the good fortune of meeting some amazing customers who utilize LLMs as a core part of their product stack and are very passionate about their user experience, so much so that they manually review 100s of LLM responses every day to evaluate how well they served their customers. This demonstrates the lengths leaders go through to ensure customers have a good experience but also how tedious monitoring LLMs can be in real-time and at large scale. How amazing would it be if this could be replaced with continuous, automated monitoring?

We will be writing more about this topic in the near future, but this is a good segway into what we are building for the industry.

About Aimon Labs

We are a venture-backed startup focused on reliable and deterministic Generative AI adoption. With a team of patented inventors focused on Enterprise ML and AI, we have built a proprietary, best-in-class Hallucination Detection solution that complements RAGs and identifies hallucinations to the sentence and passage level. In addition to that we have expertise building Search Engines and Search Apps and love to help out companies with implementation and best practices on RAGs (Retrieval Augmented Generation or In-context learning) approaches.