Tue May 06 / Devvrat Bhardwaj, Preetam Joshi
Learn how reflective agents can validate model behavior and catch instruction violations without human oversight.
Agentic applications are AI systems designed to act autonomously, reason over time, and pursue user goals without requiring constant human input. Unlike traditional apps that react to discrete inputs, agentic apps manage state, handle intermediate decisions, and often operate by leveraging large language models (LLMs) like Claude or ChatGPT. These agents interpret high-level instructions and break them down into smaller actions or reasoning steps.
But simply connecting an LLM to your app doesn’t make it agentic. The system must be able to interpret goals, manage execution, and adapt, all while staying within defined behavioral bounds. This creates new challenges: agents must not only be capable, but also reliable, predictable, and correct. To meet those standards, we need strong architectural foundations.
To explore these ideas in practice, we built an intelligent To-Do app that uses Claude to estimate how long tasks will take. While it looks simple on the surface, it’s powered by an agentic design that turns LLM outputs into structured, actionable planning. The app is built using AG2, a framework for constructing reliable, goal-driven agents.
AG2 is an open-source framework designed specifically for building agentic applications. It helps developers compose systems where language models can reason and act in a structured environment. With AG2, you don’t just call a model; you build an agent that interprets instructions, tracks state, and orchestrates LLM calls as part of a broader goal-oriented loop.
AG2 offers primitives for managing tasks, interfacing with LLMs, injecting feedback loops, and isolating behavior into traceable components. It’s especially suited for applications that require more than just reactive AI. It’s built for systems that must operate intelligently and independently, while remaining inspectable.
Figure 1: An AG2-based To-Do agent without output validation. Model outputs are used even if they violate constraints.
The first iteration of the app lets users enter tasks like “Finish the report” or “Fix the sink.” Under the hood, the AG2 agent sends each task to Claude with a prompt like: “How many hours will this task take? Return a number without any explanations.”
Claude responds, and the agent parses and stores the estimate alongside the task to generate a daily workload summary. This feels seamless at first. In many cases, Claude returns a clean numeric value: “1” or “0.5.”
2025‑05‑05 17:22:02,479 – todo_agent – DEBUG – [Attempt 1] Estimating time for task: ‘Feed the cat.’
2025‑05‑05 17:22:03,498 – todo_agent – DEBUG – Claude response: ‘0.08’
2025‑05‑05 17:22:05,034 – todo_agent – INFO – No instruction violations for task: ‘Feed the cat.’
But over time, issues emerge. LLMs, by nature, don’t always follow strict formatting. Claude’s responses sometimes include words like “about,” add units like “hours,” or wrap the number in a sentence. These seemingly minor deviations can break downstream logic. When outputs don’t match the expected structure, the agent may crash, misinterpret the data, or pass along unreliable results.
This is a common failure mode in agentic systems. Without structure enforcement, autonomy becomes brittle. Small inconsistencies quietly erode reliability.
Prompting alone isn’t enough. Even carefully crafted prompts can’t guarantee consistent compliance from a language model. These models are probabilistic, not deterministic. And when you’re building structured systems, “usually correct” isn’t good enough.
This isn’t just about parsing numbers. It’s about building trust. If the agent can’t detect when it violates its own instructions, how can users trust it to carry out multi-step reasoning or execute long-term goals?
That’s where the Reflection Pattern comes in. It’s a design strategy that helps agents evaluate their own outputs and enforce correctness before continuing.
The Reflection Pattern adds a post-step evaluation process to the agent’s reasoning loop. Instead of assuming every model output is usable, the agent inspects it. It asks: did this output follow the instructions? Can I trust this to feed into downstream logic?
In our To-Do app, this means evaluating Claude’s estimate before accepting it. If the response includes units, explanations, or formatting violations, the agent catches it, then logs, retries, or corrects the issue.
This feedback loop turns fragile LLM prompts into robust, auditable behavior. It’s the foundation for reliability in agentic systems.
Figure 2: The same pipeline with AIMon enforcing strict output validation before results are accepted.
To implement the Reflection Pattern, we use AIMon, a lightweight, model-agnostic instruction adherence (IA) checker purpose-built for language model applications. AIMon inspects the model’s output after each step and verifies whether it meets specific formatting constraints.
In our case, when Claude is prompted to estimate how long a task will take, the instructions are deliberately strict—designed to eliminate ambiguity and keep outputs machine-parseable:
1.5
).Even with well-written prompts, LLMs often drift. Claude might return values like ‘1.5 hours’ or preface the number with phrases like ‘Let’s say 2,’ both of which violate the required structure. AIMon sits between Claude’s response and the agent logic to enforce these rules.
If all four conditions are met, the result is passed on. If not, AIMon returns a failure report identifying which rule was broken. The agent then decides how to handle the violation. It may log the error, skip the task, or retry with a rephrased prompt.
This turns reflection from a manual process into a reliable, programmatic contract. We no longer assume correctness. We check for it.
For example, for a complex task such as “Cook for 20 people”, we see deviation from an instruction, which is then corrected by giving a stricter prompt to the LLM.
2025‑05‑05 17:43:12,409 – todo_agent – DEBUG – [Attempt 1] Estimating time for task: 'Cook for 20 people'
2025‑05‑05 17:43:13,466 – httpx – INFO – HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025‑05‑05 17:43:13,470 – claude_service – DEBUG – Claude raw response: 8
2025‑05‑05 17:43:13,470 – todo_agent – DEBUG – Claude response: '8'
2025‑05‑05 17:43:15,308 – httpx – INFO – HTTP Request: POST https://pbe-api.aimon.ai/v2/detect "HTTP/1.1 200 OK"
2025‑05‑05 17:43:15,310 – todo_agent – WARNING – AIMon flagged 2 issue(s) for task: 'Cook for 20 people'
2025‑05‑05 17:43:15,310 – todo_agent – WARNING – → Instruction: Respond only with a numeric value (e.g., 1.5).
2025‑05‑05 17:43:15,310 – todo_agent – WARNING – Reason: The response '8' is numeric but exceeds the required range of 0 to 4.0.
2025‑05‑05 17:43:15,310 – todo_agent – WARNING – → Instruction: Keep the numeric value in the range 0 to 4.0
2025‑05‑05 17:43:15,310 – todo_agent – WARNING – Reason: The numeric value '8' is outside the specified range of 0 to 4.0.
2025‑05‑05 17:43:15,310 – todo_agent – DEBUG – [Attempt 2] Estimating time for task: 'Cook for 20 people'
2025‑05‑05 17:43:16,333 – httpx – INFO – HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025‑05‑05 17:43:16,333 – claude_service – DEBUG – Claude raw response: 2.5
2025‑05‑05 17:43:16,333 – todo_agent – DEBUG – Claude response: '2.5'
2025‑05‑05 17:43:17,049 – httpx – INFO – HTTP Request: POST https://pbe-api.aimon.ai/v2/detect "HTTP/1.1 200 OK"
2025‑05‑05 17:43:17,049 – todo_agent – INFO – No instruction violations for task: 'Cook for 20 people'
2025‑05‑05 17:43:21,164 – httpx – INFO – HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
If you’d like to explore this application hands-on, follow the steps below.
First, make sure you have:
virtualenv
or venv
for environment isolation# 1: Clone the repository and enter it
git clone https://github.com/aimonlabs/intelligent-todo-app.git
cd intelligent-todo-app
# 2: Create & activate a virtual environment
python3 -m venv venv
source venv/bin/activate
# 3: Upgrade pip and install dependencies
python -m pip install --upgrade pip
pip install -r requirements.txt
export AIMON_API_KEY="your-aimon-key-here" \
ANTHROPIC_API_KEY="your-anthropic-key-here" && \
python -m streamlit run streamlit_app.py
Building agentic applications requires more than just plugging in a language model. It requires structure, reliability, and the ability to monitor whether an AI is behaving as expected. As we’ve seen, prompting alone isn’t enough, especially when structured outputs matter.
By combining AG2, a flexible framework for building agentic systems, with AIMon, a reflection layer for instruction adherence, we’re able to move beyond naive autonomy. The To-Do agent doesn’t just act; it checks its own outputs, catches violations, and builds trust through validation.
This design pattern, reflection through instruction adherence, is simple to apply yet powerful in effect. It turns AI responses into verifiable steps and transforms agents from brittle scripts into robust systems.
If you’re building AI-enhanced applications and care about correctness, trust, or user safety, reflection isn’t optional. It’s foundational.
AIMon helps you build more deterministic Generative AI Apps. It offers specialized tools for monitoring and improving the quality of outputs from large language models (LLMs). Leveraging proprietary technology, AIMon identifies and helps mitigate issues like hallucinations, instruction deviation, and RAG retrieval problems. These tools are accessible through APIs and SDKs, enabling offline analysis real-time monitoring of LLM quality issues.