Welcome back to Agent Support—a developer advice column for those head-scratching moments when you’re building an AI agent! Each post answers a real question from the community with simple, practical guidance to help you build smarter agents.
Today’s question comes from a developer mid-refactor—trying to validate if their new prompt actually improves output or just feels better.
💬 Dear Agent Support
I’ve made some updates to my agent—new prompt, different model, added tools—but now I’m not sure which version is actually better. How can I compare them?
Great question! This is one that pops up anytime your agent starts evolving. The short answer? Don’t guess. A/B test!
Let’s break it down!
🔍 Why A/B testing matters for agents
We tend to think of A/B testing as a marketer’s tool (i.e. headline A vs. headline B). But it’s just as useful for developers building agents.
Why? Because most agent improvements are experimental.
You’re changing one or more of the following:
- the system prompt
- the model
- the tool selection
- the output format
- the interaction flow
But without a structured way to test those changes, you’re just guessing. You might think your updated version is smarter or more helpful…but until you compare, you won’t know!
A/B testing helps you turn instincts into insight and gives you real data to back your next decision.
🚨 What happens if you don’t test?
When you skip A/B testing, a few things can happen:
- Regressions sneak in. That new tool call might slow down response time or break on edge cases.
- Subjective bias creeps in. You might love the new phrasing in version B, but does it actually outperform version A?
- Debugging gets harder. If performance dips, you won’t know which recent change caused it.
- You stall on improvements. It’s hard to feel confident making changes when you don’t have a clear way to measure success.
In short: no testing, no clarity.
📊 Compare changes without second-guessing
With agent versioning in the AI Toolkit, you can explore new prompts, models, or tools without breaking what already works.
It’s great for:
- Comparing a GPT-4 prompt vs. a GPT-4o version (or any other models!)
- Swapping in a structured schema and seeing if responses improve
- Testing tool-based flows against simple chat-based flows
- Iterating quickly while preserving your best-performing versions
Once you’ve created your versions, you can simulate conversations, evaluate responses, and decide what works best. It turns “Hmm, I think this is better?” into “Here’s what actually performs better.”
Here’s how to do it:
- Open the Agent Builder from the AI Toolkit panel in Visual Studio Code.
- Click the + New Agent button and name the agent Movie Recommendation.
- Select a Model for your agent.
- Within the System Prompt section, enter: You recommend a movie based on the user’s favorite genre.
- Within the User Prompt section, enter: What’s a good {{genre}} movie to watch?
- On the right side of the Agent Builder, select the Evaluation tab.
- Select + Add an Empty Row.
- In the {{genre}} cell, enter Comedy.
- Add another row and enter Sci-Fi.
- With each row selected, select Run Response (i.e. the play button).
- In the Manual column, select either a thumb up or thumb down based on how you want to assess the response.
- In the Agent Builder, under the title for the agent, select Save as New Version. Enter the name V1.
We’ve now configured the first version of the Movie Recommendation agent. To take full advantage of versioning, let’s create a second version of the Movie Recommendation agent with a different system prompt that states how to format the model response.
- In the Agent Builder, modify the System Prompt to the following:
You recommend a movie to watch. Your response should include:
- The movie title (bolded)
- A 1–2 sentence summary of what it’s about
- A brief note on why you’re recommending it (e.g., mood, genre, theme)
- On the Evaluation tab, rerun each row to generate the model’s response.
- In the Manual column, select either a thumb up or thumb down based on how you want to assess the response.
- In the Agent Builder, under the title for the agent, select Save as New Version. Enter the name V2.
You can now take advantage of the Compare feature to compare the agent responses for both versions of the Movie Recommendation agent. To do so, on the Evaluation tab, select the Compare button (i.e. the left and right arrow icon) and select the Movie Recommendation V1 agent.
You can now review the model’s output across both versions of the Movie Recommendation agent.
✏️ Tips for meaningful A/B tests
A/B testing only works if your comparisons are well-structured; random side-by-sides won’t reveal what’s actually better. You need a consistent, thoughtful approach to how you test. Otherwise, you risk misinterpreting the results or chasing improvements that aren’t really there.
Here’s how to make your A/B testing count:
- Change one variable at a time. If version A uses GPT-4 and version B uses GPT-4o and a new tool and a new prompt, you won’t know which change made the difference. Isolate variables so you can learn what’s actually helping.
- Use the same test prompts. Run each version of your agent against the same scenarios—ideally ones that reflect real user needs or known edge cases. You want to see how each version handles the exact same input.
- Evaluate against the same criteria. Whether you’re using manual scoring (e.g., “Was this fluent, helpful, grounded?”) or automated metrics, keep your rubric consistent so you can compare apples to apples.
- Keep notes or a changelog. Even a simple note like “V3 adds retrieval tool and uses shorter prompt” helps you track what changed and makes it easier to revisit decisions later if needed.
- Watch for trade-offs. Sometimes one version improves relevance but loses fluency, or speeds up performance but drops grounding. Testing helps you surface these tensions so you can prioritize intentionally.
The goal isn’t just to find “the best” version, it’s to understand why one works better than another. And that insight is what makes your agent stronger with every iteration!
🔁 Recap
Here’s a quick rundown of what we covered:
- A/B testing is essential for building better agents, especially when experimenting with prompts, models, or tools.
- Skipping A/B testing can lead to silent regressions, biased decisions, and messy debugging.
- The AI Toolkit makes A/B testing easy with built-in agent versioning.
- Test intentionally: small changes, same scenarios, clear logs.
📺 Want to Go Deeper?
I previously wrote about how to measure the quality of your agent’s responses! It’s worth reviewing whether you’re new to the concept of evaluations or just need a refresher.
In addition, the Evaluate and Improve the Quality and Safety of your AI Applications lab from Microsoft Build 2025 provides a comprehensive self-guided introduction to getting started with evaluations. You’ll learn what each evaluator means, how to analyze the scores, and why observability matters—plus how to use telemetry data locally or in the cloud to assess and debug your app’s performance!
👉 Explore the lab: https://github.com/microsoft/BUILD25-LAB334/
With the right tools, you’re not just building—you’re improving with purpose.