Embarking on the hype train
You write an AI agent or skill for your app, iterate against a handful of manual test cases, and ship it. Release day looks great. Early signs match up with everything you tested. Then the feedback rolls in: it's missing things and it’s struggling with edge cases. Users keep asking you the same question: "Can you make this more deterministic?"
You take another pass, tighten the prose, and add guardrails. You think it got better. You ship again.
Now the people it previously worked for tell you it's worse. You fix one person's case and break someone else's – you're chasing many hares and catching none. The succinctness that made the prompt feel elegant is now the problem: any small edit ripples outward, positive and negative, in ways you can't fully predict.
And so you ask the same question everyone before you asked: how do I make this more deterministic without breaking what already works?
The wrong question
The real problem isn't determinism – though wanting more of it is completely understandable. But, if your agent were truly deterministic all of the time, you'd have engineered away the very thing that makes it valuable: the ability to navigate complex, ill-defined situations in ways no rigid rule system could anticipate.
Researchers at Sierra found that agents passing a task once often can't pass it eight times in a row (Yao et al., 2024). They defined pass^k as a measure of the gap between “worked in a single demo" and “works in general".
So what is the problem? It's the ever-present, classic lack of alignment among the humans. Every stakeholder carries a different mental model of what "good" looks like and what the agent should do, from the macro decisions down to the minutiae of execution.
Sales needs to demo the vision. Customer success needs it to nail each customer's use case. Marketing needs it to wow in a webinar.
When the agent behaves unexpectedly, each group interprets that through its own lens – and reaches for "make it more deterministic" as a proxy for "make it do what I expect." The question feels technical, but it's disguising a people problem.
So who's right? As it turns out, everyone is – at least partially.
Consensus
Each person has a unique perspective on the requirements. But given that AI is non-deterministic by definition, how do we achieve alignment within the team on what AI should be doing?
We've always had the answer to this with other, more conventional, coding tracks: codifying requirements into automated tests and tools, including stylistic choices through committed linter configurations. Those certainly behave deterministically, but more importantly, they provide objective documentation of expected use cases and their handling and encode desired and agreed-upon behavior.
However, simply applying traditional unit testing approaches yields many challenges for validating a non-deterministic tool. What would you assert? Snapshots of generated code or the chat transcript? Would you mock tool calls and assert that those were called with the expected arguments? These approaches flounder as they attempt not only to encode what "good" looks like but also how "good" was achieved at a level of precision that fails far more often than it succeeds.
1234567891011
Jamming determinism into an agent or skill can lead to numerous anti-patterns:
- Setting the temperature to 0 in a vain attempt to make the stochastic parrot say the same thing each time
- Enabling retries of tests in the suite
- Writing precise and ever-expanding prose that forces the AI into a single golden happy path, leaving spontaneity behind
- Continually adding more subagents (SDK or prose) to pull more responsibilities from AI into deterministic code.
But all this is like attempting to hold sand with a clenched fist – you end up squeezing the magic and capabilities out of the system. The result is lots of green-checked tests but a much less capable product.
Objectivity over determinism
So, what do you do? I'd suggest starting with an evaluation framework. Eval frameworks are built with AI's non-determinism in mind. They enable you to codify specific traits of generated solutions rather than fully asserting against the final result. The line can be a bit blurry between them and unit tests that call an AI SDK or wrap CLI calls, since some tests and assertions still have merit.
1234567891011121314151617181920
For example, if your skill or agent is generating a Node project, you can at least expect the project to run. Or rephrased, an expected trait of all produced solutions is that they should be runnable by the Node interpreter. You can do this deterministically with a traditional automated testing approach. You can also leverage deterministic assertions in evals:
12345678910111213
But having the project run isn't sufficient for validating what the user asked the AI for in the first place. Does it solve the problem? You'll likely be able to run snapshot tests, but they'll fall flat because of the myriad valid (and invalid) solutions to the user's request.
This is where LLM grading comes into play. It enables validating the looser characteristics of the solution. For example, your agent helping schedule meetings across time zones would want to ensure that times across those zones are as reasonable as possible. Another LLM call acts as a judge on the outcome – how did it handle complex requirements with less-than-ideally specified outcome expectations?
12345678
These LLM judgments can also be canonicalized into more deterministic assertions (a form of promotion, if you will) where, given the indirection grading above, a deterministic test could ensure that the final solution had fewer than or exactly five files.
You should absolutely allow a fair amount of push-and-pull between the layers as you gain confidence in what the skill or agent should be doing and how it should behave.
Reaping the benefits
Evals yield the same benefits as more traditional testing frameworks: tests are committed to the repo, new tests are shipped with subsequent changes, and older tests validate that prior cases continue to work satisfactorily. Even more beautifully, this aids in dealing with the unending stream of changes you do not control:
- New models, even just versions within the same model family
- Agent updates, particularly when you're shipping Skills
- User expectations, both in novel use cases and expanded expectations of existing use cases
Most importantly, it eliminates the very slow, very manual feedback loop of many individuals running small-sample tests and trying to draw broad conclusions. Instead, we can capture the expectations, even those not yet handled by the agent or skill, and codify them into the repo for future work. I'd absolutely recommend not being afraid to add known-failing tests to drive towards objectivity and consensus before diving into resolutions!
Evals for improvement
So you now have an agreed-upon rubric by which to evaluate your tool against usage. And you're now even able to track incremental improvement in the precise handling of those use cases.
That enables something fascinating. What if you could leverage the same coding agents you already use to drive improvements to your skill or agent without requiring a human to be within that hot loop? Doing this moves beyond LLM-as-grader into LLM-as-optimizer territory, involving techniques such as TextGrad (Yuksekgonul et al., 2024) and, more recently, GEPA (Agrawal et al., 2025). More to come on this!
At Prismatic, we've been exploring various techniques to measure and improve our own Skills, agents, and MCP flow server. And we're not done.
Building our Skills was definitely the easy part – that's table stakes at this point. Continually improving it without breaking it (from anyone's perspective) is the really hard part. But we are doing it. How? We build consensus, commit it in-repo, make it executable, and enjoy the many payoffs from leveraging that objectivity.
In part 2 of this post, we’ll dig into what should happen once you have scoring in place.




