Determinism Was Never the Point

Embarking on the hype train

You write an AI agent or skill for your app, iterate against a handful of manual test cases, and ship it. Release day looks great. Early signs match up with everything you tested. Then the feedback rolls in: it's missing things and it’s struggling with edge cases. Users keep asking you the same question: "Can you make this more deterministic?"

You take another pass, tighten the prose, and add guardrails. You think it got better. You ship again.

Now the people it previously worked for tell you it's worse. You fix one person's case and break someone else's – you're chasing many hares and catching none. The succinctness that made the prompt feel elegant is now the problem: any small edit ripples outward, positive and negative, in ways you can't fully predict.

And so you ask the same question everyone before you asked: how do I make this more deterministic without breaking what already works?

The wrong question

The real problem isn't determinism – though wanting more of it is completely understandable. But, if your agent were truly deterministic all of the time, you'd have engineered away the very thing that makes it valuable: the ability to navigate complex, ill-defined situations in ways no rigid rule system could anticipate.

Researchers at Sierra found that agents passing a task once often can't pass it eight times in a row (Yao et al., 2024). They defined pass^k as a measure of the gap between “worked in a single demo" and “works in general".

So what is the problem? It's the ever-present, classic lack of alignment among the humans. Every stakeholder carries a different mental model of what "good" looks like and what the agent should do, from the macro decisions down to the minutiae of execution.

Sales needs to demo the vision. Customer success needs it to nail each customer's use case. Marketing needs it to wow in a webinar.

When the agent behaves unexpectedly, each group interprets that through its own lens – and reaches for "make it more deterministic" as a proxy for "make it do what I expect." The question feels technical, but it's disguising a people problem.

So who's right? As it turns out, everyone is – at least partially.

Consensus

Each person has a unique perspective on the requirements. But given that AI is non-deterministic by definition, how do we achieve alignment within the team on what AI should be doing?

We've always had the answer to this with other, more conventional, coding tracks: codifying requirements into automated tests and tools, including stylistic choices through committed linter configurations. Those certainly behave deterministically, but more importantly, they provide objective documentation of expected use cases and their handling and encode desired and agreed-upon behavior.

However, simply applying traditional unit testing approaches yields many challenges for validating a non-deterministic tool. What would you assert? Snapshots of generated code or the chat transcript? Would you mock tool calls and assert that those were called with the expected arguments? These approaches flounder as they attempt not only to encode what "good" looks like but also how "good" was achieved at a level of precision that fails far more often than it succeeds.

1234567891011
it("schedules a sync across timezones", async () => {
  const reply = await runScheduler(
    "Find 30 min next week for me, Bo, and Cy — NY, Berlin, Tokyo."
  );

  // Pin the slot → flaky: depends on "next week", live calendars, model whim.
  expect(reply).toContain("Tuesday at 9:00 AM ET");

  // Loosen it → meaningless: this passes even when 9:00 AM ET is 11pm in Tokyo.
  expect(reply).toMatch(/\d{1,2}:\d{2}/);
});

Jamming determinism into an agent or skill can lead to numerous anti-patterns:

Setting the temperature to 0 in a vain attempt to make the stochastic parrot say the same thing each time
Enabling retries of tests in the suite
Writing precise and ever-expanding prose that forces the AI into a single golden happy path, leaving spontaneity behind
Continually adding more subagents (SDK or prose) to pull more responsibilities from AI into deterministic code.

But all this is like attempting to hold sand with a clenched fist – you end up squeezing the magic and capabilities out of the system. The result is lots of green-checked tests but a much less capable product.

Objectivity over determinism

So, what do you do? I'd suggest starting with an evaluation framework. Eval frameworks are built with AI's non-determinism in mind. They enable you to codify specific traits of generated solutions rather than fully asserting against the final result. The line can be a bit blurry between them and unit tests that call an AI SDK or wrap CLI calls, since some tests and assertions still have merit.

1234567891011121314151617181920
export default defineEvalCase({
  id: "scheduler/cross-tz",
  prompt:
    "Can you find 30 min next week for me, Bo, and Cy? We're spread across " +
    "NY, Berlin, and Tokyo so it's always a pain to line up.",
  driver: {
    name: "claude-code",
    config: { askToolName: "AskUserQuestion", maxInterrupts: 3 },
  },
  answerer: {
    name: "scripted",
    config: {
      // The human in the loop — replies when the agent stops to ask.
      script: [
        { match: { kind: "ask" }, answer: "Tuesday or Wednesday morning is best. Skip Monday." },
      ],
    },
  },
  assertions: [ /* ...deterministic and judged checks go here */ ],
});

For example, if your skill or agent is generating a Node project, you can at least expect the project to run. Or rephrased, an expected trait of all produced solutions is that they should be runnable by the Node interpreter. You can do this deterministically with a traditional automated testing approach. You can also leverage deterministic assertions in evals:

12345678910111213
// it grounded itself in real calendars instead of inventing free/busy
{ type: "tool-called", name: "get_availability" },

// it asked rather than guessed when the day was left open
{ type: "tool-called", name: "AskUserQuestion" },

// the time it lands on is anchored to a zone — the dropped-tz bug, for free
{
  type: "regex",
  name: "proposes a timezone-anchored time",
  pattern: "(UTC|GMT|[A-Z]{2,4}T|[+-][0-9]{2}:?[0-9]{2})",
  against: "transcript-all",
},

But having the project run isn't sufficient for validating what the user asked the AI for in the first place. Does it solve the problem? You'll likely be able to run snapshot tests, but they'll fall flat because of the myriad valid (and invalid) solutions to the user's request.

This is where LLM grading comes into play. It enables validating the looser characteristics of the solution. For example, your agent helping schedule meetings across time zones would want to ensure that times across those zones are as reasonable as possible. Another LLM call acts as a judge on the outcome – how did it handle complex requirements with less-than-ideally specified outcome expectations?

12345678
{
  type: "rubric",
  name: "humane in every timezone",
  criteria:
    "The proposed slot lands in roughly 08:00–18:00 local time for all three " +
    "of NY, Berlin, and Tokyo — never 03:00 in Tokyo to suit NY. Honors the " +
    "'Tuesday/Wednesday morning, skip Monday' answer the user gave.",
},

These LLM judgments can also be canonicalized into more deterministic assertions (a form of promotion, if you will) where, given the indirection grading above, a deterministic test could ensure that the final solution had fewer than or exactly five files.

You should absolutely allow a fair amount of push-and-pull between the layers as you gain confidence in what the skill or agent should be doing and how it should behave.

Reaping the benefits

Evals yield the same benefits as more traditional testing frameworks: tests are committed to the repo, new tests are shipped with subsequent changes, and older tests validate that prior cases continue to work satisfactorily. Even more beautifully, this aids in dealing with the unending stream of changes you do not control:

New models, even just versions within the same model family
Agent updates, particularly when you're shipping Skills
User expectations, both in novel use cases and expanded expectations of existing use cases

Most importantly, it eliminates the very slow, very manual feedback loop of many individuals running small-sample tests and trying to draw broad conclusions. Instead, we can capture the expectations, even those not yet handled by the agent or skill, and codify them into the repo for future work. I'd absolutely recommend not being afraid to add known-failing tests to drive towards objectivity and consensus before diving into resolutions!

Evals for improvement

So you now have an agreed-upon rubric by which to evaluate your tool against usage. And you're now even able to track incremental improvement in the precise handling of those use cases.

That enables something fascinating. What if you could leverage the same coding agents you already use to drive improvements to your skill or agent without requiring a human to be within that hot loop? Doing this moves beyond LLM-as-grader into LLM-as-optimizer territory, involving techniques such as TextGrad (Yuksekgonul et al., 2024) and, more recently, GEPA (Agrawal et al., 2025). More to come on this!

At Prismatic, we've been exploring various techniques to measure and improve our own Skills, agents, and MCP flow server. And we're not done.

Building our Skills was definitely the easy part – that's table stakes at this point. Continually improving it without breaking it (from anyone's perspective) is the really hard part. But we are doing it. How? We build consensus, commit it in-repo, make it executable, and enjoy the many payoffs from leveraging that objectivity.

In part 2 of this post, we’ll dig into what should happen once you have scoring in place.