Evals & Feedback · April 2026

Evals for Agents That Don't Have Weights

The entire eval industry assumes you can fine-tune a model. What happens when your agents are prompt-based, the problems are multi-dimensional, and the optimization surface is language?

The Standard Playbook Doesn't Apply

In traditional AI, evals feed a well-understood loop: collect human preferences, train a reward model, fine-tune weights, deploy the better model, repeat. RLHF. DPO. The entire tooling ecosystem — Braintrust, LangSmith, Inspect AI — is built around this assumption.

But autonomous agent systems don't work this way.

The agents are prompt-based. They run on foundation models you don't control. You can't adjust weights. Your optimization levers are prompts, memories, tool configurations, and architecture — all expressed in language. And the problems they solve aren't single-dimensional ("is this response good?") but multi-faceted: did the agent plan the right thing, build it well, verify honestly, and learn from the result?

This means you need a fundamentally different eval approach. Not just different metrics — a different paradigm.

Two Different Optimization Loops

Traditional

Weight Updates

Human preferences → reward model → gradient descent → new weights → better model. Mathematical optimization.

Prompt-Based

Linguistic Updates

Human evals → distillation engine → memory injection → new context → better behavior. Language optimization.

We call this a linguistic optimization loop. The "reward model" is a distillation engine that translates scores into actionable strategies. "Gradient descent" is memory injection that changes future behavior through context. The "learning rate" is how many memories you inject per cycle. And "regularization" is pruning memories that aren't helping.

Same feedback loop structure. Entirely different mechanism.

Why Single-Score Evals Fail for Agents

A chatbot eval can ask: "Was this response helpful? Yes/No." An agent building a product has at least five things that can go wrong independently:

Planning

Did the agent work on the right thing? A perfectly built feature that nobody asked for is a planning failure, not a build failure.

Execution

Did the agent build it well? Correct plan, poor implementation. The spec was right but the code doesn't work on mobile.

Verification

Did the agent honestly assess its own work? Agents score themselves 8/10 on output that humans score 2.5/10. The gap is the eval problem.

A single "quality score" collapses all of these into one number. You know something is wrong but not where. Was it a bad plan, bad execution, or dishonest self-assessment? Without dimensional separation, you can't fix it — you can only say "try harder."

The purpose of an eval isn't to measure quality. It's to produce a signal specific enough that the system knows what to change.

Evals Across the Full Lifecycle

The key insight: evals can't just sit at the end of the pipeline. They need to be woven through the entire lifecycle — from vision to output to learning.

Here's how we built it for a system with seven autonomous agents running parallel sprints to build an intelligence product:

Anchor on Vision + Feedback

Every eval starts from: what did we ask the system to do? Not "is the output good" in the abstract — but "did this output address the specific feedback the user gave and move toward the product vision?" The vision document and user feedback items are the eval anchors. Everything traces back to them.

Trace Through the Pipeline

For each feedback item, the system traces: Did the planner pick it up? Did the spec capture it correctly? Did the builder execute it? Does it render correctly in the product? This is a chain — and the eval identifies exactly where the chain broke.

Human Scores the Output

The human reviews the live product and scores each feedback item: how well was this addressed? For anything scored low, one additional question: where in the chain did it break? Planning, execution, data, rendering, or coordination? This takes about 10 minutes.

CRITICAL: 10 min of human input per cycle, not 2 hours

Distillation, Not Raw Injection

Research across 16 sources converged on one finding: raw scores injected into prompts don't help. "You scored 3/10 on intelligence" is useless. What works is distillation — turning scores into specific, narrow strategies: "When building obligation cards, cross-reference 2+ data sources because single-source cards scored 3/10."

Calibrate the Self-Assessor

The verification phase — where agents check their own work — is consistently too generous. The fix isn't "be less generous." It's showing the verifier real examples of what the human scored high vs. low, so it can calibrate against human judgment, not abstract criteria.

Build a gold set of 30-50 human-scored examples over 5-10 cycles

Memory Budget + Pruning

Distilled lessons ("memories") are injected into the next planning cycle. But more isn't better — too many memories crowd out actual work context. A hard cap (15-20 per agent) forces active curation. Memories that don't improve scores after 3 cycles get removed. Bad lessons are worse than no lessons.

The Loop Compounds

Each cycle, the system gets slightly better at the things the human cares about. Not uniformly better at everything — specifically better at the dimensions where the gap between agent judgment and human judgment was largest. Over 10 cycles, the gap narrows. Over 50, the system learns your standards.

The Full Picture

Vision-to-Learning Eval Loop

Anchor

Vision + Feedback

"You asked for X" — the specific things the product should do, in the user's words.

Trace

Pipeline Audit

"The system did Y" — what was planned, built, and shipped. Each step traced to the original ask.

Eval

Human Scores

"Y is Z% of X" — how close is the output to what was asked? Where did the chain break?

Learn

Distill + Inject

Gap becomes a lesson. Lesson becomes a memory. Memory shapes next cycle. System improves.

This is an eval system that's fundamentally different from "run a test suite and check pass/fail." It treats evaluation as a feedback-resolution audit: for every piece of feedback, trace whether it was resolved, where it wasn't, and why. Then turn that "why" into a specific lesson that changes the next cycle.

The eval isn't measuring quality in isolation. It's measuring convergence toward the vision.

What We Learned Building This

What We Expected	What Actually Worked
Score everything, inject all scores	Score the output, distill into narrow strategies. Raw scores are noise. Strategies are signal.
Tell the verifier "be less generous"	Show the verifier real examples of what humans scored high and low. Calibrate with evidence, not instructions.
Accumulate all lessons forever	Hard memory budget. 15-20 per agent. Force active curation. Remove lessons that don't improve scores. Bad memory is worse than no memory.
Evaluate overall "quality"	Evaluate per-dimension: planning, execution, verification, data, rendering. You need to know where it broke, not just that it broke.
Full counterfactual replay for attribution	One radio button per low-scoring item: "Where did the chain break?" Gets 80% of the attribution signal at 1% of the cost.
Human reviews every output in detail	10 minutes per cycle. Score 3-5 items, flag where failures happened, scan active memories. That's enough signal for the system to learn.

The eval industry is building for a world where you train reward models and adjust weights. If your agents are prompt-based, you need to build for a world where improvement happens through language — distilled strategies, calibrated examples, and curated memory.

The Maturity Curve

Not everything needs to be built at once. We found a natural progression:

Ad-hoc Review

Agents run, human scrolls through output and gives unstructured feedback. No standardized metrics. This is where most people start — and where most people stay.

Structured Scoring

Human scores output per-dimension against original feedback items. Verifier gets calibration anchors (real examples of high and low scores). This alone is a massive improvement.

Distillation + Injection

Scores get automatically distilled into actionable lessons that feed the next cycle's planning. The feedback loop closes. The system starts compounding.

This is where we are today

Automated Closed-Loop

Production failures auto-route to eval queues. Distillation triggers automatically. Memory pruning runs on schedule. The human's role is direction-setting and exception review.

Proactive Adaptation

The system detects quality drift before humans notice. Adversarial testing surfaces new failure modes. The eval system evolves alongside the product.

Each stage is independently valuable. You don't need stage 5 to get benefit from stage 2. Start where you are and build up.

The Core Idea

Evals for autonomous agents are not about measuring quality after the fact. They're about building a closed-loop system where human judgment shapes agent behavior through structured feedback, not through model training.

The human provides the signal — 10 minutes per cycle, scoring what matters. The system does the translation — distilling scores into strategies, calibrating self-assessment, pruning bad lessons, injecting good ones. The agents compound — getting better at the specific things the human cares about, sprint after sprint.

That's not an eval pipeline. That's a learning system.