T&F's iterative methodology for building reliable agent harnesses

The Harness Engineering Process

There's a pattern that shows up in every discipline that gets good at solving hard problems under uncertainty. Scientists don't try to prove a theory in one shot — they observe, hypothesise, design an experiment, run it, and use the result to refine their understanding. Lean startups don't try to build the right product on the first attempt — they build the smallest thing that tests their riskiest assumption, measure what happens, and learn from it. Both methods accept that you won't get it right the first time, and both build systems for getting progressively less wrong.

Harness engineering works the same way. The insight from both Anthropic and OpenAI's recent work is that when an agent fails, the fix is almost never "try harder" or "use a smarter model." The fix is changing the environment the agent operates in — the constraints, the feedback loops, the structure of the work, the information available at each step. The harness is the thing you iterate on, the model is just one variable inside it.

This document describes T&F's process for that iteration. It's how we engineer our harnesses.

The three traditions

Before laying out the process, it's worth seeing where it comes from. The same core loop appears in three places, each with different vocabulary but identical logic.

Scientific methodproduces knowledge through structured uncertainty reduction. You observe a phenomenon, form a hypothesis about why it behaves that way, design an experiment that could falsify your hypothesis, run it, measure the result, and update your understanding. The output of each cycle isn't just a better experiment — it's a better model of reality.

Lean startupproduces product-market fit through validated learning. You identify your riskiest assumption, build the minimum thing that tests it, put it in front of real users, measure what happens, and decide whether to persevere or pivot. The output of each cycle isn't just a better product — it's a better understanding of what the market actually needs.

Harness engineeringproduces reliable agent systems through structured environment design. You run the harness against real work, observe where agents fail or underperform, form a theory about what's missing from the environment (not the model), change the harness, re-run, and measure whether the failure mode is resolved. The output of each cycle isn't just a better harness — it's a better understanding of the boundary between what the model can handle alone and what needs structural support.

The common thread: you are not building toward a fixed destination. You are running cycles that produce compounding knowledge, and that knowledge is what makes the system better.


The cycle

Each iteration of harness engineering follows six steps. They map cleanly to both scientific method and lean startup, but the vocabulary is specific to the work.

1. Observe

Run the harness against real work. Read the traces. Watch what the agents actually do — not what you expected them to do. Look at the logs, the outputs, the places where a human had to intervene, the places where the agent confidently produced something wrong.

Anthropic found that you have to read agent traces on realistic problems to understand where performance breaks down. OpenAI found the same: their team's primary job was watching agents work and asking what went wrong. Observation is not optional and it is not passive. It requires someone who knows what good looks like sitting with the output and paying attention.

Scientific method: observation. Lean startup: measure.

2. Diagnose

Identify the failure mode. Not "the agent got it wrong" — that's a symptom. The diagnosis is structural: what was missing from the environment that caused the failure?

Anthropic identified two root failure modes across all their work: context coherence loss (the agent loses track of what it's doing as context fills up) and self-evaluation bias (the agent rates its own work too generously). OpenAI identified underspecified environments as their primary failure class — the agent had the capability but lacked the tools, abstractions, or structure to make progress.

The question at this step is always: what capability is missing from the environment, and how do we make it both legible and enforceable for the agent?

This is the hardest step. It requires judgment about whether the problem is in the harness structure (how work is decomposed), the information architecture (what the agent knows and when), the feedback mechanisms (how the agent learns it got something wrong), or the constraints (what the agent is and isn't allowed to do). Most failures map to one of these four categories.

Scientific method: hypothesis formation. Lean startup: identify the riskiest assumption.

3. Intervene

Change the harness. Not the prompt — the structure. The intervention should be the minimum change that addresses the diagnosed failure mode, and it should be permanent infrastructure, not a one-off fix.

OpenAI's principle: when an agent fails, the correction becomes a reusable constraint — a new lint rule, a structural test, a sub-agent, an architectural boundary. Anthropic's principle: remove one component at a time and review the impact, so you know what's actually load-bearing.

Possible interventions include: adding or removing an agent role, changing how work is decomposed, adding a circuit breaker, adjusting trust tier thresholds, modifying evaluation criteria, restructuring handoff artifacts, changing what information is available at a given step, adding or tightening architectural constraints.

The discipline here is restraint. Change one thing. Know what you changed and why.

Scientific method: experimental design. Lean startup: build.

4. Run

Execute the updated harness against the same (or comparable) work. The run should be as close to production conditions as possible — real client work, real codebases, real complexity. Benchmark runs against synthetic tasks are useful for rapid iteration, but they don't replace running against the actual work the harness exists to do.

Scientific method: run the experiment. Lean startup: ship.

5. Measure

Compare the result against your baseline using the metrics that matter. For T&F, those are: first-pass approval rate, circuit breaker frequency, post-deploy defect rate, cost per task, human time per task, and the north star — percentage of work that ships without needing to be fixed by a human.

The measurement should answer two questions. First, did the intervention resolve the diagnosed failure mode? Second, did it introduce any new failure modes?

Anthropic found that evaluator scores generally improved over iterations but not always linearly — sometimes a middle iteration was better than the last one. Improvement is not monotonic, and the measure needs to capture that.

Scientific method: collect and analyse data. Lean startup: measure.

6. Learn

Update your understanding. There are three possible outcomes:

The intervention worked. The failure mode is resolved, no new failure modes appeared. The intervention becomes part of the harness definition. Document what changed and why it mattered. This is validated learning.

The intervention didn't work.The failure mode persists. Your diagnosis was wrong, or the intervention didn't address it adequately. Return to step 2 with new information from the run.

The intervention worked but introduced new problems. The intervention addressed the original failure mode but created side effects. You now have a new failure mode to diagnose. Return to step 2.

In all three cases, the cycle produces knowledge that didn't exist before. Even a failed intervention tells you something about the structure of the problem.

Scientific method: update the model, refine the hypothesis. Lean startup: learn, decide to persevere or pivot.


The model change checkpoint

There's a seventh step that sits outside the regular cycle, triggered not by a failure but by an external event: a new model release.

When a new model lands, re-examine every component of the harness. Each component encodes an assumption about what the model can't do on its own. New models invalidate some of those assumptions. Anthropic found this directly — when Opus 4.6 replaced Opus 4.5, the sprint decomposition construct that had been essential became unnecessary overhead. The model could sustain coherence over longer sessions natively.

The process: take each harness component, remove it, run the harness, and measure what happens. If performance holds, the component was compensating for a model limitation that no longer exists. Strip it. If performance degrades, the component is still load-bearing. Keep it.

Then ask the opposite question: what can the new model do that the old one couldn't? Are there harness configurations that weren't possible before but now are? This is where new capability gets unlocked — not by the model alone, but by the combination of a better model and a harness designed to take advantage of it.

As Anthropic put it: the space of interesting harness combinations doesn't shrink as models improve. It moves.


The four intervention categories

When diagnosing a failure mode (step 2), it helps to know where to look. Most harness failures map to one of four structural categories.

Decomposition— how work is broken into pieces. Failures here look like: agent loses coherence on long tasks, agent can't manage scope, agent tries to do too much at once. Interventions: add planning agents, introduce sprint constructs, change task granularity.

Information— what the agent knows and when. Failures here look like: agent lacks context it needs, agent is overwhelmed by context it doesn't need, agent can't find relevant information. Interventions: restructure handoff artifacts, change what's included in prompts, add or remove documentation.

Feedback— how the agent learns its output is wrong. Failures here look like: agent produces confidently wrong output, agent marks its own work as complete when it isn't, bugs pass through to deployment. Interventions: add evaluator agents, introduce circuit breakers, change review criteria, add automated testing.

Constraints— what the agent is and isn't allowed to do. Failures here look like: agent modifies things it shouldn't, agent drifts off-scope, agent makes decisions outside its authority. Interventions: add architectural boundaries, tighten permissions, introduce structural tests, enforce dependency rules.


How T&F uses this

Every harness development cycle at T&F follows this process. When we build a new harness (product dev, marketing, or any future domain), we start by running it at the supervised trust tier against real client work, then iterate through the cycle to improve it. When we graduate agents to higher trust tiers, it's because the cycle has produced enough validated learning — measured in consecutive successes on specific task types — to justify the shift.

The harness definition is shared across clients. The cycles run against isolated client instances. Feedback from any client's cycle improves the shared definition for all. That's the flywheel: the service delivers the learning, the learning improves the harness, the harness improves the service.

The process is the product. The harness is what we sell. And the cycle is how we make it better.