AI agents can now do real work. They can write code, review it, test it, draft content, analyse data. The capability is here. What isn't here is a way to trust them with it.
Not trust in the sense of whether the model is smart enough — that debate is mostly settled. Trust in the sense of: if I hand this agent a task that matters to my business, and something goes wrong, who's responsible? What caught the mistake? How do I know it won't happen again?
The AI industry has started to answer this for coding agents specifically. The term that's emerged is the harness — everything around the model that makes it usable. The constraints, the feedback loops, the tools, the observability. OpenAI built a million-line product with zero human-written code by engineering the harness, not by improving the model. Anthropic showed that separating an agent's work from its evaluation produced dramatically better results than letting the agent work alone. LangChain jumped from the middle of the pack to the top five on a major benchmark by changing nothing about the model and everything about the harness. The lesson is clear and the industry agrees on it: the system around the model matters more than the model itself.
The frontier research is impressive, and the problems being tackled are genuinely hard. Building complete applications autonomously — the kind of complex systems that traditionally take months, multiple people, and multiple disciplines — is one of the harder challenges in software. Methodologies like agile evolved over decades to let teams build complex systems at speed under changing pressures. Getting agents to do this reliably will take continued research and continued breakthroughs.
But businesses don't have the luxury of waiting for the frontier to arrive. The pressure is here now — to increase output, ship faster, ship more, do it with fewer people. And that pressure doesn't limit itself to coding. It covers design, product management, marketing, sales, operations. Teams across every function are being asked to bring AI into their process, and many of them are doing it, because the capability is there and the pressure is real.
The problem is what happens to accountability when they do. If an agent writes code that breaks production, who gets the blame — the engineer who prompted it, the tech lead who approved the workflow, the CTO who adopted the tool? If an agent drafts marketing copy that makes a claim the company can't back up, is that on the marketer, the marketing director, or the person who decided to use AI in the first place? Right now, the answer is usually a shrug, or a vague hope that someone's checking the output.
This creates fear, and that fear plays out in two ways. Some teams take on more risk than they realise — they adopt agents quickly, get real productivity gains, but have no system for catching failures before they reach customers. Others limit their uptake — they see the risk, can't quantify it, and hold back, falling behind competitors who moved faster. The first is a quality and liability problem. The second is a competitive one. Both stem from the same root cause: there's no system in place that tells everyone involved exactly where accountability sits, at every point in the process, as work moves between humans and agents.
That's the problem we're solving.
The trajectory is already visible. Individual agents are here and improving fast. Harness engineering has emerged as a discipline to make them reliable. The research labs are pushing toward autonomous multi-step development, and they'll get there eventually. But the adoption curve isn't waiting for the research curve.
Right now, teams are weaving agents into their existing workflows — an agent that drafts a first pass, a human who reviews it, an agent that handles the revisions. This is happening in product development, in marketing, in customer support, in operations. It's happening informally, without shared standards, without consistent oversight, and without any structured way to answer the question of who's accountable for what the agent produces.
As agents get more capable and the pressure to adopt increases, this informal integration will give way to something more deliberate. Agent teams — multiple agents coordinating across the steps of a business function — will become the norm, not because it's a neat idea, but because the economics demand it. A founder who can run product development through a team of agents at a fraction of the cost of a human team has an enormous advantage, but only if the work is good enough to ship.
Beyond agent teams, the logical end point is agent organisations — where a business's core functions are each run by agent teams, coordinated and governed centrally. That's further out, and it raises harder questions about governance and cross-domain coordination. But the path from where we are now to where this is going runs through a single chokepoint: accountability. At every step, the question is the same — can the business trust the output enough to put its name behind it?
We build harnesses — but we use the term more broadly than the industry currently does. Where most harnesses today wrap a single coding agent to make it more reliable, ours wrap teams of agents to deliver complete business functions. A harness, as we define it, is the operating model for an AI agent team. It determines who does what, how work flows between agents, what quality controls are in place, who's accountable at every point, and how the whole thing improves over time.
A solution to the accountability problem needs four things.
It needs graduated autonomy — a way for agents to start fully supervised and earn independence by proving themselves, so that teams can adopt AI now without taking on unquantified risk. The level of human involvement decreases over time, but only because the data justifies it, not because someone decided to skip the checks.
It needs structural accountability — a clear answer, at every point in the workflow, to the question of who's responsible for this output. That answer is always a person. At the early stages it's the human reviewing every output. At later stages it's the operator who approved the agent's graduation, and the organisation that adopted the process. The harness doesn't absorb accountability. It provides the evidence that makes a human's decision to trust the agent a defensible one rather than a leap of faith.
It needs safety mechanisms — ways to catch failures before they reach the customer. When an agent is uncertain, it should escalate rather than guess. When a process is stuck in a loop or running up costs, it should stop and surface the problem. These mechanisms have to be structural, not optional, because the whole point is that the system is trustworthy even when no one is watching every step.
And it needs to improve over time. Every piece of work that runs through the harness should produce data that makes the harness better. The system should get measurably more reliable with use, and the people operating it should be able to see that improvement in concrete terms — fewer unplanned interventions, higher first-pass quality, more agents earning greater independence on more task types. If the numbers aren't moving, something is wrong and the data should tell you where.
The harness is what delivers our service. When we build a product for a client, that work runs through the harness, and in doing so it produces the data that improves the harness for everyone. The service and the system that powers it are the same thing, each making the other better.
We follow an iterative process for harness development that draws on the same logic as the scientific method and lean startup methodology. The core loop is: observe agents working on real tasks, diagnose the structural cause of any failure — not "the agent got it wrong" but what was missing from the environment that led to the failure — make a targeted change to the harness, run it again, measure the result, and learn from what happened. Each cycle produces knowledge that compounds.
The key discipline is diagnosing at the right level. When an agent fails, the temptation is to tweak the prompt or try a different model. Those are sometimes the right fixes, but more often the problem is structural: the work was decomposed poorly, the agent didn't have the right information at the right time, the feedback mechanism didn't catch the error, or the agent was operating outside the constraints it needed. Fixing the structure means the problem doesn't come back.
When a new model comes out, we re-examine every component of the harness. Each component encodes an assumption about what the model can't do on its own, and new models invalidate some of those assumptions. We strip what's no longer load-bearing and look for new configurations that weren't possible before. The space of useful harness designs doesn't shrink as models improve — it moves, and the work is to keep finding the next combination that unlocks better outcomes.
We're building an open-source evaluation framework alongside the harness itself. There's no standard way to measure whether a harness is working well, compare configurations, or track improvement over time. The framework provides that — a task specification format, an instrumentation layer, a scoring methodology, and longitudinal tracking. The harness definitions are proprietary; the evaluation methodology is public. This gives teams adopting agents a way to measure the thing that actually determines whether their agents are reliable — not the model, but the system around it.
The problem we're solving isn't specific to any size of company. Any business that's feeling the pressure to bring AI into its operations — but wants to do it in a way that's controlled, sustainable, and accountable — is dealing with some version of the same question: how do we move faster without taking on risk we can't see?
We're not interested in helping businesses chase hype. The organisations we want to work with care about what they ship. They're building something that matters — solving a real problem, creating real value, making a real impact — and they need to prove it. They work in tight constraints. Limited budget, limited time, high stakes. They need to ship fast and ship frequently because there are people counting on them — investors, customers, employees, everyone in the value chain who stands to gain if it works and stands to lose if it doesn't.
These are the businesses that feel the accountability problem most acutely. A large company can absorb a failed AI experiment and write it off as a learning. A startup running on savings and friends-and-family money can't. A small team trying to prove a thesis about a real problem can't. For them, the choice between adopting AI with unquantified risk and holding back entirely isn't a strategic trade-off — it's existential.
That's why graduated autonomy matters most here. These businesses need to move now, but they need to move in a way they can stand behind. Start supervised. Let the system prove itself. Build confidence from data, not hope. Every step forward is earned, and every piece of work that ships has a clear chain of accountability behind it.
We're drawn to these businesses because the values align. The harness engineering process is iterative — it rewards patience, long-term thinking, and a commitment to getting better rather than just getting more. It's a marathon, not a sprint. We want to work with people who see it that way, who care about sustainability over speed, and who understand that building trust in a system takes time but compounds once it's there.
Founders feel this most intensely, and they're where we started. But the problem extends to any organisation where the stakes are real, the constraints are tight, and the pressure to adopt AI is meeting an unanswered question about who's accountable when it goes wrong.
We've been helping early-stage startups navigate uncertainty for over nine years. Our clients outsource their design, product, and engineering to us, and we are accountable for the products and systems we create.
Our work has always been about finding the right balance between shipping fast and building things that hold up. Iteration frequency is what allows startups to learn quickly and find product-market fit sooner, but shipping too fast creates risk. We share that risk with our customers — it's what they pay us for, and it's our job to manage it in a way that ensures they reach their long-term objective of creating sustainable value.
When we bring agents into that process, the dynamic doesn't change. The risk is still shared. We're still accountable for what ships. That reality is what drives how we build the harness — not as a research exercise, but as the system we depend on to deliver work we can stand behind.
The tools have changed. We used to close the gap between idea and execution with process templates, infrastructure playbooks, and boilerplate code. Now we can close it with AI agent teams — but only if accountability isn't lost in the process. That's what the harness is for. The underlying motivation hasn't moved. Make it possible for more people to build things that matter, and do it in a way that holds up when the stakes are real.