The Harness Platform is T&F's core product. It serves three purposes that are really one purpose at different scales.
First, it's where we design harnesses — define workflows, configure which steps are human and which are agent, set up the agents and their tools, and establish the trust controls that govern how much autonomy agents have.
Second, it's where work actually happens. Client projects run through harnesses. Agents execute tasks. Humans review outputs. Circuit breakers fire when something goes wrong. Every action is logged, every outcome is measured, and the data feeds directly back into improving the harness. The delivery tool and the improvement tool are the same thing.
Third, it's a product we can deploy for customers. An organisation stands up the platform, configures harnesses for their domains, deploys agent teams, and manages the transition from human-driven to AI-driven operations at their own pace, with full visibility and control at every point.
These aren't three separate products. They're the same platform at different levels of maturity and different points of deployment.
Accountability is structural, not aspirational.Every piece of work that moves through the platform has a clear chain of responsibility at every point. The platform doesn't just allow human oversight — it requires it at levels proportional to the trust the agents have earned. This is what makes it possible to be accountable for agent output.
Graduated autonomy is native.Trust tiers aren't a feature bolted onto an agent runner. They're the core mechanism. Agents start supervised and earn autonomy by proving themselves on specific task types. The platform tracks this automatically and enforces it structurally.
Observability is not optional. Full traces of agent reasoning, tool calls, inputs, outputs, and decisions are captured for every run. This serves three purposes: auditability (proving what happened), debugging (understanding why something went wrong), and improvement (feeding data back into the harness engineering cycle).
The flywheel is architectural.Using the platform to deliver work produces the data that improves the platform. This isn't a nice side effect — it's a design requirement. The instrumentation, metrics capture, and feedback loops exist to make the flywheel turn.
Human-in-the-loop is a first-class concept. Review queues, approval workflows, escalation paths, and intervention points are core platform primitives, not afterthoughts. The platform is designed around the assumption that humans need to stay in the loop — and that the degree of their involvement should decrease over time as trust is earned, not as a configuration choice.
Organisation
└── Harness (domain-specific workflow definition)
└── Workflow
└── Step (human | agent | hybrid)
└── Agent Config (role, model, tools, knowledge, trust tier)
└── Circuit Breakers
└── Graduation Rules
└── Evaluation Criteria
└── Project (client engagement)
└── Task (unit of work)
└── Run (execution of task through workflow)
└── Trace (complete audit log)
└── Review (human evaluation of output)
└── Metrics (captured automatically)The top-level tenant. For T&F, this is T&F itself — managing harnesses and client projects. For a deployed customer, this is their organisation — running their own harnesses with their own teams.
An organisation contains harness definitions, projects, users (human operators), and platform configuration (model provider API keys, tool integrations, notification settings).
A harness is a workflow definition for a specific domain. Product development, marketing operations, financial analysis — each gets its own harness. A harness contains:
Harness definitions follow the shared definition / isolated execution model. The definition (workflow, agent prompts, circuit breaker rules, graduation criteria) is version-controlled and shared. Each project that uses the harness gets its own isolated instance (codebase, state, trust tier positions, run history).
Harness definitions are versioned. Changes to a harness produce a new version. Historical runs are always linked to the harness version they ran against, so you can trace how harness changes affect outcomes.
A workflow is an ordered set of steps that a task moves through. Each step has a type:
Human— a person performs this step entirely. The platform provides the interface, captures the output, and logs the time spent. Human steps exist for work that agents aren't ready to handle, or for decisions that require human judgment regardless of agent capability.
Agent — an AI agent performs this step. The platform executes the agent, captures the full trace, and routes the output according to the trust tier rules. At supervised tier, the output goes to a human review queue before proceeding. At monitored tier, a percentage of outputs are sampled for review. At autonomous tier, the output proceeds unless a circuit breaker fires.
Hybrid— a human and an agent collaborate on this step. The agent produces a draft or recommendation, and a human refines, approves, or redirects it. This is the default for steps in early harness development, and it's the natural mode for steps where agent capability is improving but not yet trusted.
Each step defines:
The way work transitions from one step to the next is important and visible in the platform, because transitions are where context gets lost or handed off poorly — the exact problem both Anthropic and OpenAI identified.
Each transition defines a handoff artifact: a structured document that carries the output of the previous step plus the context the next step needs to pick up the work cleanly. Handoff artifacts are explicit, logged, and inspectable. They're the structured equivalent of Anthropic's context reset handoffs — a clean slate with enough state to continue.
When a step's output fails review, the transition loops back to the same step with the review feedback attached. The platform tracks how many times a step loops and can trigger a circuit breaker if the loop exceeds a threshold (CB2: review rejection loop).
Each agent step in a workflow has a configuration that defines how the agent operates.
The agent's identity and responsibilities within the workflow. Defined by a system prompt that establishes who the agent is, what it's responsible for, what it should and shouldn't do, and how it should handle uncertainty.
Roles in the product development harness:
Which model the agent uses. Configurable per agent, per step. Different steps may benefit from different models — a planning step might use a model with stronger reasoning, while a code generation step might use one optimised for coding.
The set of tools available to the agent. Tools are the agent's interface with the outside world — file system access, shell commands, API calls, browser automation, database queries, MCP servers.
Each tool has:
Tool permissions are governed by the trust tier. At supervised tier, destructive tools (write, execute, deploy) may require human approval before execution. At higher tiers, the agent can use them independently within its defined scope.
The information available to the agent for this step. This includes:
Knowledge base configuration determines what gets included in the agent's context and how. This is where the "give agents a map, not a manual" principle from OpenAI's work gets operationalised — the platform controls what information the agent sees and when.
The current trust level for this agent on this task type in this project. Trust tiers are:
Supervised— every output is reviewed by a human before it proceeds to the next step. The agent's work goes into a review queue. A human approves, requests changes, or rejects. The agent is building a track record.
Monitored— the agent operates independently and its output proceeds to the next step automatically. A configurable percentage of outputs (default 30%) are sampled for human review. If a sampled output fails review, the agent's consecutive success counter resets.
Autonomous — the agent operates independently. Human review happens only on escalation (circuit breaker) or exception. The agent has earned trust on this specific task type through a sustained track record.
Trust is earned per agent, per task type, per project. An agent might be autonomous on routine CRUD implementations but supervised on security-sensitive work, within the same project.
Circuit breakers are the safety mechanism that catches failures before they propagate. They fire into a human review queue with full context, halting the workflow at that point until a human resolves the issue.
CB1: Agent uncertainty.The agent signals that it isn't confident in its output, or that the task requires information or judgment it doesn't have. The agent is instructed to escalate rather than guess.
CB2: Review rejection loop.A step has been rejected and re-attempted more than a configurable threshold (default: 2 rejections). The agent isn't converging on an acceptable output, and a human needs to intervene — either by providing direction, adjusting the task, or handling it manually.
CB3: Test failure after fix attempt.Automated tests fail, the agent attempts a fix, and the tests still fail. The agent's fix didn't resolve the underlying issue, and further attempts risk introducing more problems.
CB4: Scope creep detection.The agent's output significantly exceeds or deviates from the expected scope defined in the task or sprint contract. Catches the tendency of agents to over-build or wander off-task.
CB5: Cost threshold.The run's token cost or duration exceeds a configurable threshold. Prevents runaway agent loops from burning through budget.
Each circuit breaker event is logged with: the breaker type, the trigger condition, the full agent trace leading up to the trigger, and the resolution (how the human handled it). Circuit breaker data feeds directly into the harness engineering cycle — high frequency on a specific breaker signals a failure mode to diagnose and fix.
Every agent run produces a complete trace: the sequence of the agent's reasoning steps, tool calls, inputs, outputs, and decisions. The trace viewer is the primary interface for understanding what an agent did and why.
A trace contains:
Traces are immutable and tamper-evident. They're the audit trail that makes accountability possible. When a question arises about why an agent made a particular decision, the trace provides the answer.
The review queue is where human operators interact with agent output that requires their attention. Items enter the review queue from:
Each review queue item shows:
Review decisions are logged and feed into the graduation system. Consecutive approvals build toward trust tier promotion. Rejections reset the consecutive success counter.
Real-time and historical view of all metrics from the metrics framework:
Filterable by project, harness, agent, task type, and time period. The dashboard is where operators monitor harness performance and identify areas for improvement.
A complete, immutable record of everything that happens on the platform. Every agent action, every human decision, every configuration change, every deployment. The audit log serves compliance, accountability, and debugging needs.
Graduation is the mechanism by which agents earn higher trust tiers. It's automatic but governed by explicit rules defined in the harness.
A graduation rule specifies:
When an agent meets all criteria for a task type, the platform proposes a graduation to the next trust tier. A human operator approves or defers the graduation. Graduation is never automatic without human sign-off — this is the accountability mechanism.
When an agent fails at a higher trust tier (output rejected, circuit breaker fired), the platform can automatically demote it back to the previous tier for that task type. Demotion is automatic; promotion requires human approval. This asymmetry is deliberate — it's easier to lose trust than to earn it.
An operator creates a project by selecting a harness and providing the project context:
The project gets its own isolated instance of the harness definition. Trust tiers start at supervised for all agent roles and task types. The agents begin building their track record from zero.
The workflow supports continuous delivery natively. Tasks flow through the pipeline independently. Multiple tasks can be in-flight at different steps simultaneously. The platform manages dependencies between tasks (task B can't start until task A's output is available) and parallelism (tasks A and C can run concurrently if they don't depend on each other).
Deployment steps in the workflow integrate with CI/CD pipelines. An agent can trigger a deployment, and the platform captures the deployment outcome as part of the run record. Post-deploy monitoring can feed back into the workflow — a production error creates a new task that enters the pipeline.
T&F runs the platform for its own client work. T&F is the organisation. Client projects are projects within that organisation. T&F's operators manage the harnesses, review agent output, and make graduation decisions. Clients see their project outcomes; T&F manages the system.
A customer deploys the platform within their own organisation. They configure their own harnesses (potentially starting from T&F's templates), manage their own agent teams, and control their own trust tier progression. T&F provides the platform, the harness templates, and advisory support. The customer operates it.
The platform supports this through standard multi-tenancy:
The platform is built in TypeScript end-to-end. Python is available if and when a specific capability requires it (ML tooling, certain data science libraries), but TypeScript is the default until proven unfit for a given purpose.
Core: TypeScript, Node.js, PostgreSQL, Redis (for queues and caching). API layer in Express or Fastify.
Agent execution: Claude API (primary), with model provider abstraction for flexibility. Anthropic TypeScript SDK for agent orchestration. Agent execution runs server-side with structured async workflows.
Observability: Structured logging to PostgreSQL. Trace storage with efficient querying for the trace viewer. Metrics aggregation for the dashboard.
Frontend: React with TypeScript (for the operator UI — review queue, trace viewer, metrics dashboard, harness configuration). Shared type definitions between frontend and backend.
Deployment:Docker containers, deployable to any cloud provider. Customer deployments can be self-hosted or T&F-managed.
Integration:HarnessEval (the open-source eval framework) is built in TypeScript. Every run automatically produces a run record compatible with HarnessEval's format. Eval suites can be run against the platform directly.
Monorepo structure: Single TypeScript monorepo with shared types, ensuring the entity model (harness, workflow, step, agent config, task, run, trace) is defined once and used everywhere — API, frontend, agent execution, and eval framework.
For T&F delivering client work: a structured way to shift execution from human to agent while maintaining quality and accountability. Every project produces data that improves the harness. The more work we do, the better the system gets.
For T&F as a business: a deployable product that any organisation can use to stand up AI agent teams with the observability, auditability, and trust mechanisms they need to actually rely on them.
For the industry: a concrete implementation of accountability architecture — the system that answers "who's responsible when something goes wrong?" at every point in the workflow, from supervised through autonomous.
For the Anthropic application: a production multi-agent system with graduated autonomy, quantitative evaluation, and real client deployment — built on Claude, instrumented with HarnessEval, and demonstrated with data.