Thought & Function — What We're Building

What this is

The Harness Platform is T&F's core product. It serves three purposes that are really one purpose at different scales.

First, it's where we design harnesses — define workflows, configure which steps are human and which are agent, set up the agents and their tools, and establish the trust controls that govern how much autonomy agents have.

Second, it's where work actually happens. Client projects run through harnesses. Agents execute tasks. Humans review outputs. Circuit breakers fire when something goes wrong. Every action is logged, every outcome is measured, and the data feeds directly back into improving the harness. The delivery tool and the improvement tool are the same thing.

Third, it's a product we can deploy for customers. An organisation stands up the platform, configures harnesses for their domains, deploys agent teams, and manages the transition from human-driven to AI-driven operations at their own pace, with full visibility and control at every point.

These aren't three separate products. They're the same platform at different levels of maturity and different points of deployment.

Design principles

Accountability is structural, not aspirational.Every piece of work that moves through the platform has a clear chain of responsibility at every point. The platform doesn't just allow human oversight — it requires it at levels proportional to the trust the agents have earned. This is what makes it possible to be accountable for agent output.

Graduated autonomy is native.Trust tiers aren't a feature bolted onto an agent runner. They're the core mechanism. Agents start supervised and earn autonomy by proving themselves on specific task types. The platform tracks this automatically and enforces it structurally.

Observability is not optional. Full traces of agent reasoning, tool calls, inputs, outputs, and decisions are captured for every run. This serves three purposes: auditability (proving what happened), debugging (understanding why something went wrong), and improvement (feeding data back into the harness engineering cycle).

The flywheel is architectural.Using the platform to deliver work produces the data that improves the platform. This isn't a nice side effect — it's a design requirement. The instrumentation, metrics capture, and feedback loops exist to make the flywheel turn.

Human-in-the-loop is a first-class concept. Review queues, approval workflows, escalation paths, and intervention points are core platform primitives, not afterthoughts. The platform is designed around the assumption that humans need to stay in the loop — and that the degree of their involvement should decrease over time as trust is earned, not as a configuration choice.

Platform architecture

Entities

Organisation
└── Harness (domain-specific workflow definition)
    └── Workflow
        └── Step (human | agent | hybrid)
            └── Agent Config (role, model, tools, knowledge, trust tier)
    └── Circuit Breakers
    └── Graduation Rules
    └── Evaluation Criteria
└── Project (client engagement)
    └── Task (unit of work)
        └── Run (execution of task through workflow)
            └── Trace (complete audit log)
            └── Review (human evaluation of output)
            └── Metrics (captured automatically)

Organisation

The top-level tenant. For T&F, this is T&F itself — managing harnesses and client projects. For a deployed customer, this is their organisation — running their own harnesses with their own teams.

An organisation contains harness definitions, projects, users (human operators), and platform configuration (model provider API keys, tool integrations, notification settings).

Harness

A harness is a workflow definition for a specific domain. Product development, marketing operations, financial analysis — each gets its own harness. A harness contains:

Workflow — the sequence of steps that work moves through
Circuit breakers — the conditions that halt execution and escalate to humans
Graduation rules — the criteria for agents to earn higher trust tiers on specific task types
Evaluation criteria — the rubrics used to assess agent output quality

Harness definitions follow the shared definition / isolated execution model. The definition (workflow, agent prompts, circuit breaker rules, graduation criteria) is version-controlled and shared. Each project that uses the harness gets its own isolated instance (codebase, state, trust tier positions, run history).

Harness definitions are versioned. Changes to a harness produce a new version. Historical runs are always linked to the harness version they ran against, so you can trace how harness changes affect outcomes.

Workflow

A workflow is an ordered set of steps that a task moves through. Each step has a type:

Human— a person performs this step entirely. The platform provides the interface, captures the output, and logs the time spent. Human steps exist for work that agents aren't ready to handle, or for decisions that require human judgment regardless of agent capability.

Agent — an AI agent performs this step. The platform executes the agent, captures the full trace, and routes the output according to the trust tier rules. At supervised tier, the output goes to a human review queue before proceeding. At monitored tier, a percentage of outputs are sampled for review. At autonomous tier, the output proceeds unless a circuit breaker fires.

Hybrid— a human and an agent collaborate on this step. The agent produces a draft or recommendation, and a human refines, approves, or redirects it. This is the default for steps in early harness development, and it's the natural mode for steps where agent capability is improving but not yet trusted.

Each step defines:

What it expects as input (from the previous step or from the project context)
What it produces as output (passed to the next step or to a review queue)
Which agent config handles it (for agent and hybrid steps)
What review rules apply (governed by trust tier)

Step transitions

The way work transitions from one step to the next is important and visible in the platform, because transitions are where context gets lost or handed off poorly — the exact problem both Anthropic and OpenAI identified.

Each transition defines a handoff artifact: a structured document that carries the output of the previous step plus the context the next step needs to pick up the work cleanly. Handoff artifacts are explicit, logged, and inspectable. They're the structured equivalent of Anthropic's context reset handoffs — a clean slate with enough state to continue.

When a step's output fails review, the transition loops back to the same step with the review feedback attached. The platform tracks how many times a step loops and can trigger a circuit breaker if the loop exceeds a threshold (CB2: review rejection loop).

Agent configuration

Each agent step in a workflow has a configuration that defines how the agent operates.

Role

The agent's identity and responsibilities within the workflow. Defined by a system prompt that establishes who the agent is, what it's responsible for, what it should and shouldn't do, and how it should handle uncertainty.

Roles in the product development harness:

PM — decomposes requirements into tasks, writes acceptance criteria, prioritises work
Developer — implements tasks, writes code, runs tests
Reviewer — reviews code and output for quality, convention adherence, and correctness
QA — tests the running application, files bugs, verifies acceptance criteria

Model

Which model the agent uses. Configurable per agent, per step. Different steps may benefit from different models — a planning step might use a model with stronger reasoning, while a code generation step might use one optimised for coding.

Tools

The set of tools available to the agent. Tools are the agent's interface with the outside world — file system access, shell commands, API calls, browser automation, database queries, MCP servers.

Each tool has:

A permission level — what the tool is allowed to do (read-only, read-write, execute)
A scope — what the tool can access (specific directories, specific APIs, specific databases)
An audit hook — every tool invocation is logged with inputs, outputs, and the agent's stated reasoning for using the tool

Tool permissions are governed by the trust tier. At supervised tier, destructive tools (write, execute, deploy) may require human approval before execution. At higher tiers, the agent can use them independently within its defined scope.

Knowledge base

The information available to the agent for this step. This includes:

Project context — the codebase, documentation, and history for this specific project
Harness context — the conventions, patterns, and standards defined in the harness
Step context — the handoff artifact from the previous step, plus any review feedback from prior loops
Domain knowledge — reference material relevant to the task type

Knowledge base configuration determines what gets included in the agent's context and how. This is where the "give agents a map, not a manual" principle from OpenAI's work gets operationalised — the platform controls what information the agent sees and when.

Trust tier

The current trust level for this agent on this task type in this project. Trust tiers are:

Supervised— every output is reviewed by a human before it proceeds to the next step. The agent's work goes into a review queue. A human approves, requests changes, or rejects. The agent is building a track record.

Monitored— the agent operates independently and its output proceeds to the next step automatically. A configurable percentage of outputs (default 30%) are sampled for human review. If a sampled output fails review, the agent's consecutive success counter resets.

Autonomous — the agent operates independently. Human review happens only on escalation (circuit breaker) or exception. The agent has earned trust on this specific task type through a sustained track record.

Trust is earned per agent, per task type, per project. An agent might be autonomous on routine CRUD implementations but supervised on security-sensitive work, within the same project.

Circuit breakers

Circuit breakers are the safety mechanism that catches failures before they propagate. They fire into a human review queue with full context, halting the workflow at that point until a human resolves the issue.

CB1: Agent uncertainty.The agent signals that it isn't confident in its output, or that the task requires information or judgment it doesn't have. The agent is instructed to escalate rather than guess.

CB2: Review rejection loop.A step has been rejected and re-attempted more than a configurable threshold (default: 2 rejections). The agent isn't converging on an acceptable output, and a human needs to intervene — either by providing direction, adjusting the task, or handling it manually.

CB3: Test failure after fix attempt.Automated tests fail, the agent attempts a fix, and the tests still fail. The agent's fix didn't resolve the underlying issue, and further attempts risk introducing more problems.

CB4: Scope creep detection.The agent's output significantly exceeds or deviates from the expected scope defined in the task or sprint contract. Catches the tendency of agents to over-build or wander off-task.

CB5: Cost threshold.The run's token cost or duration exceeds a configurable threshold. Prevents runaway agent loops from burning through budget.

Each circuit breaker event is logged with: the breaker type, the trigger condition, the full agent trace leading up to the trigger, and the resolution (how the human handled it). Circuit breaker data feeds directly into the harness engineering cycle — high frequency on a specific breaker signals a failure mode to diagnose and fix.

Observability

Trace viewer

Every agent run produces a complete trace: the sequence of the agent's reasoning steps, tool calls, inputs, outputs, and decisions. The trace viewer is the primary interface for understanding what an agent did and why.

A trace contains:

Reasoning steps — the agent's internal chain of thought at each point
Tool calls — every tool invocation with inputs, outputs, and the reasoning that led to it
Context snapshots — what information was available to the agent at each decision point
Output artifacts — what the agent produced (code, documents, plans, reviews)
Timing and cost — token counts, latency, and cost for each step

Traces are immutable and tamper-evident. They're the audit trail that makes accountability possible. When a question arises about why an agent made a particular decision, the trace provides the answer.

Review queue

The review queue is where human operators interact with agent output that requires their attention. Items enter the review queue from:

Trust tier reviews (supervised: all outputs, monitored: sampled outputs)
Circuit breaker escalations
Agent-initiated escalations (CB1: uncertainty)

Each review queue item shows:

The task and step context
The agent's output
The agent's trace (collapsed by default, expandable)
The acceptance criteria or evaluation rubric for this step
Actions: approve, request changes (with feedback), reject, take over manually

Review decisions are logged and feed into the graduation system. Consecutive approvals build toward trust tier promotion. Rejections reset the consecutive success counter.

Metrics dashboard

Real-time and historical view of all metrics from the metrics framework:

Core metrics: prompts per task (planned/unplanned), HITL steps per task (planned/unplanned)
Quality: first-pass approval rate, circuit breaker frequency, post-deploy defect rate
Cost: cost per task, human time per task
Improvement: trust tier distribution, consecutive success streaks, unplanned intervention rate over time

Filterable by project, harness, agent, task type, and time period. The dashboard is where operators monitor harness performance and identify areas for improvement.

Audit log

A complete, immutable record of everything that happens on the platform. Every agent action, every human decision, every configuration change, every deployment. The audit log serves compliance, accountability, and debugging needs.

Graduation system

Graduation is the mechanism by which agents earn higher trust tiers. It's automatic but governed by explicit rules defined in the harness.

A graduation rule specifies:

Agent role — which agent this rule applies to
Task type — which kind of task the rule covers (trust is earned per task type)
Required streak — number of consecutive first-pass approvals needed
Required volume — minimum number of tasks completed at the current tier
Quality threshold — minimum rubric scores required
Circuit breaker constraint — maximum allowed circuit breaker events in the evaluation window

When an agent meets all criteria for a task type, the platform proposes a graduation to the next trust tier. A human operator approves or defers the graduation. Graduation is never automatic without human sign-off — this is the accountability mechanism.

When an agent fails at a higher trust tier (output rejected, circuit breaker fired), the platform can automatically demote it back to the previous tier for that task type. Demotion is automatic; promotion requires human approval. This asymmetry is deliberate — it's easier to lose trust than to earn it.

Project execution

Creating a project

An operator creates a project by selecting a harness and providing the project context:

Repository or codebase (for product development)
Project documentation and requirements
External integrations (CI/CD, deployment targets, communication channels)
Team configuration (which human operators are involved and in what roles)

The project gets its own isolated instance of the harness definition. Trust tiers start at supervised for all agent roles and task types. The agents begin building their track record from zero.

Task flow

A task is created — either by a human operator, by the PM agent (if trusted to do so), or from an external source (issue tracker, backlog tool).
The task enters the workflow at the first step.
At each step, the appropriate actor (human, agent, or hybrid) performs the work.
For agent steps, the platform executes the agent with full instrumentation, captures the trace, and routes the output according to trust tier rules.
If the output passes review (or doesn't require review at the current trust tier), it transitions to the next step via the handoff artifact.
If the output fails review, it loops back to the same step with feedback attached.
If a circuit breaker fires, the task enters the review queue for human resolution.
When the task completes all workflow steps, it's marked as done. The run record is finalised with all metrics.

Continuous delivery

The workflow supports continuous delivery natively. Tasks flow through the pipeline independently. Multiple tasks can be in-flight at different steps simultaneously. The platform manages dependencies between tasks (task B can't start until task A's output is available) and parallelism (tasks A and C can run concurrently if they don't depend on each other).

Deployment steps in the workflow integrate with CI/CD pipelines. An agent can trigger a deployment, and the platform captures the deployment outcome as part of the run record. Post-deploy monitoring can feed back into the workflow — a production error creates a new task that enters the pipeline.

Multi-tenant deployment

T&F's deployment

T&F runs the platform for its own client work. T&F is the organisation. Client projects are projects within that organisation. T&F's operators manage the harnesses, review agent output, and make graduation decisions. Clients see their project outcomes; T&F manages the system.

Customer deployment

A customer deploys the platform within their own organisation. They configure their own harnesses (potentially starting from T&F's templates), manage their own agent teams, and control their own trust tier progression. T&F provides the platform, the harness templates, and advisory support. The customer operates it.

The platform supports this through standard multi-tenancy:

Isolated data per organisation
Configurable model provider integrations (customer uses their own API keys)
Configurable tool integrations (customer's own repos, CI/CD, deployment targets)
Role-based access control (who can configure harnesses, who can approve graduations, who can view traces)

Technology

The platform is built in TypeScript end-to-end. Python is available if and when a specific capability requires it (ML tooling, certain data science libraries), but TypeScript is the default until proven unfit for a given purpose.

Core: TypeScript, Node.js, PostgreSQL, Redis (for queues and caching). API layer in Express or Fastify.

Agent execution: Claude API (primary), with model provider abstraction for flexibility. Anthropic TypeScript SDK for agent orchestration. Agent execution runs server-side with structured async workflows.

Observability: Structured logging to PostgreSQL. Trace storage with efficient querying for the trace viewer. Metrics aggregation for the dashboard.

Frontend: React with TypeScript (for the operator UI — review queue, trace viewer, metrics dashboard, harness configuration). Shared type definitions between frontend and backend.

Deployment:Docker containers, deployable to any cloud provider. Customer deployments can be self-hosted or T&F-managed.

Integration:HarnessEval (the open-source eval framework) is built in TypeScript. Every run automatically produces a run record compatible with HarnessEval's format. Eval suites can be run against the platform directly.

Monorepo structure: Single TypeScript monorepo with shared types, ensuring the entity model (harness, workflow, step, agent config, task, run, trace) is defined once and used everywhere — API, frontend, agent execution, and eval framework.

What this enables

For T&F delivering client work: a structured way to shift execution from human to agent while maintaining quality and accountability. Every project produces data that improves the harness. The more work we do, the better the system gets.

For T&F as a business: a deployable product that any organisation can use to stand up AI agent teams with the observability, auditability, and trust mechanisms they need to actually rely on them.

For the industry: a concrete implementation of accountability architecture — the system that answers "who's responsible when something goes wrong?" at every point in the workflow, from supervised through autonomous.

For the Anthropic application: a production multi-agent system with graduated autonomy, quantitative evaluation, and real client deployment — built on Claude, instrumented with HarnessEval, and demonstrated with data.

T&F Harness Platform