HarnessEval is a framework for measuring how well an AI agent harness performs — and whether it's getting better over time.
The AI industry has mature benchmarks for models. It has nothing equivalent for harnesses. This matters because recent work from Anthropic, OpenAI, and others has shown that harness design often has a larger impact on production outcomes than model choice. LangChain jumped from 52.8% to 66.5% on TerminalBench 2.0 by changing the harness alone — same model, dramatically better results. But there's no standardised way to measure that improvement, reproduce it, or compare harness configurations against each other.
HarnessEval fills that gap. It provides a task specification format, an instrumentation layer, a scoring methodology, and a comparison mechanism — everything needed to evaluate any harness, on any domain, and track its improvement over time.
The framework is open-source. The harnesses you evaluate with it are yours.
Harness-agnostic.The framework evaluates any harness — multi-agent, single-agent, any model provider, any orchestration approach. It doesn't care how your harness is built. It cares what it produces.
Metric-driven. Every evaluation produces structured, quantitative data. Subjective assessments are captured through rubric-based scoring with explicit criteria, not open-ended judgment.
Reproducible.The same task spec, run through the same harness config, should produce comparable results. The framework captures enough context to make runs meaningfully reproducible, while acknowledging that LLM non-determinism means exact reproduction isn't possible.
Temporal.A single eval run tells you where you are. A series of runs over time tells you whether you're improving. The framework is built for longitudinal tracking, not one-off benchmarks.
Minimal instrumentation burden. Integrating HarnessEval into an existing harness should take hours, not days. The instrumentation layer hooks into common patterns (agent calls, human review points, tool invocations) with lightweight adapters.
A task spec defines what the harness is being asked to do. It's a structured document that contains everything needed to reproduce and evaluate a run.
id: "task-2026-04-07-001"
domain: "product-dev"
description: "Implement user authentication with email/password
and OAuth, including signup, login, password reset,
and session management."
complexity: "medium"
inputs:
codebase: "ref:repo/starter-template-v2"
requirements: "ref:specs/auth-requirements.md"
acceptance_criteria:
- id: "ac-01"
description: "User can sign up with email and password"
type: "functional"
verification: "automated"
- id: "ac-02"
description: "OAuth flow completes without error
for Google and GitHub providers"
type: "functional"
verification: "automated"
- id: "ac-03"
description: "Password reset email sends and
token validates correctly"
type: "functional"
verification: "automated"
- id: "ac-04"
description: "Session persists across page reloads
and expires after configured timeout"
type: "functional"
verification: "automated"
- id: "ac-05"
description: "Code follows project conventions
and passes existing linter rules"
type: "quality"
verification: "rubric"
tags: ["auth", "backend", "security"]
estimated_complexity_hours: 8Task specs can be functional (verifiable with automated tests), qualitative (scored against a rubric), or mixed. The framework ships with a starter library of task specs across common domains, and the spec format is extensible for custom domains.
A run record is the complete output of evaluating a harness against a task spec. It captures everything that happened during the run.
run_id: "run-20260407-143022" task_id: "task-2026-04-07-001" harness_config: name: "tf-product-dev-harness" version: "0.3.1" agents: ["planner", "developer", "reviewer", "qa"] model: "claude-opus-4-6" trust_tier: "supervised" timestamp: "2026-04-07T14:30:22Z" duration_seconds: 2847 cost_usd: 14.32 outcome: "pass" metrics: prompts_total: 12 prompts_planned: 4 prompts_unplanned: 8 hitl_total: 6 hitl_planned: 3 hitl_unplanned: 3 first_pass_approval: false circuit_breakers_fired: 1 circuit_breaker_types: ["CB2:review-rejection"] acceptance_criteria_passed: 5 acceptance_criteria_total: 5 tokens_in: 284000 tokens_out: 127000 scores: functional_completeness: 1.0 code_quality: 0.82 convention_adherence: 0.91 overall: 0.88
An eval suite is a collection of task specs designed to test a harness across a range of difficulties and task types within a domain. Running an eval suite produces a set of run records that, taken together, characterise the harness's current performance.
suite_id: "product-dev-core-v1"
domain: "product-dev"
description: "Core evaluation suite for product
development harnesses."
tasks:
- ref: "tasks/auth-implementation.yaml"
weight: 1.0
- ref: "tasks/crud-api.yaml"
weight: 0.8
- ref: "tasks/frontend-dashboard.yaml"
weight: 1.0
- ref: "tasks/bug-fix-regression.yaml"
weight: 0.6
- ref: "tasks/refactor-extract-service.yaml"
weight: 0.8
- ref: "tasks/ci-pipeline-setup.yaml"
weight: 0.5
scoring:
method: "weighted_average"
pass_threshold: 0.75A comparison runs the same eval suite (or task spec) through two or more harness configurations and produces a structured diff.
Comparison: tf-harness-v0.3.1 vs tf-harness-v0.4.0
Suite: product-dev-core-v1
---
v0.3.1 v0.4.0 delta
prompts_unplanned (avg) 8.2 5.1 -37.8%
hitl_unplanned (avg) 3.4 1.8 -47.1%
first_pass_approval_rate 0.42 0.67 +59.5%
cost_per_task (avg) $14.32 $11.87 -17.1%
human_time_min (avg) 22.4 13.1 -41.5%
overall_score 0.78 0.86 +10.3%This is the output that proves the harness engineering cycle is working. Each iteration of the cycle should produce a comparison that shows improvement on the metrics that matter.
A timeline tracks the same harness (or harness lineage) across multiple eval suite runs over time. This is the longitudinal view that shows the learning curve.
Timeline: tf-product-dev-harness Suite: product-dev-core-v1 --- Date Version Unplanned HITL FPA Score 2026-04-07 v0.2.0 11.3 0.28 0.64 2026-04-21 v0.3.0 8.7 0.39 0.74 2026-05-05 v0.3.1 8.2 0.42 0.78 2026-05-19 v0.4.0 5.1 0.67 0.86 2026-06-02 v0.5.0 3.3 0.78 0.91
This is the table you put in the blog post. This is the chart you show in the Anthropic application. Declining unplanned interventions, rising first-pass approval, improving scores — over time, with real data.
HarnessEval captures metrics by wrapping the key interaction points in a harness run. The instrumentation layer provides adapters for common patterns.
Agent calls. Every invocation of an agent (prompt in, response out) is logged with timestamp, token counts, agent role, and task context. This is the raw data for prompts-per-task.
Human touchpoints. Every point where a human reviews, edits, approves, rejects, or otherwise interacts with the output. Each touchpoint is tagged as planned (required by the trust tier or harness design) or unplanned (triggered by a failure or escalation). This is the raw data for HITL-steps-per-task.
Circuit breaker events. Every circuit breaker firing, with the breaker type, trigger condition, and resolution.
Acceptance criteria checks. Each acceptance criterion evaluated, with pass/fail and any automated test output or rubric scores.
Cost and timing. Token usage, API costs, wall-clock duration, and human time.
import { Evaluator, TaskSpec } from 'harnesseval';
// Load a task spec
const task = await TaskSpec.fromFile(
'tasks/auth-implementation.yaml'
);
// Wrap your harness
const evaluator = new Evaluator({
harness: myHarness,
task,
config: {
version: '0.3.1',
trustTier: 'supervised'
}
});
// Run the evaluation — this executes your harness
// with instrumentation capturing metrics automatically
const runRecord = await evaluator.run();
// Run record contains all metrics, scores, traces
console.log(runRecord.summary());
await runRecord.save('runs/');The Evaluator wraps the harness execution and captures metrics through hooks at agent call boundaries and human interaction points. Harness authors implement a thin adapter interface that tells HarnessEval where those boundaries are in their specific implementation.
Acceptance criteria with verification: "automated" are scored binary — pass or fail. The functional completeness score is the percentage of automated criteria that pass.
Acceptance criteria with verification: "rubric" are scored against a defined rubric. Rubrics follow the pattern established in T&F's harness engineering process: concrete, gradable criteria that turn subjective judgments into structured scores.
rubric:
code_quality:
weight: 0.4
levels:
1: "Code has significant structural issues,
unclear naming, no error handling"
2: "Code works but has inconsistent patterns,
some unclear sections"
3: "Code is clean, well-structured,
follows conventions, handles errors"
4: "Code is exemplary — clear abstractions,
thorough error handling, well-documented"
convention_adherence:
weight: 0.3
levels:
1: "Ignores project conventions entirely"
2: "Follows some conventions but introduces
inconsistencies"
3: "Follows all established conventions
consistently"
4: "Follows conventions and improves them
where appropriate"Rubric scoring can be performed by a human evaluator, an LLM evaluator, or both. When using LLM-based evaluation, the framework applies the same separation principle from Anthropic's work: the evaluator should be a different agent than the one that produced the work, calibrated with few-shot examples to prevent the positivity bias that agents show when grading their own output.
The overall score for a run is a weighted combination of functional and rubric scores, using the weights defined in the eval suite. This produces a single number that's comparable across runs and configurations.
harnesseval/ ├── README.md ├── LICENSE # MIT ├── package.json ├── tsconfig.json ├── src/ │ ├── core/ │ │ ├── task-spec.ts # Task spec loader/validator │ │ ├── run-record.ts # Run record data model │ │ ├── evaluator.ts # Main evaluation orchestrator │ │ ├── scorer.ts # Scoring engine │ │ └── timeline.ts # Longitudinal tracking │ ├── instrument/ │ │ ├── base.ts # Adapter interface │ │ ├── agent-tracker.ts # Agent call instrumentation │ │ ├── hitl-tracker.ts # Human-in-the-loop tracking │ │ └── cost-tracker.ts # Token and cost tracking │ ├── compare/ │ │ ├── diff.ts # Configuration comparison │ │ └── report.ts # Comparison report generation │ ├── types/ │ │ └── index.ts # Shared type definitions │ └── index.ts # Public API exports ├── suites/ │ └── product-dev/ # Starter eval suite │ ├── suite.yaml │ └── tasks/ │ ├── auth-implementation.yaml │ ├── crud-api.yaml │ ├── frontend-dashboard.yaml │ └── ... ├── examples/ │ ├── basic-eval.ts │ ├── compare-configs.ts │ └── track-timeline.ts ├── docs/ │ ├── getting-started.md │ ├── writing-task-specs.md │ ├── custom-rubrics.md │ ├── integration-guide.md │ └── contributing.md └── tests/
Open-source (this framework):
Proprietary (T&F's business):
The framework tells you how to measure any harness. What T&F builds inside the harness is the product.
Three reasons.
The field needs it."Harness engineering" became an industry term in early 2026. There's a growing body of work on how to build harnesses, but no standard way to evaluate them. This fills the gap the same way that model benchmarks filled the gap for model evaluation — by giving the community a shared methodology and a comparable output format.
It makes T&F's results credible.When T&F publishes harness performance data, the methodology behind it is public and reproducible. Anyone can run the same eval suite against their own harness and compare. The numbers stand on their own.
It's the right application artifact.The Anthropic Research Engineer, Agents role asks for someone who can "design and implement rigorous quantitative benchmarks for large scale agentic tasks." An open-source eval framework, published on GitHub with real results, is a direct answer to that requirement.
v0.1 — Foundation (Month 1-2)
Core task spec format, run record data model, basic instrumentation adapters, functional scoring, CLI for running evals and viewing results. Starter eval suite for product development.
v0.2 — Comparison and rubrics (Month 2-3)
Rubric-based scoring, LLM-as-evaluator support, configuration comparison, basic reporting.
v0.3 — Timelines and community (Month 3-5)
Longitudinal tracking, timeline visualisation, expanded task library from community contributions, documentation for contributing custom eval suites.
v0.4 — Multi-domain (Month 5-8)
Eval suites beyond product development — marketing, operations. Domain-specific rubrics. Cross-domain comparison methodology.