Thought & Function — What We're Building

What this is

HarnessEval is a framework for measuring how well an AI agent harness performs — and whether it's getting better over time.

The AI industry has mature benchmarks for models. It has nothing equivalent for harnesses. This matters because recent work from Anthropic, OpenAI, and others has shown that harness design often has a larger impact on production outcomes than model choice. LangChain jumped from 52.8% to 66.5% on TerminalBench 2.0 by changing the harness alone — same model, dramatically better results. But there's no standardised way to measure that improvement, reproduce it, or compare harness configurations against each other.

HarnessEval fills that gap. It provides a task specification format, an instrumentation layer, a scoring methodology, and a comparison mechanism — everything needed to evaluate any harness, on any domain, and track its improvement over time.

The framework is open-source. The harnesses you evaluate with it are yours.

Design principles

Harness-agnostic.The framework evaluates any harness — multi-agent, single-agent, any model provider, any orchestration approach. It doesn't care how your harness is built. It cares what it produces.

Metric-driven. Every evaluation produces structured, quantitative data. Subjective assessments are captured through rubric-based scoring with explicit criteria, not open-ended judgment.

Reproducible.The same task spec, run through the same harness config, should produce comparable results. The framework captures enough context to make runs meaningfully reproducible, while acknowledging that LLM non-determinism means exact reproduction isn't possible.

Temporal.A single eval run tells you where you are. A series of runs over time tells you whether you're improving. The framework is built for longitudinal tracking, not one-off benchmarks.

Minimal instrumentation burden. Integrating HarnessEval into an existing harness should take hours, not days. The instrumentation layer hooks into common patterns (agent calls, human review points, tool invocations) with lightweight adapters.

Core concepts

Task spec

A task spec defines what the harness is being asked to do. It's a structured document that contains everything needed to reproduce and evaluate a run.

id: "task-2026-04-07-001"
domain: "product-dev"
description: "Implement user authentication with email/password
  and OAuth, including signup, login, password reset,
  and session management."
complexity: "medium"

inputs:
  codebase: "ref:repo/starter-template-v2"
  requirements: "ref:specs/auth-requirements.md"

acceptance_criteria:
  - id: "ac-01"
    description: "User can sign up with email and password"
    type: "functional"
    verification: "automated"
  - id: "ac-02"
    description: "OAuth flow completes without error
      for Google and GitHub providers"
    type: "functional"
    verification: "automated"
  - id: "ac-03"
    description: "Password reset email sends and
      token validates correctly"
    type: "functional"
    verification: "automated"
  - id: "ac-04"
    description: "Session persists across page reloads
      and expires after configured timeout"
    type: "functional"
    verification: "automated"
  - id: "ac-05"
    description: "Code follows project conventions
      and passes existing linter rules"
    type: "quality"
    verification: "rubric"

tags: ["auth", "backend", "security"]
estimated_complexity_hours: 8

Task specs can be functional (verifiable with automated tests), qualitative (scored against a rubric), or mixed. The framework ships with a starter library of task specs across common domains, and the spec format is extensible for custom domains.

Run record

A run record is the complete output of evaluating a harness against a task spec. It captures everything that happened during the run.

run_id: "run-20260407-143022"
task_id: "task-2026-04-07-001"
harness_config:
  name: "tf-product-dev-harness"
  version: "0.3.1"
  agents: ["planner", "developer", "reviewer", "qa"]
  model: "claude-opus-4-6"
  trust_tier: "supervised"

timestamp: "2026-04-07T14:30:22Z"
duration_seconds: 2847
cost_usd: 14.32
outcome: "pass"

metrics:
  prompts_total: 12
  prompts_planned: 4
  prompts_unplanned: 8
  hitl_total: 6
  hitl_planned: 3
  hitl_unplanned: 3
  first_pass_approval: false
  circuit_breakers_fired: 1
  circuit_breaker_types: ["CB2:review-rejection"]
  acceptance_criteria_passed: 5
  acceptance_criteria_total: 5
  tokens_in: 284000
  tokens_out: 127000

scores:
  functional_completeness: 1.0
  code_quality: 0.82
  convention_adherence: 0.91
  overall: 0.88

Eval suite

An eval suite is a collection of task specs designed to test a harness across a range of difficulties and task types within a domain. Running an eval suite produces a set of run records that, taken together, characterise the harness's current performance.

suite_id: "product-dev-core-v1"
domain: "product-dev"
description: "Core evaluation suite for product
  development harnesses."

tasks:
  - ref: "tasks/auth-implementation.yaml"
    weight: 1.0
  - ref: "tasks/crud-api.yaml"
    weight: 0.8
  - ref: "tasks/frontend-dashboard.yaml"
    weight: 1.0
  - ref: "tasks/bug-fix-regression.yaml"
    weight: 0.6
  - ref: "tasks/refactor-extract-service.yaml"
    weight: 0.8
  - ref: "tasks/ci-pipeline-setup.yaml"
    weight: 0.5

scoring:
  method: "weighted_average"
  pass_threshold: 0.75

Comparison

A comparison runs the same eval suite (or task spec) through two or more harness configurations and produces a structured diff.

Comparison: tf-harness-v0.3.1 vs tf-harness-v0.4.0
Suite: product-dev-core-v1
---
                          v0.3.1    v0.4.0    delta
prompts_unplanned (avg)   8.2       5.1       -37.8%
hitl_unplanned (avg)      3.4       1.8       -47.1%
first_pass_approval_rate  0.42      0.67      +59.5%
cost_per_task (avg)       $14.32    $11.87    -17.1%
human_time_min (avg)      22.4      13.1      -41.5%
overall_score             0.78      0.86      +10.3%

This is the output that proves the harness engineering cycle is working. Each iteration of the cycle should produce a comparison that shows improvement on the metrics that matter.

Timeline

A timeline tracks the same harness (or harness lineage) across multiple eval suite runs over time. This is the longitudinal view that shows the learning curve.

Timeline: tf-product-dev-harness
Suite: product-dev-core-v1
---
Date        Version  Unplanned HITL  FPA   Score
2026-04-07  v0.2.0   11.3           0.28  0.64
2026-04-21  v0.3.0   8.7            0.39  0.74
2026-05-05  v0.3.1   8.2            0.42  0.78
2026-05-19  v0.4.0   5.1            0.67  0.86
2026-06-02  v0.5.0   3.3            0.78  0.91

This is the table you put in the blog post. This is the chart you show in the Anthropic application. Declining unplanned interventions, rising first-pass approval, improving scores — over time, with real data.

Instrumentation

HarnessEval captures metrics by wrapping the key interaction points in a harness run. The instrumentation layer provides adapters for common patterns.

What gets instrumented

Agent calls. Every invocation of an agent (prompt in, response out) is logged with timestamp, token counts, agent role, and task context. This is the raw data for prompts-per-task.

Human touchpoints. Every point where a human reviews, edits, approves, rejects, or otherwise interacts with the output. Each touchpoint is tagged as planned (required by the trust tier or harness design) or unplanned (triggered by a failure or escalation). This is the raw data for HITL-steps-per-task.

Circuit breaker events. Every circuit breaker firing, with the breaker type, trigger condition, and resolution.

Acceptance criteria checks. Each acceptance criterion evaluated, with pass/fail and any automated test output or rubric scores.

Cost and timing. Token usage, API costs, wall-clock duration, and human time.

Integration pattern

import { Evaluator, TaskSpec } from 'harnesseval';

// Load a task spec
const task = await TaskSpec.fromFile(
  'tasks/auth-implementation.yaml'
);

// Wrap your harness
const evaluator = new Evaluator({
  harness: myHarness,
  task,
  config: {
    version: '0.3.1',
    trustTier: 'supervised'
  }
});

// Run the evaluation — this executes your harness
// with instrumentation capturing metrics automatically
const runRecord = await evaluator.run();

// Run record contains all metrics, scores, traces
console.log(runRecord.summary());
await runRecord.save('runs/');

The Evaluator wraps the harness execution and captures metrics through hooks at agent call boundaries and human interaction points. Harness authors implement a thin adapter interface that tells HarnessEval where those boundaries are in their specific implementation.

Scoring

Functional scoring

Acceptance criteria with verification: "automated" are scored binary — pass or fail. The functional completeness score is the percentage of automated criteria that pass.

Rubric scoring

Acceptance criteria with verification: "rubric" are scored against a defined rubric. Rubrics follow the pattern established in T&F's harness engineering process: concrete, gradable criteria that turn subjective judgments into structured scores.

rubric:
  code_quality:
    weight: 0.4
    levels:
      1: "Code has significant structural issues,
         unclear naming, no error handling"
      2: "Code works but has inconsistent patterns,
         some unclear sections"
      3: "Code is clean, well-structured,
         follows conventions, handles errors"
      4: "Code is exemplary — clear abstractions,
         thorough error handling, well-documented"

  convention_adherence:
    weight: 0.3
    levels:
      1: "Ignores project conventions entirely"
      2: "Follows some conventions but introduces
         inconsistencies"
      3: "Follows all established conventions
         consistently"
      4: "Follows conventions and improves them
         where appropriate"

Rubric scoring can be performed by a human evaluator, an LLM evaluator, or both. When using LLM-based evaluation, the framework applies the same separation principle from Anthropic's work: the evaluator should be a different agent than the one that produced the work, calibrated with few-shot examples to prevent the positivity bias that agents show when grading their own output.

Overall scoring

The overall score for a run is a weighted combination of functional and rubric scores, using the weights defined in the eval suite. This produces a single number that's comparable across runs and configurations.

Repository structure

harnesseval/
├── README.md
├── LICENSE                    # MIT
├── package.json
├── tsconfig.json
├── src/
│   ├── core/
│   │   ├── task-spec.ts       # Task spec loader/validator
│   │   ├── run-record.ts      # Run record data model
│   │   ├── evaluator.ts       # Main evaluation orchestrator
│   │   ├── scorer.ts          # Scoring engine
│   │   └── timeline.ts        # Longitudinal tracking
│   ├── instrument/
│   │   ├── base.ts            # Adapter interface
│   │   ├── agent-tracker.ts   # Agent call instrumentation
│   │   ├── hitl-tracker.ts    # Human-in-the-loop tracking
│   │   └── cost-tracker.ts    # Token and cost tracking
│   ├── compare/
│   │   ├── diff.ts            # Configuration comparison
│   │   └── report.ts          # Comparison report generation
│   ├── types/
│   │   └── index.ts           # Shared type definitions
│   └── index.ts               # Public API exports
├── suites/
│   └── product-dev/           # Starter eval suite
│       ├── suite.yaml
│       └── tasks/
│           ├── auth-implementation.yaml
│           ├── crud-api.yaml
│           ├── frontend-dashboard.yaml
│           └── ...
├── examples/
│   ├── basic-eval.ts
│   ├── compare-configs.ts
│   └── track-timeline.ts
├── docs/
│   ├── getting-started.md
│   ├── writing-task-specs.md
│   ├── custom-rubrics.md
│   ├── integration-guide.md
│   └── contributing.md
└── tests/

What's open-source, what's not

Open-source (this framework):

The evaluation methodology and scoring system
The instrumentation layer and adapter interface
The task spec format and starter task library
The comparison and timeline tools
The rubric specification format

Proprietary (T&F's business):

T&F's harness definitions (agent prompts, decomposition patterns, circuit breaker rules)
Client-specific data and run records
T&F's tuned evaluation criteria and rubric calibrations

The framework tells you how to measure any harness. What T&F builds inside the harness is the product.

Why open-source this

Three reasons.

The field needs it."Harness engineering" became an industry term in early 2026. There's a growing body of work on how to build harnesses, but no standard way to evaluate them. This fills the gap the same way that model benchmarks filled the gap for model evaluation — by giving the community a shared methodology and a comparable output format.

It makes T&F's results credible.When T&F publishes harness performance data, the methodology behind it is public and reproducible. Anyone can run the same eval suite against their own harness and compare. The numbers stand on their own.

It's the right application artifact.The Anthropic Research Engineer, Agents role asks for someone who can "design and implement rigorous quantitative benchmarks for large scale agentic tasks." An open-source eval framework, published on GitHub with real results, is a direct answer to that requirement.

Roadmap

v0.1 — Foundation (Month 1-2)
Core task spec format, run record data model, basic instrumentation adapters, functional scoring, CLI for running evals and viewing results. Starter eval suite for product development.

v0.2 — Comparison and rubrics (Month 2-3)
Rubric-based scoring, LLM-as-evaluator support, configuration comparison, basic reporting.

v0.3 — Timelines and community (Month 3-5)
Longitudinal tracking, timeline visualisation, expanded task library from community contributions, documentation for contributing custom eval suites.

v0.4 — Multi-domain (Month 5-8)
Eval suites beyond product development — marketing, operations. Domain-specific rubrics. Cross-domain comparison methodology.