What Is an Agent Harness? / naïve-labs

01 / Harness, before AI

A harness predates software.

On a horse, the harness is the arrangement of straps that turns strength into pulled loads and plowed fields. The harness has no strength of its own. It channels power into controlled, useful work and keeps that power from going where it should not.

Software borrowed the word for testing. A test harness is the code around the code: it sets up a controlled environment, feeds the program inputs, runs it, and observes what comes out. The same shape holds: the harness is not the thing being run. It makes the run controlled and observable.

Both senses survive in current AI usage: controlled action, observable execution, repeatable work.

02 / Current usage

How the term is used now.

Current AI usage has not converged. The main definitions cluster around six sources.

Anthropic gives the cleanest functional definition: an agent harness is "the system that enables a model to act as an agent: it processes inputs, orchestrates tool calls, and returns results". Their docs describe Claude Code as "the agentic harness around Claude: it provides the tools, context management, and execution environment that turn a language model into a capable coding agent", and they call the Claude Agent SDK a "general-purpose agent harness".

LangChain popularized the formula most people now repeat: "Agent = Model + Harness. If you're not the model, you're the harness. A harness is every piece of code, configuration, and execution logic that isn't the model itself" — system prompts, tools and their descriptions, filesystem, sandbox, orchestration logic, hooks.

Cursor treats the harness as a product layer that is tuned together with each model: "the harness and the model together determine how good the agent is". Their harness gives different models different editing tools and different prompts, because models perform best with the machinery they were trained on.

METR, the evaluation organization, mostly does not say harness at all. They say scaffold: the thing that gives a model tools and manages the interaction loop. In their usage, ReAct — "an agent takes an action, sees the results of the action, and repeats" — and Claude Code belong to the same category, simpler and more elaborate scaffolds.

Hugging Face's agent glossary splits the difference: scaffolding is the behavior-defining layer the model works from (prompts, tool descriptions, parsing), harness is the execution layer that makes the agent run (calls the model, executes tool calls, decides when to stop) — while conceding that in practice most products use "harness" for the whole thing. The glossary exists because researchers at the same conference could not agree on what these words mean.

OpenAI uses the word most expansively. In their account of harness engineering, the harness reaches beyond the software around the model into the workspace itself: repository structure, CI configuration, linters, in-repo documentation, observability the agent can query. The engineer's job becomes "to design environments, specify intent, and build feedback loops that allow Codex agents to do reliable work."

Convergence. The model is not part of the harness. The harness surrounds it, executes tools, runs the loop, decides when to stop, and manages what the model sees: prompts, memory, and context.

Open boundary. The term still competes with scaffold, and Hugging Face splits the two into separate layers. Most sources stop at the software layer around the model. OpenAI extends harness engineering into the working environment. This page uses the narrower boundary: once everything an agent touches counts as harness, there is no word left for the layers that come after it.

03 / A working definition

A working definition.

Agent is an old word too: from Latin agere, to act. The one who does, as opposed to the one done to. Its technical sense comes from reinforcement learning, where an agent observes, acts, and observes again (HF glossary). In LLM systems, the common formula is:

Agent = Model + Harness

The model generates text. The harness turns that output into work:

Harness = Tools + Context + Loop

Tools are how the agent reaches the world: file operations, shell commands, web access, APIs. The model can only express the intent to act; the harness executes it. (HF glossary)
Context is everything the harness puts in front of the model: instructions, tool descriptions, memory carried across sessions, and the ongoing management of a limited context window. (Anthropic, Cursor)
The loop runs the system: call the model, execute what it asks for, feed back the result, repeat, and decide when to stop. Permissions live here: what the agent may execute, what needs approval. Feedback lives here too: tests, verification, and the error messages that let an agent correct itself. Simon Willison's compact definition holds: an agent "runs tools in a loop to achieve a goal".

Precision notes

Two details matter.

The loop is not a part like the others. It runs the others. A more accurate notation would be Agent = Loop(Model, Tools, Context): the harness as a function, the model as one of its arguments. The list form is easier to hold; the function form is closer to the system.

And the terms are not independent variables. Current models are post-trained inside specific harnesses, and harnesses are tuned per model — down to giving different models different file-editing tools (Cursor). Conceptually the cut between model and harness is clean. Empirically the two are grown together, and swapping one term while holding the other fixed changes more than the equation suggests.

Working definition, June 2026: an agent harness is the tool, context, and loop layer around a model. This page uses the narrower current boundary and treats scaffold as adjacent vocabulary. If field usage settles elsewhere, this definition changes. The agent is not the model; it exists when a harness supplies the observation-action loop. Agent vs. framework vs. workflow belongs on a separate page.

04 / Evidence

Does the harness matter?

Most writing about harnesses asserts that they matter more than the model. The evidence is mixed.

In favor: moving the same model into a different harness moved a coding agent from roughly 30th to 5th place on Terminal-Bench 2.0 (LangChain). Cursor reports spending weeks of engineering per model on harness tuning, because the gains are real (Cursor).

Complication: METR measured the autonomous task performance of frontier models on minimal scaffolds versus flagship product harnesses and found no statistically significant difference in time horizon for the models they tested.

A plausible reconciliation: harnesses matter most where work is interactive, environment-heavy, and tool-rich, and least in stripped-down autonomous measurement. That remains a hypothesis, not a result.

05 / Genealogy

Where the word came from.

Static timeline. Later, this can become an interactive view over the AI History event corpus.

Before AI
The test harness: code that runs other code under controlled conditions and observes it. (Wikipedia)
2021
The word enters the LLM world through evaluation, not agents: EleutherAI's lm-evaluation-harness becomes the standard infrastructure for running language models against benchmarks. A direct descendant of the test harness.
2023
Tool use and function calling: models gain a structured way to express intent to act; something now has to execute that intent.
2023–2024
Scaffolding: the early word for agent-side wrapping, used by evaluation groups (METR's usage continues this stratum).
Nov 2024
The Model Context Protocol standardizes how tools and context reach models, making the wrapping layer buildable as infrastructure.
2025
Agent harness becomes the product-side term: Claude Code describes itself as one; SDKs ship as harnesses.
2025–2026
Harness engineering is named as a discipline (LangChain, OpenAI, O'Reilly); the term starts to appear in academic survey literature. "Scaffold" remains live alongside it.

The pattern across every age of the word is stable: the harness is never the power. It is the structure that makes power do controlled, observable, repeatable work.

06 / Open questions

Maintained questions.

Harness or scaffold? The field uses both; one glossary splits them into separate layers. This page uses "harness" for the broad sense until usage settles.
How much does the harness matter, and where? Product benchmarks and autonomous evaluations currently disagree. What would a clean experiment look like?
Where exactly does the harness end? The narrower definition used here stops at the software layer around the model. OpenAI's wider usage pulls in the whole working environment. The boundary matters for what comes next — see below.

07 / Beyond the harness

Reliable action is not governed agency.

A harness makes agent action reliable. It does not say who set the goal, under what authority the agent acts, what norms bind it, or what happens when it causes consequences that outlast the session. Those are questions about governing agency, not about making action reliable — a different layer, with its own name: the agent habitat.

08 / Changelog

2026-06 First version: definition built origin-outward from a survey of current usage (Anthropic, OpenAI, METR, LangChain, Hugging Face, Cursor).
2026-06-12 Public-copy pass: reduced explanatory narration, tightened section transitions, and shifted the page toward reference tone.

What is an agent harness?