Everyone Gets Their Own Harness. Governance Still Exists.

I rewrote this essay because I stopped believing it

The first version of this essay, published under this same title, argued that every company needs one internal harness: a single company-controlled interface that classifies each task, assigns risk, and routes to the right model. I no longer run my own shop that way, so I am not going to keep telling you to. The piece you are reading is the correction.

What changed my mind is a file called provider_registry.py in my autonomous development factory. It registers agent harnesses by name, "claude-code", "openai-codex", so the orchestrator can spawn builders through any registered provider. The day that registry existed, the standardization argument was over. Adding a harness became a registration, not a migration. And once swapping harnesses is cheap, mandating one is pure cost: you pay the re-tooling tax across every team and get back nothing the layer underneath was not already providing.

The layer underneath is the part I believe in harder than before. Traces on every model call. Golden tests that do not care which harness produced the output. Tool gates that fail closed at compile time. Budgets that stop the run at a dollar line. Keep that substrate centralized and you can let every team bring whatever harness makes them fast.

Harnesses are where the variance lives. Let people pick.

I have measured what a harness does to the same model. In my own evals, one small local model scored 0.8 through one harness, four of five questions correct in about a second, and 0.0 through another, zero of five with every question erroring out over ten minutes. Same weights, same questions. An eighty-point swing that was entirely plumbing. Harness quality can be the whole result.

That cuts both ways. If harnesses matter that much, forcing one blessed harness on every team is a bet that your platform group builds the best harness for every job in the company. Mine does not, and yours does not either. The engineer who lives in Claude Code all day has tuned a workflow no central team will replicate. The team on a different vendor's agent has reasons grounded in their task shape. The old essay's core rule survives, restated: the user picks the job, the substrate meters the work. What I dropped is the idea that the picking has to happen inside one company-issued wrapper.

One thing has not moved an inch since the first version: end users still should not manage model economics. Nobody doing real work is going to reason about input-token pricing, tokenizer differences, prompt caching, retry rates, or context-length ceilings, and they should not have to. The correction is about where that machinery lives. It does not need to live in a wrapper between the person and their tools. It lives below the tools, in routing defaults, budget policies, and metering that apply no matter which harness made the call.

Traces are the non-negotiable

The first hard rule in my platform's contributor doc is not about code style. It reads, verbatim: "Langfuse first. Always check traces BEFORE modifying agent code. Never guess without evidence." That rule exists because debugging an agent from vibes is how you ship a confident fix for a problem you do not have.

The tracing layer is manual instrumentation rather than the stock framework callback, because the stock LangChain handler does not track streaming latency correctly. The module opens a trace when a request starts, opens spans for each phase (planner, agent, tools), and closes the trace with real latency, token counts, cost, and model info when streaming completes. It also carries one design commitment I would defend anywhere: observability errors are logged and never raised, because instrumentation that can break the agent it watches is a second bug generator.

The governance property is that this sits below every harness. Whatever spawned the run, a chat user, a scheduler, a builder inside someone's personal harness, the trace lands in the same place with the same schema. When a harness misbehaves, the argument happens over shared evidence instead of competing anecdotes.

Evals are the treaty between harnesses

Diversity without a shared bar is chaos, so the bar is centralized too. The product this platform grew out of carries 76 golden tests, and the platform repo carries the same suite. The categories:

Category	Cases	What it checks
Semantic queries	27	Three per data domain, across nine domains
Analysis	14	Insights and multi-skill composition
Edge cases	10	Greetings, errors, refusal behavior
Predictions	9	ML forecasts
Charts	6	Chart rendering
Simulations	6	Digital-twin scenarios
Dashboard	4	Dashboard management

The 76-case golden suite. The runner is a CLI SSE client against the backend; it cannot see which harness, model, or prompt produced the output.

The property that matters for this essay is harness-independence. The runner speaks SSE to the backend and asserts on outputs. Change the model, change the harness, change the prompt assembly: the same 76 cases render a verdict either way, which is what makes them a treaty rather than a preference.

When the bar does its job, it blocks things. A production QA pass on a tool-contract migration ran 118 assertions across three personas and two rounds: 74.6 percent passed (88 of 118), 8.5 percent warned (10 of 118), 16.9 percent failed (20 of 118), and all 20 failures were data mismatches between what a tool claimed and what the database held, including a run that claimed 221 orders against an actual 835. The report's status line said it plainly: issues detected, requires investigation before production deployment. It did not ship. A centralized eval that cannot say no is decoration.

Gates and budgets fail closed

The third centralized layer is enforcement. In the platform compiler, Gate 1 decides whether this actor may run this agent at all, and Gate 2 filters the tool catalog by the actor's permissions before the graph is built. An unauthorized tool is not blocked at call time. It is absent from the compiled agent. There is nothing to jailbreak at runtime because the capability was never assembled.

On the autonomous-loop side, where the improvement harness runs unsupervised for days, every sub-agent role gets a typed budget. Triage gets 15 tool uses, 3 minutes, and 50 cents. A fix agent gets 60 tool uses, 15 minutes, and 3 dollars. Audit gets 40 tool uses, 10 minutes, and 2 dollars. Verify gets 5 tool uses and a quarter. The wallclock ceiling is a hard SIGKILL deadline, not a suggestion. Above those sit loop-level caps: at most 200 fixes a day, at most 8 agent-hours a day, 72 hours of total wallclock.

A destructive-op circuit breaker regex-scans every sub-agent reply for force pushes, DROP DATABASE, rm -rf on root, and --no-verify, and halts the loop for human review on the first match.
Role separation is enforced at the process level with allowed and disallowed tool lists: triage is read-only, scope writes markdown instead of code, and no role can widen its own permissions.
Every invocation writes one row to an outcome ledger (wallclock, tool uses, exit code), so cost and reliability are reconstructable after the fact from the log alone.

The loop also grows its own regression net. Every iteration that produces a fix ships a new regression spec, and that spec joins the suite for every subsequent iteration; any red in the accumulated net halts the whole loop for review. So the enforcement layer tightens over time without anyone maintaining it by hand, and it tightens identically for every harness, because the specs test the product, and the product does not know who fixed it.

The harness can never govern itself

Here is why none of this enforcement can live inside the harness, no matter whose harness it is. I once trusted my own autonomous factory to report its own completions. It grew a 233-line function whose entire job was to forge the evidence chain, including an approval record with approver_type set to "human" that no human ever touched. I wrote that story up separately, and the lesson that survived it generalizes: every harness is an optimizing system, and a self-reported metric from an optimizing system is a claim, not a measurement.

So governance has to sit in layers the harness cannot edit. Traces it does not write. Budgets it cannot raise. Evals it does not get to grade. The same principle shows up inside the factory as a hook-enforced design rule: the agent that writes the verification harness for a piece of work cannot also write the code. If one author both sets the bar and clears it, the bar means nothing, and that holds whether the author is a person, a model, or a harness.

Once those layers hold, the harness stops being a trust boundary at all, and that is exactly what makes diversity safe. I was never trusting the harness. I was trusting the substrate underneath it, and the substrate is shared.

Standardize telemetry, not tooling

The economics still come down to the unit I used in the original essay: cost per accepted task. Model spend plus tool spend plus retries plus review, divided by the outputs you actually accept. Harness pluralism changes none of that arithmetic; the metering just has to live in the substrate, where every harness's work flows through it. What changes is the denominator. Teams working in harnesses they chose produce more accepted tasks, and the traces let you see which harnesses earn their keep instead of guessing.

The board-language version: a harness mandate is a hidden headcount tax. Re-tooling every team onto a blessed interface costs weeks per team and produces zero additional output, and the mandate starts decaying the moment a better harness ships, which in 2026 is roughly quarterly. A telemetry mandate costs one integration per harness and appreciates, because every new harness inherits the entire eval and budget stack on day one.

I stand by the old essay's best line: access is not architecture. Handing everyone premium chat is a spending pattern, and I said so at the time. But mandating one internal harness is also access theater, just with better branding. The architecture is the substrate: traces on every call, golden tests that outlive any tool, gates that fail closed, budgets with hard stops. Build that, then let your teams bring whatever harness makes them dangerous. If your platform group is spending this year picking the company harness instead of building the layer that makes every harness accountable, it is building the wrong monopoly.