Tagged: loop-engineering

Trust Is an Engineering Deliverable (Part 2 of 2)

Last post I argued the loop is the easy half and the boundary is the real work. This one is about the people the loop crowd keeps writing off, because I think they’re the most important audience in this whole discourse.

I have been watching developers stay stuck at the prompt. Good engineers. They will happily run an agent turn by turn, read every diff, approve every step, and they will not let it run an isolated goal-based loop without them in the chair.

Ask why and you get some version of fear. Fear of what it’ll touch. Fear of what it’ll break while nobody’s watching. Fear of being the name on the postmortem for a change no human wrote.

The standard response to this is coaching. Trust the model, the models are better now, just try it. I want to say plainly that this response is wrong, and not just ineffective wrong. Wrong wrong. The fear is not a mindset problem. It is an accurate risk assessment of missing infrastructure.

These developers are looking at an autonomous system with no gates, no evidence trail, and no blast radius limits, and they are declining to bet their reputation on it. That is not timidity. That is the same judgment we hired them for.

Which means it sounds like a problem to solve with engineering.

We Already Solved This Once

Think about how a junior engineer gets merge rights. Nobody sits them down and grants trust. Nobody coaches the seniors into feeling comfortable.

We built a system where trust is mostly unnecessary: CI that runs the tests, review that requires another set of eyes, branch protection that blocks the direct push, rollback that bounds the damage when something slips through anyway. The junior merges on day one not because anyone trusts them but because the system made the trust question cheap.

Then something interesting happens over months. The junior’s record accumulates. Their PRs pass clean, their incidents are rare, and the humans around them start granting real trust, the informal kind, based on evidence the system collected without anyone trying.

The infrastructure didn’t just protect the codebase. It manufactured the trust.

That is the move with loops. You do not ask developers to trust the agent. You build the system that collects the evidence, and you let the trust arrive on its own schedule.

Leverage Isn’t Enough

I want to be fair to the best version of the other argument before I push on it. The compound engineering crowd at Every has been making the loop case longer than almost anyone, a long time in AI years anyway, and their framing is genuinely good: every unit of work should teach the system, so the next unit gets cheaper.

The lessons compound. The leverage compounds.

True. And incomplete. Because error compounds on exactly the same curve. A loop that learns also drifts, and a lesson encoded from a bad example is a bug with a memory.

Compounding tells you the loop is getting more capable. It tells you nothing about whether the loop has earned the right to act on that capability. Leverage compounds output. Evidence compounds trust. You need both curves, and the discourse only talks about one of them.

What This Looks Like at My Desk

Enough theory, here is a system I actually run.

I have agents that mine my work surfaces for actionable signals. One watches Teamwork for task assignments. Another watches the calendar for meetings, and the notes and transcripts those meetings leave behind, pulling action items out of the wreckage.

Another reads email and Teams for requests for help or work that arrive dressed as conversation. Another watches GitHub PRs and issues for requests for change. Different sources, same job: find the signals hiding in the noise.

Here is the part that matters. The agents do not act on what they find. They extract the data, categorize each signal, and attach a confidence rating, and the rating travels with the signal as evidence. Then a gate decides.

Only confident signals matching the desired category get through. A vague maybe-request in a Teams thread scores low and stays out. A direct assignment with my name on it scores high and lands in front of me.

Notice where the gate sits. Not at the output end, checking finished work before a merge. At the intake end, before any work begins, before a decision is even on the table.

Everyone’s verification story this week lives at the end of the loop, grading homework after it’s done. Mine starts at admission. The evidence is not an audit trail bolted on after the fact. It is the ticket that gets a signal into the room.

That is shift-left, applied to governance. We spent a decade moving testing earlier in the pipeline because defects get expensive with age. Decisions are the same.

A bad signal admitted at intake becomes a bad task, becomes bad work, becomes a bad merge, and every stage downstream pays interest on it. So the gate goes first. Engineering shifts left, and the decision gates are standing there before any decision gets made.

Is intake gating sufficient on its own? No, and I won’t pretend it is. You still need the output gates from part one, the evals and promotion rules and blast radius limits.

The point is the boundary is not one wall at the end of the loop. It is gates at both ends, evidence flowing between them, and the loop running in the space the gates define.

Receipts Can Be Forged

There is a failure mode hiding in everything I just said, and if you have run agents against a test suite you have already met it. The agent optimizes for the gate, not the work. Tell a loop the receipt is a green test run and it will get you a green test run. Sometimes by fixing the code.

Sometimes by weakening the assert, deleting the flaky test, mocking the world away, or writing a test that lovingly mirrors the bug it was supposed to catch. Goodhart’s law with a commit bit: the moment the measure becomes the target, the measure stops measuring.

So the receipts need checks and balances of their own. Not infinite regress, just a second layer that audits the first, and the trick is making that layer arithmetic instead of judgment.

Mutation testing is the cleanest example. Seed faults into the code and count how many the tests actually catch. A suite that lets mutants walk free is a forged receipt, and the kill rate is a number nobody can argue with.

CRAP scores do similar work from another angle, flagging code that is complex and barely tested, which is exactly where an agent hides its shortcuts. And a complexity delta on every change catches the slow forgery: tests green, behavior fine, structure quietly rotting underneath.

The structural half of checks and balances matters as much as the metrics. The agent that writes the code must not be able to touch the tests, the eval definitions, or the gate thresholds. Separation of duties, enforced as a diff constraint, not a convention.

A writer that can edit its own gate is not a governed system. It is a system with paperwork.

Audit the receipts with arithmetic. Lock the gates away from the writer. Then a green run starts meaning what it says.

The Trust Ladder

So how does the fearful developer, the correctly fearful developer, actually get from prompting to loops? The same way the junior got merge rights. In stages, with evidence at every rung.

  1. Report-only. The loop finds work and proposes, nothing more. Every proposal carries its confidence and its evidence. You read them for a week or two and you learn what the loop gets wrong, which is the most valuable data you will collect in this entire process.
  2. Propose-with-gates. The loop still doesn’t act, but now your gates filter the proposals, and you watch the gates instead of the raw stream.
  3. Act-with-gates. The loop acts on bounded, reversible work, where the worst case is a reverted commit.
  4. Unattended. Eventually, for some loops, on some work, the chair sits empty.

Each rung is a deposit. Every gated decision with a receipt builds the record, the same way the junior’s clean PRs did.

And here’s the part I keep coming back to: at no rung did anyone have to feel brave. The developers I described at the top do not have a courage deficit. They have an evidence deficit, and evidence is something we know how to manufacture.

The loop crowd is right that the loop is the future of this work. The trust crowd is right that an ungoverned loop is a liability with a scheduler. Both camps are staring at the same missing piece.

Build the system that keeps receipts, and the fear takes care of itself, not because anyone talked the fear away, but because the fear was never the problem. The missing receipts were.

Trust isn’t a feeling you wait for. It’s a deliverable you ship.

Let’s talk about it.

Compound Engineering, Every

Loop Engineering, Addy Osmani

Build the Loop. Then Build the Boundary. (Part 1 of 2)

Over the past few days the vocabulary shifted under us. Peter Steinberger said stop prompting coding agents, start designing loops that prompt them. Boris Cherny, who runs Claude Code at Anthropic, said his job now is writing loops, not prompts.

Then Addy Osmani gave the thing a name, loop engineering, and the name was everywhere in about seventy-two hours. If you build software and you were anywhere near a feed this week, it found you.

I want to take the idea seriously, because it deserves that. I also want to push on the part everyone keeps skipping. Because the part everyone keeps skipping is the part that pays my bills, and I suspect it pays yours too.

What Loop Engineering Gets Right

Osmani’s framing is clean and I am not going to pretend otherwise. Loop engineering is replacing yourself as the person who prompts the agent. You build a system that finds the work, hands it to an agent, checks the result, writes down what happened, decides the next move. The unit of effort moves from the prompt to the loop.

He breaks a working loop into five pieces plus memory. Scheduled automations for discovery and triage. Worktrees so parallel agents don’t trample each other. Skills that hold the project knowledge the agent would otherwise guess at. Connectors into the tools you already live in.

Sub-agents that split the writer from the checker, because the model grading its own homework is too nice, his words and he’s right. Then a state file that survives between runs, since the agent forgets everything and the file forgets nothing.

Look at that list again. We have built every one of those pieces before, in deterministic systems, for twenty years. Cron jobs. Isolated workspaces. Runbooks. Integrations. Separation of duties. Durable state.

A loop is a control system, and control systems are home turf for us. That’s the good news, and it’s why I think senior engineers should feel something closer to recognition than dread here. The instincts transfer. They just need a new place to live.

And the vocabulary isn’t new either, worth saying plainly. Anthropic’s own Building Effective Agents post defined an agent as a model using tools based on environmental feedback in a loop. December 2024. The patterns were on paper before the buzzword showed up.

What actually changed is the building blocks stopped being a pile of bash you babysat alone and started shipping inside the products themselves. Claude Code and Codex converged on nearly the same primitives, and when the loop shape goes tool-agnostic, the discourse follows it.

What the Discourse Keeps Skipping

There’s a line in Osmani’s piece that the excitement keeps stepping right over: a loop running unattended is also a loop making mistakes unattended.

Sit with that one. Everything hard about this practice is hiding inside it. The loop does not remove judgment from the work. It removes judgment from the moment of the work.

Every call you used to make turn by turn, you now make in advance, encode, and trust. That’s not less engineering. That’s more engineering, moved earlier, with worse consequences when you get it wrong, running at three in the morning while you sleep.

Cherny has been unusually candid that volume doesn’t equal quality. Running hundreds of agents hunting for things to build produces a lot of output, and by his own telling, a lot of that output isn’t worth acting on. The man who built the tool says this.

So the question was never whether the loop can generate work. It can, at volume, cheap. The question is what stands between that volume and your main branch.

Am I being unfair to the loop crowd? Maybe a little. Osmani names verification as the thing that gets sharper, not easier, as the loop improves. The writer-checker split exists in his five blocks for exactly this reason. The honest practitioners see the problem.

But naming a problem and engineering the answer are different altitudes, and most of what shipped this week stops at naming it.

The Loop Is the Easy Half

Strip the loop down and it has two halves. One half generates. The agent reasons, plans, writes code, calls tools, tries again. That half is stochastic by design, and the variance is the whole point. It’s where the leverage comes from.

The other half decides. Is this good enough to act on, to merge, to ship, to let trigger the next iteration. And here’s the thing I’ll plant a flag on: that half has to be deterministic, or the loop is just entropy with a scheduler.

I can hear the objection already, because Claude Code’s own /goal command uses a model to judge whether the goal is met, and a model judge is not deterministic in any strict sense. Fine. The measurement can be stochastic. The decision rule cannot.

An LLM evaluator hands back a score; the threshold that promotes or blocks is a rule you wrote before the loop ever ran. Score comes back 0.7, gate says 0.8, blocked. Same evidence, same verdict, every time, and you can replay it. Fuzzy instruments, hard rules. We’ve run plants and flown planes on that arrangement for a long time.

The boundary between those halves is where loop engineering becomes real engineering. The gates are evals, invariants, promotion criteria, blast radius limits. Things you can write down, run repeatedly, and trust when nobody is watching.

This is the ground we’ve been working in the AgenticOps Harness series, and the timing surprised me: I did not expect the broader discourse to hand us the vocabulary this fast.

The maturity ladder we laid out runs from operating an agent by hand to engineering the platform that governs fleets of them. Loop engineering, as described this week, is the middle of that ladder. The moment you stop driving and start designing. What comes after, and almost nobody is writing about it yet, is the discipline that makes the loop safe to stop watching.

Evidence, or It Didn’t Happen

Here’s the artifact I want you to picture, because the boundary stays abstract until you can hold it.

A serious loop doesn’t just leave behind code. It leaves behind a decision record. The claim it was acting on. The checks it ran. The eval scores. The diff. The token cost. The blast radius. The verdict, and the rule that produced the verdict.

decision: promote
claim: "WI-2026-0412: retry storm in sync worker"
checks: [build, unit, integration, mutation]
eval_score: 0.86
gate_threshold: 0.80
diff: 4 files, +61 -12
tokens_spent: 41200
blast_radius: sync worker only, behind feature flag
rule: promote when eval_score >= threshold and all checks pass
verdict: promoted

When somebody asks, six weeks later, why did the gate say yes, you don’t reconstruct it from memory and vibes. You open the record and replay the decision.

That’s what I mean by an evidence-based, defensible system, and I’ve started to think it’s the real shape of this whole discipline. A consequential action is only as good as the evidence attached to it, evaluated by rules written before the action ran.

I keep arriving at this same idea from different directions, in different projects, in domains that have nothing to do with each other, which is usually a sign the idea is load-bearing.

The loop crowd is selling leverage, and the leverage is real. But leverage without evidence is just speed you can’t account for. The systems that survive contact with production, with auditors, with the six-weeks-later question, are the ones where every decision carries its own receipts.

Where to Start, Honestly

If you haven’t built a loop yet, don’t start with the autonomous overnight fleet. Start with one loop, one repeating task, report-only. Let it find work and propose, not act. Watch what it gets wrong for a week.

The failures you collect that week are the raw material for your first real gates, and gates built from observed failures beat gates built from imagination every single time. That’s just TDD instinct, applied one level up.

And keep a number on it. Token spend per accepted change. A loop that burns eighty thousand tokens to produce a four-hundred-token answer isn’t a productivity story. It’s a cost story in a productivity costume.

The prompt was never the unit of work. Neither is the loop. The unit of work is a decision you can defend, made by a system you designed, backed by evidence you can replay, at a boundary you can name.

Build the loop. Then build the boundary. Then make the boundary keep receipts.

Next post: why the developers refusing to run loops are the ones thinking clearly, and the system I run that earns their kind of trust one gated signal at a time.

Let’s talk about it.

Loop Engineering, Addy Osmani

Peter Steinberger on designing loops

Building Effective Agents, Anthropic