Tagged: loop-engineering
Build the Loop. Then Build the Boundary. (Part 1 of 2)
Over the past few days the vocabulary shifted under us. Peter Steinberger said stop prompting coding agents, start designing loops that prompt them. Boris Cherny, who runs Claude Code at Anthropic, said his job now is writing loops, not prompts.
Then Addy Osmani gave the thing a name, loop engineering, and the name was everywhere in about seventy-two hours. If you build software and you were anywhere near a feed this week, it found you.
I want to take the idea seriously, because it deserves that. I also want to push on the part everyone keeps skipping. Because the part everyone keeps skipping is the part that pays my bills, and I suspect it pays yours too.
What Loop Engineering Gets Right
Osmani’s framing is clean and I am not going to pretend otherwise. Loop engineering is replacing yourself as the person who prompts the agent. You build a system that finds the work, hands it to an agent, checks the result, writes down what happened, decides the next move. The unit of effort moves from the prompt to the loop.
He breaks a working loop into five pieces plus memory. Scheduled automations for discovery and triage. Worktrees so parallel agents don’t trample each other. Skills that hold the project knowledge the agent would otherwise guess at. Connectors into the tools you already live in.
Sub-agents that split the writer from the checker, because the model grading its own homework is too nice, his words and he’s right. Then a state file that survives between runs, since the agent forgets everything and the file forgets nothing.
Look at that list again. We have built every one of those pieces before, in deterministic systems, for twenty years. Cron jobs. Isolated workspaces. Runbooks. Integrations. Separation of duties. Durable state.
A loop is a control system, and control systems are home turf for us. That’s the good news, and it’s why I think senior engineers should feel something closer to recognition than dread here. The instincts transfer. They just need a new place to live.
And the vocabulary isn’t new either, worth saying plainly. Anthropic’s own Building Effective Agents post defined an agent as a model using tools based on environmental feedback in a loop. December 2024. The patterns were on paper before the buzzword showed up.
What actually changed is the building blocks stopped being a pile of bash you babysat alone and started shipping inside the products themselves. Claude Code and Codex converged on nearly the same primitives, and when the loop shape goes tool-agnostic, the discourse follows it.
What the Discourse Keeps Skipping
There’s a line in Osmani’s piece that the excitement keeps stepping right over: a loop running unattended is also a loop making mistakes unattended.
Sit with that one. Everything hard about this practice is hiding inside it. The loop does not remove judgment from the work. It removes judgment from the moment of the work.
Every call you used to make turn by turn, you now make in advance, encode, and trust. That’s not less engineering. That’s more engineering, moved earlier, with worse consequences when you get it wrong, running at three in the morning while you sleep.
Cherny has been unusually candid that volume doesn’t equal quality. Running hundreds of agents hunting for things to build produces a lot of output, and by his own telling, a lot of that output isn’t worth acting on. The man who built the tool says this.
So the question was never whether the loop can generate work. It can, at volume, cheap. The question is what stands between that volume and your main branch.
Am I being unfair to the loop crowd? Maybe a little. Osmani names verification as the thing that gets sharper, not easier, as the loop improves. The writer-checker split exists in his five blocks for exactly this reason. The honest practitioners see the problem.
But naming a problem and engineering the answer are different altitudes, and most of what shipped this week stops at naming it.
The Loop Is the Easy Half
Strip the loop down and it has two halves. One half generates. The agent reasons, plans, writes code, calls tools, tries again. That half is stochastic by design, and the variance is the whole point. It’s where the leverage comes from.
The other half decides. Is this good enough to act on, to merge, to ship, to let trigger the next iteration. And here’s the thing I’ll plant a flag on: that half has to be deterministic, or the loop is just entropy with a scheduler.
I can hear the objection already, because Claude Code’s own /goal command uses a model to judge whether the goal is met, and a model judge is not deterministic in any strict sense. Fine. The measurement can be stochastic. The decision rule cannot.
An LLM evaluator hands back a score; the threshold that promotes or blocks is a rule you wrote before the loop ever ran. Score comes back 0.7, gate says 0.8, blocked. Same evidence, same verdict, every time, and you can replay it. Fuzzy instruments, hard rules. We’ve run plants and flown planes on that arrangement for a long time.
The boundary between those halves is where loop engineering becomes real engineering. The gates are evals, invariants, promotion criteria, blast radius limits. Things you can write down, run repeatedly, and trust when nobody is watching.
This is the ground we’ve been working in the AgenticOps Harness series, and the timing surprised me: I did not expect the broader discourse to hand us the vocabulary this fast.
The maturity ladder we laid out runs from operating an agent by hand to engineering the platform that governs fleets of them. Loop engineering, as described this week, is the middle of that ladder. The moment you stop driving and start designing. What comes after, and almost nobody is writing about it yet, is the discipline that makes the loop safe to stop watching.
Evidence, or It Didn’t Happen
Here’s the artifact I want you to picture, because the boundary stays abstract until you can hold it.
A serious loop doesn’t just leave behind code. It leaves behind a decision record. The claim it was acting on. The checks it ran. The eval scores. The diff. The token cost. The blast radius. The verdict, and the rule that produced the verdict.
decision: promote
claim: "WI-2026-0412: retry storm in sync worker"
checks: [build, unit, integration, mutation]
eval_score: 0.86
gate_threshold: 0.80
diff: 4 files, +61 -12
tokens_spent: 41200
blast_radius: sync worker only, behind feature flag
rule: promote when eval_score >= threshold and all checks pass
verdict: promoted
When somebody asks, six weeks later, why did the gate say yes, you don’t reconstruct it from memory and vibes. You open the record and replay the decision.
That’s what I mean by an evidence-based, defensible system, and I’ve started to think it’s the real shape of this whole discipline. A consequential action is only as good as the evidence attached to it, evaluated by rules written before the action ran.
I keep arriving at this same idea from different directions, in different projects, in domains that have nothing to do with each other, which is usually a sign the idea is load-bearing.
The loop crowd is selling leverage, and the leverage is real. But leverage without evidence is just speed you can’t account for. The systems that survive contact with production, with auditors, with the six-weeks-later question, are the ones where every decision carries its own receipts.
Where to Start, Honestly
If you haven’t built a loop yet, don’t start with the autonomous overnight fleet. Start with one loop, one repeating task, report-only. Let it find work and propose, not act. Watch what it gets wrong for a week.
The failures you collect that week are the raw material for your first real gates, and gates built from observed failures beat gates built from imagination every single time. That’s just TDD instinct, applied one level up.
And keep a number on it. Token spend per accepted change. A loop that burns eighty thousand tokens to produce a four-hundred-token answer isn’t a productivity story. It’s a cost story in a productivity costume.
The prompt was never the unit of work. Neither is the loop. The unit of work is a decision you can defend, made by a system you designed, backed by evidence you can replay, at a boundary you can name.
Build the loop. Then build the boundary. Then make the boundary keep receipts.
Next post: why the developers refusing to run loops are the ones thinking clearly, and the system I run that earns their kind of trust one gated signal at a time.
Let’s talk about it.