Category: AgenticOps

They Can Watch. They Cannot Stop.

An agent misbehaves. The dashboard lights up. An operator sees exactly what is happening: the scope violation, the unauthorized data access, the lateral movement across systems. And then nothing. No kill switch. No isolation. No way to stop what they are watching unfold in real time. This is not a hypothetical. The Kiteworks Data Security Forecast surveyed 225 security, IT, and risk leaders across ten industries and eight regions, and the results say this is the default state of enterprise AI governance. Organizations built the screens. They did not build the brakes.

The four containment rings from the AgenticOps model predict exactly where the failure occurs and in what order to fix it. Observation is one ring. Organizations treated it as all four.

The Numbers

63% of organizations cannot enforce purpose limitations on their AI agents. They know what agents should do. They cannot technically prevent other actions. 60% cannot terminate a misbehaving agent. They can see the agent doing something unexpected. They cannot stop it. 55% cannot isolate AI systems from broader network access. The agent has access to systems beyond its intended scope, and the organization has no mechanism to restrict that access after deployment.

Meanwhile, 58% report continuous monitoring. 59% report human-in-the-loop oversight. 56% report data minimization practices. They can observe their agents, log what agents do, and flag anomalies.

The gap between observation and containment is 15 to 20 percentage points. The security community calls this the governance-containment gap. Most organizations can see an AI agent doing something unexpected and cannot prevent it from exceeding its authorized scope, shut it down, or isolate it from sensitive systems.

Mapping the Gap to the Four Rings

The four containment rings from How Agents Stay in Bounds provide a structural explanation for this gap. Each ring addresses a different containment boundary. The Kiteworks data reveals which rings organizations invested in and which they skipped.

Ring 1 is constrain inputs: scope what the agent receives before it starts. Purpose limitation is a Ring 1 control. It defines what the agent is allowed to do by limiting the context, tools, and data it can access. 63% of organizations cannot enforce this ring. They give agents broad access and rely on instructions to stay in scope. How Agents Stay in Bounds described this as the difference between policy and infrastructure. Policy says “only access these systems.” Infrastructure says “these are the only systems that exist in your environment.” Most organizations chose policy.

Ring 2 is constrain environment: physically isolate the agent’s execution. Network isolation is a Ring 2 control. It prevents the agent from reaching systems outside its intended scope regardless of what it attempts. 55% cannot enforce this ring. The agent runtime has network access to the broader environment, and the only barrier is the agent’s instructions not to use it. OpenClaw Is Not an AI Assistant demonstrated how Docker sandboxes, tool sandboxing, and network allowlists create physical containment around agent runtimes. The organizations in this survey do not have that infrastructure.

Ring 3 is validate outputs: evaluate what the agent produced before it takes effect. This is where the 58% monitoring figure lives. Organizations invested here. They can observe agent behavior, flag anomalies, and log actions for review. This ring is necessary but insufficient on its own. Observation without the ability to act on what you observe is surveillance, not governance.

Ring 4 is gate promotion: prevent unverified work from reaching production systems. Kill switches are a Ring 4 control. They prevent or reverse the promotion of agent actions that violate criteria. 60% of organizations cannot terminate a misbehaving agent. They can see the agent is misbehaving through their Ring 3 investment. They cannot stop it because they never built Ring 4.

RingControlCapabilityOrganizations Lacking
1Constrain inputsPurpose limitation63%
2Constrain environmentNetwork isolation55%
3Validate outputsContinuous monitoring42%
4Gate promotionKill switch / termination60%

The pattern is clear. Organizations invested in Ring 3 and skipped Rings 1, 2, and 4. They built the middle of the defense and left the boundaries open. This is precisely what How Agents Stay in Bounds predicted: every ring you skip is a ring you inherit as runtime risk.

Why Organizations Built Observation First

The investment pattern is not irrational. Observation is easier to implement than containment. Monitoring can be added to existing agent deployments without redesigning the architecture. Log aggregation, anomaly detection, and human review queues are familiar patterns that security teams already understand from traditional application monitoring.

Containment requires architectural decisions made before deployment. Ring 1 requires scoping the agent’s context at design time. Ring 2 requires sandbox infrastructure that isolates the runtime. Ring 4 requires promotion gates that can halt or reverse agent actions. These are not things you bolt on after the agent is running. They are things you build into the deployment architecture from the start.

You Can Build This. Three Artifacts and a Sandbox. described the build order: constraint first, then agent definition, then gate, then sandbox. That order is deliberate. You define what the agent can do before you define the agent. You define what success looks like before you let the agent run. You build the sandbox before you execute inside it. The organizations in the Kiteworks survey did it the other way around. They deployed agents, then added monitoring, and now they are discovering that monitoring without containment leaves them watching problems they cannot fix.

Government Is in the Worst Position

The Kiteworks data includes a sector breakdown that makes the structural argument even more concrete. Government agencies are in the worst position across every containment metric. 90% lack purpose-binding controls. 76% lack kill switches. A third have no dedicated AI controls at all.

This is not because government agencies are less capable. It is because they face the most acute version of the structural problem. Procurement cycles are long. Architecture decisions are locked into multi-year contracts. The agents are deployed by vendors into environments the agency does not control. Ring 1 requires scoping the agent’s context, but the agency may not have visibility into what context the vendor’s agent receives. Ring 2 requires environmental isolation, but the agent may run in the vendor’s cloud, not the agency’s infrastructure.

The containment model assumed the organization controls the deployment environment. When that assumption breaks, when the agent runtime is managed by a third party, containment becomes a contractual problem, not just a technical one. The four rings still apply, but the implementation shifts from infrastructure configuration to vendor requirements and service-level agreements.

100% Have Agents on the Roadmap

The most striking number in the survey is not about failures. It is about ambition. 100% of organizations surveyed have agentic AI on their roadmap. Zero exceptions. Across 225 organizations, ten industries, eight regions, not a single organization plans to abstain from agent deployment.

This means the governance-containment gap is not a temporary condition that some organizations will avoid. It is the default starting position for every organization that deploys agents. The question is not whether an organization will face this gap. The question is whether they close it before something goes wrong.

What AgenticOps Actually Looks Like introduced three non-negotiables: no generation without defined contracts, no promotion without evaluation, no runtime without observability. The Kiteworks data suggests a fourth: no deployment without containment. Observability alone is not governance. It is the prerequisite for governance, not the thing itself.

Closing the Gap

The gap closes in a specific order. Ring 1 first: define what each agent is allowed to do, not in policy documents but in scoped contexts and tool allowlists. Ring 2 next: isolate the agent’s execution environment so that the boundary is physical, not instructional. Ring 4 last: build promotion gates that can halt or reverse agent actions when evaluation criteria fail. Ring 3, observation, is what most organizations already have. The gap is everything else.

This is the same build order from You Can Build This applied at enterprise scale. Write a constraint. Define the agent. Write a gate. Run it in a sandbox. The artifacts are different when the agent is a customer service bot instead of a code generator, but the pattern is identical. Scope the input. Isolate the environment. Verify the output. Gate the promotion.

The Agents of Chaos study also found that 75% of leaders will not let security concerns slow their AI deployment. This means the gap will not close by slowing down. It will close by building containment infrastructure that runs at deployment speed. How Agents Stay in Bounds made this argument for individual agents. The Kiteworks data makes it for entire organizations.

The model was built to describe this problem. Now the problem has data. The four rings are no longer a conceptual model. They are a diagnostic tool. Map your agent deployments against the four rings. Where you have gaps, you have risk. Where you have all four, you have governance.

The data says most organizations are watching. Watching is not governing. Governing requires the ability to stop what you are watching. That requires infrastructure, not observation.

Let’s talk about it.

Your Code Works. Nobody Knows Why.

There is a new kind of debt accumulating in AI-assisted codebases. It does not show up in failing builds or broken deployments. The tests pass. Features ship. The system runs. But the shared understanding of why the system works the way it does is fragmenting, silently, faster than anyone expected. Five independent researchers named the same problem within six weeks of each other. They are calling it cognitive debt, and it may be the most important concept in AI-assisted development right now.

A Name for the Problem

In February 2026, Margaret Storey published a post that gave the velocity-comprehension gap a precise definition. Cognitive debt is the accumulated gap between a system’s evolving structure and a team’s shared understanding of how and why that system works and can be changed over time. Unlike technical debt, which surfaces through failing builds and broken deployments, cognitive debt “tends not to announce itself through failing builds but rather shows up through a silent loss of shared theory.”

Simon Willison amplified the concept. Addy Osmani independently arrived at the same idea with “Comprehension Debt,” arguing that code has become cheaper to produce than to perceive. Five independent authors converged on the same concept within six weeks. When that happens, the concept is not a trend. It is a structural observation that multiple people noticed simultaneously because the underlying condition became impossible to ignore.

Storey grounded the concept in Peter Naur’s 1985 theory that a program is not its source code but a theory distributed across the minds of the people who built it. The source code is an artifact of that theory. When the theory fragments, when nobody on the team can explain why certain design decisions were made or how different parts of the system interact, the system becomes unmaintainable even when it functions correctly. AI accelerates this fragmentation because it generates structure faster than shared understanding can stabilize.

Why AI Makes It Exponentially Worse

Most Software Is Just CRUD. That’s Not the Problem. established an economic principle: anything that becomes cheap gets overproduced. AI made code generation cheap. The predictable result is overproduction of structure. More files, more functions, more abstractions, more system surface area, all generated faster than any team can internalize.

The traditional development cycle had a natural governor on cognitive debt. Writing code by hand forced the author to understand what they were building, at least at the moment of creation. The understanding might fade over time, but it existed at the origin. AI-assisted development removes that governor entirely. The agent generates a module. The developer reviews the output. The tests pass. The module ships. Six months later, nobody remembers why that module handles edge cases the way it does, because nobody decided it should handle them that way. The agent decided, and the decision was reasonable enough to pass review.

This is the scenario I described in I Was a 1x Coder at Best. AI Made Me a 0x Coder. when I said I write zero lines of code by hand. The 0x workflow works because I invested in containment infrastructure. But the honest version of that story includes the risk: every module the agent produces is a module whose design rationale exists only in the conversation context that created it. Once that context window closes, the rationale is gone unless something captures it.

Storey’s student teams demonstrated this concretely. By weeks seven and eight, students using AI assistance could not explain their own system’s architecture. The code worked. The tests passed. The team had lost the theory. They could not modify the system because they did not understand the design decisions that shaped it.

Technical Debt Has a Feedback Signal. Cognitive Debt Does Not.

Technical debt announces itself. Builds break. Deployments fail. Performance degrades. The system tells you something is wrong, and you fix it or you don’t. Either way, you know. Cognitive debt is silent. The system continues to function. Features continue to ship. The team continues to deliver. The problem only surfaces when someone needs to change something they do not understand, and by then the cost of understanding has compounded to the point where the change is more expensive than rewriting.

This asymmetry explains why cognitive debt is more dangerous than technical debt in AI-assisted environments. Technical debt accumulates linearly. You cut a corner, you pay interest, you can measure the interest and decide when to pay it down. Cognitive debt accumulates exponentially because each new module generated without shared understanding increases the surface area of the system that nobody fully grasps. The team’s capacity to understand does not grow. The system does.

Verification Beats Debugging described verification pipelines that make debugging unnecessary. Mutation testing, coverage gates, complexity limits. These are essential, and they address technical correctness. They do not address cognitive debt. A system can pass every verification gate while the team’s understanding of it decays. The gates prove the code works. They do not prove anyone knows why.

Knowledge Compression Is the Structural Response

What AgenticOps Actually Looks Like introduced six layers of the AgenticOps model. The sixth layer is Knowledge Compression. At the time, it may have seemed like an afterthought, a cleanup step after the important work of generation, evaluation, and promotion. Cognitive debt makes it clear that knowledge compression is not the last step. It is the step that determines whether the system remains governable over time.

Knowledge compression is the systematic reduction of system complexity into navigable artifacts. Not documentation in the traditional sense. Not the kind of documentation that gets written once and never updated. Compression artifacts are living summaries that capture the theory of the system in forms that humans can navigate quickly. One-page summaries. Lifecycle maps. Invariant lists. Decision logs that record not just what was decided but why, including the alternatives that were considered and rejected.

I’ve Never Fully Understood the Systems I Work In introduced these concepts: compression artifacts, lifecycle maps, invariant lists. At the time they were part of the argument that containment replaces comprehension. Cognitive debt makes the argument concrete. Without compression, the theory fragments. With compression, the theory is captured in artifacts that persist beyond the conversation context that created the code.

Compression Operates at Every Level

Compression is wider than decision logs. It operates at every level where theory can be lost, from individual commits to system architecture.

At the code level, compression means collapsing noisy history into navigable history. A feature implemented across thirty story-level commits contains the full record of false starts, revisions, and incremental progress. That granularity is useful during development and useless afterward. Compressing those thirty commits into a single feature-level commit with a clear description of what changed and why preserves the theory without preserving the noise. The detailed history still exists in the branch. The compressed version is what future developers encounter first.

At the decision level, compression means capturing the reasoning chain that led to each significant choice. A system I have been building tracks this through a document pipeline: a problem document captures the problem in the language of the requester, a research document explores the problem space, an options document maps the solution space, a decision document explains which option was chosen and why the alternatives were rejected, and a plan document breaks the chosen solution into deliverable work items. Each document compresses a phase of reasoning into a navigable artifact. Six months later, when someone asks “why does this system handle authentication this way,” the answer is not buried in a Slack thread or a closed pull request. It is in the decision document, linked to the options that were considered and the problem that started the whole chain.

At the architecture level, compression means maintaining living documents that describe the system as it actually runs, not as it was originally designed. Blueprints capture the current architecture. Guides capture how users and developers interact with it. When a work item changes a feature, the documents that describe that feature update as part of the same workflow. The architecture documentation and the code stay in sync because they are governed by the same process, not maintained by separate teams on separate schedules.

This is the feedback loop from What AgenticOps Actually Looks Like made concrete. Compressed knowledge feeds back into intent. The decision documents inform the next problem document. The blueprints inform the next options document. The guides inform the next plan. Without that loop, each development cycle starts from scratch. With it, each cycle builds on the accumulated theory.

Compression Is Not Documentation

Traditional documentation fails because it is a separate activity from development. It gets written after the fact, if at all, and it drifts out of sync with the code the moment someone changes something without updating the docs. The document pipeline described above succeeds because it is the development process, not a parallel activity. The problem document is written before development starts. The decision document is written before implementation begins. The blueprint updates when the feature ships. There is no separate “documentation phase” that can be skipped under deadline pressure.

This is the same pattern from You Can Build This: constraints define the space, agents work inside it, gates verify the output. Adding compression to the gate criteria means the system self-documents as it builds. No promotion without a decision document. No merge without a compressed commit history. No deployment without updated blueprints. These are not burdensome when agents generate them alongside the code. They are only burdensome when a human writes them by hand after the fact.

The result is that any team member, or any agent, can reconstruct the theory of the system by reading the compressed artifacts. They can trace from a production behavior back through the blueprint, to the decision that shaped it, to the options that were considered, to the problem that started the work. That traceability is the structural defense against cognitive debt. The theory is not in anyone’s head. It is in the repository, navigable and current.

The Velocity-Comprehension Contract

Cognitive debt reframes the relationship between velocity and understanding. The traditional framing was: move fast, accumulate debt, pay it down later. The cognitive debt framing is: move fast, lose the theory, discover later that you cannot pay it down because you no longer understand what you built.

The AgenticOps response is not to slow down. It is to make compression a structural requirement of the development process. Generate fast. Compress immediately. The velocity stays. The theory stays too. This is what I’ve Never Fully Understood the Systems I Work In meant by “I scale containment, not understanding.” The containment now includes containing the theory of the system before it dissipates.

The industry is beginning to recognize this. Storey’s follow-up post synthesized the community response and noted that teams are experimenting with checkpoints where they rebuild shared understanding through code reviews, retrospectives, and knowledge-sharing sessions. These are behavioral responses, necessary but insufficient. The structural response is to make the checkpoints automatic, to embed them in the gate criteria so they cannot be skipped when deadlines compress.

Cognitive debt is real. It is measurable. It is accelerating. And the sixth layer of the model, the one that might have looked like cleanup, is a primary defense against it.

Let’s talk about it.

Agentic Engineering Is a Practice. AgenticOps Is the Infrastructure.

In early 2026, multiple people working independently arrived at the same conclusion about how professional developers should work with AI coding agents. They converged on a term: agentic engineering. The practices they describe are correct. But practices live in behavior, and behavior degrades under pressure. Nothing in the agentic engineering model enforces the practice when the practitioner is tired, rushed, or outnumbered.

The Industry Converged

In early 2026, Andrej Karpathy proposed the term “agentic engineering” to describe what professional developers actually do when they work with AI coding agents. Addy Osmani wrote the definitive guide. IBM published a formal definition. Simon Willison drew the line that separates it from vibe coding: “I think the borderline is when you take responsibility for the code, and stop blaming the LLM for any mistakes.”

The core claims are familiar. Humans own architecture and quality. Agents handle implementation. Testing is the single biggest differentiator between disciplined and undisciplined AI-assisted work. Specifications before prompts produce better output than prompts alone. These are not new observations. I’ve Never Fully Understood the Systems I Work In described the same boundary between human judgment and agent execution. How Agents Stay in Bounds formalized it as four rings of containment. The language differs. The structure is the same.

What makes this convergence meaningful is not that smart people agree. It is that they agree independently, from different starting points, using different evidence. When multiple observers reach the same conclusion through different paths, the conclusion is likely structural rather than stylistic.

The Practice Is Necessary But Not Sufficient

Agentic engineering describes how a developer should work. Write specifications first. Review agent output with the same rigor as human code. Test relentlessly. Maintain architectural discipline. These are practices, and they are correct. Osmani’s observation that “agentic engineering actually rewards strong fundamentals more than traditional development” matches exactly what I found building the system described in You Can Build This. Three Artifacts and a Sandbox.

The problem is that practices depend on practitioners. A developer following agentic engineering principles produces governed output. A developer who skips the specification step, or merges without reviewing the diff, or runs without tests, produces ungoverned output. The practice has no mechanism to enforce itself. It relies on discipline, and discipline degrades under pressure. Deadlines compress. Scope expands. The careful developer who reviews every diff at 10 AM rubber-stamps the last three at 6 PM.

Any approach that lives in behavior rather than infrastructure has this limitation. Verification Beats Debugging made the same argument in the AgenticOps Applied series: the fix is not better discipline, it is verification pipelines that make discipline unnecessary. What matters is what happens when developers do not follow the practice, because eventually they will not.

Where Agentic Engineering Stops

The agentic engineering literature covers three activities well. Intent specification: write a design document or task description before prompting. Agent generation: let the agent implement within a scoped context. Evaluation: review the output, run tests, verify correctness. These map to the first three layers of the AgenticOps model from What AgenticOps Actually Looks Like: Intent, Agent Generation, and Evaluation.

The literature is largely silent on what happens after evaluation. Promotion, the controlled movement of verified work into production environments, is rarely addressed. Runtime governance, the observation and constraint of agent behavior in live systems, appears only in security-focused discussions. Knowledge compression, the systematic reduction of system complexity into navigable artifacts, is almost entirely absent.

These are layers four through six. They are the layers that turn a development practice into an operational system. Without them, agentic engineering produces high-quality work in development and hopes it stays that way in production.

The feedback loop matters. Knowledge compression feeds back into intent. The compressed understanding of how the system behaves in production shapes the next round of specifications. Without that loop, each development cycle starts fresh. With it, each cycle compounds on what the system learned about itself.

Containment Is Not a Practice

The four containment rings from How Agents Stay in Bounds illustrate the difference most clearly.

  1. Constrain inputs.
  2. Constrain environment.
  3. Validate outputs.
  4. Gate promotion.

These are not things a developer does. They are things the system enforces.

Ring 1, constraining inputs, means the agent receives a scoped context defined by skill files and schemas. The developer did not decide at runtime what to include. The constraint existed before the agent started. Ring 2, constraining the environment, means the agent runs in a sandbox that physically prevents access to anything outside the workspace. The developer did not have to trust the agent not to wander. The environment made wandering impossible. Ring 3, validating outputs, means automated gates evaluate the result against measurable criteria. The developer did not have to judge quality from a diff. The gate returned a score. Ring 4, gating promotion, means nothing moves forward without evidence. The developer did not have to remember to check. The system refused to proceed without passing the gate.

OpenClaw Is Not an AI Assistant demonstrated this with three isolation layers and multiple containment rings around a real agent runtime: Docker sandboxes, tool sandboxes, allowlists, network restrictions, and human approval gates. The containment was not a developer practice. It was infrastructure configuration. The agent could not violate the boundary because the boundary was physical, not procedural.

Agentic engineering asks the developer to be disciplined. AgenticOps builds the system so that discipline is the default and violation is structurally difficult. The distinction is the same one from How Agents Stay in Bounds: policy says “don’t.” Infrastructure says “can’t.”

Practice Drifts. Infrastructure Holds.

A Hacker News commenter captured the concern precisely: “The effects of vibe coding destroy trust inside teams and orgs, between engineers.” The damage comes not from individual failures but from inconsistency. One developer follows the practice. Another does not. The codebase contains governed and ungoverned output that looks identical in a pull request.

Infrastructure eliminates this variance. When every agent session runs inside the same sandbox, uses the same constraint files, and passes through the same gates, the output quality has a floor. Individual developers can exceed the floor. They cannot go below it. Most Software Is Just CRUD. That’s Not the Problem. argued that the danger is cheap generation without constraint discipline. Infrastructure is how constraint discipline survives contact with a team of twenty developers, varying skill levels, and a Friday afternoon deadline.

The Treasure Data case study illustrates this concretely. They tried speed without structure first. Then they embedded governance into infrastructure, not policy documents. The result was one engineer shipping a production AI tool in an hour. The constraint was the accelerator. Governance infrastructure made them faster because it removed the decision overhead that slowed them down. Every developer on the team produced output that met the same quality bar because the quality bar was enforced by the system, not by individual judgment.

The Build Plan Still Works

The three artifacts from You Can Build This are the implementation of this distinction.

  1. Constraints are Ring 1.
  2. Agent definitions with tool allowlists and forbidden lists are Ring 2.
  3. Gates with measurable success criteria are Rings 3 and 4.

The sandbox makes containment physical. None of these require the developer to remember to be disciplined. They require the developer to define the constraint once and let the system enforce it on every run.

This is why the convergence matters. Agentic engineering identified the right practices. AgenticOps provides the infrastructure to make those practices the default. The industry does not need to choose between them. It needs both. The practice tells you what to do. The infrastructure ensures you actually do it.

I am glad the industry arrived at the same conclusion. The next question is whether they will build the infrastructure or stop at the practice.

Let’s talk about it.

I Was a 1x Coder at Best. AI Made Me a 0x Coder.

Over four posts I built an argument. Total understanding is a myth. Cheap generation without governance creates invisible debt. AgenticOps is the discipline layer. Containment is the mechanism.

All of that was structural. This one is personal.

I Taught Myself to Code

I don’t have a computer science degree. I don’t have a software engineering degree. I have no formal training in the thing I’ve done for a living for well over two decades.

I learned from books. Then from Google. Then from StackOverflow. I learned from copying patterns I saw in codebases I didn’t fully understand. Eventually from building things that broke and figuring out why.

The learning never felt complete. It still doesn’t, and now it feels like I have so much more to learn.

I have OCD, ADD, depression, and imposter syndrome. The OCD means I fixate on problems until they resolve. The ADD means I struggle to focus long enough to resolve them efficiently. The depression and imposter syndrome make me doubt everything I do. Those forces fight each other constantly. Sometimes that tension produces good work. Sometimes it produces hours lost chasing details that didn’t matter.

On top of that I never felt like an engineer. The people I admired seemed to hold so much of the systems we worked on in their heads, reason about concurrency without breaking a sweat, debug memory and network issues by reading traces. They seemed to operate in a different register, a different dimension.

I watched conference talks and understood maybe half of what was said. I read papers and got the gist but not the math. I built mental models that were close enough to be useful but never precise enough to feel confident.

The 10x developer myth lived in my head. Not because I believed it literally, but because I measured myself against it. If they were 10x, I was 1x. Maybe. On a good day.

Yet, I ended up as a top producer or leader on all the teams I worked on, so I had some value, even if my brain doesn’t believe it.

I Spent Years Closing a Gap That Didn’t Matter

I tried to get faster. Better tooling, better shortcuts, better frameworks. I optimized my workflow to muscle memory. Split terminals, keyboard shortcuts, IDE configurations I’d tuned over years.

I got good enough. I shipped systems that handled real traffic, real money, real consequences. Payment services processing billions of dollars per month where a bug meant many people didn’t get paid. Multi-tenant platforms where a data leak meant one company could see another company’s information.

But I never shook the feeling that the real engineers were operating at a level I’d never reach. That the gap between us was fundamental, not experiential.

So I kept grinding. More books. More side projects. More late nights trying to understand things other people seemed to just know.

The gap I was trying to close was implementation speed. How fast can I translate intent into working code? How quickly can I go from “this is what we need” to “this is what exists”?

I was optimizing for the wrong variable the entire time.

AI Made Me a 0x Coder

Then in October to November of 2025 it felt like AI arrived. Not the theoretical AGI kind. The real kind that writes code.

I started using AI agents to build systems. Not as a helper. Not as autocomplete. As the implementation layer.

Today I write zero lines of code by hand. Zero.

AI scaffolds services. AI implements business logic. AI writes tests. AI refactors modules. AI generates migrations. I define what needs to exist, what constraints it must satisfy, what acceptance criteria must be met, and I evaluate that they are met. The agent does the rest.

I code 0x.

The skill I spent twenty years building, the ability to translate intent into syntax, is fully delegated. The keystrokes I optimized. The frameworks I memorized. The patterns I drilled into muscle memory. All bypassed.

It still feels like a loss. A waste. The thing I’d spent my career trying to master was now something a machine does better and faster. A 1x coder didn’t become 10x. I became 0x.

0x Is Not a Deficit

Here’s what I didn’t expect. Letting go of implementation didn’t reduce my output. It multiplied it.

AI doesn’t just write code faster and better than me. It writes at a scale I could never match. Full service scaffolding in minutes. Test suites covering edge cases I would have missed. Rewrites and refactors across modules that would have taken me days.

I was never going to write at 100x. But I can govern at 100x.

In the first post I said I scale containment, not understanding. I wrote that before I’d lived it. Now I have.

In the second post I argued the hard parts were never typing. In October 2025 I meant it theoretically. Today, I mean it literally. I don’t type production code. The hard parts, the constraint decisions, the system boundaries, the verification criteria, those are the only parts I do.

The six layers from the third post. Intent, agent generation, evaluation, promotion, runtime governance, knowledge compression. Those are not a framework I designed in the abstract. They are the operating system I iterate on because I had to. Because governing agent output is the only way 0x works.

The four rings from the fourth post. Constrain inputs, constrain environment, validate outputs, gate promotion. Those are not best practices on a slide. They are the walls of the house I am building to live in. Without them, 0x is reckless. With them, 0x is operational.

What a Day Looks Like for a 0x Coder

Here is the 0x workflow in practice.

  1. Define intent. What value does this slice deliver? What state transitions does it manage? What must never break?
  2. Define contracts. Input schemas, output schemas, interface definitions, invariant list.
  3. Define tests. Contract tests, integration tests, edge case scenarios. The tests exist before the implementation does.
  4. Scope the agent. Mount the contracts, tests, and bounded context into the agent’s workspace. Nothing else.
  5. Generate. The agent plans, scaffolds, implements, and refactors inside its scope.
  6. Evaluate. The evaluation pipeline runs automatically. Contract tests, static analysis, security scanning, schema validation.
  7. Review outcomes. I don’t read generated code line by line. I review whether behavior matches intent. Test results, API output diffs, invariant checks.
  8. Approve or reject. If the evidence says it works, promote. If not, refine the constraints and loop.

That loop is my job. I don’t write code. I write constraints and review evidence. I don’t delegate my responsibility to deliver value.

There are skills, besides coding, that I spent twenty years building that still matter, but not the way I expected.

I understand systems well enough to define intent for them. I understand failure modes well enough to write meaningful constraints and acceptance criteria. I understand architecture well enough to scope agents tightly. I understand system risks well enough to judge when evidence is sufficient.

Understanding code well enough to evaluate it is different from writing it. Both are valid. Evaluation may matter more now.

The Skills That Actually Compound

I worried about the wrong things for twenty years.

I thought typing speed mattered. What compounds is system design.

I thought language mastery mattered. What compounds is constraint definition.

I thought memorizing APIs mattered. What compounds is evaluating outcomes.

I thought writing code from scratch mattered. What compounds is judging output.

I thought line-by-line review mattered. What compounds is evidence-based verification.

I thought understanding every line of code mattered. What compounds is understanding boundaries.

The first list is what I optimized for as a 1x coder. The second is what I actually use as a 0x one.

Every one of those skills in the second list was something I was already doing alongside the coding. Designing systems, defining boundaries, testing behavior, evaluating risk. I just didn’t recognize them as the primary skills because the coding felt like the real work.

It never was.

The Code Was Never the Value

This is the part I had to live to believe.

I spent twenty years measuring myself by my ability to produce code. When AI took that ability away, it felt like losing the foundation of my career. Will I miss the act of coding? Yes. But more than that, I worried that without it, I had no value to add.

But the foundation was never the code. The foundation was the ability to solve problems and deliver value. My value add was being able to understand systems, judge outcomes, define constraints, and make decisions under uncertainty. Writing a for loop was not where the value lived. The code was always an artifact of those decisions. Not the decisions themselves.

The payments service worked because I understood the state transitions, not because I typed the implementation. The multi-tenant platform was secure because I understood the isolation requirements, not because I wrote the permission layer by hand.

In the first post I said my decisions are what matter. I believe that more now than when I originally wrote it. I’ve spent time producing real systems without writing a single line of code, and the outcomes are the same or better than what I produced when I typed everything myself.

Not because I’m better or AI is better. Because the division of labor is optimized. Humans are good at intent, constraints, judgment, and risk assessment. Agents are good at implementation, coverage, consistency, and speed. Combining them beats either one alone.

Coding and Building Are the Same Thing Again

Grace Hopper spent her career trying to get away from code. Trying to move programming toward natural language. Uncle Bob Martin called our continued use of the word “code” a reflection of our failure to meet her goal.

I think we’re close to meeting it now. Not because prompts are natural language. Because the distinction between “writing code” and “building systems” is dissolving.

For decades, building software required coding. You couldn’t build without typing in some weird cryptic syntax. The skills overlapped so completely that we treated them as the same thing.

They aren’t. Building is intent, constraints, architecture, verification, judgment. Coding is translating those into syntax. When AI handles the translation, building remains.

The distinction was always there. We just couldn’t see it because the two were inseparable. Now they’re separated. And it turns out the building side is where the value was all along. I am a system builder and an AI agent operator.

Five Posts. One Thesis.

I’ve never fully understood the systems I work in. AI made that worse, but containment made it manageable.

Most software is CRUD molded into value. Cheap generation without governance creates invisible debt, but constraint discipline prevents it.

AgenticOps is the governance model. Six layers. Four rings of containment. A hard line between what agents generate and what systems execute.

The human’s role didn’t shrink. It moved. From implementation to intent. From typing to judgment. From code review to evidence review.

And this last part is the one I had to live to believe. The code was never the value. The decisions were. A 0x coder governing a 100x agent produces better outcomes than a 1x coder typing everything by hand.

I know because I’m the 0x coder. And I believe the systems I’m building now are as good or better than the systems I hand coded. What’s your experience?

Let’s talk about it.

Previous: [How Agents Stay in Bounds]

Next: [You Can Build This. Three Artifacts and a Sandbox.]

How Agents Stay in Bounds

The last post defined AgenticOps. Six layers from intent to knowledge compression. But I left the hardest question unanswered: how do you actually keep agents inside their boundaries?

The honest answer is you can’t guarantee it. Not the way you can prove a compiler respects a type system. A stochastic system doesn’t make promises. It makes outputs.

So the strategy isn’t trust. It’s defense in depth. Multiple layers of deterministic containment around a probabilistic process, so that no single failure leads to unbounded impact.

Boundaries Are Infrastructure, Not Policy

This is where AgenticOps stops being philosophy and becomes architecture.

The primitive is simple. One sandboxed container per agent slice. Docker Sandbox. Constrained file permissions. Whitelisted network access. A schema-constrained context mounted in at startup. The agent lives in that box. Everything it needs is in there. Everything it doesn’t need isn’t reachable.

That’s not a metaphor. The agent literally cannot write files outside its slice. It cannot reach endpoints that aren’t on the whitelist. It cannot promote its own changes up the chain. There’s no exception path, no override flag, no escape hatch.

The containment isn’t a rule the agent follows. It’s a wall the agent cannot see past.

I’ve said for years that in systems, people aren’t the problem, processes are. Most failures aren’t malicious. They’re structural. The system made the bad outcome easy and the good outcome hard. Humans being humans, they took the path of least resistance.

With stochastic agents, it’s the same insight one layer deeper. The problem isn’t the agent. The problem is the infrastructure that gives the agent room to fail in ways you can’t predict or recover from.

You can’t reason about agent output the way you reason about deterministic code. You can’t read the function and know what it’ll return. You can test it, eval it, constrain its inputs. But you cannot trust it the way you trust a compiler. It’s stochastic all the way down.

If you’re relying on the agent to follow a policy, you’re trusting a stochastic system to be trustworthy. That’s not a risk you’re managing. That’s a risk you’re ignoring.

A policy says don’t do this. Infrastructure says you can’t. When you’re governing stochastic systems, you want the second one everywhere you can get it. Policies are for humans who can read them. Infrastructure is for systems that can’t.

The Context Window Is a Containment Boundary

There are two actors in this model. An orchestrator that manages the lifecycle and an execution agent that does the work.

The orchestrator decides what the agent reasons about. If an agent is working on an order service slice, the orchestrator loads the order contract, the relevant state machine definition, the test expectations, and the bounded interface definitions for adjacent services into the agent’s context.

That’s it. Not the user service internals. Not the payment provider credentials. Not the global config.

The agent doesn’t decide what’s in scope. The orchestrator does. The context window becomes a containment boundary. The agent literally cannot reason about what it wasn’t given.

That gives you something powerful: the blast radius of a misbehaving agent is bounded by what the orchestrator mounted, not by the agent’s judgment. A bad output can only be as wrong as the scope allows.

If the scope is one contract and one set of tests, the worst case is a failed evaluation. If the scope is the entire system, the worst case is an invisible invariant violation three services deep. Scope is risk management.

Four Rings of Containment

I think about agent containment as four concentric rings. Each ring is deterministic. What’s inside them is stochastic. That asymmetry is the whole point.

Ring One: Constrain the Inputs

The agent only sees what it’s scoped to see. Typed schemas, versioned contracts, bounded context. The narrower the input scope, the smaller the space of possible outputs.

This is where most teams fail first. They hand AI an entire codebase and say “fix it,” then wonder why the output is unpredictable. An agent working on a single slice with a single contract has a fundamentally different risk profile than an agent with access to everything.

Ring Two: Constrain the Environment

The sandbox. No network access outside defined endpoints. Resource limits on CPU and memory. And a specific filesystem constraint that matters more than the others: the agent can read the broader system but can only write to the slice.

Docker volume mounts make this concrete. The repository mounts read-only. The slice directory mounts read-write. The operating system enforces it. The agent can see everything it needs to compile and resolve dependencies. It cannot modify anything outside its scope.

That distinction matters. The containment is write-scope, not visibility-scope. An agent that can only see its slice can’t build, can’t run tests, can’t verify its own work against real dependencies.

An agent that can see the system but only write to its slice can do all of those things. And the blast radius is still bounded by what it can change, not by what it can generate internally.

Builds produce artifacts outside the slice. Compiled outputs, temp files, package caches. Those writes happen in ephemeral directories that get discarded when the container stops. The only thing that survives the sandbox is the diff the orchestrator extracts from the slice directory.

Ring Three: Validate the Outputs

This is the evaluation layer. Before anything leaves the agent loop, it passes through deterministic gates. But not all gates are the same.

Static gates operate on files directly. Linting, AST validation, schema diff checks, security scanning. These work on the slice alone. They don’t need the broader system. They catch structural violations before anything compiles.

Build and test gates need more context. Contract tests, integration tests against bounded interfaces, compilation, snapshot comparison of API outputs. These work because Ring Two mounted the broader system as read-only.

The agent can build and test against the real dependency graph. It just can’t modify anything outside its scope.

The containment that matters here is not what the evaluation can see. It’s what survives extraction. The orchestrator collects only the diff from the slice directory. Build artifacts, test outputs, intermediate files, all discarded.

The evaluation runs against the full mounted context. The promotion pipeline sees only the slice-scoped changes.

That’s the honest version of “validate the outputs.” Some checks work on isolated files. Some checks need the system. Both run inside the sandbox. Neither requires the agent to have write access beyond the slice.

Ring Four: Gate the Promotion

The agent loop cannot self-promote. Period. Even if an agent produces something that passes every automated check, it does not reach production without human approval.

But what does the human actually review? Not the code. The evaluation pipeline already ran. What lands in the review queue is the evidence.

First, the human reviews the evaluation results. Which tests passed. Which contracts held. What the behavior diff looks like. API snapshots before and after. UI snapshots before and after. The evidence package tells you whether the system behaves as expected without reading a single line of generated code.

Second, the human checks scope. Did the agent touch only what it was supposed to touch? If the slice was the order service and the diff includes changes to the payment service, that’s a boundary violation.

You don’t need to read the implementation to catch that. You just need to see which files changed and whether those files belong to the slice.

Third, the human checks intent alignment. Does the behavior change match what was requested? Not “is the code clean” but “does the system do what I asked it to do.” That’s a contract question, not a code quality question.

Fourth, the human checks what machines can’t. Business judgment calls. Edge cases that require domain knowledge. Whether the thing that technically passes all gates is actually what a customer should experience. This is where human reasoning earns its place in the loop.

Fifth, the human verifies the running system. Deploy to a preview environment and test against the acceptance criteria. Does the change operate as expected when a real user touches it?

This is QA. It always was. The difference is the human is testing behavior that was generated and evaluated automatically, not behavior that was typed by hand.

That’s what code review becomes in an AgenticOps model. You stop reading code line by line. You start reviewing evidence, scope, intent, judgment, and behavior. The machines verify implementation. The human verifies outcomes.

Over time, as confidence grows, you might loosen this for certain categories of change. A low-risk schema migration that passes every gate, for example. But the default posture is closed. You earn openness through evidence.

Small Slices Make Containment Practical

There’s a principle underneath all four rings that makes them work. Scope the work small enough that boundary violations are obvious.

Small slices aren’t just a project management preference. They’re a containment strategy. The smaller the scope, the more deterministic the boundary, the more meaningful the evaluation, and the lower the stakes of getting it wrong.

What the Stack Looks Like

Put it all together and the concrete architecture looks like this.

The sequence in practice:

  1. The orchestrator creates the slice definition: contract, schema, test expectations, invariant list, and interface definitions for adjacent services.
  2. The orchestrator mounts the full repository read-only and the slice directory read-write into a sandboxed Docker container. No git CLI. No access to the remote repository. The agent can resolve dependencies and compile against the real system. It can only modify files in its slice.
  3. The execution agent generates against that context. Plans, scaffolds, implements, and refactors, all inside the sandbox. It reads broadly and writes narrowly.
  4. The evaluation pipeline runs inside the same sandbox. Static checks validate the slice files directly. Build and test checks compile and run against the full mounted context. Both enforce gates before anything leaves the container.
  5. If the output passes all gates, the orchestrator collects the diff, creates a branch, commits, and promotes to a human review queue with the evidence attached.
  6. If it does not pass, it loops back to the agent or fails out.

The execution agent never touches version control. Git operations are promotion, and promotion is outside the agent loop. The orchestrator handles branching, committing, and creating pull requests. The agent handles files.

The human never sees anything that didn’t survive the sandbox. The system never executes anything the human didn’t approve. The agent never touches anything outside its slice.

Anyone who has worked with parallel agent architectures knows this pattern is already emerging. Multiple instances against isolated issue slices, each with their own bounded context and evaluation gate.

I hope to build and experiment with this as we all learn to operate in our new AI reality. I plan on posting my results and findings in a new “AgenticOps Applied” series to share my experience.

Deterministic Boundaries Around Stochastic Processes

That’s the core design principle. Every previous abstraction step in programming was deterministic all the way down. This one isn’t. But it doesn’t need to be, as long as the containment layer is.

The agent is probabilistic. The sandbox is not. The evaluation is not. The promotion gate is not. The runtime telemetry is not. The human review is not.

The only thing that isn’t deterministic is the agent’s output. Everything else is a deterministic process that either makes it impossible for the agent to misbehave or makes it easy to detect when it does.

You don’t trust the agent to stay in bounds. You make it structurally impossible, or at minimum structurally detectable, when it doesn’t. And you scope the work small enough that detection is meaningful.

That’s how agents stay in bounds. Not by being trustworthy. By being contained.

Let’s talk about it.

Previous: [What AgenticOps Actually Looks Like]

Next: [I Was a 1x Coder at Best. AI Made Me a 0x Coder.]

What AgenticOps Actually Looks Like

Total understanding is a myth. Cheap generation without governance creates invisible debt. Those were the first two claims.

Now it’s time to make AgenticOps concrete. Not a vibe. Not “AI but responsibly.” A governance operating model for AI-amplified system production.

The Problem Is We’re Reviewing the Wrong Thing

AI can generate entire services, DTO layers, migrations, integration adapters, test scaffolding, refactors across modules. Generation bandwidth has exploded. Human review has not. And it won’t. Human review doesn’t scale linearly with generation speed.

That pain is real. You hear it even from people leading frontier AI labs. The bottleneck is no longer “who can write the code.” It’s “who can review all this safely.”

I believe the deeper issue is were reviewing the wrong thing. We’re reviewing lines of code. What we should be reviewing are outcomes. Does it satisfy the contract, pass the tests, respect the boundaries? AgenticOps makes structural review the default, not the exception.

The Failure Mode

The obvious failures aren’t the real danger. The real danger is slow structural drift. Behavior changing without anyone realizing the invariants were never encoded. Contract drift across services that only surfaces when two teams try to integrate. Feature interactions no one modeled because no one knew the features existed.

None of this announces itself. It accumulates. That’s the environment AgenticOps is designed for.

What Success Looks Like

Success in AgenticOps looks like this, humans define what the system must do and what must never break. Agents generate implementation inside explicit boundaries. Every change is evaluated automatically before promotion.

Runtime behavior is observable and reversible. Surface area growth does not increase risk proportionally.

Humans stop reviewing keystrokes. Instead they review contracts, invariants, risk surfaces, behavior diffs, telemetry, and business outcomes. Machines review implementation.

The Non-Negotiables

AgenticOps asserts a few invariants of its own. No generation without defined contracts. No promotion without evaluation. No runtime without observability. No agent autonomy without bounded scope. No hidden state transitions. No change without containment.

These are structural guarantees. They replace “LGTM” and rubber stamp reviews with real safety nets. They make it harder to accidentally introduce systemic risk through sloppy review.

The Layers

I think about AgenticOps as six layers. They build on each other.

Intent is defined by humans. System purpose, value flow, state machines, invariants, constraints, risk tolerance. This is not code. This is the contract space, the thing that must exist before any agent generates anything.

Intent is the only thing that survives. Implementations get replaced. There may be 100 ways to do a thing, but intent persists.

Agent generation is where agents plan, scaffold, implement, and refactor. But they only work inside typed schemas, versioned contracts, bounded environments, and isolated slices (an end to end unit of work that delivers usable value). No free-roaming generation.

Stochastic output is fine as long as it’s contained. The boundaries are what make it safe.

Evaluation happens before anything gets promoted. Contract tests, integration tests, property-based tests, static analysis, security checks, policy enforcement, snapshot diffs of UI and API outputs.

We don’t scale review. We scale evidence. Humans don’t inspect 1,000 lines of code. They inspect whether behavior is expected.

Promotion is the most important architectural decision in AgenticOps. The agent loop cannot self-promote. Changes move to a human approval queue where humans review what changed in behavior, not what changed in code.

Governance sits outside the agent loop, not inside it. Self-promotion is where entropy becomes systemic.

Every “AI automation” narrative that gets sloppy gets sloppy at the boundary between what agents decide and what reaches production. AgenticOps draws a hard line.

Runtime governance covers what happens after deployment. Metrics, tracing, logs, SLO tracking, anomaly detection, feature flags, rollback paths. Understanding becomes observational, not memorized.

If something breaks, I don’t reread code. I interrogate behavior.

Knowledge compression means every slice of work produces artifacts. Updated documentation, system summary, state transition diagrams, dependency maps, invariant lists, change logs.

I don’t try to hold the code in my head. I hold compressed models. The system generates its own documentation as a byproduct of the governance process, not as an afterthought someone writes six months later.

The Maturity Model

AgenticOps isn’t binary. It evolves.

  • Level 0 is manual coding.
    Humans write and review everything. This is where most of the industry lived for decades.
  • Level 1 is AI-assisted coding.
    AI generates code. Humans still review line by line. This is where most teams are right now, and it’s where the pain is sharpest. Generation speed outpaces review capacity.
  • Level 2 is contract-first generation.
    Humans define contracts. AI implements against them. Tests gate promotion. This is the minimum viable AgenticOps.
  • Level 3 is governed agent loops.
    Slice queues, evaluation services, approval gates, containment enforced structurally. The governance isn’t a process someone follows. It’s built into the system.
  • Level 4 is observational governance.
    Runtime telemetry feeds back into planning and constraint refinement. The system learns from its own behavior in production.
  • Level 5 is adaptive governance.
    The system proposes constraint improvements within defined boundaries. Humans approve or reject. The governance loop itself becomes partially automated.

I suspect most teams today are somewhere between Level 1 and Level 2. Very few are at Level 3. That’s the gap.

The Model

Humans own outcomes. Agents produce implementation. The system enforces containment.

That’s AgenticOps. Six layers. Five maturity levels. A structural answer to how governance scales alongside generation.

Next, the hardest part. How you actually keep agents inside their boundaries when the thing generating the code is fundamentally probabilistic.

Let’s talk about it.

Previous: [Most Software Is Just CRUD. That’s Not the Problem.]

Next: [How Agents Stay in Bounds]