Tagged: governance
OpenClaw Is Not an AI Assistant
OpenClaw is getting a lot of attention right now. It’s usually described as an AI assistant. That description misses what it actually is. OpenClaw is an agent runtime.
It connects a language model to tools that interact with real systems. Those tools can read files, write code, run shell commands, and call APIs.
So the right mental model is not: “install an AI assistant.” The right mental model is: “deploy an autonomous process with the ability to operate on my machine.”
Once you see it that way, the real question isn’t how to install it. The real question is how to contain it.
What OpenClaw Actually Does
OpenClaw allows a language model to operate as an agent. Instead of just generating text, the model can decide to invoke tools that interact with the outside world.
Those tools can:
- read and write files
- execute code
- run shell commands
- call APIs
- interact with external services
These capabilities are organized as skills. A skill is a package that describes a capability and exposes tools the agent can use.
Example structure:
skills/ github/ SKILL.md tools/ create_pr.js list_issues.js
The SKILL.md file explains to the model when and how to use those tools.
You can think of a skill as a capability module that expands what the agent is allowed to do.
Installing OpenClaw
OpenClaw installs through Node and runs as a CLI with a gateway daemon.
Requirements
- Node 22 or later
- macOS, Linux, or Windows (WSL recommended)
Check Node:
node -v
If needed:
nvm install 24
Install OpenClaw:
npm install -g openclaw
Run onboarding:
openclaw onboard --install-daemon
This installs the gateway service that manages agent sessions.
Configure Models
OpenClaw connects to external models through configuration.
Example file:
~/.openclaw/models.yaml
Example configuration:
models: primary: provider: anthropic model: claude-3-opus api_key: ${ANTHROPIC_KEY} fallback: provider: openai model: gpt-5 api_key: ${OPENAI_KEY}
Start the runtime:
openclaw start
At this point you have an operational agent runtime.
Installation Is Easy. Containment Is the Real Problem.
An OpenClaw agent can run shell commands, modify files, and call external services. That means the system should be treated as untrusted automation.
Most tutorials approach this with policy: “Don’t let the agent do dangerous things.” That approach is backwards. You don’t want policies. You want infrastructure that prevents the agent from doing dangerous things. Containment needs to be enforced by the environment.
Three Different Isolation Layers
There are three different isolation mechanisms involved when running OpenClaw. They solve different problems.
Runtime Containerization
The simplest layer is running OpenClaw itself inside Docker.
Example:
docker run -it \ --name openclaw \ -v claw-workspace:/workspace \ openclaw/openclaw
In this setup the OpenClaw gateway runs inside a container. This gives you:
- a reproducible environment
- basic host isolation
- simpler deployment
But this alone does not sandbox the agent’s actions. This protects the host, not the runtime.
OpenClaw Tool Sandboxing
OpenClaw can sandbox tool execution. Instead of executing commands directly, the runtime launches a container for tool execution.
Architecture:
↓
OpenClaw Gateway
↓
Agent Session → container
↓
Tool Execution
Tools that can be sandboxed include:
- shell commands
- file edits
- code execution
- browser automation
Configuration example:
agents.defaults.sandbox.mode: "all"agents.defaults.sandbox.scope: "session"
Each session receives its own sandbox container.
This isolates agent actions, but the gateway process still runs outside the sandbox.
Docker Sandboxes
Docker recently introduced Docker Sandboxes specifically for AI workloads. A Docker Sandbox runs the agent inside a micro-VM style environment with strict boundaries.
Architecture:
Host ↓Docker Sandbox ↓OpenClaw Runtime ↓Agent Tools
This environment provides stronger isolation:
- restricted filesystem access
- network proxy and allowlists
- external secret injection
- workspace-only file access
Secrets are injected from outside the sandbox rather than being stored in the runtime. Network access can be restricted to specific domains such as model providers or internal APIs. This shifts containment from policy to infrastructure. Instead of telling the agent not to do something, the environment simply prevents it.
The Containment Model That Makes Sense
The safest approach combines these layers.
Docker Sandbox ↓OpenClaw Runtime ↓OpenClaw Tool Sandbox ↓Agent Tools
This creates multiple containment rings.
Ring 1 — Docker Sandbox
Ring 2 — OpenClaw tool sandbox
Ring 3 — tool allowlists
Ring 4 — network restrictions
Ring 5 — human approval gates
Each ring assumes the ring inside it may fail. That’s how you design systems around stochastic components.
Where OpenClaw Actually Becomes Useful
Once it’s contained, OpenClaw becomes a programmable operator. The value comes from defining skills that match the workflows you already run.
Engineering Agent
Skills:
- git
- test runner
- code review
- CI
Tasks:
- review pull requests
- generate architecture summaries
- run test suites
- produce coverage reports
Example:
review this PR and summarize the architectural impact
Research Agent
Skills:
- web search
- summarization
- synthesis
- writing
Typical workflow:
- gather sources
- summarize them
- extract insights
- draft documents
Operations Agent
Skills:
- calendar
- meeting summarization
- task management
Tasks:
- triage inbox
- extract action items
- schedule meetings
- produce summaries
Product Strategy Agent
Skills:
- market research
- competitor analysis
- financial modeling
- feedback synthesis
Outputs:
- product briefs
- experiment plans
- roadmap drafts
Structuring an Agent Runtime
For larger systems, it helps to treat the runtime as infrastructure hosting multiple agents.
Example:
Runtime research agent engineering agent planning agent writing agent
Each agent has:
- its own prompt
- its own skills
- the same runtime environment
The runtime provides infrastructure. The agents provide behavior.
A Note on Maturity
OpenClaw is still early. The capabilities are powerful, but the ecosystem is not hardened yet.
Security researchers are already demonstrating how prompt injection and malicious skills can manipulate agents with broad access. That doesn’t mean the system shouldn’t be used. It means the system should be designed with containment in mind from the start.
The Opportunity
The real opportunity isn’t running a single agent. The interesting direction is combining agent runtimes with orchestration and evaluation systems.
Example architecture:
Agent Runtime ↓Workflow Engine ↓Tool Execution ↓Evaluation Loop
That changes the role of the agent. Instead of being an assistant, it becomes a component inside a controlled operational system. At that point you’re no longer experimenting with AI tools. You’re building infrastructure around them.
Let’s talk about it.
Previous: [Autonomy Without Infrastructure Is Just a Demo]
Next: [Verification Beats Debugging]
Autonomy Without Infrastructure Is Just a Demo
The AgenticOps series defines six layers, four containment rings, and a maturity model. All of it was framework vision. The AgenticOps Applied series are stories about how the vison is realized through experiments and production case studies. This post is a case study that tests the framework against a production system that was built without the it.
What Stripe Published
Stripe released two blog posts in early 2026 describing their internal coding agents, called Minions (Part 1 and Part 2). The numbers are striking. Over 1,300 merged pull requests per week. Every PR is human-reviewed. None contains human-written code.
Stripe didn’t build Minions from a governance framework. They built them from engineering first principles to solve a production problem. Autonomous coding agents at scale inside a system that processes payments.
The architecture they arrived at is worth examining. Not because it validates AgenticOps by name, but because independent convergence on the same structural patterns is stronger evidence than any single implementation built from the framework itself.
What They Built
Five components define the Minions architecture.
Devboxes. Every agent run executes in a disposable AWS EC2 instance. These environments arrive pre-warmed with the full codebase, built dependencies, and running services in about ten seconds. No internet access. No production connectivity. Destroyed after each run. Stripe already used devboxes for human engineers. The same infrastructure worked for agents.
Blueprints. Minion runs are not pure agent loops. They are hybrid state machines that interleave deterministic nodes with stochastic agent nodes. Deterministic steps handle linting, pushing branches, and triggering CI. Agent steps handle implementation and failure resolution. The agent gets freedom where reasoning helps. The system enforces what must always happen.

Toolshed. An internal MCP server with nearly 500 tools for internal systems and SaaS platforms. Agents receive curated subsets, not the full set. Security controls prevent destructive actions. Before a run begins, the system fetches context from tickets and documentation so agents start informed rather than searching blind.
Rule files. Static guidance scoped to directories. As the agent traverses the codebase, relevant rules load automatically. Stripe standardized on Cursor’s format and syncs rules to support Claude Code as well. Global rules fill the context window. Scoped rules provide signal where the agent is actually working.
Verification pipeline. Local lint runs in under five seconds after generation. Only after that passes does the system target CI against a suite of over three million tests (WTF). If CI fails, the agent gets one retry. Not infinite retries. One. Then the PR goes to a human. Stripe caps iterations because compute, tokens, and time cost money.
Alignment to the Containment Rings
Post 4 of the main series introduced four rings. Here is where Stripe’s architecture maps.
| Ring | What It Requires | What Stripe Built |
| 1: Constrain Inputs | Curated tool access, scoped context | Toolshed (curated MCP subsets), directory-scoped rule files, pre-hydrated context |
| 2: Constrain Environment | Isolated, disposable execution | Devboxes (pre-warmed EC2, no internet, destroyed after use) |
| 3: Validate Outputs | Layered verification | Local lint (seconds) + selective CI (minutes) + capped retry (one attempt) |
| 4: Gate Promotion | Human review as structural gate | Every PR goes to a human reviewer, agents never self-merge |
All four rings are present.
Ring 2 is the strongest. Devboxes provide binary isolation. The agent either cannot reach production, or the ring does not exist. There is no partial isolation. Stripe chose infrastructure over policy.
Ring 1 is more sophisticated than most implementations. Toolshed is not just tool access. It is curated, scoped, and security-controlled tool access. The distinction matters. Giving an agent 500 tools is not Ring 1. Giving it the 12 tools relevant to its task is.
Ring 3 includes a design decision that reveals operational maturity. Capping retries at one is an economic constraint, not a technical one. Infinite retries would burn tokens and compute chasing diminishing returns. The cap forces failed tasks back to humans rather than letting agents loop.
Ring 4 is non-negotiable at Stripe. Agent-generated code never merges itself. This is the same principle from the main series: governance sits outside the agent loop, not inside it.
Alignment to the Six Layers
The six layers tell a different story. Stripe covers some well and skips others entirely.
| Layer | Stripe Coverage | Evidence |
| Intent | Partial | Tasks arrive from Slack, CLI, web UIs. No formal contract space, invariants, or state machines. |
| Agent Generation | Strong | Blueprints, devboxes, Toolshed. Agents generate inside explicit boundaries. |
| Evaluation | Strong | Lint + CI + capped iteration. Layered and cost-aware. |
| Promotion | Strong | Human PR review. No self-promotion. |
| Runtime Governance | Not described | Blog posts focus on agent infrastructure, not post-deployment observability of generated code. |
| Knowledge Compression | Not described | Minions produce PRs. No mention of compressed artifacts, invariant updates, or system documentation as output. |
The bottom four layers (Generation through Promotion) are well-built. The top and bottom layers (Intent and Knowledge Compression) are absent or informal.
This is not a criticism. Stripe solved the problem they had. But the gap is structurally interesting. Maybe intent isn’t mentioned because tasks are small and well-scoped. Maybe knowledge compression is absent because Stripe’s existing engineering culture handles documentation through other channels.
The AgenticOps model predicts that these layers become necessary at higher maturity levels. Stripe may not need them yet. Or they may have them and the blog posts simply didn’t cover them.
Maturity Assessment
Post 3 of the main series defined six maturity levels. Here is where Stripe sits.
Level 0, manual coding. Humans write and review everything. Stripe is past this.
Level 1, AI-assisted coding. AI generates, humans review line by line. Stripe is past this. Minions are not copilots. They are autonomous agents that produce complete pull requests.
Level 2, contract-first generation. Humans define contracts. AI implements against them. Tests gate promotion. Stripe partially meets this. Tests gate promotion, and rule files define constraints. But there is no formal contract space in the AgenticOps sense. No versioned invariants, no state machine definitions, no explicit risk tolerance declarations. The contracts are implicit in the test suite and rule files rather than formalized as a separate layer.
Level 3, governed agent loops. Slice queues, evaluation services, approval gates, containment enforced structurally. This is where Stripe lives. Blueprints are governed loops. Devboxes are structural containment. Human review is an approval gate. The governance is built into the system, not a process someone follows.
Level 4, observational governance. Runtime telemetry feeds back into planning and constraint refinement. Stripe tracks metrics on Minion performance, success rates, and merge rates. They iterate on blueprints and rules based on results. But the blog posts do not describe an automated feedback loop from runtime telemetry to constraint refinement. There are indicators of L4 thinking without the closed loop.
Level 5, adaptive governance. The system proposes constraint improvements within defined boundaries. Not described.
Stripe is solid Level 3 with early Level 4 signals. I bet that places them ahead of most organizations. Post 3 noted that most teams are between Level 1 and Level 2. Stripe jumped past the painful middle by investing in infrastructure rather than trying to scale human review.

What’s Not There
Three things the AgenticOps model calls for that Stripe’s published architecture does not describe.
Formalized intent. Tasks arrive as natural language requests through Slack or internal tools. There is no versioned contract space, no invariant classification, no explicit risk tolerance. In the next post I argued that intent rots without versioning. Stripe’s tasks are small enough that intent rot may not be a factor. At 1,300 PRs per week, the blast radius of any single task is small by design.
Knowledge compression. Minions produce code changes. The blog posts do not describe any system for producing compressed artifacts, updated documentation, invariant lists, or system summaries as a byproduct of agent work. In a future post I will also argued that compression without tiers is spam. Stripe may have solved this through other channels, or they may not need it at the task granularity Minions operate at.
The feedback loop. 1 argued that the six-layer diagram should be a cycle, not a waterfall. Knowledge compression feeds back into intent refinement. Stripe’s system appears linear: task in, PR out. The blog posts do not describe runtime signals feeding back into blueprint design or rule file updates, though Stripe almost certainly does this manually through engineering iteration.
None of these are failures. They are observations about where the model extends beyond what Stripe published. The interesting question is whether these gaps constrain Stripe’s ability to reach Level 4 and Level 5, or whether their task granularity makes the gaps irrelevant. Maybe they are past 4 and 5 and found gear 6.
What Convergence Means
Stripe did not read the AgenticOps posts. They did not reference containment rings. They solved an engineering problem and arrived at a structurally similar architecture.
The mapping nomenclature is mine, not theirs.
When independent teams approach the same class of problem from different starting points and still land on the same structural solutions, it usually means the problem space itself is constraining the design. The architecture isn’t ideology. It’s physics.
In this case the physics is stochastic software generation.
This is the first post in this series and it shows the Framework Applied rather than Framework Vision. The underlying principles are real, published, and operating at scale. The alignment to the containment model is analytical, not claimed by Stripe.
The containment rings hold. The maturity model places Stripe where the evidence suggests. The layers that Stripe skips are the ones the model predicts become necessary later.
Will it hold? Is it wrong?
Let’s talk about it.
Next: [Intent Drifts. Then Everything Drifts.]
How Agents Stay in Bounds
The last post defined AgenticOps. Six layers from intent to knowledge compression. But I left the hardest question unanswered: how do you actually keep agents inside their boundaries?
The honest answer is you can’t guarantee it. Not the way you can prove a compiler respects a type system. A stochastic system doesn’t make promises. It makes outputs.
So the strategy isn’t trust. It’s defense in depth. Multiple layers of deterministic containment around a probabilistic process, so that no single failure leads to unbounded impact.
Boundaries Are Infrastructure, Not Policy
This is where AgenticOps stops being philosophy and becomes architecture.
The primitive is simple. One sandboxed container per agent slice. Docker Sandbox. Constrained file permissions. Whitelisted network access. A schema-constrained context mounted in at startup. The agent lives in that box. Everything it needs is in there. Everything it doesn’t need isn’t reachable.
That’s not a metaphor. The agent literally cannot write files outside its slice. It cannot reach endpoints that aren’t on the whitelist. It cannot promote its own changes up the chain. There’s no exception path, no override flag, no escape hatch.
The containment isn’t a rule the agent follows. It’s a wall the agent cannot see past.
I’ve said for years that in systems, people aren’t the problem, processes are. Most failures aren’t malicious. They’re structural. The system made the bad outcome easy and the good outcome hard. Humans being humans, they took the path of least resistance.
With stochastic agents, it’s the same insight one layer deeper. The problem isn’t the agent. The problem is the infrastructure that gives the agent room to fail in ways you can’t predict or recover from.
You can’t reason about agent output the way you reason about deterministic code. You can’t read the function and know what it’ll return. You can test it, eval it, constrain its inputs. But you cannot trust it the way you trust a compiler. It’s stochastic all the way down.
If you’re relying on the agent to follow a policy, you’re trusting a stochastic system to be trustworthy. That’s not a risk you’re managing. That’s a risk you’re ignoring.
A policy says don’t do this. Infrastructure says you can’t. When you’re governing stochastic systems, you want the second one everywhere you can get it. Policies are for humans who can read them. Infrastructure is for systems that can’t.
The Context Window Is a Containment Boundary
There are two actors in this model. An orchestrator that manages the lifecycle and an execution agent that does the work.
The orchestrator decides what the agent reasons about. If an agent is working on an order service slice, the orchestrator loads the order contract, the relevant state machine definition, the test expectations, and the bounded interface definitions for adjacent services into the agent’s context.
That’s it. Not the user service internals. Not the payment provider credentials. Not the global config.
The agent doesn’t decide what’s in scope. The orchestrator does. The context window becomes a containment boundary. The agent literally cannot reason about what it wasn’t given.
That gives you something powerful: the blast radius of a misbehaving agent is bounded by what the orchestrator mounted, not by the agent’s judgment. A bad output can only be as wrong as the scope allows.
If the scope is one contract and one set of tests, the worst case is a failed evaluation. If the scope is the entire system, the worst case is an invisible invariant violation three services deep. Scope is risk management.
Four Rings of Containment
I think about agent containment as four concentric rings. Each ring is deterministic. What’s inside them is stochastic. That asymmetry is the whole point.
Ring One: Constrain the Inputs
The agent only sees what it’s scoped to see. Typed schemas, versioned contracts, bounded context. The narrower the input scope, the smaller the space of possible outputs.
This is where most teams fail first. They hand AI an entire codebase and say “fix it,” then wonder why the output is unpredictable. An agent working on a single slice with a single contract has a fundamentally different risk profile than an agent with access to everything.
Ring Two: Constrain the Environment
The sandbox. No network access outside defined endpoints. Resource limits on CPU and memory. And a specific filesystem constraint that matters more than the others: the agent can read the broader system but can only write to the slice.
Docker volume mounts make this concrete. The repository mounts read-only. The slice directory mounts read-write. The operating system enforces it. The agent can see everything it needs to compile and resolve dependencies. It cannot modify anything outside its scope.
That distinction matters. The containment is write-scope, not visibility-scope. An agent that can only see its slice can’t build, can’t run tests, can’t verify its own work against real dependencies.
An agent that can see the system but only write to its slice can do all of those things. And the blast radius is still bounded by what it can change, not by what it can generate internally.
Builds produce artifacts outside the slice. Compiled outputs, temp files, package caches. Those writes happen in ephemeral directories that get discarded when the container stops. The only thing that survives the sandbox is the diff the orchestrator extracts from the slice directory.
Ring Three: Validate the Outputs
This is the evaluation layer. Before anything leaves the agent loop, it passes through deterministic gates. But not all gates are the same.
Static gates operate on files directly. Linting, AST validation, schema diff checks, security scanning. These work on the slice alone. They don’t need the broader system. They catch structural violations before anything compiles.
Build and test gates need more context. Contract tests, integration tests against bounded interfaces, compilation, snapshot comparison of API outputs. These work because Ring Two mounted the broader system as read-only.
The agent can build and test against the real dependency graph. It just can’t modify anything outside its scope.
The containment that matters here is not what the evaluation can see. It’s what survives extraction. The orchestrator collects only the diff from the slice directory. Build artifacts, test outputs, intermediate files, all discarded.
The evaluation runs against the full mounted context. The promotion pipeline sees only the slice-scoped changes.
That’s the honest version of “validate the outputs.” Some checks work on isolated files. Some checks need the system. Both run inside the sandbox. Neither requires the agent to have write access beyond the slice.
Ring Four: Gate the Promotion
The agent loop cannot self-promote. Period. Even if an agent produces something that passes every automated check, it does not reach production without human approval.
But what does the human actually review? Not the code. The evaluation pipeline already ran. What lands in the review queue is the evidence.
First, the human reviews the evaluation results. Which tests passed. Which contracts held. What the behavior diff looks like. API snapshots before and after. UI snapshots before and after. The evidence package tells you whether the system behaves as expected without reading a single line of generated code.
Second, the human checks scope. Did the agent touch only what it was supposed to touch? If the slice was the order service and the diff includes changes to the payment service, that’s a boundary violation.
You don’t need to read the implementation to catch that. You just need to see which files changed and whether those files belong to the slice.
Third, the human checks intent alignment. Does the behavior change match what was requested? Not “is the code clean” but “does the system do what I asked it to do.” That’s a contract question, not a code quality question.
Fourth, the human checks what machines can’t. Business judgment calls. Edge cases that require domain knowledge. Whether the thing that technically passes all gates is actually what a customer should experience. This is where human reasoning earns its place in the loop.
Fifth, the human verifies the running system. Deploy to a preview environment and test against the acceptance criteria. Does the change operate as expected when a real user touches it?
This is QA. It always was. The difference is the human is testing behavior that was generated and evaluated automatically, not behavior that was typed by hand.
That’s what code review becomes in an AgenticOps model. You stop reading code line by line. You start reviewing evidence, scope, intent, judgment, and behavior. The machines verify implementation. The human verifies outcomes.
Over time, as confidence grows, you might loosen this for certain categories of change. A low-risk schema migration that passes every gate, for example. But the default posture is closed. You earn openness through evidence.
Small Slices Make Containment Practical
There’s a principle underneath all four rings that makes them work. Scope the work small enough that boundary violations are obvious.
Small slices aren’t just a project management preference. They’re a containment strategy. The smaller the scope, the more deterministic the boundary, the more meaningful the evaluation, and the lower the stakes of getting it wrong.
What the Stack Looks Like
Put it all together and the concrete architecture looks like this.

The sequence in practice:
- The orchestrator creates the slice definition: contract, schema, test expectations, invariant list, and interface definitions for adjacent services.
- The orchestrator mounts the full repository read-only and the slice directory read-write into a sandboxed Docker container. No git CLI. No access to the remote repository. The agent can resolve dependencies and compile against the real system. It can only modify files in its slice.
- The execution agent generates against that context. Plans, scaffolds, implements, and refactors, all inside the sandbox. It reads broadly and writes narrowly.
- The evaluation pipeline runs inside the same sandbox. Static checks validate the slice files directly. Build and test checks compile and run against the full mounted context. Both enforce gates before anything leaves the container.
- If the output passes all gates, the orchestrator collects the diff, creates a branch, commits, and promotes to a human review queue with the evidence attached.
- If it does not pass, it loops back to the agent or fails out.
The execution agent never touches version control. Git operations are promotion, and promotion is outside the agent loop. The orchestrator handles branching, committing, and creating pull requests. The agent handles files.
The human never sees anything that didn’t survive the sandbox. The system never executes anything the human didn’t approve. The agent never touches anything outside its slice.
Anyone who has worked with parallel agent architectures knows this pattern is already emerging. Multiple instances against isolated issue slices, each with their own bounded context and evaluation gate.
I hope to build and experiment with this as we all learn to operate in our new AI reality. I plan on posting my results and findings in a new “AgenticOps Applied” series to share my experience.
Deterministic Boundaries Around Stochastic Processes
That’s the core design principle. Every previous abstraction step in programming was deterministic all the way down. This one isn’t. But it doesn’t need to be, as long as the containment layer is.
The agent is probabilistic. The sandbox is not. The evaluation is not. The promotion gate is not. The runtime telemetry is not. The human review is not.
The only thing that isn’t deterministic is the agent’s output. Everything else is a deterministic process that either makes it impossible for the agent to misbehave or makes it easy to detect when it does.
You don’t trust the agent to stay in bounds. You make it structurally impossible, or at minimum structurally detectable, when it doesn’t. And you scope the work small enough that detection is meaningful.
That’s how agents stay in bounds. Not by being trustworthy. By being contained.
Let’s talk about it.
Previous: [What AgenticOps Actually Looks Like]
What AgenticOps Actually Looks Like
Total understanding is a myth. Cheap generation without governance creates invisible debt. Those were the first two claims.
Now it’s time to make AgenticOps concrete. Not a vibe. Not “AI but responsibly.” A governance operating model for AI-amplified system production.
The Problem Is We’re Reviewing the Wrong Thing
AI can generate entire services, DTO layers, migrations, integration adapters, test scaffolding, refactors across modules. Generation bandwidth has exploded. Human review has not. And it won’t. Human review doesn’t scale linearly with generation speed.
That pain is real. You hear it even from people leading frontier AI labs. The bottleneck is no longer “who can write the code.” It’s “who can review all this safely.”
I believe the deeper issue is were reviewing the wrong thing. We’re reviewing lines of code. What we should be reviewing are outcomes. Does it satisfy the contract, pass the tests, respect the boundaries? AgenticOps makes structural review the default, not the exception.
The Failure Mode
The obvious failures aren’t the real danger. The real danger is slow structural drift. Behavior changing without anyone realizing the invariants were never encoded. Contract drift across services that only surfaces when two teams try to integrate. Feature interactions no one modeled because no one knew the features existed.
None of this announces itself. It accumulates. That’s the environment AgenticOps is designed for.
What Success Looks Like
Success in AgenticOps looks like this, humans define what the system must do and what must never break. Agents generate implementation inside explicit boundaries. Every change is evaluated automatically before promotion.
Runtime behavior is observable and reversible. Surface area growth does not increase risk proportionally.
Humans stop reviewing keystrokes. Instead they review contracts, invariants, risk surfaces, behavior diffs, telemetry, and business outcomes. Machines review implementation.
The Non-Negotiables
AgenticOps asserts a few invariants of its own. No generation without defined contracts. No promotion without evaluation. No runtime without observability. No agent autonomy without bounded scope. No hidden state transitions. No change without containment.
These are structural guarantees. They replace “LGTM” and rubber stamp reviews with real safety nets. They make it harder to accidentally introduce systemic risk through sloppy review.
The Layers
I think about AgenticOps as six layers. They build on each other.

Intent is defined by humans. System purpose, value flow, state machines, invariants, constraints, risk tolerance. This is not code. This is the contract space, the thing that must exist before any agent generates anything.
Intent is the only thing that survives. Implementations get replaced. There may be 100 ways to do a thing, but intent persists.
Agent generation is where agents plan, scaffold, implement, and refactor. But they only work inside typed schemas, versioned contracts, bounded environments, and isolated slices (an end to end unit of work that delivers usable value). No free-roaming generation.
Stochastic output is fine as long as it’s contained. The boundaries are what make it safe.
Evaluation happens before anything gets promoted. Contract tests, integration tests, property-based tests, static analysis, security checks, policy enforcement, snapshot diffs of UI and API outputs.
We don’t scale review. We scale evidence. Humans don’t inspect 1,000 lines of code. They inspect whether behavior is expected.
Promotion is the most important architectural decision in AgenticOps. The agent loop cannot self-promote. Changes move to a human approval queue where humans review what changed in behavior, not what changed in code.
Governance sits outside the agent loop, not inside it. Self-promotion is where entropy becomes systemic.
Every “AI automation” narrative that gets sloppy gets sloppy at the boundary between what agents decide and what reaches production. AgenticOps draws a hard line.
Runtime governance covers what happens after deployment. Metrics, tracing, logs, SLO tracking, anomaly detection, feature flags, rollback paths. Understanding becomes observational, not memorized.
If something breaks, I don’t reread code. I interrogate behavior.
Knowledge compression means every slice of work produces artifacts. Updated documentation, system summary, state transition diagrams, dependency maps, invariant lists, change logs.
I don’t try to hold the code in my head. I hold compressed models. The system generates its own documentation as a byproduct of the governance process, not as an afterthought someone writes six months later.
The Maturity Model
AgenticOps isn’t binary. It evolves.
- Level 0 is manual coding.
Humans write and review everything. This is where most of the industry lived for decades. - Level 1 is AI-assisted coding.
AI generates code. Humans still review line by line. This is where most teams are right now, and it’s where the pain is sharpest. Generation speed outpaces review capacity. - Level 2 is contract-first generation.
Humans define contracts. AI implements against them. Tests gate promotion. This is the minimum viable AgenticOps. - Level 3 is governed agent loops.
Slice queues, evaluation services, approval gates, containment enforced structurally. The governance isn’t a process someone follows. It’s built into the system. - Level 4 is observational governance.
Runtime telemetry feeds back into planning and constraint refinement. The system learns from its own behavior in production. - Level 5 is adaptive governance.
The system proposes constraint improvements within defined boundaries. Humans approve or reject. The governance loop itself becomes partially automated.
I suspect most teams today are somewhere between Level 1 and Level 2. Very few are at Level 3. That’s the gap.
The Model
Humans own outcomes. Agents produce implementation. The system enforces containment.
That’s AgenticOps. Six layers. Five maturity levels. A structural answer to how governance scales alongside generation.
Next, the hardest part. How you actually keep agents inside their boundaries when the thing generating the code is fundamentally probabilistic.
Let’s talk about it.
Previous: [Most Software Is Just CRUD. That’s Not the Problem.]
Next: [How Agents Stay in Bounds]