Tagged: Architecture

March 9, 2026

What AgenticOps Actually Looks Like

Total understanding is a myth. Cheap generation without governance creates invisible debt. Those were the first two claims.

Now it’s time to make AgenticOps concrete. Not a vibe. Not “AI but responsibly.” A governance operating model for AI-amplified system production.

The Problem Is We’re Reviewing the Wrong Thing

AI can generate entire services, DTO layers, migrations, integration adapters, test scaffolding, refactors across modules. Generation bandwidth has exploded. Human review has not. And it won’t. Human review doesn’t scale linearly with generation speed.

That pain is real. You hear it even from people leading frontier AI labs. The bottleneck is no longer “who can write the code.” It’s “who can review all this safely.”

I believe the deeper issue is were reviewing the wrong thing. We’re reviewing lines of code. What we should be reviewing are outcomes. Does it satisfy the contract, pass the tests, respect the boundaries? AgenticOps makes structural review the default, not the exception.

The Failure Mode

The obvious failures aren’t the real danger. The real danger is slow structural drift. Behavior changing without anyone realizing the invariants were never encoded. Contract drift across services that only surfaces when two teams try to integrate. Feature interactions no one modeled because no one knew the features existed.

None of this announces itself. It accumulates. That’s the environment AgenticOps is designed for.

What Success Looks Like

Success in AgenticOps looks like this, humans define what the system must do and what must never break. Agents generate implementation inside explicit boundaries. Every change is evaluated automatically before promotion.

Runtime behavior is observable and reversible. Surface area growth does not increase risk proportionally.

Humans stop reviewing keystrokes. Instead they review contracts, invariants, risk surfaces, behavior diffs, telemetry, and business outcomes. Machines review implementation.

The Non-Negotiables

AgenticOps asserts a few invariants of its own. No generation without defined contracts. No promotion without evaluation. No runtime without observability. No agent autonomy without bounded scope. No hidden state transitions. No change without containment.

These are structural guarantees. They replace “LGTM” and rubber stamp reviews with real safety nets. They make it harder to accidentally introduce systemic risk through sloppy review.

The Layers

I think about AgenticOps as six layers. They build on each other.

Intent is defined by humans. System purpose, value flow, state machines, invariants, constraints, risk tolerance. This is not code. This is the contract space, the thing that must exist before any agent generates anything.

Intent is the only thing that survives. Implementations get replaced. There may be 100 ways to do a thing, but intent persists.

Agent generation is where agents plan, scaffold, implement, and refactor. But they only work inside typed schemas, versioned contracts, bounded environments, and isolated slices (an end to end unit of work that delivers usable value). No free-roaming generation.

Stochastic output is fine as long as it’s contained. The boundaries are what make it safe.

Evaluation happens before anything gets promoted. Contract tests, integration tests, property-based tests, static analysis, security checks, policy enforcement, snapshot diffs of UI and API outputs.

We don’t scale review. We scale evidence. Humans don’t inspect 1,000 lines of code. They inspect whether behavior is expected.

Promotion is the most important architectural decision in AgenticOps. The agent loop cannot self-promote. Changes move to a human approval queue where humans review what changed in behavior, not what changed in code.

Governance sits outside the agent loop, not inside it. Self-promotion is where entropy becomes systemic.

Every “AI automation” narrative that gets sloppy gets sloppy at the boundary between what agents decide and what reaches production. AgenticOps draws a hard line.

Runtime governance covers what happens after deployment. Metrics, tracing, logs, SLO tracking, anomaly detection, feature flags, rollback paths. Understanding becomes observational, not memorized.

If something breaks, I don’t reread code. I interrogate behavior.

Knowledge compression means every slice of work produces artifacts. Updated documentation, system summary, state transition diagrams, dependency maps, invariant lists, change logs.

I don’t try to hold the code in my head. I hold compressed models. The system generates its own documentation as a byproduct of the governance process, not as an afterthought someone writes six months later.

The Maturity Model

AgenticOps isn’t binary. It evolves.

Level 0 is manual coding.
Humans write and review everything. This is where most of the industry lived for decades.
Level 1 is AI-assisted coding.
AI generates code. Humans still review line by line. This is where most teams are right now, and it’s where the pain is sharpest. Generation speed outpaces review capacity.
Level 2 is contract-first generation.
Humans define contracts. AI implements against them. Tests gate promotion. This is the minimum viable AgenticOps.
Level 3 is governed agent loops.
Slice queues, evaluation services, approval gates, containment enforced structurally. The governance isn’t a process someone follows. It’s built into the system.
Level 4 is observational governance.
Runtime telemetry feeds back into planning and constraint refinement. The system learns from its own behavior in production.
Level 5 is adaptive governance.
The system proposes constraint improvements within defined boundaries. Humans approve or reject. The governance loop itself becomes partially automated.

I suspect most teams today are somewhere between Level 1 and Level 2. Very few are at Level 3. That’s the gap.

The Model

Humans own outcomes. Agents produce implementation. The system enforces containment.

That’s AgenticOps. Six layers. Five maturity levels. A structural answer to how governance scales alongside generation.

Next, the hardest part. How you actually keep agents inside their boundaries when the thing generating the code is fundamentally probabilistic.

Let’s talk about it.

Previous: [Most Software Is Just CRUD. That’s Not the Problem.]

Next: [How Agents Stay in Bounds]

February 11, 2025

Building Resilient .NET Applications using Polly

In distributed systems, outages and transient errors are inevitable. Ensuring that your application stays responsive when a dependent service goes down is critical. This article explores service resilience using Polly, a .NET library that helps handle faults gracefully. It covers basic resilience strategies and explains how to keep your service running when a dependency is unavailable.

What Is Service Resilience

Service resilience is the ability of an application to continue operating despite failures such as network issues, temporary service unavailability, or unexpected exceptions. A resilient service degrades gracefully rather than crashing outright, ensuring users receive the best possible experience even during failures.

Key aspects of resilience include:

Retrying Failed Operations automatically attempts an operation again when a transient error occurs.
Breaking the Circuit prevents a system from continuously attempting operations that are likely to fail.
Falling Back provides an alternative response or functionality when a dependent service is unavailable.

Introducing Polly: The .NET Resilience Library

Polly is an open-source library for .NET that simplifies resilience strategies. Polly allows defining policies to handle transient faults, combining strategies into policy wraps, and integrating them into applications via dependency injection.

Polly provides several resilience strategies:

Retry automatically reattempts operations when failures occur.
Circuit Breaker stops attempts temporarily if failures exceed a threshold.
Fallback provides a default value or action when all retries fail.
Timeout cancels operations that take too long.

These strategies can be combined to build a robust resilience pipeline.

Key Polly Strategies for Service Resilience

Retry Policy

The retry policy is useful when failures are transient. Polly can automatically re-execute failed operations after a configurable delay. Example:

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .OrResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
    .WaitAndRetryAsync(3, retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
        onRetry: (outcome, timespan, retryCount, context) =>
        {
            Console.WriteLine($"Retry {retryCount}: waiting {timespan} before next attempt.");
        });

Circuit Breaker

A circuit breaker prevents an application from continuously retrying an operation that is likely to fail, protecting it from cascading failures. Example:

var circuitBreakerPolicy = Policy
    .Handle<HttpRequestException>()
    .OrResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
    .CircuitBreakerAsync(
        handledEventsAllowedBeforeBreaking: 3,
        durationOfBreak: TimeSpan.FromSeconds(30),
        onBreak: (outcome, breakDelay) =>
        {
            Console.WriteLine("Circuit breaker opened.");
        },
        onReset: () =>
        {
            Console.WriteLine("Circuit breaker reset.");
        });

Fallback Strategy: Keeping Your Service Running

When a dependent service is down, a fallback policy provides a default or cached response instead of propagating an error. Example:

var fallbackPolicy = Policy<HttpResponseMessage>
    .Handle<HttpRequestException>()
    .OrResult(r => !r.IsSuccessStatusCode)
    .FallbackAsync(
         fallbackAction: cancellationToken => Task.FromResult(
             new HttpResponseMessage(HttpStatusCode.OK)
             {
                 Content = new StringContent("Service temporarily unavailable. Please try again later.")
             }),
         onFallbackAsync: (outcome, context) =>
         {
             Console.WriteLine("Fallback executed: dependent service is down.");
             return Task.CompletedTask;
         });

Timeout Policy

A timeout policy ensures that long-running requests do not block system resources indefinitely. Example:

var timeoutPolicy = Policy.TimeoutAsync<HttpResponseMessage>(TimeSpan.FromSeconds(10));

Implementing Basic Service Resilience with Polly

Example Use Case: Online Payment Processing System

Imagine an e-commerce platform, ShopEase, which processes customer payments through an external payment gateway. To ensure a seamless shopping experience, ShopEase implements the following resilience strategies:

Retry Policy: If the payment gateway experiences transient network issues, ShopEase retries the request automatically before failing.
Circuit Breaker: If the payment gateway goes down for an extended period, the circuit breaker prevents continuous failed attempts.
Fallback Policy: If the gateway is unavailable, ShopEase allows customers to save their cart and receive a notification when payment is available.
Timeout Policy: If the payment gateway takes too long to respond, ShopEase cancels the request and notifies the customer.

By integrating these resilience patterns, ShopEase ensures a robust payment processing system that enhances customer trust and maintains operational efficiency, even when external services face issues.

Conclusion

Building resilient services means designing systems that remain robust under pressure. Polly enables implementing retries, circuit breakers, timeouts, and fallback strategies to keep services running even when dependencies fail. This improves the user experience and enhances overall application reliability.

I advocate for 12-Factor Apps (https://12factor.net/) and while resilience is not directly a part of the 12-Factor methodology, many of its principles support resilience indirectly. For truly resilient applications, a combination of strategies like Polly for .NET, Kubernetes auto-recovery, and chaos engineering should be incorporated. Encouraging 12-Factor principles, auto-recovery, auto-scaling, and other methods ensures services remain resilient and performant.

By applying these techniques, developers can create resilient architectures that gracefully handle failure scenarios while maintaining consistent functionality for users. Implement Polly and supporting resilience strategies to ensure applications stay operational despite unexpected failures.

June 22, 2015

Multitenant Thoughts

I am building my 3rd multitenant SAAS solution. I am not referencing any of my earlier work because I think they were way more work than they should have been. Also, I have since moved on from the whole ASP.net web forms development mindset and I want to start with a fresh perspective instead of trying to improve my big balls of spaghetti code.

Today, my thoughts center around enforcing the inclusion and processing of a tenant ID in every command and query. My tenant model keeps all tenant data in a shared database and tables. To keep everything segregated every time I write data and read data there has to be a tenant ID included so that we don’t mess with the wrong tenants data.

I have seen all kinds of solutions for this, some more complicating than I care to tackle at this moment. I am currently leaning towards enforcing it in the data repository.

I am using a generic repository for CRUD operations and an event repository for async event driven workflows. In the repository API’s I want to introduce a validated parameter for tenant ID in every write and read operation. This will force all clients to provide the ID when they call the repos.

I just have to update a couple classes in the repos to enforce inclusion of the tenant ID when I write data. Also, every read will use the tenant ID to scope the result set to a specific tenant’s data. I already have a proof of concept for this app so this change will cause a breaking change in my existing clients, but still not a lot of work considering the fact that I almost decided to enforce the tenant ID in a layer higher than the repo, which would have been a maintenance nightmare.

Is this best practice? No. I don’t think there is a best practice besides the fact that you should use a tenant ID to segregate tenant data in a shared data store. This solution works for my problem and I am able to maintain it in just a couple classes. If the problem changes I can look into the fancy solutions I read about.

Now, how will I resolve the tenant ID? Sub-folder, sub-domain, query string, custom domain…?

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

The Problem Is We’re Reviewing the Wrong Thing

The Failure Mode

What Success Looks Like

The Non-Negotiables

The Layers

The Maturity Model

The Model

Share this:

What Is Service Resilience

Introducing Polly: The .NET Resilience Library

Key Polly Strategies for Service Resilience

Retry Policy

Circuit Breaker

Fallback Strategy: Keeping Your Service Running

Timeout Policy

Implementing Basic Service Resilience with Polly

Example Use Case: Online Payment Processing System

Conclusion

Share this:

Share this: