March 19, 2026

Verification Beats Debugging

A few days ago I read a post describing an intense engineering sprint.

In roughly three days the author reported:

designing and implementing a JVM language
building a wiki with its own web server
improving the AI of a strategy game
creating mutation testing tools
implementing a differential mutation strategy

All while enforcing strict engineering discipline.

Coverage above 90%.
CRAP score under 8.
Mutation testing enforced.
Files split when complexity exceeded limits.

When the systems were finally run for the first time, they worked. Not mostly worked. Worked.

That sounds like a miracle if you are used to the normal development loop. The author of that post was Robert C. Martin, often called Uncle Bob, and he reported this in an X post.

But the interesting part is not the accomplishments. It is the engineering loop behind them.

The Normal Development Loop

Most development follows this pattern.

Write code.
Run program.
Debug problems.
Repeat.

Execution becomes the discovery mechanism for defects. The system runs, something breaks, and we start searching for the cause. This works, but it is inefficient. Debugging becomes the dominant cost of development.

A Verification Loop

The workflow described in the post follows a different structure.

Specification

↓

Acceptance tests (ATDD / Gherkin)

↓

Unit tests (TDD)

↓

Implementation

↓

Run tests and fix failures

↓

Measure coverage and increase it

↓

Measure complexity and reduce it

↓

Run mutation testing

↓

Refactor until all constraints hold

This is not a coding loop. It is a verification loop. The system never moves forward until each layer of verification holds.

Constraints Instead of Discipline

The key insight is that this process does not rely on discipline alone. It relies on constraints enforced by tools.

The system continuously measures:

code coverage
cyclomatic complexity
CRAP score
mutation score

If the metrics fail, the code must be changed.

This turns engineering practice into infrastructure. Instead of relying on developers to remember best practices, the system requires them.

Code Coverage

Code coverage measures how much of the codebase is executed by the test suite.

Coverage tools typically track several dimensions:

line coverage
branch coverage
function coverage

Coverage answers a basic but important question. Did the tests actually execute the code? If large portions of the system are never exercised during testing, defects can hide in those paths.

Higher coverage increases the probability that tests interact with most of the system. Many teams set a minimum threshold such as:

coverage ≥ 80%
coverage ≥ 90% for critical systems

In the workflow described earlier, coverage was kept above 90%.

Coverage alone does not guarantee correctness. It only tells us that code executed during testing. That is why coverage must be combined with stronger signals like mutation testing.

Mutation Testing

Mutation testing strengthens traditional testing.

Traditional tests answer one question: Did the code run? Mutation testing asks a stronger question: If the code were wrong, would the tests detect it?

A mutation engine introduces small semantic changes into the code:

flipping boolean conditions
changing comparison operators
altering arithmetic
removing conditions

Each change creates a mutant version of the program.

If the tests fail, the mutant is killed. If the tests pass, the mutant survived. A high mutation score means the tests actually verify behavior.

Execution coverage proves code runs. Mutation coverage proves the tests detect incorrect behavior.

Cyclomatic Complexity

Cyclomatic complexity measures how many independent execution paths exist through a function.

Each branch increases the number of paths.

Examples include:

`if` statements
loops
logical operators
conditional expressions

More paths means more scenarios that must be tested and reasoned about.

Typical guidelines:

complexity ≤ 5 → simple
complexity ≤ 10 → manageable
complexity > 10 → refactor

High cyclomatic complexity does not mean code is wrong.

It means the code is becoming difficult to reason about and difficult to test. Limiting complexity forces functions to remain small and predictable.

CRAP Score

CRAP stands for Change Risk Anti-Patterns.

It combines two signals:

cyclomatic complexity
test coverage

The idea is simple. Complex code increases risk. Untested code increases risk. Complex and untested code multiplies risk. CRAP quantifies that relationship.

Typical interpretation:

CRAP < 10 → low risk
CRAP 10-30 → moderate risk
CRAP > 30 → high risk

In the workflow described earlier the target was CRAP below 8.

That forces two things at the same time:

code must remain simple
tests must remain thorough

Together these dramatically reduce the probability of introducing defects.

Why This Matters for AI

AI-generated code has a predictable weakness. It often looks correct while being semantically fragile.

The code compiles. The tests run. But small behavioral changes break the system.

Mutation testing directly attacks that weakness. Cyclomatic complexity prevents large opaque functions from emerging. CRAP ensures complex areas remain heavily tested. Together these metrics create guardrails that stabilize generated code.

This fits naturally into an AgenticOps pipeline.

The AgenticOps Verification Loop

A practical AgenticOps workflow might look like this.

Specification

↓

Agent generates acceptance tests

↓

Agent generates unit tests

↓

Agent generates implementation

↓

Run tests and fix failures

↓

Measure coverage and improve it

↓

Reduce complexity and CRAP

↓

Mutation testing attacks the code

↓

Agent fixes surviving mutants

↓

Repeat until all quality gates pass

The system continuously attempts to invalidate its own behavior. Only code that survives adversarial verification moves forward.

Architecture Through Measurement

Another interesting rule in the workflow was limiting files to a maximum number of mutation sites. Mutation sites correlate with complexity.

As files accumulate mutation points, they become harder to reason about. Instead of manually policing architecture, the system enforces limits:

maximum mutation sites per file
maximum cyclomatic complexity
maximum CRAP score

When limits are exceeded, refactoring becomes mandatory. Architecture emerges from constraints.

Acceptance Tests First

Another subtle pattern is the order of operations. The systems were not executed during development. Behavior was defined through acceptance tests before the implementation existed. Only after the verification pipeline passed was the system executed.

Execution was confirmation. Not discovery.

Deterministic Pipelines

AI-assisted development introduces a fundamental challenge: trust. Developers often ask whether generated code “looks correct”. That is not the right question. The right question is whether the code passes the verification pipeline.

Pipelines provide deterministic evaluation of stochastic output. They transform judgment into measurement.

Parallel Verification

In the original story, everything ran on a single machine. Tests, mutation engines, coverage analysis, and refactoring cycles competed for CPU time.

Modern systems can push this further. Verification can run in parallel:

test workers
mutation workers
coverage analysis
linting
architecture checks

Parallel verification shortens feedback loops while preserving rigor.

Engineering Confidence

The most important takeaway is not productivity. It is confidence.

By the time the systems were executed, they had already survived:

acceptance tests
unit tests
mutation testing
coverage gates
structural constraints

Execution became almost a formality.

This kind of discipline has been advocated for years by engineers like Robert C. Martin, but the lesson is broader than any individual methodology. Verification beats debugging.

Convergent Patterns

This pattern appears across many engineering environments. Different teams use different tools, but the structure is consistent:

tight feedback loops
automated verification
promotion gates

The tools evolve. The principles remain.

AgenticOps applies these same ideas to AI-assisted development. The goal is not to trust the agent. The goal is to build systems where trust is unnecessary.

Let’s talk about it.

Previous: [OpenClaw Is Not an AI Assistant]

Next: [Deploying an Agent Runtime with an Agent]

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

DecoupledLogic

Verification Beats Debugging

The Normal Development Loop

A Verification Loop

Constraints Instead of Discipline

Code Coverage

Mutation Testing

Cyclomatic Complexity

CRAP Score

Why This Matters for AI

The AgenticOps Verification Loop

Architecture Through Measurement

Acceptance Tests First

Deterministic Pipelines

Parallel Verification

Engineering Confidence

Convergent Patterns

3 comments

Leave a comment Cancel reply

The Normal Development Loop

A Verification Loop

Constraints Instead of Discipline

Code Coverage

Mutation Testing

Cyclomatic Complexity

CRAP Score

Why This Matters for AI

The AgenticOps Verification Loop

Architecture Through Measurement

Acceptance Tests First

Deterministic Pipelines

Parallel Verification

Engineering Confidence

Convergent Patterns

Share this:

Related

3 comments

Leave a comment Cancel reply