2026.05.22~ 8 min readdevelopmentagentstrust
Essay · No. 05

My Automated Doubt Development Process

How front-loading scrutiny through multi-agent validation rebuilt trust in AI-assisted development.

This process originated out of a lack of trust. I lost trust early in my AI-assisted development due to allowing our LLM partners to do too much, too quickly and without the standard engineering practices I had come to internalize. Trust was regained by automating as much doubt as I could muster. What does performing doubt look like? Critiquing the implementation of an artifact and doing so, repeatedly. If you are using AI to write code, specs, docs or any artifact, you may find this piece useful.

I use subagents, quite a bit. They inhabit the fulcrum of the entire process. They are specialized in ways that audit perspectival surfaces a standard instantiation of Claude wouldn't necessarily cover. The core idea in all of this is automated doubt from multiple perspectives and the front-loading of scrutiny. The more parallax coverage in AI development, the better; where different vantage points catch different defects, the way two eyes give you depth. The development process goes something like this:

Phase 1 — Design

It starts with an idea or a feature I'd like to build and a specification. Like any good development practice, it's usually wise to start with a spec, PRD, plan, or whatever flavor of design preferred. I ask Claude to write the spec and I spend 2–5 minutes skimming the file to verify the core implementation aspects of the idea are captured. This is where the iteration process begins.

I start with a Pre-implementation workflow (slash command in Claude Code), which consists of three agents performing the first round of doubt: Pre-Implementation Architect, Documentation Validator and Assumption Excavator. These agents do several things: verify design quality, scope assessment, completeness, documentation gaps and all the hidden assumptions that exist in the spec. All relevant findings discovered are folded into the spec by the main terminal agent — usually 10–25 depending on the scope of the idea.

Example findings:

Assumption Excavator: "executionStatsSchema in registry-sdk returns {totalCount, recentCount, windowMinutes}. Spec assumes {avgScore, medianDurationMs, passRate, lastRunDate, lastRunScore}. Entire history section unbuildable without new API endpoint"

Pre-Implementation Architect: "HarnessProfile embeds mcp.read/merge/remove/write methods alongside path config. Consider extracting McpConfigStrategy to separate concerns. Each harness file will grow to 80–120 lines otherwise."

The scope determines the amount of iterations I make. If the scope calls for it, the iteration continues with the next set of agents: Gap Analyzer, Implied Completeness Detector, Ambiguity Mapper. These agents in particular are excellent at finding all the omitted aspects of the system that will be missed if left unaddressed. When the gaps are discovered, they are added to the spec.

Example findings:

Gap Analyst: "McpConfigStrategy defines read/merge/write but does not specify behavior for malformed input, permission denied, partial write failure, or file locking. Destructive operation on user config files across 4 harnesses in 3 formats."

Implied Completeness Detector: "Manifest records version at root but installation state per-harness. When v0.3.0 user (Claude Code) runs v0.4.0 with --harness opencode, behavior undefined. Per-harness versioning or upgrade reconciliation not addressed."

For practical use:

  • Small scope: Pre-implementation only
  • Medium scope: Pre-implementation with Gap, Implied, Ambiguity
  • Large scope: Full sweep with multiple runs with each, occasionally dipping into other specialized agents

Now I pause and spend some time to read the spec, ~15–60 min. If everything checks out and the spec is ready for development, I ask Claude to generate a companion checklist that we can update and follow along. The checklist is created as a separate file and helps if you need to step away and close out a session mid-dev.

Phase 2 — Development

Claude pulls up the spec and checklist and begins development. If I'm picking the spec up with the development partly complete in a new session, I usually ask Claude to explore, or send a Chain Tracer or Deep Explore subagent for the complete picture prior to restarting.

One aspect of my development process that might stand out and that I would like to highlight: I don't use subagents for writes. This comes back to the trust angle. My experiences of spawning subagents for writes gone awry, often causing more harm than good, led to a temporary line drawn in the sand. I also say temporary, because this will undoubtedly change. As I understand it, there are methods for proper swarm orchestration, worktrees, agent-to-spec driven dev, but that's a bit beyond my trust level now. Sometimes the Claude terminal agent will spawn them for bulk updates, but I prefer a single Claude Code terminal instance building out the project.

I tackle all phases of the specification until complete. Verify the build works, and then comes the post-implementation development process. I mentioned automated doubt and this is where it shines. The next several iterations of the development process involve running a Post-Implementation workflow consisting of the following subagents: Code Validator, Type Safety Validator, Test Architect, Code Optimizer, Public Interface Validator and Security Analyst. These agents audit the codebase and provide findings: code & testing quality, security posture, duplication, performance considerations, semantic or structural integrity, documentation, the public interface, etc. The first run usually generates (depending on the scope) 15–35 findings, usually with the first 15–20 findings flagged as critical or high severity. These findings are addressed and I re-run the Post-implementation workflow. Then tackle the next set of issues, then the next and so on until I've reached my idea of what quality ought to look like.

Example findings:

Code Validator: "Every other execution method calls trackIfEnabled() after completion. startPipeline() returns PipelineHandle directly without tracking. Async pipeline users get no tracking data."

Security Analyst: "PreflightError includes shellQuote-expanded target path verbatim. Error messages containing resolved filesystem paths may propagate to tracking API and dashboard."

Phase 3 — Wrap-up and Ship

Once I've satisfied my preference for what I'm ready to release and everything checks out both in a practical and quality manner, I then run the final workflow: Ship. This workflow consists of the following agents: Code Validator, Type Safety Validator, Test Architect, Code Auditor, Public Interface Validator, Security Analyst, Anxiety Reader, API Contract Validator (if API), Release Readiness Validator. This workflow finalizes the iterative process tackled in the previous phase. 5/9 agents were all in the post-implementation workflow, so they should be finding very little or entering preference territory, the others are checking the API contract (if relevant), runtime consistency, what could break and the release posture of the system. When this runs, the question is: is this ready for release? Depending on the complexity, this may require 2+ iterations of Ship.

Example findings:

Anxiety Reader: "Promise.allSettled fires all agents simultaneously with no concurrency limit, risking resource exhaustion and API rate limits."

Code Auditor: "File I/O errors in writeReportFiles caught by handleCoreError which gives SDK-specific hints instead of filesystem-specific messaging."

Conclusion

On the philosophical end, this is the negotiation between the artifacts, the agents and the operator and where the idea of quality converges. We all have an idea of what quality means to us, even the agents themselves have ideas of what both quantifies and qualifies as quality. This is the agreement we make with ourselves and the agents: what constitutes readiness. The foundation of it all is the idea that we are aiming for some form of consistency, usability, readability, maintainability — and underneath those, something we can be more confident in. Quality can be a subjective state, with objective goals. I iterate until those ideas converge. How do you know when to terminate the loop? I'd like to think it's intuitive: the combination of patience, practice, judgement and your expertise in asking the right questions. Is the juice worth the squeeze for this next fix or feature? It comes back to the personal thresholds for whatever state of the project you are ready to release. The artist is never finished, is the engineer? It ultimately comes down to the operator. The good thing about versioning, is that you can always add, subtract or modify in some manner and how that quality manifests is derived from preference and the artifact's trajectory.

One consideration of the method, and one I can state with confidence: this process is not necessarily cheap on the tokens. For those of us who have spent countless hours burning through tokens and hitting usage limits, this can play a major role in how we develop with AI. For some projects, this process is absolutely overkill, and for others, it's simply not enough and requires appending an entirely different set of agents to audit. My personal inclination is to run this process and run it repeatedly. I'd like to ensure the code I am developing with Claude or any other AI system can be verified, validated and ideally, meet a higher standard. Some projects may require nothing more than a Code Validator and Test Architect for review, others involve 40+ agents from multiple perspectives. If there is at least one agent that should be tried out on any artifact — codebase, spec, docs, etc — it's the Assumption Excavator, as it is near universally applicable.


This process originated out of a lack of trust and has developed into a trust signal.

The agents, commands, and pipelines referenced in this post are available at github.com/aself101/agents-and-pipelines.