The Best AI Development Partner Is Another AI

12/8/2024

Here's something counterintuitive: the most effective way I've found to use AI for development isn't getting one model to do everything perfectly. It's having multiple models argue with each other.

I explored this for months now, and to me is interesting that the pattern mirrors what works with human teams. We've known for decades that humans produce better work when other humans challenge it. Turns out, same dynamic works with models.

The Multi-Agent Approach

The architecture is straightforward: agents that can perceive their environment, reason and plan, remember context, and take actions. What makes it interesting is using different models for different stages, specifically pitted against each other.

Sounds wasteful. It's not. Here's why.

The Workflow

The pattern starts earlier than you'd expect - not at coding, but at product design.

Design Phase: One model takes a rough idea and shapes it into detailed requirements. A different model challenges those requirements: what's missing? What's ambiguous? What assumptions haven't we examined? The first model revises.

Work Breakdown: Same pattern. One model decomposes work into tasks. Another reviews: are these actually independent? Is the sequencing right? Are we building the wrong thing first? Friction before consensus.

Implementation: By now the pattern is muscle memory. One model generates code. A different model reviews it - genuine adversarial critique. Then revision. Sometimes I have two models solve the same problem independently and compare.

Refactoring Loop: A fresh model looks at working code with one question: how would you safely simplify this? No new features, just clarity. A model that didn't write the code often spots abstractions that seemed clever but are actually just complicated.

Testing: Integration tests, security review, performance analysis. Not premature optimization - targeted analysis. Obvious bottlenecks, loop complexity, unnecessary network calls.

End-to-End Validation: Full user flows, written by a model given only the original requirements, not implementation details. Does the system actually do what we said it would? This catches a specific class of bug: the feature that works perfectly as coded but doesn't match what we needed.

The Surprising Economics

Here's where it gets interesting. You're burning tokens on what looks like redundant work: multiple models, multiple review passes, tests written by models that didn't write the code. Traditional efficiency thinking says this is waste.

But the token cost of critique is trivial compared to shipping bugs, security holes, or architecture that doesn't scale. The "wasteful" redundancy is cheaper than the rework it prevents.

This means we need to think differently about cost. The question isn't "how few tokens can I use?" It's "what's the total cost of the outcome I need?"

Real Talk: Where This Falls Apart

The pattern is simple - generate, critique, revise - applied at every stage. But managing coding agents manually? That's a bottleneck. It works for now, but it obviously begs for automation.

This is preparation for a future where extremely capable models manage hundreds of sub-agents. The orchestration layer matters more than any individual agent.

The other issue: context coordination. Each agent has limited memory. Keeping them aligned on the big picture requires careful prompt design and state management. It's solvable but not free.

The Takeaway

Models, like people, do better work when someone's checking their homework. The twist is that the checker can be another model - and often should be.

I'll take a multi-model workflow with built-in friction over a single model trying to be perfect. The redundancy feels wrong. The results say otherwise.