I gave five AI coding agents a task: build a chess engine from scratch. One planned the architecture. Three built components in parallel. One supervised everything.
No external chess libraries. No internet lookups. Just agents, a test suite, and a goal: beat Stockfish at 1200 ELO at least 50% of the time.
The engine works. But what surprised me wasn't the output — it was what I learned about supervised AI agent execution along the way.
The Setup
The team looked like this:
roles:
- name: architect
role_type: architect
agent: claude
instances: 1
talks_to: [manager]
- name: manager
role_type: manager
agent: claude
instances: 1
talks_to: [architect, engineer]
- name: engineer
role_type: engineer
agent: claude
instances: 3
use_worktrees: true
talks_to: [manager]
Five agents. One architect running Opus for planning. Three engineers running Sonnet for implementation. One manager routing work between them. Each engineer got its own git worktree — its own branch, its own directory, completely isolated from the others.
The task board was a Markdown file:
## To Do
- [ ] Implement board representation (bitboard)
- [ ] Implement move generation (legal moves)
- [ ] Implement position evaluation (material + position tables)
- [ ] Implement search (alpha-beta with iterative deepening)
- [ ] Implement UCI protocol interface
- [ ] Write integration tests against known positions
## In Progress
## Done
I typed batty start --attach and watched.
Surprise 1: The Architect Was 10x More Important Than Any Engineer
This was the biggest lesson. I initially thought the engineers — the agents writing actual code — were the bottleneck. They weren't.
The architect was.
A good architecture plan meant engineers could work independently. Board representation, move generation, and evaluation are naturally isolated — they touch different files, use different data structures, and can be tested independently. The architect saw this and decomposed the work accordingly.
When I ran an earlier version with a weaker architecture plan, engineers kept blocking each other. The evaluation agent needed the board representation agent to finish first. The search agent needed both. Three agents, but only one could work at any given time. Parallel in theory, sequential in practice.
The fix was spending more time — and more expensive tokens — on the planning phase. I ran the architect on Opus (the most capable model) and gave it explicit instructions: "Decompose the work so that each engineer can start immediately without waiting for another engineer's output. Define interfaces upfront."
The architect produced a plan with clear module boundaries, shared type definitions in a common types.rs file, and stub implementations that each engineer could code against. All three engineers started within seconds of each other.
Lesson: In supervised AI agent execution, the quality of task decomposition determines everything. A great architect with mediocre engineers outperforms mediocre architecture with great engineers.
Surprise 2: Test Gating Caught Things I Would Have Missed
Every engineer's branch had to pass cargo test before merging. No exceptions. The agent says "done" — the supervisor runs the tests. Exit code 0 means done. Anything else means try again.
Here's what the test gate caught that I wouldn't have noticed in code review:
Off-by-one in move generation. Engineer 2 implemented pawn moves. The code looked correct — clean, well-structured, proper handling of en passant and promotion. But the test suite included known positions from the Perft test suite — positions where the exact number of legal moves is known. Engineer 2's implementation generated 19 moves in a position that should have 20. A missing edge case in castling rights after a rook capture. The kind of bug that passes code review because the logic reads correctly.
The test gate caught it. The agent got the failure output, saw exactly which position failed and by how many moves, and fixed it in the retry.
Type mismatch in evaluation scores. Engineer 3 implemented position evaluation using centipawn scores. The search module expected scores in a different range. Both modules compiled independently. Both had passing unit tests. The integration test — which ran the full engine against a known position — produced moves that were legal but strategically terrible. The engine was maximizing the wrong scale.
Without the test gate, this would have merged. I would have spent an hour debugging "why does the engine sacrifice its queen for no reason" before finding the score scaling issue.
Lesson: Test gates don't just catch bugs. They catch the class of bugs that look correct in isolation but break at integration boundaries. This is exactly where multi-agent systems fail — each agent's work is locally correct but globally broken.
Surprise 3: Five Agents Was the Sweet Spot
I tried three configurations:
| Config | Agents | Result |
|---|---|---|
| Pair | 1 architect + 1 engineer | Works but sequential. ~45 minutes. |
| Team | 1 architect + 3 engineers + 1 manager | Parallel execution. ~18 minutes. |
| Squad | 1 architect + 5 engineers + 1 manager | Merge complexity killed the gains. ~22 minutes. |
Going from 1 to 3 engineers was a clear win. Each engineer worked on a different module. Merges were clean because worktree isolation prevented file conflicts, and the architect's decomposition kept modules independent.
Going from 3 to 5 engineers actually slowed things down.
Why? Two reasons:
Merge serialization. Batty merges branches sequentially with a file lock. With 3 engineers finishing around the same time, merges queue briefly but resolve quickly. With 5, the queue backs up. Each merge triggers a test run in the target branch, and later merges sometimes conflict with earlier ones because the codebase has changed underneath them.
Task granularity. A chess engine has about 5-6 natural modules. With 3 engineers, each gets a substantial chunk of work. With 5, you're splitting modules into smaller pieces that have tighter coupling. Engineer 4 needs to implement the UCI protocol, but it depends on the search module (Engineer 3) and the board representation (Engineer 1). The independence that made 3 agents work breaks down at 5.
Lesson: More agents isn't always better. The optimal team size depends on how many truly independent tasks exist. If you have to create artificial boundaries to give agents work, you've gone too far.
Surprise 4: Token Costs Weren't What I Expected
The naive assumption: 5 agents = 5x the cost. The reality was closer to 2x.
Here's why:
Scoped context. Each engineer only loaded the files relevant to its module. Engineer 1 (board representation) never saw the evaluation code. Engineer 3 (evaluation) never saw the UCI protocol. A strict .claudeignore file kept each agent's context to ~25K tokens instead of the full ~80K project context.
Session resets. After each task, the agent got a fresh session. No accumulated conversation history from previous tasks. Clean context = fewer tokens per completion.
Model mixing. The architect ran on Opus (~15x more expensive per token than Sonnet). The engineers ran on Sonnet. Since engineers do 80% of the token-consuming work, the blended cost was much lower than running everything on Opus.
| Cost Component | Tokens | Cost |
|---|---|---|
| Architect (Opus) | ~40K | ~$1.20 |
| Engineer 1 (Sonnet) | ~60K | ~$0.36 |
| Engineer 2 (Sonnet) | ~55K | ~$0.33 |
| Engineer 3 (Sonnet) | ~65K | ~$0.39 |
| Manager (Sonnet) | ~15K | ~$0.09 |
| Total | ~235K | ~$2.37 |
A single agent doing the same work sequentially would use ~180K tokens on Opus (~$5.40) because it carries the full context throughout. The multi-agent approach was both faster and cheaper.
Lesson: Multi-agent execution is a cost optimization strategy, not just a speed optimization. Scoped tasks + model mixing + session resets cut costs more than you'd expect.
What the Engine Looks Like
The result: chess_test. A Rust chess engine built entirely by AI agents under supervision.
It's not going to beat Stockfish at full strength. But against Stockfish at 1200 ELO, it wins consistently. The architecture is clean — separate modules for board representation, move generation, evaluation, search, and UCI protocol. Each module has its own test suite.
The interesting thing isn't the engine itself. It's that the development process — supervised AI agent execution with worktree isolation, test gating, and hierarchical task dispatch — produced a codebase that's more modular and better-tested than what I typically get from a single long agent session.
When one agent does everything, it tends to take shortcuts. Shared mutable state. Implicit dependencies. Tests that pass but don't cover edge cases. When multiple agents work in isolation with hard boundaries, the code is forced to be modular because agents literally can't access each other's files.
How to Try This
If you want to run a similar experiment:
cargo install batty-cli
cd your-project
batty init --template team # architect + 3 engineers + manager
Edit .batty/team_config/team.yaml to configure agents, roles, and the test command. Add tasks to the kanban board. Run batty start --attach and watch agents work in adjacent tmux panes.
The demo video shows the chess engine build from start to finish — architect planning, engineers implementing in parallel, test gates catching bugs, branches merging.
Source: github.com/battysh/batty
The Takeaway
Supervised AI agent execution isn't about making agents faster. It's about making their output trustworthy.
Five agents building a chess engine taught me:
- Invest in the architect. Task decomposition quality > agent count. Use your best model for planning.
- Test gates are non-negotiable. Agents produce confident, plausible, broken code. Exit code 0 is the cheapest reviewer you'll ever hire.
- More agents ≠ better. Match team size to the number of naturally independent tasks. Stop at the boundary where you'd have to create artificial splits.
- Multi-agent is a cost play. Scoped context + model mixing + session resets = faster AND cheaper than one expensive agent doing everything sequentially.
The agents didn't surprise me with their code quality. They surprised me with how much the supervision layer — task decomposition, isolation, test gating — determined the outcome.
The code wrote itself. The architecture didn't.
What's the most agents you've run on a single project? Where did the coordination break down? I'm curious whether the 5-agent ceiling holds for other codebases or if it's specific to this kind of project.
United States
NORTH AMERICA
Related News
How do you actually manage your content's SEO performance?
18h ago
America's CIA Recruited Iran's Nuclear Scientists - By Threatening To Kill Them
18h ago
Why NodeDB Might Be a Better Multi-Model Database
18h ago

Qodo AI Review 2026: Is It the Best AI Testing Tool?
18h ago

I built a local-first Obsidian suite to safely feed my vault to AI 🛠️🐕
18h ago