TechTrends Now - Tech News for Builders and Operators

I gave five AI coding agents a task: build a chess engine from scratch. One planned the architecture. Three built components in parallel. One supervised everything.

No external chess libraries. No internet lookups. Just agents, a test suite, and a goal: beat Stockfish at 1200 ELO at least 50% of the time.

The engine works. But what surprised me wasn't the output — it was what I learned about supervised AI agent execution along the way.

The Setup

The team looked like this:

roles:
  - name: architect
    role_type: architect
    agent: claude
    instances: 1
    talks_to: [manager]

  - name: manager
    role_type: manager
    agent: claude
    instances: 1
    talks_to: [architect, engineer]

  - name: engineer
    role_type: engineer
    agent: claude
    instances: 3
    use_worktrees: true
    talks_to: [manager]

Five agents. One architect running Opus for planning. Three engineers running Sonnet for implementation. One manager routing work between them. Each engineer got its own git worktree — its own branch, its own directory, completely isolated from the others.

The task board was a Markdown file:

## To Do
- [ ] Implement board representation (bitboard)
- [ ] Implement move generation (legal moves)
- [ ] Implement position evaluation (material + position tables)
- [ ] Implement search (alpha-beta with iterative deepening)
- [ ] Implement UCI protocol interface
- [ ] Write integration tests against known positions

## In Progress

## Done

I typed batty start --attach and watched.

Surprise 1: The Architect Was 10x More Important Than Any Engineer

This was the biggest lesson. I initially thought the engineers — the agents writing actual code — were the bottleneck. They weren't.

The architect was.

A good architecture plan meant engineers could work independently. Board representation, move generation, and evaluation are naturally isolated — they touch different files, use different data structures, and can be tested independently. The architect saw this and decomposed the work accordingly.

When I ran an earlier version with a weaker architecture plan, engineers kept blocking each other. The evaluation agent needed the board representation agent to finish first. The search agent needed both. Three agents, but only one could work at any given time. Parallel in theory, sequential in practice.

The fix was spending more time — and more expensive tokens — on the planning phase. I ran the architect on Opus (the most capable model) and gave it explicit instructions: "Decompose the work so that each engineer can start immediately without waiting for another engineer's output. Define interfaces upfront."

The architect produced a plan with clear module boundaries, shared type definitions in a common types.rs file, and stub implementations that each engineer could code against. All three engineers started within seconds of each other.

Lesson: In supervised AI agent execution, the quality of task decomposition determines everything. A great architect with mediocre engineers outperforms mediocre architecture with great engineers.

Surprise 2: Test Gating Caught Things I Would Have Missed

Every engineer's branch had to pass cargo test before merging. No exceptions. The agent says "done" — the supervisor runs the tests. Exit code 0 means done. Anything else means try again.

Here's what the test gate caught that I wouldn't have noticed in code review:

Off-by-one in move generation. Engineer 2 implemented pawn moves. The code looked correct — clean, well-structured, proper handling of en passant and promotion. But the test suite included known positions from the Perft test suite — positions where the exact number of legal moves is known. Engineer 2's implementation generated 19 moves in a position that should have 20. A missing edge case in castling rights after a rook capture. The kind of bug that passes code review because the logic reads correctly.

The test gate caught it. The agent got the failure output, saw exactly which position failed and by how many moves, and fixed it in the retry.

Type mismatch in evaluation scores. Engineer 3 implemented position evaluation using centipawn scores. The search module expected scores in a different range. Both modules compiled independently. Both had passing unit tests. The integration test — which ran the full engine against a known position — produced moves that were legal but strategically terrible. The engine was maximizing the wrong scale.

Without the test gate, this would have merged. I would have spent an hour debugging "why does the engine sacrifice its queen for no reason" before finding the score scaling issue.

Lesson: Test gates don't just catch bugs. They catch the class of bugs that look correct in isolation but break at integration boundaries. This is exactly where multi-agent systems fail — each agent's work is locally correct but globally broken.

Surprise 3: Five Agents Was the Sweet Spot

I tried three configurations:

Config	Agents	Result
Pair	1 architect + 1 engineer	Works but sequential. ~45 minutes.
Team	1 architect + 3 engineers + 1 manager	Parallel execution. ~18 minutes.
Squad	1 architect + 5 engineers + 1 manager	Merge complexity killed the gains. ~22 minutes.

Going from 1 to 3 engineers was a clear win. Each engineer worked on a different module. Merges were clean because worktree isolation prevented file conflicts, and the architect's decomposition kept modules independent.

Going from 3 to 5 engineers actually slowed things down.

Why? Two reasons:

Merge serialization. Batty merges branches sequentially with a file lock. With 3 engineers finishing around the same time, merges queue briefly but resolve quickly. With 5, the queue backs up. Each merge triggers a test run in the target branch, and later merges sometimes conflict with earlier ones because the codebase has changed underneath them.

Task granularity. A chess engine has about 5-6 natural modules. With 3 engineers, each gets a substantial chunk of work. With 5, you're splitting modules into smaller pieces that have tighter coupling. Engineer 4 needs to implement the UCI protocol, but it depends on the search module (Engineer 3) and the board representation (Engineer 1). The independence that made 3 agents work breaks down at 5.

Lesson: More agents isn't always better. The optimal team size depends on how many truly independent tasks exist. If you have to create artificial boundaries to give agents work, you've gone too far.

Surprise 4: Token Costs Weren't What I Expected

The naive assumption: 5 agents = 5x the cost. The reality was closer to 2x.

Here's why:

Scoped context. Each engineer only loaded the files relevant to its module. Engineer 1 (board representation) never saw the evaluation code. Engineer 3 (evaluation) never saw the UCI protocol. A strict .claudeignore file kept each agent's context to ~25K tokens instead of the full ~80K project context.

Session resets. After each task, the agent got a fresh session. No accumulated conversation history from previous tasks. Clean context = fewer tokens per completion.

Model mixing. The architect ran on Opus (~15x more expensive per token than Sonnet). The engineers ran on Sonnet. Since engineers do 80% of the token-consuming work, the blended cost was much lower than running everything on Opus.

Cost Component	Tokens	Cost
Architect (Opus)	~40K	~$1.20
Engineer 1 (Sonnet)	~60K	~$0.36
Engineer 2 (Sonnet)	~55K	~$0.33
Engineer 3 (Sonnet)	~65K	~$0.39
Manager (Sonnet)	~15K	~$0.09
Total	~235K	~$2.37

A single agent doing the same work sequentially would use ~180K tokens on Opus (~$5.40) because it carries the full context throughout. The multi-agent approach was both faster and cheaper.

Lesson: Multi-agent execution is a cost optimization strategy, not just a speed optimization. Scoped tasks + model mixing + session resets cut costs more than you'd expect.

What the Engine Looks Like

The result: chess_test. A Rust chess engine built entirely by AI agents under supervision.

It's not going to beat Stockfish at full strength. But against Stockfish at 1200 ELO, it wins consistently. The architecture is clean — separate modules for board representation, move generation, evaluation, search, and UCI protocol. Each module has its own test suite.

The interesting thing isn't the engine itself. It's that the development process — supervised AI agent execution with worktree isolation, test gating, and hierarchical task dispatch — produced a codebase that's more modular and better-tested than what I typically get from a single long agent session.

When one agent does everything, it tends to take shortcuts. Shared mutable state. Implicit dependencies. Tests that pass but don't cover edge cases. When multiple agents work in isolation with hard boundaries, the code is forced to be modular because agents literally can't access each other's files.

How to Try This

If you want to run a similar experiment:

cargo install batty-cli
cd your-project
batty init --template team  # architect + 3 engineers + manager

Edit .batty/team_config/team.yaml to configure agents, roles, and the test command. Add tasks to the kanban board. Run batty start --attach and watch agents work in adjacent tmux panes.

The demo video shows the chess engine build from start to finish — architect planning, engineers implementing in parallel, test gates catching bugs, branches merging.

Source: github.com/battysh/batty

The Takeaway

Supervised AI agent execution isn't about making agents faster. It's about making their output trustworthy.

Five agents building a chess engine taught me:

Invest in the architect. Task decomposition quality > agent count. Use your best model for planning.
Test gates are non-negotiable. Agents produce confident, plausible, broken code. Exit code 0 is the cheapest reviewer you'll ever hire.
More agents ≠ better. Match team size to the number of naturally independent tasks. Stop at the boundary where you'd have to create artificial splits.
Multi-agent is a cost play. Scoped context + model mixing + session resets = faster AND cheaper than one expensive agent doing everything sequentially.

The agents didn't surprise me with their code quality. They surprised me with how much the supervision layer — task decomposition, isolation, test gating — determined the outcome.

The code wrote itself. The architecture didn't.

What's the most agents you've run on a single project? Where did the coordination break down? I'm curious whether the 5-agent ceiling holds for other codebases or if it's specific to this kind of project.

I Built a Chess Engine with 5 AI Agents — Here's What Surprised Me

The Setup

Surprise 1: The Architect Was 10x More Important Than Any Engineer

Surprise 2: Test Gating Caught Things I Would Have Missed

Surprise 3: Five Agents Was the Sweet Spot

Surprise 4: Token Costs Weren't What I Expected

What the Engine Looks Like

How to Try This

The Takeaway

Comments (0)

United States

Related News

How do you actually manage your content's SEO performance?

America's CIA Recruited Iran's Nuclear Scientists - By Threatening To Kill Them

Why NodeDB Might Be a Better Multi-Model Database

Qodo AI Review 2026: Is It the Best AI Testing Tool?

I built a local-first Obsidian suite to safely feed my vault to AI 🛠️🐕