Fetching latest headlines…
Coding Agents Suck at Tools
NORTH AMERICA
🇺🇸 United StatesJune 21, 2026

Coding Agents Suck at Tools

0 views0 likes0 comments
Originally published byDev.to

Open up the source code of any agent framework, harness, etc. Hermes, Copilot, Pi, Opencode or whatever.
You will find tools, tools everywhere. 

Examples: 

https://github.com/NousResearch/hermes-agent/tree/main/tools
https://github.com/anomalyco/opencode/tree/dev/packages/opencode/src/tool

These are bash hooks, file viewers, file editors, greps, questions, tasks, web search.

Your harness is a loop, and on each loop, all these tools and their individual instructions are injected in the context. That is why initial request using opencode is 10k tokens for no reason.

Do we have problems with tools? Yes, and a lot.
Some tools just don’t work properly, there are bugs there in these tools. Some don't count lines properly, some truncate files in order to save some tokens. Some invalidate cache, making your token/$ ratio worse.

Tool signatures differ from one harness to the other. And models absolutely suck at this. The longer session goes, the more model “forgets” and shifts its attention from this noisy tool descriptions. Harness starts failing on basic operations like finding files. Poor models drop into endless loops, eating your budget with no output at all.

When you force a model to context-switch between writing clean JavaScript and formatting a deviant, rigid JSON payload just to view a file, it all breaks down, and if not, its just not efficient.

Models are not trained on every harness available, they are trained on bash and coding. That’s what they need to do, and that’s why Pi is so good. But it could be better!


The failure peak is always file editing. Writing a file from scratch is a forward-flowing token stream, quite easy we guess. Editing an existing file requires the model to hold an exact mental map of the code's Abstract Syntax Tree (AST), match white-space indentation perfectly (tabs vs. spaces), and calculate precise line diffs.
When poor models attempt a search-and-replace edit, it almost always misses a newline or a trailing brace. The harness rejects the edit. The model gets confused by the raw bash or parser error, loses its place in the file, and begins modifying the wrong lines entirely - corrupting the codebase until the context window is nothing but garbage. Tools produce garbage and pollute your context. The more tools harness has, the more unrelated garbage is in your context.

We think we solved it. What if harness will just build code to edit other code? It’s already trained on doing that, kind of. So we decided to build a harness with only 1 “tool”, which is Elixir Eval. We call it eeva.

Elixir is a perfect language for models. It both looks similar to bash, has the same “piping” behavior like bash, has very similar out of the box functions like File.ls or File.read. And it’s a clean functional language, models are very good at this. They are good both with bash, and with Elixir. They combine the knowledge and attention to solve tasks. Every time model fails to make an operation, harness feeds the model with always similar Elixir error traces.

This seems like a small change, but it really flips the game a bit. The longer your agent is working, the longer the context, the more precise are the edits and operations. Instead of feeding the context with junk errors of random tools, we feed it with elixir compile errors, forcing model into elixir, basically “fine-tuning” it on the fly with high quality outputs and results.

With bigger context all coding agents are failing eventually, even ours. But the ceiling is much higher this time.

So if you wanna try this approach, take a shot: https://github.com/beamcore/agent

Comments (0)

Sign in to join the discussion

Be the first to comment!