Fetching latest headlines…
Building an AI Agent Runtime That Uses Codex CLI / Claude Code as Workers and Closes Tasks Only With Evidence
NORTH AMERICA
🇺🇸 United StatesMay 10, 2026

Building an AI Agent Runtime That Uses Codex CLI / Claude Code as Workers and Closes Tasks Only With Evidence

0 views0 likes0 comments
Originally published byDev.to

Most AI agents treat done as a message.

In Spectrion, done is a state transition.

A task is not completed until the runtime has gone through the full path:

- selected the ready task from the plan;
- checked dependencies;
- checked policy and approvals;
- executed the work through tools, CLI workers, or subagents;
- collected evidence;
- updated state through execute_plan;
- continued to the next task or honestly stopped on a blocker.

The most important part: Codex CLI, Claude Code, or any other headless CLI can be used as workers inside the plan.

But they do not decide when the task is finished.

Spectrion verifies their output and closes the task only with evidence.

Why a normal agent loop is weak

A normal AI chat:

user request
  -> model
  -> answer

An agent with tools:

user request
  -> model
  -> tool call
  -> tool result
  -> answer

That is enough for simple tasks.

It is not enough for large engineering work.

For example:

Systematically discover and fix all remaining bugs across agent runtime,
UI/UX, tools, memory, planning system, and server.

Add comprehensive test coverage.
Maintain server backward compatibility.
Resolve planning mode confusion.

You cannot just write a plan and say “done.”

You need a runtime that holds state:

plan
tasks
dependencies
approval gates
evidence
CLI sessions
subagents
blockers
final acceptance criteria

That is what I am building in Spectrion.

The main flow: plan + CLI worker

In Spectrion, a large goal becomes an execution plan:

user goal
  -> create_plan
  -> approval / questions / assumptions
  -> execute ready task
  -> optional CLI worker: Codex CLI / Claude Code / another CLI
  -> collect output
  -> verify with code / tests / logs / artifacts
  -> attach evidence
  -> execute_plan: mark_completed / mark_failed / blocked
  -> continue next ready task
  -> finish only when plan state is terminal

The external CLI is a worker.

Spectrion is the supervisor.

The CLI can say done.

Spectrion must verify whether the task can actually be closed.

A plan is not a Markdown list

A typical “plan” often looks like this:

1. Inspect the code
2. Find bugs
3. Fix them
4. Verify

That list guarantees nothing.

In Spectrion, a plan is a runtime artifact:

objective
scope
phases
tasks
dependencies
risk level
approval gates
open questions
acceptance criteria
rollback notes
required evidence
status
progress

The main invariant:

model message does not close the task
task state closes the task
completed task requires evidence
risky task requires approval
blocked task stays blocked

The plan is not finished until every task is closed as completed or skipped, or until the runtime honestly stops on a blocker / approval / open question.

create_plan vs todo

todo is for short work inside one turn:

- check a file;
- edit text;
- run one test;
- remember to update README;
- keep a small checklist.

create_plan is for situations with:

- multiple phases;
- dependencies;
- approval gates;
- risky changes;
- migrations;
- deployments;
- long-running work;
- rollback;
- acceptance criteria;
- cross-turn execution.

Example:

Find all remaining bugs in runtime, UI, tools, memory,
planning, and server. Add tests. Preserve backward compatibility.

That is not a todo list.

That is an execution plan.

In one real case, the plan broke the work into 35 tasks: discovery, runtime/planning fixes, tools layer, UI/UX, server compatibility, tests, and live smoke checks.

Caption:

The plan is not a Markdown list. It is a runtime artifact: objective, scope, questions, phases, tasks, progress, and approval gates.


Caption:

Risky or critical-path tasks do not execute silently. They stop at an approval gate.

What counts as evidence

Evidence is not a model sentence like “I checked it.”

Evidence is a verifiable trace of execution:

- test output;
- process log;
- diff or patch;
- path to changed file;
- reproducible scenario;
- HTTP response metadata;
- screenshot;
- artifact id;
- concrete blocker reason;
- scope or approval constraint.

If there is no evidence, the task should not be closed.

Codex CLI / Claude Code as workers

Suppose the plan contains this task:

Audit tools layer for schema mismatches and timeout bugs.

Spectrion can launch Codex CLI:

codex exec "Scan Agent/Tools for schema mismatches, timeout handling gaps, and unsafe parsing. Return confirmed issues with file paths, reproduction notes, and suggested regression tests."

Then continue the same context:

codex exec resume --last "Convert the top confirmed findings into concrete patch steps and regression tests. Do not edit files yet."

Or it can use Claude Code / another CLI in headless or line-oriented mode.

But Spectrion does not trust CLI output blindly.

The pipeline looks like this:

Spectrion task
  -> launch CLI worker
  -> read output
  -> check files
  -> run tests
  -> compare with plan objective
  -> filter weak claims
  -> attach evidence
  -> mark task completed only if verified

A CLI can find a suspicious code path.

A CLI can propose a patch.

A CLI can collect logs.

But Spectrion runtime closes the task.

Why you cannot simply trust a CLI

CLI output is input, not a verdict.

It can:

- mix up a file;
- miss an edge case;
- propose a patch without a test;
- call a hypothesis a bug;
- say done even though a command was never run;
- forget backward compatibility;
- miss acceptance criteria from the plan.

Example verification:

Codex says:
  "Found a likely timeout bug in ToolExecutor."

Spectrion checks:
  - file path exists;
  - code path reachable;
  - bug reproducible;
  - patch applicable;
  - test fails before fix;
  - test passes after fix;
  - no neighboring regression.

Only then:
  execute_plan(mark_completed, evidence=...)

That is the difference between “chat with a command” and an execution runtime.

Persistent terminal sessions

Many development tasks need a live process:

- dev server;
- test watcher;
- REPL;
- long-running CLI;
- process waiting for stdin;
- server logs;
- watcher between tool calls.

Spectrion can keep a persistent terminal session:

terminal start:
  session_id = web-dev
  command = npm run dev

terminal read:
  latest server logs

terminal send:
  r

terminal read:
  restart result

For a long-running CLI:

terminal start:
  session_id = tools-audit
  command = some-cli --headless

terminal read:
  partial output

terminal send:
  follow-up prompt

terminal read:
  final result

This turns the agent into a process operator, not just a command generator.

But terminal is a powerful tool. It runs with the permissions of the current environment. So it needs policy boundaries: approval, command logs, workspace limits, kill switch, and restrictions on dangerous behavior.

/afk

In real work, the user is not always sitting next to the agent.

They may start an audit and leave.

A normal chat gets stuck on the first clarification.

Spectrion has /afk: a mode where the agent can continue long-running work without constant user presence.

Inside the runtime, that means:

- do not ask non-blocking questions;
- make conservative assumptions;
- keep plan/todo state up to date;
- continue ready tasks;
- verify evidence;
- stop on real blockers;
- do not bypass approvals, credentials, payments, security, destructive boundaries;
- finalize only with outcome and evidence.

AFK does not bypass rules.

AFK should not turn a terminal into an unlimited root script.

AFK exists so the task does not die because of a minor branch.

Subagents

Spectrion can run subagent sessions.

There are two modes:

delegate_task     -> blocking delegation
sessions_spawn    -> background session

For a large bugfix, the pattern may look like this:

subagent A -> LLM streaming
subagent B -> tools layer
subagent C -> UI regressions
subagent D -> server compatibility
main agent -> plan, dependencies, verification, final quality

Subagents speed up the work, but the parent runtime should not accept their output blindly.

It needs to verify:

- what was tested;
- which files/targets were covered;
- what evidence was attached;
- where the result is a hypothesis vs a confirmed fact;
- what limitations remain.

Responsibility for closing the plan task stays with the parent runtime.

Remote CLI

Sometimes the work should not run on the local Mac:

- long audit;
- heavy tests;
- server environment;
- 24/7 runner;
- isolated Linux environment.

For that, Spectrion has remote CLI.

Spectrion can deploy a CLI container to a Linux server, connect it through mesh, and execute commands remotely:

deploy remote CLI
  -> check status
  -> stream logs
  -> exec command
  -> collect evidence
  -> restart / stop / remove when done

To the user, it is one agent.

Physically, work may happen on iPhone, Mac, or a remote Linux runner.

The runtime keeps the shared plan, state, and evidence.

Bug Bounty Hunter mode

Bug bounty is a mode where scope, approval, and evidence matter even more.

It is not “scan any website.”

The bug bounty agent starts with an intake gate:

- program/platform;
- rules/scope URL;
- concrete target;
- what is allowed / forbidden.

If that data is missing, the agent does not run tools.

It does not guess scope and it does not start active checks.

Flow:

1. read rules;
2. lock scope;
3. passive recon;
4. attack surface map;
5. hypothesis/evidence ledger;
6. HYPOTHESIS / INDICATION / PROVEN;
7. active validation only after approval;
8. report-ready output.

Simple rule:

no scope -> stop
unclear permission -> ask
active validation -> approval
no proof + no impact -> not a vulnerability

This is not a “hack” button. It is a controlled mode for authorized research.

UX: trust state, not words

The core UX is not a fancy button.

The core UX is trust in state.

The user should see:

Phase 2 is running.
This task depends on completed audit.
This task requires approval.
This task is blocked because evidence is missing.
This task was closed with test output and diff.

Not “the agent is thinking somewhere.”

Clear work state.

Conclusion

A good agent is not a model with a large context window and a list of tools.

A good agent is a runtime that can hold commitments.

It knows when a plan is required and when a todo list is enough.

It continues in /afk, but does not bypass approval.

It manages Codex CLI, Claude Code, terminal sessions, and subagents, but does not trust their answers blindly.

It can work in bug bounty mode, but starts with scope and rules.

Most importantly, it closes a task only when there is evidence.

The user is not asking the agent to write a beautiful status update.

The user is asking the agent to do the task.

That is why Spectrion is being built not as a chat with functions, but as an execution runtime.

Where to find it:

Site: https://spectrion.app
App Store: https://apps.apple.com/app/spectrion-agent-ai/id6759151825

Comments (0)

Sign in to join the discussion

Be the first to comment!